Data Transfer & Management
Last update: October 1, 2018

Introduction

Transferring data includes moving files from local machines to XSEDE, as well as transfers between XSEDE resources. This section gives a high level overview on the recommended XSEDE data transfer methods.

There are a variety of methods for transferring files across XSEDE. XSEDE provides a Web Browser interface (easy to use, universally available), command-line interfaces (for casual use, scripting, and automation), and application programming interfaces (for developing applications). You may choose between Globus, globus-url-copy and uberftp, or scp and sftp. See Table 1 below for details on each method.

Table 1. Data Transfer Methods

Usage Mode Transfer Method Things to know
Web Browser Interface Globus Web application easy-to-use web interface; uses XSEDE web single sign-on; desktop download available
Command Line Interface Globus Command Line Interface (CLI) managed, reliable, and auto-tuned transfer; supports scripting; uses XSEDE web single sign-on; requires command line/scripting skill; requires a Python installation
globus-url-copy & uberftp high-performance transfer with tuning options; requires advanced knowledge for authentication and performance tuning; no automatic failure recovery
scp & sftp easy, familiar interface; must use local (resource-specific) username and password; no automatic failure recovery
Application Programming Interface Globus SDK for Python designed for application developers; popular Python language; simplifies code; use in Jupyter notebooks; requires Python programming skill; some advanced features are not exposed
Globus REST API usable with most development environments; full access to all Globus service features; requires programming skill; requires familiarity with REST and OAuth 2.0

Globus Intro & Setup

Designed specifically for researchers, Globus provides fast, reliable, and secure file transfer among XSEDE resources or between an XSEDE resource and another system (such as a campus cluster, lab server, or personal computer). Beyond file transfer, Globus allows researchers to securely share data with collaborators, and to publish data for broader access as required by many data management plans.

Use Globus to:

  • Move data between any two systems: in XSEDE, at your campus or laboratory, or at your home. Easily move small numbers of very large files (even terabyte-sized) as well as large numbers of small files (thousands or more at a time).
  • Share data with your colleagues and collaborators, whether the data is stored at your laboratory on campus, your office, your home, or on selected XSEDE resources.

With Globus:

  • You don't have to move your data to a public cloud service to share it.
  • You have full control over what data is shared and with whom.
  • You can share data with colleagues who aren't registered with XSEDE.

Globus services are available to everyone with an XSEDE User Portal account.

  • Use the Globus interfaces described below, including the Web interface (used by most people), the command-line interface (used by advanced users and developers), and the REST APIs (used by application developers).
  • Move files to and from any XSEDE resource on which you have a current allocation (permission to use).
  • Download and install the Globus Connect Personal software on your personal computer to transfer files between it and XSEDE or other systems.
  • Enable sharing on your Globus Connect Personal endpoints. (See below.)
  • Download and install the Globus Connect Server software on any of your personal or campus systems and use it to provide multi-user file transfer access on those systems. File sharing on multi-user systems is limited to XSEDE service providers and is otherwise available with a Globus subscription.

Set Up Globus

Make sure that Globus knows you are registered with XSEDE. This will allow you to use any of the features listed above. First, open a web browser and navigate to the Globus website.

  1. Click the "Login" button at the top-right of the page.

    • If you've recently used Globus, you'll be automatically logged in. See Add XSEDE to your Globus Profile.

    • If you haven't used Globus recently (or ever), you will be prompted to select an organization where you already have an account. Type or select "XSEDE" to specify that you want to use your XSEDE identity as in Figure 1. below.

    Figure 1. Use your XSEDE identity to log in to Globus
  2. After clicking "Continue," you will be asked to enter your XSEDE username and password. The web page will have the familiar XSEDE interface and will have an address beginning with "weblogin.xsede.org" as in Figure 2. below.

    Figure 2. Enter your XSEDE username and password at the XSEDE login page
  3. Once you enter your XSEDE username and password, you will be logged into Globus and you should see the main file transfer page shown in Figure 3.

    Figure 3. Successfully logged in to Globus

2. Add XSEDE to your Profile

If you've used Globus before, then you already have a Globus profile. Add your XSEDE identity to your Globus profile to enable the XSEDE features described here.

After you are logged into Globus, click the "Account" button in the upper-right of the web page. You will see a list of the identities that Globus has associated with your profile. If XSEDE isn't already listed, click "Add Linked Identity" in the upper-right of the web page. You will be asked to select an organization. Type or select "XSEDE" to specify that you want to use your XSEDE identity.

Figure 4. Link your XSEDE identity to an existing Globus profile

You'll be redirected to the XSEDE login page shown in Figure 2, where you can enter your XSEDE username and password and then give Globus permission to access your XSEDE identity information. Once you successfully authenticate to XSEDE, you will have access to all of XSEDE's Globus features as described below.

Using Globus

Move Files

Once you are logged in to Globus, the Transfer Files web interface (see Figure 3) is self-explanatory and you will probably understand how to use it without further help. If you need some hints or further details, the Globus website provides an excellent Getting Started guide. The next few sections explain how to find XSEDE's systems within Globus so you can move data to and from them, and how to access your local systems (personal or campus computers).

Find XSEDE's Resources

As you can see in Figure 3 above, Globus lets you move files to and from endpoints. All XSEDE resources have Globus endpoints that allow you to move files to and from them. All of XSEDE's endpoints have the word "XSEDE" in their names, making them easy to look up. Just click the "start here" box in Globus, and a panel will appear where you can type the endpoint's name. Type "XSEDE," and all of the XSEDE endpoints will appear, as shown in Figure 6. (Make sure the endpoint you pick is "owned" by XSEDE.) XSEDE keeps its endpoints up-to-date in Globus on a routine basis. If the resource you are looking for is especially new or is no longer online and you can't find it in the list, use the XSEDE Help Desk to inquire about its availability. Select the endpoint you want to move files to or from, and you will be connected to that system and shown the files and directories there.

Figure 5. Type "XSEDE" to get a list of all of XSEDE's Globus Endpoints

Find your Local Computers

You can easily create a Globus endpoint on your own computer so that you can upload or download files. To do this, download and install the Globus Connect Personal software. (Globus Connect Personal is available for Windows, Mac, and Linux systems.) You can set Globus Connect Personal up as a background task (so that it is always running in the background) or you can start and stop it like any other application. While Globus Connect Personal is running, your system will be available in Globus's transfer interface (but only to you!) and you will be able to start transfers between it and any other Globus endpoint, including XSEDE resources.

Note that you can start and stop Globus Connect Personal on your system whenever you need to, even shutting your system down, hibernating, and moving from one network to another. Globus will automatically find your system and continue any active transfers whenever you are connected to a network and have the tool running. Globus Connect Personal is designed to work with most firewalls and NAT devices. It does not require administrative privileges to run on your system.

Find Other Research Systems

Many colleges, universities, and national research institutions and laboratories offer Globus endpoints for their systems. To find an endpoint associated with an organization, try typing the name of the organization in the "start here" endpoint box in Globus. You may be surprised how many organizations are already listed and available. You will typically need to have an account with the organization to access its endpoints.

Any multi-user HPC or shared storage system can be configured as a Globus endpoint using Globus Connect Server. On most servers, installing Globus Connect Server requires just a few commands. Once the software is installed, any user with a local account on that system can move and share files between it and an XSEDE resource. Information on Globus Connect Server is available on the Globus documentation web site.

Share Data

You can share files from a Globus endpoint with your colleagues and collaborators. Your collaborators will receive an email message from you inviting them to login to Globus and access the files you've shared with them. You can give them read-only or both read and write access to your files. They can log into Globus using their campus credentials (with InCommon-participating campuses), their XSEDE account (if they have one), or a new Globus account.

Share your Personal Servers

To enable sharing on your own system running Globus Connect Personal, join XSEDE's "XSEDE Globus Plus Users" group. When logged in to Globus, click the "Groups" button in the top bar and select "Search for Groups" as shown in Figure 6.

Figure 6. Search for the XSEDE Globus Plus Users group

You will be taken to a search page. Type "XSEDE Globus Plus" in the search box and press Enter. Click on the "XSEDE Globus Plus Users" group in the search results box, and you will be shown brief information about the group and given a button labeled "Join Group" on the right side of the page, as shown in Figure 7.

Figure 7. Join the XSEDE Globus Plus Users group

When you click the "Join Group" button, you will be asked for some very brief information about yourself (name, email organization, etc.). The first question on the form is, "Select which username to join as." Click the pop-up list and select your XSEDE identity if it is available in the list. (If your XSEDE identity isn't in the list, you may use another identity to join.) Click the button "Submit Application to Join." You will receive an email message shortly (during normally staffed hours) letting you know that your request has been approved.

Once Globus Plus has been enabled for your account, follow Globus's instructions for enabling sharing on your system and creating a Shared Endpoint.

Share Files Across XSEDE

As of January 2017, only the XSEDE resources managed by the San Diego Supercomputing Center and the Wrangler system at TACC and Indiana University allow sharing on their endpoints. If you would like to use sharing on any other XSEDE endpoint, please contact the XSEDE Help Desk.

Share Files on Other Systems

To enable sharing on any other multi-user system running Globus Connect Server (such as a shared campus system), you or your campus will need a Globus subscription. Many campuses have subscriptions already; enquire with your local staff to find out if your systems include Globus and sharing.

Globus Interfaces

You can access Globus features using the Web browser interface (described above), a command-line interface, a Python software development kit (SDK), or a Web-standard REST API for non-Python applications.

If you use Globus's CLI, SDK, or REST APIs, you are encouraged to use the developer mailing list to engage directly with the Globus engineering team and other developers.

Command Line Interface (CLI)

You can use Globus's command line interface (CLI) to script, or automate, your use of Globus services. For example, you might write a script that automatically uploads the output data produced by an application when the application completes.

Follow Globus's CLI installation instructions to install the CLI on your system. Once installed, follow Globus's CLI QuickStart guide to quickly become familiar with the CLI and how to use it in your work.

Software Development Kit (SDK)

Use the Globus Software Development Kit (SDK) for Python to write applications that use Globus services.

Follow Globus's SDK installation instructions to install the SDK in your Python development environment, and then follow the SDK Tutorial to quickly become familiar with the SDK and how to use it. The full SDK documentation includes complete reference information as well as helpful examples.

REST API

If you write applications in a language other than Python, or if you require advanced features that aren't provided in the SDK, Globus also provides Web-standard REST APIs with complete access to all Globus services. Access to Globus's REST APIs is controlled by OAuth 2.0 (OAuth2) access tokens, which are obtained by authenticating with Globus Auth, Globus's access management service.

You are also encouraged to use the developer mailing list to engage directly with the Globus engineering team and other developers.

Globus References

The Globus services are quite powerful and include many features not mentioned above. You can learn more using the following resources.

globus-url-copy and uberftp

The "globus-url-copy" and "uberftp" commands are command-line implementations of the GridFTP protocol that underlies all XSEDE transfer mechanisms. Use these commands to transfer large files, but be aware that if the transfer fails for any reason, you will need to restart it yourself.

Here's a sample transfer from PSC's Bridges to TACC's Stampede 2 optimized for large files:

login1$ globus-url-copy -stripe -tcp-bs 8388608 \
   gsiftp://gridftp.bridges.psc.edu/scratcha/joeuser/mylargefile \
   gsiftp://gridftp.stampede2.tacc.xsede.org:2811/scratch/joeuser/mylargefile
Resource GridFTP Servers
bridges.psc.xsede.org gsiftp://gridftp.bridges.psc.edu:2811/
comet.sdsc.xsede.org gsiftp://oasis-dm.sdsc.xsede.org:2811
stampede2.tacc.xsede.org gsiftp://gridftp.stampede2.tacc.xsede.org:2811/
supermic.cct-lsu.xsede.org gsiftp://smic1.hpc.lsu.edu:2811/
wrangler.tacc.xsede.org gsiftp://gridftp.wrangler.tacc.xsede.org:2811
xstream.stanford.xsede.org gsiftp://xstream.stanford.xsede.org:2811/

For advanced users, speedpage.psc.edu provides information on transfer speeds you can expect using globus-url-copy with the optimized parameters above.

scp & sftp

You may also use one of these command-line tools to transfer small (< 2 GB) files between XSEDE resources and/or your local machine. From Linux or Mac, you can run these commands directly from the terminal. From Windows, use your ssh client. Both scp and sftp are easy to use and secure.

Data Integrity and Validation

Data protection mechanisms are incorporated into much of the infrastructure and software used in XSEDE, and in most cases users are not required to take any special steps to ensure the integrity of their data. However, there are situations in which a user may wish to check that a transferred file has been copied correctly to the new system, or check that a file has not been changed since it was originally created. In these situations, checksums may be used to generate a cryptographic hash of one or more files. Cryptographic hashes have the property that they always produce the same value when operating on the same input date, so they can be saved and then compared against a recomputed hash to verify that a file is exactly the same as it was when the original checksum was generated. Even a single bit change in a multi-terabyte file will produce a different checksum value, so a successful checksum comparison provides a strong guarantee that data has not been altered in any way.

We recommend that users utilize the "sha256sum" command to create and check cryptographic hashes. This command should be available on most UNIX systems, as well as most XSEDE resources. To generate a checksum for a given file, run sha256sum with the name of the file (or files) you wish to check; the command will report the checksum of each file on a separate line, followed by the filename:

login1$ sha256sum filename1 filename2
9db55391e52a4a84944c6c9817ab8d0445547e8934d88d26032cc4747e196039  filename1
a6483e57971627e4e2403c6d3e38b205c70db2221f0b9fe46781e0af76192ef5  filename2

To save the generated checksums for comparison, redirect the output to a file:

login1$ sha256sum filename1 filename2 > checksums.out

You can then use the contents of this file to verify that the files are exactly the same on any given system, or on the same system at a later date, using the "-c" flag to sha256sum:

login1$ sha256sum -c checksums.out
filename1: OK
filename2: OK

Wildcards can also be used with the sha256sum command. For example, a user could generate checksums for all the files in a directory using the command:

login1$ sha256sum * > checksums.out

After transferring these files to another XSEDE resource, the user could verify that the data was transferred completely and correctly using the saved output file. If any of the files has been corrupted or incompletely transferred, the check will produce output like the following:

login1$ sha256sum -c checksums.out
filename1: FAILED
filename2: OK
sha256sum: WARNING: 1 of 2 computed checksums did NOT match

In this situation, the user should retransfer the files in question or restore from a backup copy of the data.

In order to verify data integrity at a later date, you must have a record of the original checksum values to compare to the present value. Therefore, generate and save checksums when data is first created or before it is transferred into XSEDE, even if you do not immediately intend to perform verification against those checksums.

Integrity and Transfer mechanisms

Some data transfer mechanisms, including GridFTP, provide options to generate and compare checksums as part of the transfer operation. When using Globus to manage GridFTP transfers, include the "--verify-checksum" option in command-line invocations, or select the "Verify Checksum" option in the web interface. Secure copy (scp) provides some protection for data integrity during the transfer due to the encryption of data in transit, but it does not perform end-to-end validation of data integrity, therefore users should perform additional verification if data integrity is important.

Assessing Data Transfer Performance

The key principle to understand is that any given data transfer can only perform as well as the slowest component involved in the transfer. It may be a single slow network link, the disk drive on your laptop, or an overloaded file system on a supercomputer, but regardless, the overall transfer can perform no faster than the slowest component in the chain. It is important that you understand this when assessing data transfer performance.

For example, for most transfers into an XSEDE system coming from a desktop or laptop system, the slowest component will be the disk drive on the desktop/laptop. Individual hard drives are often the "weak link" in this scenario, with maximum speeds under 100MB/sec either due to the drive itself or the interface to the drive. If you are transferring files from a desktop or laptop system you should not expect to see performance better than 100MB/sec, or even lower on a sustained basis. If you see significantly less than that you may wish to check the performance of the network to which your laptop or desktop is connected - there are many freely available bandwith-checking utilities on the web that can give you practical readings, and your local IT administrator can give you details of the maximum performance achievable on your local network.

Note that the limitation on drive performance is inherent to the drives themselves. Some XSEDE sites and systems, such as Wrangler, offer a "data dock" service allowing you to physically deliver drives to the site and have data loaded on-site. However, since the performance of the drives is the limiting factor in performance, it will rarely be the case that this is a time-efficient practice. It is most useful in situations where the data is located in a network-limited location or for reasons of security.

To aid in understanding the potential bottlenecks for transfer performance, we have compiled the following table, which shows the network, storage, and measured performance peaks for most of the largest XSEDE sites and systems. The measured performance numbers represent the highest performance seen in real-world tests transferring large files, so they can be treated as roughly the highest level of performance you should expect if you have a source storage system and network that can provide that level of performance on a consistent basis.

XSEDE Site Data Transfer Statistics

Site Resource Server Count Network Bandwidth/Node Storage Bandwidth/Node WAN Bandwidth Max Total Bandwidth Max Node Bandwidth Max Measured BW
SDSC All 5 50 50 40 40 40 5.576
TACC Stampede 2 3 20 100 100 60 20 4.904
TACC Ranch 2 40 40 100 80 40 3.776
PSC Bridges 2 40 100 30 30 30 6.08
NICS All 8 10 40 100 80 10 3.178

All values are in Gigabits per second. Divide this value by eight to get the Gigabytes per second value. Since all sites have multiple Globus/GridFTP servers supporting each endpoint, we provide both the per-node network capacity and the total network capacity of the sites connection to the XSEDE/Internet2 network backbone. The maximum total and node bandwidth numbers represent the lowest value of the various inputs and thus show what the maximum possible performance is for a given endpoint.

It is also important to remember that these are shared resources, so it will rarely be the case that your transfer has all of the capacity available to it. More usually your transfer will be sharing bandwidth with other users, other applications, and other systems. Of particular salience is that the file systems on all XSEDE resources are shared, and the total storage and network bandwidth will be utilized by both data transfer applications and active applications at almost all times.

Beyond the capabilities of the network and storage systems involved, the single biggest factor in your transfer performance will be the size of the files being transferred, as there is time spent on the network setting up and tearing down the connection for each file transfer, and for high bandwidth networks like XSEDE this is relatively costly. For example, going from file sizes of 10MB to 1GB can improve your average transfer performance from a few MB/sec to over 1GB/sec. If you need to transfer a large number of files you will get the best performance by first bundling them into a single tar file and copying the single file.