Data Management, Transfer & Synchronization
Last update: March 6, 2022
Transferring data includes moving files from local machines to XSEDE, as well as transfers between XSEDE resources. This section gives a high level overview on the recommended XSEDE data transfer methods.
There are a variety of methods for transferring files across XSEDE. XSEDE provides a Web Browser interface (easy to use, universally available), command-line interfaces (for casual use, scripting, and automation), and application programming interfaces (for developing applications). You may choose between Globus,
sftp. See Table 1 below for details on each method.
|Usage Mode||Transfer Method||Things to know|
|Web Browser Interface||Globus Web application||easy-to-use web interface; uses XSEDE web single sign-on; desktop download available|
|Command Line Interface||Globus Command Line Interface (CLI)||managed, reliable, and auto-tuned transfer; supports scripting; uses XSEDE web single sign-on; requires command line/scripting skill; requires a Python installation|
| ||easy, familiar interface; must use local (resource-specific) username and password; no automatic failure recovery|
|Application Programming Interface||Globus SDK for Python||designed for application developers; popular Python language; simplifies code; use in Jupyter notebooks; requires Python programming skill; some advanced features are not exposed|
|Globus REST API||usable with most development environments; full access to all Globus service features; requires programming skill; requires familiarity with REST and OAuth 2.0|
Designed specifically for researchers, Globus provides fast, reliable, and secure file transfer among XSEDE resources or between an XSEDE resource and another system (such as a campus cluster, lab server, or personal computer). Beyond file transfer, Globus allows researchers to securely share data with collaborators, and to publish data for broader access as required by many data management plans.
Use Globus to:
- Move data between any two systems: in XSEDE, at your campus or laboratory, or at your home. Easily move small numbers of very large files (even terabyte-sized) as well as large numbers of small files (thousands or more at a time).
- Share data with your colleagues and collaborators, whether the data is stored at your laboratory on campus, your office, your home, or on selected XSEDE resources.
- You don't have to move your data to a public cloud service to share it.
- You have full control over what data is shared and with whom.
- You can share data with colleagues who aren't registered with XSEDE.
Globus services are available to everyone with an XSEDE User Portal account.
- Use the Globus interfaces described below, including the Web interface (used by most people), the command-line interface (used by advanced users and developers), and the REST APIs (used by application developers).
- Move files to and from any XSEDE resource on which you have a current allocation (permission to use).
- Download and install the Globus Connect Personal software on your personal computer to transfer files between it and XSEDE or other systems.
- Enable sharing on your Globus Connect Personal endpoints. (See below.)
- Download and install the Globus Connect Server software on any of your personal or campus systems and use it to provide multi-user file transfer access on those systems. File sharing on multi-user systems is limited to XSEDE service providers and is otherwise available with a Globus subscription.
Make sure that Globus knows you are registered with XSEDE. This will allow you to use any of the features listed above. First, open a web browser and navigate to the Globus website.
Click the "Login" button at the top-right of the page.
If you've recently used Globus, you'll be automatically logged in. See Add XSEDE to your Globus Profile.
If you haven't used Globus recently (or ever), you will be prompted to select an organization where you already have an account. Type or select "XSEDE" to specify that you want to use your XSEDE identity as in Figure 1. below.
After clicking Continue, you will be asked to enter your XSEDE username and password. The web page will have the familiar XSEDE interface, will display the CILogon banner, and will have a web address beginning with "
idp.xsede.org" as in Figure 2 below.
Once you enter your XSEDE username and password, you will be asked to provide multi-factor authentication using XSEDE's Duo service, as shown in Figure 3. (If you are not already enrolled with XSEDE's Duo service, you will instead be asked to enroll with Duo. Follow the on-screen instructions to enroll with XSEDE's Duo service.)
After you authenticate with XSEDE's Duo, you will be logged into Globus and will see the File Manager page shown in Figure 4.
If you've used Globus before, then you already have a Globus profile. Add your XSEDE identity to your Globus profile to enable the XSEDE features described here.
After you are logged into Globus, click the "Account" button on the left sidebar of the web page. You will see a list of the identities that Globus has associated with your profile. If XSEDE isn't already listed, click "Link Another Identity" on the right side of the web page. You will be asked to select an organization. Type or select "XSEDE" to specify that you want to use your XSEDE identity.
You'll be redirected to the XSEDE login page shown in Figure 2, where you can enter your XSEDE username and password and then give Globus permission to access your XSEDE identity information. Once you successfully authenticate to XSEDE, you will have access to all of XSEDE's Globus features as described below.
Once you are logged in to Globus, the File Manager interface (see Figure 4) is self-explanatory and you will probably understand how to use it without further help. If you need some hints or further details, the Globus website provides an excellent Getting Started guide. The next few sections explain how to find XSEDE's systems within Globus so you can move data to and from them, how to upload and download files, and how to set up your local system (personal or campus computer) for larger transfers that run in the background.
Globus lets you move files to and from collections. All XSEDE resources have Globus collections for the storage available to authorized researchers. All of XSEDE's collections have the word "XSEDE" in their names, making them easy to look up. Just click the Collection Search box at the top of the File Manager, type "XSEDE," and all of XSEDE's collections will appear, as shown in Figure 6. (Make sure the collection you pick is "owned" by XSEDE and displays the green "Greek columns" icon that signifies the collection is owned by a Globus-supporting organization.) XSEDE keeps its collections up-to-date in Globus on a routine basis. If the resource you are looking for is especially new or is no longer online and you can't find it in the list, use the XSEDE Help Desk to inquire about its availability. Select the collection you want to move files to or from, and the File Manager will allow you to browse the files and folders there.
You can easily create a Globus collection on your own computer so you can transfer lots of files or very large files to or from your computer in the background while you do other work. To do this, download and install the Globus Connect Personal software. (Globus Connect Personal is available for Windows, Mac, and Linux systems.) Globus Connect Personal runs as a background task, and you can start and stop it like any other application. While Globus Connect Personal is running, your system will be available in Globus's File Manager interface (but only to you!) and you will be able to start transfers between it and any other Globus collection, including XSEDE resources.
You can start and stop Globus Connect Personal on your system whenever you need to, even shutting your system down, hibernating, and moving from one network to another. Globus will automatically find your system and continue any active transfers whenever you are connected to a network and have the tool running. Globus Connect Personal is designed to work with most firewalls and NAT devices. It does not require administrative privileges to run on your system.
Many colleges, universities, research institutions, and laboratories offer Globus collections for their systems. To find a collection provided by an organization, try typing the name of the organization in the Collection Search box at the top of the File Manager page. You may be surprised how many organizations are already listed and available. To confirm the organization that provides a collection, look for the green "Greek columns" icon that signifies the collection is owned by a Globus-supporting organization. Also check the "Owner" field in the listing. You'll need to be authorized by the organization to access its collections.
Any Linux-based server system can host Globus collections by installing Globus Connect Server. On most servers, installing Globus Connect Server requires just a few commands. Once the software is installed, anyone with a local account on the server can move and share files between it and other Globus collections, including XSEDE resource collections. Information on Globus Connect Server is available on the Globus website.
XSEDE resources with the latest version of Globus software allow you to upload or download files with your web browser. When viewing a folder on an XSEDE resource in the File Manager, the Upload icon will be active. (See Figure 7.)
Click the Upload button and your browser will open a file selection dialog to choose the file(s) you want to upload. You may select multiple files, but you may not select folders. Once you've selected the files to upload, Globus will upload the files to the folder you are viewing in the File Manager. See Figure 8.
You can also download files to your computer. When viewing a folder in the File Manager, select a single file by clicking its name. (Download is currently only available for single files.) The Download button will activate if this feature is available on this XSEDE server. See Figure 9.
When you click the Download button, you might need to re-authenticate. Follow the prompts, and your file will download to your computer as shown in Figure 10. You may then open it in any application.
If the Upload and Download buttons do not activate as described above, the XSEDE resource you are using hasn't been updated to the latest Globus software. You can use Globus Connect Personal to transfer files to/from this resource (see above "Find Your Local Computers") or request an upgrade by submitting a support ticket to .
You can share files from a Globus collection with your colleagues and collaborators. Your colleagues will receive an email message inviting them to login to Globus and access the files you've shared with them. You can give them read-only or both read and write access to your files. They can log into Globus using their campus credentials (with InCommon-participating campuses), their XSEDE account (if they have one), or a new Globus account.
You can access Globus features using the Web browser interface (described above), a command-line interface, a Python software development kit (SDK), or a Web-standard REST API for non-Python applications.
If you use Globus's CLI, SDK, or REST APIs, you are encouraged to use the developer mailing list to engage directly with the Globus engineering team and other developers.
You can use Globus's command line interface (CLI) to script, or automate, your use of Globus services. For example, you might write a script that automatically uploads the output data produced by an application when the application completes.
Follow Globus's CLI installation instructions to install the CLI on your system. Once installed, follow Globus's CLI QuickStart guide to quickly become familiar with the CLI and how to use it in your work.
Use the Globus Software Development Kit (SDK) for Python to write applications that use Globus services.
Follow Globus's SDK installation instructions to install the SDK in your Python development environment, and then follow the SDK Tutorial to quickly become familiar with the SDK and how to use it. The full SDK documentation includes complete reference information as well as helpful examples.
If you write applications in a language other than Python, or if you require advanced features that aren't provided in the SDK, Globus also provides Web-standard REST APIs with complete access to all Globus services. Access to Globus's REST APIs is controlled by OAuth 2.0 (OAuth2) access tokens, which are obtained by authenticating with Globus Auth, Globus's access management service.
You are also encouraged to use the developer mailing list to engage directly with the Globus engineering team and other developers.
You may also use one of these command-line tools to transfer small (< 2 GB) files between XSEDE resources and/or your local machine. From Linux or Mac, you can run these commands directly from the terminal. From Windows, use your SSH client. Both
sftp are easy to use and secure.
Data protection mechanisms are incorporated into much of the infrastructure and software used in XSEDE, and in most cases users are not required to take any special steps to ensure the integrity of their data. However, there are situations in which a user may wish to check that a transferred file has been copied correctly to the new system, or check that a file has not been changed since it was originally created. In these situations, checksums may be used to generate a cryptographic hash of one or more files. Cryptographic hashes have the property that they always produce the same value when operating on the same input date, so they can be saved and then compared against a recomputed hash to verify that a file is exactly the same as it was when the original checksum was generated. Even a single bit change in a multi-terabyte file will produce a different checksum value, so a successful checksum comparison provides a strong guarantee that data has not been altered in any way.
We recommend that users utilize the "
sha256sum" command to create and check cryptographic hashes. This command should be available on most UNIX systems, as well as most XSEDE resources. To generate a checksum for a given file, run
sha256sum with the name of the file (or files) you wish to check; the command will report the checksum of each file on a separate line, followed by the filename:
login1$ sha256sum filename1 filename2 9db55391e52a4a84944c6c9817ab8d0445547e8934d88d26032cc4747e196039 filename1 a6483e57971627e4e2403c6d3e38b205c70db2221f0b9fe46781e0af76192ef5 filename2
To save the generated checksums for comparison, redirect the output to a file:
login1$ sha256sum filename1 filename2 > checksums.out
You can then use the contents of this file to verify that the files are exactly the same on any given system, or on the same system at a later date, using the "
-c" flag to
login1$ sha256sum -c checksums.out filename1: OK filename2: OK
Wildcards can also be used with the
sha256sum command. For example, a user could generate checksums for all the files in a directory using the command:
login1$ sha256sum * > checksums.out
After transferring these files to another XSEDE resource, the user could verify that the data was transferred completely and correctly using the saved output file. If any of the files has been corrupted or incompletely transferred, the check will produce output like the following:
login1$ sha256sum -c checksums.out filename1: FAILED filename2: OK sha256sum: WARNING: 1 of 2 computed checksums did NOT match
In this situation, the user should retransfer the files in question or restore from a backup copy of the data.
In order to verify data integrity at a later date, you must have a record of the original checksum values to compare to the present value. Therefore, generate and save checksums when data is first created or before it is transferred into XSEDE, even if you do not immediately intend to perform verification against those checksums.
All Globus transfers between two servers (including a Globus Connect Personal server on your own computer) automatically use advanced checksums to verify file integrity during and after each transfer. If an integrity check fails, Globus will automatically re-transfer the file. You can disable this feature using the "Transfer & Sync Options" interface (see Figure 14), but it is enabled by default. IMPORTANT: Browser uploads and downloads do not include integrity checks, so you should perform these yourself as described above.
The key principle to understand is that any given data transfer can only perform as well as the slowest component involved in the transfer. It may be a single slow network link, the disk drive on your laptop, or an overloaded file system on a supercomputer, but regardless, the overall transfer can perform no faster than the slowest component in the chain. It is important that you understand this when assessing data transfer performance.
For example, for most transfers into an XSEDE system coming from a desktop or laptop system, the slowest component will be the disk drive on the desktop/laptop. Individual hard drives are often the "weak link" in this scenario, with maximum speeds under 100MB/sec either due to the drive itself or the interface to the drive. If you are transferring files from a desktop or laptop system you should not expect to see performance better than 100MB/sec, or even lower on a sustained basis. If you see significantly less than that you may wish to check the performance of the network to which your laptop or desktop is connected - there are many freely available bandwith-checking utilities on the web that can give you practical readings, and your local IT administrator can give you details of the maximum performance achievable on your local network.
Note that the limitation on drive performance is inherent to the drives themselves. Some XSEDE sites and systems, such as Wrangler, offer a "data dock" service allowing you to physically deliver drives to the site and have data loaded on-site. However, since the performance of the drives is the limiting factor in performance, it will rarely be the case that this is a time-efficient practice. It is most useful in situations where the data is located in a network-limited location or for reasons of security.
To aid in understanding the potential bottlenecks for transfer performance, we have compiled the following table, which shows the network, storage, and measured performance peaks for most of the largest XSEDE sites and systems. The measured performance numbers represent the highest performance seen in real-world tests transferring large files, so they can be treated as roughly the highest level of performance you should expect if you have a source storage system and network that can provide that level of performance on a consistent basis.
|Site||Resource||Server Count||Network Bandwidth/Node||Storage Bandwidth/Node||WAN Bandwidth||Max Total Bandwidth||Max Node Bandwidth||Max Measured BW|
All values are in Gigabits per second. Divide this value by eight to get the Gigabytes per second value. Since all sites have multiple Globus/GridFTP servers supporting each endpoint, we provide both the per-node network capacity and the total network capacity of the sites connection to the XSEDE/Internet2 network backbone. The maximum total and node bandwidth numbers represent the lowest value of the various inputs and thus show what the maximum possible performance is for a given endpoint.
It is also important to remember that these are shared resources, so it will rarely be the case that your transfer has all of the capacity available to it. More usually your transfer will be sharing bandwidth with other users, other applications, and other systems. Of particular salience is that the file systems on all XSEDE resources are shared, and the total storage and network bandwidth will be utilized by both data transfer applications and active applications at almost all times.
Beyond the capabilities of the network and storage systems involved, the single biggest factor in your transfer performance will be the size of the files being transferred, as there is time spent on the network setting up and tearing down the connection for each file transfer, and for high bandwidth networks like XSEDE this is relatively costly. For example, going from file sizes of 10MB to 1GB can improve your average transfer performance from a few MB/sec to over 1GB/sec. If you need to transfer a large number of files you will get the best performance by first bundling them into a single tar file and copying the single file.
Some research projects—particularly those with team members at multiple campuses—need to maintain copies of the project's data on a campus system and on an XSEDE system. Keeping these copies synchronized is important, but it shouldn't be time-consuming. You can automate this synchronization using Globus. The steps to set this up are as follows.
- Confirm that you can transfer data between your campus and XSEDE.
- Determine the direction(s) in which synchronization must happen.
- Set up a repeating synchronization transfer task.
- Monitor the synchronization to confirm it's working properly.
To synchronize data between your campus and XSEDE data storage, you first need to be able to transfer data between your project's campus data storage and your project's XSEDE data storage. XSEDE uses Globus for data transfers, and many research campuses also have Globus set up for research use. If your campus doesn't already have Globus set up for your use, you can do it yourself.
Follow the instructions for transferring data with Globus, and make a copy of your project's data on your project's XSEDE storage (or on your campus storage, depending on where the data starts). Once you have a copy in both places, you're ready to set up synchronization.
The manner in which your research project produces new data will determine how the data should be synchronized between campus and XSEDE.
- If new data is produced only in one place (either on campus or on XSEDE) and you need it to automatically appear in the other place, you'll need synchronization in only one direction: either from campus to XSEDE, or from XSEDE to campus.
- If new data is produced both on campus and on XSEDE, you'll need bi-directional synchronization.
To synchronize in a single direction, you'll create a repeating task that synchronizes new data from the source to the destination. Your source is where new data is created, and your destination is where the copy is needed. For example, you can create a task that copies new data from your project's campus storage to your project's XSEDE storage every hour. The task will only transfer data that isn't already on the XSEDE storage.
For bi-directional synchronization, you'll create two repeating tasks: one in each direction. Each task will only transfer data that appears on the source and isn't already on the destination.
Assuming you've already used Globus to make a copy of your data, setting up a repeating synchronization task will be easy. The first step is exactly the same as for the first data transfer you performed. Login to the Globus web app, locate the data source and the destination. (The source is where new data will appear, and the destination is where the new data will be copied to.) Figure 15 shows the setup for synchronizing a folder called "sequencer-data" on campus storage to XSEDE's Darwin resource at the University of Delaware.
After you've located the data source and destination, and clicked to select the folder to be synchronized, click Transfer & Timer Options between the two Start buttons. (The place to click is circled in red in Figure 15.) This will display the options for synchronization, task start time, and repeating, shown in Figures 2 and 3. Please, always check the "sync" box! (If you don't check this box, every file will be transferred every time the task runs, even if it's already on the destination.)
In the example in Figure 16, we set options to transfer new files and files with newer modification time on the source system, and to copy the modification time along with the file's contents. (You could instead transfer only new files, or new files plus files that have changed size, or new files plus files whose checksums at source and destination do not match.) We also set an option to terminate the transfer if a quota error is detected on the destination (i.e., you ran out of available storage). (Without that option, Globus will continue attempting to transfer data until the quota is increased, you manually cancel the task, or several days pass without any improvement in the error.)
In Figure 17, we show how to set options to repeat the synchronization every two hours, ending at 11pm on November 19, 2022. (That's when this particular research project's XSEDE allocation ends.) You could instead specify a number of times to repeat the task (e.g., once per day for 60 days) or make it continue indefinitely until you manually delete the timer. You may also set the time of the first synchronization if you don't want it to begin immediately.
When you're finished setting your synchronization and repeating options, scroll up to the top of the File Manager window and click the Start button. Figure 18 shows how Globus will let you know that your task has started.
If you've determined that your data only needs to be synchronized in one direction, you're finished! You can move on to the next section on monitoring your synchronization.
If you've determined that you need bi-directional synchronization, you can repeat this process and create a second synchronization task in the other direction. The easiest way to do that is to remain on the File Manager screen, double-check that the Transfer & Timer Options are still the way you need them, and click the other Start button (the one with the arrow in the other direction). This will create the exact same synchronization task with the Source and Destination swapped.
Your synchronization task will run according to the schedule you set. You'll receive an email notification each time the task runs. Each notification contains a summary of what was transferred. You can also view the task history in the Globus web app. As shown in Figure 19, click the Activity icon in the left side of the Globus web app, and you'll see recent activity.
Click the Timers tab at the top of the display, as shown in Figure 20, to view your active timers. Here, you can cancel a timer by clicking the trash can on the right side of the timer entry.
Click the arrow on the right side of any entry in the Timers list to see details, as shown in Figure 21.
The Timer Log tab on the details page displays a list of every time the task has run so far, as shown in Figure 22. For each task execution, you can view a summary of what was transferred.
It's easy to maintain a synchronized copy of your project's data in two or more locations. This can enable collaboration with research partners, facilitate automated data processing, or gather data from sources at multiple campuses. If you can transfer data between two locations—such as your campus and XSEDE—you can also keep the copies synchronized with a repeating synchronization task. Once set up, the synchronization will happen automatically until your repeating schedule ends or you cancel the task.