Mac/Linux – Parallel Rsync Utility

I’m in the process of migrating 10 TBs of data from an NFS share to a CIFS share and while talking over the details with my team lead he mentioned that he would slap me if I proceeded to do the transfer in serial :-). With that motivation I wrote prsync_transfer! He was joking of course, but in all seriousness he is right. If you run rsync in a serial fashion, the initial “receiving file list…” process may take a while to complete, especially if you have a lot of small files to transfer. After the jump I’ll show you the utility I wrote to resolve this issue.

Prsync_transfer is a executable script written in Ruby, that will rsync the contents of one folder into another in parallel. I already lied a little bit, it doesn’t do all of the rsync’ing in parallel, it only runs the transfer of a-z and A-Z files and folders in parallel. The reason being is that files and folders beginning with alpha characters usually make up the bulk of any transfer. The default job order which the rsync transfers will happen is:

  1. alpha characters
  2. non-alpha characters
  3. files and folders with a leading whitespace
  4. hidden files and folders

You will receive a report on how each transfer did (i.e. exited with 0, not exited with 0) after each job has finished.

The usage looks like this:

./prsync_transfer <-rsync_options> <source> <target> <log_location> (jobs_to_run EX: 1-3)

./prsync_transfer -avP ~/source ~/target ~/log

By default, the utility will run jobs 1-4 but you can specify which jobs to run. The following will only rsync the alpha and non-alpha characters:

./prsync_transfer -avP ~/source ~/target ~/log 1-2

You can also do includes and excludes, just mind the quoting that is needed:

./prsync_transfer “-avP –exclude=Caches*” ~/source ~/target ~/log

All of the rsync jobs will redirect their stdin and stdout to a separate file in whatever folder you specify for the log location. This allows for easier viewing. Example:

a.log
a_error.log
b.log
b_error.log
non-alpha_character.log
non-alpha_character_error.log

The information below is a bit outdated. Please see “Update 2” note.

If you use this utility you can actually view where the data transfer is at by using “ps aux |grep [r]sync”. You’ll know that if you see it’s on the e-h transfer that all of the a-d transfers are done. And if for some reason you need to stop it, you can then go into the utility and comment out what’s been done. Although this is largely unnecessary as the rsync command doesn’t transfer what’s already been done. But it may save you some time as it does need to compare the sending and receiving end first:

#start_parallel_rsync(A_TO_D)
start_parallel_rsync(E_TO_H)
start_parallel_rsync(I_TO_L)

You can grab a copy of the utility from my GitHub page. I welcome suggestions for improvements, and if anyone has any questions I’d be happy to answer them.

Update: There was an issue with the non-alpha character job. It wasn’t rsync’ing files and folders that were longer than one character (ex: ‘1’ vs ‘1somefilename’). I’ve fixed it by modifying that job’s exclude list. Thanks to Mat X for pointing this out!

Update 2: One thing that bugged me about the parallel alpha character transfer implementation of this utility, was that you had to wait until an entire batch of transfers was done before starting another (ex: starting the e-h transfers only after the a-d transfers are done). So I rewrote that part of the utility to start a new transfer soon after another is finished (ex: starting the ‘e’ transfer as soon as one of the a-d transfers are done). This change should help speed up long transfers. The maximum number of parallel transfers that can be going at any one time are still 4, but I plan to add the ability to specify how many (with the default being 4). You can find the new version here, and the old version here.

Advertisements

4 thoughts on “Mac/Linux – Parallel Rsync Utility

    • Hi Mat,

      Thanks for your comment, and you’re right! During my tests I named my non-alpha folders as single numbers and not like this, “1thisisatest”. I will work on fixing this.

  1. The script as it is today requires that the source be a directory. It would be nice if the source could be a remote system. When syncing from one location to another, using remote mount points is not a good idea, or not possible depending on the situation.

    rsync example:
    rsync -avz system1:/vol/test /test

    This uses the area /vol/test on the system “system1” as the source.

    Thanks!

    • Hi Lewis,

      I completely agree. It’s actually been on my list of things to try and do. I initially didn’t do it because I didn’t need that functionality. There’s also the fact that in allowing to do so, some form of automatic authentication would need to be setup before hand (SSH keys, Kerberos tickets, etc…). Otherwise the script would be prompting for a password frequently. However, I think it’s time I begin working on implementing this functionality.

      Thanks for the feedback!
      Riley

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s