I’m in the process of migrating 10 TBs of data from an NFS share to a CIFS share and while talking over the details with my team lead he mentioned that he would slap me if I proceeded to do the transfer in serial :-). With that motivation I wrote prsync_transfer! He was joking of course, but in all seriousness he is right. If you run rsync in a serial fashion, the initial “receiving file list…” process may take a while to complete, especially if you have a lot of small files to transfer. After the jump I’ll show you the utility I wrote to resolve this issue.
Prsync_transfer is a executable script written in Ruby, that will rsync the contents of one folder into another in parallel. I already lied a little bit, it doesn’t do all of the rsync’ing in parallel, it only runs the transfer of a-z and A-Z files and folders in parallel. The reason being is that files and folders beginning with alpha characters usually make up the bulk of any transfer. The default job order which the rsync transfers will happen is:
- alpha characters
- non-alpha characters
- files and folders with a leading whitespace
- hidden files and folders
You will receive a report on how each transfer did (i.e. exited with 0, not exited with 0) after each job has finished.
The usage looks like this:
./prsync_transfer <-rsync_options> <source> <target> <log_location> (jobs_to_run EX: 1-3)
./prsync_transfer -avP ~/source ~/target ~/log
By default, the utility will run jobs 1-4 but you can specify which jobs to run. The following will only rsync the alpha and non-alpha characters:
./prsync_transfer -avP ~/source ~/target ~/log 1-2
You can also do includes and excludes, just mind the quoting that is needed:
./prsync_transfer “-avP –exclude=Caches*” ~/source ~/target ~/log
All of the rsync jobs will redirect their stdin and stdout to a separate file in whatever folder you specify for the log location. This allows for easier viewing. Example:
The information below is a bit outdated. Please see “Update 2” note.
If you use this utility you can actually view where the data transfer is at by using “ps aux |grep [r]sync”. You’ll know that if you see it’s on the e-h transfer that all of the a-d transfers are done. And if for some reason you need to stop it, you can then go into the utility and comment out what’s been done. Although this is largely unnecessary as the rsync command doesn’t transfer what’s already been done. But it may save you some time as it does need to compare the sending and receiving end first:
You can grab a copy of the utility from my GitHub page. I welcome suggestions for improvements, and if anyone has any questions I’d be happy to answer them.
Update: There was an issue with the non-alpha character job. It wasn’t rsync’ing files and folders that were longer than one character (ex: ‘1’ vs ‘1somefilename’). I’ve fixed it by modifying that job’s exclude list. Thanks to Mat X for pointing this out!
Update 2: One thing that bugged me about the parallel alpha character transfer implementation of this utility, was that you had to wait until an entire batch of transfers was done before starting another (ex: starting the e-h transfers only after the a-d transfers are done). So I rewrote that part of the utility to start a new transfer soon after another is finished (ex: starting the ‘e’ transfer as soon as one of the a-d transfers are done). This change should help speed up long transfers. The maximum number of parallel transfers that can be going at any one time are still 4, but I plan to add the ability to specify how many (with the default being 4). You can find the new version here, and the old version here.