Today we released rsnapshot-parallel to Github.
While we’ve used rsnapshot for years, recently with the proliferation of virtual machines we’ve noticed our backups running much longer, often well into the workday.
A quick look at our ZFS backup pool, with iostat
showed that the drives were usually only running at 30% of utilization. Those are rookie numbers – we want it over 90% if we’re pushing a backup onto the pool.
Looking more deeply, with strace
, we saw rsync
, the underlying program that rsnapshot
uses, doing a cycle of “read/write/read/write” – while it was reading it wasn’t writing, and while it was writing it wasn’t reading. LWN took a look at rsync performance a while ago, but they found that CPU governors were their main problem. We tried replicating their results, and, well, today’s kernel must be much better at CPU scaling, so we had to look elsewhere.
The easy approach to solving this problem is to run as many rsync jobs at a time as the drive can handle. There are a few approaches to doing that out there. parasync runs its own rsync-control algorithm, but only in one direction, the wrong one for us. There has been some discussion on the rsnapshot-users mailing list about ways to do this, and one person said he’d hacked GNU Parallel into rsnapshot but we didn’t find any code release, and while most people were suggesting to maintain separate config files for each host, we didn’t want to do that.
rsnapshot-parallel still uses one config file, /etc/rsnapshot.conf
in the same format as rsnapshot, but we change our cron job to call rsnapshot-parallel just before backup time, and rsnapshot-parallel does the work of creating the individual config files, PID files, and rsnapshot entries for us. If we add a new VM to our cloud, we add it to rsnapshot.conf with our normal setup and rsnapshot-parallel will take care of the rest for us when it comes around again.
So, back to figuring out how many rsyncs to run at a time. Our first approach was to see how bad it was running every rsync job at the same time (we have about 22 hosts in this backup) so we added a “bomb” mode (to fork-bomb all the backups at once). Surprisingly, it was nice to see that the system actually handled this task just fine, and scheduled the jobs rather nicely. Sure, not all of them got full bandwidth, but the important thing was that our backup drives were humming along at about 93-98% utilization. This system has a 6-core Xeon with 32GB of RAM, for some sense of scale (nothing fancy, really).
That approach is overkill, as it certainly it takes more memory (a.k.a. money) to run all those rsyncs at once, and will likely slow down the backups a bit with too many random writes, so in rsnapshot-parallel we have the default knob set to stagger the backups to run a new one every 10 minutes. The ideal job launch interval is going to depend on the size of your backups, your disk speed, your available memory, and more. In a future version it probably makes sense to keep a certain number of rsyncs running dynamically, but this is working OK for now.
Many of our VM instances are small and don’t take much time to back up, individually, but doing them sequentially, with rsync causing its own delays, meant that sometimes we wouldn’t see the last VM backup until the twelve hour mark. By running rsync-parallel in bomb mode, we had all of those small VM’s backed up in under forty minutes, which is a huge win for us. The big-data servers took longer to come in, but those are mostly onsite anyway, so our Internet line was clear quickly and our cloud instances weren’t busy running backups during the workday.
This initial release of rsnapshot-parallel only runs in ‘daily’ mode, as we use filesystem-level snapshots for versioning. A future version could add cron jobs for hourly/weekly/monthly/yearly intervals. We run a pretty basic config file, and if your config file is heavily modified or does really fancy stuff, rsnapshot-parallel might not work out of the box. If you add features or fix problems, please send a pull request.