mk01 wrote:@ghaskins: as you are saying "I think rsync has better unreliable connection recovery options than zfs send" I would not agree.
How so? Note that I said specifically "unreliable connection recovery options". Say you have 64GB of differential data to send (a common value for me when I unload my digital cameras). If you sent 63GB with zfs send and then the connection dies, you have to resend all 64GB on restart (IIUC). With rsync, it just picks up where it left off and sends the last 1GB.
Since my use case is across an 4Mb/s internet connection, where bandwidth is limited (~1.5 days to transfer 64GB) and interruptions are fairly probable, I see this is a giant advantage.
Aside from the inconvenience of starting a 1.5 day transfer over and over, we are ignoring the issue that many have reported about zpools becoming corrupted if a zfs-recv is interrupted. It also ignores the issue that you have to maintain some level of synchronization with your snapshots on both sides of the link with zfs-send, which can be problematic.
Maybe it may look scary on the first full send (few TB filesystem with only final snapshot),
I am not really sure what you are getting at here. Both zfs-send and rsync would be similar in that they both have a relatively large initial transfer, followed by relatively small incremental updates governed by the delta in the dataset. That is not "scary" on either front. It's just physics we have to live with.
but otherwise, if you think of what rsync is doing and what zfs snapshot and then send -I, it's incomparable.
if you change one file on FS with few millions files, many as part of package where actual homogenous state across all files is needed (like dBase databases, mysql databases, even itunes library, iPhoto, Aperture, Logic. rsync would be running hours just to send one new file.
I am not sure you are understanding how rsync works. It only sends the delta, just like zfs-send -I. If there were changes, only the subset of data within the files that changed are transmitted. ZFS has the theoretical advantage that it already knows about the delta in the dataset ahead of time. Rsync has to compute the delta using things like atime updates and generating checksums. However, in practice its very efficient and fast.
For instance, I have 3TB of data across approximately 3 million files, and a typical rsync runs for about 5 minutes, in which it figures out there are only a few dozen megabytes worth of changes to send. If the delta is larger, it takes longer proportional to the amount of data to send, gated by my internet bandwidth. But then again, so does zfs-send.
And if in between transaction occurs (which needs to update multiple files), you will never be able to reconstruct.
snapshot is snapshot, it freezes the state and happens in matter of seconds, the same for send.
Now this is actually a valid point. Being able to atomically snapshot the filesystem gives you a much higher probability of maintaing file-level consistency for changes in flight. For me, this is not an option b/c I can't get reasonable performance out of Zevo CE 1.1.1 and was forced to fall back on JHFS+ (for now). This isn't a big deal for me, either, though, because the data I care the most about are static media files (photos and videos of my family) and wouldn't be subject to transitive updates.
Once the snapshot is received and displayed on the other side, you are happy and sure you are fine.
"Once .. received" are the operative words. See my above comments about unreliable connection restart above where I stated why I think this falls apart with zfs-send.
You can't tell that for rsync. It's impossible. Just go through the various --delete options. Delete before, after, during, during in sequence, during fuzzy... it's scaryx Not that those options are implemented. But they are just the result of all the dramatic risks they try to avoid.
It's not impossible, its quite simple actually. I simply set rsync to perform all of its operations "in place" and to synchronize deletes, etc. The client-side driven script then zfs-snapshots the backend storage at the conclusion. Therefore, if the rsync dies in the middle, the snapshot is never run since the client-side script doesn't complete.
In practice, this means that it doesn't really matter what order you delete (before, after, during, etc) or update files. You only care that each snapshot was generated when an rsync job believed it had successfully completed a replica. If an update dies in the middle no snapshot is created, but the next connection will just continue where it left off.
You might argue that zfs-send preserves data integrity end-to-end, but I speculate that with ZFS and ECC on both ends of the link than an rsync transfer should be not susceptible to introducing errors in your data since TCP/SSH would verify transmission integrity.
The biggest argument I can see to using zfs-send over rsync is if you are using dedup and/or compressible files as they are likely to retain these properties in transmission. For me, I don't use dedup and my files are mostly incompressible anyway. However, my link is relatively slow, the data set is large, and the potential for interrupted transfers is relatively high. Using rsync over zfs-send a no-brainer (for me) even when zfs-send becomes a viable option. But I can see someone running, say, an environment where the connection is more reliable, the bandwidth more abundant, and dedup is being used going the zfs-send route. It's just not for me.
Based on all this, and your valid point about the atomic-snapshot benefit, the ideal solution (for me, at least) would be to snapshot the local filesystem, rsync from the snapshot to the backup server, snapshot the backup server, and then delete the local snapshot. I think this would give me the best of both worlds and avoid any of the problems I mentioned with zfs-send.
Kind Regards,
-Greg