I'm currently in the process of trying to copy a pool over to a new set of drives, and the process is taking a really long time, as I keep having to interrupt and resume the process in order to maintain any kind of decent speed.
I started out using the following command:
- Code: Select all
zfs send -Rw zbackup@snapshot | zfs receive -usdv zbackup2
And it seemed to work fine for the most part, with larger snapshots averaging around 200mb/sec, while smaller ones had more variable speeds (understandable since they may not have as many large files to get fully up to speed). I'm sending between two directly attached pools, there is no networking involved, and the disks are in Thunderbolt 3 enclosures so should have plenty of bandwidth.
However, during transfer of a dataset containing large media files (and set to recordsize=1M) I noticed some large snapshots were taking an excruciatingly long time to transfer, way down in the 3mb/sec range. Running zpool iostat -v shows each pool only reading or writing around 30mb every 10 seconds or so.
This is bad enough that I interrupted it, and tried repeating the send using:
- Code: Select all
zfs send -RwI @<last_snapshot> zbackup/dataset@snapshot | zfs receive -usdv zbackup2
And I would notice an immediate improvement in my speed again, but only until I hit another slow send.
The issue is bad enough that I'm actually finding it faster to use a script to perform the sends one at a time (sending each snapshot in turn, incrementally from the last snapshot, if any), though I'm still hitting some unexpectedly slow sends even using this method, but at least it seems to recover on the next.
But what I'm struggling to figure out is why I'm experiencing such unreliable performance – all drives involved are CMR, so I'm not running into SMR cache issues, and all drives should average around 150mb/sec read/write speeds for the larger files, and both pools have plenty of free space so it shouldn't be a low free space or excessive fragmentation issue. The source pool is a backup target, and while it does have some fragmentation (27%) as reported by zpool list, this has never affected performance before, and that's with well over 25% capacity free (~2tb of an 8tb pool).
While running zpool iostat -v I'm not seeing any unusual differences between read/write activity on the drives – all drives seem to be read from/written to at pretty much the exact same speed, the slowness appears to be coming from long periods spent either sending/receiving nothing (or doing so at a very slow rate). I know that send streams can contain a lot of data that isn't record data, because it also covers all of the changes made to the dataset (i.e- deletions), but I wouldn't expect that to add so much.
It's the weirdest thing as it's like it's just pure hit and miss whether a transfer is slow or not – I can run one of the lines above (or similar) and it will be really slow, but if I cancel it and run again it will be at full speed, and zpool iostat -v shows the same (so I'm not faster due to caching, as these datasets are too big for that anyway).
Could it be a metadata issue causing ZFS to struggle to find data for sending fast enough? Is there anything I can do to confirm that, or accelerate it? Again though, this is a backup pool I'm trying to replicate from, it basically only receives incremental snapshots from a single source, and that's what I'm trying to send with a replication stream, so I would expect these to have their data at least somewhat grouped together so it being hard to find seems odd.