hard disk optimisation for performance purposes

Moderators: jhartley, MSR734, nola

hard disk optimisation for performance purposes

Post by grahamperrin » Tue Oct 16, 2012 1:59 pm

As someone asked about fragmentation, I'll share some recently bookmarked information.

[zfs-discuss] resilver = defrag? (2010-09-09) began with three questions:

Orvar Korvar wrote:A) Resilver = Defrag. True/false?

B) If I buy larger drives and resilver, does defrag happen?

C) Does zfs send zfs receive mean it will defrag?


http://www.mail-archive.com/zfs-discuss ... 42045.html (2010-09-10):

Darren J Moffat wrote:… take a step back and ask "why are you even worried about fragmentation ?" "do you know you have a pool that is fragmented?" "is it actually causing you a performance problem?"


http://www.mail-archive.com/zfs-discuss ... 42055.html (2010-09-11):

Richard Elling wrote:It really depends on your definition of "fragmentation." This term is used differently for various file systems. The UFS notion of fragmentation is closer to the ZFS notion of gangs.


http://www.mail-archive.com/zfs-discuss ... 42105.html (2010-09-13):

Richard Elling wrote:… I suggest deprecating the use of the term "defragmentation."


http://www.mail-archive.com/zfs-discuss ... 42158.html (2010-09-15):

Richard Elling wrote:… Several features work against HDD optimization. Redundant copies of the metadata are intentionally spread across the media, so that there is some resilience to media errors. Entries into the ZIL can also be of varying size and are allocated in the pool -- solved by using a separate log device. COW can lead to wikipedia disk [de]fragmentation for files which are larger than the recordsize.

Continuing to try to optimize for HDD performance is just a matter of changing the lipstick on the pig.


More recently:

[illumos-Discuss] Block pointer rewrite? (2011-02-07):

Haudy Kazemi wrote:I have a scriptable idea that offers a way to re-balance data on VDEVs without using block pointer rewrite and without doing a full backup/restore. I haven't tested this yet. Comments welcome. …


https://groups.google.com/a/zfsonlinux. ... cC7WrQrE8J (2012-05-09):

Christ Schlacta wrote:I posted a request to the list for offline block pointer rewrite, which in theory at least should be way simpler than regular block pointer rewrite, but noone even seems to have noticed it.


https://groups.google.com/a/zfsonlinux. ... YnY9uKHtwJ (2012-09-28):

mblahay wrote:Is this block-pointer-rewrite on any of the open source to-do lists? … get the add disk and rebalance functionality to work …


What free space thresholds/limits are advisable for 640 GB and 2 TB hard disk drives with ZEVO ZFS on OS X? (2012-09-28)
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

block pointer rewrite (BPR)

Post by grahamperrin » Tue Oct 16, 2012 8:38 pm

Thanks to irc://irc.freenode.net/#illumos we identificed a significant e-mail about block pointer rewrite (BPR):

[discuss] Re: [developer] Block pointer rewrite? (2012-01-09):

Matt Ahrens wrote:… I implemented most of BP rewrite several years back, at Sun/Oracle. I don't know what plans Oracle has for this work, but given its absence in S11, I wouldn't bank on it being released. There are several obstacles that they would have overcome. Performance was a big problem -- like with dedup, we must store a giant table of translations. Also, the code was didn't layer well; many other features needed to "know about" bprewrite. Maintaining it would add significant to cost to future projects. …


I'm told that BPR was mentioned at the 2012 illumos ZFS Day.

Key phrases from the chat in IRC: The Holy Grail, tilting at windmills, etc. :-) so whilst I look to illumos for upstream discussions in general, I do not expect to see BPR as an idea for an illumos project. Chat was logged, http://echelog.com/logs/browse/illumos/1350424800
Last edited by grahamperrin on Fri Nov 30, 2012 7:55 pm, edited 2 times in total.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: hard disk optimisation for performance purposes

Post by grahamperrin » Tue Oct 16, 2012 9:48 pm

Rules of thumb (based on what I read)

With rotational hard disks: aim to never use more than eighty percent of the capacity of a pool. zpool list is our friend.

If you use more than eighty – so much more that performance is unacceptably reduced – and if you later find that reduced usage does not yield a required improvement in performance:

  • be prepared to use at least one additional disk, for a suitably large separate pool, zfs send and receive whilst that new pool is fresh.

If the file system to be sent is relatively large:

  • be prepared to wait a relatively long time, for reception without interruption.

Please note: not all uses above the eighty percent mark lead to reduced performance. No cause for alarm.

This topic is for cases where write patterns within extremely little free space lead to performance issues that are both (a) unacceptable and (b) persistent long after free space is regained. These cases are probably rare.

Personally

I have a single-disk 2 TB pool that I happily pushed (for test purposes) to nearly zero bytes free, since then I keep it around 96% full. With this pool, file systems performance is certainly far from optimal, but it doesn't bother me because I use the disk infrequently.

Looking ahead

illumos gate - Feature #2605: Partial/incremental ZFS send/receive - illumos.org – work in progress, celebrated around the ten year anniversary of ZFS.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Link

Post by grahamperrin » Sat Apr 06, 2013 3:18 am

Some discussion of fragmentation, workarounds and benefits at the tail end of post viewtopic.php?p=4582#p4582 under performance degradation over time …
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: hard disk optimisation for performance purposes

Post by raattgift » Sat Apr 06, 2013 5:27 am

WRT (a)-(d) in your IRC log http://echelog.com/logs/browse/illumos/1350424800

+ (e) horrible layering, putting something that migrates blocks around *under* ZFS

As you know, this can be done now with Core Storage; because CS LVs end up being assigned /dev/diskNN, and because ZEVO is not bright enough to look deeper, it will just treat it like any block device and put a GPT label on it when using it in a command like zpool create.

As you also know, this confuses Core Storage such that it's hard to modify the LV in any way. You may not know that you have to write several blocks of zeros into the start of the LV's /dev/diskNN device then newfs_hfs the /dev/diskNN before doing a CS operation like deleteLV, resizeLV (which is "unsupported"), or resizestack (which is even more "unsupported", but hugely useful when layering mutually-ignorant disk block management systems).

If Apple decides to support arbitrary GPT labels in LVs, you could just grow an LV and zfs will Do The Right Thing if the pool autoexpand property is "on" or otherwise if "zpool expand" is run afterwards.

I have (in household "production") a pool which is constructed like this.

I am not entirely happy that the boot volume is not mirrored. I may do something about that at some point, as I have done with my favourite household workstation. (The mirroring is using softraid's excellent software, but that does not do data integrity or highly flexible online dataset/mountpoint management like zfs does).

LVG 1

PVs:
disk0 (ssd, 250MB)
disk2 (rotating rust, 1.5TB)

LV A: JHFS+ (boot volume, 400G)
LV B: GPT label, ZFS in partition 2. (128.8 GB)
LV C: JHFS+ (mostly empty) (600G)

Crucially, this is 10.8.3 on a platform that supports Fusion Drives (a late 2012 Mac Mini stuffed with memory).
It therefore reports:

localhost kernel[0]: thr 0xffffff802d41caa0 Composite Disk alg="bloomclock" unit_nbytes=131072

making this a not-totally-silly LVG. Not all kernels support Fusion Drive; most older hardware won't boot kernels built for newer hardware; most newer hardware won't boot kernels for older hardware; the kernel itself has a key that unlocks the CPDK MigrCS algorithms. This is only mostly market segmentation driven; there is checksummed metadata in CPDKs that will eat older processors for breakfast. :/

LVG 2

PV:
disk1 (ssd, 256GB)

LV Z: GPT label, ZFS in partition 2 (128.8 GB)
LV Y: GPT label, ZFS in partition 2 (200 MB), a SLOG for a rotating rust pool
LV X: similar to LV Y
LV W: similar to LV Y
LV V: similar to LV Y
LV U: 64 GB (a cache vdev, definitely oversized, due for shrinking to ca. 10G)
LV T: 50 GB (a cache vdev, unused, because the IOKit volume UUID does not persist properly and so it is always UNAVAILABLE at boot; it is also oversized, due for deletion)

I have a pool with one storage vdev, It is mainly used for my $HOME.

mirror-0
GPTE_ ... at diskLVBs2
GPTE_ ... at diskLVZs2
logs
GPTE_ ... at diskLVWs2
cache
GPTE_ ... at diskAnotherSSD

I will probably ditch the L2ARC or give it a small slice. The mere existence of big L2ARC cache vdevs can lead to having too little space in ARC, which devastates performance. Additionally, I've seen zero read slowdowns on the pool; the occasional scrub and the COW nature coupled with the use patterns means pretty much all reads on that side of the mirror are serviced from the ssd PV. The working set of the boot volume also largely fits in the ssd PV, so the rotating rust PV tends to be in power saving mode almost always. If a block on the composite disk side of the mirror is on the rotating rust, then the read is usually serviced from the other side of the mirror first (this is the way vdev_mirror.c works), and the 128k chunk the read was waiting on will in due course be migrated onto the ssd PV. It's neato.

I am undecided about the SLOG for this pool.

Operationally, I have grown the mirror vdevs, and will do so again after changing LVU and LVT to free up space on the non-composite LVG.
It'll involve:

1. Verify that I have a bootable backup and usable backups for all data.
2. zpool offline LVU; zpool remove pool LVU
3. dd if=/dev/zero bs=128k count=100 of=/dev/diskLVU
4. same as 3 for LVT
5. newfs_hfs /dev/diskLVU; newfs_hfs /dev/diskLVT
6. diskutil cs deleteLV LVT
7. diskutil cs resizeLV LVU 10g
8. zpool add pool cache /dev/diskLVU
9. zpool detach pool LVZ
10. dd if=/dev/zero ... of=LVZ; newfs_hfs LVZ
11. diskutil cs resizeLV LVZ 228g
12. zpool attach pool LVB LVZ; zpool set autoexpand=on pool
13. zpool detach pool LVB
14. dd if=/dev/zero ... of=LVA; newfs_hfs LVA
15. diskutil cs resizeLV LVB 228g
16. zpool attach pool LVZ LVA
17. zpool list pool

I could of course make a third side to the mirror, but step (1) is OK for a small amount of data. Restoring if something goes wrong and makes the pool and boot volume unusable won't be a 72 hour process.

If LVG 2 were also a composite disk, I could just create a pair of new LVs, one on each LVG. I could then zpool replace LVA LVa, wait for a resilver, then zpool replace LVZ LVz; and see the pool expand when that finished. That would lead to a lot of block migrations though, so would be a utility rather than performance win. Resizing the existing LVs does not disturb the bloom filter / clock scores much, and won't use many "cold" blocks that are on the spinning rust PV.

Also, if the pool on 2 large composite CPDKs were sufficiently large, there would inevitably be times when both sides of the mirror vdev would have to wait on the slow spinning rust devices.

One feature requiring block pointer rewrite is shrinking pools. With two CS CPDKs, I could create a new smaller pool with new smaller LVs on each side of the mirror. zfs send/zfs receive to migrate the data. Destroy the old pool. Destroy the old LVs. All online, with hot blocks migrating to the SSD in due course, no messy wiring or plugging, and no new hardware. BPRW would avoid the short unavailability of data needed to adjust mountpoints.

Another feature requiring BPRW is changing the replication level. With three CS CPDKs I could create a new pool with three LVs for a raidz1. Same zfs send/receive migration required, same fiddling with mountpoints. Going from raidz1 to a mirror or mirrors also follows the same approach: build a new pool with new LVs in the existing CPDK LVGs (for 2 mirrors, add a fourth CS LVG), migrate the data accordingly. All the same disadvantages compared to real online BPRW.

One big gotcha of course is that "Fusion Drive" block migration only happens when an appropriate SSD is one of the PVs and any other PV is a non-SSD. This means each CPDK needs an SSD. Also, you will get unpredictable performance fall-offs if ZFS data gets migrated to the slower disk. Working set management could be a little hairy.

However, if the working set fits comfortably within the set of SSDs in the "Fusion" LVGs, then zfs fragmentation wouldn't be relevant, as seek times vanish. Likewise, the BigMigrCS (and to an extent the MigrCS) has the side effect of improving on-disk locality of reference for blocks accessed or created at about the same time. This effectively "defragments" zfs data as it becomes old enough to be moved off the SSD PV. (JHFS+ LVs gain from this too).

As with Ahren's explanation, there is a giant table of mappings between LV block address and LVG block address, maintained by CS. ZFS is unaware of it, and there is at present no means to expose the mapping to ZFS, and nothing in ZFS that would take advantage of that anyway.

Unfortunately, the extra physical devices and software layers tend to increase vdev fragility (while increasing cost), and may actually worsen performance. Additionally, it is unlikely to be supported by anyone.

However, it is a step towards (e), and can be done in 10.8.3 on many Macs.

( One could also do (e) with, for example, iSCSI or XSan. A number of university computer centres have done precisely that over the years, although there are critical gotchas with that approach as well. An example: http://utcc.utoronto.ca/~cks/space/blog ... rverDesign )
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm


Return to General Discussion

Who is online

Users browsing this forum: Bing [Bot] and 1 guest