Advanced ZFS question on logbias

All your general support questions for OpenZFS on OS X.

Advanced ZFS question on logbias

Postby macz » Fri Jul 28, 2017 8:39 pm

Probably for one of the zfs devs...

If running a zfs pool on all super cap protected intel s3500 ssd drives and doing sync write work to them... I am trying to speed up the writes while also reducing the redundant double writing of the data to the pool since without log device the sync writes are written to the SSDs twice

Adding a dedicated zil would take care of the double writing and the fragmentation of the pool but unless its nvme or something faster then the stripped SSDs of the pool it will actually probably slow things down

Reading what logbias=thoughput does.. it sounds like it forces the pool to not use a zil at all ... not just ignoring a seperate log device as that can be done other ways.. but it sounds like logbias throughput commits each write to the main pool before the ack... and if the pool is stripped SSDs that should be fast enough?

Thoughts?
macz
 
Posts: 53
Joined: Wed Feb 03, 2016 4:54 am

Re: Advanced ZFS question on logbias

Postby rottegift » Sun Jul 30, 2017 4:48 pm

You probably don't want to touch the logbias property without good reason.

The property is only relevant for synchronous write system calls, and is designed to allow those calls to return to userland more quickly than if they had to wait for the transaction group to be committed to the disk (which might take up to 15 seconds depending on load and underlying performance of the storage devices).

Most such system calls are expected to introduce substantial latency in general (and notably on unjournalled HFS+) and applications don't really expect otherwise.

All NFS writes, however, are (absent dangerous configuration by the server *and* client) always synchronous. Consequently writing out a large file from an NFS client to an NFS server (where the server-side storage is a ZFS dataset) will be very slow unless a low-write-latency log device is active on the pool. If one has such a configuration and a client is certain to be doing frequent bulk writes to a well-defined location *and* there is no log device configured, then logbias=throughput is likely to help.

That behaviour was typical in the early days of ZFS, but is fairly rare now (NFSv4 has mitigations for bulk writes and log devices are very commonly active on high-latency pools (where the storage devices are rotating disks, typically); additionally network performance is much better now, so one is not desperately trying to contain additional sources of latency on top of those that historically were introduced by waiting in transmit queues on the client's CSMA/CD interface on old unswitched ethernet thicknet, for example).

The bulk-synchronous-write pattern is pretty rare in the non-NFS case too. At the margins there are async bulk writes that are coupled to an fsync() system call, and there returning quickly from fsync() is probably not worth the additional latency likely to be imposed on the pool overall (assuming many threads are active across multiple datasets).

I'm not sure why you're worried about non-contiguous data blocks on an ssd-based pool; there's almost zero penalty for (reading) those. Writing synchronously when you do not have to will practically always slow down write performance, no matter what logbias is set to. Among the reasons are that aggregation opportunities are lost in an effort to keep sync system call latencies down. Setting logbias=throughput will not generally make much difference in the case of directly-attached storage.

Finally, be careful about changing logbias on a mounted dataset; there have been bugs found in the past couple of years (some in ZOL (their issue 541 comes to mind) could be pool-destroying) and since making changes to the property is rare, it is frankly not especially well covered in testing and user experience.

Lastly, what are you actually doing that generates a workload of mannnnnny synchronous writes? Without knowing that, it's hard to offer you more specific guidance. The general guidance is "well, you don't want to do that if you can avoid it, and if you can't avoid it then use a pool with minimal random write latency". Only synchronous activity will be written into the pool (absent a log vdev) or the separate log device, but of course if you're deliberately synchronously writing many megabytes per second, that data is likely going to end up being logged to the pool or slog respectively.

ETA: logbias still looks flaky. https://github.com/zfsonlinux/zfs/issues/6315 -- we have enough code with its origins in ZFS on Linux (and have seen metaslab bugs) that I would be very wary.

ETA2: the ticket says that it's reproducible in other openzfs ports too (notably openindiana, which is illumos). so be extra wary.
rottegift
 
Posts: 26
Joined: Fri Apr 25, 2014 12:00 am

Re: Advanced ZFS question on logbias

Postby macz » Mon Jul 31, 2017 5:04 am

@rottegift


Thanks for the long reply.. and it sounds like you need more detailed info on what I am doing.. so here you go!! Thanks again for helping out.

Firstly.. i use to run ZFS on and old OSX 10.6 server hackintosh using apples unreleased code.. it was on a e8400 and that server was over 6 years old and did a great job but upgrading to 10.11 and using openzfs OS X was OK but it was clear that I needed to do something new since the hardware was just too slow and it was just waiting to die with 6+ years of 24/7 uptime i was shocked at just how reliable it all was.


So I found a nice half depth server that ran dual l5640s and was going to be perfect for my needs as a plex server/file server etc.. but the issue I had hackintoshing it was a lack of (non existent drivers) for LSI cards and was using a 4e4i card to provide some addl drives to the 12 internal and also connect to the external 24 drive shelf over 8088.

So, I decided to install esxi on it... and set up an all-in-one where omnios/napp-it (solaris based) VM that is installed on the only local vmfs store (sata connector) ... it is loaded by esxi and it provides ZFS backed NFS shares to esxi to host the rest of my VMs. I have a 2x intel s3500 super cap ssd stripe (for speed) as the machine pool running on the LSI card passed to omnios and a datapool of regular drives... and of course a pool on the backup shelf that gets booted for backup runs.


So.. now the meat... since all NFS in esxi is sync and the VMs... just a couple like about 7 mix of linux, OS X, vcenter... are running on that NFS served ZFS backed vmfs ... i want to reduce the overhead.... and keep the SSDs from having to write all VM data twice to SSD.. once for the zil sync write and the second time when zfs lays it back down to 'perminant' storage. An external zil (SLOG) would keep the 2 intel s3500 SSDs from doing 2 writes but it seems like a waste. Everything here is internal to a server on ups backup... so unless something panics there should be no interruption.. and the SSDs are all supercap anyway.

I still dont feel like setting sync=never on the datastore... but as I read it.. oracle designed logbias specifically for situations where there databases write 2 different datasets to the storage pool.. data and log.. and allows the data to be written logbias=throughout which commits the writes directly to disk rather than any zil.. thus freeing the zil device to serve the more critical and smaller latency critical log writes.

So i was thinking with 2 stripped SSDs .. I should be able to directly write the data without zil as it shoul be fast enough since there is no seek penalty on SSD... I feel that ZFS is still optimized for spindles and not really tuned or aware of SDD...

that way the SSD sees half the write traffic since zil is not written first.. it should even speed it up.. since again I dont see how a zil write on SSD can be any faster than commiting the data .. zil writes to spindle disks are faster as zfs doesn't try to place the zil write it just writes it wherever the heads are at the time to reduce latency so the ack gets back to the sync flush... but this behavior makes no sence in a all flash pool


So in sum... an all flash pool on stripped SSD being served via internal virtual network which runs at 10G+ speeds, via NFS requireing sync is writhing the data twice so looking for a good way to optimize this to prevent writing twice the data to the storage and to keep sync from increasing the fragmentation that sync imparts on a pool without SLOG zil.

The reason I am not really looking to use a SLOG is that I cant afford Zeus drive or some other storage that is faster than the stripped SSDs.. and just adding a small intel 80gb s3500 as SLOG would fix the frag on the 2x s3500 main pool but might actually be slower than just commiting the writes.

So logbias=throughput tells zfs to not use ZIL at all and commit all writes directly to disk (which on all SSD should be fast enough and just as fast if not faster than using zil on pool) and since the disk is supercap protected the writes will get written

Another way of doing this is tuning the zil.. since zil only get writes smaller than a x value .. tuning that number would force all writes to bypass ZIL... but tuning is behind the scenes and logbias is a property that can be set on a dataset level granularity

Thoughts?

As you can see this is not really a osx zfs question but a 'how zfs works' kinda question.
I wish I could have stuck to a OSX on bare metal and use OS X zfs but it just didnt work out due to lack of HBA drivers. And now that I am running certain workloads in esxi with VMs I really cant go back... its pretty slick.. really stable..

Another issue I have is that in my all mac home.. now that all storage is being provided by solaris/napp-it SMB.. it uses windows alike SMB NFS4 permissions and ACL models so trying to set up my SMB shares for mac users is complicated for me as understanding the POSIX vs ACL vs SMB permissions levels is giving me a headache!!!

Thanks

Code: Select all
N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops  <=4kB 4-32kB >=32kB
     62384      31192      62384    1310720     655360    1310720     10      0      0     10
    148304      74152     148304     786432     393216     786432      6      0      0      6
     10632       5316      10632     532480     266240     532480      6      2      0      4


Here is a zilstat 2 3 of the server running idle
macz
 
Posts: 53
Joined: Wed Feb 03, 2016 4:54 am

Re: Advanced ZFS question on logbias

Postby rottegift » Tue Aug 01, 2017 3:06 pm

Just to reiterate, logbias=througput looks badly broken.

https://github.com/zfsonlinux/zfs/issues/6315

The crucial things here are: reproduced on OpenIndiana (illumos), "zfs create ... -V ... -o logbias=throughput ...", and the panic will make the pool *unimportable* (i.e., you try to import, and your machine will panic).

The worse thing is that it's probably leaving a timebomb in one metaslab, and if your pool is not very busy with allocations and deallocations, it might not be hit until there's a new allocation into the metaslab that got messed up by logbias=throughput.

If you've EVER done logbias=througput, you should probably do zdb -mmmmm on the pool and see if that completes normally (if it does, you're (probably) in OK shape). If zdb fails at all (on an unimported pool), you should start thinking about how to rebuild the pool.

There are recent ZIL changes in master that will help reduce the additional writing incurred by bulk sync writes, and further ZIL changes that are in experimental zfsonlinux PRs which may also help reduce the amount of data that has to go out into a slogless pool mainly by improving aggregation among multiple (synchronous-system-call) writers (although the primary focus of these PRs is more to avoid critical section waits imposed by the serialization requirement of the log's on-device semantics (which also affects slogs)).

In the mean time, you can investigate zfs_immediate_write_sz versus the wsize option on your clients' mount_nfs or equivalent, especially if you do TCP mounts. Making both larger will keep large NFS writes out of the in-pool ZIL, but will increase the latency of each write(2) system call on the client, since the server's write(2) call will have to wait for the TXG to fully sync.

If you want or need to keep each NFS write small then a slog will at least avoid double-writing into the storage vdevs, if that's important to you for SSD lifetime reasons (log writes into the pool are pretty dense for the data involved, so the write amplification is probably not going to make much difference for SSDs built in say the last two years). You don't look hugely performance-constrained by sync writing, so a faster slog than your storage vdevs is not obviously worth shopping for; it'd only be about write amplification. (Support for TRIM/UNMAP will eventually land in openzfs and should be straightforward to port to o3x, and that will help too, assuming your odd tree of VMs and so forth passes trims down to the physical device.)
rottegift
 
Posts: 26
Joined: Fri Apr 25, 2014 12:00 am

Re: Advanced ZFS question on logbias

Postby macz » Sun Aug 06, 2017 4:37 am

@rottegift

Thanks again for taking the time to write such a detailed response...

I have looked at the bug report you keep referring but I failed to see how its a general logbias but... he is creating a ZVOL with logbias-throughput (dont think that is what it was designed for) and also using it via hypervisor as a systemdisk ....

I have seen countless threads where people are using logbias for standard zfs file systems and have not found a single report of issues?

Also the panic in question if he switches the virtual device from ide to Virtio that panic goes away.. so could this be more of an issue with virtualization and ZVOLS rather than logbias?

Anyway.. i guess my overall point here is that it does not appear that using zfs on an all flash SSD pool is optimized to take advantage of near zero latency ... its designed, at best, for spinning rust and seperate, low latency, log devices.

I would just like to optimize this all intel s3500 capacitor backed SSD pool for sync / NFS operation and the 2 area are not having to write data twice to the SSD pool since the zil write is no faster than if it were just commited fully then ack, and 2 speed...

A slog ssd would keep the pool from being fragmented and reduce wear on dual writes but since the main pool is stripped SSD might even slow it down although speed is not the primary goal.

There has to be a way, and I thought logbias was the mechanism, to tell ZFS to bypass the zil and commit sync writes to storage immediately....
macz
 
Posts: 53
Joined: Wed Feb 03, 2016 4:54 am


Return to General Help

Who is online

Users browsing this forum: No registered users and 30 guests