Does ZEVO support TRIM?

Moderators: jhartley, MSR734, nola

Does ZEVO support TRIM?

Post by dburkland » Mon Dec 24, 2012 2:04 pm

I just enabled TRIM on my Macbook Pro's SSD using TRIM enabler and am wondering if ZEVO is TRIM aware?

Thanks,

Dan
dburkland Offline


 
Posts: 2
Joined: Sun Dec 09, 2012 4:58 am

Re: Does ZEVO support TRIM?

Post by jollyjinx » Wed Apr 10, 2013 11:28 am

Can't anybody from ZEVO answer that ? I do have a MBPr with the inbuilt SSD which supports TRIM out of the box and wondered if that is supported by ZFS ?

a) Is TRIM supported for Apples SSD that support TRIM out of the box ?
b) Is TRIM supported for 3rd Party SSDs that were enabled using the TRIM enabler ?

I'm guesing it should be handled identical in both cases.
jollyjinx Offline


 
Posts: 60
Joined: Sun Sep 16, 2012 12:40 pm
Location: Munich - Germany

Re: Does ZEVO support TRIM?

Post by raattgift » Thu Apr 11, 2013 3:02 am

TRIM is not supported by many ZFS implementations, and it's fairly new in most of those (FreeBSD's was committed to HEAD in late September 2012).

There was lots of discussion about TRIM in the opensolaris context in 2008, some of that will be found in zfs-discuss archives of the time. SSD controllers have advanced substantially since then, and few decent SSDs have trouble dealing with bursts of writes even when TRIM is not used. There are likely to be anecdotes about real world perfomance available on lists specific to those implementations/ports of ZFS.

TRIM is not likely to make much of a difference in practice even if done remarkably well for *storage* vdevs on the TRIM-issuing side, in part because ZFS's write strategies allow for erase block migration time to prepare enough free blocks to avoid spare-area pressure when a TXG moves into the commit phase and in part because ZFS puts very little pressure on any internal COW system that deals with short writes by virtue of ZFS's own COW and its use of multiples of ashift when committing writes to physical devices. Heavily sequential workloads are better off on as-fast-as-possible rotating rust, since they are better at purely sequential writes than all but the most expensive write-optimized SSDs, and are comparable for purely sequential reads. ARC/L2ARC will lessen the impact of any nonsequential traffic hitting such a pool.

TRIM would be of most use in an pool in which the storage vdevs are only SSDs *and* where the pool is limited by IOPS. (The second you add a slower storage device in the same vdev, any recovered IOPS from TRIM becomes academic; a slower device in a different vdev in the same pool likewise erodes the potential IOPS gain from TRIM) where the traffic pattern involves lots of multi-kilobyte ephemeral files. In a Mac context, the build directories for package systems like Macports is the most obvious case of that; backupd's cleanups also result in lots of TRIM activity if an SSD is used as the Time Machine Volume (which is probably extremely rare). Otherwise you'd be thinking of cases wherein a Mac was hosting a random-workload database with a large working set and lots of writes and the backing store vdevs are all SSDs.

cache vdevs (as opposed to storage vdevs) are written to so slowly that there is going to be no advantage to TRIM except on very very old SSDs. TRIM doesn't benefit read-random/write-slow-sequential at all. Moreover, when the cache vdev fills up, data at the front is simply overwritten; there is no TRIM issued, as there's nothing to TRIM. (You could trim a few blocks from the front, but why? You might need to read them back into ARC!) A TRIM *might* be useful when a cache vdev is first connected. Maybe.

log vdevs could benefit from TRIM if they were pretty busy; writes are bursts and sequential. log vdevs are only ever read from in unusual circumstances (a linear read at pool import time is the most common pattern). Anything written into log vdevs will simply be overwritten; an SSD suitable for that will have enough spare space and a good internal block migration system, so won't need a TRIM, but TRIM after each TXG is fully committed might improve performance where an SSD is used in multiple different ways, including as a *heavily* used log vdev. Most log vdevs are just not that heavily used... maximum occupancy tends to be in the low numbers of megabytes, so the TRIM isn't really doing all that much compared to a hundreds-of-GBytes modern SSD.

Indeed, an SSD which is shared with a non-zfs, non-transactional, block-rewriting filesystem is the most obvious place where TRIMs issued from the zfs subsystem could help -- but it will mainly help the performance of the other filesystem, rather than ZFS. The biggest advantages would be giant TRIMs as vdevs are brought online. The whole of a log or cache vdev could be TRIMmed then, and it could be profitable to TRIM the big known empty chunks on a metaslab-by-metaslab basis during the import process. That would be easy, and not slow down importation terribly for small numbers of vdevs with typical numbers of metaslabs. TRIMming a storage vdev that's part of an active pool is harder, more prone to error, and in many cases not clearly a performance win for zfs.

TRIM support is unlikely to be in CE 1.1.1 and I would not think it would be much of a priority for ZEVO.

If, however, you were migrate a pool between ZEVO CE 1.1.1 and a zfs implementation that supports TRIM, *and* could show a clear performance benefit for a fairly natural use case (and especially if that use case fits with what greenbytes does), then I could see it becoming a priority. :-)

Here's a question for you: do you know if Core Storage emits TRIMs in your "Fusion Drive" setups, and whether those TRIMs are issued in relation to activity in a non-JHFS+ LV? Or is it all related to activity in the HFS layer? If you're seeing (Rd|Wr)BgMigrCS activity, are TRIMs also happening then? (I do know that fsck_hfs will report trimming activity from time to time (also, strings /sbin/fsck_hfs | grep -i trim) including when the HFS volume is in a CS LV where the LVG is a composite disk containing a TRIM-capable SSD.)

Code: Select all
# find /System/Library/Extensions/CoreStorage.kext -type f -exec strings '{}' + | grep -i trim
/trim
%s: add_extent_for_trim failed err=%d, off=%llu nblks=%llu
%s: send_trims_down failed err=
# find /System/Library/Filesystems/zfs.fs -type f -exec strings '{}' + | grep -i trim
stringByTrimmingCharactersInSet:
cla:ZFS-TimeMachine-master # find /System/Library/Filesystems/hfs.fs -type f -exec strings '{}' + | grep -i trim
Trimming unused blocks._
Trimming unused blocks._
...


Additionally one could consider this information:

Code: Select all
# sysctl -a | grep -i trim
kern.jnl_trim_flush: 240
vfs.generic.jnl.kdebug.trim: 0


Speculation: The former is likely to be a block threshold for issuing TRIMs (it's not a since-boot-time count of TRIMs; it's 240 on several very different systems). The latter is probably a toggle that will likely spam out extra information into one's logs.

Update: the latter is documented in the open source bit of vfs/vfs_journal.c, which is also where the HFS trim mechanism is implemented.

Code: Select all
 * Set sysctl vfs.generic.jnl.kdebug.trim=1 to enable KERNEL_DEBUG_CONSTANT
 * logging of trim-related calls within the journal.  (They're
 * disabled by default because there can be a lot of these events,
 * and we don't want to overwhelm the kernel debug buffer.  If you
 * want to watch these events in particular, just set the sysctl.)


The first is unknown, but everyone seems to have it set to "240". It is probably the number of trimmed extents that causes a journal flush.

Update 2: there are plenty of dtrace fbt probes for trim available, that you could very likely watch to answer the question I asked you. :-)
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Re: Does ZEVO support TRIM?

Post by grahamperrin » Fri Apr 12, 2013 12:50 am

raattgift wrote:… TRIM support is unlikely to be in CE 1.1.1


I had the same guess.

(There's nothing obviously TRIM-related amongst beta stuff that I cached privately. That said, I would have given little thought to TRIM at the time. And what I cached is not a comprehensive collection.)

and I would not think it would be much of a priority for ZEVO.… 


My first thought in response to the opening post was, there's GreenBytes expertise around flash storage. Now for me there's an overriding thought –

raattgift wrote:… clear performance benefit for a fairly natural use case …


I'm vaguely interested in this stuff because in a few months, or maybe next year, I'll get an Apple laptop with flash storage. But I don't imagine testing in any great detail; raattgift's post is reassuring.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Does ZEVO support TRIM?

Post by jollyjinx » Fri Apr 12, 2013 7:26 am

I was not a all concerned about the speed advantage that TRIM would give me. A ZFS home does not feel fast compared to HFS+ anyways.
Thanx @raattgift for the long comment.

I am using a 15" MBPr with a flash drive and ZFS and as I'm not using 100% of the ZFS volume. So TRIM would substantially prolong the livetime of the drive as the wear leveling would not have to copy as much as with TRIM. Block that are TRIMed do not need to be copied for later references but immediately go into the spare block list of the SSD. thus lower write amplification and longer disk life.
I also have a mac mini where I use two SSDs as ZFS cache, there TRIM would make no difference as the disks are full all the time anyways.

Given that the disk I have is 512GB drive (the samsung 830 from what I see) I probably have at around 3000 WriteCycles - so 1.5 PTByte to write, with a write amplification of 3 I still have 500 TB to write before the drive dies. It's not much of a concern, but I always thrive for the best technical solution which TRIM would be in my case.
jollyjinx Offline


 
Posts: 60
Joined: Sun Sep 16, 2012 12:40 pm
Location: Munich - Germany

Re: Does ZEVO support TRIM?

Post by raattgift » Sun Apr 14, 2013 7:30 am

jollyjinx wrote:A ZFS home does not feel fast compared to HFS+ anyways.


Really? On an identical partition on identical hardware connected identically?

[I live with two busy $HOME s (and other stuff like the macports /opt hierarchy) on pools which each have two SSDs in a mirrored vdev. Once the searchfs traffic that happens at startup abates, performance is pretty comparable, and gets much better than JHFS+ as the ARC heats up. ARC+UBC >> UBC. (In fact, ARC > UBC generally).]

jollyjinx wrote:I also have a mac mini where I use two SSDs as ZFS cache, there TRIM would make no difference as the disks are full all the time anyways.


Be careful. The mere existence of oversized cache vdevs can lead to performance problems, since each record in the L2ARC consumes ~256 bytes of core at all times, even when the underlying record is no longer reachable and thus will never be pulled into the ARC. Unreachable records only go away when overwritten (or the pool is otherwise offlined), and cache vdevs are filled in a circle - a record that becomes immediately unavailable after going into L2ARC may stay there a lonnnnnnng time, consuming 256 bytes of system RAM.)

Additionally, since ZEVO's ARC is deliberately kept small, you may end up with almost no arc_buf_t if you have big cache vdevs carrying many GBytes. That will destroy performance at best (especially write performance) and at worst may lead to a system crash.

If you have a huge SSD and want cache vdev(s), use GPT partitions and make the cache vdevs use slice devices. A good size is single-digit GBytes per slice. You probably don't want more total L2ARC space than a verrrrrry small multiple of total system RAM.

Moreover, the filling thread that copies blocks from the ARC to the cache vdevs isn't especially smart; your L2ARC will have diminishing returns. The L2ARC hit/miss rate is not easily extracted in the ZEVO CE 1.1.1 port, although in principle you could do so with dtrace fbt exit probes.

jollyjinx wrote: It's not much of a concern, but I always thrive for the best technical solution which TRIM would be in my case.


You are right that modern SSDs will last years before raising the first warnings of write failure, even under heavy load.

TRIM is only "the best technical solution" if it helps performance in real world usage patterns (questionable) AND if the code does not have any bugs or other issues which may compromise data integrity.

TRIM code for ZFS is *very* fresh -- September 2012 for FreeBSD's merge, and it's not even in most other implementations.

TRIMming on log vdevs or in the on-storage-vdev ZILs is especially scary -- I managed to zap a log vdev with nasty results. You never want a log vdev going wrong when a pool is offline (or at all, really). :/

(FWIW, I had to turn to openindiana to "zpool import -m" the pool at all and to "zpool remove [missing log vdev]", but I was impatient and careless and did that on the pool while degraded (oi had trouble seeing all the pool's disks). Unfortunately that left me with a pool that reliably crashes FreeBSD-9.1 and 10.8.3 with the ZEVO CE 1.1.1 implementation. So it's back to OI to zfs send incrementals of the data out of this bad pool, which I'll have to destroy.
The operation which zapped the log vdev is not especially different from a TRIM hitting the wrong LBAs because of the strange range rules in some controller chipsets, and barrier bugs or complete absence of guarantees about when a TRIM will happen and when one can reuse an LBA that has just been TRIMmed.)
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Link

Post by grahamperrin » Sun Apr 14, 2013 8:04 am

grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Does ZEVO support TRIM?

Post by jollyjinx » Mon Apr 15, 2013 4:09 am

I have not seen any problems with that large of a ZFS cache. ZFS is not using much of the 16 GB. On the other hand the machine isn't doing much anyways.

As background, my (powersaving) server is setup as follows:

mac mini 2012, 16GB
10TB raidz2 ( 4xUSB3,1xFW800, 2TB 2, 2.5" WDPassports)
2x256GB Samsung SSDs (internal):
- 2x50GB raid0 HFS+ as root
- 2x115GB zfs-cache
- 2x1GB (mirrored) for zfs-log

The machine is used for (zfs) backups, satellite recording ( eyetv netstreams ) and later conversion via Handbrake and delivering the videos via AirVideo.
I would love to throw away HFS+ completely, but as long as there is no ZFS boot support, this is out of the question. So I'm stuck with HFS+ and ZFS on one disk on the rMBP. I don't trust HFS+ to keep my sources any more after TimeMachine failing badly ( did not do any backups for one month , even though it did correctly backup every hour [ known problem to apple ] ).

FYI: The machine uses 18 Watts when running idle (including disks) and 4W when in sleep - which the machine does most of the time. Maximum usage is about 57W when Handbrake converts things. Measurements done with a sem16+
jollyjinx Offline


 
Posts: 60
Joined: Sun Sep 16, 2012 12:40 pm
Location: Munich - Germany

Re: Does ZEVO support TRIM?

Post by raattgift » Mon Apr 15, 2013 7:42 am

Your configuration is reasonable; I have a couple of similar setups on recent mac minis.

Gigabytes of L2ARC will take a long time to fill, even with fairly heavy use -- l2arc_feed_thread() runs periodically and does not move a huge volume of data onto cache vdevs by design. As noted in the lonnnnng "Level 2 ARC" comment block in the standard zfs/arc.c, this is to avoid "clogging" the cache vdev with writes and to avoid churn. It also allows for better scheduling of writes onto the device. The hottest blocks always stay near the head of the MFU queue, so are unlikely to be copied into the L2ARC; L2ARC's existence encourages this result. Additionally, sequential prefetches and writes are not L2ARC eligible unless they are rapidly reused (and therefore are in the MFU queue rather than just the MRU one).

Moreover, the problem with a huge L2ARC is in the number of objects stored in the cache vdevs; each object -- which may range in size from 512 bytes to 256 kibytes -- consumes ~256 bytes of RAM, and in ZEVO there is a low cap on the amount of RAM the entire zfs subsystem will use. If objects are all large, a huge L2ARC is not a problem, but might not be adding any value if it's not serving up many read IOPS (zpool iostat -v will tell you that; zstat's "ARC overall" hit percentage is important; if it's anywhere over 95%, you likely have enough ARC+L2ARC, and will only see diminishing returns as you add more). If there are many small objects, you will end up with less space in the main ARC and performance will tank in two ways: firstly, writes will throttle terribly and secondly reads will be served by the slower L2ARC or even the storage vdevs.

Write throttling is the worst of these, especially if the cache vdev devices have very low read latencies. Whenever *anything* is written through the ZPL, it goes into ARC. (Synchronous writes also go into the ZIL or log vdev). If there is lots of space in the ARC, the transaction group will stay open for several seconds, accumulating dirty ARC objects. The txg then transitions to the quiescing and syncing phases, and the writes are scheduled out in a large burst. However, when there is very little ARC space, the txg transitions much more quickly to the quiescing phase, which will block new writes; it may spend some time quiescing, depending on the write load. Essentially it waits for a tiny number of threads (possibly even one) to send a record's worth of data or to suspend the current write operation. Then, given write pressure and a tiny quiesced txg, writes are synced. However, the act of writing out a txg can also cause other writes to be necessary in the same transaction group -- internal zfs metadata, POSIX metadata ([amc]times; directory data updates), and so forth. Each of these may in turn have to wait on an ARC block to become available through eviction or through further sync activity. However, L2ARC metadata is *not* fully releasable. If there simply is too little space in the ARC because of things that cannot be evicted or released, a pattern of salvage IO occurs (from reducing txg_time and clamping ARC occupancy) such that the pool's storage vdevs are absolutely hammered with tiny IOPS, which generally brings a system to its knees. It may not be possible to recover from that pattern in generic zfs implementations. :-(

Again, you will only see this pattern given sufficient system uptime, a usage pattern that favours L2ARC-eligibility, leading to many arc_buf_hdr_t objects, and few free arc_buf_t ones, as reported by zstat.

It is, however, easy to avoid this particular problem simply by using a smaller-sized set of cache vdevs. Adding many GiB of l2arc is not likely to hugely improve your Mac's performance if you're not using it as a server for dozens of busy clients. A few extra GiB per pool is likely to push your "ARC overall" hits to somewhere above 95%, and you are unlikely to do much better than that.

With large SSDs it is better to make a pool out of slices of two or more of them and to use them for datasets with the most latency-sensitive workloads.

I'm pretty sure your 2x115GiB cache vdevs are essentially wasted, with only hundreds of MiB of occupancy after a day or two of uptime.

The workload you describe is highly sequential and your raidz2 vdev will handle that fine, although there is likely a speed mismatch across the devices in the vdev that won't help you.

After an uptime of a week or two check the occupancy of the cache vdevs (first column of "zpool iostat -v"), and size the cache vdevs to that, or even smaller, since the "alloc" includes stale and unreachable entries that will never be read from the device, and that tends to be a substantial percentage of a large cache vdev.

The space not used, I'd put into place for whatever data you have that can fit in a small low-latency fast pool.

On one of my worstations, that includes my $HOME (with separate datasets for ~/Library/Safari and ~/Library/Saved Application State so that I can snapshot them frequently and recover old ~/Library/Safari/LastSession.plist and the like thanks to local snapdir=on settings), my macports /opt (which requires a LaunchDaemon plist that deals with /opt taking its time to be mounted, which freaks out launchd thanks to macports use of symlinks in /Library/LaunchDaemons), and my squid3 cache_dir.

Code: Select all
NAME                                         USED   AVAIL   REFER  MOUNTPOINT
...
ssdpool/DATA/opt                           32.7Gi  82.6Gi  27.3Gi  /opt
ssdpool/DATA/squidcache                    21.7Gi  8.34Gi  21.2Gi  /Volumes/ssdpool/DATA/squidcache
ssdpool/xxx                                117Gi  33.3Gi   110Gi  /Users/xxx
ssdpool/xxx/Library                        1.49Gi  33.3Gi   272Ki  none
ssdpool/xxx/Library/Safari                 1.09Gi  33.3Gi   118Mi  /Users/xxx/Library/Safari
ssdpool/xxx/Library/SavedApplicationState   411Mi  33.3Gi  37.0Mi  /Users/xxx/Library/Saved Application State
...


And in another machine

Code: Select all
NAME                       AVAIL    USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
homepool                  32.6Gi   159Gi    6.64Mi  45.4Mi              0      159Gi
...
homepool/DATA/XPLANE10    32.6Gi  60.4Gi    1.09Gi  59.3Gi              0          0


As the ARC heats up, performance is noticeably better than when the same data was in a softraid mirror on the same two SSDs.

I have regularly shrunk the boot volume and increased the pool on paired SSDs in my Macintoys that have them.

Likewise, I moved the cache vdevs (for my spinning rust pools, I was 40-60GiB slices of the SATA 3 SSDs, which was a total waste of space) onto fast 128 GiB USB3 flash sticks, and then after some analysis of actual use decided to use per-pool pairs of 5 GiB partitions living on two of those sticks, which has perhaps counterintuitively dramatically improved overall performance and system robustness.

My big frustration of the moment is how hard it is to get all of /Library/Server off the boot volume in 10.8.3 Server without running into enormous problems at system startup time.

Like you, I am tired of *actual* data loss and corruption in JHFS+ volumes, especially corruption that was undetected for some time (and has thus been propagated to backups). ZEVO CE has been great for that, otherwise it'd be all mounted across a network network from OI or FreeBSD 9.1 servers, with attendant performance degradation, administration hassles, security hazards, and so forth.
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Re: Does ZEVO support TRIM?

Post by emory » Mon Apr 15, 2013 9:20 am

raattgift wrote:
jollyjinx wrote:A ZFS home does not feel fast compared to HFS+ anyways.


Really? On an identical partition on identical hardware connected identically?


Anecdotally, yes. That's why I went down the road of jHFS+ Fusion Drives for home directories (and symlinking Documents/Music onto a FreeNAS share). I don't have that exact case documented, but my FreeNAS over gigabit ethernet (raidz 3x3TB 7200RPM) is faster for sequential read/write than a local ZEVO mirror (mirror 2x1TB 7200RPM).

I have a Google Docs spreadsheet (https://docs.google.com/a/hellyeah.com/spreadsheet/ccc?key=0Av2d4b91SLePdE1CdjVDSldMaUM5eUxCSFV1MEtfbFE#gid=2) available though like I said it's anecdotal.
emory Offline


 
Posts: 15
Joined: Mon Sep 17, 2012 7:47 pm

Next

Return to General Discussion

Who is online

Users browsing this forum: ilovezfs and 0 guests

cron