Flooding and choking

Moderators: jhartley, MSR734, nola

Flooding and choking

Post by grahamperrin » Tue Apr 16, 2013 10:43 pm

Placeholder for discussion. Drawing from at least:


Which methods are best for measuring factors that might contribute to a flood or choke?

Focus on IOPS?

I'm familiar with Brendan's blog « Top 10 DTrace scripts for Mac OS X (highlights), some of which is HFS Plus-specific, but there's a bigger picture:

  • not all approaches to DTrace work as required with OS X
  • and so on

– so some of the ZFS-related IO stuff that I see online ends up bookmarked but unread, or read but not digested.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Flooding and choking

Post by raattgift » Thu Apr 18, 2013 3:42 am

The problem here is that USB is slowwwwwwww. Let this script run for a bit, it times the function boundaries of zio_write_phys.

This is during heavy traffic mainly to an SSD-based pool, with some light (~ 1 MB/s) background I/O on a 3-way mirror of FW800 disks. Your numbers will certainly differ for your "tall" pool and the like, but any distribution skewed towards longer times is bad. Large numbers of phys writes lasting longer than 5 seconds is likely causing throttling, and anything approaching 30 seconds certainly is.

Dirtied blocks are locked in memory for at least this amount of time.

Let it run for a minute or two, then interrupt it for the graph.

Code: Select all
$ sudo dtrace -n 'fbt:com.getgreenbytes.filesystem.zfs:zio_write_phys:entry { self->s = timestamp; }
    arc_buf_add_ref:return /self->s/ {
        @["ns"] = quantize(timestamp - self->s); self->s = 0; }'
dtrace: description 'fbt:com.getgreenbytes.filesystem.zfs:zio_write_phys:entry ' matched 2 probes
^C

  ns                                               
           value  ------------- Distribution ------------- count   
         2097152 |                                         0       
         4194304 |@@                                       1       
         8388608 |@@@@@@@@@@@                              6       
        16777216 |@@@@@@@@@                                5       
        33554432 |@@@@@                                    3       
        67108864 |                                         0       
       134217728 |                                         0       
       268435456 |@@                                       1       
       536870912 |                                         0       
      1073741824 |                                         0       
      2147483648 |@@                                       1       
      4294967296 |@@@@@@@@@                                5       
      8589934592 |                                         0       


You could also trace zio_read_phys using the same one-liner (just change "write" to "read"). A skew towards slow reads indicates bad performance, which you'll also "feel". They are not so critical, since not so many resources are locked down for reads. Moreover, they generally pile up waiting for writes to complete, although the scheduling algorithm will opportunistically do a read that is sufficiently nearby (approximately the same track) as a write.

Code: Select all
dtrace -n 'fbt:com.getgreenbytes.filesystem.zfs:zio_read_phys:entry { self->s = timestamp; }
    arc_buf_add_ref:return /self->s/ {
        @["ns"] = quantize(timestamp - self->s); self->s = 0; }'
dtrace: description 'fbt:com.getgreenbytes.filesystem.zfs:zio_read_phys:entry ' matched 2 probes
^C

  ns                                               
           value  ------------- Distribution ------------- count   
           32768 |                                         0       
           65536 |                                         1       
          131072 |@@@                                      462     
          262144 |@@@@                                     483     
          524288 |@@@@@@@@@@@@@@@@@@@@@                    2769     
         1048576 |@@@@@@                                   784     
         2097152 |@@                                       259     
         4194304 |@                                        186     
         8388608 |@                                        128     
        16777216 |                                         34       
        33554432 |                                         7       
        67108864 |                                         26       
       134217728 |@                                        79       
       268435456 |                                         7       
       536870912 |                                         36       
      1073741824 |                                         28       
      2147483648 |                                         34       
      4294967296 |                                         26       
      8589934592 |                                         13       
     17179869184 |                                         10       
     34359738368 |                                         3       
     68719476736 |                                         0       


You can see who is doing the writing

Code: Select all
dtrace -n 'syscall::write:entry /fds[arg0].fi_fs == "zfs"/ { @[execname] = count(); }'
dtrace: description 'syscall::write:entry ' matched 1 probe
^C

  SafariForWebKit                                                   1
  Terminal                                                          1
  BetterTouchTool                                                   2
  XXX                                                             3
  dd                                                              128


It's also probably useful to figure out what's hitting your ARC the most in general

Code: Select all
$ sudo dtrace -n 'fbt:com.getgreenbytes.filesystem.zfs:arc_buf_add_ref:entry  { @[execname] = count() }'
dtrace: description 'fbt:com.getgreenbytes.filesystem.zfs:arc_buf_add_ref:entry  ' matched 1 probe
^C

  Dropbox                                                           2
  storeagent                                                        3
  ubd                                                               3
  lssave                                                            4
  BetterTouchTool                                                   5
  EyeTV Helper                                                     10
  squid                                                            11
  SystemUIServer                                                   13
  com.apple.iClou                                                  16
  XXXX                                                            18
  log_file_daemon                                                  18
  XXXX                                                          18
  Finder                                                           21
  getopt                                                           29
  taskgated                                                        34
  Skype                                                            40
  Terminal                                                         40
  WebProcess                                                       47
  fseventsd                                                        87
  dbfseventsd                                                     100
  pbs                                                             102
  cfprefsd                                                        130
  CalendarAgent                                                   176
  mds                                                             199
  mdworker                                                        537
  XXX                                                           605
  DragThing                                                       790
  kernel_task                                                    1137
  SafariForWebKit                                                2017
  zfs                                                           28546


And finally, while not really diagnostic, this traces reads and writes to zfs-backed files and after interruption will show the
summary for the top ten for the duration of the trace.

Code: Select all
$ sudo dtrace -qn 'syscall::read:entry,syscall::write:entry /fds[arg0].fi_fs == "zfs"/ { @[probefunc, fds[arg0].fi_pathname] = count(); printf("%Y  %s  %s\n", walltimestamp, stringof(probefunc), stringof(fds[arg0].fi_pathname)); } END { trunc(@,10); }'


Naturally you can change "zfs" to "hfs" etc. in that one-liner.

Again, though, *THE* key thing is the amount of time it takes to complete writes out to the devices. If that's slow, you will be unhappy.
It's going to be slow for any nontrivial amount of writing to a USB2 device, even a device that is remarkably, atypically, unusually, and astonishingly well behaved, which few USB2 mass storage devices really are.

Edit: zio_flush is also interesting to quantize; that is a write with DKIOCFLUSHWRITECACHE which happens for synchronous writes and at the very least for the labels at each TXG. This will trace only writes, naturally. Shorter times and short-skewed distributions are better; the timing of this is critical in gating the overall IOPS of any pool, but this will trace all pools with traffic. Fiddling around with a walk up the structures to the spa_name so that one can trace only one pool is an exercise for the reader. :-)

Code: Select all
dtrace -n 'fbt:com.getgreenbytes.filesystem.zfs:zio_flush:entry { self->s = timestamp; }
    arc_buf_add_ref:return /self->s/ {
        @["ns"] = quantize(timestamp - self->s); self->s = 0; }'

           value  ------------- Distribution ------------- count   
          524288 |                                         0       
         1048576 |@@@@                                     1       
         2097152 |@@@@                                     1       
         4194304 |@@@@                                     1       
         8388608 |@@@@@@@@@@@@@@@                          4       
        16777216 |@@@@                                     1       
        33554432 |                                         0       
        67108864 |                                         0       
       134217728 |                                         0       
       268435456 |                                         0       
       536870912 |                                         0       
      1073741824 |@@@@                                     1       
      2147483648 |                                         0       
      4294967296 |@@@@@@@                                  2       
      8589934592 |                                         0       
...
    274877906944 |                                         1       


Note the extreme outlier: that's a flush to a pool made up of a set of external drives that go into idle power saving mode. ZEVO delays waking up external drives.
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Re: Flooding and choking

Post by grahamperrin » Sun Apr 21, 2013 3:31 am

raattgift wrote:… any distribution skewed towards longer times is bad. Large numbers of phys writes lasting longer than 5 seconds is likely causing throttling, and anything approaching 30 seconds certainly is. …


Here this morning, one attempt failed (no measurement was gained):

Code: Select all
macbookpro08-centrim:~ gjp22$ date
Sun 21 Apr 2013 08:28:25 BST
macbookpro08-centrim:~ gjp22$ uptime
 8:28  up 17:01, 4 users, load averages: 1.73 1.22 1.50
macbookpro08-centrim:~ gjp22$ sudo dtrace -n 'fbt:com.getgreenbytes.filesystem.zfs:zio_write_phys:entry { self->s = timestamp; }
>     arc_buf_add_ref:return /self->s/ {
>         @["ns"] = quantize(timestamp - self->s); self->s = 0; }'
Password:
dtrace: description 'fbt:com.getgreenbytes.filesystem.zfs:zio_write_phys:entry ' matched 2 probes
Client side error: Connection interrupted
^C


During that period:

  • a scrub of the pool named tall began at 08:30:40
  • Mail – which uses a large number of accounts and extraordinary amount of data, and so takes a long time to present its windows – was opened around 08:32:45
  • the interruption was logged by dtrace at 08:36:55.574

Then a relatively extreme example, measured over around 4.5 minutes:

Code: Select all
macbookpro08-centrim:~ gjp22$ date
Sun 21 Apr 2013 09:14:39 BST
macbookpro08-centrim:~ gjp22$ sudo dtrace -n 'fbt:com.getgreenbytes.filesystem.zfs:zio_write_phys:entry { self->s = timestamp; }
    arc_buf_add_ref:return /self->s/ {
        @["ns"] = quantize(timestamp - self->s); self->s = 0; }'
Password:
dtrace: description 'fbt:com.getgreenbytes.filesystem.zfs:zio_write_phys:entry ' matched 2 probes
^C

  ns                                               
           value  ------------- Distribution ------------- count   
          524288 |                                         0       
         1048576 |@                                        3       
         2097152 |@                                        5       
         4194304 |                                         1       
         8388608 |                                         0       
        16777216 |@                                        3       
        33554432 |@@@@@@@@@@@@@@@@@                        69       
        67108864 |@@@@@@@@@                                36       
       134217728 |@@@                                      11       
       268435456 |@@                                       8       
       536870912 |                                         0       
      1073741824 |                                         2       
      2147483648 |@@                                       9       
      4294967296 |@@@@                                     16       
      8589934592 |                                         1       
     17179869184 |                                         0       

macbookpro08-centrim:~ gjp22$ date
Sun 21 Apr 2013 09:19:16 BST
macbookpro08-centrim:~ gjp22$ zpool status
  pool: gjp22
 state: ONLINE
 scan: scrub repaired 0 in 34h29m with 0 errors on Thu Apr 18 10:25:47 2013
config:

   NAME                                         STATE     READ WRITE CKSUM
   gjp22                                        ONLINE       0     0     0
     GPTE_71B8BDA2-3EBA-4B91-9E1C-2AE2B1DAAD06  ONLINE       0     0     0  at disk8s2
   cache
     GPTE_2605CCB0-67B7-4C93-A4B1-83EF764CE617  ONLINE       0     0     0  at disk5s2

errors: No known data errors

  pool: tall
 state: ONLINE
 scan: scrub in progress since Sun Apr 21 08:30:40 2013
    77.8Gi scanned out of 2.02Ti at 27.3Mi/s, 20h44m to go
    0 repaired, 3.76% done
config:

   NAME                                         STATE     READ WRITE CKSUM
   tall                                         ONLINE       0     0     0
     GPTE_78301A52-4AFF-4D96-8DE9-E76ABC14909C  ONLINE       0     0     0  at disk2s2
     GPTE_99056308-F5E2-4314-852C-4DA04732A2D0  ONLINE       0     0     0  at disk4s2

errors: No known data errors

  pool: zhandy
 state: ONLINE
 scan: scrub canceled on Sun Apr  7 02:29:14 2013
config:

   NAME                                         STATE     READ WRITE CKSUM
   zhandy                                       ONLINE       0     0     0
     GPTE_A54431D5-B46F-44A9-83B4-76802A584C6E  ONLINE       0     0     0  at disk7s2

errors: No known data errors
macbookpro08-centrim:~ gjp22$


If I read that correctly:

  • most physical writes took less than 0.134217728 second
  • seventeen took longer than 4.294967296 seconds
  • one of the seventeen took longer than 8.589934592 seconds but less than 17.179869184

During that period:

  • I began and ended a small Time Machine backup to tall/com.apple.backupd on USB 2.0 on the old Sitecom dock
  • coinciding with reception at children of tall/backups of sends from two other pools on different buses, one of which (zhandy) is USB 2.0 on the newer Cerulian hub
  • coinciding with a launch of Preview
  • broadly coinciding with attachment of a 225.13 GB .sparsebundle that's stored on zhandy and browsing the attached image with Finder
  • all during a scrub.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Flooding and choking

Post by raattgift » Sun Apr 21, 2013 11:02 am

Ah, yeah, I was trying to be clever at too low a level and made two thinkos.

Try this one instead. spa_sync() does quite a lot of work in committing a transaction group, but that will include all the writes, including synchronous ones, to the physical devices. We want only the edge of spa_sync, no thinking ahead to stop the clock upon return from a related asynchronously-called function. In particular, it's not capturing the "choking" in the ARC (I'll get back to you on that one), but it will tell you if writes out to your device take a long time.

This will need to run for a little while, as spa_sync may be called as infrequently as every half-minute.

Code: Select all
dtrace -n 'fbt:com.getgreenbytes.filesystem.zfs:spa_sync:entry { self->s = timestamp; }
spa_sync:return /self->s/ {
             @["units"] = quantize(timestamp - self->s);
             self->s = 0; }'


"Units" here again are nominally ns, but timestamp is only strictly useful for relative timings rather than precision counts. The quantization into buckets is pretty close if one ignores the four least significant digits of each bucket (i.e., on most Macs, tens of microseconds is approximately the precision of the actual time in the thread).

In both of these, the multimodal distribution is because I have (on each Mac) a fast busy pool made of SSDs and at least one active pool made up of FW800 disks.

Code: Select all
  units                                             
           value  ------------- Distribution ------------- count   
           16384 |                                         0       
           32768 |@@@@@@@@@                                28       
           65536 |@@@@@@@@@@@@@@@@@                        55       
          131072 |                                         0       
          262144 |@                                        2       
          524288 |                                         0       
         1048576 |                                         0       
         2097152 |@                                        3       
         4194304 |@@@                                      9       
         8388608 |                                         0       
        16777216 |                                         0       
        33554432 |@                                        3       
        67108864 |@                                        4       
       134217728 |@@@                                      11       
       268435456 |@@@                                      10       
       536870912 |@                                        3       
      1073741824 |                                         1       
      2147483648 |                                         0       


Code: Select all
  units                                             
           value  ------------- Distribution ------------- count   
           16384 |                                         0       
           32768 |@@@                                      9       
           65536 |@@@@@@                                   18       
          131072 |                                         0       
          262144 |                                         0       
          524288 |                                         1       
         1048576 |                                         0       
         2097152 |@                                        2       
         4194304 |@                                        3       
         8388608 |                                         0       
        16777216 |                                         0       
        33554432 |@@@                                      10       
        67108864 |@@@@@@@@@@@@@@@@@                        51       
       134217728 |@@@@@@                                   18       
       268435456 |@@                                       6       
       536870912 |                                         0       


Note that the exercise (and the message before) here is tracing *ONLY WRITES*. The origin of this thread was an observation of WIRED sizes growing when writing to a slow pool, and my interest here is in quantifying how long IO takes to complete, and then to compare that with dirty pages in the ARC. In zfs, reads are subordinate to writes (writes block reads, not the other way around - that's why there's always a zil and why you can configure a separate log vdev, so that small, frequent sync writes do not kill read performance). If your write speed is slow, then uncached reads will also be slow. If your write speed is slow, then dirty buffers will take longer to clean, as well. If you have lots of write pressure (where writing by unthrottled processes is faster than writing to storage devices) then dirty buffers will grow in number and in amount of time they remain dirty; dirty buffers may evict objects from the read cache, which in turn is likely to increase the need for uncached reads.

So, the faster a transaction is fully committed, the better overall performance will be. When there are very few writes, spa_sync() will finish very quickly. (There are other optimizations for idle pools, involving deferred frees and so forth; that will make spa_sync() return even faster). When the devices are slow to write everything and sync the labels, then spa_sync() will take longer to return.

The io::: provider in Mac OS X is a bit odd; it's not clear whether /usr/bin/iopending is useful in diagnosing zfs, although it sure is for hfs.
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Re: Flooding and choking

Post by grahamperrin » Mon Apr 22, 2013 7:11 pm

Code: Select all
macbookpro08-centrim:~ gjp22$ date
Tue 23 Apr 2013 00:39:47 BST
macbookpro08-centrim:~ gjp22$ sudo dtrace -n 'fbt:com.getgreenbytes.filesystem.zfs:spa_sync:entry { self->s = timestamp; }
> spa_sync:return /self->s/ {
>              @["units"] = quantize(timestamp - self->s);
>              self->s = 0; }'
Password:
dtrace: description 'fbt:com.getgreenbytes.filesystem.zfs:spa_sync:entry ' matched 2 probes
Client side error: Connection interrupted
^C

  units                                             
           value  ------------- Distribution ------------- count   
           16384 |                                         0       
           32768 |@                                        7       
           65536 |@@@@@@@@@@@                              64       
          131072 |                                         0       
          262144 |                                         0       
          524288 |@                                        4       
         1048576 |@@                                       10       
         2097152 |                                         2       
         4194304 |                                         0       
         8388608 |@                                        3       
        16777216 |@                                        4       
        33554432 |@                                        7       
        67108864 |@                                        5       
       134217728 |                                         2       
       268435456 |@@@                                      17       
       536870912 |@@@@                                     24       
      1073741824 |@@                                       13       
      2147483648 |@@                                       9       
      4294967296 |@@@@@@@@@@@                              61       
      8589934592 |                                         0       

macbookpro08-centrim:~ gjp22$ date
Tue 23 Apr 2013 00:47:22 BST
macbookpro08-centrim:~ gjp22$


During that period of around seven minutes:

  • Time Machine backup to ZFS file system tall/com.apple.backupd
  • scrub of pool zhandy
  • installation of a patch to NeoOffice – installations of this type are usually time-consuming, this one completed at 00:49.

From my point of view, Time Machine in recent versions of Mountain Lion seems to be a good citizen – when the Mac is busy, Time Machine-related writes takes a back seat. Now for example it appears to have got no further than 90 KB of 11.29 GB after an hour or so. I expect bursts of activity after I go to sleep.

raattgift wrote:… The origin of this thread was an observation of WIRED sizes growing when writing to a slow pool …


I thought first of the DiskWarrior case, which seems much more read-oriented.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Flooding and choking

Post by raattgift » Wed Apr 24, 2013 10:25 pm

grahamperrin wrote:
Code: Select all
      4294967296 |@@@@@@@@@@@                              61       


It's levelling off in that bucket because you are doing enough writing that you are moving a txg from open to quiesce every five seconds.

Since only one txg per state (open, quiescing, syncing) per pool is available, that means you have to complete a spa_sync in under five seconds or threads sending writes into the pool are paused.

If we assume that you are writing full blast to the USB 2 device, there is the problem that USB2 is a half-duplex communications bus. So during the spa_sync, you aren't really able to do much reading *from disk*, and since spa_sync is firing all the time, uncached read performance will be slow. That is, the bus is almost certain to be the limiting factor, rather than disk rotational latency or track switch and settling time. That's a first order explanation, as small caches (and opportunistic and speculative cache-fills) at various levels in the path do help somewhat.

grahamperrin wrote:I thought first of the DiskWarrior case, which seems much more read-oriented


If your dataset has atime=on, each read that DiskWarrior performs will adjust the atime of the band file; the bands directory atime will be bumped a lot too, even when Disk Warrior (or fsck_hfs, or whatever) is issuing only reads to the contained hfs filesystem. Although these are gathered together for scheduling purposes, it's useful to remember that an isolated atime update will result in a COW of the relevant ZAP data, which may spill across multiple records. That COW also triggers writes in the metaslab spacemap. Each of those writes in turn trigger writes up the Merkle tree towards the data set root and ultimately the uberblocks.

If DW reads a lot of band files while scanning, there will be a lot of writes, amplified by the nature of ZFS.

Again, because of USB 2's relative slowness and half duplex channel, and because ZFS (sensibly) prioritizes writes over uncached (and especially low-priority) reads, the DW pattern almost inevitably will hurt performance.

Turning atime off in the dataset the sparsebundle lives in is a good idea. The number of writes should fall considerably, and performance should increase.

Scrubs are low-priority reads and when you're bus-bandwidth-limited rather than IOPS limited, the seeks that scrubs cause should not really get in the way of other I/O that badly.

backupd typically does relatively few reads of the destination time machine volume, and its writes are easy to schedule and sequentialize. It also tries to avoid being the highest-priority writer. The reads that bakcupd does do are also highly cacheable by all but the smallest ARC, and also by UBC. Read hits of cached HFS blocks by the UBC will suppress atime updates in the zfs dataset; however, read hits of cached bandfile (and associated metadata) blocks by UBC or ARC will generate atime updates in the zfs dataset. The cached HFS blocks are rapidly invalidated from the UBC when the dmg is unmounted, but the bandfile blocks and associated *zfs* metadata may stick around in one or more of UBC, ARC, and any cache vdev (l2arc) the pool may have online.
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Re: Flooding and choking

Post by grahamperrin » Fri Apr 26, 2013 2:00 pm

raattgift wrote:… If your dataset has atime=on …


Nope:

Code: Select all
macbookpro08-centrim:~ gjp22$ zpool history gjp22 tall zhandy | grep atime
2012-09-04.19:41:23 zfs set atime=off gjp22
2012-09-04.19:41:58 zfs set atime=off tall
2012-12-08.18:31:00 zfs create -o casesensitivity=insensitive -o normalization=formD -o atime=off -o snapdir=visible -o compression=gzip-9 -o canmount=noauto tall/backups/zhandy
2012-12-08.18:31:08 zfs create -o casesensitivity=insensitive -o normalization=formD -o atime=off -o snapdir=visible -o compression=gzip-9 -o canmount=noauto tall/backups/zhandy/Pocket Time Machine
2013-03-12.23:05:23 zfs create -o casesensitivity=insensitive -o compression=gzip-9 -o atime=off tall/backups/zhandy
2013-03-12.23:05:59 zfs create -o casesensitivity=insensitive -o compression=gzip-9 -o atime=off -o canmount=noauto tall/backups/gjp22
2012-12-06.18:00:52 zpool create -o ashift=12 -O casesensitivity=insensitive -O normalization=formD -O atime=off -O snapdir=visible -O compression=gzip-9 zhandy /dev/disk2
2012-12-07.20:23:24 zfs create -o casesensitivity=insensitive -o normalization=formD -o atime=off -o snapdir=visible -o compression=gzip-9 zhandy/Pocket Time Machine
2012-12-08.16:34:14 zfs create -o casesensitivity=insensitive -o normalization=formD -o atime=off -o snapdir=visible -o compression=gzip-9 zhandy/Pocket Time Machine
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom


Return to General Discussion

Who is online

Users browsing this forum: ilovezfs and 0 guests

cron