Performance issues

Moderators: jhartley, MSR734, nola

Performance issues

Post by ghaskins » Wed Dec 05, 2012 6:08 pm

Hello all,

First off, thanks to all of those involved in ZFS development in general, and making it available on OSX in particular. I have been waiting for this project to stabilize so that I may begin to use it for 3TB of precious personal data (photos/videos of my family, etc). I have been using HFS+ over iSCSI to a 5-bay Synology box for years now. While this box is great, and has been extremely reliable and flexible for me, it is ultimately gated by the GigE connection to 115MB/s max (and I am lacking lots of ZFS features like data integrity, etc).

I have been tracking Zevo development for some time (and was one of the "Zevo silver" customers), and was delighted to see Zevo CE come out recently. So much so, that I started outfitting my MacPro with a new SAS setup with the hope of eventually replace the iSCSI/HFS+ setup with ZFS to gain performance _and_ integrity. After working through some issues with the raw performance to the spindles in my SAS, I am a bit disappointed with the overall performance of the end result once ZFS is layered in, so I am posting some info about my setup in the hopes someone out there may spy a configuration issue, or to otherwise open a discussion about whats wrong.

So first, some background. I am running an "early 2009" quad-core 2.93Ghz "Nehalem" MacPro, with 16GB ram, an LSI 9207-8e HBA (Astek Corp HBA driver), 4x3TB WD 6Gbps 7200rpm SAS drives (WD3001FYYG-01SL3), Zevo 1.1.1, and OSX v10.7.5.

I have detailed performance metrics (quickbench, dd, etc) of all kinds of things, single spindles over HFS+, all 4 spindles over Apple software-raid0, all 4 in a raidz, striped+mirrored vdevs, SSD L2ARC, SSD ZIL, L2ARC+ZIL. You name it. Ill spare you the graphs unless someone asks.

In summary, each spindle in my system operates somewhere in the 65MB/s - 165MB/s range starting with 65MB/s at 4k sequential blocks, and moving rapidly up from there towards the 165MB/s peak as the block size increases. Likewise, random writes start at about 2MB/s for 4k random, moving rapidly into double digits and peaking at 114MB/s for 1024k blocks.

I can easily hit 165MB/s with large transfers to a single spindle, and I can easily hit ~700MB/s in a raid0/HFS+ setup. So that at least shows that the system seems to be able to handle the raw bandwidth.

Now I totally understand that something like a redundant mirror/raidz has more overhead and fewer effective spindles worth of capacity and IO performance than a 4x raid0. I'm not expecting it to deliver ~700MB/s of sustained transfer. However, the problem I am having is I can barely manage to get more than 1 spindles worth of bandwidth out of ZFS, no matter what configuration or tweak I try, and I have seen ZFS do much better on other platforms such as OpenIndiana. So I am scratching my head and wondering if I am doing something wrong, or whether I am simply running into a limitation of the Zevo/OSX platform when compared to a different ZFS platform. Any help understanding this is appreciated.

Heres what I see. Regardless of whether I use all 4 disks in a single RAIDZ vdev, or set up two vdevs consisting of a two way mirror, I never get more than 165MB/s-180MB/s out of a large dd transfer (bs=1m count=4k), and I never see higher than 233MB/s out of quick bench "extended" runs.

I haven't focused on the stripe+mirror configuration as much, because in my setup it actually performed worse (for reasons still TBD). So most experience is on the 4x RAIDZ. I would expect to see ballpark for large transfers in this array to be on the order of 3x165MB/s=495MB/s, minus some overhead, putting it in the 400MB/s range. What I see instead, as noted above is more like 165MB/s-230MB/s. The interesting thing is zpool-iostat never shows any one drive going over 65MB/s.

Interestingly enough, I happen to be intimately familiar with the value of 65MB/s. Its the value of the maximum sequential large-block write throughput when either the WCE=0, or the QD=1 (since I had an initial problem with per spindle performance that I tracked down to WCE=0). Now I know I have persistently set WCE=1 in the mode page of the drives, and have confirmed that each drive outside of ZFS is performing as expected. So part of me is wondering if Zevo is somehow inadvertently disabling write cache or somehow limiting IOs such that it is effectively operating as QD=1.

I also happen to know that 65MB/s is approximately the value of the performance of any given spindle for random writes with bs=64k, and QD=254/WCE=1. So at first I was thinking that perhaps the ZIL was causing too many seeks and therefore reducing the 165MB/s sequential peak to 65MB/s random peak. I threw an SSD ZIL in there to take the random IOP load off the spindles, and saw no improvement (still 65MB/s per spindle, max). This is further supported by looking at an OI array with 8-spindles of lower end 5400 SATA drives, and seeing it regularly push 90MB/s per spindle, 480MB/s aggregate through (and there is no ZIL cache on that setup).

So my questions are: Does Zevo manage WCE as the Solaris impl is rumored to do? If so, what is the criteria, and is it possible that it has the logic inverted in my scenario? Is there a way to tell (ala Linux sdparm tool in OSX)? Is it possible that WCE is fine, but Zevo is failing to deliver very many queued commands in parallel? Is something else suspected to be wrong? Or is this just a limitation of the current IO scheduler employed in Zevo?

Kind Regards,
-Greg
ghaskins Offline


 
Posts: 52
Joined: Sat Nov 17, 2012 9:37 am

cross references and thanks

Post by grahamperrin » Wed Dec 05, 2012 11:35 pm

ghaskins wrote:… I can barely manage to get more than 1 spindles worth of bandwidth out of ZFS …


For other readers: there's a similar observation amongst the cases under Performance Observation (2012-09-26).

Greg, other posts that come to mind include:

  • write throttle (2012-10-17, under Kernel Panic (idle state)) – I don't know whether throttling with ZEVO differs from throttling in an OS such as OpenIndiana
  • Performance issue with FW800 connection order? (2012-09-24) – no FireWire in this topic, but I wonder whether a change in order of things might yield an improvement in performance.

The level of detail in your opening post is excellent – thanks. For the future you might like to expand the opening subject line, just a little, to distinguish it amongst search results.

I guess that performance is a consideration in development/tuning of the custom ZFS memory manager for ZEVO … but at this stage (CE 1.1.1) in development, not an overriding consideration.

This topic deserves more than my guesswork – I'll bookmark and follow with interest.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: cross references and thanks

Post by ghaskins » Thu Dec 06, 2012 8:39 pm

Hello Graham

grahamperrin wrote:
The level of detail in your opening post is excellent – thanks. For the future you might like to expand the opening subject line, just a little, to distinguish it amongst search results.



Good point, and noted. Will remember that for the next post.

grahamperrin wrote:
I guess that performance is a consideration in development/tuning of the custom ZFS memory manager for ZEVO … but at this stage (CE 1.1.1) in development, not an overriding consideration.



This is completely understandable. While I am by no stretch an OSX internals expert, I am a Linux kernel programmer, and a systems guy in general so let me know if there is some way I can help track this down. I have a vested interest in seeing Zevo become the best it can be.

Kind Regards,
-Greg
ghaskins Offline


 
Posts: 52
Joined: Sat Nov 17, 2012 9:37 am

Re: Performance issues

Post by ghaskins » Mon Dec 10, 2012 8:57 am

So I learned something interesting this weekend:

Concurrent writers of different types can achieve higher aggregate throughput than a single writer alone or a group of similar writers

Some background: As mentioned, 4x3TB 7200rpm SAS drives can hit about 175MB/s peak alone, and 700MB/s in aggregate across a raid0 stripe. In a raidz, I would expect that the theoretical maximum could be no more than 525MB/s, since one of the spindles is doing IO for parity.

Doing a single "dd if=/dev/zero of=/zfs/test.dat bs=1m count=4k" results in somewhere between 180MB/s -> 230MB/s, far short of the 525MB theoretical maximum. Running two in parallel simply divides the available throughput between them.

However, what I discovered this weekend was very interesting: I began the process of migrating my 3TB worth of data (about 3 million files from my homedir) to the Zevo setup, via rsync from a JHFS+ mounted over iSCSI. The iSCSI link was the bottleneck in this case, only generating about 50-80MB/s of traffic, with the ZFS array mostly idle, with a burst of about 180MB/s about every 5 seconds (I assume this was probably the TXG commit firing.

Anyway, for fun, I started a dd as indicated above while this was going on, and was shocked to see an easily reproducible > 350MB/s! Looking at Activity Monitor, I see aggregate bursts between 400-500MB/s, and zpool-iostat shows per-drive bursts over 120MB/s. Now these figures are still far short of the theoretical maximums of 525, 700, and 175, respectively. Yet, they represent a huge improvement over the maximums I observed with a single test thread.

I'm not sure what all this means, but it does give me hope that with the right tunings or a future update to the code, Ill be able to fully exploit the investment I made in the SAS setup. My guess is this either has something to do with the TXG mechanism (the higher metadata activity of the rsync drove more frequent commits?) or the write-throttle. I'll let the ZFS experts weigh in.
ghaskins Offline


 
Posts: 52
Joined: Sat Nov 17, 2012 9:37 am

Re: Performance issues

Post by maxijazz » Sun Dec 30, 2012 1:28 am

I have 2 pools in my MacPro:
Code: Select all
NAME                                           STATE     READ WRITE CKSUM
   basic                                          ONLINE       0     0     0
     mirror-0                                     ONLINE       0     0     0
       GPTE_6D8FB03E-C9CA-4D1D-8DBE-C487EF97D0DD  ONLINE       0     0     0  at disk4s2
       GPTE_7770868B-4233-4E3A-8783-CD55C68F01B0  ONLINE       0     0     0  at disk5s2
   cache
     GPTE_162250FE-5A24-4D1B-94CB-6BDE0A075765    ONLINE       0     0     0  at disk0s5

and
Code: Select all
NAME                                         STATE     READ WRITE CKSUM
   extra                                        ONLINE       0     0     0
     GPTE_DBF727E1-7B96-48B1-9C18-0CDAA6D18A18  ONLINE       0     0     0  at disk2s2

Average sequential write speed for "basic" pool is about 130MB/sec (with peaks at 170MB/s) and it feels "sufficient" for my purpose.
The "extra" pool is slow and i plan to destroy it in a future.

The other day i did initiated secure erase of several huge files on the "extra" pool, so it lasted for over 1hour. A "Locum" process (which is executor of the erasing procedure) was using almost 100% of CPU (as usual), which translates to 100% of 1 core out of 8 real cores (16 virtual cores) in my Mac.
During that time i tried to copy some files to the "basic" pool and to my astonishment it was extremely slow, only about 10MB/sec. Immediately after secure erasing completed, the "normal" 130MB/sec throughput restored.

Let the smarter than me think about it.
maxijazz Offline


 
Posts: 12
Joined: Wed Sep 19, 2012 8:42 pm

Secure erasure

Post by grahamperrin » Sun Dec 30, 2012 7:16 am

Did you use Disk Utility – or diskutil(8) – to attempt the secure erasure?

Consider:


… wonder whether the data is truly erased.
Last edited by grahamperrin on Mon Dec 31, 2012 4:51 am, edited 2 times in total.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Performance during attempted secure erasure by Disk Utility

Post by grahamperrin » Sun Dec 30, 2012 7:27 am

maxijazz wrote:… During that time … extremely slow …


If a hidden image written by diskutil(8) causes free space within a pool to fall below ten percent, then performance will suck.

maxijazz wrote:Immediately after secure erasing completed, the "normal" 130MB/sec throughput restored. …


Hurrah!

In Super User (moved from Ask Different): What free space thresholds/limits are advisable for 640 GB and 2 TB hard disk drives with ZEVO ZFS on OS X?


maxijazz, I guess that the temporary loss of performance around your basic pool was partly due to the suck around your extra pool.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

cache at any slice of disk0 – consider latency

Post by grahamperrin » Sun Dec 30, 2012 7:45 am

maxijazz wrote:… 

Code: Select all
NAME                                           STATE     READ WRITE CKSUM
   basic                                          ONLINE       0     0     0
     mirror-0                                     ONLINE       0     0     0
       GPTE_6D8FB03E-C9CA-4D1D-8DBE-C487EF97D0DD  ONLINE       0     0     0  at disk4s2
       GPTE_7770868B-4233-4E3A-8783-CD55C68F01B0  ONLINE       0     0     0  at disk5s2
   cache
     GPTE_162250FE-5A24-4D1B-94CB-6BDE0A075765    ONLINE       0     0     0  at disk0s5




If your HFS Plus OS X startup volume is also from a slice of disk0, then consider:

grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Performance issues

Post by John Wiegley » Thu Feb 14, 2013 10:04 pm

You can't compare RAID-Z to RAID-0, they are entirely different things. When using RAID-Z, you can expect write speeds to be as fast as the slowest spindle in the group. See https://blogs.oracle.com/roch/entry/when_to_and_not_to for more details.
John Wiegley Offline


 
Posts: 3
Joined: Sat Sep 15, 2012 3:56 pm

Re: Performance issues

Post by ghaskins » Fri Feb 15, 2013 7:45 am

Hi John,

John Wiegley wrote:You can't compare RAID-Z to RAID-0, they are entirely different things.


Yes, but I also believe that any mention of raid-0 was stated as understood to not be directly comparable. It was provided to illustrate that the hardware is not the bottleneck, nothing more.

When using RAID-Z, you can expect write speeds to be as fast as the slowest spindle in the group. See https://blogs.oracle.com/roch/entry/when_to_and_not_to for more details.


Actually, this is a common misconception. You are confusing IOPS with IO bandwidth, which are not the same thing. It is absolutely true that a raidz1 IOPS are gated by the slowest device in the array. However, in terms of bandwidth (which is what I was measuring here), a raidz1 will generally perform similar to a raid0 or raid5 in that the bandwidth is aggregated across the spindles (minus parity overhead). This is true on most ZFS implementations, except for Zevo CE 1.1.1 where there seems to be some kind of suboptimal code path affecting at least some of us in the community.

But you don't have to take my word for it. Try setting up a dual boot box and run something like OmniOS, OpenIndiana, FreeNAS, etc, along side OSX+Zevo. I think you will see on the former you will generally see bandwidth approximately equal to N-1 of your underlying devices. On the latter, you will see something typically much lower, often times at a fraction of even a single spindle. This was the phenomenon I was reporting.

Kind Regards,
-Greg
ghaskins Offline


 
Posts: 52
Joined: Sat Nov 17, 2012 9:37 am


Return to General Discussion

Who is online

Users browsing this forum: ilovezfs and 1 guest

cron