Hello all,
First off, thanks to all of those involved in ZFS development in general, and making it available on OSX in particular. I have been waiting for this project to stabilize so that I may begin to use it for 3TB of precious personal data (photos/videos of my family, etc). I have been using HFS+ over iSCSI to a 5-bay Synology box for years now. While this box is great, and has been extremely reliable and flexible for me, it is ultimately gated by the GigE connection to 115MB/s max (and I am lacking lots of ZFS features like data integrity, etc).
I have been tracking Zevo development for some time (and was one of the "Zevo silver" customers), and was delighted to see Zevo CE come out recently. So much so, that I started outfitting my MacPro with a new SAS setup with the hope of eventually replace the iSCSI/HFS+ setup with ZFS to gain performance _and_ integrity. After working through some issues with the raw performance to the spindles in my SAS, I am a bit disappointed with the overall performance of the end result once ZFS is layered in, so I am posting some info about my setup in the hopes someone out there may spy a configuration issue, or to otherwise open a discussion about whats wrong.
So first, some background. I am running an "early 2009" quad-core 2.93Ghz "Nehalem" MacPro, with 16GB ram, an LSI 9207-8e HBA (Astek Corp HBA driver), 4x3TB WD 6Gbps 7200rpm SAS drives (WD3001FYYG-01SL3), Zevo 1.1.1, and OSX v10.7.5.
I have detailed performance metrics (quickbench, dd, etc) of all kinds of things, single spindles over HFS+, all 4 spindles over Apple software-raid0, all 4 in a raidz, striped+mirrored vdevs, SSD L2ARC, SSD ZIL, L2ARC+ZIL. You name it. Ill spare you the graphs unless someone asks.
In summary, each spindle in my system operates somewhere in the 65MB/s - 165MB/s range starting with 65MB/s at 4k sequential blocks, and moving rapidly up from there towards the 165MB/s peak as the block size increases. Likewise, random writes start at about 2MB/s for 4k random, moving rapidly into double digits and peaking at 114MB/s for 1024k blocks.
I can easily hit 165MB/s with large transfers to a single spindle, and I can easily hit ~700MB/s in a raid0/HFS+ setup. So that at least shows that the system seems to be able to handle the raw bandwidth.
Now I totally understand that something like a redundant mirror/raidz has more overhead and fewer effective spindles worth of capacity and IO performance than a 4x raid0. I'm not expecting it to deliver ~700MB/s of sustained transfer. However, the problem I am having is I can barely manage to get more than 1 spindles worth of bandwidth out of ZFS, no matter what configuration or tweak I try, and I have seen ZFS do much better on other platforms such as OpenIndiana. So I am scratching my head and wondering if I am doing something wrong, or whether I am simply running into a limitation of the Zevo/OSX platform when compared to a different ZFS platform. Any help understanding this is appreciated.
Heres what I see. Regardless of whether I use all 4 disks in a single RAIDZ vdev, or set up two vdevs consisting of a two way mirror, I never get more than 165MB/s-180MB/s out of a large dd transfer (bs=1m count=4k), and I never see higher than 233MB/s out of quick bench "extended" runs.
I haven't focused on the stripe+mirror configuration as much, because in my setup it actually performed worse (for reasons still TBD). So most experience is on the 4x RAIDZ. I would expect to see ballpark for large transfers in this array to be on the order of 3x165MB/s=495MB/s, minus some overhead, putting it in the 400MB/s range. What I see instead, as noted above is more like 165MB/s-230MB/s. The interesting thing is zpool-iostat never shows any one drive going over 65MB/s.
Interestingly enough, I happen to be intimately familiar with the value of 65MB/s. Its the value of the maximum sequential large-block write throughput when either the WCE=0, or the QD=1 (since I had an initial problem with per spindle performance that I tracked down to WCE=0). Now I know I have persistently set WCE=1 in the mode page of the drives, and have confirmed that each drive outside of ZFS is performing as expected. So part of me is wondering if Zevo is somehow inadvertently disabling write cache or somehow limiting IOs such that it is effectively operating as QD=1.
I also happen to know that 65MB/s is approximately the value of the performance of any given spindle for random writes with bs=64k, and QD=254/WCE=1. So at first I was thinking that perhaps the ZIL was causing too many seeks and therefore reducing the 165MB/s sequential peak to 65MB/s random peak. I threw an SSD ZIL in there to take the random IOP load off the spindles, and saw no improvement (still 65MB/s per spindle, max). This is further supported by looking at an OI array with 8-spindles of lower end 5400 SATA drives, and seeing it regularly push 90MB/s per spindle, 480MB/s aggregate through (and there is no ZIL cache on that setup).
So my questions are: Does Zevo manage WCE as the Solaris impl is rumored to do? If so, what is the criteria, and is it possible that it has the logic inverted in my scenario? Is there a way to tell (ala Linux sdparm tool in OSX)? Is it possible that WCE is fine, but Zevo is failing to deliver very many queued commands in parallel? Is something else suspected to be wrong? Or is this just a limitation of the current IO scheduler employed in Zevo?
Kind Regards,
-Greg