Page 1 of 2

hanging on 1.9.0

PostPosted: Mon Jun 17, 2019 8:03 pm
by mauricev
I have a 3 TB ZFS mirror using zfs 1.9.0 under 10.14.5 and I'm rsyncing files to it and along with running a virtual machine. After a while, I/O hangs. What's going on?

Re: hanging on 1.9.0

PostPosted: Tue Jun 18, 2019 4:45 am
by tangles
Not much by the sounds of it…

What's going on? who knows…
You switched it off? you closed your eyes?
Sorry for taking the piss… but mate… we're not mind readers.

Please provide a bit more info, here's some suggestions…

Hardware description of Mac and zpool connectivity.
Output of:
zpool status and zpool list so we can see how your pool is setup and what state it's in.
zfs get all on <dataset in question>
zpool iostat -v 1 600 while running rsync to see if any vdev has poor I/O.

By providing the above, the community will have a better chance to help you.

Cheers,

Re: hanging on 1.9.0

PostPosted: Tue Jun 18, 2019 6:47 am
by mauricev
Trash can Mac with a JMicron-based 2-bay disk enclosure connected via USB 3.
Code: Select all
 pool: externalhd
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
   still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
   the pool may no longer be accessible by software that does not support
   the features. See zpool-features(5) for details.
  scan: scrub in progress since Tue Jun 18 10:26:29 2019
   826G scanned at 810M/s, 26.0G issued at 25.5M/s, 2.41T total
   0 repaired, 1.05% done, 1 days 03:12:32 to go
config:

   NAME                                            STATE     READ WRITE CKSUM
   externalhd                                      ONLINE       0     0     0
     mirror-0                                      ONLINE       0     0     0
       media-A1232949-F65F-A64B-B241-D4DBBA49E0A0  ONLINE       0     0     0
       media-448C5ECE-160D-7446-BF08-3BED3AE018B5  ONLINE       0     0     0

errors: No known data errors

As you can see, I'm running a scrub.

Code: Select all
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
externalhd  2.72T  2.41T   319G        -         -    16%    88%  1.00x  ONLINE  -


I saw this message on trying to reboot after it hung
Code: Select all
Warning: Pool 'externalhd' has encountered an uncorrectable I/O failure and has been suspended.


Bad USB cable? Bad controller electronics?

Re: hanging on 1.9.0

PostPosted: Tue Jun 18, 2019 7:14 am
by mauricev
I am separately running a Mac program Drive Genius which is supposed to monitor disk health. It's now reporting that one of the two disks, a Toshiba DT01ACA300, has a significant number of damaged areas. This implies the disk is very sick, but the scrub is still running and not detecting any errors. How are the errors apparent to it, but not zfs? Why should the errors of one disk hang the whole pool?

Re: hanging on 1.9.0

PostPosted: Tue Jun 18, 2019 7:36 am
by sean
mauricev wrote:...one of the two disks, a Toshiba DT01ACA300, has a significant number of damaged areas. This implies the disk is very sick, but the scrub is still running and not detecting any errors. How are the errors apparent to it, but not zfs?


The point of view from the disk and the filesystem can be very different. Drives can attempt to remap bad blocks, etc. I think the more relevant piece is your scrub is only ~1% complete, so it's premature to think that zfs won't find any errors. I don't think there is much point in waiting around to see if it does, though.

maurice wrote:Why should the errors of one disk hang the whole pool?


It doesn't take many resets and / or timeouts to send drive performance right into the drain, and I suspect you're getting a LOT of them. Since you have a mirror, I would stop the scrub, pull the drive, put in a replacement, resilver, and carry on.

Re: hanging on 1.9.0

PostPosted: Tue Jun 18, 2019 3:47 pm
by lundman
"an uncorrectable I/O failure and has been suspended. "


ZFS detected the disk more or less vanished, and was forced to give up - you will not get more data from ZFS after that. You can issue "zpool clear pool" and "zpool clear pool device" to ask it to retry talking to the disk, but it seems likely the disk will glitch again.

Re: hanging on 1.9.0

PostPosted: Thu Jun 20, 2019 7:13 am
by mauricev
I replaced the disk and the pool seems to be working normally.

Re: hanging on 1.9.0

PostPosted: Tue Mar 10, 2020 12:07 am
by Sigmoid
Okay I had a similar but not identical experience (on 1.9.4), and I'd like to get some advice. I'm on an old Macbook Pro running High Sierra with Firewire but a dead Thunderbolt port (and only USB2), so I'm using FW800 for storage (slow but mostly adequate).

I've been using a mirrored pool for years and years now, but it's full and I'm moving on. I got a second-hand 4-drive bay, configured it to JBOD, loaded it with four brand spanking new Toshiba P300 3TB drives, and created a RaidZ2 pool on them. Then I proceeded to rsync everything over from one drive to the other. At ~20MB/s average transfer, it could be far better but whatev, I'm on old hardware and copying from FW800 to FW800 over a single connection.

However, today I woke up to see that the rsync was hanging. It was still showing "36% 17.76MB/s 0:04:55" for the progress on the last file, but looking at where the copy was when I went to bed, it seemes to have been showing that for hours. Also, I waited around and nothing happened. I could browse the old pool just fine, but when I navigated to the new zfs filesystem I was copying to, Finder hung. I restarted Finder with option-command-esc, opened a terminal and tried to look at the new filesystem. Bash hung too.

I tried restarting the computer but it hung in restart. I swallowed and just powered down the machine. On restarting, I could successfully mount both pools, and everything copied to the new pool up until the time of the freeze seems okay. ZFS sees nothing wrong with the pools.

Code: Select all
bash-3.2$ zfs --version
zfs-1.9.4-0
zfs-kmod-1.9.4-0

bash-3.2$ zpool list
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
Borogove     10.9T   997G  9.90T        -         -     0%     8%  1.00x  ONLINE  -
Jabberwocky  1.36T  1.25T   116G        -         -     7%    91%  1.00x  ONLINE  -

bash-3.2$ zpool status -v Borogove
  pool: Borogove
 state: ONLINE
  scan: none requested
config:

   NAME                                            STATE     READ WRITE CKSUM
   Borogove                                        ONLINE       0     0     0
     raidz2-0                                      ONLINE       0     0     0
       media-A51B7409-55CB-F843-8965-056CB0EA2580  ONLINE       0     0     0
       media-E237643B-EFEF-3040-8338-50A9B97DC6CE  ONLINE       0     0     0
       media-B446407F-226D-3F42-BF90-45B23938F8EB  ONLINE       0     0     0
       media-671B767A-A3B1-7B45-8B72-A0AA31AD6726  ONLINE       0     0     0

errors: No known data errors
bash-3.2$ zpool status -v Jabberwocky
  pool: Jabberwocky
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
   still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
   the pool may no longer be accessible by software that does not support
   the features. See zpool-features(5) for details.
  scan: none requested
config:

   NAME                                            STATE     READ WRITE CKSUM
   Jabberwocky                                     ONLINE       0     0     0
     mirror-0                                      ONLINE       0     0     0
       media-1878AB63-8C37-C543-9B55-62325B92CF75  ONLINE       0     0     0
       media-DA0E5735-D45E-2946-90EC-BE3B8E1BA013  ONLINE       0     0     0

errors: No known data errors


I tried looking at log entries, but found nothing that could help me out what happened. I checked power settings and the machine shouldn't have went to sleep or tried to turn off hard drives.

Any idea where to look for pointers to the cause of the issue? Also, if any of the drives are faulty, I'd be really happy to take it back, but I'd have to know which one... I don't know what kind of drive diag works over FW800, Drive Utility says SMART isn't supported on these drives.

Re: hanging on 1.9.0

PostPosted: Tue Mar 10, 2020 2:55 am
by nodarkthings
As for SMART diagnostic, you might try SAT SMART Driver (https://binaryfruit.com/drivedx/usb-drive-support), it might read SMART data on your external drive.
P.S.: I've never tried it with FW...

Re: hanging on 1.9.0

PostPosted: Tue Mar 10, 2020 3:05 am
by Sigmoid
Okay I have something. It's flipping me out honestly.

Code: Select all
bash-3.2$ zpool status -v Borogove
  pool: Borogove
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: none requested
config:

   NAME                                            STATE     READ WRITE CKSUM
   Borogove                                        UNAVAIL      0     0     0  insufficient replicas
     raidz2-0                                      UNAVAIL      2    32     0  insufficient replicas
       media-A51B7409-55CB-F843-8965-056CB0EA2580  REMOVED      0     0     0
       media-E237643B-EFEF-3040-8338-50A9B97DC6CE  REMOVED      0     0     0
       media-B446407F-226D-3F42-BF90-45B23938F8EB  REMOVED      0     0     0
       media-671B767A-A3B1-7B45-8B72-A0AA31AD6726  REMOVED      0     0     0

errors: List of errors unavailable (insufficient privileges)


Well the next time it happens I'll make sure to run it with sudo, unfortunately I tried zpool clear first, and things froze up. It's weird that it says "REMOVED". Makes me wonder if it's a connectivity thing. Now the weird thing is that the old pool is daisy chained from the new box, and that was still up, so it's not a bus-wide issue.

I'm pissed because I'm not prepared to drop a shitload of money on a new file server, or a new Mac AND a Thunderbolt 3 enclosure, and I got the Firewire enclosure (OWC Mercury Elite Pro Qx2) second hand, as these aren't being made anymore. At this point I'm praying to god that one of the hard drives be faulty, and not the enclosure.

I started a scrub, we'll see what's up with that, but of course that only checks the media with data on it.

ps. BTW, strangely it would seem that once again the transfer froze up around the time the display blanked. Now at first I thought it might be a power management thing, but again power management is set to don't turn anything off ever. Now I even turned off display blanking. It would be nice if I could see a log of Firewire subsystem messages, like on Linux, but I haven't found any such thing so far.