Massive failures during file copies

All your general support questions for OpenZFS on OS X.

Massive failures during file copies

Postby Gerk » Tue Sep 20, 2016 12:52 pm

Had this happen a couple of times now, I didn't get there quick enough to copy/paste a proper zpool status (but it basically showed a ton of things offline and errors).

This is with known working drives and controller (I just finished re-testing them all using the hardware RAID setup and wrote to almost half the size of the array with no issues).

Here's a snippet of the system.log: https://gist.github.com/Gerk/9ba9c6d74b ... 5128a14baa

If it means anything I have lz4 compression enabled and was just doing some burn-in testing with file copies. This was during a pretty simple file copy (duplicating about 4 files @ 1G or so each).

Freshly installed 10.11.6 and using latest 1.5.2 release.

Code: Select all
$ sudo zpool status -v
  pool: zRAID
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: resilvered 131M in 0h0m with 0 errors on Tue Sep 20 13:35:18 2016
config:

   NAME                                            STATE     READ WRITE CKSUM
   zRAID                                           UNAVAIL      0     0     0  insufficient replicas
     raidz2-0                                      UNAVAIL      0    77     0  insufficient replicas
       media-844F2B4B-DA52-D54D-91FC-F683CE1A8644  FAULTED      0     9     0  too many errors
       media-8FEF6055-61B7-1847-A871-4A33AD9C2726  FAULTED      0     4     0  too many errors
       media-FD4416B8-CCA6-0A49-9ED8-C6D0C6488EC2  FAULTED      0    73     0  too many errors
       media-C948E9F3-37B8-3D49-85D7-AFD9F71BF325  FAULTED      3    93     0  too many errors
       media-BF79C110-4AED-A243-A5AC-A481827BF75A  FAULTED      0    55     0  too many errors
       media-CF80C451-6C58-CA4D-BC4E-FE0703E3115E  ONLINE       0    55     1
     raidz2-1                                      UNAVAIL      0     8     0  insufficient replicas
       media-F3A37669-7818-2543-9602-9231F1B41593  FAULTED      4     9     0  too many errors
       media-37275731-F50A-0041-BE47-763ACD314FE3  ONLINE       0     0     0
       media-E453976D-F3CD-3F42-B216-DC91363B4676  FAULTED      0     1     0  too many errors
       media-73F5F336-AB9E-0C47-8627-A2D898C6A4CC  FAULTED      4    11     0  too many errors
       media-10060C2E-13C4-5348-9D04-DD238CB1134D  ONLINE       0     0     0
       media-0C568C7E-AAF8-7D4A-9DD3-EB245FE2A5AE  ONLINE       0     0     0

errors: List of errors unavailable (insufficient privileges)
Gerk
 
Posts: 9
Joined: Mon Sep 19, 2016 9:45 am

Re: Massive failures during file copies

Postby Gerk » Tue Sep 20, 2016 12:59 pm

And the (virtual) machine ended up fully locking up. After a restart the pool is back online, with some partially copied files zombied.

Code: Select all
$ zpool status
  pool: zRAID
 state: ONLINE
  scan: resilvered 364M in 0h0m with 0 errors on Tue Sep 20 14:02:53 2016
config:

   NAME                                            STATE     READ WRITE CKSUM
   zRAID                                           ONLINE       0     0     0
     raidz2-0                                      ONLINE       0     0     0
       media-844F2B4B-DA52-D54D-91FC-F683CE1A8644  ONLINE       0     0     0
       media-8FEF6055-61B7-1847-A871-4A33AD9C2726  ONLINE       0     0     0
       media-FD4416B8-CCA6-0A49-9ED8-C6D0C6488EC2  ONLINE       0     0     0
       media-C948E9F3-37B8-3D49-85D7-AFD9F71BF325  ONLINE       0     0     0
       media-BF79C110-4AED-A243-A5AC-A481827BF75A  ONLINE       0     0     0
       media-CF80C451-6C58-CA4D-BC4E-FE0703E3115E  ONLINE       0     0     0
     raidz2-1                                      ONLINE       0     0     0
       media-F3A37669-7818-2543-9602-9231F1B41593  ONLINE       0     0     0
       media-37275731-F50A-0041-BE47-763ACD314FE3  ONLINE       0     0     0
       media-E453976D-F3CD-3F42-B216-DC91363B4676  ONLINE       0     0     0
       media-73F5F336-AB9E-0C47-8627-A2D898C6A4CC  ONLINE       0     0     0
       media-10060C2E-13C4-5348-9D04-DD238CB1134D  ONLINE       0     0     0
       media-0C568C7E-AAF8-7D4A-9DD3-EB245FE2A5AE  ONLINE       0     0     0

errors: No known data errors
Gerk
 
Posts: 9
Joined: Mon Sep 19, 2016 9:45 am

Re: Massive failures during file copies

Postby Gerk » Tue Sep 20, 2016 1:03 pm

Aaaaand, I can recreate it on demand, pretty much exactly the same issue when I selected a half dozen files and hit duplicate in Finder.

Code: Select all
Sep 20 14:06:17 freenas kernel[0]: Task timeout reset bus
Sep 20 14:06:17 freenas kernel[0]: disk8s1: device/channel is not attached.
Sep 20 14:06:17 --- last message repeated 3 times ---
Sep 20 14:06:17 freenas kernel[0]: disk9s1: device/channel is not attached.
Sep 20 14:06:37 freenas kernel[0]: Task timeout reset bus
Sep 20 14:06:54 --- last message repeated 10 times ---
Sep 20 14:06:57 freenas kernel[0]: Task timeout reset bus
Sep 20 14:07:17 --- last message repeated 21 times ---
Sep 20 14:07:29 freenas kernel[0]: Device 7/0 removed.
Sep 20 14:07:29 freenas kernel[0]: Device 0/0 removed.
Sep 20 14:07:30 freenas kernel[0]: disk8s1: device/channel is not attached.
Sep 20 14:07:30 --- last message repeated 6 times ---
Sep 20 14:07:30 freenas kernel[0]: disk9s1: device/channel is not attached.
Sep 20 14:07:30 --- last message repeated 4 times ---
Sep 20 14:07:30 freenas zed[608]: eid=18 class=probe_failure pool=zRAID
Sep 20 14:07:30 freenas zed[610]: eid=19 class=probe_failure pool=zRAID
Sep 20 14:07:31 freenas kernel[0]: disk8s1: device/channel is not attached.
Sep 20 14:07:31 --- last message repeated 2 times ---
Sep 20 14:07:31 freenas kernel[0]: disk9s1: device/channel is not attached.
Sep 20 14:07:31 --- last message repeated 2 times ---
Sep 20 14:07:31 freenas zed[612]: eid=20 class=probe_failure pool=zRAID
Sep 20 14:07:31 freenas zed[614]: eid=21 class=probe_failure pool=zRAID
Sep 20 14:07:45 freenas kernel[0]: ZFS: Device removal detected: 'disk8'
Sep 20 14:07:45 freenas kernel[0]: ZFS: Device removal detected: 'disk8s1'
Sep 20 14:07:45 freenas kernel[0]: ZFS: Device removal detected: 'disk8s9'
Sep 20 14:07:45 freenas kernel[0]: ZFS: Device removal detected: 'disk9'
Sep 20 14:07:45 freenas kernel[0]: ZFS: Device removal detected: 'disk9s1'
Sep 20 14:07:45 freenas kernel[0]: ZFS: Device removal detected: 'disk9s9'
Sep 20 14:07:45 freenas InvariantDisk[49]: Removing symlink: /var/run/disk/by-path/PCI0@0-PE50@16-S1F0@0-@7:0
Sep 20 14:07:45 freenas InvariantDisk[49]: Removing symlink: /var/run/disk/by-path/PCI0@0-PE50@16-S1F0@0-@8:9
Sep 20 14:07:45 freenas InvariantDisk[49]: Removing symlink: /var/run/disk/by-id/media-5D7044B7-9771-BA48-897A-762ACD58983A
Sep 20 14:07:45 freenas InvariantDisk[49]: Removing symlink: /var/run/disk/by-path/PCI0@0-PE50@16-S1F0@0-@8:1
Sep 20 14:07:45 freenas InvariantDisk[49]: Removing symlink: /var/run/disk/by-id/media-E453976D-F3CD-3F42-B216-DC91363B4676
Sep 20 14:07:45 freenas InvariantDisk[49]: Removing symlink: /var/run/disk/by-path/PCI0@0-PE50@16-S1F0@0-@8:0
Sep 20 14:07:45 freenas InvariantDisk[49]: Removing symlink: /var/run/disk/by-path/PCI0@0-PE50@16-S1F0@0-@7:9
Sep 20 14:07:45 freenas InvariantDisk[49]: Removing symlink: /var/run/disk/by-id/media-DDF320E1-AEFB-6140-BA46-5D181E5A4986
Sep 20 14:07:45 freenas InvariantDisk[49]: Removing symlink: /var/run/disk/by-path/PCI0@0-PE50@16-S1F0@0-@7:1
Sep 20 14:07:45 freenas InvariantDisk[49]: Removing symlink: /var/run/disk/by-id/media-37275731-F50A-0041-BE47-763ACD314FE3
Sep 20 14:07:51 freenas kernel[0]: Task timeout reset bus
Sep 20 14:07:51 freenas kernel[0]: disk4s1: device/channel is not attached.
 
Gerk
 
Posts: 9
Joined: Mon Sep 19, 2016 9:45 am

Re: Massive failures during file copies

Postby Gerk » Tue Sep 20, 2016 1:04 pm

Code: Select all
$ sudo zpool status -v
Password:
  pool: zRAID
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
   Sufficient replicas exist for the pool to continue functioning in a
   degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
   repaired.
  scan: resilvered 364M in 0h0m with 0 errors on Tue Sep 20 14:02:53 2016
config:

   NAME                                            STATE     READ WRITE CKSUM
   zRAID                                           DEGRADED     0     0     0
     raidz2-0                                      ONLINE       0     0     0
       media-844F2B4B-DA52-D54D-91FC-F683CE1A8644  ONLINE       0     0     0
       media-8FEF6055-61B7-1847-A871-4A33AD9C2726  ONLINE       0     0     0
       media-FD4416B8-CCA6-0A49-9ED8-C6D0C6488EC2  ONLINE       0     0     0
       media-C948E9F3-37B8-3D49-85D7-AFD9F71BF325  ONLINE       0     0     0
       media-BF79C110-4AED-A243-A5AC-A481827BF75A  ONLINE       0     0     0
       media-CF80C451-6C58-CA4D-BC4E-FE0703E3115E  ONLINE       0     0     0
     raidz2-1                                      DEGRADED     0     0     0
       media-F3A37669-7818-2543-9602-9231F1B41593  ONLINE       0     0     0
       media-37275731-F50A-0041-BE47-763ACD314FE3  FAULTED      3    21     0  too many errors
       media-E453976D-F3CD-3F42-B216-DC91363B4676  FAULTED      0     8     0  too many errors
       media-73F5F336-AB9E-0C47-8627-A2D898C6A4CC  ONLINE       0     0     0
       media-10060C2E-13C4-5348-9D04-DD238CB1134D  ONLINE       0     0     0
       media-0C568C7E-AAF8-7D4A-9DD3-EB245FE2A5AE  ONLINE       0     0     0

errors: No known data errors
Gerk
 
Posts: 9
Joined: Mon Sep 19, 2016 9:45 am

Re: Massive failures during file copies

Postby Brendon » Tue Sep 20, 2016 1:14 pm

I suspect that those "device removed" messages are key - we print them in response to notifications from the OS that the HDD has disconnected from the OS. If your HDDs disconnect, we obviously can't write to them.

I guess the question on your mind is does O3X have hardware specific device drivers in it? The answer is no, the OS itself provides the lowlevel I/O.

You need to ensure that whatever hardware setup/configuration is valid.

Brendon
Brendon
 
Posts: 286
Joined: Thu Mar 06, 2014 12:51 pm

Re: Massive failures during file copies

Postby Gerk » Tue Sep 20, 2016 2:48 pm

As I stated in the previous messages the exact same hardware has been running on the same VM setup for days including a burn-in with 18TB of writes with zero issues (when assembling the drives with hardware RAID card). When switching to individual disks and assembling with ZFS this issue happened.

The reason why I tried it this way is because I had the exact same issues when doing it from my real server hardware when first trying to go the ZFS route (which the drives have been attached to for 2 years with the same card and drive chassis).

Again no issues until I switched to using ZFS, so something is going on there. There are no other messages in any of the logs about drives dropping (assuming zfs is capturing that).
Gerk
 
Posts: 9
Joined: Mon Sep 19, 2016 9:45 am

Re: Massive failures during file copies

Postby Gerk » Tue Sep 20, 2016 3:04 pm

So just to make things even more confusing I replicated this issue (it happens every time). As soon as things locked up and I started seeing messages in the console about drives not being attached I ran diskutil list, which shows all drives are still physically attached to the OS, so there's something else going on here.

Code: Select all
$ diskutil list
/dev/disk0 (internal, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *42.9 GB    disk0
   1:                        EFI EFI                     209.7 MB   disk0s1
   2:                  Apple_HFS El Capitan              42.1 GB    disk0s2
   3:                 Apple_Boot Recovery HD             650.0 MB   disk0s3
/dev/disk1 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk1
   1:                        ZFS                         2.0 TB     disk1s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk1s9
/dev/disk2 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk2
   1:                        ZFS                         2.0 TB     disk2s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk2s9
/dev/disk3 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk3
   1:                        ZFS                         2.0 TB     disk3s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk3s9
/dev/disk4 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk4
   1:                        ZFS                         2.0 TB     disk4s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk4s9
/dev/disk5 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk5
   1:                        ZFS                         2.0 TB     disk5s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk5s9
/dev/disk6 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk6
   1:                        ZFS                         2.0 TB     disk6s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk6s9
/dev/disk7 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk7
   1:                        ZFS                         2.0 TB     disk7s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk7s9
/dev/disk8 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk8
   1:                        ZFS                         2.0 TB     disk8s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk8s9
/dev/disk9 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk9
   1:                        ZFS                         2.0 TB     disk9s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk9s9
/dev/disk10 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk10
   1:                        ZFS                         2.0 TB     disk10s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk10s9
/dev/disk11 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk11
   1:                        ZFS                         2.0 TB     disk11s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk11s9
/dev/disk12 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk12
   1:                        ZFS                         2.0 TB     disk12s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk12s9
/dev/disk13 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk13
   1:                        ZFS                         2.0 TB     disk13s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk13s9
/dev/disk14 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk14
   1:                        ZFS                         2.0 TB     disk14s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk14s9
/dev/disk15 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk15
   1:                        ZFS                         2.0 TB     disk15s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk15s9
/dev/disk16 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *2.0 TB     disk16
   1:                        ZFS                         2.0 TB     disk16s1
   2: 6A945A3B-1DD2-11B2-99A6-080020736631               8.4 MB     disk16s9
Gerk
 
Posts: 9
Joined: Mon Sep 19, 2016 9:45 am

Re: Massive failures during file copies

Postby Gerk » Tue Sep 20, 2016 3:41 pm

And this last time I did manage to see at least on disk actually become detached from the OS when doing the zfs route.

I'm going to try this again, the closest test I can get to what is going on here is to stripe all 16 drives with apple's software raid, and then try it again with all the drives added to a single zfs pool.

And when doing this with the appleRAID not a single hiccup. Here's the script i cobbled together to make this hRAID. I'm going to do the same so I can make a similar zRAID and then test that one with the same approach.

Code: Select all
#!/bin/bash

DLIST="disk1 disk2 disk3 disk4 disk5 disk6 disk7 disk8 disk9 disk10 disk11 disk12 disk13 disk14 disk15 disk16"
ALIST="disk1s2 disk2s2 disk3s2 disk4s2 disk5s2 disk6s2 disk7s2 disk8s2 disk9s2 disk10s2 disk11s2 disk12s2 disk13s2 disk14s2 disk15s2 disk16s2"

for d in $DLIST
do
   diskutil partitionDisk /dev/$d GPTFormat HFS+ $d 100%
done

echo "Creating striped array"

CMD="diskutil appleRAID create stripe hRAID HFS+ "

$CMD $ALIST


I then took a temp file (5G of binary data that I stole from the Blackmagic Disk Speed Test app) and copied it to hRAID. Worked well, in fact it was incredibly fast (almost 1500 megabytes/sec read and write when I tested with that app). Real world it was nice and fast. Once that file is copied I "select all' and duplicate. I did that until I had close to 100G of data on the device, no problems, not so much as a hiccup.

Next up I will take pretty much the exact same approach with a zfs pool, same drives, same hardware, same os.
Gerk
 
Posts: 9
Joined: Mon Sep 19, 2016 9:45 am

Re: Massive failures during file copies

Postby Gerk » Tue Sep 20, 2016 3:52 pm

So I have now destroyed the appleRAID and built a quick and dirty array like this:

Code: Select all
sudo zpool create zRAID disk1 disk2 disk3 disk4 disk5 disk6 disk7 disk8 disk9 disk10 disk11 disk12 disk13 disk14 disk15 disk16


And within 10 seconds of trying to run the Blackmagic Disk Speed Test app it already dropped a whole bunch of drives.

Code: Select all
$ zpool status
  pool: zRAID
 state: UNAVAIL
status: One or more devices are faulted in response to persistent errors.  There are insufficient replicas for the pool to
   continue functioning.
action: Destroy and re-create the pool from a backup source.  Manually marking the device
   repaired using 'zpool clear' may allow some data to be recovered.
  scan: none requested
config:

   NAME        STATE     READ WRITE CKSUM
   zRAID       UNAVAIL      0     0     0  insufficient replicas
     disk1     FAULTED      0    17     0  too many errors
     disk2     ONLINE       0     0     0
     disk3     FAULTED      0    16     0  too many errors
     disk4     FAULTED      0    10     0  too many errors
     disk5     ONLINE       0     0     0
     disk6     FAULTED      0    21     0  too many errors
     disk7     FAULTED      0    17     0  too many errors
     disk8     FAULTED      0    12     0  too many errors
     disk9     ONLINE       0     0     0
     disk10    FAULTED      0     4     0  too many errors
     disk11    ONLINE       0     0     0
     disk12    FAULTED      0    12     0  too many errors
     disk13    FAULTED      0     5     0  too many errors
     disk14    FAULTED      0     5     0  too many errors
     disk15    FAULTED      0     8     0  too many errors
     disk16    FAULTED      0    11     0  too many errors

errors: 123 data errors, use '-v' for a list


I'm not sure how else I could demonstrate this more clearly, but there is definitely something going on at the ZFS level here ... no?
Gerk
 
Posts: 9
Joined: Mon Sep 19, 2016 9:45 am

Re: Massive failures during file copies

Postby Brendon » Tue Sep 20, 2016 8:19 pm

Further testing and consultation in the IRC channel has resulted in OP experiencing an apparent hardware failure on HFS+ as well. Seems this is not a ZFS issue per se.

- Brendon
Brendon
 
Posts: 286
Joined: Thu Mar 06, 2014 12:51 pm


Return to General Help

Who is online

Users browsing this forum: No registered users and 33 guests

cron