Trouble with a pool whilst DiskWarrior 4.4 made preparations

This forum is to find answers to problems you may be having with ZEVO Community Edition.

Moderators: jhartley, MSR734, nola

Trouble with a pool whilst DiskWarrior 4.4 made preparations

Post by grahamperrin » Sat Apr 06, 2013 9:54 am

There's an overview topic for this pool:


The trouble here

In as few words as possible:

The most recent scrub repaired 0 with 0 errors on 2013-04-05 at 06:15.

I used fsck_hfs to verify the HFS Plus file system of a .sparsebundle that's stored in a ZFS child file system:

Volumes/tall/com.apple.backupd/macbookpro08-centrim.sparsebundle

No problems found by fsck_hfs, but considering recent incidents involving the pool I decided to experiment with DiskWarrior 4.4.

The first and second runs of DiskWarrior were disrupted by Time Machine routines; no replacement of the directory.

Before or during a third run of DiskWarrior: to avoid disruption, I switched off Time Machine.

Trouble began at 2013-04-06 10:07:56.000:

Code: Select all
2013-04-06 10:07:56.000 kernel[0]: zfs_vdm_completion: media 0xffffff80320df200, device/channel is not attached (6), 0 bytes (of 12288)
2013-04-06 10:07:56.000 kernel[0]: zfs_vdm_completion: media 0xffffff80320df200, device/channel is not attached (6), 0 bytes (of 12288)
2013-04-06 10:07:56.000 kernel[0]: zfs_vdm_completion: media 0xffffff80320df200, device/channel is not attached (6), 0 bytes (of 12288)
2013-04-06 10:07:56.000 kernel[0]: zfs_vdm_completion: media 0xffffff80320df200, device/channel is not attached (6), 0 bytes (of 12288)
2013-04-06 10:07:56.000 kernel[0]: zfs_vdm_completion: media 0xffffff80320df200, device/channel is not attached (6), 0 bytes (of 20480)
2013-04-06 10:07:56.000 kernel[0]: zfs_vdm_completion: media 0xffffff80320df200, device/channel is not attached (6), 0 bytes (of 20480)
2013-04-06 10:07:56.000 kernel[0]: zfs_vdm_completion: media 0xffffff80320df200, device/channel is not attached (6), 0 bytes (of 8192)
2013-04-06 10:07:56.000 kernel[0]: zfs_vdm_completion: media 0xffffff80320df200, device/channel is not attached (6), 0 bytes (of 8192)
2013-04-06 10:07:56.000 kernel[0]: zfs_vdm_completion: media 0xffffff80320df200, device/channel is not attached (6), 0 bytes (of 8192)
2013-04-06 10:07:56.000 kernel[0]: zfs_vdm_completion: media 0xffffff80320df200, device/channel is not attached (6), 0 bytes (of 16384)
2013-04-06 10:07:56.000 kernel[0]: ZFSLabelScheme:willTerminate: this 0xffffff803430f200 provider 0xffffff80320df200 'zfs vdev for 'tall''
2013-04-06 10:07:56.000 kernel[0]: ________________________________________
2013-04-06 10:07:56.000 kernel[0]: ZFS WARNING: 'error from: fs.zfs.probe_failure'
2013-04-06 10:07:56.000 kernel[0]: pool: 'tall'
2013-04-06 10:07:56.000 kernel[0]: vdev_type: 'disk'
2013-04-06 10:07:56.000 kernel[0]: vdev_path: '/dev/dsk/GPTE_99056308-F5E2-4314-852C-4DA04732A2D0'
2013-04-06 10:07:56.000 kernel[0]: parent_type: 'root'
2013-04-06 10:07:56.000 kernel[0]: prev_state: 0
2013-04-06 10:07:56.000 kernel[0]: ________________________________________
2013-04-06 10:07:56.000 kernel[0]: ZFS WARNING: 'error from: fs.zfs.data'
2013-04-06 10:07:56.000 kernel[0]: pool: 'tall'
2013-04-06 10:07:56.000 kernel[0]: ________________________________________
2013-04-06 10:07:56.000 kernel[0]: ZFS WARNING: 'error from: fs.zfs.data'
2013-04-06 10:07:56.000 kernel[0]: pool: 'tall'

… 


GPTE_99056308-F5E2-4314-852C-4DA04732A2D0 links to the larger and newer of the two disks.

DiskWarrior seemed to complete its preparations a few minutes later; within its report is a 10:16 time stamp.

It was impossible to export the pool. A normal restart routine brought the Mac close to a halt (observed in verbose mode), but zpool was a stray command that the operating system could not kill so ultimately, a forced restart was required.

The pool is left with some errors, I can probably rollback to rid myself of most of those.

Critically

I did nothing to disconnect or disturb the disks, so the reported detachment is a puzzle.

Considerations

No external hub for these disks at the time. Both were directly connected to USB 2.0 ports of one hi-speed bus in the MacBookPro5,2, nothing else on that bus.

For at least some of the time, the Mac was extraordinarily busy with at least:

  • concurrent runs of two commands that took probably a few hours to traverse my ZEVO home directory gjp22 (afterthought: those commands were long running but probably not significant contributors to business)
  • DiskWarrior itself (example)

The first debug build of OS X SAT SMART Driver 0.6 was present – but missing 64-bit objects.

The File Anti-Virus feature of Kaspersky Security was enabled, so the comparisons by DiskWarrior took longer.

Time Machine backups within the sparse bundle disk image might include an EICAR test file … which (in combination with protection by Kaspersky Security) might have affected comparisons at the HFS Plus level, but should not affect the state of the pool at the ZFS level.

Additional details

Screenshots and other files at http://www.wuala.com/grahamperrin/publi ... allery&id= may be of interest, but I haven't finished arranging things in and around that folder so please treat that collection as volatile.

A highlight from the detailed report produced by DiskWarrior:

Disk: "Time Machine Backups"
Detected Media Errors in Time Machine Backups


For this, the report without detail seems clearer:

  • Media errors were encountered during rebuilding. Some data may be missing because it could not be read from disk.


I'm not sure about that description by DiskWarrior. Considering the timing of the reported detachment of the device/channel, I suspect that media errors were truly encountered:

  • after the rebuilt directory was prepared (but not committed)
  • whilst comparing with the existing directory.

Comparisons

device/channel is not attached

– appears in three other topics:


In the latter there's suspicion of imperfect hardware.

In my case I reckon that hardware is good, but I'm open to suggestions.

Now

I'm scrubbing the pool …
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

IOUSBMassStorageClass device unresponsive after resets

Post by grahamperrin » Sat Apr 06, 2013 2:31 pm

grahamperrin wrote:Trouble began at 2013-04-06 10:07:56.000: … 


Immediately before the first ZFS-related message for the trouble, there was a USB-related message:

2013-04-06 10:07:56.000 kernel[0]: IOUSBMassStorageClass[0xffffff803430ca00]: The device is still unresponsive after 6 consecutive USB Device Resets; it will be terminated.

A search for part of that message finds a few discussions. Amongst the results:


… I should add to another recent topic:


All USB-related message from kernel, from boot to forced restart:

Code: Select all
2013-04-05 15:05:34.000 kernel[0]: IOBluetoothUSBDFU::probe
2013-04-05 15:05:34.000 kernel[0]: IOBluetoothUSBDFU::probe ProductID - 0x8217 FirmwareVersion - 0x0201
2013-04-05 15:05:34.000 kernel[0]: [BroadcomBluetoothHCIControllerUSBTransport][start] -- completed
2013-04-05 16:07:42.000 kernel[0]: USBF:   1460. 82   AppleUSBEHCI[0xffffff802f712000]::Found a transaction which hasn't moved in 5 seconds on bus 0x24, timing out! (Addr: 5, EP: 0)
2013-04-05 16:07:48.000 kernel[0]: USBF:   1466. 82   AppleUSBEHCI[0xffffff802f712000]::Found a transaction which hasn't moved in 5 seconds on bus 0x24, timing out! (Addr: 5, EP: 0)
2013-04-05 16:07:54.000 kernel[0]: USBF:   1472. 83   AppleUSBEHCI[0xffffff802f712000]::Found a transaction which hasn't moved in 5 seconds on bus 0x24, timing out! (Addr: 5, EP: 0)
2013-04-05 16:08:00.000 kernel[0]: USBF:   1478. 85   AppleUSBEHCI[0xffffff802f712000]::Found a transaction which hasn't moved in 5 seconds on bus 0x24, timing out! (Addr: 5, EP: 0)
2013-04-05 16:08:06.000 kernel[0]: USBF:   1484. 86   AppleUSBEHCI[0xffffff802f712000]::Found a transaction which hasn't moved in 5 seconds on bus 0x24, timing out! (Addr: 5, EP: 0)
2013-04-05 16:08:12.000 kernel[0]: USBF:   1490. 88   AppleUSBEHCI[0xffffff802f712000]::Found a transaction which hasn't moved in 5 seconds on bus 0x24, timing out! (Addr: 5, EP: 0)
2013-04-05 16:08:12.000 kernel[0]: USBF:   1490. 89   IOUSBCompositeDriver[0xffffff8033504a00](DataTraveler 400) GetFullConfigDescriptor(0) returned NULL
2013-04-05 16:08:54.000 kernel[0]: USBMSC Identifier (non-unique): 5B9108000040 0x951 0x1614 0x110
2013-04-05 16:11:30.000 kernel[0]: USBMSC Identifier (non-unique): 322549FBA004 0x152d 0x2329 0x0
2013-04-06 07:28:36.000 kernel[0]: USBF:   56716.367   AppleUSBEHCI[0xffffff802f712000]::Found a transaction past the completion deadline on bus 0x24, timing out! (Addr: 5, EP: 2)
2013-04-06 07:58:05.000 kernel[0]: USBF:   58486. 55   AppleUSBEHCI[0xffffff802f712000]::Found a transaction past the completion deadline on bus 0x24, timing out! (Addr: 5, EP: 2)
2013-04-06 08:57:08.000 kernel[0]: USBF:   62029. 18   AppleUSBEHCI[0xffffff802f712000]::Found a transaction past the completion deadline on bus 0x24, timing out! (Addr: 5, EP: 2)
2013-04-06 09:32:28.000 kernel[0]: USBF:   64149.611   AppleUSBEHCI[0xffffff802f712000]::Found a transaction past the completion deadline on bus 0x24, timing out! (Addr: 5, EP: 2)
2013-04-06 09:46:38.000 kernel[0]: USBF:   64999.542   AppleUSBEHCI[0xffffff802f71f000]::Found a transaction past the completion deadline on bus 0x26, timing out! (Addr: 2, EP: 2)
2013-04-06 09:46:38.000 kernel[0]: USBF:   64999.542   AppleUSBEHCI[0xffffff802f712000]::Found a transaction past the completion deadline on bus 0x24, timing out! (Addr: 5, EP: 1)
2013-04-06 09:46:44.000 kernel[0]: USBF:   65005.546   AppleUSBEHCI[0xffffff802f712000]::Found a transaction which hasn't moved in 5 seconds on bus 0x24, timing out! (Addr: 5, EP: 0)
2013-04-06 09:52:07.000 kernel[0]: USBF:   65328.917   AppleUSBEHCI[0xffffff802f712000]::Found a transaction past the completion deadline on bus 0x24, timing out! (Addr: 5, EP: 1)
2013-04-06 09:52:38.000 kernel[0]: USBF:   65359.931   AppleUSBEHCI[0xffffff802f712000]::Found a transaction past the completion deadline on bus 0x24, timing out! (Addr: 5, EP: 1)
2013-04-06 10:07:16.000 kernel[0]: USBF:   66237.311   AppleUSBEHCI[0xffffff802f712000]::Found a transaction past the completion deadline on bus 0x24, timing out! (Addr: 5, EP: 1)
2013-04-06 10:07:24.000 kernel[0]: USBF:   66245.313   AppleUSBEHCI[0xffffff802f712000]::Found a transaction which hasn't moved in 5 seconds on bus 0x24, timing out! (Addr: 5, EP: 0)
2013-04-06 10:07:56.000 kernel[0]: IOUSBMassStorageClass[0xffffff803430ca00]: The device is still unresponsive after 6 consecutive USB Device Resets; it will be terminated.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Trouble with a pool whilst DiskWarrior 4.4 made preparat

Post by raattgift » Sun Apr 07, 2013 8:38 pm

The most likely explanation is that you are flooding your USB bus with traffic. A USB2 host controller must poll each device (usually round-robin), directing it to send whatever data they have to the host and to receive whatever data the host has for the selected client. The problem is that once a transfer has started it must run to completion or time-out, and the time-out dance is fairly long. WIth a full buss that has *any* slow device on it, the fast devices experience queues, which may overrun, like this:

2013-04-05 16:07:42.000 kernel[0]: USBF: 1460. 82 AppleUSBEHCI[0xffffff802f712000]::Found a transaction which hasn't moved in 5 seconds on bus 0x24, timing out! (Addr: 5, EP: 0)

"Friends don't let friends use USB2 for heavy USB mass storage device class traffic". However, if you must, use multiple USB2 busses (i.e. multiple host controllers).

(Note that USB 2 "hi-speed bus" is one of the biggest misnomers in computer peripheral history. It is not and has never been meant for primary disk storage, and miserable failures when trying to do so are commonplace.)

This

"2013-04-06 10:07:56.000 kernel[0]: ZFS WARNING: 'error from: fs.zfs.probe_failure'"

is a reasonable response to host controller timeouts; timeouts lose transactions; probes are transactions. At some point you want to stop retrying and declare the USB MSC device unusable, and hope to be able to then better use the rest of the devices attached to the bus.

It would also have been reasonable to outright FAULT the device at that point, and since that would FAULT the vdev, it would have FAULTed your pool. I would not say your are lucky that didn't happen because:

"The pool is left with some errors"

THAT is the TRUE bug here. ZFS should always err on the side of data integrity, even at the cost of avoidable data unavailable.

Although you deliberately configured a pool with considerable disregard for data integrity (you have reduced scope for on-line correction; just ZFS metadata by default), the same pool configuration totally disregards data availability, and so availability should fail first, before integrity problems set in.

I don't want to dwell on it, but I would not use third party metadata modifying tools on a Time Machine Volume. In 10.8, tmutil and backupd will do whatever (HFS) metadata repairs they find necessary. A simple "tmutil latestbackup" or "tmutil listbackups" is virtually certain to be sufficient. If you must touch that (HFS) metadata at all, use fsck_hfs with a -R option (and for a typical Time Machine backup set a large -c option, which will dramatically shrink the time necessary); use of -n (and/or -l first, but see the -c caveat; you may need the -f option) would be a good way of arming you with a good idea of what fsck_hfs is likely to do. I would not use of Disk Warrior at all, much as Alsoft's tool was a life-and-performance-saver in earlier days of Mac OS X and its predecessors.

(One of the things I like most about ANY zfs for Mac OS X is not having to turn to such a tool, given proper configuration of pools and a reasonable zfs send/recv based backup strategy geared towards fast recovery of data. Additional local not-USB-2 storage is now much cheaper than several hours of time with a repair tool for half-TB-or-bigger datasets.)

(Finally, since you have an Alsoft license, it might be useful for you to bring up the DW errors with them.)
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Links

Post by grahamperrin » Tue Apr 16, 2013 11:32 pm

Thanks. An earlier topic and a spin-off:

grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Trouble with a pool whilst DiskWarrior 4.4 made preparat

Post by grahamperrin » Fri Apr 26, 2013 2:05 pm

At viewtopic.php?p=4719#p4719

raattgift wrote:… If DW reads a lot of band files while scanning, there will be a lot of writes …


I imagine that during a rebuild, writes are frequent but small, and limited to relatively few bands – those occupied by the rebuilt directory.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom


Return to Troubleshooting

Who is online

Users browsing this forum: bileyqrkq, ilovezfs and 0 guests

cron