Frequent Kernel Panics - zap->zap_u.zap_fat.zap_phys->zap_ma

This forum is to find answers to problems you may be having with ZEVO Community Edition.

Moderators: jhartley, MSR734, nola

Re: Frequent Kernel Panics - zap->zap_u.zap_fat.zap_phys->za

Post by raattgift » Mon Apr 01, 2013 1:36 pm

"ZFS assertion failed: zap->zap_u.zap_fat.zap_phys->zap_magic == 0x2F52AB2ABULL (0x0 == 0x2f52ab2ab)"

That's pretty clear. ZAPs is the ZFS Attribute Processor; it sits on the DMU (like the ZPL, the ZFS POSIX Layer does) and manages objects that provide an efficient attribute/value data store. The assertion in question is checking to make sure that the in-core structure it is looking at is indeed a fat ZAP object by looking for a magic number. Instead of the magic number, it sees zero, and the assertion fails (causing a panic).

You can expect fat ZAPs to be in play when opening a dataset -- any dataset -- and so this type of cross-platform assertion failure is fairly commonly observed in the presence of on-disk structure corruption, usually after a violent disconnection when the dataset structures are in the process of being updated.

Note that the problem may not be a read error in the strict sense -- that is, what was written is being returned faithfully. Scrubs will NEVER detect such a problem, EVER. An example case is in single-device pools where the ZIL is slow (slow bus, slow device, lots of traffic), and it can also happen with a poor choice of SLOG, where the separate log device is unreliable at pool import time. You could look for ZIL errors in /var/log/system.log for evidence of this. There may also be a transaction group scheduling problem where there is a bad assumption somewhere up the ZPL/ZAP stack that critically related sets of DMU objects will be dealt with in the same TXG commit; those may or not be cross-platform.

It is highly unlikely that you will recover your damaged dataset; you should restore it from a backup.

However, *if* you want to poke at the dataset (or you have to -- however, please note that 1 April is International Backup Day, and that you should have known-to-be-restorable backups on hand; zfs send/receive to a pool on a different system is a good approach) then as I noted, you might find that an older snapshot (if one exists) might be usable in *read only* mode.

It would be sensible to set the canmount=off property on the bad dataset, and to create (and mount read-only) clones of older snapshots. You should copy the data out of the dataset (/usr/bin/rsync -avhPEs from/ to/) to a new dataset the moment you have read-only access to it; you should not rely on a rollback or a clone to be free from ZAP problems that may bring down your system again.

Finally, note that you can still archive the dataset in question, since zfs send and receive deal with DMU objects without considering their semantics in upper layers (like the ZAP). The problem is thus that you get a bad fat ZAP DMU that will cause this assertion in any pool on any zfs platform the moment it is processed.

Since the moment in question is at or very near the mount, be careful. Panics are crashes and thus are a form of unpredictable violent disconnect, so they pose some risk to data availability, especially to any dataset that was mounted read-write, and to non-ZFS filesystems as well, such as your JHFS+ boot partition.
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Re: Frequent Kernel Panics - zap->zap_u.zap_fat.zap_phys->za

Post by raattgift » Mon Apr 01, 2013 1:43 pm

"the dataset corruption occurred when the machine was switched off (holding power button down)."

That is very very very likely.

So likely that performing the scrub and memory testing was akin to the historical advice of "repair your permissions" in the face of any and all Mac OS X errors.

The best way to escape further holes in your uptime is to "zfs destroy -R" the dataset and restore from backups.

Note that the bad DMU object could be stored in backups, depending on how you made them.
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Re: Frequent Kernel Panics - zap->zap_u.zap_fat.zap_phys->za

Post by ChrisS » Tue Apr 02, 2013 3:48 pm

Thanks for the explanation.

It accounts for how the problem was transferred to the backup (made by sending incremental snapshots) - which is a shame as it would have been pretty easy to choose a snapshot from before the first problem.

I will try the clone and recover. It is now pretty necessary not to destroy it. It is the back up disk and the original disk has now been wiped and initialized. I have the current state of the data but not the snapshot history.
ChrisS Offline


 
Posts: 6
Joined: Sat Oct 13, 2012 8:11 am

Re: Frequent Kernel Panics - zap->zap_u.zap_fat.zap_phys->za

Post by grahamperrin » Wed Apr 03, 2013 2:00 am

From viewtopic.php?p=4532#p4532 and relating to this topic:

raattgift wrote:…  a fully consistent filesystem that nevertheless contains a bad object; rolling back a few TXGs at import time *MAY* have decommitted the bad ZAP DMU object, but it also likely would have decommitted other (possibly good and important) data at the same time. If the label txg values agree, it is almost never a good idea to roll back to previous txgs at import, especially not without involving an operator.


From above:

"the dataset corruption occurred when the machine was switched off (holding power button down)."

That is very very very likely. …


Do you mean that corruption of on-disk structure is rare, and that forcing off is the likeliest explanation for the corruption in this (rare) case?

Or do you mean that any forced restart or forced shutdown, or any kernel panic, is generally likely to cause an on-disk structure corruption that might ultimately be this disruptive?

I ask because in my environment (aggressively testing more than ZEVO) I have encountered probably hundreds of panics or forced restarts.

Maybe I'm shielded in some ways by the mysteries of Core Storage, which is beneath the ZFS pool that's my home.

… what was written is being returned faithfully. Scrubs will NEVER detect such a problem …


So if what's written is likely to cause a panic, is there no way (other than a panic) to identify that faithful but troublesome data?

I don't want to hijack this topic with conceptual stuff, so feel free to spin off to another …
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Frequent Kernel Panics - zap->zap_u.zap_fat.zap_phys->za

Post by raattgift » Wed Apr 03, 2013 6:37 pm

grahamperrin wrote:Do you mean that corruption of on-disk structure is rare, and that forcing off is the likeliest explanation for the corruption in this (rare) case?


Yes.

grahamperrin wrote:Or do you mean that any forced restart or forced shutdown, or any kernel panic, is generally likely to cause an on-disk structure corruption that might ultimately be this disruptive?


No.

grahamperrin wrote:I ask because in my environment (aggressively testing more than ZEVO) I have encountered probably hundreds of panics or forced restarts.


Sun and other zfs developers have triggered (and encountered accidentally) many thousands of violent disconnections.

The problem is that they have not encountered all possible types of hardware combinations; some handle deliberate power removal differently from others.


grahamperrin wrote:Maybe I'm shielded in some ways by the mysteries of Core Storage, which is beneath the ZFS pool that's my home.


Unlikely.

grahamperrin wrote:So if what's written is likely to cause a panic, is there no way (other than a panic) to identify that faithful but troublesome data?


True for any filesystem on any medium. If higher layer software insists on writing out bad data, no matter how carefully you store that bad data and retrieve it later, it's still bad.

In this case, the DMU object is good, but what was put in the DMU object is bad.

Sending this particular corrupt dataset to a zfs developer -- not necessarily (just) a ZEVO one, since this is cross-platform -- may help squash a write scheduling bug that manifests in the zfs attributes processing layer.

FWIW there was some public discussion of a case of this particular assert failure in zap_deref_phys on the recently retired zfs-discuss@opensolaris.org mailing list.

http://thr3ads.net/zfs-discuss/2012/04/ ... ort-UPDATE
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Previous

Return to Troubleshooting

Who is online

Users browsing this forum: bileyqrkq, ilovezfs and 0 guests

cron