disk image approaches to JHFS+ on ZFS

Moderators: jhartley, MSR734, nola

disk image approaches to JHFS+ on ZFS

Post by grahamperrin » Mon Nov 05, 2012 11:28 am

Spun off from viewtopic.php?p=2550#p2550 and echoing part of viewtopic.php?p=217#p217

Disk images, zfs scrub and dealing with errors

If scrub reveals an error in the part of the dataset that stores a sparse disk image:

  • it may be difficult to identify the affected file(s) within the image.

If scrub reveals an error in the part of the dataset that stores a sparse bundle disk image:

  • we might identify an affected band, but it may be difficult to identify the affected file(s) within that band
  • if something other than a band is affected, difficulties may be greater.

Example

Echoed from viewtopic.php?p=1256#p1256

Code: Select all
sh-3.2$ sudo zpool status -xv -T d
Sat 22 Sep 07:56:58 2012
  pool: blocky-OS
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub repaired 1.62Mi in 5h22m with 50 errors on Sat Sep 22 07:06:38 2012
config:

   NAME                                         STATE     READ WRITE CKSUM
   blocky-OS                                    ONLINE       0     0    50
     GPTE_31EC193D-5063-4C3C-A3F4-B09A8CBB3C6D  ONLINE       0     0   217  at disk5s2
     GPTE_EF222F60-36F0-4B9C-86C1-0FB57C2000BE  ONLINE       0     0     0  at disk8s1

errors: Permanent errors have been detected in the following files:

        blocky-OS:/macbookpro08-centrim.sparsebundle/bands/6cce
        blocky-OS:/macbookpro08-centrim.sparsebundle/bands/6ccf
sh-3.2$ clear


… and zoom in to the potential for problems:

Code: Select all
        blocky-OS:/macbookpro08-centrim.sparsebundle/bands/6cce
        blocky-OS:/macbookpro08-centrim.sparsebundle/bands/6ccf


If you encounter that type of situation, with bands of a sparse bundle disk image that provides a JHFS+ volume, how will you deal with likelihood of corresponding errors on the JHFS+ volume?

As fsck_hfs may not find those errors, you could:

  1. abandon the current disk image (abandon all current files on the JHFS+ volume)
  2. mount a snapshot to restore an older version of the disk image.

To minimise the possibility of error

Consider storing your disk images (nothing else) on a child file system, and set copies=2 or copies=3 for that file system.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

afterthoughts

Post by grahamperrin » Tue Nov 06, 2012 9:09 pm

To minimise the possibility of error

Consider storing your disk images (nothing else) on a child file system, and set copies=2 or copies=3 for that file system.


Additionally: an appropriate quota might prevent free space issues associated with Time Machine.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Consistency

Post by grahamperrin » Thu Dec 20, 2012 10:00 pm

If physical connection to the ZFS storage is lost before the disk image is detached, then a shut down or restart of the operating system will require force. (It will be impossible to unmount the JHFS+ file system.)

Without intending to test robustness, I often experienced disruptions (such as kernel panics) but never noticed a significant problem with the JHFS+ file system. I assume that robustness at the ZFS level allows journalling to be effective at the JHFS+ level.

However, I estimate that very few of the recent disruptions coincided with significant writes to JHFS+.

Rewind a few months. Before using ZFS I routinely used a sparse bundle disk image to store .vdi virtual disk images for VirtualBox. There was probably a period of transition when that .sparsebundle, which I moved to ZFS, contained those .vdi files … and it's possible that at least one panic occurred whilst VirtualBox wrote to a .vdi … but I never paid great attention to the integrity of .vdi files in those situations.

Side note: whilst I still use that JHFS+ .sparsebundle on ZFS, I now prefer to store the .vdi files on ZFS (not on JHFS+).

A more thorough test of robustness might pay attention to integrity of files on JHFS+ on ZFS on Core Storage following, say:

  • untimely physical disconnection of a disk of the Core Storage pool; or
  • a simple kernel panic.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Link

Post by grahamperrin » Sun Mar 03, 2013 3:13 pm

grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: disk image approaches to JHFS+ on ZFS

Post by raattgift » Sun Mar 03, 2013 5:32 pm

have you read the hdiutil(1) man page? it is long and detailed.

UDSB dmgs are bundle-backed SPARSEIMAGEs; ranges of LBAs are stored in individual band files; the mapping between the start of a range and the LBA within the virtual disk image is usually just a simple multiplication

dmgs in general are pretty agnostic about the data within the hosted filesystem; you need to use the latter's metadata to determine what occupies any given range of filesystem blocks and the hosted partition data to determine where those FSBs are in LBA terms.

corestorage is not as well documented; most of what exists publically is in the diskutil man page and in things like the apple-open-sourced fs_usage.c file and the like. CS does checksum its metadata (this can be seen with dtrace) but does not make many guarantees for the data hosted within LVs; additionally, violent interference with writes to CS metadata can lead to LVs being made pretty much permanently unavailable (certainly fsck_cs is limited in this regard). JHFS+ within a CS LV is only marginally safer than JHFS+ within a GPT partition alone, and only in the single PV case. CPDKs (composite multi=PV "disks") are subject to block migrations that introduce additional data loss risk, because the migrated data is not checked to ensure that the data will read back identically once migrated. the risk is roughly similar to that introduced by good 3rd-party JHFS+ defragmentation utilities.

The count=n, where n > 1 property for zfs datasets are insufficient to protect against the loss of a complete LV. ZFS data availability comes from having multiple copies of the data scattered across wholly independent physical devices (and when you do that point the count property becomes much less compelling). If you lose LVG metadata during a violent crash (i.e., it checksums bad at reboot), a single-disk pool will likely never be importable without major offline surgery, no matter what value is used for the pool's datasets' count properties. It would be much faster and easier to restore from a backup into a new pool.

However, you can certainly do things like this

pool: xpool
state: ONLINE
scan: scrub repaired 0 in 0h33m with 0 errors on Sat Mar 2 10:32:54 2013
config:

NAME STATE READ WRITE CKSUM
xpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
GPTE_200D94D0-A77C-48A8-BB40-192BD2474736 ONLINE 0 0 0 at disk4s2
GPTE_3EC99D9C-54F7-426B-8B36-8A9B6526F4DA ONLINE 0 0 0 at disk23s2
cache
GPTE_315646A9-0718-44A6-8287-3CDB3B036103 ONLINE 0 0 0 at disk27s2

Here, disk4s2 is within a CPDK LVG where the physical volume is on a Thunderbolt-attached SATA SSD and a big FW800-attached drive.
It shares the LVG with startup volume on which is installed variant of 10.8.2 that supports block migration ("fusion disk"). Hot ZFS and JHFS+ blocks from the two LVGs stay on the SSD just fine. Violent crashes have damaged JHFS+ data and provoked resilvering fixes from the other half of the mirror.

The second half of the mirror lives in a different CS LVG that lives entirely on a different Thunderbolt-attached SATA SSD. There is only that PV in the LVG. There are other LVs though, which are used for l2arc cache and separated zil log for ZFS pools that are spread across multiple FW800 or USB3 attached traditional disks.

Disk27 is a USB-3 attached thumbnail drive which helped when the ZFS blocks in the first half of the mirror were mainly too cold to be on the SSD; it now impedes performance by its very existence (eating arc_buf_hdr_t memory, and returning l2arc data more slowly than it could be retrieved from either half of the mirror). Its removal is imminent.

In the case of this pool, violent disconnections (including crashes and power losses) of the pool have been repaired just fine by the resilvering mechanism. Count is 1 for all of the datasets.

hdiutil does a file system consistency check on hosted (J)HFS(+) filesystems as necessary. I do time machine backups of the startup volume mentioned above into a dataset in another pool that is also a mirror (but only on FW800 disks). tmutil -- another tool with an excellent man page -- creates a UDSB in the target dataset. Violent disconnections have repaired firstly by ZFS resilvering, and secondly by the fsck_hfs that hdiutil does at attach time, except for once. That once led to a terribly damaged catalog file in the hosted JHFS+ file system within the UDSB; rolling back to a previous zfs snapshot of the zfs dataset the UDSB resides in recovered from that quickly and with minimal data loss.

Had the dataset not been snapshotted, the whole set of time machine data likely would have been a write-off, being so complicated and time-consuming to recover that it simply would not have been worthwhile.

Additionally, forcing a TXG rollback for an unmirrored pool that refuses to import is complicated enough and risky enough that the data redundancy in dataset copies > 1 would not significantly speed recovery or likely help with the most obvious case of data unavailable, namely inconsistent JHFS+ metadata (or file data!) arising from nterrupted writes into the mounted-by-tmutil DMG-hosted volume.

tl;dr: if your data isn't immediately available after a crash, you are probably better off recovering it from actual backups. ZFS gives lots of tools to increase immediate data availability (distributing data across multiple physical devices per vdev; dataset snapshots) that help mitigate against various things that commonly lead to data hosted within a DMG becoming unavailable. Use them. It's much better for you than trying to figure out how to dig through various abstraction layers when you have a DMG that refuses to checksum correctly after a crash, or that refuses to boot, or which returns bad data to applications.
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Re: disk image approaches to JHFS+ on ZFS

Post by grahamperrin » Mon Mar 04, 2013 2:22 pm

raattgift wrote:have you read the hdiutil(1) man page? …


Yes, thanks. Unless I'm missing something, that manual doesn't bring me closer to answering the question in Ask Different.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Link

Post by ilovezfs » Thu Apr 18, 2013 10:51 am



Graham, I made an attempt at answering your question at Ask Different:

http://apple.stackexchange.com/question ... rse-bundle

Reposting my answer here:

Assuming you can attach the sparsebundle, you should be able to do this using fileXray, which is $79 for a personal use license and found at http://filexray.com

fileXray is able to "reverse map volume storage" meaning it can "determine which file a given block or byte offset on a volume belongs to." The relevant option is --who_owns_byte, explained on page 172 of the documentation, which can be found at http://filexray.com/fileXray.pdf

Now according to page 54 of the documentation, "It is important to note that a device dump must be 'raw'—that is, it must not require any additional transformations such as decompression or decryption. In other words, for a disk image file to be used directly by fileXray, the image must not be compressed, encrypted, or sparse. fileXray will reject such an image." However, it goes on to say, "If you do have such an image that is compressed, encrypted, or sparse, either convert it using the Mac OS X hdiutil command-line program, or simply attach it (optionally without mounting it) using hdiutil and use fileXray on the resultant block device instead of the image file."

So once you have the sparsebundle attached or mounted, the question is what byte offset to provide to the --who_owns_byte option.

Apple provides "routines to manipulate a sparse bundle" at http://www.opensource.apple.com/source/ ... seBundle.c

Based on that code, we can see in the routine doSparseRead that the bandName of the first band for a given offset is the bandNum as a hexadecimal number. In particular, asprintf(&bandName, "%s/bands/%x", ctx->pathname, bandNum).

The bandNum is the offset / blockSize, since off_t bandNum = (offset + nread) / blockSize and nread starts at 0, which will truncate the result of the division, so the offset should be bandNum * blockSize or bandNum * blockSize + blockSize. Note that the blockSize is the bandSize, since off_t blockSize = ctx->bandSize;

Looking at doSparseWrite seems to give the same answer, since off_t bandNum = (offset + written) / blockSize; with written initialized at 0 and asprintf(&bandName, "%s/bands/%x", ctx->pathname, bandNum);

It would be great if someone with access to fileXray could try this.
ilovezfs Online


 
Posts: 249
Joined: Sun Feb 10, 2013 9:02 am

Thanks

Post by grahamperrin » Fri Apr 19, 2013 1:56 pm

grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Thanks

Post by ilovezfs » Fri Apr 19, 2013 10:39 pm


You're welcome. I hope it helps when you do have a chance to pursue this further. Checked the diff between SparseBundle.c from hfs-191 and hfs-195 and the only difference is a new sync_volume routine, which isn't related.
ilovezfs Online


 
Posts: 249
Joined: Sun Feb 10, 2013 9:02 am


Return to General Discussion

Who is online

Users browsing this forum: No registered users and 1 guest

cron