0 bytes free and zfs destroy hangs

All your general support questions for OpenZFS on OS X.

0 bytes free and zfs destroy hangs

Postby DanielSmedegaardBuus » Wed Apr 19, 2023 3:08 am

Hi :)

I have a large zpool that had something like 50 GB free (of 5-something TB total).

I was creating md5 hashes of all files on it, and thought I'd take the md5 checksums while it was crunching through everything, and pipe them into a little shell loop to write these md5 hashes to each file as an extended attribute. I made a snapshot before doing any of this.

The thing I hadn't considered was that adding extended attributes takes up space. As in, all the remaining space. I have 0 bytes free, and my long-running commands both gave up.

Forgetting that I had done a snapshot, I moved some files off the pool. Then remembering I did the snapshot, I found the oldest snapshot which used about ~15GB, and issued a zfs destroy command on it.

Several hours later, this command was still hanging and the disks were sleeping. zpool iostat revealed zero activity.

At this point, I just wanna roll back to the most recent snapshot. I've mounted it and I'm comparing its files list to that of the current state of the pool, in case I forgot that I added or changed something since the snapshot. The xattrs are fine to drop - I can just recreate them from the generated hashes.

But I'm really afraid to do anything at this stage, with a zfs destroy command hanging on a previous snapshot. Accidentally destroying the pool would be really bad, since this is actually me recovering another (raidz1) pool, which has already lost one drive and has errors on some of its files.

What is safe to do at this point?

Thank you.

EDIT: I'm now wondering if removing all xattrs with a find -exec command might actually free up the GBs spent on adding them?
DanielSmedegaardBuus
 
Posts: 38
Joined: Thu Aug 28, 2014 11:00 pm

Re: 0 bytes free and zfs destroy hangs

Postby DanielSmedegaardBuus » Wed Apr 19, 2023 4:46 am

I've just ordered an 18TB Toshiba enterprise drive to hold a backup of the zpool before doing anything (fingers crossed there'll be no power outages), but I'd still like any feedback on this :)
DanielSmedegaardBuus
 
Posts: 38
Joined: Thu Aug 28, 2014 11:00 pm

Re: 0 bytes free and zfs destroy hangs

Postby lundman » Wed Apr 19, 2023 2:10 pm

Making a backup is a good idea, do that first.

Most likely it sounds like that when it hit 0 bytes free, it actually stopped the syncing. So the snapshot destroy most likely didn't do anything.

You can always run spindump to see what it is doing (search for zfs in the output).

There is a "rollback" bug we discovered this week, so if that is what you had in mind, you might want to "import -N" and rollback to avoid it. Or wait for the next build with the fixes in them.
User avatar
lundman
 
Posts: 1337
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: 0 bytes free and zfs destroy hangs

Postby DanielSmedegaardBuus » Wed Apr 19, 2023 10:02 pm

All right, thank you for the tips, Lundman :)

I grepped the spindump output for zfs and got a bunch of wait thingies, like,

Code: Select all
  Thread 0x112c97a          519 samples (1-519)       priority 80 (base 80)
 *519  call_continuation + 23 (kernel + 116055) [0xffffff800021c557]
   *519  taskq_thread + 313 (zfs + 73785) [0xffffff7f84bc7039]
     *519  spl_cv_wait + 57 (zfs + 4489) [0xffffff7f84bb6189]
       *519  msleep + 98 (kernel + 5427922) [0xffffff800072d2d2]
         *519  ??? (kernel + 5426694) [0xffffff800072ce06]
           *519  lck_mtx_sleep + 126 (kernel + 500174) [0xffffff800027a1ce]
             *519  thread_block_reason + 175 (kernel + 546063) [0xffffff800028550f]
               *519  ??? (kernel + 550218) [0xffffff800028654a]
                 *519  machine_switch_context + 205 (kernel + 1580685) [0xffffff8000381e8d]


And a whole lot of these xattr-related ones,

Code: Select all
  Thread 0x243b414          519 samples (1-519)       priority 46 (base 37)
  519  <truncated backtrace>
    519  __getattrlist + 10 (libsystem_kernel.dylib + 115254) [0x7fff59129236]
     *519  hndl_unix_scall64 + 22 (kernel + 120390) [0xffffff800021d646]
       *519  unix_syscall64 + 616 (kernel + 6301720) [0xffffff8000802818]
         *519  getattrlist + 112 (kernel + 2605888) [0xffffff800047c340]
           *519  ??? (kernel + 2606142) [0xffffff800047c43e]
             *519  ??? (kernel + 2596364) [0xffffff8000479e0c]
               *519  ??? (kernel + 2585915) [0xffffff800047753b]
                 *519  vn_getxattr + 977 (kernel + 2886385) [0xffffff80004c0af1]
                   *519  zfs_vnop_getxattr + 522 (zfs + 1415578) [0xffffff7f84d0e99a]
                     *519  zpl_xattr_get + 229 (zfs + 1438869) [0xffffff7f84d14495]
                       *519  zpl_xattr_get_dir + 136 (zfs + 1440712) [0xffffff7f84d14bc8]
                         *519  zfs_lookup + 269 (zfs + 1369533) [0xffffff7f84d035bd]
                           *519  zfs_dirlook + 232 (zfs + 1337336) [0xffffff7f84cfb7f8]
                             *519  zfs_dirent_lock + 910 (zfs + 1336702) [0xffffff7f84cfb57e]
                               *519  zap_lookup_norm + 53 (zfs + 1023333) [0xffffff7f84caed65]
                                 *519  dmu_buf_hold + 68 (zfs + 264884) [0xffffff7f84bf5ab4]
                                   *519  dbuf_read + 499 (zfs + 220259) [0xffffff7f84beac63]
                                     *519  spl_cv_wait + 57 (zfs + 4489) [0xffffff7f84bb6189]
                                       *519  msleep + 98 (kernel + 5427922) [0xffffff800072d2d2]
                                         *519  ??? (kernel + 5426694) [0xffffff800072ce06]
                                           *519  lck_mtx_sleep + 126 (kernel + 500174) [0xffffff800027a1ce]
                                             *519  thread_block_reason + 175 (kernel + 546063) [0xffffff800028550f]
                                               *519  ??? (kernel + 550218) [0xffffff800028654a]
                                                 *519  machine_switch_context + 205 (kernel + 1580685) [0xffffff8000381e8d]


And I killed off the xattr loop more than a day ago, so it's somehow still stuck on trying to write xattrs - I'm guessing :)

zfs destroy is also still hanging. Anyway, I'll avoid doing anything on the system and wait for the other drive to arrive tomorrow, then zfs send the latest snapshot to it just in case. That'll leave me with a single-drive backup, a maybe good RAIDZ-2 backup, on the original RAIDZ-1 with a failed drive. Good to know about the rollback bug :)

Thanks!
DanielSmedegaardBuus
 
Posts: 38
Joined: Thu Aug 28, 2014 11:00 pm

Re: 0 bytes free and zfs destroy hangs

Postby lundman » Thu Apr 20, 2023 4:09 am

The top one is just a thread sleepiing.

The second one is indeed a hang, *519 dbuf_read + 499

is interesting, there is probably another thread in there that is holding that lock.

I don't think you'll be able to zfs send until you reboot, it's just stuck. But do try
User avatar
lundman
 
Posts: 1337
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: 0 bytes free and zfs destroy hangs

Postby DanielSmedegaardBuus » Tue Apr 25, 2023 6:47 am

Oh, okay :D

I think it must simply be stuck trying to write an xattr with 0 bytes free? Waiting for free space to appear. My loop was still trying to write xattrs long after the fs was filled up, so I'm guessing it's somehow stuck on something to do with these requests? I have an error log of 334,485 xattr errors like

Code: Select all
xattr: [Errno 28] No space left on device: '/Volumes/deck/<file>'


all made after the drive filled up.

Well, anyway, the pool is still running, and I can read from both its fses, and zfs send, even delete files (though it doesn't seem to free up space ... perhaps because some of those pending xattr requests are immediately eating it up? Though they shouldn't really be pending, as they've returned an error. But what do I know :) )

I'm still rsyncing onto my new drive - zfs send was terribly slow - and although the system isn't very responsive, it's been copying for several days now, and there's about 20-25% left, so it's looking good. I'll try a reboot at the end and do a rollback, or try destroying the older snapshot again, see if I can get that pool back to normal.

Thanks :)
DanielSmedegaardBuus
 
Posts: 38
Joined: Thu Aug 28, 2014 11:00 pm

Re: 0 bytes free and zfs destroy hangs

Postby DanielSmedegaardBuus » Tue May 02, 2023 3:48 am

Okay, just wanted to provide an update. I had to restart halfway through as the machine was zfs memory swapping into a death spiral, but luckily it still mounted on reboot, so I could finish the rsync. Of course, it's zfs, so it's extremely slow, but you already know know that being a developer. So it only just finished after a couple of weeks.

I can try some rollback if you'd like for testing? But you already mentioned that there's a know bug with rollback in the latest release so perhaps that isn't relevant? I'd also like not to tickle the beast, because when ZFS breaks anywhere, everything is broken until reboot, but yeah, I expect you know that too.

I'm currently checksumming the latest copy, so I can compare it to the original, which since it's ZFS makes the UI extremely slow and typing cumbersome, so I probably won't read any replies until it's finished, in like a week or so, but let me know if you need any data. I'll be happy to provide it.

Cheers
DanielSmedegaardBuus
 
Posts: 38
Joined: Thu Aug 28, 2014 11:00 pm


Return to General Help

Who is online

Users browsing this forum: No registered users and 23 guests