Abort trap 6 on zpool status

All your general support questions for OpenZFS on OS X.

Abort trap 6 on zpool status

Postby tangent » Wed Jul 11, 2018 1:28 pm

Recently, I'm getting the following error with 1.7.3 BETA on macOS 10.13.6:

Code: Select all
$ zpool status
  pool: SmallOne
 state: ONLINE
  scan: scrub repaired 0 in 28h3m with 0 errors on Mon Jul  9 04:03:51 2018
Abort trap: 6


Following the error should be a listing of the devices in the pool, then information about the other pool on this system.

If I query the second pool explicitly, I get much the same thing:

Code: Select all
$ zpool status BigOne
  pool: BigOne
 state: ONLINE
  scan: scrub repaired 0 in 39h36m with 0 errors on Mon Jul  9 15:36:07 2018
Abort trap: 6


Both pools show up in a zpool list command:

Code: Select all
$ zpool list
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
SmallOne   3.62T  2.16T  1.47T        -         -    17%    59%  1.00x  ONLINE  -
BigOne     12.7T  10.7T  1.94T        -         -    21%    84%  1.00x  ONLINE  -


I assume this error means something is broken, but "Abort trap: 6" is not enough for me to go on.

I tried running a SMART utility on the system, and it claims all of the physical devices are fine.
tangent
 
Posts: 47
Joined: Tue Nov 11, 2014 6:58 pm

Re: Abort trap 6 on zpool status

Postby lundman » Thu Jul 12, 2018 9:16 pm

Abort 6 is a software call to abort(). Ie, usually a deliberate call by the developer, or the software detecting some unusual event, to cause program to terminate.

For example, if you set environment variable ZFS_ABORT=1 all zfs software will about, JUST before they exit. So you can inspect the core-dump file and confirm no-memory leaks etc. If you have that set, you should certainly try "unset ZFS_ABORT".

But can be many other things. You can simply:

"lldb zpool -- status"

and when it aborts, lldb will kick in and you can run commands like "bt" to see the stack.
User avatar
lundman
 
Posts: 1335
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: Abort trap 6 on zpool status

Postby tangent » Fri Jul 13, 2018 6:23 am

Attempting to obtain a backtrace with that method gave me this:

Code: Select all
...following three lines repeated about 300 times...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'run_one_line' is not defined
Current executable set to 'zpool' (x86_64).
(lldb) settings set -- target.run-args  "status"
(lldb) bt
error: invalid process


I took that spam at the beginning as a hint that I couldn't do this without debugging symbols, and that the 1.7.3 beta release I was running didn't have them, so I decided to build from source according to the wiki and try again. That gave me the same results, except that zpool status now warns me that I haven't upgraded my pool yet, but that's on purpose.

Then I thought maybe the problem is one of the many unreasonable (IMAO) differences between gdb and lldb, so I crafted my own lldb commands and managed to obtain the backtrace:

Code: Select all
$ lldb
(lldb) target create zpool
...noise noise noise...
(lldb) run status
...same zpool status output...
Process 61612 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00007fff74ab3b66 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff74ab3b66 <+10>: jae    0x7fff74ab3b70            ; <+20>
    0x7fff74ab3b68 <+12>: movq   %rax, %rdi
    0x7fff74ab3b6b <+15>: jmp    0x7fff74aaaae9            ; cerror_nocancel
    0x7fff74ab3b70 <+20>: retq   
Target 0: (zpool) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff74ab3b66 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff74c7e080 libsystem_pthread.dylib`pthread_kill + 333
    frame #2: 0x00007fff74a0f1ae libsystem_c.dylib`abort + 127
    frame #3: 0x000000010041f60a libzfs.2.dylib`zpool_vdev_name [inlined] devid_str_decode at devid.h:46 [opt]
    frame #4: 0x000000010041f605 libzfs.2.dylib`zpool_vdev_name [inlined] devid_to_path at libzfs_pool.c:3642 [opt]
    frame #5: 0x000000010041f605 libzfs.2.dylib`zpool_vdev_name(hdl=<unavailable>, zhp=<unavailable>, nv=<unavailable>, name_flags=<unavailable>) at libzfs_pool.c:3811 [opt]
    frame #6: 0x0000000100003169 zpool`max_width(zhp=0x0000000100607070, nv=0x000000010060d040, depth=4, max=10, name_flags=0) at zpool_main.c:1523 [opt]
    frame #7: 0x000000010000329b zpool`max_width(zhp=0x0000000100607070, nv=0x00000001006090e8, depth=2, max=<unavailable>, name_flags=0) at zpool_main.c:1548 [opt]
    frame #8: 0x000000010000329b zpool`max_width(zhp=0x0000000100607070, nv=0x0000000100607430, depth=0, max=<unavailable>, name_flags=0) at zpool_main.c:1548 [opt]
    frame #9: 0x0000000100004688 zpool`status_callback(zhp=0x0000000100607070, data=<unavailable>) at zpool_main.c:6631 [opt]
    frame #10: 0x00000001000014a6 zpool`pool_list_iter(zlp=0x0000000100606cc0, unavail=1, func=(zpool`status_callback at zpool_main.c:6332), data=0x00007ffeefbfe650) at zpool_iter.c:172 [opt]
    frame #11: 0x000000010000160c zpool`for_each_pool(argc=<unavailable>, argv=<unavailable>, unavail=1, proplist=<unavailable>, func=(zpool`status_callback at zpool_main.c:6332), data=0x00007ffeefbfe650) at zpool_iter.c:246 [opt]
    frame #12: 0x0000000100008447 zpool`zpool_do_status(argc=0, argv=0x00007ffeefbff8f0) at zpool_main.c:6766 [opt]
    frame #13: 0x0000000100005a0a zpool`main(argc=<unavailable>, argv=0x00007ffeefbff8e0) at zpool_main.c:8108 [opt]
    frame #14: 0x00007fff74963015 libdyld.dylib`start + 1


That looks like a useful, clean backtrace. Here's hoping it gives you a quick "aha" moment.

I anticipate that you might change either of the git master branches and have me try again, so I'll ask my next question in advance: do I have to build with zfsadm again, or can I just "git pull && make -j12 && sudo make install" now that the trees have been configured with zfsadm?
tangent
 
Posts: 47
Joined: Tue Nov 11, 2014 6:58 pm

Re: Abort trap 6 on zpool status

Postby lundman » Mon Jul 16, 2018 4:10 pm

Hmm so maybe some string manipulation issue in "devid_str_decode()".

if (devid_str_decode(devid_str, &devid, &minor) != 0)
return (NULL);


static inline
int
devid_str_decode(
char *devidstr,
ddi_devid_t *retdevid,
char **retminor_name)
{
abort();
}


Wha ? Ok, so devid stuff seems to be illumos specific and we don't have support to it - and maybe it isn't expected to be called at all. Makes one wonder why it is called though.. Possibly because the pool health is not "GOOD". Maybe some further porting work is required in this area.
User avatar
lundman
 
Posts: 1335
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: Abort trap 6 on zpool status

Postby tangent » Tue Jul 17, 2018 5:28 am

lundman wrote:devid stuff seems to be illumos specific
I wondered if it had anything to do with the /var/run/disk/by-* stuff, so I exported the problem pool and tried reimporting it both by-id and by-serial, and that didn't help.

Then, thinking that maybe one of the vdevs had gone MIA in a way that ZFS couldn't diagnose properly, I did some web searching for another command to get this information, and came up with "zpool list -v". As you recall from above, that command ran just fine without -v, but adding it caused a repeat of the crash; the same backtrace results, except that it goes through zpool`list_callback() instead of status_callback().

The problem pool is arranged as two 2-way mirrors, with each mirrored pair in a separate 2-drive Thunderbolt enclosure. That means if either of the enclosures had gone MIA, the pool couldn't mount, lacking any copies of roughly half the data on the pool. Also, my SMART utility sees all four drives.

Is there anything else I can test to help you find out why ZFS is failing while enumerating vdevs?
tangent
 
Posts: 47
Joined: Tue Nov 11, 2014 6:58 pm

Re: Abort trap 6 on zpool status

Postby lundman » Tue Jul 17, 2018 4:29 pm

I think we should simply comment out those calls to abort() and see what paths end up looking funny.

Ultimately, we should just "port" those functions over, to do bare minimum, since abort()ing isn't exactly a stellar choice.
User avatar
lundman
 
Posts: 1335
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: Abort trap 6 on zpool status

Postby tangent » Tue Jul 17, 2018 6:03 pm

I made a more conservative patch than you suggest, since that would have left return-params uninitialized, and I'm glad I did because now it crashes in a semi-orderly way somewhere else, rather than go scribbling off into kernel RAM.

The patch:

Code: Select all
diff --git a/lib/libspl/include/devid.h b/lib/libspl/include/devid.h
index 5406c33b5..2da904fa6 100644
--- a/lib/libspl/include/devid.h
+++ b/lib/libspl/include/devid.h
@@ -27,6 +27,7 @@
 #ifndef _LIBSPL_DEVID_H
 #define    _LIBSPL_DEVID_H
 
+#include <memory.h>
 #include <stdlib.h>
 
 typedef int ddi_devid_t;
@@ -43,7 +44,9 @@ devid_str_decode(
     ddi_devid_t *retdevid,
     char **retminor_name)
 {
-   abort();
+    memset(retdevid, 0, sizeof(*retdevid));
+    *retminor_name = 0;
+    return 0;
 }
 
 static inline
@@ -54,28 +57,26 @@ devid_deviceid_to_nmlist(
     char *minor_name,
     devid_nmlist_t **retlist)
 {
-   abort();
+    *retlist = 0;
+    return 0;
 }
 
 static inline
 void
 devid_str_free(char *str)
 {
-   abort();
 }
 
 static inline
 void
 devid_free(ddi_devid_t devid)
 {
-   abort();
 }
 
 static inline
 void
 devid_free_nmlist(devid_nmlist_t *list)
 {
-   abort();
 }
 
 static inline


The result:

Code: Select all
$ sudo lldb zpool
lldb> run status
....noise noise noise...
Process 37011 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x000000010041f609 libzfs.2.dylib`zpool_vdev_name [inlined] devid_to_path at libzfs_pool.c:3656 [opt]
   3653      /*
   3654       * In a case the strdup() fails, we will just return NULL below.
   3655       */
-> 3656      path = strdup(list[0].devname);
   3657   
   3658      devid_free_nmlist(list);
   3659   
Target 0: (zpool) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x000000010041f609 libzfs.2.dylib`zpool_vdev_name [inlined] devid_to_path at libzfs_pool.c:3656 [opt]
    frame #1: 0x000000010041f609 libzfs.2.dylib`zpool_vdev_name(hdl=0x0000000102010a00, zhp=0x0000000100700df0, nv=0x0000000100706dc0, name_flags=8) at libzfs_pool.c:3811 [opt]
    frame #2: 0x0000000100003169 zpool`max_width(zhp=0x0000000100700df0, nv=0x0000000100706dc0, depth=4, max=10, name_flags=0) at zpool_main.c:1523 [opt]
    frame #3: 0x000000010000329b zpool`max_width(zhp=0x0000000100700df0, nv=0x0000000100702e68, depth=2, max=<unavailable>, name_flags=0) at zpool_main.c:1548 [opt]
    frame #4: 0x000000010000329b zpool`max_width(zhp=0x0000000100700df0, nv=0x00000001007011b0, depth=0, max=<unavailable>, name_flags=0) at zpool_main.c:1548 [opt]
    frame #5: 0x0000000100004688 zpool`status_callback(zhp=0x0000000100700df0, data=<unavailable>) at zpool_main.c:6631 [opt]
    frame #6: 0x00000001000014a6 zpool`pool_list_iter(zlp=0x0000000100700a40, unavail=1, func=(zpool`status_callback at zpool_main.c:6332), data=0x00007ffeefbfe8a0) at zpool_iter.c:172 [opt]
    frame #7: 0x000000010000160c zpool`for_each_pool(argc=<unavailable>, argv=<unavailable>, unavail=1, proplist=<unavailable>, func=(zpool`status_callback at zpool_main.c:6332), data=0x00007ffeefbfe8a0) at zpool_iter.c:246 [opt]
    frame #8: 0x0000000100008447 zpool`zpool_do_status(argc=0, argv=0x00007ffeefbffb48) at zpool_main.c:6766 [opt]
    frame #9: 0x0000000100005a0a zpool`main(argc=<unavailable>, argv=0x00007ffeefbffb38) at zpool_main.c:8108 [opt]
    frame #10: 0x00007fff612b6015 libdyld.dylib`start + 1
    frame #11: 0x00007fff612b6015 libdyld.dylib`start + 1


Presumably that illegal instruction is due to trying to dereference one of those pointers I zeroed.

I think I just hit my quota of scary-ops-on-a-live-filesystem for the week.
tangent
 
Posts: 47
Joined: Tue Nov 11, 2014 6:58 pm

Re: Abort trap 6 on zpool status

Postby lundman » Wed Jul 18, 2018 4:27 pm

Everything in lib/ is userland, and not so scary - You have already got the status information back from the kernel, we are just trying to parse it.

Most likely you are calling strdup() with a NULL string, but I think we should look earlier than that:

Code: Select all
         char *newdevid = path_to_devid(path);

         if (newdevid == NULL ||
             strcmp(devid, newdevid) != 0) {
            char *newpath;

            if ((newpath = devid_to_path(devid)) != NULL) {
               /*


It only tries to call "devid_to_path()" if the return on the first line has "path_to_devid()" a non-NULL value. So the whole code could be skipped, if we make sure path_to_devid() returns NULL. As to why it doesn't I am not sure, but we could probably just stick in a "return NULL;" the top of it and see how much further it gets.
User avatar
lundman
 
Posts: 1335
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: Abort trap 6 on zpool status

Postby tangent » Wed Jul 18, 2018 7:36 pm

lundman wrote:Everything in lib/ is userland

Well in that case, it means I can run it without installing first, if only I can figure out how. My attempt (lldb cmd/zpool/.libs/zpool) yielded a complaint from lldb:

Code: Select all
error: libzfs.2.dylib debug map object file '/usr/local/src/o3x/zfs/lib/libzfs/.libs/libzfs_pool.o' has changed (actual time is 2018-07-18 21:25:00.000000000, debug map time is 2018-07-17 20:00:54.000000000) since this executable was linked, file will be ignored


I assume that's because I'm not setting the macOS equivalents of LD_* properly, as the libtool wrapper (cmd/zpool/zpool) will, but I can't use the wrapper with lldb short of editing the lldb command into it, which means it breaks on every build, so I'm stuck.

Installing gets me past that and results in the same thing as in the prior try:

Code: Select all
Process 71821 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x000000010041f608 libzfs.2.dylib`zpool_vdev_name [inlined] devid_to_path at libzfs_pool.c:3656 [opt]
   3653      /*
   3654       * In a case the strdup() fails, we will just return NULL below.
   3655       */
-> 3656      path = strdup(list[0].devname);
   3657   
   3658      devid_free_nmlist(list);
   3659   
Target 0: (zpool) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x000000010041f608 libzfs.2.dylib`zpool_vdev_name [inlined] devid_to_path at libzfs_pool.c:3656 [opt]
    frame #1: 0x000000010041f608 libzfs.2.dylib`zpool_vdev_name(hdl=0x0000000101800000, zhp=0x0000000102002130, nv=0x0000000102005fd0, name_flags=8) at libzfs_pool.c:3810 [opt]
    frame #2: 0x0000000100003169 zpool`max_width(zhp=0x0000000102002130, nv=0x0000000102005fd0, depth=4, max=10, name_flags=0) at zpool_main.c:1523 [opt]
    frame #3: 0x000000010000329b zpool`max_width(zhp=0x0000000102002130, nv=0x00000001020041d8, depth=2, max=<unavailable>, name_flags=0) at zpool_main.c:1548 [opt]
    frame #4: 0x000000010000329b zpool`max_width(zhp=0x0000000102002130, nv=0x0000000102002520, depth=0, max=<unavailable>, name_flags=0) at zpool_main.c:1548 [opt]
    frame #5: 0x0000000100004688 zpool`status_callback(zhp=0x0000000102002130, data=<unavailable>) at zpool_main.c:6631 [opt]
    frame #6: 0x00000001000014a6 zpool`pool_list_iter(zlp=0x00000001007006f0, unavail=1, func=(zpool`status_callback at zpool_main.c:6332), data=0x00007ffeefbfe610) at zpool_iter.c:172 [opt]
    frame #7: 0x000000010000160c zpool`for_each_pool(argc=<unavailable>, argv=<unavailable>, unavail=1, proplist=<unavailable>, func=(zpool`status_callback at zpool_main.c:6332), data=0x00007ffeefbfe610) at zpool_iter.c:246 [opt]
    frame #8: 0x0000000100008447 zpool`zpool_do_status(argc=0, argv=0x00007ffeefbff8b0) at zpool_main.c:6766 [opt]
    frame #9: 0x0000000100005a0a zpool`main(argc=<unavailable>, argv=0x00007ffeefbff8a0) at zpool_main.c:8108 [opt]
    frame #10: 0x00007fff688ff015 libdyld.dylib`start + 1


Just to be clear, I removed the "if" line at the top of path_to_devid(), leaving only the "return (NULL)" to run unconditionally now. That's what you wanted me to test, right?

I've attached the patch so far, since pasting it in last time gave me errors on trying to reapply it, presumably because phpBB is munging the whitespace somehow.
Attachments
zpool-status-crash-fix.patch.gz
(542 Bytes) Downloaded 490 times
tangent
 
Posts: 47
Joined: Tue Nov 11, 2014 6:58 pm

Re: Abort trap 6 on zpool status

Postby lundman » Wed Jul 18, 2018 9:57 pm

Ah the script "cmd.sh" in the zfs tree lets you run commands from the build tree:

Code: Select all
# ./cmd.sh zpool status


it does the LD binding (DYLD) for you. I think I also have a debug.sh that runs "lldb" before the exec $cmd. Not sure it is part of the repo though.
User avatar
lundman
 
Posts: 1335
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Next

Return to General Help

Who is online

Users browsing this forum: No registered users and 34 guests