Page 1 of 2

Rsync --server writing to zfs folder quits?

PostPosted: Fri Aug 24, 2018 6:08 pm
by tim.rohrer
I've recently implemented zfs on my MacOS server and created tank0/ArchivesLocal to receive rsync'd files from clients. The rsync-based script I use has been in place for years. Nothing else in the system has changed other than installing OpenZFS on OS X on the server and creating the tank0/ArchivesLocal dataset to receive the remote files.

However, some files will copy (not more than about 10% of the total from this particular client), and then the `rsync --server` process seems to just quit, leading to a write error from the remote system.

I've tried setting a `--timeout=3600` for rsync, and checking file permissions on the server, but nothing seems to work. Due to the lack of logging I've been able to find on the server, I'm at a loss. As I said, the only change I've made is to install o3x and move the zfs filesystem.

How do I access logging the zfs logging?

Or, does someone know why this is happening or what I should check?

Thanks!

Re: Rsync --server writing to zfs folder quits?

PostPosted: Sun Aug 26, 2018 3:54 pm
by lundman
Sounds like `rsync --server` is crashing, you should perhaps run dtruss on it and see why it dies.

Even just a "sudo dtruss -fn rsync" might be enough

Re: Rsync --server writing to zfs folder quits?

PostPosted: Mon Aug 27, 2018 6:07 am
by tim.rohrer
Thank you, @lundman. I had not used dtruss before. Here are the last few lines:

Code: Select all
20351/0x1caa02:  lstat64("Users/eclaire/Library/Application Support/MobileSync/Backup/3b409f1ca0ef5f73c9d902bdfcd4e3e0c6c6b2cc/76/76e479dc8e361b13fd50537bc52a182fc1c136d2\0", 0x7FFEE42B9D20, 0x0)       = 0 0
20351/0x1caa02:  stat64_extended(0x7FFEE42BA6B0, 0x7FFEE42B9980, 0x7FEF0E5410B0)       = 0 0
20351/0x1caa02:  listxattr(0x7FFEE42BA6B0, 0x7FEF0F030A00, 0x400)       = 21 0
20351/0x1caa02:  getxattr("Users/eclaire/Library/Application Support/MobileSync/Backup/3b409f1ca0ef5f73c9d902bdfcd4e3e0c6c6b2cc/76/76e479dc8e361b13fd50537bc52a182fc1c136d2\0", 0x7FEF0F030A00, 0x0)       = 31 0
20351/0x1caa02:  getxattr("Users/eclaire/Library/Application Support/MobileSync/Backup/3b409f1ca0ef5f73c9d902bdfcd4e3e0c6c6b2cc/76/76e479dc8e361b13fd50537bc52a182fc1c136d2\0", 0x7FEF0F030A00, 0x7FEF0E5403F0)       = 31 0
20351/0x1caa02:  lstat64("Users/eclaire/Library/Application Support/MobileSync/Backup/3b409f1ca0ef5f73c9d902bdfcd4e3e0c6c6b2cc/76/76e5ecec7d8380f77726e720a4709428027efa8f\0", 0x7FFEE42B9D20, 0x0)       = 0 0
20351/0x1caa02:  stat64_extended(0x7FFEE42BA6B0, 0x7FFEE42B9980, 0x7FEF0E663E30)       = 0 0
20351/0x1caa02:  listxattr(0x7FFEE42BA6B0, 0x7FEF0F030A00, 0x400)       = 0 0
20351/0x1caa02:  lstat64("Users/eclaire/Library/Application Support/MobileSync/Backup/3b409f1ca0ef5f73c9d902bdfcd4e3e0c6c6b2cc/76/76fb3c685d422334fd25d9777d64db74af7cb25a\0", 0x7FFEE42B9D20, 0x0)       = 0 0
20351/0x1caa02:  stat64_extended(0x7FFEE42BA6B0, 0x7FFEE42B9980, 0x7FEF0E663E30)       = 0 0
20351/0x1caa02:  listxattr(0x7FFEE42BA6B0, 0x7FEF0F030A00, 0x400)       = 21 0
20351/0x1caa02:  getxattr("Users/eclaire/Library/Application Support/MobileSync/Backup/3b409f1ca0ef5f73c9d902bdfcd4e3e0c6c6b2cc/76/76fb3c685d422334fd25d9777d64db74af7cb25a\0", 0x7FEF0F030A00, 0x0)       = 31 0
20351/0x1caa02:  getxattr("Users/eclaire/Library/Application Support/MobileSync/Backup/3b409f1ca0ef5f73c9d902bdfcd4e3e0c6c6b2cc/76/76fb3c685d422334fd25d9777d64db74af7cb25a\0", 0x7FEF0F030A00, 0x7FEF0E6638A0)       = 31 0
20351/0x1caa02:  select(0x4, 0x7FFEE42B8ED0, 0x7FFEE42B8E50, 0x7FFEE42B8DD0, 0x7FFEE42B8F70)       = 1 0
dtrace: error on enabled probe ID 2172 (ID 161: syscall::write:return): invalid kernel access in action #13 at DIF offset 68
20352/0x1caa06:  select(0x1, 0x7FFEE42B8150, 0x7FFEE42B80D0, 0x7FFEE42B8050, 0x7FFEE42B81F0)       = 0 0
20351/0x1caa02:  select(0x4, 0x7FFEE42B8ED0, 0x7FFEE42B8E50, 0x7FFEE42B8DD0, 0x7FFEE42B8F70)       = 0 0
20352/0x1caa06:  select(0x1, 0x7FFEE42B8150, 0x7FFEE42B80D0, 0x7FFEE42B8050, 0x7FFEE42B81F0)       = 1 0
dtrace: error on enabled probe ID 2174 (ID 159: syscall::read:return): invalid kernel access in action #13 at DIF offset 68
20352/0x1caa06:  sigaction(0x1E, 0x7FFEE42B7FC8, 0x0)       = 0 0
20352/0x1caa06:  sigaction(0x1F, 0x7FFEE42B7FC8, 0x0)       = 0 0
20352/0x1caa06:  __semwait_signal(0x703, 0x0, 0x1)       = -1 Err#60
20351/0x1caa02:  select(0x4, 0x7FFEE42B8ED0, 0x7FFEE42B8E50, 0x7FFEE42B8DD0, 0x7FFEE42B8F70)       = 1 0
dtrace: error on enabled probe ID 2174 (ID 159: syscall::read:return): invalid kernel access in action #13 at DIF offset 68
20351/0x1caa02:  sigaction(0x1E, 0x7FFEE42B8D48, 0x0)       = 0 0
20351/0x1caa02:  sigaction(0x1F, 0x7FFEE42B8D48, 0x0)       = 0 0
20351/0x1caa02:  wait4(0xFFFFFFFFFFFFFFFF, 0x7FFEE42B7F64, 0x1)       = 20352 0
20351/0x1caa02:  wait4(0xFFFFFFFFFFFFFFFF, 0x7FFEE42B7F64, 0x1)       = -1 Err#10
20351/0x1caa02:  sigreturn(0x7FFEE42B8438, 0x1E, 0x0)       = 0 Err#-2
20351/0x1caa02:  kill(20352, 30)       = -1 Err#3
20351/0x1caa02:  __semwait_signal(0x903, 0x0, 0x1)       = -1 Err#60
dtrace: error on enabled probe ID 2366 (ID 897: syscall::thread_selfid:return): invalid user access in action #5 at DIF offset 0
dtrace: error on enabled probe ID 2366 (ID 897: syscall::thread_selfid:return): invalid user access in action #5 at DIF offset 0


The `invalid kernel access in action #13` also appears at various times in the rest of the output.

Thanks!

Re: Rsync --server writing to zfs folder quits?

PostPosted: Mon Aug 27, 2018 4:27 pm
by lundman
Code: Select all
20351/0x1caa02:  wait4(0xFFFFFFFFFFFFFFFF, 0x7FFEE42B7F64, 0x1)       = -1 Err#10
20351/0x1caa02:  sigreturn(0x7FFEE42B8438, 0x1E, 0x0)       = 0 Err#-2
20351/0x1caa02:  kill(20352, 30)       = -1 Err#3
20351/0x1caa02:  __semwait_signal(0x903, 0x0, 0x1)       = -1 Err#60


So it receives signal 10, which is SIGBUS. Can you try a different version of rsync in case it is a known problem, if we manage to make it crash in all versions, we must be returning something invalid.

Re: Rsync --server writing to zfs folder quits?

PostPosted: Wed Aug 29, 2018 8:22 pm
by tim.rohrer
Sorry for the delay. I was out for a couple of days.

Interestingly, this client (and the server) are using rsync 3.1.2 which I had built some time ago. I move the server (using MacPorts) to 3.1.3, but that doesn't seem to solve the problem. I will need to install 3.1.3 on the client, which I'll work on, and then report back here.

Tim

Re: Rsync --server writing to zfs folder quits?

PostPosted: Thu Aug 30, 2018 8:54 am
by tim.rohrer
Code: Select all
dtrace: error on enabled probe ID 2172 (ID 161: syscall::write:return): invalid kernel access in action #13 at DIF offset 68
 7193/0x2ab2fc:  select(0x1, 0x7FFEEBDC2150, 0x7FFEEBDC20D0, 0x7FFEEBDC2050, 0x7FFEEBDC21E0)       = 0 0
 7192/0x2ab2f7:  select(0x4, 0x7FFEEBDC2ED0, 0x7FFEEBDC2E50, 0x7FFEEBDC2DD0, 0x7FFEEBDC2F60)       = 0 0
 7193/0x2ab2fc:  select(0x1, 0x7FFEEBDC2150, 0x7FFEEBDC20D0, 0x7FFEEBDC2050, 0x7FFEEBDC21E0)       = 0 0
 7192/0x2ab2f7:  select(0x4, 0x7FFEEBDC2ED0, 0x7FFEEBDC2E50, 0x7FFEEBDC2DD0, 0x7FFEEBDC2F60)       = 0 0
 7193/0x2ab2fc:  select(0x1, 0x7FFEEBDC2150, 0x7FFEEBDC20D0, 0x7FFEEBDC2050, 0x7FFEEBDC21E0)       = 0 0
 7192/0x2ab2f7:  select(0x4, 0x7FFEEBDC2ED0, 0x7FFEEBDC2E50, 0x7FFEEBDC2DD0, 0x7FFEEBDC2F60)       = 0 0
 7193/0x2ab2fc:  select(0x1, 0x7FFEEBDC2150, 0x7FFEEBDC20D0, 0x7FFEEBDC2050, 0x7FFEEBDC21E0)       = 1 0
dtrace: error on enabled probe ID 2174 (ID 159: syscall::read:return): invalid kernel access in action #13 at DIF offset 68
 7193/0x2ab2fc:  sigaction(0x1E, 0x7FFEEBDC1FC8, 0x0)       = 0 0
 7193/0x2ab2fc:  sigaction(0x1F, 0x7FFEEBDC1FC8, 0x0)       = 0 0
 7193/0x2ab2fc:  __semwait_signal(0x703, 0x0, 0x1)       = -1 Err#60
 7192/0x2ab2f7:  select(0x4, 0x7FFEEBDC2ED0, 0x7FFEEBDC2E50, 0x7FFEEBDC2DD0, 0x7FFEEBDC2F60)       = 1 0
dtrace: error on enabled probe ID 2174 (ID 159: syscall::read:return): invalid kernel access in action #13 at DIF offset 68
 7192/0x2ab2f7:  sigaction(0x1E, 0x7FFEEBDC2D48, 0x0)       = 0 0
 7192/0x2ab2f7:  sigaction(0x1F, 0x7FFEEBDC2D48, 0x0)       = 0 0
 7192/0x2ab2f7:  kill(7193, 30)       = 0 0
 7192/0x2ab2f7:  wait4(0xFFFFFFFFFFFFFFFF, 0x7FFEEBDC27CC, 0x1)       = 7193 0
 7192/0x2ab2f7:  wait4(0xFFFFFFFFFFFFFFFF, 0x7FFEEBDC27CC, 0x1)       = -1 Err#10
 7192/0x2ab2f7:  sigreturn(0x7FFEEBDC2C80, 0x1E, 0x0)       = 0 Err#-2
 7192/0x2ab2f7:  __semwait_signal(0x1403, 0x0, 0x1)       = -1 Err#60
  604/0x2acb41:  thread_selfid(0xC, 0x0, 0x0)       = 2804545 0
dtrace: error on enabled probe ID 2366 (ID 897: syscall::thread_selfid:return): invalid user access in action #5 at DIF offset 0


Using MacPorts rsync 3.1.3 on both client and server. I guess I'll try compiling myself unless you have another idea? After that, I can try rolling back the versions, but I need to keep ACL support.

Thanks.

Re: Rsync --server writing to zfs folder quits?

PostPosted: Fri Aug 31, 2018 7:37 am
by tim.rohrer
I compiled 3.1.0 for both the server and the client. Here was the result:

Code: Select all
27357/0x2d1a2f:  openat_nocancel(0xFFFFFFFFFFFFFFFE, 0x7FFEDFDD8F50, 0xA02)       = 1 0
27357/0x2d1a2f:  fchmod(0x1, 0x180, 0x0)       = 0 0
27357/0x2d1a2f:  select(0x5, 0x7FFEDFDD8800, 0x7FFEDFDD8780, 0x7FFEDFDD8700, 0x7FFEDFDD8898)       = 2 0
dtrace: error on enabled probe ID 2174 (ID 159: syscall::read:return): invalid kernel access in action #13 at DIF offset 68
dtrace: error on enabled probe ID 2172 (ID 161: syscall::write:return): invalid kernel access in action #13 at DIF offset 68
27356/0x2d1a2e:  select(0x4, 0x7FFEDFDD7F70, 0x7FFEDFDD7EF0, 0x7FFEDFDD7E70, 0x7FFEDFDD8008)       = 1 0
27357/0x2d1a2f:  select(0x1, 0x7FFEDFDD8800, 0x7FFEDFDD8780, 0x7FFEDFDD8700, 0x7FFEDFDD8898)       = 1 0
dtrace: error on enabled probe ID 2174 (ID 159: syscall::read:return): invalid kernel access in action #13 at DIF offset 68
dtrace: error on enabled probe ID 2174 (ID 159: syscall::read:return): invalid kernel access in action #13 at DIF offset 68
27357/0x2d1a2f:  sigaction(0x1E, 0x7FFEDFDD8668, 0x0)       = 0 0
27357/0x2d1a2f:  sigaction(0x1F, 0x7FFEDFDD8668, 0x0)       = 0 0
27357/0x2d1a2f:  unlink("Users/eclaire/Pictures/Photos Library-network-network-network.photoslibrary/Database/Albums/.x7qrdVEeTeGKAjXrnAh+%Q.apalbum.AZFaLz\0", 0x0, 0x0)       = 0 0
27357/0x2d1a2f:  select(0x0, 0x0, 0x0, 0x0, 0x7FFEDFDD8628)       = 0 0
27356/0x2d1a2e:  select(0x4, 0x7FFEDFDD7F70, 0x7FFEDFDD7EF0, 0x7FFEDFDD7E70, 0x7FFEDFDD8008)       = 1 0
dtrace: error on enabled probe ID 2174 (ID 159: syscall::read:return): invalid kernel access in action #13 at DIF offset 68
27356/0x2d1a2e:  sigaction(0x1E, 0x7FFEDFDD7DD8, 0x0)       = 0 0
27356/0x2d1a2e:  sigaction(0x1F, 0x7FFEDFDD7DD8, 0x0)       = 0 0
27356/0x2d1a2e:  kill(27357, 30)       = 0 0
27356/0x2d1a2e:  select(0x0, 0x0, 0x0, 0x0, 0x7FFEDFDD7D98)       = -1 Err#4
27356/0x2d1a2e:  wait4(0xFFFFFFFFFFFFFFFF, 0x7FFEDFDD7804, 0x1)       = 27357 0
27356/0x2d1a2e:  wait4(0xFFFFFFFFFFFFFFFF, 0x7FFEDFDD7804, 0x1)       = -1 Err#10
27356/0x2d1a2e:  sigreturn(0x7FFEDFDD7CD0, 0x1E, 0x0)       = 0 Err#-2
27356/0x2d1a2e:  select(0x0, 0x0, 0x0, 0x0, 0x7FFEDFDD7D98)       = 0 0
dtrace: error on enabled probe ID 2366 (ID 897: syscall::thread_selfid:return): invalid user access in action #5 at DIF offset 0
dtrace: error on enabled probe ID 2366 (ID 897: syscall::thread_selfid:return): invalid user access in action #5 at DIF offset 0


I've now tried three different versions of rsync (3.1.0, 3.1.2, 3.1.3). I think I can roll back to the 3.0 series, but that is several years old, so....

How do we proceed?

Thank you!

Re: Rsync --server writing to zfs folder quits?

PostPosted: Sun Sep 02, 2018 6:01 pm
by lundman
Hmm can you make it dump core (ulimit -c unlimited) and we can peek at it with lldb "lldb rsync rsync.core". It might reveal some clues as to what result is unexpected.

You did ask about logging earlier, you can
Code: Select all
# sysctl kstat.zfs.darwin.tunable.vnop_debug=1
# log stream --source --predicate 'senderImagePath CONTAINS "zfs" OR senderImagePath CONTAINS "spl"'


But it's pretty verbose, and can be hard to find anything related. dtruss'ing the rsync binary is more likely to show you the last call(s) that may be involved with the crash.

Re: Rsync --server writing to zfs folder quits?

PostPosted: Mon Sep 03, 2018 8:12 am
by tim.rohrer
This level of depth into kernel/os workings is pretty new to me, but I'm researching and testing.

I've read `ulimit -c unlimited` needs to be run from the terminal that runs the program that crashes. However, on the server I don't know how to control this.

Or, am I misunderstanding...should I be getting this core dump from the client?

Re: Rsync --server writing to zfs folder quits?

PostPosted: Mon Sep 03, 2018 2:48 pm
by tim.rohrer
After extensive research, I created a plist under /Library/LaunchDaemons and loaded it.

Code: Select all
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
        "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>limit.core</string>
    <key>ProgramArguments</key>
    <array>
      <string>launchctl</string>
      <string>limit</string>
      <string>core</string>
      <string>unlimited</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>ServiceIPC</key>
    <false/>
  </dict>
</plist>


Afterwards, I've checked multiple terminals and found `ulimit -c` produces "unlimited". Creating a sample c program that will segfault results in a core.PID being written to /cores, and the crash report shows up on Console. Likewise, opening TextEdit and then issuing `killall -ABRT TextEdit` results in a crash report. However it does not produce a core.PID under /cores.

I've tested multiple rsyc runs from the client and watched dtruss output until rsync --server dies, and neither a crash report nor a core.PID file is generated to any location that I can see. Thinking the issue might be because I'm running rsync as root, I did a sudo -i on a remote terminal and then ran TextEdit, which failed since the UIDs of the terminal user and the window owner didn't match. This report showed up as a crash report in Console.

Because this particular situation is not likely related to OpenZFS, I'll write this up for AskDifferent and see if there is something wrong with my plist.

But, if anyone sees an obvious reason why the core dump is not being generated, I'd appreciate the help.

Tim