Hi,

We're encountering a rather weird, sporadic hang during a large-file 
transfer. A bit of background on the setup:

 - Host A rsyncs a MySQL database from Host B. Both sides use rsync 
   3.1.1 on Debian Jessie.

 - Host B serves the files using rsync --daemon from an ext4 filesystem 
   residing on a thin LVM snapshot.

 - Host A uses the following command to rsync the files to a btrfs subvolume:
   rsync -og -avPH --log-file=/var/log/mysql-rsync.log --debug=ALL 
--chown=user:user --delete --inplace rsync://hostb.example.com/mysql/ 
/media/btrfs/db-snapshot-$$/mysql
   (I know parts of the command (-og + -a + --chown) don't make much 
   sense :)

 - Both hosts sit on the same LAN, connected to a Juniper EX4200 stack 
   with active-backup interfaces. MTU is standard 1500 and there are no 
         known network issues. Host A has Intel I350 (igb) NICs, while Host B 
has
   BCM5709 (bnx2) NICs. Both hosts run Linux 3.16.36.

Once every few (~3-5) transfers, the rsync process hangs with both sides 
waiting on select() on the TCP socket. From the transfer logs I can see 
that the hang always happens when about to receive the same 32GB file, 
already a couple of GBs into the total transfer.

Host A reports:
....
04:41:03 [6281] [receiver] send_msg_int(100, 1701)
04:41:03 [6281] recv_files(mysql/innodb_table_stats.ibd)
04:41:03 [6281] sender finished //mysql/innodb_table_stats.ibd
04:41:03 [6281] send_files(1749, //mysql/user.MYD)
04:41:03 [6281] sender finished //mysql/user.MYD
04:41:03 [6281] send_files(1750, //mysql/user.MYI)
04:41:03 [6281] sender finished //mysql/user.MYI
04:41:03 [6281] send_files(1806, //some_dbase)
04:41:03 [6281] send_files(1808, //some_dbase/actions.ibd)
04:41:03 [6281] >f.st...... mysql/innodb_table_stats.ibd
04:41:03 [6281] finishing mysql/innodb_table_stats.ibd
04:41:03 [6281] set modtime of mysql/innodb_table_stats.ibd to (1491343251) Wed 
Apr  5 01:00:51 2017
04:41:03 [6281] [receiver] send_msg_int(100, 1703)
04:41:03 [6281] recv_files(mysql/user.MYD)
04:41:03 [6281] >f.st...... mysql/user.MYD
04:41:03 [6281] finishing mysql/user.MYD
04:41:03 [6281] set modtime of mysql/user.MYD to (1487686529) Tue Feb 21 
16:15:29 2017
04:41:03 [6281] [receiver] send_msg_int(100, 1749)
04:41:03 [6281] recv_files(mysql/user.MYI)
04:41:03 [6281] >f..t...... mysql/user.MYI
04:41:03 [6281] finishing mysql/user.MYI
04:41:03 [6281] set modtime of mysql/user.MYI to (1487689840) Tue Feb 21 
17:10:40 2017
04:41:03 [6281] [receiver] send_msg_int(100, 1750)
04:41:03 [6281] recv_files(some_dbase)
04:41:03 [6281] .d..t...... some_dbase/
<silence>

while Host B reports:
....
01:41:03 [3726] send_files(1749, //mysql/user.MYD)
01:41:03 [3726] <f.st...... mysql/user.MYD
01:41:03 [3726] sender finished //mysql/user.MYD
01:41:03 [3726] send_files(1750, //mysql/user.MYI)
01:41:03 [3726] <f..t...... mysql/user.MYI
01:41:03 [3726] sender finished //mysql/user.MYI
01:41:03 [3726] send_files(1806, //some_dbase)
01:41:03 [3726] .d..t...... some_dbase/
01:41:03 [3726] send_files(1808, //some_dbase/actions.ibd)
<silence>

The backtraces of the hung processes are:

Host A (receiver):
 #0  0x00007f87d7da3873 in select () from /lib/x86_64-linux-gnu/libc.so.6
 #1  0x00007f87d88dd7a2 in perform_io (needed=needed@entry=4, 
flags=flags@entry=17) at io.c:742
 #2  0x00007f87d88deef7 in raw_read_int () at io.c:908
 #3  0x00007f87d88df683 in read_a_msg () at io.c:1423
 #4  0x00007f87d88e0676 in read_buf (f=f@entry=4, buf=buf@entry=0x7ffd814c0df0 
"\300\037L\201\375\177", len=len@entry=1) at io.c:1835
 #5  0x00007f87d88e157b in read_ndx (f=f@entry=4) at io.c:2222
 #6  0x00007f87d88be377 in read_ndx_and_attrs (f_in=4, f_out=6, 
iflag_ptr=0x7ffd814c0f28, type_ptr=0x7ffd814c0f27 "\200\b", 
     buf=0x7ffd814c1fc0 "", len_ptr=0x7ffd814c0f2c) at rsync.c:320
 #7  0x00007f87d88c5b5e in recv_files (f_in=5, f_out=-2125728480, 
local_name=0x7ffd814c0fc0 "some_dbase") at receiver.c:548
 #8  0x00007f87d88d0787 in do_recv (f_in=4, f_out=6, local_name=0x0) at 
main.c:906
 #9  0x00007f87d88d1218 in client_run (f_in=4, f_out=4, pid=-1, argc=<optimized 
out>, argv=<optimized out>) at main.c:1213
 #10 0x00007f87d88b4330 in start_client (argv=<optimized out>, argc=<optimized 
out>) at main.c:1382
 #11 main (argc=2, argv=0x7f87d9000240) at main.c:1651

Host B (sender):
 #0  0x00007f9a7130a873 in __select_nocancel () at 
../sysdeps/unix/syscall-template.S:81
 #1  0x00007f9a71e447a2 in perform_io (needed=needed@entry=4, 
flags=flags@entry=17) at io.c:742
 #2  0x00007f9a71e45e2e in raw_read_buf (buf=0x7ffe7c1aa380 "", len=4) at 
io.c:895
 #3  0x00007f9a71e47643 in read_buf (f=<optimized out>, 
buf=buf@entry=0x7ffe7c1aa380 "", len=len@entry=4) at io.c:1840
 #4  0x00007f9a71e47851 in read_int (f=<optimized out>) at io.c:1711
 #5  0x00007f9a71e2f652 in receive_sums (f=<optimized out>) at sender.c:96
 #6  send_files (f_in=4, f_out=1930300096) at sender.c:332
 #7  0x00007f9a71e37f6c in do_server_sender (argv=<optimized out>, 
argc=<optimized out>, f_out=3, f_in=3) at main.c:840
 #8  start_server (f_in=3, f_out=3, argc=2, argv=0x7f9a7302f240) at main.c:1107
 #9  0x00007f9a71e5a34e in rsync_module (f_in=3, f_out=3, i=0, 
addr=0x7f9a72098518 <numeric_ids> "\377\377\377\377", 
     host=0x7f9a72098c80 <name_buf> "host-a.example.com") at clientserver.c:1009
 #10 0x00007f9a71e5b260 in start_daemon (f_in=3, f_out=3) at clientserver.c:1100
 #11 0x00007f9a71e4f5d1 in start_accept_loop (port=4, fn=0x7ffe7c1aa160, 
fn@entry=0x7f9a71e5b120 <start_daemon>) at socket.c:618
 #12 0x00007f9a71e5bb46 in daemon_main () at clientserver.c:1202
 #13 0x00007f9a71e1aeb1 in main (argc=0, argv=0x0) at main.c:1626
 
The worrysome part is the Send-Q on Host A, which last time contained 263392
unACK'd bytes. The rest of the queues involved were empty. tcpdump 
reveals no retransmission attempts, only 2 window updates and a stream 
of TCP keepalives after the point where the transfer halts:

 Bytes  Time (s)       SRC           DST        Proto    Details
 12995  1.637946       hostb        hosta       RSYNC    Server MOTD
 86     1.637963       hosta       hostb        TCP      39115→rsync(873) [ACK] 
Seq=1 Ack=37686543 Win=23993 Len=0 TSval=2255190023 TSecr=100923227
 86     1.638406       hostb        hosta       TCP      [TCP Window Update] 
rsync(873)→39115 [ACK] Seq=37686543 Ack=1 Win=1688 Len=0 TSval=100923227 
TSecr=2255190022
 1514   1.641767       hostb        hosta       RSYNC    Server MOTD
 86     1.641802       hosta       hostb        TCP      39115→rsync(873) [ACK] 
Seq=1 Ack=37687971 Win=23985 Len=0 TSval=2255190024 TSecr=100923228
 1381   1.641938       hostb        hosta       RSYNC    Server MOTD
 86     1.641955       hosta       hostb        TCP      39115→rsync(873) [ACK] 
Seq=1 Ack=37689266 Win=23977 Len=0 TSval=2255190024 TSecr=100923228
 86     1.647888       hostb        hosta       TCP      [TCP Window Update] 
rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=6840 Len=0 TSval=100923230 
TSecr=2255190024
 86     121.931963     hosta       hostb        TCP      [TCP Keep-Alive] 
39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255220097 
TSecr=100923230
 86     121.932062     hostb        hosta       TCP      [TCP Window Update] 
rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=100953302 
TSecr=2255190024
 86     242.248005     hosta       hostb        TCP      [TCP Keep-Alive] 
39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255250176 
TSecr=100953302
 86     242.248090     hostb        hosta       TCP      [TCP Keep-Alive ACK] 
rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=100983381 
TSecr=2255190024
 86     362.567974     hosta       hostb        TCP      [TCP Keep-Alive] 
39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255280256 
TSecr=100983381
 86     362.568085     hostb        hosta       TCP      [TCP Keep-Alive ACK] 
rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=101013461 
TSecr=2255190024
 86     482.891958     hosta       hostb        TCP      [TCP Keep-Alive] 
39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255310336 
TSecr=101013461
 86     482.892026     hostb        hosta       TCP      [TCP Keep-Alive ACK] 
rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=101043542 
TSecr=2255190024

>From the backtraces it appears as though the sender is waiting for the
checksums the client has sent, while the client is waiting for the next file id
to come in. I can only assume the deadlock is because of the client's checksums
sitting in its Send-Q. The only potentially related part I can find in the
traffic captures are a couple of Zero Window responses by hostb, which however
are followed by window updates with non-zero sizes.

Any ideas or hints what else to look for?

Regards,
Apollon

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Reply via email to