[Kernel-packages] [Bug 1796469] Re: aws s3 cp --recursive hangs on the last file on a large file transfer to instance
Can also confirm that the 4.15.0-1025-aws kernel appears to have solved the timeout issues we were seeing with `git clone` processes. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1796469 Title: aws s3 cp --recursive hangs on the last file on a large file transfer to instance Status in linux-aws package in Ubuntu: Confirmed Bug description: aws s3 cp --recursive hangs on the last file on a large transfer to an instance I have confirmed that this works on version Linux/4.15.0-1021-aws aws cli version aws-cli/1.16.23 Python/2.7.15rc1 Linux/4.15.0-1023-aws botocore/1.12.13 Ubuntu version Description: Ubuntu 18.04.1 LTS Release: 18.04 eu-west-1 - ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912 - ami-00035f41c82244dab Package version linux-aws: Installed: 4.15.0.1023.23 Candidate: 4.15.0.1023.23 Version table: *** 4.15.0.1023.23 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status 4.15.0.1007.7 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1796469/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796469] Re: aws s3 cp --recursive hangs on the last file on a large file transfer to instance
Trying it again with the newly posted 4.15.0-1025-aws kernel, we can no longer reproduce the issue with pg_dump. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1796469 Title: aws s3 cp --recursive hangs on the last file on a large file transfer to instance Status in linux-aws package in Ubuntu: Confirmed Bug description: aws s3 cp --recursive hangs on the last file on a large transfer to an instance I have confirmed that this works on version Linux/4.15.0-1021-aws aws cli version aws-cli/1.16.23 Python/2.7.15rc1 Linux/4.15.0-1023-aws botocore/1.12.13 Ubuntu version Description: Ubuntu 18.04.1 LTS Release: 18.04 eu-west-1 - ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912 - ami-00035f41c82244dab Package version linux-aws: Installed: 4.15.0.1023.23 Candidate: 4.15.0.1023.23 Version table: *** 4.15.0.1023.23 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status 4.15.0.1007.7 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1796469/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796469] Re: aws s3 cp --recursive hangs on the last file on a large file transfer to instance
We ran into the exact same issue with pg_dump after upgrading to Bionic (4.15.0-1023-aws kernel) on r5.xlarge (a next-gen Nitro instance). After 16-80 MB of data is received, tcpdump shows the receive window collapsing to 1 (not quite suddenly, but quickly), and it never recovers. We have engaged with AWS support on this issue, and they referred us to this bug. Other than downgrading the kernel, has anyone discovered any other workarounds? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1796469 Title: aws s3 cp --recursive hangs on the last file on a large file transfer to instance Status in linux-aws package in Ubuntu: Confirmed Bug description: aws s3 cp --recursive hangs on the last file on a large transfer to an instance I have confirmed that this works on version Linux/4.15.0-1021-aws aws cli version aws-cli/1.16.23 Python/2.7.15rc1 Linux/4.15.0-1023-aws botocore/1.12.13 Ubuntu version Description: Ubuntu 18.04.1 LTS Release: 18.04 eu-west-1 - ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912 - ami-00035f41c82244dab Package version linux-aws: Installed: 4.15.0.1023.23 Candidate: 4.15.0.1023.23 Version table: *** 4.15.0.1023.23 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status 4.15.0.1007.7 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1796469/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796469] Re: aws s3 cp --recursive hangs on the last file on a large file transfer to instance
The problem affects also transfers from RDS instances - it prevents for example creating an export file using pg_dump from an RDS Postgres database. The problem seems to be caused by a sudden collapse of downloader TCP receive window to "1" (usually with scale 7, so 1*2^7=128 bytes) after transferring several tens of GB of data over a single connection. The TCP receive window never recovers. Analysis of a single stalled transfer from RDS PostgreSQL: tcpdump -nn -vvv -r pcap --dont-verify-checksums; done | xz -1 > pcap.txt.xz $ xzgrep wscale pcap.txt.xz 10.16.14.237.33578 > 10.16.10.102.5432: Flags [S], seq 3273208485, win 26883, options [mss 8961,sackOK,TS val 1284903196 ecr 0,nop,wscale 7], length 0 10.16.10.102.5432 > 10.16.14.237.33578: Flags [S.], seq 2908863056, ack 3273208486, win 28960, options [mss 1460,sackOK,TS val 120076048 ecr 1284903196,nop,wscale 10], length 0 So the window scale for this TCP connection is 7, so each "win N" values have to be multiplied by 2^7=128. The window size TCP parameter at the beginning of the file is for example 852, which means after scaling 852*2^7=109056 bytes $ xzgrep '10.16.14.237.33578' pcap.txt.xz | head -10 | tail -1 10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2256009, ack 201198741, win 852, options [nop,nop,TS val 1284911021 ecr 120078004], length 0 But near the end of the file it is 1, which means after scaling 1*2^7=128 bytes: $ xzcat pcap.txt.xz | tail -1000 | grep '10.16.14.237.33578' | head -1 10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3742538664, win 1, options [nop,nop,TS val 1286238401 ecr 120409852], length 0 And indeed the RDS server is sending 128 bytes of data and waiting for confirmation before sending another 128: 13:47:27.170534 IP (tos 0x0, ttl 255, id 11479, offset 0, flags [DF], proto TCP (6), length 180) 10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 3742538664:3742538792, ack 2266022, win 174, options [nop,nop,TS val 120409903 ecr 1286238401], length 128 13:47:27.170542 IP (tos 0x0, ttl 64, id 28256, offset 0, flags [DF], proto TCP (6), length 52) 10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3742538792, win 1, options [nop,nop,TS val 1286238605 ecr 120409903], length 0 13:47:27.374539 IP (tos 0x0, ttl 255, id 11480, offset 0, flags [DF], proto TCP (6), length 180) 10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 3742538792:3742538920, ack 2266022, win 174, options [nop,nop,TS val 120409954 ecr 1286238605], length 128 The switch from "win 600" (76800 bytes) to "win 1" (128 bytes) is sudden and I have no idea what could have caused it: 13:33:26.230782 IP (tos 0x0, ttl 64, id 24124, offset 0, flags [DF], proto TCP (6), length 52) 10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3741933728, win 600, options [nop,nop,TS val 1285397677 ecr 120199669], length 0 13:33:26.230868 IP (tos 0x0, ttl 255, id 7295, offset 0, flags [DF], proto TCP (6), length 4396) 10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 3741933728:3741938072, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 4344 13:33:26.230918 IP (tos 0x0, ttl 255, id 7298, offset 0, flags [DF], proto TCP (6), length 43492) 10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 3741938072:3741981512, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 43440 13:33:26.230932 IP (tos 0x0, ttl 255, id 7328, offset 0, flags [DF], proto TCP (6), length 13708) 10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 3741981512:3741995168, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 13656 13:33:26.230948 IP (tos 0x0, ttl 255, id 7338, offset 0, flags [DF], proto TCP (6), length 2948) 10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 3741995168:3741998064, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 2896 13:33:26.230969 IP (tos 0x0, ttl 255, id 7340, offset 0, flags [DF], proto TCP (6), length 5348) 10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 3741998064:3742003360, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 5296 13:33:26.231759 IP (tos 0x0, ttl 255, id 7344, offset 0, flags [DF], proto TCP (6), length 4396) 10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 3742003360:3742007704, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 4344 13:33:26.231775 IP (tos 0x0, ttl 255, id 7347, offset 0, flags [DF], proto TCP (6), length 2876) 10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 3742007704:3742010528, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 2824 13:33:26.233238 IP (tos 0x0, ttl 64, id 24125, offset 0, flags [DF], proto TCP (6), length 52) 10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3742010528, win 1,
[Kernel-packages] [Bug 1796469] Re: aws s3 cp --recursive hangs on the last file on a large file transfer to instance
For anyone using ansible this is how I rolled back to version 21 - name: Removing awscli pip: name: awscli state: absent - name: Wait for sudo become: yes shell: while sudo fuser /var/lib/dpkg/lock >/dev/null 2>&1; do sleep 1; done; - name: remove aws kernel version become: yes apt: name: "{{ item }}" state: absent with_items: - linux-aws - linux-image-aws - linux-headers-aws - name: Wait for sudo become: yes shell: while sudo fuser /var/lib/dpkg/lock >/dev/null 2>&1; do sleep 1; done; - name: Pin aws kernel become: yes apt: deb: http://launchpadlibrarian.net/385721541/linux-image-aws_4.15.0.1021.21_amd64.deb - name: Wait for sudo become: yes shell: while sudo fuser /var/lib/dpkg/lock >/dev/null 2>&1; do sleep 1; done; - name: Pin aws kernel dependency become: yes apt: deb: http://launchpadlibrarian.net/385721540/linux-headers-aws_4.15.0.1021.21_amd64.deb - name: Wait for sudo become: yes shell: while sudo fuser /var/lib/dpkg/lock >/dev/null 2>&1; do sleep 1; done; - name: Pin aws kernel version become: yes apt: deb: http://launchpadlibrarian.net/385721538/linux-aws_4.15.0.1021.21_amd64.deb - name: Installing awscli pip: name: awscli -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1796469 Title: aws s3 cp --recursive hangs on the last file on a large file transfer to instance Status in linux-aws package in Ubuntu: Confirmed Bug description: aws s3 cp --recursive hangs on the last file on a large transfer to an instance I have confirmed that this works on version Linux/4.15.0-1021-aws aws cli version aws-cli/1.16.23 Python/2.7.15rc1 Linux/4.15.0-1023-aws botocore/1.12.13 Ubuntu version Description: Ubuntu 18.04.1 LTS Release: 18.04 eu-west-1 - ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912 - ami-00035f41c82244dab Package version linux-aws: Installed: 4.15.0.1023.23 Candidate: 4.15.0.1023.23 Version table: *** 4.15.0.1023.23 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status 4.15.0.1007.7 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1796469/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796469] Re: aws s3 cp --recursive hangs on the last file on a large file transfer to instance
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: linux-aws (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1796469 Title: aws s3 cp --recursive hangs on the last file on a large file transfer to instance Status in linux-aws package in Ubuntu: Confirmed Bug description: aws s3 cp --recursive hangs on the last file on a large transfer to an instance I have confirmed that this works on version Linux/4.15.0-1021-aws aws cli version aws-cli/1.16.23 Python/2.7.15rc1 Linux/4.15.0-1023-aws botocore/1.12.13 Ubuntu version Description: Ubuntu 18.04.1 LTS Release: 18.04 eu-west-1 - ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912 - ami-00035f41c82244dab Package version linux-aws: Installed: 4.15.0.1023.23 Candidate: 4.15.0.1023.23 Version table: *** 4.15.0.1023.23 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status 4.15.0.1007.7 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1796469/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796469] Re: aws s3 cp --recursive hangs on the last file on a large file transfer to instance
Not sure if it's related but I'll leave this here. We have a bunch of EC2 instances running Ubuntu 18.04 LTS and GitLab Runner, with the latter configured to execute build jobs inside Docker containers. Our EC2 instances are of the t2.medium, t2.xlarge and t3.medium sizes and run in the eu-west-1 region (in different AZs). After upgrading from kernel 4.15.0-1021-aws to 4.15.0-1023-aws we've seen a lot of build failures across all of our GitLab Runner instances on Ubuntu 18.04 LTS. The problems happen intermittently. Specifically the build jobs fail during the startup phase where GitLab Runner clones the git repository. The git clone operation fails with: ``` Cloning repository... Cloning into '/builds/my-group/my-project'... error: RPC failed; curl 18 transfer closed with outstanding read data remaining fatal: The remote end hung up unexpectedly fatal: early EOF fatal: index-pack failed ``` According to the documentation at https://curl.haxx.se/libcurl/c /libcurl-errors.html this means: > CURLE_PARTIAL_FILE (18) > > A file transfer was shorter or larger than expected. This happens when the > server first reports an expected transfer size, and then delivers data that > doesn't match the previously given size. When this problem happened it seems that the bandwidth used by the `git clone` operation eventually dropped to a few or zero kByte/s (the reporting wasn't granular enough to determine which). I assume that the GitLab server later closed/timed out the connection before the git client had downloaded all of the data it expected, leaving us with a `CURLE_PARTIAL_FILE` error. We've also seen larger (10s-100s of MBs) file downloads hang indefinitely (e.g. apt install). We've rolled back changes made during the last few weeks one by one and eventually found that downgrading to 4.15.0-1021-aws solved the issue on all of our GitLab Runner instances (ranging from 11.2.0, 11.2.1 and 11.3.1). I couldn't spot anything obvious that seemed to be related to the issue in the changelog here: https://changelogs.ubuntu.com/changelogs/pool/main/l/linux-aws/linux-aws_4.15.0-1023.23/changelog FWIW, this is what `docker version` reports: ``` Client: Version: 18.06.1-ce API version: 1.38 Go version:go1.10.3 Git commit:e68fc7a Built: Tue Aug 21 17:24:56 2018 OS/Arch: linux/amd64 Experimental: false Server: Engine: Version: 18.06.1-ce API version: 1.38 (minimum version 1.12) Go version: go1.10.3 Git commit: e68fc7a Built:Tue Aug 21 17:23:21 2018 OS/Arch: linux/amd64 Experimental: false ``` -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1796469 Title: aws s3 cp --recursive hangs on the last file on a large file transfer to instance Status in linux-aws package in Ubuntu: New Bug description: aws s3 cp --recursive hangs on the last file on a large transfer to an instance I have confirmed that this works on version Linux/4.15.0-1021-aws aws cli version aws-cli/1.16.23 Python/2.7.15rc1 Linux/4.15.0-1023-aws botocore/1.12.13 Ubuntu version Description: Ubuntu 18.04.1 LTS Release: 18.04 eu-west-1 - ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912 - ami-00035f41c82244dab Package version linux-aws: Installed: 4.15.0.1023.23 Candidate: 4.15.0.1023.23 Version table: *** 4.15.0.1023.23 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status 4.15.0.1007.7 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1796469/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796469] Re: aws s3 cp --recursive hangs on the last file on a large file transfer to instance
Does anyone have any more information about this? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1796469 Title: aws s3 cp --recursive hangs on the last file on a large file transfer to instance Status in linux-aws package in Ubuntu: New Bug description: aws s3 cp --recursive hangs on the last file on a large transfer to an instance I have confirmed that this works on version Linux/4.15.0-1021-aws aws cli version aws-cli/1.16.23 Python/2.7.15rc1 Linux/4.15.0-1023-aws botocore/1.12.13 Ubuntu version Description: Ubuntu 18.04.1 LTS Release: 18.04 eu-west-1 - ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912 - ami-00035f41c82244dab Package version linux-aws: Installed: 4.15.0.1023.23 Candidate: 4.15.0.1023.23 Version table: *** 4.15.0.1023.23 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status 4.15.0.1007.7 500 500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1796469/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp