Updating this thread in case anyone else finds themselves in this boat... This problem is still ongoing, here are the things I've tried:
*Updated kernel to 3.16.0-29* No change, the launchpad thread on this issue ( https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811) indicates that the fix in 3.14+ might be a regression but should be fixed in the 15.04 release. Hopefully that fix makes its way back to LTS. *WAL-E master* No change after installing the master, but it did seem to take longer for the "rides the rocket" errors to start compounding. That's just anecdotal though, I don't have any real timings. *Pool Size* I am currently trying a backup-push using --pool-size 1. This has been running for 12+ hours now and has only caused the error a few times. I'm hoping that even if this takes a couple days I can get a complete basebackup in to S3. I haven't tried using the --cluster-read-rate-limit option yet. If the pool size change above doesn't pan out this option is next. My main concern with this option is not having a sense of what rate to pass in so pool size option was a little easier to attempt. If anyone has a suggestion on how to compute that number please let me know. On Fri, Jan 9, 2015 at 9:01 PM, Brian Scholl <[email protected]> wrote: > Hello Daniel, > > Thanks for the response! I will try both of these on a test server. It > might take a few days but I'll update this thread when I have an update. > > Have a great weekend! > Brian > > > > > On Fri, Jan 9, 2015 at 6:32 PM, Daniel Farina <[email protected]> wrote: > >> On Fri, Jan 9, 2015 at 1:31 PM, <[email protected]> wrote: >> > Hello! >> > >> > First of all, this is my first post to this user group. If I'm in the >> wrong >> > place please don't hesitate to point me in a different direction. >> >> You got it right :) >> >> > Starting around mid-December I've been unable to complete a backup-push. >> > After running for an hour or so the server stops responding to network >> > requests. The only thing I can do is wait until backup-push finishes >> and >> > then I can ssh back in to the server. >> >> Maybe it's swamping everything. Try the I/O rate limiting option (see >> readme). >> >> > Once back online I can find the following problems: >> > >> > dmesg repeats this error: [1107575.808936] xen_netfront: xennet: skb >> rides >> > the rocket: 19 slots >> > Wal-e complains about HTTP 500 when pushing files to S3 (sorry, I don't >> have >> > a copy of this error handy) >> >> That's potentially important. Can you make it handy? >> >> > My server is configured as follows (let me know if more info is >> helpful): >> > >> > amazon ec2 i2.4xlarge >> > ubuntu 14.04 lts >> > postgres 9.3 >> > wal-e 7.3 >> > database size is ~2.4TB >> > >> > From what I've been able to find so far there may be a bug in the xennet >> > driver that is causing the "rides the rocket" error, see here and here. >> > I've tried turning some of the suggested features off with ethtool as >> > suggested in the links and it seems to have prevented the "rides to the >> > rocket" errors but backup-push still doesn't complete. >> > >> > I've since used an older backup-push to get another server going for >> testing >> > and it too has the same problem. >> > >> > Has anyone else seen this? If so, were you able to resolve it? >> >> Nope. >> >> Also, try the current WAL-E master. Compared to 0.7.3, I have >> drastically optimized the buffer management. Performance is perhaps >> even ten times better, which matters for an instance of your size. >> > > -- You received this message because you are subscribed to the Google Groups "wal-e" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
