Updating this thread in case anyone else finds themselves in this boat...

This problem is still ongoing, here are the things I've tried:

*Updated kernel to 3.16.0-29*
No change, the launchpad thread on this issue (
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811) indicates
that the fix in 3.14+ might be a regression but should be fixed in the
15.04 release.  Hopefully that fix makes its way back to LTS.

*WAL-E master*
No change after installing the master, but it did seem to take longer for
the "rides the rocket" errors to start compounding.
That's just anecdotal though, I don't have any real timings.

*Pool Size*
I am currently trying a backup-push using --pool-size 1.  This has been
running for 12+ hours now and has only caused the error a few times.  I'm
hoping that even if this takes a couple days I can get a complete
basebackup in to S3.

I haven't tried using the --cluster-read-rate-limit option yet.  If the
pool size change above doesn't pan out this option is next.  My main
concern with this option is not having a sense of what rate to pass in so
pool size option was a little easier to attempt.  If anyone has a
suggestion on how to compute that number please let me know.


On Fri, Jan 9, 2015 at 9:01 PM, Brian Scholl <[email protected]> wrote:

> Hello Daniel,
>
> Thanks for the response!  I will try both of these on a test server.  It
> might take a few days but I'll update this thread when I have an update.
>
> Have a great weekend!
> Brian
>
>
>
>
> On Fri, Jan 9, 2015 at 6:32 PM, Daniel Farina <[email protected]> wrote:
>
>> On Fri, Jan 9, 2015 at 1:31 PM,  <[email protected]> wrote:
>> > Hello!
>> >
>> > First of all, this is my first post to this user group.  If I'm in the
>> wrong
>> > place please don't hesitate to point me in a different direction.
>>
>> You got it right :)
>>
>> > Starting around mid-December I've been unable to complete a backup-push.
>> > After running for an hour or so the server stops responding to network
>> > requests.  The only thing I can do is wait until backup-push finishes
>> and
>> > then I can ssh back in to the server.
>>
>> Maybe it's swamping everything. Try the I/O rate limiting option (see
>> readme).
>>
>> > Once back online I can find the following problems:
>> >
>> > dmesg repeats this error: [1107575.808936] xen_netfront: xennet: skb
>> rides
>> > the rocket: 19 slots
>> > Wal-e complains about HTTP 500 when pushing files to S3 (sorry, I don't
>> have
>> > a copy of this error handy)
>>
>> That's potentially important. Can you make it handy?
>>
>> > My server is configured as follows (let me know if more info is
>> helpful):
>> >
>> > amazon ec2 i2.4xlarge
>> > ubuntu 14.04 lts
>> > postgres 9.3
>> > wal-e 7.3
>> > database size is ~2.4TB
>> >
>> > From what I've been able to find so far there may be a bug in the xennet
>> > driver that is causing the "rides the rocket" error, see here and here.
>> > I've tried turning some of the suggested features off with ethtool as
>> > suggested in the links and it seems to have prevented the "rides to the
>> > rocket" errors but backup-push still doesn't complete.
>> >
>> > I've since used an older backup-push to get another server going for
>> testing
>> > and it too has the same problem.
>> >
>> > Has anyone else seen this?  If so, were you able to resolve it?
>>
>> Nope.
>>
>> Also, try the current WAL-E master. Compared to 0.7.3, I have
>> drastically optimized the buffer management. Performance is perhaps
>> even ten times better, which matters for an instance of your size.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to