On Tue, Feb 11, 2014 at 2:51 PM, Kevin Harriss <[email protected]> wrote:
> I have a master slave streaming replication setup working with the master
> pushing WAL segments to S3. However, on the slave it says it is still
> waiting to startup when I try to query against it. I looked in the logs and
> it looks like it keeps trying to pulled and fetch the same WAL segment over
> and over again. Any ideas on how to fix this? The log is below.
>
> Thanks,
>
> Kevin
>
> wal_e.operator.s3_operator INFO     MSG: begin wal restore
>         STRUCTURED: time=2014-02-11T22:16:36.292176-00 pid=21049
> action=wal-fetch key=s3://.../.../wal_005/000000010000001C000000CE.lzo
> prefix=s3://.../... seg=000000010000001C000000CE state=begin
> wal_e.worker.s3_worker INFO     MSG: completed download and decompression
>         DETAIL: Downloaded and decompressed
> "s3://.../.../wal_005/000000010000001C000000CE.lzo" to
> "pg_xlog/RECOVERYXLOG"
>         STRUCTURED: time=2014-02-11T22:16:37.383960-00 pid=21049
> wal_e.operator.s3_operator INFO     MSG: complete wal restore
>         STRUCTURED: time=2014-02-11T22:16:37.384938-00 pid=21049
> action=wal-fetch key=s3://.../.../wal_005/000000010000001C000000CE.lzo
> prefix=s3://.../... seg=000000010000001C000000CE state=complete
> wal_e.operator.s3_operator INFO     MSG: begin wal restore
>         STRUCTURED: time=2014-02-11T22:16:37.601913-00 pid=21059
> action=wal-fetch key=s3://.../.../wal_005/000000010000001C000000CE.lzo
> prefix=s3://.../... seg=000000010000001C000000CE state=begin
> wal_e.worker.s3_worker INFO     MSG: completed download and decompression
>         DETAIL: Downloaded and decompressed
> "s3://.../.../wal_005/000000010000001C000000CE.lzo" to
> "pg_xlog/RECOVERYXLOG"
>         STRUCTURED: time=2014-02-11T22:16:38.124682-00 pid=21059
> wal_e.operator.s3_operator INFO     MSG: complete wal restore
>         STRUCTURED: time=2014-02-11T22:16:38.125453-00 pid=21059
> action=wal-fetch key=s3://.../.../wal_005/000000010000001C000000CE.lzo
> prefix=s3://.../... seg=000000010000001C000000CE state=complete

The fact that WAL-E suggests it's downloading the log is troubling.

I've seen WAL corruption manifest this way: postgres will look at the
segment, give up, but then try restoring again without so much as a
peep if memory serves.  Is postgres complaining somewhere?

Sadly, the last time I figured this out it was a corruption so severe
that I downloaded the WAL to break it open and noticed it had very
much the wrong file size, as were all the WAL leading up to it before
an EBS crash.  Somehow the server continued on happily for hours
afterwards which did not make for an easy recovery (I was lucky that
there was not a double-failure and pg_resetxlog plus dump/restore was
available to me).

It could also be a more pedestrian bug somewhere else, but if so, it'd
be the first.

Try a new base backup/restore and cross your fingers, and perhaps
preserve 000000010000001C000000CE and try running it through xlogdump
and submitting information to pgsql-bugs if things are amiss.

-- 
You received this message because you are subscribed to the Google Groups 
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to