Hello,
We're running postgres+wal-e inside lxc containers, and trying to setup a
master with a hot_standby.
wal logs are successfully being uploaded by the master, and read by the
hot_standby -- however, it's using a large amount of resources.
Our (relevant sections of) postgresql.conf (same on both instances):
archive_command = 'test ! -d /etc/wal-e.d-WRITE/env || envdir
/etc/wal-e.d-WRITE/env wal-e wal-push %p'
archive_mode = on
archive_timeout = 1min
hot_standby = on
hot_standby_feedback = on
max_wal_senders = 20
wal_buffers = 16MB
wal_keep_segments = 5
wal_level = 'hot_standby'
wal_receiver_timeout = 1min
wal_sender_timeout = 1min
wal_sync_method = 'fdatasync'
and the recovery.conf (only on the hot_standby):
standby_mode = on
restore_command = 'envdir /etc/wal-e.d-READ/env wal-e wal-fetch "%f" "%p"'
recovery_end_command = 'test ! -d /etc/wal-e.d-WRITE/env || envdir
/etc/wal-e.d-WRITE/env wal-e backup-push /var/lib/postgresql/data LATEST'
trigger_file = '/tmp/standby_off'
we can see from docker top:
0 [ # docker top container axk-%cpu -o pid,%cpu,command
PID %CPU COMMAND
51161 5.4 /usr/bin/python
/usr/local/bin/wal-e wal-prefetch /var/lib/postgresql/data/pg_xlog
00000001000000000000005E
51441 5.2 /usr/bin/python
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51449 5.2 /usr/bin/python
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51500 5.0 /usr/bin/python
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51728 3.0 /usr/bin/python
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51731 3.0 /usr/bin/python
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51732 3.0 /usr/bin/python
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51733 2.0 /usr/bin/python
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51735 2.0 /usr/bin/python
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51738 2.0 /usr/bin/python
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
50632 0.0 postgres: (42039) idle
51497 0.0 sh -c envdir /etc/wal-e.d-READ/env
wal-e wal-fetch "00000001000000000000005A" "pg_xlog/RECOVERYXLOG"
64809 0.0 /bin/bash /docker-entrypoint.sh
postgres
65153 0.0 postgres
65158 0.0 postgres: logger process
65159 0.0 postgres: startup process waiting
for 00000001000000000000005A
65301 0.0 postgres: checkpointer process
65302 0.0 postgres: writer process
65304 0.0 postgres: stats collector process
0 #
that wal-e is using a lot of CPU.
(note - I can't reproduce anymore, but yesterday each process was using
10-11% CPU. This example is from today, where the bottom is from
yesterday, so that's why the wal sequence number is a little off)
In our logs, we see this every second or so
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2015-09-15T20:02:45.810760-00 pid=31029
wal_e.operator.backup INFO MSG: begin wal restore
STRUCTURED: time=2015-09-15T20:02:50.216222-00 pid=31029
action=wal-fetch key=swift://host/wal_005/000000010000000000000054.lzo
prefix=prefix/ seg=000000010000000000000054 state=begin
requests.packages.urllib3.connectionpool INFO Starting new HTTP
connection (1):
requests.packages.urllib3.connectionpool INFO Starting new HTTP
connection (1):
swiftclient INFO REQ: curl -i
http://swift//wal_005/000000010000000000000054.lzo -X GET -H "X-Auth-Token:
AUTH_"
swiftclient INFO RESP STATUS: 404 Not Found
swiftclient INFO RESP HEADERS: [('date', 'Tue, 15 Sep 2015 20:02:51
GMT'), ('content-length', '70'), ('content-type', 'text/html;
charset=UTF-8'), ('x-trans-id', '')]
swiftclient INFO RESP BODY: <html><h1>Not Found</h1><p>The resource
could not be found.</p></html>
swiftclient ERROR Object GET failed:
wal_005/000000010000000000000054.lzo 404 Not Found [first 60 chars of
response] <html><h1>Not Found</h1><p>The resource could not be found.<
Traceback (most recent call last):
File
"/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 1390,
in _retry
service_token=self.service_token, **kwargs)
File
"/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 949,
in get_object
http_response_content=body)
ClientException: Object GET failed:
wal_005/000000010000000000000054.lzo 404 Not Found [first 60 chars of
response] <html><h1>Not Found</h1><p>The resource could not be found.<
wal_e.blobstore.swift.utils WARNING MSG: could no longer locate object
while performing wal restore
DETAIL: The absolute URI that could not be located is
swift://wal_005/000000010000000000000054.lzo.
HINT: This can be normal when Postgres is trying to detect what
timelines are available during restoration.
STRUCTURED: time=2015-09-15T20:02:51.271368-00 pid=31029
lzop: <stdin>: not a lzop file
wal_e.operator.backup INFO MSG: complete wal restore
STRUCTURED: time=2015-09-15T20:02:51.372213-00 pid=31029
action=wal-fetch key=wal_005/000000010000000000000054.lzo prefix=/
seg=000000010000000000000054 state=complete
That object does not yet exist in swift (*54.lzo), however the master
eventually creates it, uploads it, and the hot standby downloads it - is
happy + applies it - and then starts with looking for *55.lzo
I think that's all expected - but I can't nail down what's using so much
CPU if it's only doing an HTTP request every second. One possible idea I
see is that the PIDs are constantly changing. Maybe each failure is
resulting in new workers being spun up?
Is this expected? Should it be taking this much CPU time? Short of
reducing the number of prefetchers (most of them aren't prefetching
anyways), is there any action I can take?
Thank you for your time,
Travis
--
You received this message because you are subscribed to the Google Groups
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.