[wal-e] wal-fetch consuming high %CPU

Travis Beatty Wed, 16 Sep 2015 07:57:18 -0700

Hello,

We're running postgres+wal-e inside lxc containers, and trying to setup a 
master with a hot_standby.
wal logs are successfully being uploaded by the master, and read by the 
hot_standby -- however, it's using a large amount of resources.



Our (relevant sections of) postgresql.conf (same on both instances):
archive_command             = 'test ! -d /etc/wal-e.d-WRITE/env || envdir 
/etc/wal-e.d-WRITE/env wal-e wal-push %p'
archive_mode                = on
archive_timeout             = 1min
hot_standby                 = on
hot_standby_feedback        = on
max_wal_senders             = 20
wal_buffers                 = 16MB
wal_keep_segments           = 5
wal_level                   = 'hot_standby'
wal_receiver_timeout        = 1min
wal_sender_timeout          = 1min
wal_sync_method             = 'fdatasync'


and the recovery.conf (only on the hot_standby):

standby_mode = on
restore_command = 'envdir /etc/wal-e.d-READ/env wal-e wal-fetch "%f" "%p"'
recovery_end_command = 'test ! -d /etc/wal-e.d-WRITE/env || envdir 
/etc/wal-e.d-WRITE/env wal-e backup-push /var/lib/postgresql/data LATEST'
trigger_file = '/tmp/standby_off'


we can see from docker top:
0 [ # docker top container axk-%cpu -o pid,%cpu,command
PID                 %CPU                COMMAND
51161               5.4                 /usr/bin/python 
/usr/local/bin/wal-e wal-prefetch /var/lib/postgresql/data/pg_xlog 
00000001000000000000005E
51441               5.2                 /usr/bin/python 
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51449               5.2                 /usr/bin/python 
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51500               5.0                 /usr/bin/python 
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51728               3.0                 /usr/bin/python 
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51731               3.0                 /usr/bin/python 
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51732               3.0                 /usr/bin/python 
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51733               2.0                 /usr/bin/python 
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51735               2.0                 /usr/bin/python 
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
51738               2.0                 /usr/bin/python 
/usr/local/bin/wal-e wal-fetch 00000001000000000000005A pg_xlog/RECOVERYXLOG
50632               0.0                 postgres:  (42039) idle
51497               0.0                 sh -c envdir /etc/wal-e.d-READ/env 
wal-e wal-fetch "00000001000000000000005A" "pg_xlog/RECOVERYXLOG"
64809               0.0                 /bin/bash /docker-entrypoint.sh 
postgres
65153               0.0                 postgres
65158               0.0                 postgres: logger process
65159               0.0                 postgres: startup process waiting 
for 00000001000000000000005A
65301               0.0                 postgres: checkpointer process
65302               0.0                 postgres: writer process
65304               0.0                 postgres: stats collector process
0  #


that wal-e is using a lot of CPU. 

(note - I can't reproduce anymore, but yesterday each process was using 
10-11% CPU.  This example is from today, where the bottom is from 
yesterday, so that's why the wal sequence number is a little off)


In our logs, we see this every second or so

wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2015-09-15T20:02:45.810760-00 pid=31029
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2015-09-15T20:02:50.216222-00 pid=31029 
action=wal-fetch key=swift://host/wal_005/000000010000000000000054.lzo 
prefix=prefix/ seg=000000010000000000000054 state=begin
requests.packages.urllib3.connectionpool INFO     Starting new HTTP 
connection (1): 
requests.packages.urllib3.connectionpool INFO     Starting new HTTP 
connection (1): 
swiftclient  INFO     REQ: curl -i 
http://swift//wal_005/000000010000000000000054.lzo -X GET -H "X-Auth-Token: 
AUTH_"
swiftclient  INFO     RESP STATUS: 404 Not Found
swiftclient  INFO     RESP HEADERS: [('date', 'Tue, 15 Sep 2015 20:02:51 
GMT'), ('content-length', '70'), ('content-type', 'text/html; 
charset=UTF-8'), ('x-trans-id', '')]
swiftclient  INFO     RESP BODY: <html><h1>Not Found</h1><p>The resource 
could not be found.</p></html>
swiftclient  ERROR    Object GET failed: 
wal_005/000000010000000000000054.lzo 404 Not Found  [first 60 chars of 
response] <html><h1>Not Found</h1><p>The resource could not be found.<
        Traceback (most recent call last):
          File 
"/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 1390, 
in _retry
            service_token=self.service_token, **kwargs)
          File 
"/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 949, 
in get_object
            http_response_content=body)
        ClientException: Object GET failed: 
wal_005/000000010000000000000054.lzo 404 Not Found  [first 60 chars of 
response] <html><h1>Not Found</h1><p>The resource could not be found.<
wal_e.blobstore.swift.utils WARNING  MSG: could no longer locate object 
while performing wal restore
        DETAIL: The absolute URI that could not be located is 
swift://wal_005/000000010000000000000054.lzo.
        HINT: This can be normal when Postgres is trying to detect what 
timelines are available during restoration.
        STRUCTURED: time=2015-09-15T20:02:51.271368-00 pid=31029
lzop: <stdin>: not a lzop file
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2015-09-15T20:02:51.372213-00 pid=31029 
action=wal-fetch key=wal_005/000000010000000000000054.lzo prefix=/ 
seg=000000010000000000000054 state=complete
That object does not yet exist in swift (*54.lzo), however the master 
eventually creates it, uploads it, and the hot standby downloads it - is 
happy + applies it - and then starts with looking for *55.lzo

I think that's all expected - but I can't nail down what's using so much 
CPU if it's only doing an HTTP request every second.  One possible idea I 
see is that the PIDs are constantly changing.  Maybe each failure is 
resulting in new workers being spun up?

Is this expected?  Should it be taking this much CPU time?  Short of 
reducing the number of prefetchers (most of them aren't prefetching 
anyways), is there any action I can take?

Thank you for your time,

Travis

-- 
You received this message because you are subscribed to the Google Groups 
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[wal-e] wal-fetch consuming high %CPU

Reply via email to