Re: [PERFORM] Recovery will take 10 hours

2006-04-24 Thread Simon Riggs
On Sun, 2006-04-23 at 22:46 -0600, Brendan Duddridge wrote:

> So how do you overlap the restore process with the retrieving of files?

The restore command can be *anything*. You just write a script...

> Our restore command is:
> 
> restore_command = 'gunzip %p'
> 
> If I change it to:
> 
> restore_command = 'gunzip %p &'
> 
> to execute the restore command in the background, will that do the  
> trick?

No, but you can execute a shell script that does use & internally.

> But I don't think the real problem was the retrieval of the files. It  
> only
> took maybe 1/2 a second to retrieve the file, but often took anywhere  
> from
> 5 to 30 seconds to process the file. More so on the longer end of the  
> scale.

Sorry, thought you meant the decompression time.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com/


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Recovery will take 10 hours

2006-04-24 Thread Markus Schaber
Hi, Brandan,

Brendan Duddridge wrote:

> So how do you overlap the restore process with the retrieving of files?

You need a shell script as restore command that does both uncompressing
the current file, and starting a background decompress of the next
file(s). It also has to check whether the current file is already in
progress from a last run, and wait until this is finished instead of
decompressing it. Seems to be a little complicated than it sounds first.

> restore_command = 'gunzip %p &'

Warning: Don't do it this way!

It will break things because PostgreSQL will try to access a
not-completely-restored wal file.


HTH,
Markus
-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Recovery will take 10 hours

2006-04-23 Thread Brendan Duddridge

Hi Simon,

The backup with 3120 WAL files was a 2 day old base backup. We've moved
to a 1 day base backup now, but that'll still be 1600 WALs or so a day.
That will probably take 5 hours to restore I suspect. Unless we move to
2 or more base backups per day. That seems crazy though.

So how do you overlap the restore process with the retrieving of files?

Our restore command is:

restore_command = 'gunzip %p'

If I change it to:

restore_command = 'gunzip %p &'

to execute the restore command in the background, will that do the  
trick?


But I don't think the real problem was the retrieval of the files. It  
only
took maybe 1/2 a second to retrieve the file, but often took anywhere  
from
5 to 30 seconds to process the file. More so on the longer end of the  
scale.


Thanks,


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 23, 2006, at 10:06 AM, Simon Riggs wrote:


On Thu, 2006-04-20 at 13:29 -0600, Brendan Duddridge wrote:


We had a database issue today that caused us to have to restore to
our most recent backup. We are using PITR so we have 3120 WAL files
that need to be applied to the database.


How often are you taking base backups?


After 45 minutes, it has restored only 230 WAL files. At this rate,
it's going to take about 10 hours to restore our database.



Most of the time, the server is not using very much CPU time or I/O
time. So I'm wondering what can be done to speed up the process?


You can improve the performance of a recovery by making your restore
command overlap retrieval of files. The recovery process waits  
while the
restore command returns a file, so by doing an asynchronous  
lookahead of

one file you can be gunzipping the next file while the current one is
being processed.

I'll either document this better, or build an overlap into the restore
command processing itself, so the script doesn't need to do this.

--
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com/


---(end of  
broadcast)---

TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that  
your

   message can get through to the mailing list cleanly





---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Recovery will take 10 hours

2006-04-23 Thread Simon Riggs
On Thu, 2006-04-20 at 13:29 -0600, Brendan Duddridge wrote:

> We had a database issue today that caused us to have to restore to  
> our most recent backup. We are using PITR so we have 3120 WAL files  
> that need to be applied to the database.

How often are you taking base backups?

> After 45 minutes, it has restored only 230 WAL files. At this rate,  
> it's going to take about 10 hours to restore our database.

> Most of the time, the server is not using very much CPU time or I/O  
> time. So I'm wondering what can be done to speed up the process?

You can improve the performance of a recovery by making your restore
command overlap retrieval of files. The recovery process waits while the
restore command returns a file, so by doing an asynchronous lookahead of
one file you can be gunzipping the next file while the current one is
being processed.

I'll either document this better, or build an overlap into the restore
command processing itself, so the script doesn't need to do this.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com/


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Gavin Sherry
On Thu, 20 Apr 2006, Brendan Duddridge wrote:

> Hi Tomas,
>
> Hmm... ktrace -p PID -c returns immediately without doing anything
> unless I've previously done a ktrace -p PID.
>
> According to the man page for ktrace's -c flag:
>-c  Clear the trace points associated with the specified file
> or processes.

On other systems, strace/truss with -c produces a list of sys calls with
the number of times they've been called in the elapsed period.

To answer your other question, temporarily disabling fsync during the
recovery should speed it up.

For future reference, processing thousands of WAL files for recovery is
not ideal. You should be doing a base backup much more often.

Gavin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Brendan Duddridge

Hi Tomas,

Hmm... ktrace -p PID -c returns immediately without doing anything  
unless I've previously done a ktrace -p PID.


According to the man page for ktrace's -c flag:
  -c  Clear the trace points associated with the specified file  
or processes.


When I run ktrace on OS X Server 10.4.6 it returns to the console  
immediately, however the ktrace.out file gets larger and larger until  
I issue another ktrace command with the -c flag. It never sits  
waiting for keyboard input.



I haven't been able to find any way of generating the stats yet. The  
man page for ktrace or kdump doesn't mention anything about stats.



Thanks,



Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 20, 2006, at 6:51 PM, Tomas Vondra wrote:

So I tried ktrace -p PID and it created a really big file. I had  
to do

'ktrace -p PID -c' to get it to stop.

The ktrace.out file is read using kdump, but there's a lot of binary
data in there intermixed with some system calls.


Yes, that's what (s|k)trace does - it attaches to the process, and
prints out all the system calls, parameters, return values etc. That
gives you "exact" overview of what's going on in the program, but it's
a little bit confusing if you are not familiar with that and/or you're
in a hurry.

But Luke Lonergan offered a '-c' switch, which gives you a statistics
of the used system calls. This way you can see number of calls for
individual syscalls and time spent in them. That could give you a hint
why the process is so slow (for example there can be an I/O bottleneck
or something like that).

Just do 'ktrace -p PID -c' for about 30 seconds, then 'Ctrl-C' and  
post

the output to this mailing list.

t.v.





---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Brendan Duddridge

Hi Tom,

Well, we started the restore back up with the WAL archives copied to  
our local disk.


It's going at about the same pace as with the restore over NFS.

So I tried ktrace -p PID and it created a really big file. I had to  
do 'ktrace -p PID -c' to get it to stop.


The ktrace.out file is read using kdump, but there's a lot of binary  
data in there intermixed with some system calls.


For example:

15267 postgres RET   read 8192/0x2000
15267 postgres CALL  lseek(153,0,2)
15267 postgres RET   lseek 0
15267 postgres CALL  lseek(127,0,2)
15267 postgres RET   lseek 0
15267 postgres CALL  lseek(138,0,2)
15267 postgres RET   lseek 0
15267 postgres CALL  lseek(153,0,2)
15267 postgres RET   lseek 0
15267 postgres CALL  lseek(127,0,2)
15267 postgres RET   lseek 0
15267 postgres CALL  read(5,25225728,8192)
15267 postgres GIO   fd 5 read 8192 bytes
   "\M-P]\0\^A\0\0\0\^A\0\0\^A\M^H,\M-5`\0\0\0\^C\M-6r fill,  
polyester has a subtle sheen, machine wash\0\0\0Xreverses to\
 solid colour, polyester fill, polyester has a subtle sheen,  
machine wash\^_\^Y7\M-3\0\0\0\0\0\0\0\0\0\0\0\0\0\0oG\0\

\b\0\^[)\^C \M^Or\M-#\^B\0\0\0\0\0A\M-&\M-]

... lots of data 

\M^K$\0\0\0\fcomplete\0\0\0HCustom-width Valanceless  
Aluminum Mini Blinds 37 1/4-44" w. x 48" l.\0\0\0\M-P1" aluminum\
 slats, valanceless headrail and matching bottom rail,  
hidden brackets, clear acrylic tilt wand, extra slats with rou\
te holes in the back, can be cut down to minimum width of  
14", hardware. . .\0\0\^Aq1" aluminum slats, valanceless he\
adrail and matching bottom rail, hidden brackets, clear  
acrylic tilt wand, extra slats with route holes in the back, \
can be cut down to minimum width of 14", hardware and  
instructions included, wipe with a dam"

15267 postgres RET   read 8192/0x2000
15267 postgres CALL  lseek(138,0,2)
15267 postgres RET   lseek 0
15267 postgres CALL  lseek(158,317251584,0)
15267 postgres RET   lseek 0
15267 postgres CALL  write(158,35286464,8192)
15267 postgres GIO   fd 158 wrote 8192 bytes
   "[EMAIL PROTECTED],[EMAIL PROTECTED] \^C?\M^X 
[EMAIL PROTECTED]@[EMAIL PROTECTED]>[EMAIL PROTECTED]@$>[EMAIL PROTECTED]>[EMAIL PROTECTED]@$=\
[EMAIL PROTECTED]@$"[EMAIL PROTECTED]@[EMAIL PROTECTED]<[EMAIL PROTECTED]<[EMAIL PROTECTED]@$<8 
[EMAIL PROTECTED];[EMAIL PROTECTED];\M-([EMAIL PROTECTED];[EMAIL PROTECTED];[EMAIL PROTECTED]:[EMAIL PROTECTED]:\M^H\


etc...

I'm not sure that really tells me anything though other than the WAL  
archives don't actually archive SQL, but store only the database  
changes.



Thanks,


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 20, 2006, at 3:19 PM, Tom Lane wrote:


Brendan Duddridge <[EMAIL PROTECTED]> writes:

Do you mean do a kill -QUIT on the postgres process in order to
generate a stack trace?


Not at all!  I'm talking about tracing the kernel calls it's making.
Depending on your platform, the tool for this is called strace,
ktrace, truss, or maybe even just trace.  With strace you'd do
something like

strace -p PID-of-process 2>outfile
... wait 30 sec or so ...
control-C

Not sure about the APIs for the others but they're probably roughly
similar ... read the man page ...

regards, tom lane

---(end of  
broadcast)---

TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match





---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Brendan Duddridge

Thanks Tom,

We are storing only the WAL archives on the NFS volume. It must have  
been a hiccup in the NFS mount. Jeff Frost asked if we were using  
hard or soft mounts. We were using soft mounts, so that may be where  
the problem lies with the PANIC.


Is it better to use the boot volume of the database machine for  
archiving our WAL files instead of over the NFS mount? I'm sure it's  
probably not a good idea to archive to the same volume as the pg_xlog  
directory, so that's why I thought maybe using the boot drive would  
be better. We'll just have to make sure we don't fill up the drive.  
Although I know that PostgreSQL often writes to the /data directory  
that is located on the boot drive. It might not be good to start  
archiving there. Our table spaces are on a separate RAID.


If we need to restore in the future we'll just have to copy the WAL  
files from the boot drive of our database machine over the NFS to the  
restore machine.


Thanks,


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 20, 2006, at 5:29 PM, Tom Lane wrote:


Brendan Duddridge <[EMAIL PROTECTED]> writes:

Oops... forgot to mention that both files that postgres said were
missing are in fact there:


Please place the blame where it should fall: it's your archive restore
command that's telling postgres that.


There didn't seem to be any issues with the NFS mount. Perhaps it
briefly disconnected and came back right away.


Unstable NFS mounts are Really Bad News.  You shouldn't be expecting
to run a stable database atop such a thing.

If it's not the database but only the WAL archive that's NFS'd, it  
might

be possible to live with it, but you'll need to put some defenses into
your archive restore script to cope with such events.

As far as restarting goes: I think you can restart from here without
first redoing your base-backup restore, but as previously noted it'll
still read through the same WAL files it looked at before.  You won't
save much except the time to redo the base restore.

regards, tom lane





---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Tom Lane
Brendan Duddridge <[EMAIL PROTECTED]> writes:
> Oops... forgot to mention that both files that postgres said were  
> missing are in fact there:

Please place the blame where it should fall: it's your archive restore
command that's telling postgres that.

> There didn't seem to be any issues with the NFS mount. Perhaps it  
> briefly disconnected and came back right away.

Unstable NFS mounts are Really Bad News.  You shouldn't be expecting
to run a stable database atop such a thing.

If it's not the database but only the WAL archive that's NFS'd, it might
be possible to live with it, but you'll need to put some defenses into
your archive restore script to cope with such events.

As far as restarting goes: I think you can restart from here without
first redoing your base-backup restore, but as previously noted it'll
still read through the same WAL files it looked at before.  You won't
save much except the time to redo the base restore.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Brendan Duddridge

Well our restore command is pretty basic:

restore_command = 'gunzip %p'

I'm not sure why that would succeed then fail.


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 20, 2006, at 5:20 PM, Tom Lane wrote:


Brendan Duddridge <[EMAIL PROTECTED]> writes:

However, as I just finished posting to the list, the process died
with a PANIC error:



[2006-04-20 16:41:28 MDT] LOG: restored log file
"0001018F0034" from archive
[2006-04-20 16:41:35 MDT] LOG: restored log file
"0001018F0035" from archive
[2006-04-20 16:41:38 MDT] LOG: restored log file
"0001018F0036" from archive
sh: line 1: /wal_archive/0001018F0037.gz: No such file or
directory
[2006-04-20 16:41:46 MDT] LOG: could not open file "pg_xlog/
0001018F0037" (log file 399, segment 55): No such file or
directory
[2006-04-20 16:41:46 MDT] LOG: redo done at 18F/36FFF254
sh: line 1: /wal_archive/0001018F0036.gz: No such file or
directory
[2006-04-20 16:41:46 MDT] PANIC: could not open file "pg_xlog/
0001018F0036" (log file 399, segment 54): No such file or
directory


This looks to me like a bug in your archive restore command.  It had
just finished providing 0001018F0036 at 16:41:38, why was
it not able to do so again at 16:41:46?

regards, tom lane

---(end of  
broadcast)---

TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that  
your

   message can get through to the mailing list cleanly





---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Jeff Frost


Brendan,

Is your NFS share mounted hard or soft?  Do you have space to copy the files 
locally?  I suspect you're seeing NFS slowness in your restore since you 
aren't using much in the way of disk IO or CPU.


-Jeff

On Thu, 20 Apr 2006, Brendan Duddridge wrote:

Oops... forgot to mention that both files that postgres said were missing are 
in fact there:


A partial listing from our wal_archive directory:

-rw--- 1 postgres staff 4971129 Apr 19 20:08 0001018F0036.gz
-rw--- 1 postgres staff 4378284 Apr 19 20:09 0001018F0037.gz

There didn't seem to be any issues with the NFS mount. Perhaps it briefly 
disconnected and came back right away.



Thanks!



Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 20, 2006, at 5:11 PM, Brendan Duddridge wrote:


Hi Jeff,

The WAL files are stored on a separate server and accessed through an NFS 
mount located at /wal_archive.


However, the restore failed about 5 hours in after we got this error:

[2006-04-20 16:41:28 MDT] LOG: restored log file "0001018F0034" 
from archive
[2006-04-20 16:41:35 MDT] LOG: restored log file "0001018F0035" 
from archive
[2006-04-20 16:41:38 MDT] LOG: restored log file "0001018F0036" 
from archive
sh: line 1: /wal_archive/0001018F0037.gz: No such file or 
directory
[2006-04-20 16:41:46 MDT] LOG: could not open file 
"pg_xlog/0001018F0037" (log file 399, segment 55): No such file 
or directory

[2006-04-20 16:41:46 MDT] LOG: redo done at 18F/36FFF254
sh: line 1: /wal_archive/0001018F0036.gz: No such file or 
directory
[2006-04-20 16:41:46 MDT] PANIC: could not open file 
"pg_xlog/0001018F0036" (log file 399, segment 54): No such file 
or directory
[2006-04-20 16:41:46 MDT] LOG: startup process (PID 9190) was terminated by 
signal 6
[2006-04-20 16:41:46 MDT] LOG: aborting startup due to startup process 
failure

[2006-04-20 16:41:46 MDT] LOG: logger shutting down



The /wal_archive/0001018F0037.gz is there accessible on the NFS 
mount.


Is there a way to continue the restore process from where it left off?

Thanks,


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 20, 2006, at 3:19 PM, Jeff Frost wrote:


On Thu, 20 Apr 2006, Brendan Duddridge wrote:


Hi,

We had a database issue today that caused us to have to restore to our 
most recent backup. We are using PITR so we have 3120 WAL files that need 
to be applied to the database.


After 45 minutes, it has restored only 230 WAL files. At this rate, it's 
going to take about 10 hours to restore our database.


Most of the time, the server is not using very much CPU time or I/O time. 
So I'm wondering what can be done to speed up the process?


Brendan,

Where are the WAL files being stored and how are they being read back?

--
Jeff Frost, Owner   <[EMAIL PROTECTED]>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly





---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

 http://www.postgresql.org/docs/faq





---(end of broadcast)---
TIP 6: explain analyze is your friend



--
Jeff Frost, Owner   <[EMAIL PROTECTED]>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Tom Lane
Brendan Duddridge <[EMAIL PROTECTED]> writes:
> However, as I just finished posting to the list, the process died  
> with a PANIC error:

> [2006-04-20 16:41:28 MDT] LOG: restored log file  
> "0001018F0034" from archive
> [2006-04-20 16:41:35 MDT] LOG: restored log file  
> "0001018F0035" from archive
> [2006-04-20 16:41:38 MDT] LOG: restored log file  
> "0001018F0036" from archive
> sh: line 1: /wal_archive/0001018F0037.gz: No such file or  
> directory
> [2006-04-20 16:41:46 MDT] LOG: could not open file "pg_xlog/ 
> 0001018F0037" (log file 399, segment 55): No such file or  
> directory
> [2006-04-20 16:41:46 MDT] LOG: redo done at 18F/36FFF254
> sh: line 1: /wal_archive/0001018F0036.gz: No such file or  
> directory
> [2006-04-20 16:41:46 MDT] PANIC: could not open file "pg_xlog/ 
> 0001018F0036" (log file 399, segment 54): No such file or  
> directory

This looks to me like a bug in your archive restore command.  It had
just finished providing 0001018F0036 at 16:41:38, why was
it not able to do so again at 16:41:46?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Brendan Duddridge
Oops... forgot to mention that both files that postgres said were  
missing are in fact there:


A partial listing from our wal_archive directory:

-rw--- 1 postgres staff 4971129 Apr 19 20:08  
0001018F0036.gz
-rw--- 1 postgres staff 4378284 Apr 19 20:09  
0001018F0037.gz


There didn't seem to be any issues with the NFS mount. Perhaps it  
briefly disconnected and came back right away.



Thanks!



Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 20, 2006, at 5:11 PM, Brendan Duddridge wrote:


Hi Jeff,

The WAL files are stored on a separate server and accessed through  
an NFS mount located at /wal_archive.


However, the restore failed about 5 hours in after we got this error:

[2006-04-20 16:41:28 MDT] LOG: restored log file  
"0001018F0034" from archive
[2006-04-20 16:41:35 MDT] LOG: restored log file  
"0001018F0035" from archive
[2006-04-20 16:41:38 MDT] LOG: restored log file  
"0001018F0036" from archive
sh: line 1: /wal_archive/0001018F0037.gz: No such file  
or directory
[2006-04-20 16:41:46 MDT] LOG: could not open file "pg_xlog/ 
0001018F0037" (log file 399, segment 55): No such file  
or directory

[2006-04-20 16:41:46 MDT] LOG: redo done at 18F/36FFF254
sh: line 1: /wal_archive/0001018F0036.gz: No such file  
or directory
[2006-04-20 16:41:46 MDT] PANIC: could not open file "pg_xlog/ 
0001018F0036" (log file 399, segment 54): No such file  
or directory
[2006-04-20 16:41:46 MDT] LOG: startup process (PID 9190) was  
terminated by signal 6
[2006-04-20 16:41:46 MDT] LOG: aborting startup due to startup  
process failure

[2006-04-20 16:41:46 MDT] LOG: logger shutting down



The /wal_archive/0001018F0037.gz is there accessible on  
the NFS mount.


Is there a way to continue the restore process from where it left off?

Thanks,


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 20, 2006, at 3:19 PM, Jeff Frost wrote:


On Thu, 20 Apr 2006, Brendan Duddridge wrote:


Hi,

We had a database issue today that caused us to have to restore  
to our most recent backup. We are using PITR so we have 3120 WAL  
files that need to be applied to the database.


After 45 minutes, it has restored only 230 WAL files. At this  
rate, it's going to take about 10 hours to restore our database.


Most of the time, the server is not using very much CPU time or I/ 
O time. So I'm wondering what can be done to speed up the process?


Brendan,

Where are the WAL files being stored and how are they being read  
back?


--
Jeff Frost, Owner   <[EMAIL PROTECTED]>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of  
broadcast)---

TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that  
your

  message can get through to the mailing list cleanly





---(end of  
broadcast)---

TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq





---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Brendan Duddridge

Hi Tom,

I found it... it's called ktrace on OS X Server.

However, as I just finished posting to the list, the process died  
with a PANIC error:


[2006-04-20 16:41:28 MDT] LOG: restored log file  
"0001018F0034" from archive
[2006-04-20 16:41:35 MDT] LOG: restored log file  
"0001018F0035" from archive
[2006-04-20 16:41:38 MDT] LOG: restored log file  
"0001018F0036" from archive
sh: line 1: /wal_archive/0001018F0037.gz: No such file or  
directory
[2006-04-20 16:41:46 MDT] LOG: could not open file "pg_xlog/ 
0001018F0037" (log file 399, segment 55): No such file or  
directory

[2006-04-20 16:41:46 MDT] LOG: redo done at 18F/36FFF254
sh: line 1: /wal_archive/0001018F0036.gz: No such file or  
directory
[2006-04-20 16:41:46 MDT] PANIC: could not open file "pg_xlog/ 
0001018F0036" (log file 399, segment 54): No such file or  
directory
[2006-04-20 16:41:46 MDT] LOG: startup process (PID 9190) was  
terminated by signal 6
[2006-04-20 16:41:46 MDT] LOG: aborting startup due to startup  
process failure

[2006-04-20 16:41:46 MDT] LOG: logger shutting down


Would turning off fsync make it go faster? Maybe it won't take 10  
hours again if we start from scratch.


Also, what if we did just start it up again? Will postgres realize  
that the existing wal_archive files have already been processed and  
just skip along until it finds one it hasn't processed yet?


Thanks,


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 20, 2006, at 3:19 PM, Tom Lane wrote:


Brendan Duddridge <[EMAIL PROTECTED]> writes:

Do you mean do a kill -QUIT on the postgres process in order to
generate a stack trace?


Not at all!  I'm talking about tracing the kernel calls it's making.
Depending on your platform, the tool for this is called strace,
ktrace, truss, or maybe even just trace.  With strace you'd do
something like

strace -p PID-of-process 2>outfile
... wait 30 sec or so ...
control-C

Not sure about the APIs for the others but they're probably roughly
similar ... read the man page ...

regards, tom lane

---(end of  
broadcast)---

TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match





---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Brendan Duddridge

Hi Jeff,

The WAL files are stored on a separate server and accessed through an  
NFS mount located at /wal_archive.


However, the restore failed about 5 hours in after we got this error:

[2006-04-20 16:41:28 MDT] LOG: restored log file  
"0001018F0034" from archive
[2006-04-20 16:41:35 MDT] LOG: restored log file  
"0001018F0035" from archive
[2006-04-20 16:41:38 MDT] LOG: restored log file  
"0001018F0036" from archive
sh: line 1: /wal_archive/0001018F0037.gz: No such file or  
directory
[2006-04-20 16:41:46 MDT] LOG: could not open file "pg_xlog/ 
0001018F0037" (log file 399, segment 55): No such file or  
directory

[2006-04-20 16:41:46 MDT] LOG: redo done at 18F/36FFF254
sh: line 1: /wal_archive/0001018F0036.gz: No such file or  
directory
[2006-04-20 16:41:46 MDT] PANIC: could not open file "pg_xlog/ 
0001018F0036" (log file 399, segment 54): No such file or  
directory
[2006-04-20 16:41:46 MDT] LOG: startup process (PID 9190) was  
terminated by signal 6
[2006-04-20 16:41:46 MDT] LOG: aborting startup due to startup  
process failure

[2006-04-20 16:41:46 MDT] LOG: logger shutting down



The /wal_archive/0001018F0037.gz is there accessible on  
the NFS mount.


Is there a way to continue the restore process from where it left off?

Thanks,


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 20, 2006, at 3:19 PM, Jeff Frost wrote:


On Thu, 20 Apr 2006, Brendan Duddridge wrote:


Hi,

We had a database issue today that caused us to have to restore to  
our most recent backup. We are using PITR so we have 3120 WAL  
files that need to be applied to the database.


After 45 minutes, it has restored only 230 WAL files. At this  
rate, it's going to take about 10 hours to restore our database.


Most of the time, the server is not using very much CPU time or I/ 
O time. So I'm wondering what can be done to speed up the process?


Brendan,

Where are the WAL files being stored and how are they being read back?

--
Jeff Frost, Owner   <[EMAIL PROTECTED]>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of  
broadcast)---

TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that  
your

  message can get through to the mailing list cleanly





---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Jeff Frost

On Thu, 20 Apr 2006, Brendan Duddridge wrote:


Hi,

We had a database issue today that caused us to have to restore to our most 
recent backup. We are using PITR so we have 3120 WAL files that need to be 
applied to the database.


After 45 minutes, it has restored only 230 WAL files. At this rate, it's 
going to take about 10 hours to restore our database.


Most of the time, the server is not using very much CPU time or I/O time. So 
I'm wondering what can be done to speed up the process?


Brendan,

Where are the WAL files being stored and how are they being read back?

--
Jeff Frost, Owner   <[EMAIL PROTECTED]>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Luke Lonergan
Title: Re: [PERFORM] Recovery will take 10 hours



Brendan,

  strace –p  -c

Then do a “CTRL-C” after a minute to get the stats of system calls.

- Luke

On 4/20/06 2:13 PM, "Brendan Duddridge" <[EMAIL PROTECTED]> wrote:

Hi Tom,

Do you mean do a kill -QUIT on the postgres process in order to 
generate a stack trace?

Will that affect the currently running process in any bad way? And 
where would the output go? stdout?

Thanks,


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 20, 2006, at 2:17 PM, Tom Lane wrote:

> Brendan Duddridge <[EMAIL PROTECTED]> writes:
>> We had a database issue today that caused us to have to restore to
>> our most recent backup. We are using PITR so we have 3120 WAL files
>> that need to be applied to the database.
>> After 45 minutes, it has restored only 230 WAL files. At this rate,
>> it's going to take about 10 hours to restore our database.
>> Most of the time, the server is not using very much CPU time or I/O
>> time. So I'm wondering what can be done to speed up the process?
>
> That seems a bit odd --- should be eating one or the other, one would
> think.  Try strace'ing the recovery process to see what it's doing.
>
>> If there were something we could do to speed up the process, would it
>> be possible to kill the postgres process, tweak some parameter
>> somewhere and then start it up again? Or would we have to restore our
>> base backup again and start over?
>
> You could start it up again, but it'd want to read through all the WAL
> it's already looked at, so I'd not recommend this until/unless you're
> pretty sure you've fixed the performance issue.  Right at the moment,
> I think this is a golden opportunity to study the performance of WAL
> recovery --- it's not something we've tried to optimize particularly.
>
>   regards, tom lane
>
> ---(end of 
> broadcast)---
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>    choose an index scan if your joining column's datatypes do not
>    match
>



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly









Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Tom Lane
Brendan Duddridge <[EMAIL PROTECTED]> writes:
> Do you mean do a kill -QUIT on the postgres process in order to  
> generate a stack trace?

Not at all!  I'm talking about tracing the kernel calls it's making.
Depending on your platform, the tool for this is called strace,
ktrace, truss, or maybe even just trace.  With strace you'd do
something like

strace -p PID-of-process 2>outfile
... wait 30 sec or so ...
control-C

Not sure about the APIs for the others but they're probably roughly
similar ... read the man page ...

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Brendan Duddridge

Hi Tom,

Do you mean do a kill -QUIT on the postgres process in order to  
generate a stack trace?


Will that affect the currently running process in any bad way? And  
where would the output go? stdout?


Thanks,


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 20, 2006, at 2:17 PM, Tom Lane wrote:


Brendan Duddridge <[EMAIL PROTECTED]> writes:

We had a database issue today that caused us to have to restore to
our most recent backup. We are using PITR so we have 3120 WAL files
that need to be applied to the database.
After 45 minutes, it has restored only 230 WAL files. At this rate,
it's going to take about 10 hours to restore our database.
Most of the time, the server is not using very much CPU time or I/O
time. So I'm wondering what can be done to speed up the process?


That seems a bit odd --- should be eating one or the other, one would
think.  Try strace'ing the recovery process to see what it's doing.


If there were something we could do to speed up the process, would it
be possible to kill the postgres process, tweak some parameter
somewhere and then start it up again? Or would we have to restore our
base backup again and start over?


You could start it up again, but it'd want to read through all the WAL
it's already looked at, so I'd not recommend this until/unless you're
pretty sure you've fixed the performance issue.  Right at the moment,
I think this is a golden opportunity to study the performance of WAL
recovery --- it's not something we've tried to optimize particularly.

regards, tom lane

---(end of  
broadcast)---

TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match





---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Recovery will take 10 hours

2006-04-20 Thread Tom Lane
Brendan Duddridge <[EMAIL PROTECTED]> writes:
> We had a database issue today that caused us to have to restore to  
> our most recent backup. We are using PITR so we have 3120 WAL files  
> that need to be applied to the database.
> After 45 minutes, it has restored only 230 WAL files. At this rate,  
> it's going to take about 10 hours to restore our database.
> Most of the time, the server is not using very much CPU time or I/O  
> time. So I'm wondering what can be done to speed up the process?

That seems a bit odd --- should be eating one or the other, one would
think.  Try strace'ing the recovery process to see what it's doing.

> If there were something we could do to speed up the process, would it  
> be possible to kill the postgres process, tweak some parameter  
> somewhere and then start it up again? Or would we have to restore our  
> base backup again and start over?

You could start it up again, but it'd want to read through all the WAL
it's already looked at, so I'd not recommend this until/unless you're
pretty sure you've fixed the performance issue.  Right at the moment,
I think this is a golden opportunity to study the performance of WAL
recovery --- it's not something we've tried to optimize particularly.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[PERFORM] Recovery will take 10 hours

2006-04-20 Thread Brendan Duddridge

Hi,

We had a database issue today that caused us to have to restore to  
our most recent backup. We are using PITR so we have 3120 WAL files  
that need to be applied to the database.


After 45 minutes, it has restored only 230 WAL files. At this rate,  
it's going to take about 10 hours to restore our database.


Most of the time, the server is not using very much CPU time or I/O  
time. So I'm wondering what can be done to speed up the process?


The database is about 20 GB. The WAL files are compressed with gzip  
to about 4 MB. Expanded, the WAL files would take 48 GB.


We are using PostgreSQL 8.1.3 on OS X Server 10.4.6 connected to an  
XServe RAID. The pg_xlog is on its own separate RAID and so are the  
table spaces.


Here's a representative sample of doing iostat:

hulk1:/Library/PostgreSQL admin$ iostat 5
  disk1   disk2   disk0   cpu
  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us sy id
19.31 101  1.91  14.39  51  0.71  37.37   4  0.13  15 10 76
  8.00  21  0.16   0.00   0  0.00  90.22   2  0.16   0  2 98
  8.00  32  0.25   0.00   0  0.00   0.00   0  0.00   0  1 98
  8.00  76  0.60   0.00   0  0.00   0.00   0  0.00   0  1 99
  8.00 587  4.59  1024.00   4  4.00   0.00   0  0.00   4  7 88
  8.00 675  5.27  956.27   6  5.60   0.00   0  0.00   6  6 88
11.32 1705 18.84   5.70   1  0.01  16.36   7  0.12   1  6 93
  8.00  79  0.62  1024.00   3  3.20   0.00   0  0.00   2  2 96
  8.00  68  0.53   0.00   0  0.00   0.00   0  0.00   0  2 98
  8.00  76  0.59   0.00   0  0.00   0.00   0  0.00   0  1 99
  8.02  89  0.69   0.00   0  0.00   0.00   0  0.00   1  1 98
  8.00 572  4.47  911.11   4  3.20   0.00   0  0.00   5  5 91
13.53 1227 16.21  781.55   4  3.21  12.14   2  0.03   3  6 90
  8.00  54  0.42   0.00   0  0.00  90.22   2  0.16   1  1 98
  8.00  68  0.53   0.00   0  0.00   0.00   0  0.00   0  1 99
  8.00 461  3.60  1024.00   3  3.20   0.00   0  0.00   3  6 91
  8.00 671  5.24  964.24   7  6.40   0.00   0  0.00   6  8 86
  7.99 248  1.94   0.00   0  0.00   0.00   0  0.00   1  3 96
15.06 1050 15.44  911.11   4  3.20  12.12   3  0.03   2  5 93
19.84 176  3.41   5.70   1  0.01   0.00   0  0.00   0  1 99


disk1 is the RAID volume that has the table spaces on it. disk2 is  
the pg_xlog and disk0 is the boot disk.


So you can see the CPU is idle much of the time and the IO only  
occurs in short bursts. Each line in the iostat results is 5 seconds  
apart.


If there were something we could do to speed up the process, would it  
be possible to kill the postgres process, tweak some parameter  
somewhere and then start it up again? Or would we have to restore our  
base backup again and start over?


How can I make this go faster?


Thanks,


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings