Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-03-10 Thread Heikki Linnakangas

Fujii Masao wrote:

Hi,

On Mon, Mar 9, 2009 at 10:20 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

Thanks. This patch seems to be missing the new readahead.c file. I grabbed
that from the previous patch version.


Oh, sorry for the mistake. I changed one of Suzuki-san's patches
to be rebased to HEAD again (readahead-20090310.patch).
The other (addShBufCheck-20090120.patch) is not changed.

Comment:
we might reach consistent recovery state *before* redoing the safe
starting point, because readahead slightly delays the actual redo.
Is this safe? 


No. If you haven't replayed all the WAL records up to the safe starting 
point, the database isn't consistent yet. The distinction doesn't matter 
in practice without Hot Standby, though.



If not, the readahead queue should be flushed before
reaching that state?


Yes. Or you could move the reporting that you've reached the consistent 
recovery state into RedoRecords, when you reach the min safe starting point.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-03-10 Thread Koichi Suzuki
Hi,

2009/3/10 Heikki Linnakangas heikki.linnakan...@enterprisedb.com:
 Fujii Masao wrote:

 Hi,

 On Mon, Mar 9, 2009 at 10:20 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:

 Thanks. This patch seems to be missing the new readahead.c file. I
 grabbed
 that from the previous patch version.

 Oh, sorry for the mistake. I changed one of Suzuki-san's patches
 to be rebased to HEAD again (readahead-20090310.patch).
 The other (addShBufCheck-20090120.patch) is not changed.

 Comment:
 we might reach consistent recovery state *before* redoing the safe
 starting point, because readahead slightly delays the actual redo.
 Is this safe?

 No. If you haven't replayed all the WAL records up to the safe starting
 point, the database isn't consistent yet. The distinction doesn't matter in
 practice without Hot Standby, though.

 If not, the readahead queue should be flushed before
 reaching that state?

 Yes. Or you could move the reporting that you've reached the consistent
 recovery state into RedoRecords, when you reach the min safe starting point.

This arrangement can be done with Hot Standby and Sync.Rep, right?

So far, it sounds that we need to add a code to handle if malloc()
fails (OOM).   In this case, possible way could be to skip whole
readahead, although the rest of the recovery may fail because of the
memory shortage.


 --
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com




-- 
--
Koichi Suzuki

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-03-09 Thread Heikki Linnakangas

Fujii Masao wrote:

On Tue, Mar 3, 2009 at 1:47 PM, Fujii Masao masao.fu...@gmail.com wrote:

Hi Suzuki-san,

On Thu, Feb 26, 2009 at 5:03 AM, Koichi Suzuki koichi@gmail.com wrote:

My reply to Gregory's comment didn't have any objections.   I believe,
as I posted to Wiki page, latest posted patch is okay and waiting for
review.

One of your latest patches doesn't match with HEAD, so I updated it.


Oops! I failed in attaching the patch. This is second try.


Thanks. This patch seems to be missing the new readahead.c file. I 
grabbed that from the previous patch version.


It's annoying that we have to write *_readahead functions for each and 
every record type. In most record types, we already pass the information 
about the pages involved to XLogInsert, for full page writes. If we 
could change the WAL format so that that information is stored in WAL 
even when a full page image is taken, we could do the read-ahead for 
every WAL record type in a single function. Getting the API right needs 
some thinking, but that would be a lot nicer approach in the long run.


I agree with Itagaki-san's earlier comment that we should find a way to 
avoid the ReadAheadHasRoom calls 
(http://archives.postgresql.org/message-id/20090114101659.89cd.52131...@oss.ntt.co.jp). 
Let's just leave enough slack in the queue, so that it never fills up. 
And if the unthinkable happens and it does fill up, we can just throw 
away the readahead requests that don't fit; they're just hints anyway.


The API for ReadAheadAddEntry should include ForkNumber. Although all 
the readahead calls included in the patch all access the main fork, 
that's really just an omission that probably should be fixed even though 
the FSM and visibility map should become cached pretty quickly for any 
active tables.


effective_io_concurrency setting is ignored. If it's set to 1, we should 
disable prefetching altogether for the sake of both robustness (let you 
recover even if there's a bug in the readahead code) and performance 
(avoid useless posix_fadvise calls, sorting etc. if there's only one 
spindle).


The bursty queuing behavior feels pretty strange to me, though I guess 
it works pretty well in practice. I would've assumed a FIFO queue of WAL 
records, with some kind of a cache of recently issued posix_fadvise() calls.


As the patch stands, it's not limited to archive recovery. The code in 
readahead.c seems to assume that the readahead queue will always be 
flushed between xlog segment switch, but that is not enforced in xlog.c.


malloc() can return NULL on out of memory. Need to check that, or use 
palloc() instead.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-03-04 Thread Koichi Suzuki
Appreciate for your kind help!

2009/3/3 Fujii Masao masao.fu...@gmail.com:
 On Tue, Mar 3, 2009 at 1:47 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Hi Suzuki-san,

 On Thu, Feb 26, 2009 at 5:03 AM, Koichi Suzuki koichi@gmail.com wrote:
 My reply to Gregory's comment didn't have any objections.   I believe,
 as I posted to Wiki page, latest posted patch is okay and waiting for
 review.

 One of your latest patches doesn't match with HEAD, so I updated it.

 Oops! I failed in attaching the patch. This is second try.

 Regards,

 --
 Fujii Masao
 NIPPON TELEGRAPH AND TELEPHONE CORPORATION
 NTT Open Source Software Center




-- 
--
Koichi Suzuki

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-03-02 Thread Fujii Masao
Hi Suzuki-san,

On Thu, Feb 26, 2009 at 5:03 AM, Koichi Suzuki koichi@gmail.com wrote:
 My reply to Gregory's comment didn't have any objections.   I believe,
 as I posted to Wiki page, latest posted patch is okay and waiting for
 review.

One of your latest patches doesn't match with HEAD, so I updated it.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-03-02 Thread Fujii Masao
On Tue, Mar 3, 2009 at 1:47 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Hi Suzuki-san,

 On Thu, Feb 26, 2009 at 5:03 AM, Koichi Suzuki koichi@gmail.com wrote:
 My reply to Gregory's comment didn't have any objections.   I believe,
 as I posted to Wiki page, latest posted patch is okay and waiting for
 review.

 One of your latest patches doesn't match with HEAD, so I updated it.

Oops! I failed in attaching the patch. This is second try.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


readahead-20090303.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-02-25 Thread Koichi Suzuki
Hi,

My reply to Gregory's comment didn't have any objections.   I believe,
as I posted to Wiki page, latest posted patch is okay and waiting for
review.

2009/2/24 Robert Haas robertmh...@gmail.com:
 On Sun, Jan 25, 2009 at 7:15 AM, Gregory Stark st...@enterprisedb.com wrote:
 Koichi Suzuki koichi@gmail.com writes:

 Please find enclosed 2nd patch of pg_readahead which include a patch
 to bufer manager to skip prefetch of pages already in shared buffer.

 I'm a bit confused by this comment. PrefetchBuffer already checks if the page
 is in shared buffers.

 What is tricky to avoid is prefetching the same page twice -- since the first
 prefetch doesn't actually put it in shared buffers there's no way to avoid
 prefetching it again unless you keep some kind of hash of recently prefetched
 buffers.

 For the index scan case I'm debating about whether to add such a cache
 directly to PrefetchBuffer -- in which case it would remember if some other
 scan prefetched the same buffer -- or to keep it in the index scan code.

 Has this issue been resolved?  Does this patch need more review?
 Because if so, I'm guessing it needs to happen RSN.

 ...Robert




-- 
--
Koichi Suzuki

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-02-24 Thread Robert Haas
On Sun, Jan 25, 2009 at 7:15 AM, Gregory Stark st...@enterprisedb.com wrote:
 Koichi Suzuki koichi@gmail.com writes:

 Please find enclosed 2nd patch of pg_readahead which include a patch
 to bufer manager to skip prefetch of pages already in shared buffer.

 I'm a bit confused by this comment. PrefetchBuffer already checks if the page
 is in shared buffers.

 What is tricky to avoid is prefetching the same page twice -- since the first
 prefetch doesn't actually put it in shared buffers there's no way to avoid
 prefetching it again unless you keep some kind of hash of recently prefetched
 buffers.

 For the index scan case I'm debating about whether to add such a cache
 directly to PrefetchBuffer -- in which case it would remember if some other
 scan prefetched the same buffer -- or to keep it in the index scan code.

Has this issue been resolved?  Does this patch need more review?
Because if so, I'm guessing it needs to happen RSN.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-01-28 Thread Koichi Suzuki
Hi,

This is also a reply to your latest post.I'm replying to the older
one because I need original text.

2009/1/26 Koichi Suzuki koichi@gmail.com:
 Hi,

 It's simply because we should not refer to system catalog during the recovery.

This is the reason why I didn't use PrefetchBuffer().   Prefetch
buffer goes to the system catalog which we should not read during the
recovery.
I added a code to prefetch needed page without referring to the system catalog.

As you pointed out, it has nothing to do with the prefetch itself.


 2009/1/25 Gregory Stark st...@enterprisedb.com:

 Koichi Suzuki koichi@gmail.com writes:

 Please find enclosed 2nd patch of pg_readahead which include a patch
 to bufer manager to skip prefetch of pages already in shared buffer.

 I'm a bit confused by this comment. PrefetchBuffer already checks if the page
 is in shared buffers.

 What is tricky to avoid is prefetching the same page twice -- since the first
 prefetch doesn't actually put it in shared buffers there's no way to avoid
 prefetching it again unless you keep some kind of hash of recently prefetched
 buffers.

 For the index scan case I'm debating about whether to add such a cache
 directly to PrefetchBuffer -- in which case it would remember if some other
 scan prefetched the same buffer -- or to keep it in the index scan code.

I also think this could be additional feature of prefetch.   On the
other hand, if some page is not on the shared buffer and on kernel's
cache, whichi should be the case we should avoid pfefetch,
posix_fadvise() will not read from the disk and the duration for this
call will be very very small.   So for the time being, I think this
can be acceptable.


 --
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning




 --
 --
 Koichi Suzuki

-- 
--
Koichi Suzuki

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-01-26 Thread Gregory Stark

Koichi Suzuki koichi@gmail.com writes:

 It's simply because we should not refer to system catalog during the recovery.

I don't understand how this is connected to anything to do with prefetching?


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-01-25 Thread Gregory Stark

Koichi Suzuki koichi@gmail.com writes:

 Please find enclosed 2nd patch of pg_readahead which include a patch
 to bufer manager to skip prefetch of pages already in shared buffer.

I'm a bit confused by this comment. PrefetchBuffer already checks if the page
is in shared buffers. 

What is tricky to avoid is prefetching the same page twice -- since the first
prefetch doesn't actually put it in shared buffers there's no way to avoid
prefetching it again unless you keep some kind of hash of recently prefetched
buffers.

For the index scan case I'm debating about whether to add such a cache
directly to PrefetchBuffer -- in which case it would remember if some other
scan prefetched the same buffer -- or to keep it in the index scan code.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-01-25 Thread Koichi Suzuki
Hi,

It's simply because we should not refer to system catalog during the recovery.

2009/1/25 Gregory Stark st...@enterprisedb.com:

 Koichi Suzuki koichi@gmail.com writes:

 Please find enclosed 2nd patch of pg_readahead which include a patch
 to bufer manager to skip prefetch of pages already in shared buffer.

 I'm a bit confused by this comment. PrefetchBuffer already checks if the page
 is in shared buffers.

 What is tricky to avoid is prefetching the same page twice -- since the first
 prefetch doesn't actually put it in shared buffers there's no way to avoid
 prefetching it again unless you keep some kind of hash of recently prefetched
 buffers.

 For the index scan case I'm debating about whether to add such a cache
 directly to PrefetchBuffer -- in which case it would remember if some other
 scan prefetched the same buffer -- or to keep it in the index scan code.

 --
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning




-- 
--
Koichi Suzuki

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-01-13 Thread Koichi Suzuki
Sorry, I didn't attatch the patch file.  This is the second try.

2009/1/12 Koichi Suzuki koichi@gmail.com:
 This is V4 patch to improve file read during startup for review.

 Basic algorithm is same as V3 but works with Gregory's fadvise patch.

 http://archives.postgresql.org/pgsql-hackers/2009-01/msg00026.php

 This patc also include additional patch for posix_fadvise to skip
 prefetch of pages which does not exist.

 We still need a patch to work this with syncronous replication, which
 will be posted by the end of this week.

 Regards;

 --
 --
 Koichi Suzuki




-- 
--
Koichi Suzuki


readahead-20090109.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] V4 of PITR performance improvement for 8.4

2009-01-13 Thread ITAGAKI Takahiro

Koichi Suzuki koichi@gmail.com wrote:

  This patc also include additional patch for posix_fadvise to skip
  prefetch of pages which does not exist.

I reviewed your patch and found items to be fixed and improved.

- You should not claim your copyrights.
There are your copyright claims in two files newly added, but they
should be included by PostgreSQL Global Development Group.
| * Portions Copyright (c) 2008, Nippon Telegraph and Telephone Corporation
| * Portions Copyright (c) 1996-2004, PostgreSQL Global Development Group

- Avoid pending wal records on continuous REDO.
In archive recovery, ReadRecord() waits until the next segment comes.
Should we avoid pending WAL records during the waiting?
RedoRecords() might be needed before calling RestoreArchivedFile().
It would be better to redo records and restore archived files
in parallel, but we will need to replace system() to other system
because system() blocks until the process finishes.

- Should check the buffer pool before calling smgrprefetch.
If the page is in the shared buffer pool, we don't need to
prefetch the buffer from disks. I think it is good to add
PrefetchBufferWithoutRelcache() like ReadBufferWithoutRelcache().

- Is possible to remove calls of ReadAheadHasRoom()?
If we checks rooms of readahead-queue, xxx_readahead functions
don't need to call ReadAheadHasRoom. The maximum readaheads are
at most 3, so we'll flush the queue when the rooms of the queue
are less than 3. And if the queue is full unexpectedly, we can
ignore the added items because it is only hint.

- Should avoid large static variables.
| [src/backend/access/transam/xlog]
| + static char RecordQueueBuf[XLogSegSize * 2];
| [readahead.c]
| + static XLogReadAhead ReadAheadQueue[ReadAheadQueueSize];
I think it is a bad idea to place a large object as a static variable.
It should be a pointer and allocated with palloc() in runtime.
Also those variable are only used by startup process.

- Merge readahead.h to xlog.h or something.
It export just a few functions and the functionality belongs
xlog and recovery.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] V4 of PITR performance improvement for 8.4

2009-01-12 Thread Koichi Suzuki
This is V4 patch to improve file read during startup for review.

Basic algorithm is same as V3 but works with Gregory's fadvise patch.

http://archives.postgresql.org/pgsql-hackers/2009-01/msg00026.php

This patc also include additional patch for posix_fadvise to skip
prefetch of pages which does not exist.

We still need a patch to work this with syncronous replication, which
will be posted by the end of this week.

Regards;

-- 
--
Koichi Suzuki

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers