from:"Koichi Suzuki"

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-05-21 Thread Koichi Suzuki

I really appreciate for the modification.

I also believe XLOG_NOOP is cool to maintains XLOG format consistent.
I'll continue to write a code to produce incremental log record from
the full page writes as well as  too maintain CRC, XLOOG_NOOP and
other XLOG locations,I also found that you've added information on
btree strip log records, which anables to produce corresponding
incremental logs from the full page writes.

2007/5/21, Tom Lane <[EMAIL PROTECTED]>:

Koichi Suzuki <[EMAIL PROTECTED]> writes:
> As replied to "Patch queue triage" by Tom, here's simplified patch to
> mark WAL record as "compressable", with no increase in WAL itself.
> Compression/decompression commands will be posted separately to PG
> Foundary for further review.

Applied with some minor modifications.  I didn't like the idea of
suppressing the sanity-check on WAL record length; I think that's
fairly important.  Instead, I added a provision for an XLOG_NOOP
WAL record type that can be used to fill in the extra space.
The way I envision that working is that the compressor removes
backup blocks and converts each compressible WAL record to have the
same contents and length it would've had if written without backup
blocks.  Then, it inserts an XLOG_NOOP record with length set to
indicate the amount of extra space that needs to be chewed up --
but in the compressed version of the WAL file, XLOG_NOOP's "data
area" is not actually stored.  The decompressor need only scan
the file looking for XLOG_NOOP and insert the requisite number of
zero bytes (and maybe recompute the XLOG_NOOP's CRC, depending on
whether you want it to be valid for the short-format record in the
compressed file).  There will also be some games to be played for
WAL page boundaries, but you had to do that anyway.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

--
--
Koichi Suzuki

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-05-07 Thread Koichi Suzuki

Hi,

As replied to "Patch queue triage" by Tom, here's simplified patch to
mark WAL record as "compressable", with no increase in WAL itself.
Compression/decompression commands will be posted separately to PG
Foundary for further review.

---
As suggested by Tom, I agree that WAL should not include "both" full
page write and incremental (logical) log.   I began to examine WAL
record format to see if incremental log can be made from full page
writes.   It will be okay even before 8.4, if simplified patch to the
core is accepted.   I will post simplified patch to the core as follows:

1. Mark the flag to indicate that the WAL record is compressable from
full page writes to incremental log.  This flag will be set if
a) It is not written during the hot backup, and
b) Archive command is active, and
c) WAL contains full page writes, and
d) full_page_writes=on.
No logical log will be written to WAL in this case, and
2. During recovery, xl_tot_len check will be skipped for compressed WAL
records.

Please note that new GUC is not needed in this patch.

With this patch, compress/decompress can be developped outside the core.

I'd be very grateful if this patch can be considered again.

Best Regards;

-- 
-
Koichi Suzuki
diff -cr pgsql_org/src/backend/access/transam/xlog.c 
pgsql/src/backend/access/transam/xlog.c
*** pgsql_org/src/backend/access/transam/xlog.c 2007-05-02 15:56:38.0 
+0900
--- pgsql/src/backend/access/transam/xlog.c 2007-05-07 16:30:38.0 
+0900
***
*** 837,842 
--- 837,854 
return RecPtr;
}
  
+   /*
+* If online backup is not in progress and WAL archiving is active, mark
+* backup blocks removable if any.
+* This mark will be referenced during archiving to remove needless 
backup
+* blocks in the record and compress WAL segment files.
+*/
+   if (XLogArchivingActive() && fullPageWrites &&
+   (info & XLR_BKP_BLOCK_MASK) && !Insert->forcePageWrites)
+   {
+   info |= XLR_BKP_REMOVABLE;
+   }
+ 
/* Insert record header */
  
record = (XLogRecord *) Insert->currpos;
***
*** 2738,2750 
blk += blen;
}
  
!   /* Check that xl_tot_len agrees with our calculation */
!   if (blk != (char *) record + record->xl_tot_len)
{
!   ereport(emode,
!   (errmsg("incorrect total length in record at 
%X/%X",
!   recptr.xlogid, 
recptr.xrecoff)));
!   return false;
}
  
/* Finally include the record header */
--- 2750,2778 
blk += blen;
}
  
!   /*
!* If physical log has not been removed, check the length to see
!* the following.
!*   - No physical log existed originally,
!*   - WAL record was not removable because it is generated during
!* the online backup,
!*   - Cannot be removed because the physical log spanned in
!* two segments.
!* The reason why we skip the length check on the physical log removal 
is
!* that the flag XLR_SET_BKB_BLOCK(0..2) is reset to zero and it 
prevents
!* the above loop to proceed blk to the end of the record.
!*/
!   if (!(record->xl_info & XLR_BKP_REMOVABLE) ||
!   record->xl_info & XLR_BKP_BLOCK_MASK)
{
!   /* Check that xl_tot_len agrees with our calculation */
!   if (blk != (char *) record + record->xl_tot_len)
!   {
!   ereport(emode,
!   (errmsg("incorrect total length in 
record at %X/%X",
!   recptr.xlogid, 
recptr.xrecoff)));
!   return false;
!   }
}
  
/* Finally include the record header */
pgsql/src/backend/access/transamã ãã«çºè¦: xlog.c.orig
diff -cr pgsql_org/src/include/access/xlog.h pgsql/src/include/access/xlog.h
*** pgsql_org/src/include/access/xlog.h 2007-01-06 07:19:51.0 +0900
--- pgsql/src/include/access/xlog.h 2007-05-07 16:30:38.0 +0900
***
*** 66,73 
  /*
   * If we backed up any disk blocks with the XLOG record, we use flag bits in
   * xl_info to signal it.  We support backup of up to 3 disk blocks per XLOG
!  * record.(Could support 4 if we cared to dedicate all the xl_info bits 
for
!  * this purpose; currently bit 0 of xl_info is unused and available.)
   */
  #define XLR_BKP_BLOCK_MASK0x0E/* all info bits used for bkp 
blocks */
  #define XLR_MAX_BKP_BLOCKS3
--- 66,74 
  /*
   * If we backed up any disk blocks with the XLOG record, we use flag bits in
   * xl_i

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-26 Thread Koichi Suzuki


Josh,

Josh Berkus wrote:

Koichi, Andreas,


1) To deal with partial/inconsisitent write to the data file at crash
recovery, we need full page writes at the first modification to pages
after each checkpoint.   It consumes much of WAL space.


We need to find a way around this someday.  Other DBs don't do this; it may be 
becuase they're less durable, or because they fixed the problem.


Maybe both.   Fixing the problem may need some means to detect 
partial/inconsistent writes to the data files, which may needs 
additional CPU resource.





I don't think there should be only one setting.   It depend on how
database is operated.   Leaving wal_add_optiomization_info = off default
does not bring any change in WAL and archive log handling.   I
understand some people may not be happy with additional 3% or so
increase in WAL size, especially people who dosn't need archive log at
all.   So I prefer to leave the default off.


Except that, is there any reason to turn this off if we are archiving?  Maybe 
it should just be slaved to archive_command ... if we're not using PITR, it's 
off, if we are, it's on.


Hmm, this sounds to work.  On the other hand, existing users, who are 
happy with the current archiving and would not like to change current 
archiving command to pg_compresslog or archive log size will increase a 
bit.  I'd like to hear some more on this.





1) is there any throughput benefit for platforms with fast CPU but
contrained I/O (e.g. 2-drive webservers)?  Any penalty for servers with
plentiful I/O?

I've only run benchmarks with archive process running, because
wal_add_optimization_info=on does not make sense if we don't archive
WAL.   In this situation, total I/O decreases because writes to archive
log decreases.   Because of 3% or so increase in WAL size, there will be
increase in WAL write, but decrease in archive writes makes it up.


Yeah, I was just looking for a way to make this a performance feature.  I see 
now that it can't be.  ;-)


As to the performance feature, I tested the patch against 8.3HEAD. 
With pgbench, throughput was as follows:

Case1. Archiver: cp command, wal_add_optimization_info = off,
   full_page_writes=on
Case2. Archiver: pg_compresslog, wal_add_optimization_info = on,
   full_page_writes=on
DB Size: 1.65GB, Total transaction:1,000,000

Throughput was:
Case1: 632.69TPS
Case2: 653.10TPS ... 3% gain.

Archive Log Size:
Case1: 1.92GB
Case2: 0.57GB (about 30% of the Case1)... Before compression, the size 
was 1.92GB.  Because this is based on the number of WAL segment file 
size, there will be at most 16MB error in the measurement.  If we count 
this, the increase in WAL I/O will be less than 1%.





3) How is this better than command-line compression for log-shipping? 
e.g. why do we need it in the database?

I don't fully understand what command-line compression means.   Simon
suggested that this patch can be used with log-shipping and I agree.
If we compare compression with gzip or other general purpose
compression, compression ratio, CPU usage and I/O by pg_compresslog are
all quite better than those in gzip.


OK, that answered my question.


This is why I don't like Josh's suggested name of wal_compressable
eighter.
WAL is compressable eighter way, only pg_compresslog would need to be
more complex if you don't turn off the full page optimization. I think a
good name would tell that you are turning off an optimization.
(thus my wal_fullpage_optimization on/off)


Well, as a PG hacker I find the name wal_fullpage_optimization quite baffling 
and I think our general user base will find it even more so.  Now that I have 
Koichi's explanation of the problem, I vote for simply slaving this to the 
PITR settings and not having a separate option at all.


Could I have more specific suggestion on this?

Regards;


--
-
Koichi Suzuki

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-25 Thread Koichi Suzuki


Hi,

Zeugswetter Andreas ADI SD wrote:
I don't insist the name and the default of the GUC parameter. 
 I'm afraid wal_fullpage_optimization = on (default) makes 
some confusion because the default behavior becomes a bit 
different on WAL itself.


Seems my wal_fullpage_optimization is not a good name if it caused
misinterpretation already :-(

Amount of WAL after 60min. run of DBT-2 benchmark 
wal_add_optimization_info = off (default) 3.13GB

how about wal_fullpage_optimization = on (default)


The meaning of wal_fullpage_optimization = on (default)
would be the same as your wal_add_optimization_info = off (default).
(Reversed name, reversed meaning of the boolean value)

It would be there to *turn off* the (default) WAL full_page
optimization.
For your pg_compresslog it would need to be set to off. 
"add_optimization_info" sounded like added info about/for some

optimization
which it is not. We turn off an optimization with the flag for the
benefit
of an easier pg_compresslog implementation.


For pg_compresslog to remove full page writes, we need 
wal_add_optimization_info=on.




As already said I would decouple this setting from the part that sets
the "removeable full page" flag in WAL, and making the recovery able to
skip dummy records. This I would do unconditionally.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq




--
-
Koichi Suzuki

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-23 Thread Koichi Suzuki


Hi,

Sorry, because of so many comments/questions, I'll write inline

Josh Berkus wrote:

Hackers,


Writing lots of additional code simply to remove a parameter that
*might* be mis-interpreted doesn't sound useful to me, especially when
bugs may leak in that way. My take is that this is simple and useful
*and* we have it now; other ways don't yet exist, nor will they in time
for 8.3.


How about naming the parameter wal_compressable?  That would indicate pretty 
clearly that the parameter is intended to be used with wal_compress and 
nothing else.


Hmm, it sounds nicer.



However, I do agree with Andreas that anything which adds to WAL volume, even 
3%, seems like going in the wrong direction.  We already have higher log 
output than any comparable database (higher than InnoDB by 3x) and we should 
be looking for output to trim as well as compression.


So the relevant question is whether the patch in its current form provides 
enough benefit to make it worthwhile for 8.3, or whether we should wait for 
8.4.  Questions:




Before answering questions below, I'd like to say that archive log 
optimization has to be address different point of views to the current 
(upto 8.2) settings.


1) To deal with partial/inconsisitent write to the data file at crash 
recovery, we need full page writes at the first modification to pages 
after each checkpoint.   It consumes much of WAL space.


2) 1) is not necessary for archive recovery (PITR) and full page writes 
can be removed for this purpose.   However, we need full page writes 
during hot backup to deal with partial writes by backup commands.  This 
is implemented in 8.2.


3) To maintain crash recovery chance and reduce the amount of archive 
log, removal of  unnecessary full page writes from archive logs is a 
good choice.   To do this, we need both logical log and full page writes 
in WAL.


I don't think there should be only one setting.   It depend on how 
database is operated.   Leaving wal_add_optiomization_info = off default 
does not bring any change in WAL and archive log handling.   I 
understand some people may not be happy with additional 3% or so 
increase in WAL size, especially people who dosn't need archive log at 
all.   So I prefer to leave the default off.


For users, I think this is simple enough:

1) For people happy with 8.2 settings:
   No change is needed to move to 8.3 and there's really no change.

2) For people who need to reduce archive log size but like to leave full 
page writes to WAL (to maintain crash recovery chance):

   a) Add GUC parameter: wal_add_optiomization_info=on
   b) Change archive command from "cp" to "pg_compresslog"
   c) Change restore command from "cp" to "pg_decompresslog"

Archive log can be stored and restored as done in older releases.

1) is there any throughput benefit for platforms with fast CPU but contrained 
I/O (e.g. 2-drive webservers)?  Any penalty for servers with plentiful I/O?


I've only run benchmarks with archive process running, because 
wal_add_optimization_info=on does not make sense if we don't archive 
WAL.   In this situation, total I/O decreases because writes to archive 
log decreases.   Because of 3% or so increase in WAL size, there will be 
increase in WAL write, but decrease in archive writes makes it up.




2) Will this patch make attempts to reduce WAL volume in the future 
significantly harder?


Yes, I'd like to continue to work to reduce the WAL size.   It's still 
an issue when database size becomes several handreds of gigabytes in 
size.   Anyway, I think WAL size reduction has to be done in 
XLogInsert() or XLogWrite().   We need much more discussion for this. 
The issue will be how to maintain crash recovery chance by inconsistent 
writes (by full_page_writes=off, we have to give it up).   On the other 
hand we have to keep examining each WAL record.




3) How is this better than command-line compression for log-shipping?  e.g. 
why do we need it in the database?


I don't fully understand what command-line compression means.   Simon 
suggested that this patch can be used with log-shipping and I agree. 
If we compare compression with gzip or other general purpose 
compression, compression ratio, CPU usage and I/O by pg_compresslog are 
all quite better than those in gzip.


Please let me know if you intended defferently.

Regards;

--
-
Koichi Suzuki

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-22 Thread Koichi Suzuki


Hi,

I don't insist the name and the default of the GUC parameter.  I'm 
afraid wal_fullpage_optimization = on (default) makes some confusion 
because the default behavior becomes a bit different on WAL itself.


I'd like to have some more opinion on this.

Zeugswetter Andreas ADI SD wrote:
With DBT-2 benchmark, I've already compared the amount of WAL.   The 
result was as follows:


Amount of WAL after 60min. run of DBT-2 benchmark 
wal_add_optimization_info = off (default) 3.13GB


how about wal_fullpage_optimization = on (default)
 
wal_add_optimization_info = on (new case) 3.17GB -> can be 
optimized to 0.31GB by pg_compresslog.


So the difference will be around a couple of percents.   I think this
is 

very good figure.

For information,
DB Size: 12.35GB (120WH)
Checkpoint timeout: 60min.  Checkpoint occured only once in the run.


Unfortunately I think DBT-2 is not a good benchmark to test the disabled
wal optimization.
The test should contain some larger rows (maybe some updates on large
toasted values), and maybe more frequent checkpoints. Actually the poor
ratio between full pages and normal WAL content in this benchmark is
strange to begin with.
Tom fixed a bug recently, and it would be nice to see the new ratio. 


Have you read Tom's comment on not really having to be able to
reconstruct all record types from the full page image ? I think that
sounded very promising (e.g. start out with only heap insert/update). 


Then:
- we would not need the wal optimization switch (the full page flag
would always be added depending only on backup)
- pg_compresslog would only remove such "full page" images where it
knows how to reconstruct a "normal" WAL record from
- with time and effort pg_compresslog would be able to compress [nearly]
all record types's full images (no change in backend)

I don't think replacing LSN works fine.  For full recovery to 
the current time, we need both archive log and WAL.  
Replacing LSN will make archive log LSN inconsistent with 
WAL's LSN and the recovery will not work.


WAL recovery would have had to be modified (decouple LSN from WAL
position during recovery).
An "archive log" would have been a valid WAL (with appropriate LSN
advance records). 
 
Reconstruction to regular WAL is proposed as 
pg_decompresslog.  We should be careful enough not to make 
redo routines confused with the dummy full page writes, as 
Simon suggested.  So far, it works fine.


Yes, Tom didn't like "LSN replacing" eighter. I withdraw my concern
regarding pg_decompresslog.

Your work in this area is extremely valuable and I hope my comments are
not discouraging.

Thank you
Andreas




--
-
Koichi Suzuki

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-20 Thread Koichi Suzuki

Here's only a part of the reply I should do, but as to I/O error
checking ...

Here's a list of system calls and other external function/library calls
used in pg_lesslog patch series, together with how current patch checks
each errors and how current postgresql source handles the similar calls:

1. No error check is done
1-1. fileno()
 fileno() is called against stdin and stdout from pg_compresslog.c
and pg_decompresslog.c.  They are intended to be invoked from a shell
and so stdin and stdout are both available.  fileno() error occurs only
if invoker of pg_compresslog or pg_decompresslog closes stdin and/or
stdout before the invoker executes them.   I found similar fileno()
usage in pg_dump/pg_backup_archive.c and postmaster/syslogger.c.   I
don't think this is an issue.

1-2. fflush()
fflush() is called against stdout within a debug routine, debug.c.
Such usage can also be found in bin/initdb.c, bin/scripts/createdb.c,
bin/psql/common.c and more.   I don't think this is an issue either.

1-3. printf() and fprintf()
 It is common practice not to check the error.   We can find such
calls in many of existing source codes.

1-4. strerror()
 It is checked that system call returns error before calling
strerror.   Similar code can be found in other PostgreSQL source too.

--
2. Error check is done
All the following function calls are associated with return value check.
open(), close(), fstat(), read(), write()

---
3. Functions do not return error
The following functin will not return errors, so no error check is needed.
exit(), memcopy(), memset(), strcmp()

I hope this helps.

Regards;

Tom Lane wrote:
> "Simon Riggs" <[EMAIL PROTECTED]> writes:
>> Writing lots of additional code simply to remove a parameter that
>> *might* be mis-interpreted doesn't sound useful to me, especially when
>> bugs may leak in that way. My take is that this is simple and useful
>> *and* we have it now; other ways don't yet exist, nor will they in time
>> for 8.3.
> 
> The potential for misusing the switch is only one small part of the
> argument; the larger part is that this has been done in the wrong way
> and will cost performance unnecessarily.  The fact that it's ready
> now is not something that I think should drive our choices.
> 
> I believe that it would be possible to make the needed core-server
> changes in time for 8.3, and then to work on compress/decompress
> on its own time scale and publish it on pgfoundry; with the hope
> that it would be merged to contrib or core in 8.4.  Frankly the
> compress/decompress code needs work anyway before it could be
> merged (eg, I noted a distinct lack of I/O error checking).
> 
>   regards, tom lane
> 

-- 
-
Koichi Suzuki

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-20 Thread Koichi Suzuki


Hi,

I agree that pg_compresslog should be aware of all the WAL records' 
details so that it can optimize archive log safely.   In my patch, I've 
examined 8.2's WAL records to make pg_compresslog/pg_decompresslog safe.


Also I agree further pg_compresslog maintenance needs to examine changes 
in WAL record format.   Because the number of such format will be 
limited, I think the amount of work will be reasonable enough.


Regards;

Simon Riggs wrote:

On Fri, 2007-04-13 at 10:36 -0400, Tom Lane wrote:

"Zeugswetter Andreas ADI SD" <[EMAIL PROTECTED]> writes:

But you also turn off the optimization that avoids writing regular
WAL records when the info is already contained in a full-page image
(increasing the uncompressed size of WAL).
It was that part I questioned.


I think its right to question it, certainly.


That's what bothers me about this patch, too.  It will be increasing
the cost of writing WAL (more data -> more CRC computation and more
I/O, not to mention more contention for the WAL locks) which translates
directly to a server slowdown.


I don't really understand this concern. Koichi-san has included a
parameter setting that would prevent any change at all in the way WAL is
written. If you don't want this slight increase in WAL, don't enable it.
If you do enable it, you'll also presumably be compressing the xlog too,
which works much better than gzip using less CPU. So overall it saves
more than it costs, ISTM, and nothing at all if you choose not to use
it.


The main arguments that I could see against Andreas' alternative are:

1. Some WAL record types are arranged in a way that actually would not
permit the reconstruction of the short form from the long form, because
they throw away too much data when the full-page image is substituted.
An example that's fresh in my mind is that the current format of the
btree page split WAL record discards newitemoff in that case, so you
couldn't identify the inserted item in the page image.  Now this is only
saving two bytes in what's usually going to be a darn large record
anyway, and it complicates the code to do it, so I wouldn't cry if we
changed btree split to include newitemoff always.  But there might be
some other cases where more data is involved.  In any case, someone
would have to look through every single WAL record type to determine
whether reconstruction is possible and fix it if not.

2. The compresslog utility would have to have specific knowledge about
every compressible WAL record type, to know how to convert it to the
short format.  That means an ongoing maintenance commitment there.
I don't think this is unacceptable, simply because we need only teach
it about a few common record types, not everything under the sun ---
anything it doesn't know how to fix, just leave alone, and if it's an
uncommon record type it really doesn't matter.  (I guess that means
that we don't really have to do #1 for every last record type, either.)

So I don't think either of these is a showstopper.  Doing it this way
would certainly make the patch more acceptable, since the argument that
it might hurt rather than help performance in some cases would go away.


Yeh, its additional code paths, but it sounds like Koichi-san and
colleagues are going to be trail blazing any bugs there and will be
around to fix any more that emerge.


What about disconnecting WAL LSN from physical WAL record position
during replay ?
Add simple short WAL records in pg_compresslog like: advance LSN by 8192
bytes.

I don't care for that, as it pretty much destroys some of the more
important sanity checks that xlog replay does.  The page boundaries
need to match the records contained in them.  So I think we do need
to have pg_decompresslog insert dummy WAL entries to fill up the
space saved by omitting full pages.


Agreed. I don't want to start touching something that works so well.


We've been thinking about doing this for at least 3 years now, so I
don't see any reason to baulk at it now. I'm happy with Koichi-san's
patch as-is, assuming further extensive testing will be carried out on
it during beta.




--
-
Koichi Suzuki

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-19 Thread Koichi Suzuki


Sorry I was very late to find this.

With DBT-2 benchmark, I've already compared the amount of WAL.   The 
result was as follows:


Amount of WAL after 60min. run of DBT-2 benchmark
wal_add_optimization_info = off (default) 3.13GB
wal_add_optimization_info = on (new case) 3.17GB -> can be optimized to 
0.31GB by pg_compresslog.


So the difference will be around a couple of percents.   I think this is 
very good figure.


For information,
DB Size: 12.35GB (120WH)
Checkpoint timeout: 60min.  Checkpoint occured only once in the run.



I don't think replacing LSN works fine.  For full recovery to the 
current time, we need both archive log and WAL.  Replacing LSN will make 
archive log LSN inconsistent with WAL's LSN and the recovery will not work.


Reconstruction to regular WAL is proposed as pg_decompresslog.  We 
should be careful enough not to make redo routines confused with the 
dummy full page writes, as Simon suggested.  So far, it works fine.


Regards;

Zeugswetter Andreas ADI SD wrote:

Yup, this is a good summary.

You say you need to remove the optimization that avoids the logging
of 

a new tuple because the full page image exists.
I think we must already have the info in WAL which tuple inside the 
full page image is new (the one for which we avoided the WAL entry 
for).


How about this:
Leave current WAL as it is and only add the not removeable flag to 
full pages.

pg_compresslog then replaces the full page image with a record for
the 

one tuple that is changed.
I tend to think it is not worth the increased complexity only to
save 

bytes in the uncompressed WAL though.
It is essentially what my patch proposes.  My patch includes 
flag to full page writes which "can be" removed.


Ok, a flag that marks full page images that can be removed is perfect.

But you also turn off the optimization that avoids writing regular
WAL records when the info is already contained in a full-page image
(increasing the
uncompressed size of WAL).
It was that part I questioned. As already stated, maybe I should not
have because
it would be too complex to reconstruct a regular WAL record from the
full-page image.  
But that code would also be needed for WAL based partial replication, so

if it where too
complicated we would eventually want a switch to turn off the
optimization anyway
(at least for heap page changes).


Another point about pg_decompresslog:

Why do you need a pg_decompresslog ? Imho pg_compresslog should 
already do the replacing of the full_page with the dummy entry. Then



pg_decompresslog could be a simple gunzip, or whatever compression
was 

used, but no logic.

Just removing full page writes does not work.   If we shift the rest
of 

the WAL, then LSN becomes inconsistent in compressed archive logs
which 

pg_compresslog produces.   For recovery, we have to restore LSN as the



original WAL.   Pg_decompresslog restores removed full page writes as
a 

dumm records so that recovery redo functions won't be confused.


Ah sorry, I needed some pgsql/src/backend/access/transam/README reading.

LSN is the physical position of records in WAL. Thus your dummy record
size is equal to what you cut out of the original record.
What about disconnecting WAL LSN from physical WAL record position
during replay ?
Add simple short WAL records in pg_compresslog like: advance LSN by 8192
bytes.

Andreas




--
-
Koichi Suzuki

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-12 Thread Koichi Suzuki


Hi,

Sorry, inline reply.

Zeugswetter Andreas ADI SD wrote:



Yup, this is a good summary.

You say you need to remove the optimization that avoids 
the logging of a new tuple because the full page image exists.

I think we must already have the info in WAL which tuple inside the full
page image
is new (the one for which we avoided the WAL entry for).

How about this:
Leave current WAL as it is and only add the not removeable flag to full
pages.
pg_compresslog then replaces the full page image with a record for the
one tuple that is changed.
I tend to think it is not worth the increased complexity only to save
bytes in the uncompressed WAL though.


It is essentially what my patch proposes.  My patch includes flag to 
full page writes which "can be" removed.



Another point about pg_decompresslog:

Why do you need a pg_decompresslog ? Imho pg_compresslog should already
do the replacing of the
full_page with the dummy entry. Then pg_decompresslog could be a simple
gunzip, or whatever compression was used,
but no logic.


Just removing full page writes does not work.   If we shift the rest of 
the WAL, then LSN becomes inconsistent in compressed archive logs which 
pg_compresslog produces.   For recovery, we have to restore LSN as the 
original WAL.   Pg_decompresslog restores removed full page writes as a 
dumm records so that recovery redo functions won't be confused.


Regards;



Andreas




--
-----
Koichi Suzuki

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-11 Thread Koichi Suzuki

I don't fully understand what "transaction log" means.   If it means 
"archived WAL", the current (8.2) code handle WAL as follows:


1) If full_page_writes=off, then no full page writes will be written to 
WAL, except for those during onlie backup (between pg_start_backup and 
pg_stop_backup).   The WAL size will be considerably small but it cannot 
recover from partial/inconsistent write to the database files.  We have 
to go back to the online backup and apply all the archive log.


2) If full_page_writes=on, then full page writes will be written at the 
first update of a page after each checkpoint, plus full page writes at 
1).   Because we have no means (in 8.2) to optimize the WAL so far, what 
we can do is to copy WAL or gzip it at archive time.


If we'd like to keep good chance of recovery after the crash, 8.2 
provides only the method 2), leaving archive log size considerably 
large.  My proposal maintains the chance of crash recovery the same as 
in the case of full_page_writes=on and reduces the size of archived log 
as in the case of full_page_writes=off.


Regards;

Hannu Krosing wrote:

Ühel kenal päeval, T, 2007-04-10 kell 18:17, kirjutas Joshua D. Drake:

In terms of idle time for gzip and other command to archive WAL offline,
no difference in the environment was given other than the command to
archive.   My guess is because the user time is very large in gzip, it
has more chance for scheduler to give resource to other processes.   In
the case of cp, idle time is more than 30times longer than user time.
Pg_compresslog uses seven times longer idle time than user time.  On the
other hand, gzip uses less idle time than user time.   Considering the
total amount of user time, I think it's reasonable measure.

Again, in my proposal, it is not the issue to increase run time
performance.   Issue is to decrease the size of archive log to save the
storage.

Considering the relatively little amount of storage a transaction log
takes, it would seem to me that the performance angle is more appropriate.


As I understand it it's not about transaction log but about write-ahead
log.

and the amount of data in WAL can become very important once you have to
keep standby servers in different physical locations (cities, countries
or continents) where channel throughput and cost comes into play.

With simple cp (scp/rsync) the amount of WAL data needing to be copied
is about 10x more than data collected by trigger based solutions
(Slony/pgQ). With pg_compresslog WAL-shipping seems to have roughly the
same amount and thus becomes a viable alternative again.


Is it more efficient in other ways besides negligible tps? Possibly more
efficient memory usage? Better restore times for a crashed system?


I think that TPS is more affected by number of writes than size of each
block written, so there is probably not that much to gain in TPS, except
perhaps from better disk cache usage. 


For me pg_compresslog seems to be a winner even if it just does not
degrade performance.




--
Koichi Suzuki


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-11 Thread Koichi Suzuki

The score below was taken based on 8.2 code, not 8.3 code.  So I don't
think the below measure is introduced only in 8.3 code.

Tom Lane wrote:
> Koichi Suzuki <[EMAIL PROTECTED]> writes:
>> For more information, when checkpoint interval is one hour, the amount
>> of the archived log size was as follows:
>> cp: 3.1GB
>> gzip:   1.5GB
>> pg_compresslog: 0.3GB
> 
> The notion that 90% of the WAL could be backup blocks even at very long
> checkpoint intervals struck me as excessive, so I went looking for a
> reason, and I may have found one.  There has been a bug in CVS HEAD
> since Feb 8 causing every btree page split record to include a backup
> block whether needed or not.  If these numbers were taken with recent
> 8.3 code, please retest with current HEAD.
> 
>   regards, tom lane
> 


-- 
Koichi Suzuki

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-10 Thread Koichi Suzuki

Hi,

In the case below, we run DBT-2 benchmark for one hour to get the
measure.   Checkpoint occured three times (checkpoint interval was 20min).

For more information, when checkpoint interval is one hour, the amount
of the archived log size was as follows:
cp: 3.1GB
gzip:   1.5GB
pg_compresslog: 0.3GB

For both cases, database size was 12.7GB, relatively small.

As pointed out, if we don't run the checkpoint forever, the value for cp
will become close to that for pg_compresslog, but it is not practical.

The point here is, if we collect archive log with cp and the average
work load is a quarter of the full power, cp archiving will produce
about 0.8GB archive log per hour (for DBT-2 case, of course the size
depends on the nature of the transaction).   If we run the database
whole day, the amount of the archive log will be as large as database
itself.   After one week, archive log size gets seven times as large as
the database itself.   This is the point.   In production, such large
archive log will raise storage cost.   The purpose of the proposal is
not to improve the performance, but to decrease the size of archive log
to save necessary storage, preserving the same chance of recovery at the
crash recovery as full_page_writes=on.

Because of DBT-2 nature, it is not meaningful to compare the throuput
(databsae size determines the number of transactions to run).   Instead,
 I compared the throuput using pgbench.   These measures are: cp:
570tps, gzip:558tps, pg_compresslog: 574tps, negligible difference.

In terms of idle time for gzip and other command to archive WAL offline,
no difference in the environment was given other than the command to
archive.   My guess is because the user time is very large in gzip, it
has more chance for scheduler to give resource to other processes.   In
the case of cp, idle time is more than 30times longer than user time.
Pg_compresslog uses seven times longer idle time than user time.  On the
other hand, gzip uses less idle time than user time.   Considering the
total amount of user time, I think it's reasonable measure.

Again, in my proposal, it is not the issue to increase run time
performance.   Issue is to decrease the size of archive log to save the
storage.

Regards;

Tom Lane wrote:
> Koichi Suzuki <[EMAIL PROTECTED]> writes:
>> My proposal is to remove unnecessary full page writes (they are needed 
>> in crash recovery from inconsistent or partial writes) when we copy WAL 
>> to archive log and rebuilt them as a dummy when we restore from archive 
>> log.
>> ...
>> Benchmark: DBT-2
>> Database size: 120WH (12.3GB)
>> Total WAL size: 4.2GB (after 60min. run)
>> Elapsed time:
>>cp:120.6sec
>>gzip:  590.0sec
>>pg_compresslog: 79.4sec
>> Resultant archive log size:
>>cp: 4.2GB
>>gzip:   2.2GB
>>pg_compresslog: 0.3GB
>> Resource consumption:
>>cp:   user:   0.5sec system: 15.8sec idle:  16.9sec I/O wait: 87.7sec
>>gzip: user: 286.2sec system:  8.6sec idle: 260.5sec I/O wait: 36.0sec
>>pg_compresslog:
>>  user:   7.9sec system:  5.5sec idle:  37.8sec I/O wait: 28.4sec
> 
> What checkpoint settings were used to make this comparison?  I'm
> wondering whether much of the same benefit can't be bought at zero cost
> by increasing the checkpoint interval, because that translates directly
> to a reduction in the number of full-page images inserted into WAL.
> 
> Also, how much was the database run itself slowed down by the increased
> volume of WAL (due to duplicated information)?  It seems rather
> pointless to me to measure only the archiving effort without any
> consideration for the impact on the database server proper.
> 
>   regards, tom lane
> 
> PS: there's something fishy about the gzip numbers ... why all the idle
> time?
> 
> ---(end of broadcast)---
> TIP 4: Have you searched our list archives?
> 
>http://archives.postgresql.org
> 

-- 
Koichi Suzuki

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-05 Thread Koichi Suzuki


Hi,

I agree to put the patch to core and the others (pg_compresslog and 
pg_decompresslog) to contrib/lesslog.


I will make separate materials to go to core and contrib.

As for patches, we have tested against pgbench, DBT-2 and our 
propriatery benchmarks and it looked to run correctly.


Regards;

Simon Riggs wrote:

On Tue, 2007-04-03 at 19:45 +0900, Koichi Suzuki wrote:

Bruce Momjian wrote:

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches
Thank you very much for including.   Attached is an update of the patch 
according to Simon Riggs's comment about GUC name.


The patch comes with its own "install kit", which is great to review
(many thanks), but hard to determine where you think code should go when
committed.

My guess based on your patch
- the patch gets applied to core :-)
- pg_compresslog *and* pg_decompresslog go to a contrib directory called
contrib/lesslog?

Can you please produce a combined patch that does all of the above, plus
edits the contrib Makefile to add all of those, as well as editing the
README so it doesn't mention the patch, just the contrib executables?

The patch looks correct to me now. I haven't tested it yet, but will be
doing so in the last week of April, which is when I'll be doing docs for
this and other stuff, since time is pressing now.




--
Koichi Suzuki

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-03 Thread Koichi Suzuki


Bruce Momjian wrote:

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches


Thank you very much for including.   Attached is an update of the patch 
according to Simon Riggs's comment about GUC name.


Regards;

--
Koichi Suzuki


20070403_pg_lesslog.tgz
Description: Binary data

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PATCHES] [HACKERS] Full page writes improvement, code update again.

2007-04-03 Thread Koichi Suzuki

Here's third revision of WAL archival optimization patch.   GUC
parameter name was changed to wal_add_optimization_info.

Regards;

-- 
Koichi Suzuki


20070403_pg_lesslog.tar.gz
Description: application/gzip

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-01 Thread Koichi Suzuki

Tom Lane wrote:
> "Simon Riggs" <[EMAIL PROTECTED]> writes:
>> Any page written during a backup has a backup block that would not be
>> removable by Koichi's tool, so yes, you'd still be safe.
> 
> How does it know not to do that?
> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 5: don't forget to increase your free space map settings
> 

XLogInsert( ) already has a logic to determine if inserting WAL record
is between pg_start_backup and pg_stop_backup.   Currently it is used
to remove full_page_writes when full_page_writes=off.   We can use
this to mark WAL records.   We have one bit not used in WAL record
header, the last bit of xl_info, where upper four bits are used to
indicate the resource manager and three of the rest are used to
indicate number of full page writes included in the record.

In my proposal, this unused bit is used to mark that full page
writes must not be removed at offline optimization by pg_compresslog.

Regards;

-- 
--
Koichi Suzuki

-- 
Koichi Suzuki

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [PATCHES] Full page writes improvement, code update

2007-03-30 Thread Koichi Suzuki


Simon;
Tom;

Koichi is writing.

Your question is how to determine WAL record generated between
pg_start_backup and pg_stop_backup and here's an answer.

XLogInsert( ) already has a logic to determine if inserting WAL record
is between pg_start_backup and pg_stop_backup.   Currently it is used
to remove full_page_writes when full_page_writes=off.   We can use
this to mark WAL records.   We have one bit not used in WAL record
header, the last bit of xl_info, where upper four bits are used to
indicate the resource manager and three of the rest are used to
indicate number of full page writes included in the record.

So in my proposal, this unused bit is used to mark that full page
writes must not be removed at offline optimization by pg_complesslog.

Sorry I didn't have mailing list capability from home and have just
completed my subscription from
home.   I had to create new thread to continue my post.  Sorry for confusion.

Please refer to the original thread about this discussion.

Best Regards;

--
------
Koichi Suzuki

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-03-29 Thread Koichi Suzuki


Hi,

Here's a patch reflected some of Simon's comments.

1) Removed an elog call in a critical section.

2) Changed the name of the commands, pg_complesslog and pg_decompresslog.

3) Changed diff option to make a patch.

--
Koichi Suzuki


pg_lesslog.tgz
Description: Binary data

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-03-29 Thread Koichi Suzuki


Josh;

I'd like to explain what the term "compression" in my proposal means 
again and would like to show the resource consumption comparision with 
cp and gzip.


My proposal is to remove unnecessary full page writes (they are needed 
in crash recovery from inconsistent or partial writes) when we copy WAL 
to archive log and rebuilt them as a dummy when we restore from archive 
log.  Dummy is needed to maintain LSN.  So it is very very different 
from general purpose compression such as gzip, although pg_compresslog 
compresses archive log as a result.


As to CPU and I/O consumption, I've already evaluated as follows:

1) Collect all the WAL segment.
2) Copy them by different means, cp, pg_compresslog and gzip.

and compared the ellapsed time as well as other resource consumption.

Benchmark: DBT-2
Database size: 120WH (12.3GB)
Total WAL size: 4.2GB (after 60min. run)
Elapsed time:
  cp:120.6sec
  gzip:  590.0sec
  pg_compresslog: 79.4sec
Resultant archive log size:
  cp: 4.2GB
  gzip:   2.2GB
  pg_compresslog: 0.3GB
Resource consumption:
  cp:   user:   0.5sec system: 15.8sec idle:  16.9sec I/O wait: 87.7sec
  gzip: user: 286.2sec system:  8.6sec idle: 260.5sec I/O wait: 36.0sec
  pg_compresslog:
user:   7.9sec system:  5.5sec idle:  37.8sec I/O wait: 28.4sec

Because the resultant log size is considerably smaller than cp or gzip, 
pg_compresslog need much less I/O and because the logic is much simpler 
than gzip, it does not consume CPU.


The term "compress" may not be appropriate.   We may call this "log 
optimization" instead.


So I don't see any reason why this (at least optimization "mark" in each 
log record) can't be integrated.


Simon Riggs wrote:

On Thu, 2007-03-29 at 11:45 -0700, Josh Berkus wrote:


OK, different question:
Why would anyone ever set full_page_compress = off?
The only reason I can see is if compression costs us CPU but gains RAM & 
I/O.  I can think of a lot of applications ... benchmarks included ... 
which are CPU-bound but not RAM or I/O bound.  For those applications, 
compression is a bad tradeoff.


If, however, CPU used for compression is made up elsewhere through smaller 
file processing, then I'd agree that we don't need a switch.


As I wrote to Simon's comment, I concern only one thing.

Without a switch, because both full page writes and corresponding 
logical log is included in WAL, this will increase WAL size slightly 
(maybe about five percent or so).   If everybody is happy with this, we 
don't need a switch.




Koichi-san has explained things for me now.

I misunderstood what the parameter did and reading your post, ISTM you
have as well. I do hope Koichi-san will alter the name to allow
everybody to understand what it does.



Here're some candidates:
full_page_writes_optimize
full_page_writes_mark: means it marks full_page_write as "needed in 
crash recovery", "needed in archive recovery" and so on.


I don't insist these names.  It's very helpful if you have any 
suggestion to reflect what it really means.


Regards;
--
Koichi Suzuki

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-03-29 Thread Koichi Suzuki


Hi, Here're some feedback to the comment:

Simon Riggs wrote:

On Wed, 2007-03-28 at 10:54 +0900, Koichi Suzuki wrote:


As written below, full page write can be
categolized as follows:

1) Needed for crash recovery: first page update after each checkpoint.
This has to be kept in WAL.

2) Needed for archive recovery: page update between pg_start_backup and
pg_stop_backup. This has to be kept in archive log.

3) For log-shipping slave such as pg_standby: no full page writes will
be needed for this purpose.

My proposal deals with 2). So, if we mark each "full_page_write", I'd
rather mark when this is needed. Still need only one bit because the
case 3) does not need any mark.


I'm very happy with this proposal, though I do still have some points in
detailed areas.

If you accept that 1 & 2 are valid goals, then 1 & 3 or 1, 2 & 3 are
also valid goals, ISTM. i.e. you might choose to use full_page_writes on
the primary and yet would like to see optimised data transfer to the
standby server. In that case, you would need the mark.


Yes, I need the mark.  In my proposal, only unmarked full-page-writes, 
which were written as the first update after a checkpoint, are to be 
removed offline (pg_compresslog).





- Not sure why we need "full_page_compress", why not just mark them
always? That harms noone. (Did someone else ask for that? If so, keep
it)

No, no one asked to have a separate option. There'll be no bad
influence to do so.  So, if we mark each "full_page_write", I'd
rather mark when this is needed. Still need only one bit because the
case 3) does not need any mark.


OK, different question: 
Why would anyone ever set full_page_compress = off? 


Why have a parameter that does so little? ISTM this is:

i) one more thing to get wrong

ii) cheaper to mark the block when appropriate than to perform the if()
test each time. That can be done only in the path where backup blocks
are present.

iii) If we mark the blocks every time, it allows us to do an offline WAL
compression. If the blocks aren't marked that option is lost. The bit is
useful information, so we should have it in all cases.


Not only full-page-writes are written as WAL record.   In my proposal, 
both full-page-writes and logical log are written in a WAL record, which 
will make WAL size slightly bigger (five percent or so).   If 
full_page_compress = off, only a full-page-write will be written in a 
WAL record.   I thought someone will not be happy with this size growth.


I agree to make this mandatory if every body is happy with extra logical 
 log in WAL records with full page writes.


I'd like to have your opinion.




- OTOH I'd like to see an explicit parameter set during recovery since
you're asking the main recovery path to act differently in case a single
bit is set/unset. If you are using that form of recovery, we should say
so explicitly, to keep everybody else safe.

Only one thing I had to do is to create "dummy" full page write to
maintain LSNs. Full page writes are omitted in archive log. We have to
LSNs same as those in the original WAL. In this case, recovery has to
read logical log, not "dummy" full page writes. On the other hand, if
both logical log and "real" full page writes are found in a log record,
the recovery has to use "real" full page writes.


I apologise for not understanding your reply, perhaps my original
request was unclear.

In recovery.conf, I'd like to see a parameter such as

dummy_backup_blocks = off (default) | on

to explicitly indicate to the recovery process that backup blocks are
present, yet they are garbage and should be ignored. Having garbage data
within the system is potentially dangerous and I want to be told by the
user that they were expecting that and its OK to ignore that data.
Otherwise I want to throw informative errors. Maybe it seems OK now, but
the next change to the system may have unintended consequences and it
may not be us making the change. "It's OK the Alien will never escape
from the lab" is the starting premise for many good sci-fi horrors and I
want to watch them, not be in one myself. :-)

We can call it other things, of course. e.g.
ignore_dummy_blocks
decompressed_blocks
apply_backup_blocks


So far, we don't need any modification to the recovery and redo 
functions.   They ignore the dummy and apply logical logs.   Also, if 
there are both full page writes and logical log, current recovery 
selects full page writes to apply.


I agree to introduce this option if 8.3 code introduces any conflict to 
the current.   Or, we could introduce this option for future safety.  Do 
you think we should introduce this option?


If this should be introduced now, what we should do is to check this 
option when dummy full-page-write appears.






Yes I believe so. As pg_standby does not include any chance to meet
partial writes of pa

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-03-27 Thread Koichi Suzuki


Simon;

Thanks a lot for your comments/advices. I'd like to write some feedback.

Simon Riggs wrote:

On Tue, 2007-03-27 at 11:52 +0900, Koichi Suzuki wrote:


Here's an update of a code to improve full page writes as proposed in

http://archives.postgresql.org/pgsql-hackers/2007-01/msg01491.php
and
http://archives.postgresql.org/pgsql-patches/2007-01/msg00607.php

Update includes some modification for error handling in archiver and
restoration command.

In the previous threads, I posted several evaluation and shown that we
can keep all the full page writes needed for full XLOG crash recovery,
while compressing archive log size considerably better than gzip, with
less CPU consumption.  I've found no further objection for this proposal
but still would like to hear comments/opinions/advices.


Koichi-san,

Looks interesting. I like the small amount of code to do this.

A few thoughts:

- Not sure why we need "full_page_compress", why not just mark them
always? That harms noone. (Did someone else ask for that? If so, keep
it)


No, no one asked to have a separate option. There'll be no bad
influence to do so. As written below, full page write can be
categolized as follows:

1) Needed for crash recovery: first page update after each checkpoint.
This has to be kept in WAL.

2) Needed for archive recovery: page update between pg_start_backup and
pg_stop_backup. This has to be kept in archive log.

3) For log-shipping slave such as pg_standby: no full page writes will
be needed for this purpose.

My proposal deals with 2). So, if we mark each "full_page_write", I'd
rather mark when this is needed. Still need only one bit because the
case 3) does not need any mark.


- OTOH I'd like to see an explicit parameter set during recovery since
you're asking the main recovery path to act differently in case a single
bit is set/unset. If you are using that form of recovery, we should say
so explicitly, to keep everybody else safe.


Only one thing I had to do is to create "dummy" full page write to
maintain LSNs. Full page writes are omitted in archive log. We have to
LSNs same as those in the original WAL. In this case, recovery has to
read logical log, not "dummy" full page writes. On the other hand, if
both logical log and "real" full page writes are found in a log record,
the recovery has to use "real" full page writes.


- I'd rather mark just the nonremovable blocks. But no real difference


It sound nicer. According to the full page write categories above, we 
can mark full page writes as needed in "crash recovery" or "archive 
recovery".   Please give some feedback to the above full page write

categories.



- We definitely don't want an normal elog in a XLogInsert critical
section, especially at DEBUG1 level


I agree. I'll remove elog calls from critical sections.


- diff -c format is easier and the standard


I'll change the diff option.


- pg_logarchive and pg_logrestore should be called by a name that
reflects what they actually do. Possibly pg_compresslog and
pg_decompresslog etc.. I've not reviewed those programs, but:


I wasn't careful to have command names more based on what to be done. 
I'll change the command name.




- Not sure why we have to compress away page headers. Touch as little as
you can has always been my thinking with recovery code.


Eliminating page headers gives compression rate slightly better, a 
couple of percents.   To make code simpler, I'll drop this compression.




- I'm very uncomfortable with touching the LSN. Maybe I misunderstand?


The reason why we need pg_logarchive (or pg_decompresslog) is to
maintain LSN the same as those in the original WAL. For this purpose,
we need "dummy" full page write. I don't like to touch LSN either and
this is the reason why pg_decompresslog need some extra work,
eliminating many other changes in the recovery.



- Have you thought about how pg_standby would integrate with this
option? Can you please?


Yes I believe so. As pg_standby does not include any chance to meet
partial writes of pages, I believe you can omit all the full page
writes. Of course, as Tom Lange suggested in
http://archives.postgresql.org/pgsql-hackers/2007-02/msg00034.php
removing full page writes can lose a chance to recover from
partial/inconsisitent writes in the crash of pg_standby. In this case,
we have to import a backup and archive logs (with full page writes
during the backup) to recover. (We have to import them when the file
system crashes anyway). If it's okay, I believe
pg_compresslog/pg_decompresslog can be integrated with pg_standby.

Maybe we can work together to include pg_compresslog/pg_decompresslog in 
pg_standby.




- I'll do some docs for this after Freeze, if you'd like. I have some
other changes to make there, so I can do this at the same time.


I'll be

[PATCHES] Full page writes improvement, code update

2007-03-26 Thread Koichi Suzuki

Hi,

Here's an update of a code to improve full page writes as proposed in

http://archives.postgresql.org/pgsql-hackers/2007-01/msg01491.php
and
http://archives.postgresql.org/pgsql-patches/2007-01/msg00607.php

Update includes some modification for error handling in archiver and
restoration command.

In the previous threads, I posted several evaluation and shown that we
can keep all the full page writes needed for full XLOG crash recovery,
while compressing archive log size considerably better than gzip, with
less CPU consumption.  I've found no further objection for this proposal
but still would like to hear comments/opinions/advices.

Regards;

-- 
Koichi Suzuki


pg_lesslog.tgz
Description: Binary data

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [PATCHES] Full page writes improvement

2007-02-08 Thread Koichi Suzuki

Full_page_compress is not intended to use with PITR slave, but for the
case to keep both online backup and archive log for archive recovery,
which is very popular PostgreSQL operation now.

I've just posted my evaluation for the patch as a reply for another
thread of the same proposal (sorry, I created new thread because old one
seemed not good).

It compares log compression with gzip case.  Also, our proposal can
combine with gzip.   It's overall overhead is slightly less than just
copying WAL using cat.   As a result, my proposal does not include
serious overhead.

Please refer to the thread "Archive log compression keeping physical log
available in the crash recovery".  I appreciate further opinion/comment
on this.   I'd like to have more suggestion which evaluation is useful.

I've posted two (archive and restore) commands and a small patch.
These two commands can be treated as contrib and the patch itself does
work if WAL is simply copied to the archive directory.

Regards;
Koichi Suzuki

Tom Lane wrote:
> Koichi Suzuki <[EMAIL PROTECTED]> writes:
>> Tom Lane wrote:
>>> Doesn't this break crash recovery on PITR slaves?
> 
>> Compressed archive log contains the same data as full_page_writes off
>> case.   So the influence to PITR slaves is the same as full_page_writes off.
> 
> Right.  So what is the use-case for running your primary database with
> full_page_writes on and the slaves with it off?  It doesn't seem like
> a very sensible combination to me.
> 
> Also, it seems to me that some significant performance hit would be
> taken by having to grovel through the log files to remove and re-add the
> full-page data.  Plus you are actually writing *more* WAL data out of
> the primary, not less, because you have to save both the full-page
> images and the per-tuple data they normally replace.  Do you have
> numbers showing that there's actually any meaningful savings overall?
> 
>   regards, tom lane
> 

-- 
Koichi Suzuki

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [PATCHES] Full page writes improvement

2007-02-01 Thread Koichi Suzuki

Tom Lane wrote:
> Koichi Suzuki <[EMAIL PROTECTED]> writes:
>> Tom Lane wrote:
>>> Doesn't this break crash recovery on PITR slaves?
> 
>> Compressed archive log contains the same data as full_page_writes off
>> case.   So the influence to PITR slaves is the same as full_page_writes off.
> 
> Right.  So what is the use-case for running your primary database with
> full_page_writes on and the slaves with it off?  It doesn't seem like
> a very sensible combination to me.
> 
> Also, it seems to me that some significant performance hit would be
> taken by having to grovel through the log files to remove and re-add the
> full-page data.  Plus you are actually writing *more* WAL data out of
> the primary, not less, because you have to save both the full-page
> images and the per-tuple data they normally replace.  Do you have
> numbers showing that there's actually any meaningful savings overall?

Yes, I have some evaluations to show that we're writing less and using
overall less resources.   Please give me a couple of days to translate.

In the case of PITR slave, because archive logs are read in a short
period, amount of archive log may not be an issue.   In the case where
online backup and archive logs must be kept for (relatively) long
period, archive log size is a major issue.

        K.Suzuki

> 
>   regards, tom lane
> 


-- 
Koichi Suzuki

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] [PATCHES] Full page writes improvement

2007-02-01 Thread Koichi Suzuki

Tom Lane wrote:
> Koichi Suzuki <[EMAIL PROTECTED]> writes:
>> Here's an idea and a patch for full page writes improvement.
> 
>> Idea:
>> (1) keep full page writes for ordinary WAL, make them available during
>> the crash recovery, -> recovery from inconsistent pages which can be
>> made at the crash,
>> (2) Remove them from the archive log except for those written during
>> online backup (between pg_start_backup and pg_stop_backup) -> small size
>> archive log.
> 
> Doesn't this break crash recovery on PITR slaves?

Compressed archive log contains the same data as full_page_writes off
case.   So the influence to PITR slaves is the same as full_page_writes off.

K.Suzuki

> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 3: Have you checked our extensive FAQ?
> 
>http://www.postgresql.org/docs/faq
> 


-- 
Koichi Suzuki

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

[PATCHES] Full page writes improvement

2007-01-31 Thread Koichi Suzuki

Here's an idea and a patch for full page writes improvement.

Idea:
(1) keep full page writes for ordinary WAL, make them available during
the crash recovery, -> recovery from inconsistent pages which can be
made at the crash,
(2) Remove them from the archive log except for those written during
online backup (between pg_start_backup and pg_stop_backup) -> small size
archive log.

Implementation:
(1) Mark WAL record whose full-page-writes can be removed,
(2) Remove full-page writes from the marked WAL record in archive
command, and
(3) Restore the removed full-page writes to make LSN consistent.

Included is a patch for this as well as archive and restore command source.

Patch is very small and I hope this to be included in 8.3.

-- 
Koichi Suzuki


pg_lesslog.tar.gz
Description: application/gzip

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PATCHES] A couple of patches for PostgreSQL 64bit support

2005-09-02 Thread Koichi Suzuki

Thanks a lot for the port to CVS.

I agree that we need more benckmark efforts to clarify real outcome of
"more than 2GB" memory.   Please let me spend some more for this.  I
will post benchmark results.   As long as I see from pgbench, it looks
more memory gets more throuput.   Maybe big SQL against big dataset is
another example to show the effect.

I also agree that we need much more study to show the effect of 64bit
TID (and perhaps CID).   Based on the patch I posted, I'll continue my
effort and also post the results for discussion.

Best Regards;

Tom Lane wrote:
> Koichi Suzuki <[EMAIL PROTECTED]> writes:
> 
>>Here're a couple of patches for PostgreSQL 64bit support.  There're just
>>two extension to 64bit, size of shared memory and transaction ID.
> 
> 
> I've applied the part of this that seemed reasonably noncontroversial,
> namely the fixes to do shared memory size calculation in size_t
> arithmetic instead of int arithmetic.  (While at it, I extended the Size
> convention to all the shared memory sizing routines, not just buffers,
> and added code to detect overflows in the calculations.  That way we
> don't need a "64 bit" configure switch.)  While I still remain
> unconvinced that there's any real-world need for more than 2Gb of
> shared_buffers, this change certainly makes the code more robust against
> configuration errors, and it has essentially zero cost (except maybe a
> few more milliseconds during postmaster start).
> 
> On the other hand, I think the 64-bit XID idea needs considerably more
> work before being proposed again.  That would incur serious costs due
> to the expansion of tuple headers, and there's no evidence that the
> distributed cost would be bought back by avoiding one vacuum pass every
> billion transactions.  (Your description of the patch claimed that
> vacuuming requires shutting down the database, which is simply wrong.)
> Also, as previously noted, you can't just whack the size of XID around
> without considering side-effects on CID, OID, Datum, etc.
> 
>   regards, tom lane
> 


-- 
---
Koichi Suzuki
Open Source Engineeering Departmeent,
NTT DATA Intellilink Corporation
Phone: +81-3-5566-9628  WWW: http://www.intellilink.co.jp
--

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PATCHES] A couple of patches for PostgreSQL 64bit support

2005-07-19 Thread Koichi Suzuki


Mark,

I've not seen CVS in detail.   I begain this work against 8.0.1 and 
continued thru 8.0.2 to 8.0.3.  It was not a great work.   The patch is 
rather straightforward and I appreciate if you try to port against CVS.


Mark Wong wrote:

Hi,

I grabbed the patches to try, but I was wondering if it would be more
interesting to try them against CVS rather than 8.0.3 (and if it would
be easy to port :)?

Mark




--
---
Koichi Suzuki
Open Source Engineeering Departmeent,
NTT DATA Intellilink Corporation
Phone: +81-3-5566-9628  WWW: http://www.intellilink.co.jp
--

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PATCHES] A couple of patches for PostgreSQL 64bit support

2005-07-14 Thread Koichi Suzuki


> I asked originally for some experimental evidence showing any value
> in having more than 2Gb of shared buffers.  In the absence of any
> convincing demonstration, I'm not very inclined to worry about whether
> we can handle wider-than-int shared memory size.

Hi,

Attached is a result of pgbench with 64bit patch PostgreSQL (base is
8.0.1).  Benchmark machine is dual opteron (1.4GHz, 1MB cache each) with
8GB of memory and 120GB of IDE hard disk.

Koichi Suzuki wrote:

>> I have some experimeltal data about this extension.   I will gather it
>> and post hopefully this weekend.

-----------
Koichi Suzuki
Open Source Engineeering Departmeent,
NTT DATA Intellilink Corporation
Phone: +81-3-5566-9628  WWW: http://www.intellilink.co.jp
--


64-bit-pgbench20050712.pdf
Description: Adobe PDF document

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PATCHES] A couple of patches for PostgreSQL 64bit support

2005-07-12 Thread Koichi Suzuki

Hi,

Attached is a result of pgbench with 64bit patch PostgreSQL (base is
8.0.1).  Benchmark machine is dual opteron (1.4GHz, 1MB cache each) with
8GB of memory and 120GB of IDE hard disk.

Koichi Suzuki wrote:
> I have some experimeltal data about this extension.   I will gather it
> and post hopefully this weekend.


-- 
---
Koichi Suzuki
Open Source Engineeering Departmeent,
NTT DATA Intellilink Corporation
Phone: +81-3-5566-9628  WWW: http://www.intellilink.co.jp
--


64-bit-pgbench20050712.pdf
Description: Adobe PDF document

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PATCHES] A couple of patches for PostgreSQL 64bit support

2005-07-07 Thread Koichi Suzuki

I have some experimeltal data about this extension.   I will gather it
and post hopefully this weekend.

Tom Lane wrote:
> Koichi Suzuki <[EMAIL PROTECTED]> writes:
> 
>>Here're a couple of patches for PostgreSQL 64bit support.  There're just
>>two extension to 64bit, size of shared memory and transaction ID.
> 
> 
> I asked originally for some experimental evidence showing any value
> in having more than 2Gb of shared buffers.  In the absence of any
> convincing demonstration, I'm not very inclined to worry about whether
> we can handle wider-than-int shared memory size.
> 
> As for the XID change, I don't think this patch accurately reflects the
> size of the impact.  There are a lot of things that in practice need to
> be the same size as XID (CID, most obviously, but I suspect also OID).
> And again, some demonstration of the performance impact would be
> appropriate.  Here, not only do you have to prove that widening XID
> isn't a big performance hit in itself, but you also have to convince
> us that it's a win compared to the existing approach of vacuuming at
> least every billion transactions.
> 
>   regards, tom lane
> 


-- 
---
Koichi Suzuki
Open Source Engineeering Departmeent,
NTT DATA Intellilink Corporation
Phone: +81-3-5566-9628  WWW: http://www.intellilink.co.jp
--

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

[PATCHES] A couple of patches for PostgreSQL 64bit support

2005-07-06 Thread Koichi Suzuki

Hi, all,

I have posted a couple of patches with regard to 64bit environment
support to PATCHES ml.   It expands size of shared memory to 64bit space
and extends XID to 64bit.   Please take a look at it.

-- 
---
Koichi Suzuki
Open Source Engineeering Departmeent,
NTT DATA Intellilink Corporation
Phone: +81-3-5566-9628  WWW: http://www.intellilink.co.jp
--

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

[PATCHES] A couple of patches for PostgreSQL 64bit support

2005-07-06 Thread Koichi Suzuki

Hi all,

Here're a couple of patches for PostgreSQL 64bit support.  There're just
two extension to 64bit, size of shared memory and transaction ID.

Please take a look at overview.txt for this proposal and pathces, based
upon 8.0.3.

Any discussions are welcome.

-- 
-------
Koichi Suzuki
Open Source Engineeering Departmeent,
NTT DATA Intellilink Corporation
Phone: +81-3-5566-9628  WWW: http://www.intellilink.co.jp
--

Proposal: 64bit Extension in PostgreSQL 8.0.x

  July 7th, 2005
         Koichi Suzuki (NTT DATA Intellilink)

1. Background and Purpose

64-bit architecture CPU is getting more and more popular in Intel-based CPU, 
such as 
EM64T, AMD64 and IA64.Servers based upon such CPUs can provide much more 
memory 
available.   Tens of gigabytes of memory is available in a node, typically 
16gigabytes 
or so.

Obviously, 32bit-based Linux and its applications can run on such a machine.   
However, 
from the point of each process' view, size of avilable memory is limited by 
process user 
space, typically 1GB for kernel and 3GB for each process.   PostgreSQL's kernel 
uses 
shared memory to hold shared data and much more memory should be available as 
PostgreSQL 
is going to handle bigger database.   For this purpose, we need to extend 
PostgreSQL to 
64-bit program and make shared memory management to handle shared memory beyond 
32bit 
limitation.

On the other hand, PostgreSQL is going to be used in mission-critical systems 
in 
enterprise environment.   In such application, we need to provide long period 
operation 
without stopping service.   Currently, PostgreSQL's transaction ID is limited 
by 32bit 
integer and when it is about to run out, we have to stop the database operation 
and run 
vacuum freeze to reuse the older transaction ID value.   Vacuum freeze 
operation scans 
all the database space and together with the bigger database size in oparation, 
it's 
going to take longer.   Because PostgreSQL is going to be used in busier 
systems, 
transaction IDs tend to run out more earlier.   To provide longer continuous 
operation, 
we need to make transaction ID 64bit-based.


2. How we can do?

It is basically very simple to do these two extensions.   We can locate their 
definitions 
and change them into 64-bit based ones.   However, there are much much more to 
be done.   
We have to find all the lines which deals with their value and have to modify 
them so that 
there will not be any loss of calculation precision.   Detailed result will be 
given 
later.

3. Environment

Our code will assume the following:

1) PostgreSQL server (i.e. postmaster and postgres processes) should run only 
on 64-bit 
   CPU based server machies, that is, EM64T, AMD64 and IA64.
2) Both 64bit (EM64T, AMD64 and IA64) and 32bit CPU(IA32) are allowed as 
clients.
3) Currently, we support only Linux 2.6.x kernel.

4. Specification changes

Due to these two changes, PostgreSQL's spec will change as follows:

4.1 Shared memory size

Shared memory size are specified in shared_buffers entry in postgresql.conf 
file.   
In 32-bit environment, it is limited to INTMAX/BLOCKSIZE.   Now new limitation 
is 
INTMAX/2.   This value specifies the number of blocks, so actual memory which 
can be 
specified by this parameter will be (INITMAX/2)*BLOCKSIZE.   Typical BLOCKSIZE 
is 8KB.   
Therefore, typicak maximum shared memory will be 8TB.   This is beyond the 
limitation of 
Linux user space for EM64T (512GB) and should be sufficient.

The reason why shared memory size specification is limited to INITMAX/2 is as 
follows:  
In 8.0.x buffer management, (buffer_number)*2 is used to produce buffer IDs and 
whole 
these subsystem is still based on 32bit calculation.   We'd like to keep the 
change as 
minimum as possible and having this limit will not have bad influence to the 
maxumum 
value of actual shared memory size.

4.2 Transaction ID (XID)

Type of transaction ID is pre-defined in the catalog and there are no way for 
users to 
redefine its type.   In this extension, transaction IDs are handled as follows:

1) On catalog, transaction ID's (such as XMIN and XMAX) are give the type 
"XID", as 
   in 32-bit environment.
2) If applications on 32-bit based environment trys to read the value of the 
type "XID", 
   it has to read this value as "unsigned long long" value.
3) In the case of 64-bit based environment, the value of gXIDh can be handled 
as 
   "unsigned long" value, depending on compilers.
4) The length of the type "XID" is stored in the catalog "pg_type".

4.3 Configure

W have added two configuration options:
--enable-64bit-shared-memory
enables 64bit shared memory by defining USE_64BIT_SHARED.
--enable-64bit-transaction-id
enables 64-bit transact

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update again.

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

Re: [PATCHES] [HACKERS] Full page writes improvement, code update

[PATCHES] Full page writes improvement, code update

Re: [HACKERS] [PATCHES] Full page writes improvement

Re: [HACKERS] [PATCHES] Full page writes improvement

Re: [HACKERS] [PATCHES] Full page writes improvement

[PATCHES] Full page writes improvement

Re: [PATCHES] A couple of patches for PostgreSQL 64bit support

Re: [PATCHES] A couple of patches for PostgreSQL 64bit support

Re: [PATCHES] A couple of patches for PostgreSQL 64bit support

Re: [PATCHES] A couple of patches for PostgreSQL 64bit support

Re: [PATCHES] A couple of patches for PostgreSQL 64bit support

[PATCHES] A couple of patches for PostgreSQL 64bit support

[PATCHES] A couple of patches for PostgreSQL 64bit support

34 matches

Site Navigation

Mail list logo

Footer information