Re: [PATCHES] Dead Space Map version 3 (simplified)

2007-04-20 Thread Heikki Linnakangas

ITAGAKI Takahiro wrote:

Attached is an updated DSM patch. I've left the core function of DSM only
and dropped other complicated features in this release.


We discussed it a long time ago already, but I really wished the DSM 
wouldn't need a fixed size shared memory area. It's one more thing the 
DBA needs to tune manually. It also means we need to have an algorithm 
for deciding what to keep in the DSM and what to leave out. And I don't 
see a good way to extend the current approach to implement the 
index-only-scans that we've been talking about, and the same goes for 
recovery. :(


The way you update the DSM is quite interesting. When a page is dirtied, 
the BM_DSM_DIRTY flag is set in the buffer descriptor. The corresponding 
bit in the DSM is set lazily in FlushBuffer whenever BM_DSM_DIRTY is 
set. That's a clever way to avoid contention on updates. But does it 
work for tables that have a small hot part that's updated very 
frequently? That's exactly the scenario where the DSM is the most 
useful. Hot pages stay in the buffer cache because they're frequently 
accessed, which means that FlushBuffer isn't getting called for them and 
the bits in the DSM aren't getting set until checkpoint. This could lead 
to unnecessary bloating of the hot part. A straightforward fix would be 
to scan the buffer cache for buffers marked with BM_DSM_DIRTY to update 
the DSM before starting the vacuum scan.


It might not be a problem in practice, but it bothers me that the DSM 
isn't 100% accurate. You end up having a page with dead tuples on it 
marked as non-dirty in the DSM at least when a page is vacuumed but 
there's some RECENTLY_DEAD tuples on it that become dead later on. There 
might be other scenarios as well.


If I'm reading the code correctly, DSM makes no attempt to keep the 
chunks ordered by block number. If that's the case, vacuum needs to be 
modified because it currently relies on the fact that blocks are scanned 
and the dead tuple list is therefore populated in order.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] [HACKERS] parser dilemma

2007-04-20 Thread Andrew Dunstan

Zoltan Boszormenyi wrote:

On the other hand, marking GENERATED as %right
solves this issue. I hope it's an acceptable solution.



If anything I should have thought it would be marked %nonassoc.

cheers

andrew


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-20 Thread Simon Riggs
On Fri, 2007-04-20 at 10:16 +0200, Zeugswetter Andreas ADI SD wrote:

> Your work in this area is extremely valuable and I hope my comments are
> not discouraging.

I think its too late in the day to make the changes suggested by
yourself and Tom. They make the patch more invasive and more likely to
error, plus we don't have much time. This really means the patch is
likely to be rejected at the 11th hour when what we have essentially
works...

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-20 Thread Koichi Suzuki
Here's only a part of the reply I should do, but as to I/O error
checking ...

Here's a list of system calls and other external function/library calls
used in pg_lesslog patch series, together with how current patch checks
each errors and how current postgresql source handles the similar calls:


1. No error check is done
1-1. fileno()
 fileno() is called against stdin and stdout from pg_compresslog.c
and pg_decompresslog.c.  They are intended to be invoked from a shell
and so stdin and stdout are both available.  fileno() error occurs only
if invoker of pg_compresslog or pg_decompresslog closes stdin and/or
stdout before the invoker executes them.   I found similar fileno()
usage in pg_dump/pg_backup_archive.c and postmaster/syslogger.c.   I
don't think this is an issue.

1-2. fflush()
fflush() is called against stdout within a debug routine, debug.c.
Such usage can also be found in bin/initdb.c, bin/scripts/createdb.c,
bin/psql/common.c and more.   I don't think this is an issue either.

1-3. printf() and fprintf()
 It is common practice not to check the error.   We can find such
calls in many of existing source codes.

1-4. strerror()
 It is checked that system call returns error before calling
strerror.   Similar code can be found in other PostgreSQL source too.

--
2. Error check is done
All the following function calls are associated with return value check.
open(), close(), fstat(), read(), write()

---
3. Functions do not return error
The following functin will not return errors, so no error check is needed.
exit(), memcopy(), memset(), strcmp()


I hope this helps.

Regards;

Tom Lane wrote:
> "Simon Riggs" <[EMAIL PROTECTED]> writes:
>> Writing lots of additional code simply to remove a parameter that
>> *might* be mis-interpreted doesn't sound useful to me, especially when
>> bugs may leak in that way. My take is that this is simple and useful
>> *and* we have it now; other ways don't yet exist, nor will they in time
>> for 8.3.
> 
> The potential for misusing the switch is only one small part of the
> argument; the larger part is that this has been done in the wrong way
> and will cost performance unnecessarily.  The fact that it's ready
> now is not something that I think should drive our choices.
> 
> I believe that it would be possible to make the needed core-server
> changes in time for 8.3, and then to work on compress/decompress
> on its own time scale and publish it on pgfoundry; with the hope
> that it would be merged to contrib or core in 8.4.  Frankly the
> compress/decompress code needs work anyway before it could be
> merged (eg, I noted a distinct lack of I/O error checking).
> 
>   regards, tom lane
> 


-- 
-
Koichi Suzuki

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-20 Thread Koichi Suzuki

Hi,

I agree that pg_compresslog should be aware of all the WAL records' 
details so that it can optimize archive log safely.   In my patch, I've 
examined 8.2's WAL records to make pg_compresslog/pg_decompresslog safe.


Also I agree further pg_compresslog maintenance needs to examine changes 
in WAL record format.   Because the number of such format will be 
limited, I think the amount of work will be reasonable enough.


Regards;

Simon Riggs wrote:

On Fri, 2007-04-13 at 10:36 -0400, Tom Lane wrote:

"Zeugswetter Andreas ADI SD" <[EMAIL PROTECTED]> writes:

But you also turn off the optimization that avoids writing regular
WAL records when the info is already contained in a full-page image
(increasing the uncompressed size of WAL).
It was that part I questioned.


I think its right to question it, certainly.


That's what bothers me about this patch, too.  It will be increasing
the cost of writing WAL (more data -> more CRC computation and more
I/O, not to mention more contention for the WAL locks) which translates
directly to a server slowdown.


I don't really understand this concern. Koichi-san has included a
parameter setting that would prevent any change at all in the way WAL is
written. If you don't want this slight increase in WAL, don't enable it.
If you do enable it, you'll also presumably be compressing the xlog too,
which works much better than gzip using less CPU. So overall it saves
more than it costs, ISTM, and nothing at all if you choose not to use
it.


The main arguments that I could see against Andreas' alternative are:

1. Some WAL record types are arranged in a way that actually would not
permit the reconstruction of the short form from the long form, because
they throw away too much data when the full-page image is substituted.
An example that's fresh in my mind is that the current format of the
btree page split WAL record discards newitemoff in that case, so you
couldn't identify the inserted item in the page image.  Now this is only
saving two bytes in what's usually going to be a darn large record
anyway, and it complicates the code to do it, so I wouldn't cry if we
changed btree split to include newitemoff always.  But there might be
some other cases where more data is involved.  In any case, someone
would have to look through every single WAL record type to determine
whether reconstruction is possible and fix it if not.

2. The compresslog utility would have to have specific knowledge about
every compressible WAL record type, to know how to convert it to the
short format.  That means an ongoing maintenance commitment there.
I don't think this is unacceptable, simply because we need only teach
it about a few common record types, not everything under the sun ---
anything it doesn't know how to fix, just leave alone, and if it's an
uncommon record type it really doesn't matter.  (I guess that means
that we don't really have to do #1 for every last record type, either.)

So I don't think either of these is a showstopper.  Doing it this way
would certainly make the patch more acceptable, since the argument that
it might hurt rather than help performance in some cases would go away.


Yeh, its additional code paths, but it sounds like Koichi-san and
colleagues are going to be trail blazing any bugs there and will be
around to fix any more that emerge.


What about disconnecting WAL LSN from physical WAL record position
during replay ?
Add simple short WAL records in pg_compresslog like: advance LSN by 8192
bytes.

I don't care for that, as it pretty much destroys some of the more
important sanity checks that xlog replay does.  The page boundaries
need to match the records contained in them.  So I think we do need
to have pg_decompresslog insert dummy WAL entries to fill up the
space saved by omitting full pages.


Agreed. I don't want to start touching something that works so well.


We've been thinking about doing this for at least 3 years now, so I
don't see any reason to baulk at it now. I'm happy with Koichi-san's
patch as-is, assuming further extensive testing will be carried out on
it during beta.




--
-
Koichi Suzuki

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] [HACKERS] parser dilemma

2007-04-20 Thread Zoltan Boszormenyi

Zoltan Boszormenyi írta:

Martijn van Oosterhout írta:

On Thu, Apr 19, 2007 at 11:19:40AM +0200, Zoltan Boszormenyi wrote:
 

The problem comes from cases like

colname coltype DEFAULT 5! GENERATED ...

Since b_expr allows postfix operators, it takes one more token of
lookahead than we have to tell if the default expression is "5!"
or "5!GENERATED ...".
  


ISTM that as long as:

 colname coltype DEFAULT (5!) GENERATED ...

works I don't see why it would be a problem to require the parentheses
in this case. Postfis operators are not going to be that common here I
think.

Have a nice day,
  


You mean like this one?

*** gram.y.old  2007-04-20 09:23:16.0 +0200
--- gram.y  2007-04-20 09:25:34.0 +0200
***
*** 7550,7557 
   { $$ = (Node *) makeA_Expr(AEXPR_OP, 
$2, $1, $3, @2); }
   | qual_Op 
b_expr%prec Op
   { $$ = (Node *) makeA_Expr(AEXPR_OP, 
$1, NULL, $2, @1); }
!   | b_expr 
qual_Op%prec POSTFIXOP
!   { $$ = (Node *) makeA_Expr(AEXPR_OP, 
$2, $1, NULL, @2); }
   | b_expr IS DISTINCT FROM b_expr
%prec IS

   {
   $$ = (Node *) 
makeSimpleA_Expr(AEXPR_DISTINCT, "=", $1, $5, @2);

--- 7550,7557 
   { $$ = (Node *) makeA_Expr(AEXPR_OP, 
$2, $1, $3, @2); }
   | qual_Op 
b_expr%prec Op
   { $$ = (Node *) makeA_Expr(AEXPR_OP, 
$1, NULL, $2, @1); }
!   | '(' b_expr qual_Op
')' %prec POSTFIXOP
!   { $$ = (Node *) makeA_Expr(AEXPR_OP, 
$3, $2, NULL, @3); }
   | b_expr IS DISTINCT FROM b_expr
%prec IS

   {
   $$ = (Node *) 
makeSimpleA_Expr(AEXPR_DISTINCT, "=", $1, $5, @2);



This change alone brings 13 reduce/reduce conflicts.


On the other hand, marking GENERATED as %right
solves this issue. I hope it's an acceptable solution.

Best regards,
Zoltán Böszörményi

--
--
Zoltán Böszörményi
Cybertec Geschwinde & Schönig GmbH
http://www.postgresql.at/



psql-serial-42.diff.gz
Description: Unix tar archive

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-20 Thread Zeugswetter Andreas ADI SD

> With DBT-2 benchmark, I've already compared the amount of WAL.   The 
> result was as follows:
> 
> Amount of WAL after 60min. run of DBT-2 benchmark 
> wal_add_optimization_info = off (default) 3.13GB

how about wal_fullpage_optimization = on (default)
 
> wal_add_optimization_info = on (new case) 3.17GB -> can be 
> optimized to 0.31GB by pg_compresslog.
> 
> So the difference will be around a couple of percents.   I think this
is 
> very good figure.
> 
> For information,
> DB Size: 12.35GB (120WH)
> Checkpoint timeout: 60min.  Checkpoint occured only once in the run.

Unfortunately I think DBT-2 is not a good benchmark to test the disabled
wal optimization.
The test should contain some larger rows (maybe some updates on large
toasted values), and maybe more frequent checkpoints. Actually the poor
ratio between full pages and normal WAL content in this benchmark is
strange to begin with.
Tom fixed a bug recently, and it would be nice to see the new ratio. 

Have you read Tom's comment on not really having to be able to
reconstruct all record types from the full page image ? I think that
sounded very promising (e.g. start out with only heap insert/update). 

Then:
- we would not need the wal optimization switch (the full page flag
would always be added depending only on backup)
- pg_compresslog would only remove such "full page" images where it
knows how to reconstruct a "normal" WAL record from
- with time and effort pg_compresslog would be able to compress [nearly]
all record types's full images (no change in backend)

> I don't think replacing LSN works fine.  For full recovery to 
> the current time, we need both archive log and WAL.  
> Replacing LSN will make archive log LSN inconsistent with 
> WAL's LSN and the recovery will not work.

WAL recovery would have had to be modified (decouple LSN from WAL
position during recovery).
An "archive log" would have been a valid WAL (with appropriate LSN
advance records). 
 
> Reconstruction to regular WAL is proposed as 
> pg_decompresslog.  We should be careful enough not to make 
> redo routines confused with the dummy full page writes, as 
> Simon suggested.  So far, it works fine.

Yes, Tom didn't like "LSN replacing" eighter. I withdraw my concern
regarding pg_decompresslog.

Your work in this area is extremely valuable and I hope my comments are
not discouraging.

Thank you
Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] HOT Patch - Ready for review

2007-04-20 Thread Pavan Deolasee

On 4/19/07, Heikki Linnakangas <[EMAIL PROTECTED]> wrote:



What's the purpose of the "HeapScanHintPagePrune" mechanism in index
builds? I lost track of the discussion on create index, is the it
necessary for correctness?



Its not required strictly for correctness, but it helps us prune the
HOT-chains
while index building. During index build, if we skip a tuple which is
RECENTLY_DEAD, existing transactions can not use the index for queries.
Pruning the HOT-chains reduces the possibility of finding such tuples
while building the index.


A comment in IndexBuildHeapScan explaining

why it's done would be nice.



I would do that.


In any case a PG_TRY/CATCH block should be

used to make sure it's turned off after an unsuccessful index build.



Oh thanks. Would do that too

I would wait for other review comments before submitting a fresh patch.
I hope thats ok.

Thanks,
Pavan
--

EnterpriseDB http://www.enterprisedb.com


[PATCHES] actualised forgotten Magnus's patch for plpgsql MOVE statement

2007-04-20 Thread Pavel Stehule

Hello

I refreshed Magnus's patch 
http://archives.postgresql.org/pgsql-patches/2007-02/msg00275.php from 
februar.


Regards

Pavel Stehule

p.s. scrollable cursors in plpgsql need little work still. I forgot for 
nonstandard (postgresql extension) direction forward all, forward n, 
backward n. Forward all propably hasn't sense.


_
Najdete si svou lasku a nove pratele na Match.com. http://www.msn.cz/
*** ./doc/src/sgml/plpgsql.sgml.orig	2007-04-20 09:01:50.0 +0200
--- ./doc/src/sgml/plpgsql.sgml	2007-04-20 09:09:04.0 +0200
***
*** 1524,1529 
--- 1524,1536 


 
+ A MOVE statement sets FOUND
+ true if is success, false if is out of table.
+
+   
+ 
+   
+
  A FOR statement sets FOUND true
  if it iterates one or more times, else false.  This applies to
  all three variants of the FOR statement (integer
***
*** 2567,2572 
--- 2574,2624 
   
  
  
+  MOVE
+ 
+ 
+ MOVE  direction FROM  cursor;
+ 
+ 
+ 
+ MOVE repositions a cursor without retrieving any data. MOVE works 
+ exactly like the FETCH  command, except it only positions the 
+ cursor and does not return rows. As with SELECT
+  INTO, the special variable FOUND can
+  be checked to see whether a cursor was repositioned or not.
+ 
+ 
+ 
+  The direction clause can be any of the
+  variants allowed in the SQL  command except the ones that can fetch
+  more than one row; namely, it can be
+  NEXT,
+  PRIOR,
+  FIRST,
+  LAST,
+  ABSOLUTE count,
+  RELATIVE count,
+  FORWARD, or
+  BACKWARD.
+  Omitting direction is the same
+  as specifying NEXT.
+  direction values that require moving
+  backward are likely to fail unless the cursor was declared or opened
+  with the SCROLL option.
+ 
+ 
+ 
+  Examples:
+ 
+ MOVE curs1;
+ MOVE LAST FROM curs3;
+ MOVE RELATIVE -2 FROM curs4;
+ 
+
+  
+ 
+ 
   CLOSE
  
  
*** ./src/pl/plpgsql/src/gram.y.orig	2007-04-19 19:15:28.0 +0200
--- ./src/pl/plpgsql/src/gram.y	2007-04-19 19:39:36.0 +0200
***
*** 125,131 
  %type 	stmt_assign stmt_if stmt_loop stmt_while stmt_exit
  %type 	stmt_return stmt_raise stmt_execsql stmt_execsql_insert
  %type 	stmt_dynexecute stmt_for stmt_perform stmt_getdiag
! %type 	stmt_open stmt_fetch stmt_close stmt_null
  
  %type 	proc_exceptions
  %type  exception_sect
--- 125,131 
  %type 	stmt_assign stmt_if stmt_loop stmt_while stmt_exit
  %type 	stmt_return stmt_raise stmt_execsql stmt_execsql_insert
  %type 	stmt_dynexecute stmt_for stmt_perform stmt_getdiag
! %type 	stmt_open stmt_fetch stmt_move stmt_close stmt_null
  
  %type 	proc_exceptions
  %type  exception_sect
***
*** 179,184 
--- 179,185 
  %token	K_IS
  %token	K_LOG
  %token	K_LOOP
+ %token	K_MOVE
  %token	K_NEXT
  %token	K_NOSCROLL
  %token	K_NOT
***
*** 635,640 
--- 636,643 
  		{ $$ = $1; }
  | stmt_fetch
  		{ $$ = $1; }
+ | stmt_move
+ 		{ $$ = $1; }
  | stmt_close
  		{ $$ = $1; }
  | stmt_null
***
*** 1478,1483 
--- 1481,1499 
  		fetch->rec		= rec;
  		fetch->row		= row;
  		fetch->curvar	= $4->varno;
+ 		fetch->is_move	= false;
+ 
+ 		$$ = (PLpgSQL_stmt *)fetch;
+ 	}
+ ;
+ 
+ stmt_move		: K_MOVE lno opt_fetch_direction cursor_variable ';'
+ 	{
+ 		PLpgSQL_stmt_fetch *fetch = $3;
+ 
+ 		fetch->lineno = $2;
+ 		fetch->curvar	= $4->varno;
+ 		fetch->is_move	= true;
  
  		$$ = (PLpgSQL_stmt *)fetch;
  	}
*** ./src/pl/plpgsql/src/pl_exec.c.orig	2007-04-20 09:24:27.0 +0200
--- ./src/pl/plpgsql/src/pl_exec.c	2007-04-20 09:25:14.0 +0200
***
*** 3112,3118 
  	return PLPGSQL_RC_OK;
  }
  
- 
  /* --
   * exec_stmt_fetch			Fetch from a cursor into a target
   * --
--- 3112,3117 
***
*** 3164,3208 
  	}
  
  	/* --
! 	 * Determine if we fetch into a record or a row
! 	 * --
! 	 */
! 	if (stmt->rec != NULL)
! 		rec = (PLpgSQL_rec *) (estate->datums[stmt->rec->recno]);
! 	else if (stmt->row != NULL)
! 		row = (PLpgSQL_row *) (estate->datums[stmt->row->rowno]);
! 	else
! 		elog(ERROR, "unsupported target");
! 
! 	/* --
! 	 * Fetch 1 tuple from the cursor
  	 * --
  	 */
! 	SPI_scroll_cursor_fetch(portal, stmt->direction, how_many);
! 	tuptab = SPI_tuptable;
! 	n = SPI_processed;
! 
! 	/* --
! 	 * Set the target and the global FOUND variable appropriately.
! 	 * --
! 	 */
! 	if (n == 0)
  	{
! 		exec_move_row(estate, rec, row, NULL, tuptab->tupdesc);
! 		exec_set_found(estate, false);
  	}
  	else
  	{
! 		exec_move_row(estate, rec