Re: [PATCHES] still alive?

2008-09-17 Thread Heikki Linnakangas

Bernd Helmle wrote:
--On Donnerstag, September 11, 2008 15:39:01 +0300 Peter Eisentraut 
[EMAIL PROTECTED] wrote:

Anyone who thinks the patches list should remain as separate from
hackers, shout now (with rationale)!


Seems i've missed something, what's then supposed to hold patches now?


Just send them to pgsql-hackers.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

2008-09-08 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

Also, it would be nice to use B-M(-H) for LIKE as well.


Right offhand, that seems impossible, at least in patterns with %.
Or were you thinking of trying to separate out the fixed substrings
of a pattern and search for them with BMH?


Yep, something like that. Even if it only handled the special case of 
'%foobar%', that would be nice, because that's a pretty common special case.



Anyway, it's not material for this patch, since it'd involve pretty
fundamental redesign of the LIKE code.


Yep.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] WIP Join Removal

2008-09-02 Thread Heikki Linnakangas

Simon Riggs wrote:

On Tue, 2008-09-02 at 13:20 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

It turns out that a join like this

select a.col2
from a left outer join b on a.col1 = b.col1
where b.col2 = 1;

can be cheaper if we don't remove the join, when there is an index on
a.col1 and b.col2, because the presence of b allows the values returned
from b to be used for an index scan on a.
Umm, you *can't* remove that join. 


Yes, you can. The presence or absence of rows in b is not important to
the result of the query because of the left outer join.

I spent nearly a whole day going down that deadend also.


Oh. How does the query look like after removing the join, then?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] WIP Join Removal

2008-09-02 Thread Heikki Linnakangas

Simon Riggs wrote:

select a.col2
from a left outer join b on a.col1 = b.col1
where b.col2 = 1;

is logically equivalent to 


select a.col2
from a;


No, it's not:

postgres=# CREATE TABLE a (col1 int4, col2 int4);
CREATE TABLE
postgres=# CREATE TABLE b (col1 int4, col2 int4);
CREATE TABLE
postgres=# INSERT INTO a VALUES (1,1);
INSERT 0 1
postgres=# select a.col2 from a;
 col2
--
1
(1 row)

postgres=# select a.col2 from a left outer join b on a.col1 = b.col1 
where b.col2 = 1;

 col2
--
(0 rows)

But anyway, Greg's example looks valid, and proves the point that 
removing a join isn't always a win.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] WIP Join Removal

2008-09-02 Thread Heikki Linnakangas

Gregory Stark wrote:

I wonder if it would be more worthwhile to remove them and have a subsequent
phase where we look for possible joins to *add*. So even if the user writes
select * from invoices where customer_id=? the planner might be able to
discover that it can find those records quicker by scanning customer, finding
the matching company_id,customer_id and then using an index to look them up
in invoices.


Yeah, that would be cool. The question is whether it's worth the 
additional overhead in planner, compared to the gain in the rare case 
that it's applicable. That's always the thing with planner tricks like 
this. I think we'll eventually need some sort of tuning knob to control 
how hard the planner tries to apply different optimizations like that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] WIP Join Removal

2008-09-02 Thread Heikki Linnakangas

Simon Riggs wrote:

It seems I wrote my original tests using and instead of where and
hadn't noticed the distinction. Thanks for helping me catch that error.


Ah, yeah, that's a big difference. Proving correctness is hard, but to 
refute something you need just one test case that fails ;-).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] fixing bug in combocid.c

2008-09-01 Thread Heikki Linnakangas

Karl Schnaitter wrote:

This patch is for a bug in backend/utils/time/combocid.c.

In HeapTupleHeaderAdjustCmax, the code makes the incorrect assumption 
that the raw command id corresponds to cmin, when the raw command id can 
in fact be a combo cid. It needs to be fixed to call 
HeapTupleHeaderGetCmin to get cmin. That's all there is to my patch, but 
read on for more details.


Yep. Patch committed.

Thanks for the report!


This is my first patch; I hope I did it right!


You did well :-). No need to tar or gzip small patches like that, though.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [HACKERS][PATCHES] odd output in restore mode

2008-07-29 Thread Heikki Linnakangas

Andrew Dunstan wrote:



Greg Smith wrote:

On Wed, 23 Jul 2008, Kevin Grittner wrote:


In our scripts we handle this by copying to a temp directory on the
same mount point as the archive directory and doing a mv to the
archive location when the copy is successfully completed. I think
that this even works on Windows. Could that just be documented as a
strong recommendation for the archive script?


This is exactly what I always do. I think the way cp is shown in the 
examples promotes what's really a bad practice for lots of reasons, 
this particular problem being just one of them.


I've been working on an improved archive_command shell script that I 
expect to submit for comments and potential inclusion in the 
documentation as a better base for other people to build on. This is 
one of the options for how it can operate. It would be painful but not 
impossible to convert a subset of that script to run under Windows as 
well, at least enough to cover this particular issue.


A Perl script using the (standard) File::Copy module along with the 
builtin function rename() should be moderately portable. It would to be 
nice not to have to maintain two scripts.


It's also not very nice to require a Perl installation on Windows, just 
for a replacement of Copy. Would a simple .bat script work?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [HACKERS][PATCHES] odd output in restore mode

2008-07-28 Thread Heikki Linnakangas

Andrew Dunstan wrote:

Kevin Grittner wrote:
Heikki Linnakangas [EMAIL PROTECTED] wrote: 

We really need a more reliable way of detecting that a file has been
fully copied. 
 
In our scripts we handle this by copying to a temp directory on the

same mount point as the archive directory and doing a mv to the
archive location when the copy is successfully completed.  I think
that this even works on Windows.  Could that just be documented as a
strong recommendation for the archive script?


Needs testing at least. If it does in fact work then we can just adjust 
the docs and be done 


Yeah.

- or maybe provide a .bat file or perl script that 
would work as na archive_command on Windows.


We're not talking about archive_command. We're talking about the thing 
that copies files to the directory that pg_standby polls.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [HACKERS][PATCHES] odd output in restore mode

2008-07-28 Thread Heikki Linnakangas

Andrew Dunstan wrote:



Heikki Linnakangas wrote:

Andrew Dunstan wrote:


- or maybe provide a .bat file or perl script that would work as na 
archive_command on Windows.


We're not talking about archive_command. We're talking about the thing 
that copies files to the directory that pg_standby polls.


Er, that's what the archive_command is. Look at the pg_standby docs and 
you'll see that that's where we're currently recommending use of windows 
copy. Perhaps you're confusing this with the restore_command?


Oh, right. I was thinking that archive_command copies the files to an 
archive location, and there's yet another process copying files from 
there to the directory pg_standby polls. But indeed in the simple 
configuration, archive_command is the command that we're interested in.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [HACKERS][PATCHES] odd output in restore mode

2008-07-23 Thread Heikki Linnakangas

Simon Riggs wrote:

On Tue, 2008-07-22 at 17:19 -0700, Martin Zaun wrote:

8. Unresolved question of implementing now/later a cp replacement


The patch implements what's been agreed. 


I'm not rewriting cp, for reasons already discussed.

Not a comment to you Martin, but it's fairly clear that I'm not
maintaining this correctly for Windows. I've never claimed to have
tested this on Windows, and only included Windows related items as
requested by others. I need to make it clear that I'm not going to
maintain it at all, for Windows. If others wish to report Windows issues
then they can suggest appropriate fixes and test them also.


Hmm. I just realized that replacing the cp command within pg_standby 
won't help at all. The problem is with the command that copies the files 
*to* the archivelocation that pg_standby polls, not with the copy 
pg_standby does from archivelocation to pg_xlog. And we don't have much 
control over that.


We really need a more reliable way of detecting that a file has been 
fully copied. One simple improvement would be to check the xlp_magic 
field of the last page, though it still wouldn't be bullet-proof.


Do the commands that preallocate the space keep the file exclusively 
locked during the copy? If they do, shouldn't we get an error in trying 
to run the restore copy command, and retry after the 1s sleep in 
RestoreWALFileForRecovery? Though if the archive location is a samba 
mount or something, I guess we can't rely on Windows-style exclusive 
locking.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] page macros cleanup (ver 04)

2008-07-09 Thread Heikki Linnakangas

Zdenek Kotala wrote:

Pavan Deolasee napsal(a):

On Fri, Jul 4, 2008 at 4:25 PM, Heikki Linnakangas
[EMAIL PROTECTED] wrote:


No, there's a itemsz = MAXALIGN(itemsz) call before the check against
HashMaxItemSize.



Ah, right. Still should we just not MAXALIGN_DOWN the Max size to
reflect the practical limit on the itemsz ? It's more academical
though, so not a big deal.


Finally I use following formula:

#define HashMaxItemSize(page) \
 MAXALIGN_DOWN(PageGetPageSize(page) - \
   ( SizeOfPageHeaderData + sizeof(ItemIdData) ) - \
MAXALIGN(sizeof(HashPageOpaqueData)) )


I did not replace PageGetPageSize(page), because other *MaxItemSize has 
same interface.


Ok, fair enough.

There's another academical discrepancy between these two hunks:


*** src/backend/access/hash/hashpage.c  12 May 2008 00:00:44 -  1.75
--- src/backend/access/hash/hashpage.c  9 Jul 2008 11:30:09 -
***
*** 407,413 
for (i = _hash_log2(metap-hashm_bsize); i  0; --i)
{
if ((1  i) = (metap-hashm_bsize -
!
(MAXALIGN(sizeof(PageHeaderData)) +
  
MAXALIGN(sizeof(HashPageOpaqueData)
break;
}
--- 407,413 
for (i = _hash_log2(metap-hashm_bsize); i  0; --i)
{
if ((1  i) = (metap-hashm_bsize -
!
(MAXALIGN(SizeOfPageHeaderData) +
  
MAXALIGN(sizeof(HashPageOpaqueData)
break;
}


and


*** src/include/access/hash.h   19 Jun 2008 00:46:05 -  1.88
--- src/include/access/hash.h   9 Jul 2008 11:30:10 -
***
*** 192,198 
  #define BMPG_SHIFT(metap) ((metap)-hashm_bmshift)
  #define BMPG_MASK(metap)  (BMPGSZ_BIT(metap) - 1)
  #define HashPageGetBitmap(pg) \
!   ((uint32 *) (((char *) (pg)) + MAXALIGN(sizeof(PageHeaderData
  
  /*

   * The number of bits in an ovflpage bitmap word.
--- 191,197 
  #define BMPG_SHIFT(metap) ((metap)-hashm_bmshift)
  #define BMPG_MASK(metap)  (BMPGSZ_BIT(metap) - 1)
  #define HashPageGetBitmap(pg) \
!   ((uint32 *) (((char *) (pg)) + MAXALIGN(SizeOfPageHeaderData + sizeof(ItemIdData 
  
  /*

   * The number of bits in an ovflpage bitmap word.


I admit I don't understand what that bitmap is, it looks like we're 
assuming it can take up all the space between page header and the 
special portion, without any line pointers, but in HashPageGetBitmap, 
we're reserving space for one pointer. It looks like the actual size of 
the bitmap is only the largest power of 2 below the maximum size, so 
that won't be an issue in practice. That macro is actually doing the 
same thing as PageGetContents, so I switched to using that. As that 
moves the data sligthly on those bitmap pages, I guess we'll need a 
catversion bump.


Attached is an updated patch. I also fixed some whitespace and comments 
that were no longer valid. How does it look to you now?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
? DEADJOE
? GNUmakefile
? buildlog
? config.log
? config.status
? page_04-heikki-1.patch
? contrib/pg_standby/.deps
? contrib/pg_standby/pg_standby
? contrib/spi/.deps
? doc/src/sgml/cvsmsg
? src/Makefile.global
? src/backend/postgres
? src/backend/access/common/.deps
? src/backend/access/gin/.deps
? src/backend/access/gist/.deps
? src/backend/access/hash/.deps
? src/backend/access/heap/.deps
? src/backend/access/index/.deps
? src/backend/access/nbtree/.deps
? src/backend/access/transam/.deps
? src/backend/bootstrap/.deps
? src/backend/catalog/.deps
? src/backend/catalog/postgres.bki
? src/backend/catalog/postgres.description
? src/backend/catalog/postgres.shdescription
? src/backend/commands/.deps
? src/backend/executor/.deps
? src/backend/lib/.deps
? src/backend/libpq/.deps
? src/backend/main/.deps
? src/backend/nodes/.deps
? src/backend/optimizer/geqo/.deps
? src/backend/optimizer/path/.deps
? src/backend/optimizer/plan/.deps
? src/backend/optimizer/prep/.deps
? src/backend/optimizer/util/.deps
? src/backend/parser/.deps
? src/backend/port/.deps
? src/backend/postmaster/.deps
? src/backend/regex/.deps
? src/backend/rewrite/.deps
? src/backend/snowball/.deps
? src/backend/snowball/snowball_create.sql
? src/backend/storage/buffer/.deps
? src/backend/storage/file/.deps
? src/backend/storage/freespace/.deps
? src/backend/storage/ipc/.deps
? src/backend/storage/large_object/.deps
? src/backend/storage/lmgr/.deps
? src/backend/storage/page/.deps
? src/backend/storage/smgr/.deps
? src/backend/tcop/.deps
? src/backend/tsearch/.deps
? src/backend/utils/.deps
? src/backend/utils/probes.h
? src/backend/utils/adt/.deps
? src/backend/utils/cache/.deps
? src/backend/utils/error/.deps
? src/backend/utils/fmgr/.deps
? src/backend/utils/hash/.deps
? src/backend/utils

Re: [PATCHES] Bug fix for pg_standby keepfiles calculation

2008-07-08 Thread Heikki Linnakangas

Simon Riggs wrote:

Fix minor bug in pg_standby, noted by Ferenc Felhoffer


Applied, thanks.

I couldn't find a bug report from Ferenc in the archives. Did he contact 
you personally?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


[PATCHES] Multi-column GIN

2008-07-04 Thread Heikki Linnakangas

Dumb question:

What's the benefit of a multi-column GIN index over multiple 
single-column GIN indexes?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Relation forks FSM rewrite patches

2008-07-04 Thread Heikki Linnakangas

Simon Riggs wrote:

On Fri, 2008-06-27 at 19:36 +0300, Heikki Linnakangas wrote:

Here's an updated version of the relation forks patch, and an 
incremental FSM rewrite patch on top of that. The relation forks patch 
is ready for review. The FSM implementation is more work-in-progress 
still, but I'd like to get some review on that as well, with the goal of 
doing more performance testing and committing it after the commit fest.


Hmmm, a 6000 line page with no visible documentation, readme, nor any
discussion on -hackers that I'm aware of.


There is a readme in the patch, and there certainly has been discussion 
on -hackers, see:


http://archives.postgresql.org/pgsql-hackers/2008-04/msg00415.php

where the current design is discussed for the first time.

User documentation is still to be done. There isn't anything to tune, or 
anything that requires manual maintenance, so there isn't much to 
document, though, except perhaps a chapter in the Internals part.


Here's an updated version of the patch. Changes:
- one bug has been fixed (I assumed that it's always safe to call 
rightchild(x) on a parent node, but that was not always true for the 
rightmost branch of the tree)

- there's some new debugging code.
- if we can't find a page with enough free space in search_avail after 
starting from scratch 1000 times, give up. That shouldn't happen, but 
see below.


There is still a nasty bug somewhere, probably related to locking :-(. I 
ran DBT-2 with this, and after about 1h a FSM lookup goes into an 
endless loop, hogging all CPU. I suspect that there's a bug somewhere so 
that a change to the root node of a lower level FSM page isn't 
propagated to the upper level FSM page for some reason. oprofile shows 
that the endless loop happens in search_avail. This is why I added the 
code to give up after 1000 tries, hoping to get some debugging output 
the next time that happens.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] page macros cleanup

2008-07-04 Thread Heikki Linnakangas

Pavan Deolasee wrote:

On Fri, Jul 4, 2008 at 1:01 PM, Zdenek Kotala [EMAIL PROTECTED] wrote:


Good catch. I lost in basic arithmetic. What I see now that original
definition count sizeof(ItemIdData) twice and on other side it does not take
care about MAXALING correctly. I think correct formula is:

#define HashMaxItemSize(page) \
   (PageGetPageSize(page) - \
 ( MAXALIGN(SizeOfPageHeaderData + sizeof(ItemIdData))+ \
   MAXALIGN(sizeof(HashPageOpaqueData)) \
 )\
)

What do you think?


Yes. I think that's the correct way.


Doesn't look right to me. There's no padding after the first line 
pointer, hence the first MAXALIGN shouldn't be there.


BTW, looking at hashinsert.c where it's used, we're actually passing a 
pointer to the meta page to HashMaxItemSize(). So the PageGetPageSize() 
call on that is quite bogus, since it's not the meta page that the tuple 
is going to be inserted to. It's academical, because all pages are the 
same size anyway, but doesn't look right. I think I'd go with BLKCSZ 
instead.


I think this is the way it should be:

#define HashMaxItemSize \
(BLCKSZ - \
 SizeOfPageHeaderData - \
 MAXALIGN(sizeof(HashPageOpaqueData)) - \
 sizeof(ItemIdData))

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] page macros cleanup

2008-07-04 Thread Heikki Linnakangas

Pavan Deolasee wrote:

On Fri, Jul 4, 2008 at 3:37 PM, Heikki Linnakangas
[EMAIL PROTECTED] wrote:


I think this is the way it should be:

#define HashMaxItemSize \
   (BLCKSZ - \
SizeOfPageHeaderData - \
MAXALIGN(sizeof(HashPageOpaqueData)) - \
sizeof(ItemIdData))



I am wondering if this would fail for corner case if HashMaxItemSize
happened to be unaligned. For example, if (itemsz  HashMaxItemSize 
MAXALIGN(itemsz), PageAddItem() would later fail with a not-so-obvious
error. Should we just MAXALIGN_DOWN the HashMaxItemSize ?


No, there's a itemsz = MAXALIGN(itemsz) call before the check against 
HashMaxItemSize.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] page macros cleanup

2008-07-04 Thread Heikki Linnakangas

Zdenek Kotala wrote:

By my opinion first place where tuple should be placed is:

MAXALIGN(SizeOfPageHeaderData + sizeof(ItemIdData))


Yeah, but we don't enforce that directly. We enforce it by MAXALIGNing 
size in PageAddItem, and with the rule that special-size is MAXALIGNed.


To put it another way, it's possible that there's an unaligned amount of 
free space on a page, as returned by PageGetExactFreeSpace. But since 
item size is always MAXALIGNed, it's impossible to use it all.



Come to think of it, why do we require MAXALIGN alignment of tuples? I 
must be missing something, but AFAICS the widest fields in 
HeapTupleHeaderData are 4-bytes wide, so it should be possible to get 
away with 4-byte alignment.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Relation forks FSM rewrite patches

2008-07-04 Thread Heikki Linnakangas

Simon Riggs wrote:

On Fri, 2008-07-04 at 12:27 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Fri, 2008-06-27 at 19:36 +0300, Heikki Linnakangas wrote:

Here's an updated version of the relation forks patch, and an 
incremental FSM rewrite patch on top of that. The relation forks patch 
is ready for review. The FSM implementation is more work-in-progress 
still, but I'd like to get some review on that as well, with the goal of 
doing more performance testing and committing it after the commit fest.

Hmmm, a 6000 line page with no visible documentation, readme, nor any
discussion on -hackers that I'm aware of.
There is a readme in the patch, and there certainly has been discussion 
on -hackers, see:


http://archives.postgresql.org/pgsql-hackers/2008-04/msg00415.php

where the current design is discussed for the first time.


OK, I see the readme now. Thanks.

Minor point but the readme needs to draw a clear distinction between FSM
pages and data pages, which confused me when I read it first time.


Ok, thanks.

I had trouble finding distinctive terms for the tree within a page, and 
the tree of FSM blocks, with root at block 0.



Contention on FSM pages concerns me. Every change has the potential to
bubble up to the top, which would cause a significant problem. I'd like
to find a way to minimise the number of bubble-up operations, otherwise
there will be no material difference between using a single whole-FSM
lock like we do now and this new scheme. Note that in this new scheme
the path length to the bottom of the tree is considerably longer and can
stall waiting on I/O - so contention in the FSM is a big issue. (It
always has been in databases as long as I can remember).


There's already some mitigating factors:

1. You only need to bubble up to an upper level FSM page if the amount 
at the top of the leaf FSM page changes. Keep in mind that one FSM page 
holds FSM information on ~4000 heap pages, so you don't need to bubble 
up if there's any page within that 4000 page range that has as much or 
more space than the page you're updating.


2. We only update the FSM when we try to insert a tuple and find that it 
doesn't fit. That reduces the amount of FSM updates dramatically when 
you're doing bulk inserts. (This is the same as in the current 
implementation, though I'm not sure it's optimal anymore.)


I haven't been able to come up with a simple test case that shows any 
meaningful performance difference between the current and this new 
implementation. Got any ideas? I fear that we're going overboard trying 
to avoid contention that simple isn't there, but it's hard to argue 
without a concrete test case to analyze..



The FSM tree as proposed has exact values.


Not quite. Free space is tracked in BLCKSZ/256 (=32 with default BLCKSZ) 
increments, so that we only need one byte per heap page.



What if we bubbled up based
upon only significant tree changes, rather than exact changes?


Hmm. So an update would only ever update the lowest-level FSM page, and 
leave a mismatch between the value at the root node of a lower level FSM 
page, and the corresponding value at the lead node of its parent. The 
mismatch would then need to be fixed by the next search that traverses 
down that path, and finds that there's not enough space there after all.


That works when the amount of free space on page is decremented. VACUUM, 
that increments it, would still need to bubble up the change, because if 
the value at the upper level is not fixed, no search might ever traverse 
down that path, and the value would never be fixed.


That would solve the I/O while holding lock issue. (not that I'm too 
worried about it, though)



So perhaps we can perform bubble-up only when the change in
free space in greater than avg row size or the remaining space has
dropped to less than 2*avg row size, where exact values begin to matter
more. 


One idea is to make the mapping from amount of free space in bytes to 
the 1-byte FSM category non-linear. For example, use just one category 
for  2000 bytes free, on the assumption that anything larger than that 
is toasted anyway, and divide the 255 categories evenly across the 
remaining 2000 bytes.


I would like to move away from using the average row width in the 
calculation if possible. We use it in the current implementation, but if 
we won't to keep doing it, we'll need to track it within the FSM like 
the current implementation does, adding complexity and contention issues 
of its own. Or reach into the statistics from the FSM, but I don't like 
that idea much either.



Also, as FSM map becomes empty we will see more and more bubble up
operations reaching top parts of the tree. We need a way to avoid
contention from growing over time.


Yeah, that's something to watch out for.


I'm not happy about the FSM being WAL logged for minor changes (new
pages, yes). The high number of changes will cause extra traffic where
we don't want it. This will accentuate

Re: [PATCHES] Relation forks FSM rewrite patches

2008-07-04 Thread Heikki Linnakangas

Alvaro Herrera wrote:

I wonder if we could instead make vacuum write a new FSM tree from
scratch, rather than updating it piecemeal.


I'd like to move away from that, as we'll hopefully get some sort of 
partial vacuum capability soon.


We might want to have some sort of bulk update operation, though, so 
that you could refresh FSM information for say 100 pages, or one FSM 
page, at a time. That would reduce the WAL traffic of a VACUUM 
significantly, though I'm not sure how significant that is to begin with.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] EXPLAIN progress info

2008-07-03 Thread Heikki Linnakangas
I like this idea in general. I'm envisioning a cool explain display in 
pgAdmin that's updated live, as the query is executed, showing how many 
tuples a seq scan in the bottom, and an aggregate above it, has 
processed. Drool.


Currently the interface is that you open a new connection, and signal 
the backend using the same mechanism as a query cancel. That's fine for 
the interactive psql use case, but seems really heavy-weight for the 
pgAdmin UI I'm envisioning. You'd have to poll the server, opening a new 
connection each time. Any ideas on that? How about a GUC to send the 
information automatically every n seconds, for example?


Other than that, this one seems to be the most serious issue:

Tom Lane wrote:

Gregory Stark [EMAIL PROTECTED] writes:
There are downsides: 


Insurmountable ones at that.  This one already makes it a non-starter:


a) the overhead of counting rows and loops is there for every query execution,
even if you don't do explain analyze.


I wouldn't be surprised if the overhead of the counters turns out to be 
a non-issue, but we'd have to see some testing of that. InstrEndLoop is 
the function we're worried about, right?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] page macros cleanup

2008-07-03 Thread Heikki Linnakangas

Just one quick note:

Zdenek Kotala wrote:

*** pgsql.orig.da8c485e0e2a/src/backend/access/gist/gistutil.c  pá črn 13 
18:00:35 2008
--- pgsql.orig/src/backend/access/gist/gistutil.c   pá črn 13 18:00:35 
2008
***
*** 592,598 
/*
 * Additionally check that the special area looks sane.
 */
!   if (((PageHeader) (page))-pd_special !=
(BLCKSZ - MAXALIGN(sizeof(GISTPageOpaqueData
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
--- 592,598 
/*
 * Additionally check that the special area looks sane.
 */
!   if ( PageGetSpecialPointer(page) - page !=
(BLCKSZ - MAXALIGN(sizeof(GISTPageOpaqueData
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),


Should probably use PageGetSpecialSize here. Much simpler, and doesn't 
assume that the special area is always at the end of page (not that I 
see us changing that anytime soon).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] extend VacAttrStats to allow stavalues of different types

2008-07-01 Thread Heikki Linnakangas

Jan Urbański wrote:

Tom Lane wrote:

I think the correct solution is to initialize the fields to match the
column type before calling the typanalyze function.  Then you don't
break compatibility for existing typanalyze functions.  It's also less
code, since the standard typanalyze functions can rely on those preset
values.


Right. Updated patch attached.


Thanks, committed.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [HACKERS] odd output in restore mode

2008-07-01 Thread Heikki Linnakangas

Simon Riggs wrote:

Patch implements

* recommendation to use GnuWin32 cp on Windows
* provide holdtime delay, default 0 (on all platforms)
* default stays same on Windows=copy to ensure people upgrading don't
get stung


This seems pretty kludgey to me. I wouldn't want to install GnuWin32 
utilities on a production system just for the cp command, and I don't 
know how I would tune holdtime properly for using copy. And it seems 
risky to have defaults that are known to not work reliably.


How about implementing a replacement function for cp ourselves? It 
seems pretty trivial to do. We could use that on Unixes as well, which 
would keep the differences between Win32 and other platforms smaller, 
and thus ensure the codepath gets more testing.


(Sorry for jumping into the discussion so late, I didn't follow this 
thread earlier, and just read it now in the archives while looking at 
the patch.)


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Better formatting of functions in pg_dump

2008-07-01 Thread Heikki Linnakangas

Tom Lane wrote:

Greg Sabino Mullane [EMAIL PROTECTED] writes:

Why the random switching between newline-before and newline-after
styles?  Please be consistent.



I thought they were all after. On second glance, they still seem
all after?


Oh, my mistake, I had failed to see that the patch was getting rid of
newline-before style in this function.  I think you might have gone
a bit overboard on adding whitespace, but the previous objection is
nonsense, sorry.


Yeah, I like idea of moving the metadata stuff before the function 
body, but the whitespace is a bit too much. You can fit
   LANGUAGE plpgsql IMMUTABLE STRICT SECURITY DEFINER COST 10 in 
on one line without wrapping on a 80 col terminal. And we don't try to 
guarantee any specific width anyway, you can get very long lines if the 
function has a lot of arguments, for example.


I applied this simpler patch that just moves the metadata stuff before 
the function body, leaving the whitespace as is (in newline-before style).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** src/bin/pg_dump/pg_dump.c
--- src/bin/pg_dump/pg_dump.c
***
*** 6775,6788  dumpFunc(Archive *fout, FuncInfo *finfo)
  	rettypename = getFormattedTypeName(finfo-prorettype, zeroAsOpaque);
  
  	appendPQExpBuffer(q, CREATE FUNCTION %s , funcsig);
! 	appendPQExpBuffer(q, RETURNS %s%s\n%s\nLANGUAGE %s,
  	  (proretset[0] == 't') ? SETOF  : ,
! 	  rettypename,
! 	  asPart-data,
! 	  fmtId(lanname));
! 
  	free(rettypename);
  
  	if (provolatile[0] != PROVOLATILE_VOLATILE)
  	{
  		if (provolatile[0] == PROVOLATILE_IMMUTABLE)
--- 6775,6786 
  	rettypename = getFormattedTypeName(finfo-prorettype, zeroAsOpaque);
  
  	appendPQExpBuffer(q, CREATE FUNCTION %s , funcsig);
! 	appendPQExpBuffer(q, RETURNS %s%s,
  	  (proretset[0] == 't') ? SETOF  : ,
! 	  rettypename);
  	free(rettypename);
  
+ 	appendPQExpBuffer(q, \nLANGUAGE %s, fmtId(lanname));
  	if (provolatile[0] != PROVOLATILE_VOLATILE)
  	{
  		if (provolatile[0] == PROVOLATILE_IMMUTABLE)
***
*** 6850,6856  dumpFunc(Archive *fout, FuncInfo *finfo)
  			appendStringLiteralAH(q, pos, fout);
  	}
  
! 	appendPQExpBuffer(q, ;\n);
  
  	ArchiveEntry(fout, finfo-dobj.catId, finfo-dobj.dumpId,
   funcsig_tag,
--- 6848,6854 
  			appendStringLiteralAH(q, pos, fout);
  	}
  
! 	appendPQExpBuffer(q, \n%s;\n, asPart-data);
  
  	ArchiveEntry(fout, finfo-dobj.catId, finfo-dobj.dumpId,
   funcsig_tag,

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES][UPDATED] A GUC variable to replace PGBE_ACTIVITY_SIZE

2008-07-01 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
Another simple optimization occurred to me while looking at this: we 
should skip the memcpy/strcpy altogether if the BackendActivity slot is 
not in use. That seems like a good bet, you usually don't try to max out 
max_connections.


Huh?  How could we be assigning to a slot that is not in use?


Before the patch, we loop through the shared PgBackendStatus slots 
(there is MaxBackends of them), and issue a memcpy for each to copy it 
to our local slot. After that, we check if it was actually in use.


After the patch, we still loop through the shared slots, but only issue 
the memcpy for slots that are in use.


Hope this answers your question, I'm not sure what you meant...

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [HACKERS] Hint Bits and Write I/O

2008-06-27 Thread Heikki Linnakangas

Gregory Stark wrote:

I'm also a bit concerned that *how many hint bits* isn't enough information to
determine how important it is to write out the page. 


Agreed, that doesn't seem like a very good metric to me either.


Or how many *unhinted* xmin/xmax
values were found? If HTSV can hint xmin for a tuple but finds xmax still in
progress perhaps that's a good sign it's not worth dirtying the page?


I like that thought.

Overall, I feel that we should never dirty when setting a hint bit, just 
set the separate buffer flag to indicate that hint bits have been set. 
The decision to dirty and write out, or not, should be delayed until 
we're about to write/replace the buffer. That is, in bgwriter.


How about this strategy:

1. First of all, before writing a dirty buffer, scan all tuples on the 
page and set all hint bits that can be set. This will hopefully save us 
from having to dirty the page again in the future, when another tuple on 
the page is accessed. This has been proposed before, and IIRC Tom has 
argued that it's a modularity violation for bgwriter to access the 
contents of pages like that, but I'm sure we can find a way to do it safely.


2. When bgwriter encounters a page that's marked as hint bits dirty, 
write it only if *all* hint bits on the page has been, or can be, set. 
Dirtying a page before that point doesn't seem worthwhile, as the next 
access to the tuple that doesn't have all the hint bits set will have to 
dirty the page again.



Actually, I'd like to see some benchmarks on an even simpler strategy:
just never dirty a page just because a hint bit has been set. It might 
work surprisingly well in practice: If a database is I/O bound, we don't 
care about the extra CPU work or lock congestion of checking the clog. 
If it's CPU bound, the active pages that matter are in the buffer cache, 
and so are the hint bits for those pages.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] A GUC variable to replace PGBE_ACTIVITY_SIZE

2008-06-24 Thread Heikki Linnakangas

I'd like to see some quick testing of this thing mentioned in the comments:


/*
 * XXX if pgstat_track_activity_query_size is really large,
 * it might be best to use strcpy not memcpy for copying the
 * activity string?
 */


If we make it a GUC, there will be people running with really large 
pgstat_track_activity_query_size settings. In fact I wouldn't be 
surprised if people routinely cranked it up to the 10-100 KB range, just 
in case.


It's going to be platform-dependent, for sure, but some quick 
micro-benchmarks of where the break-even point between memcpy and strcpy 
lies would be nice.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Refactoring xlogutils.c

2008-06-11 Thread Heikki Linnakangas

Teodor Sigaev wrote:
- For the routines that need a fake, or lightweight, Relation struct 
anyway, there's a new function called 
XLogFakeRelcacheEntry(RelFileNode), that returns a palloc'd Relation 
struct.


Is that fake structure applicable for ReadBuffer()?


Yes.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Refactoring xlogutils.c

2008-06-11 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

Attached is an updated version of my patch to refactor the
XLogOpenRelation/XLogReadBuffer interface, in preparation for the 
relation forks patch, and subsequently the FSM rewrite patch.


The code motion in md.c looks fairly bogus; was that a copy-and-paste
error?


This one?


*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***
*** 208,216  mdcreate(SMgrRelation reln, bool isRedo)
char   *path;
Filefd;
  
-   if (isRedo  reln-md_fd != NULL)

-   return; /* created and opened 
already... */
- 
Assert(reln-md_fd == NULL);
  
path = relpath(reln-smgr_rnode);

--- 208,213 


That's intentional. That check has been moved to smgrcreate(), so that 
it's done before calling TablespaceCreateDbSpace(). The reason for that 
is that smgrcreate() is now called for every XLogReadBuffer() 
invocation, so we want to make it exit quickly if there's nothing to do.


On second look though, I think we should actually leave that check in 
mdcreate(). Even though it'd be dead code since we do the same check in 
smgrcreate already, removing it changes the semantics of mdcreate(): 
it'd no longer be acceptable to call mdcreate() if the file is open already.



Otherwise it looks pretty sane, but I have one stylistic gripe:
I'm dubious about your willingness to assume that pfree() is enough for
getting rid of a fake relcache entry.  Relcache entries are complex
beasts and it may not continue to work to do that.  It's also possible
that we'd have additional resources attached to them someday.  So
I think it'd be worth having a FreeFakeRelcacheEntry() routine to
localize the knowledge of how to get rid of them.


Yeah, I guess you're right.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] minor ts_type.h comment fix

2008-06-10 Thread Heikki Linnakangas

Jan Urbański wrote:

These should read TSQuery, not TSVector, no?


Yep. Applied.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


[PATCHES] \timing on/off

2008-06-10 Thread Heikki Linnakangas
I started to look at David Fetter's patch to enable \timing on/off, in 
addition to toggling the mod with just \timing.


I gather that the conclusion of the thread Making sure \timing is on, 
starting at: 
http://archives.postgresql.org/pgsql-general/2008-05/msg00324.php


was that we should leave \H and \a alone for now, because it's not clear 
what \H off would do. And \a is already documented as deprecated; we 
might want to do that for \H as well.


Here's a patch, based on David's patch, that turns timing into a \pset 
variable, and makes \timing an alias for \pset timing. This makes 
\timing behave the same as \x.


I also moved the help line from the External section into 
Formatting. I don't think \timing is external in the same sense as 
\cd or \! are. Rather, it controls the output, like \x or \pset options.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] \timing on/off

2008-06-10 Thread Heikki Linnakangas

And here is the patch I forgot to attach.

Heikki Linnakangas wrote:
I started to look at David Fetter's patch to enable \timing on/off, in 
addition to toggling the mod with just \timing.


I gather that the conclusion of the thread Making sure \timing is on, 
starting at: 
http://archives.postgresql.org/pgsql-general/2008-05/msg00324.php


was that we should leave \H and \a alone for now, because it's not clear 
what \H off would do. And \a is already documented as deprecated; we 
might want to do that for \H as well.


Here's a patch, based on David's patch, that turns timing into a \pset 
variable, and makes \timing an alias for \pset timing. This makes 
\timing behave the same as \x.


I also moved the help line from the External section into 
Formatting. I don't think \timing is external in the same sense as 
\cd or \! are. Rather, it controls the output, like \x or \pset options.





--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: doc/src/sgml/ref/psql-ref.sgml
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/ref/psql-ref.sgml,v
retrieving revision 1.207
diff -c -r1.207 psql-ref.sgml
*** doc/src/sgml/ref/psql-ref.sgml	1 Jun 2008 16:23:08 -	1.207
--- doc/src/sgml/ref/psql-ref.sgml	9 Jun 2008 12:19:35 -
***
*** 1721,1726 
--- 1721,1739 
/para
/listitem
/varlistentry
+ 
+   varlistentry
+   termliteraltiming/literal/term
+   listitem
+   para
+   You can specify an optional second argument, if it is provided it
+   may be either literalon/literal or literaloff/literal which
+   will enable or disable display of how long each SQL statement takes,
+   in milliseconds. If the second argument is not provided then we will
+   toggle between on and off.
+   /para
+   /listitem
+   /varlistentry
  /variablelist
  /para
  
***
*** 1734,1740 
  para
  There are various shortcut commands for command\pset/command. See
  command\a/command, command\C/command, command\H/command,
! command\t/command, command\T/command, and command\x/command.
  /para
  /tip
  
--- 1747,1754 
  para
  There are various shortcut commands for command\pset/command. See
  command\a/command, command\C/command, command\H/command,
! command\t/command, command\T/command, command\x/command,
! and command\timing/command.
  /para
  /tip
  
***
*** 1864,1870 
 termliteral\timing/literal/term
  listitem
  para
!  Toggles a display of how long each SQL statement takes, in milliseconds.
  /para
 /listitem
/varlistentry
--- 1878,1885 
 termliteral\timing/literal/term
  listitem
  para
!  Toggles a display of how long each SQL statement takes. As such it
!  is equivalent to literal\pset timing/literal.
  /para
 /listitem
/varlistentry
Index: src/bin/psql/command.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/bin/psql/command.c,v
retrieving revision 1.189
diff -c -r1.189 command.c
*** src/bin/psql/command.c	14 May 2008 19:10:29 -	1.189
--- src/bin/psql/command.c	9 Jun 2008 10:56:25 -
***
*** 290,301 
  		char	   *opt = psql_scan_slash_option(scan_state,
   OT_WHOLE_LINE, NULL, false);
  
! 		if (pset.timing)
  			INSTR_TIME_SET_CURRENT(before);
  
  		success = do_copy(opt);
  
! 		if (pset.timing  success)
  		{
  			INSTR_TIME_SET_CURRENT(after);
  			INSTR_TIME_SUBTRACT(after, before);
--- 290,301 
  		char	   *opt = psql_scan_slash_option(scan_state,
   OT_WHOLE_LINE, NULL, false);
  
! 		if (pset.popt.timing)
  			INSTR_TIME_SET_CURRENT(before);
  
  		success = do_copy(opt);
  
! 		if (pset.popt.timing  success)
  		{
  			INSTR_TIME_SET_CURRENT(after);
  			INSTR_TIME_SUBTRACT(after, before);
***
*** 884,897 
  	/* \timing -- toggle timing of queries */
  	else if (strcmp(cmd, timing) == 0)
  	{
! 		pset.timing = !pset.timing;
! 		if (!pset.quiet)
! 		{
! 			if (pset.timing)
! puts(_(Timing is on.));
! 			else
! puts(_(Timing is off.));
! 		}
  	}
  
  	/* \unset */
--- 884,894 
  	/* \timing -- toggle timing of queries */
  	else if (strcmp(cmd, timing) == 0)
  	{
! 		char	   *opt = psql_scan_slash_option(scan_state,
!  OT_NORMAL, NULL, true);
! 
! 		success = do_pset(timing, opt, pset.popt, pset.quiet);
! 		free(opt);
  	}
  
  	/* \unset */
***
*** 1739,1744 
--- 1736,1752 
  			printf(_(Target width for \wrapped\ format is %d.\n), popt-topt.columns);
  	}
  
+ 	/* set timing mode */
+ 	else if (strcmp(param, timing) == 0

Re: [PATCHES] \timing on/off

2008-06-10 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
Here's a patch, based on David's patch, that turns timing into a \pset 
variable, and makes \timing an alias for \pset timing. This makes 
\timing behave the same as \x.


This seems a bit random to me.  AFAIR all of the \pset properties are
associated with formatting of table output.  \timing doesn't belong.


Hmm, yes, so it seems. This patch 
http://archives.postgresql.org/pgsql-general/2008-05/msg00427.php from 
David seems the most appropriate thing to do then.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] GIN improvements

2008-06-06 Thread Heikki Linnakangas

Teodor Sigaev wrote:



2) fast insert into GIN

New version:
http://www.sigaev.ru/misc/fast_insert_gin-0.6.gz

Changes:
- added option FASTUPDATE=(1|t|true|on|enable|0|f|false|disable) for
  CREATE/ALTER command for GIN indexes


I think we should try to make it automatic. I'm not sure how.

How about having a constant sized fastupdate buffer, of say 100 rows 
or a fixed number of pages, and when that becomes full, the next 
inserter will have to pay the price of updating the index and flushing 
the buffer. To keep that overhead out of the main codepath, we could 
make autovacuum to flush the buffers periodically.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] extend VacAttrStats to allow stavalues of different types

2008-06-02 Thread Heikki Linnakangas

Jan Urbański wrote:

Following the conclusion here:
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00273.php
here's a patch that extends VacAttrStats to allow typanalyze functions 
to store statistic values of different types than the underlying column.


Looks good to me at first glance.

About this comment:


+* XXX or maybe fall back on attrtype- stuff when these are NULL? That 
way
+* we won't break other people's custom typanalyze functions. Not sure 
if
+* any exist, though.


I tried to google for a user defined data type with a custom typanalyze 
function but didn't find anything, so I don't think it's an issue. It's 
a bit nasty, though, because if one exists, it will compile and run just 
fine, until you run ANALYZE. In general it's better to break an old API 
obviously rather than silently, so that the old code doesn't even compile.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Map forks (WIP)

2008-05-20 Thread Heikki Linnakangas

Tom Lane wrote:

One thing I did *not* like was changing the FSM API to refer to Relation
rather than RelFileNode --- I don't believe that's a good idea at all.
In particular, consider what happens during TRUNCATE or CLUSTER: it's
not very clear how you'll tell the versions of the relation apart.
If you want to push the FSM API up to use SMgrRelation instead of
RelFileNode, that'd be okay, but not Relation.  (Essentially the
distinction I'm trying to preserve here is logical vs physical
relation.)


Oh really? I'm quite fond of the new API. From a philosophical point of
view, in the new world order, the FSM is an integral part of a relation, 
not something tacked on the physical layer. TRUNCATE and CLUSTER will 
need to truncate and truncate+recreate the FSM file, respectively. The 
FSM fork is on an equal footing with the main fork: when TRUNCATE swaps 
the relation file, a new FSM fork is created as well, and there's no way 
or need to access the old file anymore. When a relation is moved to 
another tablespace, the FSM fork is moved as well, and while the 
RelFileNode changes at that point, the logical Relation is the same.


Besides, Relation contains a bunch of very handy fields. pgstat_info in 
particular, which is needed if we want to collect pgstat information 
about FSM, and I think we will. I might also want add a field like 
rd_amcache there, for the FSM: I'm thinking of implementing something 
like the fastroot thing we have in b-tree, and we might need some other 
per-relation information there as well.



The XLogOpenRelationWithFork stuff needs to be re-thought also,
as again this is blurring the question of what's a logical and
what's a physical relation --- and if forknum isn't part of the
relation ID, that API is wrong either way.  I'm not sure about
a good solution in this area, but I wonder if the right answer
might be to make the XLOG replay stuff use SMgrRelations instead
of bogus Relations.  IIRC the replay code design predates the
existence of SMgrRelation, so maybe we need a fresh look there.


Agreed, I'm not happy with that part either. I tried to do just what you 
suggest, make XLOG replay stuff deal with SMgrRelations instead of the 
lightweight relcache, and it did look good until I got to refactoring 
btree_xlog_cleanup() (GIN/GiST has the same problem, IIRC). 
btree_xlog_cleanup() uses the same functions as the normal-operation 
code to insert pointers to parent pages, which operates on Relation. 
That started to become really hairy to solve without completely 
bastardizing the normal code paths.


Hmm. One idea would be  to still provide a function to create a fake 
RelationData struct from SMgrRelation, which the redo function can call 
in that kind of situations.



(On closer look, XLogOpenRelationWithFork seems unused anyway


That's just because FSM WAL-logging hasn't been implemented yet.


One really trivial thing that grated on me was

+ /*
+  * In a few places we need to loop through 0..MAX_FORKS to discover which
+  * forks exists, so we should try to keep this number small.
+  */
+ #define MAX_FORKS (FSM_FORKNUM + 1)

I think you should either call it MAX_FORK (equal to the last fork
number) or NUM_FORKS (equal to last fork number plus one).  As is,
it's just confusing.  


Agreed, will fix.


And the comment is flat out wrong for the current usage.


What's described in the comment is done in ATExecSetTableSpace. I grant 
you that there's many other usages for it. I'll work on the comment.



BTW, it would probably be a good idea to try to get the fork access
API committed before you work on FSM.  Whenever you can break a
big patch into successive sections, it's a good idea, IMHO.  I don't
think there's any doubt that we are going to go in this direction,
so I see no objection to committing fork-based API revisions in advance
of having any real use for them.


Yep. I'll develop them together for now, but will separate them when the 
fork stuff is ripe for committing.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] create or replace language

2008-05-15 Thread Heikki Linnakangas

Andreas 'ads' Scherbaum wrote:

Attached is another version of the patch (still missing documentation),
which changes the language owner on update (the owner can still be
changed in pg_pltemplate).


The other CREATE OR REPLACE commands don't change the owner, so CREATE 
OR REPLACE LANGUAGE shouldn't do that either.



So do we want to replace any data (in my opinion only the validator is
left) at all or just skip any error message?


I think you should be able to change handler and validator functions, 
and the trusted flag. Or is there a reason to not allow that?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Verified fix for Bug 4137

2008-05-09 Thread Heikki Linnakangas

Simon Riggs wrote:

if (restartWALFileName)
{
+   /*
+* Don't do cleanup if the restartWALFileName provided
+* is later than the xlog file requested. This is an error
+* and we must not remove these files from archive.
+* This shouldn't happen, but better safe than sorry.
+*/
+   if (strcmp(restartWALFileName, nextWALFileName)  0)
+   return false;
+ 
  		strcpy(exclusiveCleanupFileName, restartWALFileName);

return true;
}


I committed this sanity check into pg_standy, though it really shouldn't 
happen, but it just occurred to me that the most likely reason for that 
to happen is probably that the user has screwed up his restore_command 
line, mixing up the  %p and %r arguments. Should we make that an error 
instead of just not doing the cleanup?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Verified fix for Bug 4137

2008-05-09 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
Ok, that'll work. Committed, thanks. I changed the sanity check that 
xlogfname  restore point into an Assert, though, because that's a sign 
that something's wrong.


Shouldn't that Assert allow the equality case?


Yes. Thanks.

Hmm. I can't actually think of a scenario where that would happen, but 
it was definitely an oversight on my part.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Verified fix for Bug 4137

2008-05-06 Thread Heikki Linnakangas

Simon Riggs wrote:

The problem was that at the very start of archive recovery the %r
parameter in restore_command could be set to a filename later than the
currently requested filename (%f). This could lead to early truncation
of the archived WAL files and would cause warm standby replication to
fail soon afterwards, in certain specific circumstances.

Fix applied to both core server in generating correct %r filenames and
also to pg_standby to prevent acceptance of out-of-sequence filenames.


So the core problem is that we use ControlFile-checkPointCopy.redo in 
RestoreArchivedFile to determine the safe truncation point, but when 
there's a backup label file, that's still coming from pg_control file, 
which is wrong.


The patch fixes that by determining the safe truncation point as 
Min(checkPointCopy.redo, xlogfname), where xlogfname is the xlog file 
being restored. That depends on the assumption that everything before 
the first file we (try to) restore is safe to truncate. IOW, we never 
try to restore file B first, and then A, where A  B.


I'm not totally convinced that's a safe assumption. As an example, 
consider doing an archive recovery, but without a backup label, and the 
latest checkpoint record is broken. We would try to read the latest 
(broken) checkpoint record first, and call RestoreArchivedFile to get 
the xlog file containing that. But because that record is broken, we 
fall back to using the previous checkpoint, and will need the xlog file 
where the previous checkpoint record is in.


That's a pretty contrived example, but the point is that assumption 
feels fragile. At the very least it should be noted explicitly in the 
comments. A less fragile approach would be to use something dummy, like 
 as the truncation point until we've 
successfully read the checkpoint/restartpoint record and started the replay.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Verified fix for Bug 4137

2008-05-06 Thread Heikki Linnakangas

Simon Riggs wrote:

Falling back to the secondary checkpoint implies we have a corrupted or
absent WAL file, so making recovery startup work correctly won't avoid
the need to re-run the base backup. We'll end with an unrecoverable
error in either case, so it doesn't seem worth attempting to improve
this in the way you suggest.


That's true whenever you have to fall back to a secondary checkpoint, 
but we still try to get the database up. One could argue that we 
shouldn't, of course.


Anyway, the point is that the patch relies on a non-obvious assumption. 
Even if the secondary checkpoint issue is a non-issue, it's not obvious 
(to me at least) that there isn't other similar scenarios. And someone 
might inadvertently break the assumption in a future patch, because it's 
not an obvious one; calling ReadRecord looks very innocent. We shouldn't 
introduce an assumption like that when we don't have to.



I think we should completely prevent access to secondary checkpoints
during archive recovery, because if the primary checkpoint isn't present
or is corrupt we aren't ever going to get passed it to get to the
pg_stop_backup() point. Hence an archive recovery can never be valid in
that case. I'll do a separate patch for that because they are unrelated
issues.


Well, we already don't use the secondary checkpoint if a backup label 
file is present. And you can take a base backup without 
pg_start_backup()/pg_stop_backup() if you shut down the system first (a 
cold backup).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Verified fix for Bug 4137

2008-05-06 Thread Heikki Linnakangas

Simon Riggs wrote:

On Tue, 2008-05-06 at 17:52 +0100, Heikki Linnakangas wrote:
I didn't suggest that alphabetical sorting property is a new assumption; 
it sure isn't. The new assumption is that you never call ReadRecord() 
for record 0002 before you call it for record 0001 (before initializing 
the checkPointCopy field from the checkpoint record, anyway).


I've not done anything with the ordering of calls to ReadRecord(). There
is no such assumption introduced here.


Hmm, I see that I was inaccurate when I said the patch introduces that 
assumption, sorry about the confusion. It's more like the code is 
currently completely oblivious of the issue, and the patch makes it 
better but doesn't fully fix it.


The code in CVS is broken, as we now know, because it assumes that we 
can delete all files  checkPointCopy.redo, which is not true when 
checkPointCopy.redo has not yet been initialized from the backup label 
file, and points to a location that's ahead of the real safe truncation 
point. The patch makes it better, and the assumption is now that it's 
safe to truncate everything  min(checkPointCopy.redo, xlog file we're 
reading). That still seems too fragile to me, because we assume that 
after you've asked for xlog record X, we never need anything earlier 
then that.


In fact, what will happen if the checkpoint record's redo pointer points 
to an earlier xlog file:


1. The location of the checkpoint record is read by read_backup_label(). 
Let's say that it's 0005.
2. ReadCheckpointRecord() is called for 0005. The restore command is 
called because that xlog file is not present. The safe truncation point 
is determined to be 0005, as that's what we're reading. Everything 
before that is truncated
3. The redo pointer in the checkpoint record points to 0003. That's 
where we should start the recovery. Oops :-(


I haven't tested this, so, am I missing something that makes that not 
happen?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] plpgsql CASE statement - last version

2008-05-02 Thread Heikki Linnakangas

Pavel Stehule wrote:

Hello

I found some bugs when I used base_lexer, so I returned back own
lexer. It's only little bit longer, but simpler.


Hmm. I don't like having to lex the expressions again. It just doesn't 
feel right.


How about taking a completely different strategy, and implement the 
CASE-WHEN construct fully natively in plpgsql, instead of trying to 
convert it to a single SQL CASE-WHEN expression? It's not a very good 
match anyway; you have to do tricks to convert the comma-separated lists 
of WHEN expressions to WHEN-THEN clauses, and you can't use the 
THEN-clauses as is, but you have to replace them with offsets into the 
array. I think implementing the logic in pl_exec.c would lead to cleaner 
code.


FWIW, the current approach gives pretty cryptic CONTEXT information in 
error messages as well. For example, this pretty simple case-when example:


postgres=# create or replace FUNCTION case_test(int)
returns text as $$
begin
  case $1
when 1 then
  return 'one';
when 'invalid' then
  return 'two';
when 3,4,5 then
  return 'three, four or five';
  end case;
end;
$$ language plpgsql immutable;
CREATE FUNCTION

gives this pretty hard-to-understand error message:

postgres=# SELECT case_test(1);
ERROR:  invalid input syntax for integer: invalid
CONTEXT:  SQL statement SELECT CASE   $1  WHEN  1 THEN  1  WHEN 
'invalid' THEN  2  WHEN  3 THEN  3  WHEN 4 THEN  3  WHEN 5 THEN  3  END 

PL/pgSQL function case_test line 2 at unknown

BTW, what does PL/SQL or PSM say about the type-compatibility of the 
CASE and the WHENs? We're very lenient in assignments, how should this 
behave?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] GUC parameter cursors_tuple_fraction

2008-05-02 Thread Heikki Linnakangas

Simon Riggs wrote:

* We've said here http://www.postgresql.org/docs/faqs.TODO.html that we
Don't want hints. If that's what we really think, then this patch must
surely be rejected because its a hint... That isn't my view. I *now*
think we do need hints of various kinds. 


cursors_tuple_fraction or OPTIMIZE FOR xxx ROWS isn't the kind of hints 
we've said no to in the past. We don't want hints that work-around 
planner deficiencies, for example where we get the row count of a node 
completely wrong. This is different. This is about telling how the 
application is going to use the result set. It's relevant even assuming 
that the planner got the estimates spot on. Which plan is the best 
depends on whether the application can start processing the data as it 
comes in, or whether it's loading it all in memory first, for example.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Patch to add a feature to pg_standby

2008-04-30 Thread Heikki Linnakangas

[EMAIL PROTECTED] wrote:

When using pg_standby to remain in recovery mode on a warm standby system,
if there is a need to perform other actions in coordination with recovery
actions, the -x auxiliary command option implemented by this patch
enables that coordination.  I considered adding the ability to override the
restoreCommand, however by keeping this separate and optional it is
possible to force retries of the auxiliary command until successful and
still utilize pg_usleep instead of looping within an external script or
command.  And the previous behavior of pg_standby remains unchanged (other
than debug logging and documenting the option in usage) if the new option
is omitted.

I added this feature to help with synchronization of a content repository
consisting of a PostgreSQL db for meta-information and a separate file
store for content.
The auxiliary command enables forcing an rsync of the file store that is at
least as current as the found WAL segment file's db changes, and prevents
recovery of that WAL file unless the rsync can be performed successfully.

(See attached file: pg_standby.c.diff)


This could be implemented by a pass-through restore_command, that 
calls pg_standby, and does the custom action when pg_standby returns 
successfully. Or perhaps you could run the custom action in whatever 
script is copying the files to the directory.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_postmaster_reload_time() patch

2008-04-30 Thread Heikki Linnakangas

Gurjeet Singh wrote:

On Wed, Apr 30, 2008 at 9:53 AM, George Gensure [EMAIL PROTECTED] wrote:


I've done a quick write up for reload time reporting from the
administration TODO.  I was a little paranoid with the locking, but
didn't want problems to occur with signals on the postmaster and the
read side.



IMHO, the function should return NULL if the postmaster never reloaded; that
is, it was started, but never signaled to reload.


I think it's useful to get the server startup time if the configuration 
has never been reloaded. That's when the configuration was loaded, if no 
reload has been triggered since. Perhaps the function should be named to 
reflect that better. pg_configuration_load_time() ?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Partial match in GIN

2008-04-09 Thread Heikki Linnakangas

Alvaro Herrera wrote:

Well, LIKE %foo% is supposed to match foo unanchored, but with a btree
(or two btrees) you can only get 'foo' anchored to either end of the
string (or both).


Ah, true.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] EXPLAIN progress info

2008-04-09 Thread Heikki Linnakangas

Tom Lane wrote:

Gregory Stark [EMAIL PROTECTED] writes:
There are downsides: 


Insurmountable ones at that.  This one already makes it a non-starter:


a) the overhead of counting rows and loops is there for every query execution,
even if you don't do explain analyze.


and you are also engaging in a flight of fantasy about what the
client-side code might be able to handle.  Particularly if it's buried
inside, say, httpd or some huge Java app.  Yeah, you could possibly make
it work for the case that the problem query was manually executed in
psql, but that doesn't cover very much real-world territory.


I think there's two different use cases here. The one that Greg's 
proposal would be good for is a GUI, like pgAdmin. It would be cool to 
see how a query progresses through the EXPLAIN tree when you run it from 
the query tool. That would be great for visualizing the executor; a 
great teaching tool.


But I agree it's no good for use by a DBA to monitor a live system 
running a real-world application. For that we do need something else.



You'd be far more likely to get somewhere with a design that involves
looking from another session to see if anything's happening.  In the
case of queries that are making database changes, pgstattuple is
certainly a usable option.  For SELECT-only queries, I agree it's
harder, but it's still possible.  I seem to recall some discussion of
including a low-overhead progress counter of some kind in the
pg_stat_activity state exposed by a backend.  The number of rows so far
processed by execMain.c in the current query might do for the
definition.


Yeah, something like this would be better for monitoring a live system.

The number of rows processed by execMain.c would only count the number 
of rows processed by the top node of the tree, right? For a query that 
for example performs a gigantic sort, that would be 0 until the sort is 
done, which is not good. It's hard to come up with a single counter 
that's representative :-(.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


[PATCHES] Indexam API changes

2008-04-08 Thread Heikki Linnakangas
There's a bunch of mails in the patch queue about the indexam API, so we 
need to discuss that.


The first question is: do we want to refactor the bitmap index scan API, 
if we don't have any immediate use for it? Namely, since we don't have 
anyone actively working on the bitmap index patch nor the git patch.


There was also discussion on adding support for candidate matches, 
mainly for GIT, but GiST could possibly also take advantage of that.


If people think it's worth it, I can fix the bit-rot in the patch and 
work on it, but personally I don't think it is.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Partial match in GIN

2008-04-08 Thread Heikki Linnakangas

Teodor Sigaev wrote:
Looking at the patch, you require that the TIDBitmap fits in work_mem 
in non-lossy format. I don't think that's acceptable, it can easily 
exceed work_mem if you search for some very common word. Failing to 
execute a valid query is not good.
But way is better than nothing. In really, that way was chosen to have 
fast merge of (potentially) hundreds of sorted lists of ItemPointers. 
Other ways is much slower.


How about forcing the use of a bitmap index scan, and modify the indexam 
API so that GIN could a return a lossy bitmap, and let the bitmap heap 
scan do the rechecking?


I don't think the storage size of tsquery matters much, so whatever is 
the best solution in terms of code readability etc.
That was about tsqueryesend/recv format? not a storage on disk. We don't 
require compatibility of binary format of db's files, but I have some 
doubts about binary dump.


We generally don't make any promises about cross-version compatibility 
of binary dumps, though it would be nice not to break it if it's not too 
much effort.


Hmm. match_special_index_operator() already checks that the index's 
opfamily is pattern_ops, or text_ops with C-locale. Are you reusing 
the same operator families for wildspeed? Doesn't it then also get 
confused if you do a WHERE textcol  'foo' query by hand?

No, wildspeed use the same operator ~~
match_special_index_operator() isn't called at all: in 
match_clause_to_indexcol() function is_indexable_operator() is called 
before match_special_index_operator() and returns true.


expand_indexqual_opclause() sees that operation is a OID_TEXT_LIKE_OP 
and calls prefix_quals() which fails because it wishes only several 
Btree opfamilies.


Oh, I see. So this assumption mentioned in the comment there:

/*
 * LIKE and regex operators are not members of any index opfamily,
 * so if we find one in an indexqual list we can assume that it
 * was accepted by match_special_index_operator().
 */

is no longer true with wildspeed. So we do need to check that in 
expand_indexqual_opclause() then.


NOTICE 2: it seems to me, that similar technique could be implemented 
for ordinary BTree to eliminate hack around LIKE support.
LIKE expression. I wonder what the size and performance of that would 
be like, in comparison to the proposed GIN solution?


GIN speeds up '%foo%' too - which is impossible for btree. But I don't 
like a hack around LIKE support in BTree. This support uses outflank 
ways missing regular one.


You could satisfy '%foo%' using a regular and a reverse B-tree index, 
and a bitmap AND. Which is interestingly similar to the way you proposed 
to use a TIDBitmap within GIN.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Partial match in GIN

2008-04-08 Thread Heikki Linnakangas

Alvaro Herrera wrote:

Heikki Linnakangas wrote:

Teodor Sigaev wrote:


GIN speeds up '%foo%' too - which is impossible for btree. But I don't  
like a hack around LIKE support in BTree. This support uses outflank  
ways missing regular one.
You could satisfy '%foo%' using a regular and a reverse B-tree index,  
and a bitmap AND. Which is interestingly similar to the way you proposed  
to use a TIDBitmap within GIN.


Huh, can you?  I can see doing col LIKE 'foo%' OR reverse(col) LIKE
reverse('%foo') with two btree indexes, but not a true %foo% ...


That should be AND, not OR..

Hmm. It is the same as far as I can see. Am I missing something?

You couldn't support more complex patterns directly, like 'foo%bar%foo', 
but you could still split that into 'foo%' AND '%bar%' AND '%foo', and 
recheck the matches against the original pattern


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Improve shutdown during online backup

2008-04-07 Thread Heikki Linnakangas

Albe Laurenz wrote:

Moreover, if Shutdown == SmartShutdown, new connections won't be accepted,
and nobody can connect and call pg_stop_backup().
So even if I'd add a check for
(pmState == PM_WAIT_BACKENDS)  !BackupInProgress() somewhere in the
ServerLoop(), it wouldn't do much good, because the only way for somebody
to cancel online backup mode would be to manually remove the file.


Good point.


So the only reasonable thing to do on smart shutdown during an online
backup is to have the shutdown request fail, right? The only alternative being
that a smart shutdown request should interrupt online backup mode.


Or we can add another state, PM_WAIT_BACKUP, before PM_WAIT_BACKENDS, 
that allows new connections, and waits until the backup ends.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Expose checkpoint start/finish times into SQL.

2008-04-05 Thread Heikki Linnakangas

Robert Treat wrote:
1) Alert if checkpointing stops occuring within a reasonable time frame (note 
there are failure cases and normal use cases where this might occur)  (also 
note I'll agree, this isn't common, but the results are pretty disatrous if 
it does happen)


What are the normal use cases where this would occur? I can't think of any.

Wrt. failure cases, there's a million things that can go wrong in a 
system, and only few of them will give the symptom of checkpoints 
stopped happening, so I'm not excited about adding a facility to 
monitor just that.


More hooks for monitoring purposes in general would be nice, and I would 
like to see them exposed as SQL functions, but I'd like to see a much 
bigger proposal for that.


2) Can be graphed over time (using rrdtool and others) for trending checkpoint 
activity


Hmm. You'd need the historical data to do that properly. In particular, 
if two checkpoints happen between the polling interval, you'd miss that. 
log_checkpoints=on, in CSV output, seems like a better approach for that.



3) Ease monitoring of checkpoints across pitr setups


How is that different from monitoring in a non-pitr setup?

4) Help determine if your checkpoints are being timeout driven or segment 
driven, or if you need to look at those settings 


Seems like the log_checkpoints output is more useful for that, as it 
directly tells you what triggered the checkpoint.


5) Determine the number of log files that will need to be replayed in the 
event of a crash


If you have a requirement for that, just set checkpoint_segments 
accordingly. And again, you can get the information in the logs by 
log_checkpoints=on already.


6) Helps give an indication on if you should enter a manual checkpoint before 
issuing a pg_start_backup call 


If you want the backup to begin immediately, just do a manual checkpoint 
unconditionally. It'll finish quickly if there hasn't been much activity 
since the last one. We talked about adding a new variant of 
pg_start_backup() that does that automatically (or rather, just hurries 
the current checkpoint) in the 8.3 cycle, but didn't do it in the end. 
Perhaps we should add that, after all?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Partial match in GIN

2008-04-05 Thread Heikki Linnakangas

Teodor Sigaev wrote:

For each
matched entry all corresponding ItemPointers are collected in TIDBitmap
structure to effective merge ItemPointers from different entries. Patch
introduces following changes in interface:


Looking at the patch, you require that the TIDBitmap fits in work_mem in 
non-lossy format. I don't think that's acceptable, it can easily exceed 
work_mem if you search for some very common word. Failing to execute a 
valid query is not good.


Here there is a unclean issue: now tsquery has new flag to label prefix 
search and cstring representation has backward compatibility, but 
external binary hasn't it now. Now, extra byte is used for storage of 
this flag. In other hand, there 4 unused bits in external binary 
representation (in byte stores weights of lexeme), so it's possible to 
use one of them to store this flag. What are opinions?


I don't think the storage size of tsquery matters much, so whatever is 
the best solution in terms of code readability etc.


NOTICE 1: current index support of LIKE believes that only BTree can 
speed up LIKE and becomes confused with this module with error 
'unexpected opfamily' in
prefix_quals(). For this reason, partial match patch adds small check 
before

calling expand_indexqual_opclause().


Hmm. match_special_index_operator() already checks that the index's 
opfamily is pattern_ops, or text_ops with C-locale. Are you reusing the 
same operator families for wildspeed? Doesn't it then also get confused 
if you do a WHERE textcol  'foo' query by hand?


NOTICE 2: it seems to me, that similar technique could be implemented 
for ordinary BTree to eliminate hack around LIKE support.


Yep, if you created a b-tree index on the reversed key, that could be 
used for LIKE '%foo' expressions. And if you had that in addition to a 
regular b-tree index, you could use those two indexes together for any 
LIKE expression. I wonder what the size and performance of that would be 
like, in comparison to the proposed GIN solution?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Re: [BUGS] BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data

2008-04-03 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
On second thought, expanding AttrNumber to int32, wholesale, might not 
be a good idea,


No, it wouldn't.  For one thing it'd be a protocol break --- column
numbers are int16 --- 


I wasn't planning to change that.


and for another, we'd have terrible performance
problems with such wide rows.


Yes, we probably would :-). Though if there's any nasty O(n^2) behavior 
left in there, we should look at optimizing it anyway to speed up more 
reasonably sized queries, in the range of a few hundred columns.



 Actually rows are supposed to be limited
to ~1600 columns, anyway, because of HeapTupleHeader limitations.


The trick is that that limitation doesn't apply to the intermediate 
virtual tuples we move around in the executor. Those are just arrays of 
Datums, and can have more than MaxTupleAttributeNumber attributes, as 
long as you project away enough attributes, bringing it below that 
limit, before returning it to the client or materializing it into a 
HeapTuple or MinimalTuple in the executor.



Apparently you've found a path where that restriction isn't enforced
correctly, but I haven't seen the referenced message yet ...


Enforcing the limit for virtual tuples as well, and checking for the 
limit in the planner is one option, but it would cripple the ability to 
join extremely wide tables. For example, if you had 10 tables with 200 
columns each, you couldn't join them together even for the purposes of 
COUNT(*). Granted, that's not a very common thing to do, this is the 
first time this bug is reported after all, but I'd prefer to keep the 
capability if possible.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Re: [BUGS] BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data

2008-04-03 Thread Heikki Linnakangas

Tom Lane wrote:

Alvaro Herrera [EMAIL PROTECTED] writes:

Tom Lane wrote:

I still haven't seen the actual bug description come by here, and the
pgsql-bugs archive hasn't got it either.



(apparently some mails on that thread are missing ...)


That's what I meant.  Heikki is quoting himself from a message that
hasn't appeared anywhere public, and he must have had at least one
message from the OP that hasn't appeared either.  So the rest of us
are still mostly in the dark about the problem.


Hmm, strange. Looks like my mail client decided to sent that mail to 
pgsql-bugs-owner@ instead of pgsql-bugs@ for some reasone. Here's the 
missing mail:


Ceschia, Marcello wrote:
 In query query_not_working all values from column 136_119 has the 
value of the first column.


 Using the splitted query (working_version) it works.

 I hope this data will help to find the bug.

Thanks.

Oh, the query actually gives an assertion failure on an 
assertion-enabled build, so this is clearly a bug:


TRAP: FailedAssertion(!(attnum  0  attnum = 
list_length(rte-joinaliasvars)), File: parse_relation.c, Line: 1697)


gdb tells that attnum is -31393 at that point. That's because 
get_rte_attribute_type() takes an AttrNumber, which is int16, and 
make_var() is trying to pass 34143, so it overflows.


It seems we should extend AttrNumber to int32; we don't use AttrNumber 
in any of the on-disk structs. Though you still couldn't have more than 
MaxHeapAttributeNumber (1600) attributes in a table or 
MaxTupleAttributeNumber (1664) in a result set or intermediate tuples, 
like the output of a sort node, at least you could join ridiculously 
wide tables like that as long as you project out enough columns.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Expose checkpoint start/finish times into SQL.

2008-04-03 Thread Heikki Linnakangas

Theo Schlossnagle wrote:

First whack at exposing the start and finish checkpoint times into SQL.


Why is that useful?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Ordered Append WIP patch v1

2008-03-31 Thread Heikki Linnakangas

Gregory Stark wrote:

Here's the WIP patch I described on -hackers to implemented ordered append
nodes.


I think it would be better to create completely separate executor node 
for this, rather than tack this feature on the regular Append node. The 
two don't seem to have much in common. Looking at AppendState, for 
example, as_firstplan and as_lastplan are not interesting for the 
MergeAppend, and likewise MergeAppend needs a whole bunch of fields not 
interesting to regular Append. The fact that the first statement in 
ExecAppend is if(!node-as_is_ordered), with zero lines of shared 
codes between them, also suggests that it would be a good idea to keep 
them separate.


As you point out in the comments, it's quite confusing to have indexes 
into the slots array and the heap array. I still can't get my head 
around that. Would it help to define a struct along the lines of:


struct {
  TupleTableSlot *slot;
  PlanState *plan;
};

and store pointers to those in the heap?


1) I still haven't completely figured out what to do with equivalence classes.
   My hack of just stuffing all the append subrel vars into there seems to
   work fine. I need to understand what's going on to see if there's really a
   problem with it or not.


I don't understand that well enough to comment... I presume the FIXME in 
the patch is related to this?



4) I haven't handled mark/restore or random access. I think they could be
   handled and they'll probably be worth the complexity but I'm not sure.


For Mark/restore, I suppose you would mark/restore all subnodes, and 
store/restore the heap. For reversing direction, you would reverse the 
heap. Doesn't sound too hard, but it's probably better to make it work 
without them at first before adding any bells and whistles...



5) Is it considering too many paths? Are there some which maybe aren't worth
   considering? For example, maybe it only makes sense to take best start_cost
   paths since if that plan doesn't dominate then the best total_cost plan is
   likely to be the sequential scans + unordered append + sort.


I don't understand this paragraph. Are you referring to this comment in 
the patch:



/* If we can't find any plans with the right order just add the
 * cheapest total plan to both paths, we'll have to sort it
 * anyways. Perhaps consider not generating a startup path for such
 * pathkeys since it'll be unlikely to be the cheapest startup
 * cost. */


Having to sort one or two of the sub-plans might not be to be expensive 
at all. For example, having to sort the empty parent relation of a 
partitioned table is essentially free. It's probably never cheaper to 
use the Merge Append if you need to sort *all* of the children, but if 
you can avoid sorting even one of them, it might very well be worthwhile.



6) I haven't looked at setops yet but this is particularly attractive for the
   UNION (as opposed to UNION ALL) case.


Yes. And DISTINCT.


7) I copied/adapted a bunch of bits from tuplesort to maintain the heap and do
   the scankey comparisons. I could refactor that code back into tuplesort but
   it would mean twisting around tuplesort quite a bit. Essentially it would
   mean introducing a new type of tuplesort which would start off in
   FINAL_MERGE state only it would have to behave differently since we don't
   want it prefetching lots of records like FINAL_MERGE does, I don't think.


Doesn't seem worthwhile. The heap management part of the patch is quite 
short.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] create language ... if not exists

2008-03-29 Thread Heikki Linnakangas

Andreas 'ads' Scherbaum wrote:

The attached patch for HEAD extends the CREATE LANGUAGE statement by an
IF NOT EXISTS option which in effect changes the raised error into a
notice.

Before i continue working on this patch i would like to know if this
extension has a chance to go into PG and what other changes i should
apply (beside the missing documentation).


The way we've solved this problem for other CREATE commands is to add 
OR REPLACE option, instead of IF NOT EXISTS. We should do the same here.


Regarding the patch itself: You define rule opt_if_not_exists, but 
never use it. And you add a new rule for CREATE LANGUAGE ... HANDLER 
..., but forgot IF_P NOT EXISTS from the end of that. Looks like you 
couldn't decide which approach to take, and ended up doing a little bit 
of both ;-).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] create language ... if not exists

2008-03-29 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
The way we've solved this problem for other CREATE commands is to add 
OR REPLACE option, instead of IF NOT EXISTS. We should do the same here.


If we're willing to consider a solution that is specific to CREATE
LANGUAGE (as opposed to implementing IF NOT EXISTS across-the-board,
which might happen someday) what I'd suggest is just incorporating
the behavior directly into the abbreviated (no parameters) form of
CREATE LANGUAGE.  If the language already exists and has the same
properties specified in pg_pltemplate, don't raise an error.  Give
a notice maybe.


Why not implement OR REPLACE like for other things? Still seems the 
most consistent behavior to me.


You might want to get the error if the language already exists, which 
your proposal wouldn't allow. And it wouldn't help with languages 
without a pg_pltemplate entry.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [BUGS] Incomplete docs for restore_command for hotstandby

2008-03-28 Thread Heikki Linnakangas

Simon Riggs wrote:

On Mon, 2008-02-25 at 17:56 +0600, Markus Bertheau wrote:

The FIXME of course needs replacement by someone in the know.


Doc patch edited to include all of Markus' points, tidy up some related
text and fix typos.

Good to apply to HEAD.


Committed to HEAD with minor fixes.

What's our policy wrt. back-patching doc changes? This seems applicable 
to older versions as well, but do we do that?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Consistent \d commands in psql

2008-03-26 Thread Heikki Linnakangas

Greg Sabino Mullane wrote:

Attached is an updated version of my psql patch that makes the \d
backslash commands perform in an intuitive, consistent way.
Specifically, the following objects will be treated as first class
citizens (as tables and indexes currently are) by showing all the
non-system objects by default and requiring a S to see the system
ones.

aggregates
conversions
comments
domains
operators
functions
types

Currently, there is no way to view all the non-system functions in a
database using backslash commands, as you can with \dt, unless all of
the functions happen to be in a single schema (\df myschema.). With
this patch, it would be as simple as \df, and the current behavior
would be done with \dfS.


Yes, that seems like a good idea. \df in particular has been too noisy
to be usable. Not sure about conversions and domains; I doubt anyone 
creates custom conversions in practice, and there's no system domains in 
a standard installation.


Does anyone want to argue that there's a backward-compatibility problem 
with changing \df? I don't think there is; you shouldn't be using psql 
backslash commands in an application.



This patch also adds a few new things to the tab-completion table, such
as comments and conversions.


There's a bunch of merge conflicts in the diff.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] \password in psql help

2008-03-26 Thread Heikki Linnakangas

Magnus Hagander wrote:

+   fprintf(output, _(  \\password [USERNAME]\n
+ securely change the password for 
a user\n));


I would leave out the word securely. Unless you want to provide 
another command for changing it insecurely ;-). What does it mean, anyway?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-12 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
Elsewhere in our codebase where we use arrays that are enlarged as 
needed, we keep track of the allocated size and the used size of the 
array separately, and only call repalloc when the array fills up, and 
repalloc a larger than necessary array when it does. I chose to just 
call repalloc every time instead, as repalloc is smart enough to fall 
out quickly if the chunk the allocation was made in is already larger 
than the new size. There might be some gain avoiding the repeated 
repalloc calls, but I doubt it's worth the code complexity, and calling 
repalloc with a larger than necessary size can actually force it to 
unnecessarily allocate a new, larger chunk instead of reusing the old 
one. Thoughts on that?


Seems like a pretty bad idea to me, as the behavior you're counting on
only applies to chunks up to 8K or thereabouts. 


Oh, you're right. Though I'm sure libc realloc has all kinds of smarts 
as well, it does seem better to not rely too much on that.



In a situation where
you are subcommitting lots of XIDs one at a time, this is likely to have
quite awful behavior (or at least, you're at the mercy of the local
malloc library as to how bad it is).  I'd go with the same
double-it-each-time-needed approach we use elsewhere.


Yep, patch attached. I also changed xactGetCommittedChildren to return 
the original array instead of copying it, as Alvaro suggested.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/transam/twophase.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/twophase.c,v
retrieving revision 1.39
diff -c -r1.39 twophase.c
*** src/backend/access/transam/twophase.c	1 Jan 2008 19:45:48 -	1.39
--- src/backend/access/transam/twophase.c	12 Mar 2008 12:45:13 -
***
*** 827,833 
  		save_state_data(children, hdr.nsubxacts * sizeof(TransactionId));
  		/* While we have the child-xact data, stuff it in the gxact too */
  		GXactLoadSubxactData(gxact, hdr.nsubxacts, children);
- 		pfree(children);
  	}
  	if (hdr.ncommitrels  0)
  	{
--- 827,832 
Index: src/backend/access/transam/xact.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xact.c,v
retrieving revision 1.258
diff -c -r1.258 xact.c
*** src/backend/access/transam/xact.c	4 Mar 2008 19:54:06 -	1.258
--- src/backend/access/transam/xact.c	12 Mar 2008 13:41:10 -
***
*** 130,136 
  	int			gucNestLevel;	/* GUC context nesting depth */
  	MemoryContext curTransactionContext;		/* my xact-lifetime context */
  	ResourceOwner curTransactionOwner;	/* my query resources */
! 	List	   *childXids;		/* subcommitted child XIDs */
  	Oid			prevUser;		/* previous CurrentUserId setting */
  	bool		prevSecDefCxt;	/* previous SecurityDefinerContext setting */
  	bool		prevXactReadOnly;		/* entry-time xact r/o state */
--- 130,138 
  	int			gucNestLevel;	/* GUC context nesting depth */
  	MemoryContext curTransactionContext;		/* my xact-lifetime context */
  	ResourceOwner curTransactionOwner;	/* my query resources */
! 	TransactionId	*childXids;	/* subcommitted child XIDs, in XID order */
! 	int			nChildXids;		/* # of subcommitted child XIDs */
! 	int			maxChildXids;	/* allocated size of childXids */
  	Oid			prevUser;		/* previous CurrentUserId setting */
  	bool		prevSecDefCxt;	/* previous SecurityDefinerContext setting */
  	bool		prevXactReadOnly;		/* entry-time xact r/o state */
***
*** 156,162 
  	0,			/* GUC context nesting depth */
  	NULL,		/* cur transaction context */
  	NULL,		/* cur transaction resource owner */
! 	NIL,		/* subcommitted child Xids */
  	InvalidOid,	/* previous CurrentUserId setting */
  	false,		/* previous SecurityDefinerContext setting */
  	false,		/* entry-time xact r/o state */
--- 158,166 
  	0,			/* GUC context nesting depth */
  	NULL,		/* cur transaction context */
  	NULL,		/* cur transaction resource owner */
! 	NULL,		/* subcommitted child Xids */
! 	0,			/* # of subcommitted child Xids */
! 	0,			/* allocated size of childXids */
  	InvalidOid,	/* previous CurrentUserId setting */
  	false,		/* previous SecurityDefinerContext setting */
  	false,		/* entry-time xact r/o state */
***
*** 559,565 
  	 */
  	for (s = CurrentTransactionState; s != NULL; s = s-parent)
  	{
! 		ListCell   *cell;
  
  		if (s-state == TRANS_ABORT)
  			continue;
--- 563,569 
  	 */
  	for (s = CurrentTransactionState; s != NULL; s = s-parent)
  	{
! 		int low, high;
  
  		if (s-state == TRANS_ABORT)
  			continue;
***
*** 567,576 
  			continue;			/* it can't have any child XIDs either */
  		if (TransactionIdEquals(xid, s-transactionId))
  			return true;
! 		foreach

Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-12 Thread Heikki Linnakangas

Pavan Deolasee wrote:

Wait. Subtransaction ids are local to a transaction and always start from 1.
See this:

/*
 * reinitialize within-transaction counters
 */
s-subTransactionId = TopSubTransactionId;
currentSubTransactionId = TopSubTransactionId;


No, we're not talking about SubTransactionIds. We're talking about real 
XIDs assigned to subtransactions. Subtransactions are assigned regular 
XIDs as well, just like top-level transactions.


I can see where you were coming from with you suggestion now :-).

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-12 Thread Heikki Linnakangas

Pavan Deolasee wrote:

On Wed, Mar 12, 2008 at 9:27 PM, Tom Lane [EMAIL PROTECTED] wrote:


 I didn't like it; it seemed overly complicated (consider dealing with
 XID wraparound),


We are talking about subtransactions here. I don't think we support
subtransaction wrap-around, do we ?


Imagine that you start a transaction just before transaction 
wrap-around, so that the top level XID is 2^31-10. Then you start 20 
subtransactions. What XIDs will they get? Now how would you map those to 
a bitmap?


It's certainly possible, you could index the bitmap by the index from 
top transaction XID for example. But it does get a bit complicated.



and it would have problems with a slow transaction
 generating a sparse set of subtransaction XIDs.


I agree thats the worst case. But is that common ? Thats what I
was thinking when I proposed the alternate solution. I thought that can
happen only if most of the subtransactions abort, which again I thought
is not a normal case. But frankly I am not sure if my assumption is correct.


It's not that common to have hundreds of thousands of subtransactions to 
begin with..



I think getting rid of
 the linear search will be enough to fix the performance problem.


I wonder if a skewed binary search would help more ? For example,
if we know that the range of xids stored in the array is 1 to 1000 and
if we are searching for a number closer to 1000, we can break the
array into large,small parts instead of equal parts and then
search.


Possibly, but I doubt it's worth the trouble. The simple binary search 
solved the performance problem well enough. In the test case of the OP, 
with 30 subtransactions, with the patch, there was no longer any 
measurable difference whether you ran the SELECT COUNT(*) in the same 
transaction as the INSERTs or after a COMMIT.



Well, may be I making simple things complicated ;-)


I think so :-).

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-11 Thread Heikki Linnakangas

(moved to pgsql-patches, as there's a patch attached)

Tom Lane wrote:

Getting rid of the linked-list representation would be a win in a couple
of ways --- we'd not need the bogus list of XIDs support in pg_list.h,
and xactGetCommittedChildren would go away.  OTOH AtSubCommit_childXids
would get more expensive.


I couldn't let this case go, so I wrote a patch. I replaced the linked 
list with an array that's enlarged at AtSubCommit_childXids when necessary.


I couldn't measure any performance hit from the extra work of enlarging 
and memcpying the array. I tried two test cases:


1. Insert one row per subtransaction, and commit the subtransaction 
after each insert. This is like the OP's case.


printf(CREATE TABLE foo (id int4);\n);
printf(BEGIN;\n);
for(i=1; i = 10; i++)
{
printf(SAVEPOINT sp%d;\n, i);
printf(INSERT INTO foo VALUES (1);\n);
printf(RELEASE SAVEPOINT sp%d;\n, i);
}
printf(COMMIT;\n);

2. Insert one row per subtransaction, but leave the subtransaction in 
not-committed state


printf(CREATE TABLE foo (id int4, t text);\n);
printf(BEGIN;\n);
for(i=1; i = 10; i++)
{
printf(SAVEPOINT sp%d;\n, i);
printf(INSERT INTO foo VALUES (1, 'f');\n);
}
printf(COMMIT;\n);

Test case 1 is not bad, because we just keep appending to the parent 
array one at a time. Test case 2 might become slower, as the number of 
subtransactions increases, as at the commit of each subtransaction you 
need to enlarge the parent array and copy all the already-committed 
childxids to it. However, you hit the limit on amount of shared mem 
required for holding the per-xid locks before that becomes a problem, 
and the performance becomes dominated by other things anyway (notably 
LockReassignCurrentOwner).


I initially thought that using a single palloc'd array to hold all the 
XIDs would introduce a new limit on the number committed 
subtransactions, thanks to MaxAllocSize, but that's not the case. 
Without patch, we actually allocate an array like that anyway in 
xactGetCommittedChildren.


Elsewhere in our codebase where we use arrays that are enlarged as 
needed, we keep track of the allocated size and the used size of the 
array separately, and only call repalloc when the array fills up, and 
repalloc a larger than necessary array when it does. I chose to just 
call repalloc every time instead, as repalloc is smart enough to fall 
out quickly if the chunk the allocation was made in is already larger 
than the new size. There might be some gain avoiding the repeated 
repalloc calls, but I doubt it's worth the code complexity, and calling 
repalloc with a larger than necessary size can actually force it to 
unnecessarily allocate a new, larger chunk instead of reusing the old 
one. Thoughts on that?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/transam/xact.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xact.c,v
retrieving revision 1.258
diff -c -r1.258 xact.c
*** src/backend/access/transam/xact.c	4 Mar 2008 19:54:06 -	1.258
--- src/backend/access/transam/xact.c	11 Mar 2008 12:31:28 -
***
*** 130,136 
  	int			gucNestLevel;	/* GUC context nesting depth */
  	MemoryContext curTransactionContext;		/* my xact-lifetime context */
  	ResourceOwner curTransactionOwner;	/* my query resources */
! 	List	   *childXids;		/* subcommitted child XIDs */
  	Oid			prevUser;		/* previous CurrentUserId setting */
  	bool		prevSecDefCxt;	/* previous SecurityDefinerContext setting */
  	bool		prevXactReadOnly;		/* entry-time xact r/o state */
--- 130,137 
  	int			gucNestLevel;	/* GUC context nesting depth */
  	MemoryContext curTransactionContext;		/* my xact-lifetime context */
  	ResourceOwner curTransactionOwner;	/* my query resources */
! 	TransactionId	*childXids;	/* subcommitted child XIDs, in XID order */
! 	int			nChildXids;		/* # of subcommitted child XIDs */
  	Oid			prevUser;		/* previous CurrentUserId setting */
  	bool		prevSecDefCxt;	/* previous SecurityDefinerContext setting */
  	bool		prevXactReadOnly;		/* entry-time xact r/o state */
***
*** 156,162 
  	0,			/* GUC context nesting depth */
  	NULL,		/* cur transaction context */
  	NULL,		/* cur transaction resource owner */
! 	NIL,		/* subcommitted child Xids */
  	InvalidOid,	/* previous CurrentUserId setting */
  	false,		/* previous SecurityDefinerContext setting */
  	false,		/* entry-time xact r/o state */
--- 157,164 
  	0,			/* GUC context nesting depth */
  	NULL,		/* cur transaction context */
  	NULL,		/* cur transaction resource owner */
! 	NULL,		/* subcommitted child Xids */
! 	0,			/* # of subcommitted child Xids */
  	InvalidOid

Re: [PATCHES] Bulk Insert tuning

2008-03-11 Thread Heikki Linnakangas

Gavin Sherry wrote:

On Tue, Feb 26, 2008 at 02:43:51PM +, Simon Riggs wrote:

Following patch implements a simple mechanism to keep a buffer pinned
while we are bulk loading.


CK Tan and I worked on something similar but the problem we discovered
was self locking. Consider a primary key: we insert a tuple into a
buffer and do not release the exclusive lock. The btree code fetches the
buffer and tries to share lock it, but we've already exclusive locked
it. Oops. The performance improvement, though, makes it worth seeing if
there is a solution.


That's not a problem if you only keep the buffer pinned, not locked. In 
my experience, pinning is the more expensive part, with the lookup into 
the buffer lookup table. Locking isn't free either, of course, but just 
avoiding the pin+unpin would help a lot.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] TransactionIdIsInProgress() cache

2008-03-11 Thread Heikki Linnakangas

Alvaro Herrera wrote:

I didn't check whether your transformation is correct, but if so then it
can be changed like this and save the extra XidDidCommit call:

xvac_committed = TransactionIdDidCommit(xvac);
if (xvac_committed)
{
/* committed */
}
else if (!TransactionIdIsInProgress(xvac))
{
   if (xvac_committed)
   {
  /* committed */
   }
   else
   {
  /* aborted */
   }
}
else
{
/* in-progress */
}


Nope, that's not good. Per comments in tqual.c, you have to call 
TransactionIdIsInProgress *before* TransactionIdDidCommit, to avoid this 
race condition:


1. Xact A inserts a record
2. Xact B does TransactionIdDidCommit, which retuns false because it's 
still in progress

3. Xact B commits
4. Xact B does TransactionIdIsInProgress to see if A is still in 
progress. It returns false. We conclude that A aborted, while it 
actually committed.


My proposal was basically to add an extra TransactionIdDidCommit call 
before the TransactionIdIsInProgress call, in the hope that it returns 
true and we can skip the rest.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequentialscan after bulk insert; speed returns to ~500 tuples/second aftercommit

2008-03-11 Thread Heikki Linnakangas

Alvaro Herrera wrote:

Heikki Linnakangas wrote:

I couldn't let this case go, so I wrote a patch. I replaced the linked  
list with an array that's enlarged at AtSubCommit_childXids when 
necessary.


Do you still need to palloc the return value from
xactGetCommittedChildren?  Perhaps you can save the palloc/memcpy/pfree
and just return the pointer to the array already in memory?


Yeah, good point. The callers just need to be modified not to pfree it.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] CopyReadLineText optimization

2008-03-10 Thread Heikki Linnakangas

Andrew Dunstan wrote:
Another question that occurred to me - did you try using strpbrk() to 
look for the next interesting character rather than your homegrown 
searcher gadget? If so, how did that perform?


It looks like strpbrk() performs poorly:

unpatched:
 testname |  min duration
--+-
 all  | 00:00:08.099656
 1/2  | 00:00:06.734241
 1/4  | 00:00:06.016946
 1/8  | 00:00:05.622122
 1/16 | 00:00:05.304252
 none | 00:00:05.155755
(6 rows)

strpbrk:

 testname |  min duration
--+-
 all  | 00:00:22.980997
 1/2  | 00:00:13.724834
 1/4  | 00:00:08.980246
 1/8  | 00:00:06.582357
 1/16 | 00:00:05.291485
 none | 00:00:06.239468
(6 rows)

memchr:

 testname |  min duration
--+-
 all  | 00:00:13.684479
 1/2  | 00:00:09.509213
 1/4  | 00:00:06.921959
 1/8  | 00:00:05.654761
 1/16 | 00:00:04.719027
 none | 00:00:03.718361
(6 rows)

Attached is the test script and patches I used, if someone wants to test 
this on another platform.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/commands/copy.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.295
diff -c -r1.295 copy.c
*** src/backend/commands/copy.c	1 Jan 2008 19:45:48 -	1.295
--- src/backend/commands/copy.c	5 Mar 2008 14:44:58 -
***
*** 67,72 
--- 67,275 
  	EOL_CRNL
  } EolType;
  
+ 
+ /* Fast byte searcher functions *
+  *
+  * CopyReadLine needs to search for newlines, as well as quoting and escape
+  * characters to determine which newlines are real and which ones are escaped.
+  * Doing that in the naive way, looping through the input byte by byte and
+  * comparing against the interesting characters, can be slow.
+  *
+  * To speed that up, we provide a searcher interface, that can be used to
+  * search a piece of memory for up to 4 different bytes simultaneously. It's
+  * similar to memchr, but allows searching for multiple chars at the same 
+  * time.
+  * 
+  * To start a search on a new block of memory, call init_searcher. Then call
+  * next_searcher repeatedly to return all the occurrances of the bytes being
+  * searched for.
+  *
+  * The searcher implementation uses memchr to search for the bytes, and keeps
+  * track of where the next occurrance of each is. Using memchr allows us
+  * to take advantage of any platform-specific optimizations.
+  */
+ 
+ /*
+  * Struct used to store state of the current search between calls to 
+  * next_searcher. Initialized by init_search.
+  */
+ typedef struct {
+ 	/* Each element in c corresponds the element in s. These are sorted
+ 	 * by the pointers (ptr) */
+ 	int n;		/* elements used in the arrays */
+ 	char c[4];	/* bytes we're looking for */
+ 	char *ptr[4]; /* pointers to next occurances of the bytes */
+ 	char *end;	/* end of block being searched. Last byte to search is end-1 */
+ } searcher;
+ 
+ /* Functions internal to searcher */
+ 
+ #define swap_searcher_entries(searcher, a, b) do { \
+ 	char tmpc = (searcher)-c[a]; \
+ 	char *tmpptr = (searcher)-ptr[a]; \
+ 	(searcher)-c[a] = (searcher)-c[b]; \
+ 	(searcher)-ptr[a] = (searcher)-ptr[b]; \
+ 	(searcher)-c[b] = tmpc; \
+ 	(searcher)-ptr[b] = tmpptr; \
+ } while(0)
+ 
+ static void
+ sort_searcher(searcher *s)
+ {
+ 	switch(s-n)
+ 	{
+ 		case 0:
+ 		case 1:
+ 			/* Nothing to do */
+ 			break;
+ 		case 2:
+ 			if (s-ptr[0]  s-ptr[1])
+ swap_searcher_entries(s, 0, 1);
+ 			break;
+ 		case 3:
+ 			if (s-ptr[0]  s-ptr[2])
+ swap_searcher_entries(s, 0, 2);
+ 			if (s-ptr[0]  s-ptr[1])
+ swap_searcher_entries(s, 0, 1);
+ 			if (s-ptr[1]  s-ptr[2])
+ swap_searcher_entries(s, 1, 2);
+ 			break;
+ 		case 4:
+ 			if (s-ptr[0]  s-ptr[1])
+ swap_searcher_entries(s, 0, 1);
+ 			if (s-ptr[2]  s-ptr[3])
+ swap_searcher_entries(s, 2, 3);
+ 			if (s-ptr[1]  s-ptr[2])
+ swap_searcher_entries(s, 2, 3);
+ 			if (s-ptr[0]  s-ptr[1])
+ swap_searcher_entries(s, 0, 1);
+ 			if (s-ptr[2]  s-ptr[3])
+ swap_searcher_entries(s, 2, 3);
+ 			break;
+ 	}
+ }
+ 
+ /* Remove the topmost element */
+ static void
+ remove_top(searcher *searcher)
+ {
+ 	int i;
+ 
+ 	searcher-n--;
+ 
+ 	/* Move up the remaining items */
+ 	for (i = 0; i  searcher-n; i++)
+ 	{
+ 		searcher-c[i] = searcher-c[i + 1];
+ 		searcher-ptr[i] = searcher-ptr[i + 1];
+ 	}
+ }
+ 
+ /*
+  * The first element in the list has been replaced by the caller.
+  * Move it to the right place in the list, to keep it sorted.
+  */
+ static void
+ sift_down(searcher *searcher)
+ {
+ 	if (searcher-n  2 || searcher-ptr[0] = searcher-ptr[1])
+ 		return;
+ 	swap_searcher_entries(searcher, 0, 1);
+ 
+ 	if (searcher-n  3 || searcher-ptr[1] = searcher-ptr[2])
+ 		return;
+ 	swap_searcher_entries(searcher, 1, 2);
+ 
+ 	if (searcher-n  4 || searcher-ptr[2

Re: [PATCHES] CopyReadLineText optimization

2008-03-08 Thread Heikki Linnakangas

Andrew Dunstan wrote:



Heikki Linnakangas wrote:

Andrew Dunstan wrote:
I'm still a bit worried about applying it unless it gets some 
adaptive behaviour or something so that we don't cause any serious 
performance regressions in some cases.


I'll try to come up with something. At the most conservative end, we 
could fall back to the current method on the first escape, quote or 
backslash character.


Also, could we perhaps benefit from inlining some calls, or is your 
compiler doing that anyway?


gcc does inline all static functions that are only called from one 
site,  and small functions, using some heuristic. I don't think more 
aggressive inlining would help.




Another question that occurred to me - did you try using strpbrk() to 
look for the next interesting character rather than your homegrown 
searcher gadget? If so, how did that perform?


I haven't tried that. There's a small difference: strpbrk stops at '\0'. 
But come to think of it, I guess it doesn't matter. Will test...


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] CopyReadAttributesCSV optimization

2008-03-08 Thread Heikki Linnakangas

Andrew Dunstan wrote:



Heikki Linnakangas wrote:
Here's a patch to speed up CopyReadAttributesCSV. On the test case 
I've been playing with, loading the TPC-H partsupp table, about 20% 
CopyReadAttributesCSV (inlined into DoCopy, DoCopy itself is 
insignificant):




[snip]


The trick is to split the loop in CopyReadAttributesCSV into two 
parts, inside quotes, and outside quotes, saving some instructions in 
both parts.


Your mileage may vary, but I'm quite happy with this. I haven't tested 
it much yet, but I wouldn't expect it to be a loss in any interesting 
scenario. The code also doesn't look much worse after the patch, 
perhaps even better.


  


This looks sane enough, and worked for me in testing, so I'm going to 
apply it shortly. I'll probably add a comment or two about how the loops 
interact.


Thanks.

FWIW, I did some more performance testing, with input consisting of a 
lot of quotes, and it seems the performance gain holds even then. I was 
not able to find an input where the new version performs worse.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] CopyReadLineText optimization

2008-03-06 Thread Heikki Linnakangas

Andrew Dunstan wrote:

Heikki Linnakangas wrote:

Another update attached: It occurred to me that the memchr approach is
only safe for server encodings, where the non-first bytes of a 
multi-byte character always have the hi-bit set.




We currently make the following assumption in the code:

* These four characters, and the CSV escape and quote characters, are
* assumed the same in frontend and backend encodings.
*

The four characters are the carriage return, line feed, backslash and dot.

I think the requirement might well actually be somewhat stronger than 
that: i.e. that none of these will appear as a non-first byte in any 
multi-byte client encoding character. If that's right, then we should be 
able to write CopyReadLineText without bothering about multi-byte chars. 
If it's not right then I suspect we have some cases that can fail now 
anyway.


No, we don't require that, and we do handle it correctly. We use 
pg_encoding_mblen to determine the length of each character in 
CopyReadLineText when the encoding is a client-only encoding, and only 
look at the first byte of each character. In CopyReadAttributesText, 
where we have a similar loop, we've already transformed the input to 
server encoding.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.orgextra=pgsql-patches


Re: [PATCHES] CopyReadLineText optimization

2008-03-06 Thread Heikki Linnakangas

Andrew Dunstan wrote:
I'm still a bit worried about applying it unless it gets some adaptive 
behaviour or something so that we don't cause any serious performance 
regressions in some cases.


I'll try to come up with something. At the most conservative end, we 
could fall back to the current method on the first escape, quote or 
backslash character.


Also, could we perhaps benefit from inlining 
some calls, or is your compiler doing that anyway?


gcc does inline all static functions that are only called from one site, 
 and small functions, using some heuristic. I don't think more 
aggressive inlining would help.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.orgextra=pgsql-patches


Re: [PATCHES] CopyReadLineText optimization

2008-03-06 Thread Heikki Linnakangas

Andrew Dunstan wrote:

Heikki Linnakangas wrote:

Andrew Dunstan wrote:
I'm still a bit worried about applying it unless it gets some 
adaptive behaviour or something so that we don't cause any serious 
performance regressions in some cases.


I'll try to come up with something. At the most conservative end, we 
could fall back to the current method on the first escape, quote or 
backslash character.


That's far too conservative, I think. Somewhere a bit short of your 
observed breakeven point seems about right.


The problem is, you don't know how many stop characters there is until 
you've counted them.


We could fall back after X such characters, or only start using memchr 
after seeing 8 consecutive non-stop characters. Whatever we choose, the 
heuristic needs to be very simple and fast to check, otherwise we just 
introduce more overhead trying to decide which method to use.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.orgextra=pgsql-patches


Re: [PATCHES] WIP: guc enums

2008-03-05 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

Magnus Hagander wrote:

On my platform (linux x86) it works fine when I just cast this to (int *),
but I'm unsure if that's going to be safe on other platforms. I had some
indication that it's probably not?


No, I don't think that's safe. Some googleing (*) suggests that the 
compiler is free to choose any integer type for an enum.


Yeah, it's absolutely not safe.

What I'd suggest is declaring the actual variable as int.  You can still
use an enum typedef to declare the values, and just avert your eyes
when you have to cast the enum to int or vice versa.  (This is legal per
C spec, so you won't introduce any portability issues when you do it.)


That's pretty much the same as int variable and #defined constants. You 
lose compiler checks, like assigning from one enum type to another, and 
the enumeration value ‘FOOBAR’ not handled in switch warning.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.orgextra=pgsql-patches


Re: [PATCHES] WIP: guc enums

2008-03-05 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

Tom Lane wrote:

What I'd suggest is declaring the actual variable as int.  You can still
use an enum typedef to declare the values, and just avert your eyes
when you have to cast the enum to int or vice versa.  (This is legal per
C spec, so you won't introduce any portability issues when you do it.)


That's pretty much the same as int variable and #defined constants. You 
lose compiler checks, like assigning from one enum type to another, and 
the enumeration value ‘FOOBAR’ not handled in switch warning.


Well, you can at least get the latter if you cast explicitly:

switch ((MyEnum) myvariable) ...

We do this in several places already where the underlying variable isn't
declared as the enum for one reason or another.  Also, local variables
can be declared as the enum type to get a little more safety.


True, I guess that's not too bad then. Just have to remember to do that.

Regarding the places where we already do that, I could find just three:
src/backend/utils/adt/lockfuncs.c:		switch ((LockTagType) 
lock-tag.locktag_type)

src/backend/storage/lmgr/lmgr.c:switch ((LockTagType) tag-locktag_type)
src/backend/regex/regc_locale.c:switch ((enum classes) index)

The first and 2nd are really the same, and it seems legitimate. At quick 
glance, I couldn't figure out why index is an int-variable, and not an 
enum, but that code comes from tcl.



In any case, the alternative being suggested of keeping the variables as
strings throws away *every* possible code-level advantage of having an
enum variable classification.


Oh no, I didn't suggest keeping the variables as strings, that's 
madness. I suggested keeping the variables as enums, and defining 
setter functions for them, similar to the assign hooks we have now, 
but the setter function wouldn't have to do anything else than assign an 
int to the enum variable. The setter function would be just a 
replacement for *((int *)variable) = X.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.orgextra=pgsql-patches


Re: [PATCHES] CopyReadLineText optimization

2008-03-05 Thread Heikki Linnakangas

Heikki Linnakangas wrote:

Heikki Linnakangas wrote:
Attached is a patch that modifies CopyReadLineText so that it uses 
memchr to speed up the scan. The nice thing about memchr is that we 
can take advantage of any clever optimizations that might be in libc 
or compiler.


Here's an updated version of the patch. The principle is the same, but 
the same optimization is now used for CSV input as well, and there's 
more comments.


Another update attached: It occurred to me that the memchr approach is 
only safe for server encodings, where the non-first bytes of a 
multi-byte character always have the hi-bit set.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/commands/copy.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.295
diff -c -r1.295 copy.c
*** src/backend/commands/copy.c	1 Jan 2008 19:45:48 -	1.295
--- src/backend/commands/copy.c	5 Mar 2008 14:44:58 -
***
*** 67,72 
--- 67,275 
  	EOL_CRNL
  } EolType;
  
+ 
+ /* Fast byte searcher functions *
+  *
+  * CopyReadLine needs to search for newlines, as well as quoting and escape
+  * characters to determine which newlines are real and which ones are escaped.
+  * Doing that in the naive way, looping through the input byte by byte and
+  * comparing against the interesting characters, can be slow.
+  *
+  * To speed that up, we provide a searcher interface, that can be used to
+  * search a piece of memory for up to 4 different bytes simultaneously. It's
+  * similar to memchr, but allows searching for multiple chars at the same 
+  * time.
+  * 
+  * To start a search on a new block of memory, call init_searcher. Then call
+  * next_searcher repeatedly to return all the occurrances of the bytes being
+  * searched for.
+  *
+  * The searcher implementation uses memchr to search for the bytes, and keeps
+  * track of where the next occurrance of each is. Using memchr allows us
+  * to take advantage of any platform-specific optimizations.
+  */
+ 
+ /*
+  * Struct used to store state of the current search between calls to 
+  * next_searcher. Initialized by init_search.
+  */
+ typedef struct {
+ 	/* Each element in c corresponds the element in s. These are sorted
+ 	 * by the pointers (ptr) */
+ 	int n;		/* elements used in the arrays */
+ 	char c[4];	/* bytes we're looking for */
+ 	char *ptr[4]; /* pointers to next occurances of the bytes */
+ 	char *end;	/* end of block being searched. Last byte to search is end-1 */
+ } searcher;
+ 
+ /* Functions internal to searcher */
+ 
+ #define swap_searcher_entries(searcher, a, b) do { \
+ 	char tmpc = (searcher)-c[a]; \
+ 	char *tmpptr = (searcher)-ptr[a]; \
+ 	(searcher)-c[a] = (searcher)-c[b]; \
+ 	(searcher)-ptr[a] = (searcher)-ptr[b]; \
+ 	(searcher)-c[b] = tmpc; \
+ 	(searcher)-ptr[b] = tmpptr; \
+ } while(0)
+ 
+ static void
+ sort_searcher(searcher *s)
+ {
+ 	switch(s-n)
+ 	{
+ 		case 0:
+ 		case 1:
+ 			/* Nothing to do */
+ 			break;
+ 		case 2:
+ 			if (s-ptr[0]  s-ptr[1])
+ swap_searcher_entries(s, 0, 1);
+ 			break;
+ 		case 3:
+ 			if (s-ptr[0]  s-ptr[2])
+ swap_searcher_entries(s, 0, 2);
+ 			if (s-ptr[0]  s-ptr[1])
+ swap_searcher_entries(s, 0, 1);
+ 			if (s-ptr[1]  s-ptr[2])
+ swap_searcher_entries(s, 1, 2);
+ 			break;
+ 		case 4:
+ 			if (s-ptr[0]  s-ptr[1])
+ swap_searcher_entries(s, 0, 1);
+ 			if (s-ptr[2]  s-ptr[3])
+ swap_searcher_entries(s, 2, 3);
+ 			if (s-ptr[1]  s-ptr[2])
+ swap_searcher_entries(s, 2, 3);
+ 			if (s-ptr[0]  s-ptr[1])
+ swap_searcher_entries(s, 0, 1);
+ 			if (s-ptr[2]  s-ptr[3])
+ swap_searcher_entries(s, 2, 3);
+ 			break;
+ 	}
+ }
+ 
+ /* Remove the topmost element */
+ static void
+ remove_top(searcher *searcher)
+ {
+ 	int i;
+ 
+ 	searcher-n--;
+ 
+ 	/* Move up the remaining items */
+ 	for (i = 0; i  searcher-n; i++)
+ 	{
+ 		searcher-c[i] = searcher-c[i + 1];
+ 		searcher-ptr[i] = searcher-ptr[i + 1];
+ 	}
+ }
+ 
+ /*
+  * The first element in the list has been replaced by the caller.
+  * Move it to the right place in the list, to keep it sorted.
+  */
+ static void
+ sift_down(searcher *searcher)
+ {
+ 	if (searcher-n  2 || searcher-ptr[0] = searcher-ptr[1])
+ 		return;
+ 	swap_searcher_entries(searcher, 0, 1);
+ 
+ 	if (searcher-n  3 || searcher-ptr[1] = searcher-ptr[2])
+ 		return;
+ 	swap_searcher_entries(searcher, 1, 2);
+ 
+ 	if (searcher-n  4 || searcher-ptr[2] = searcher-ptr[3])
+ 		return;
+ 	swap_searcher_entries(searcher, 2, 3);
+ }
+ 
+ /*
+  * Starts a new search in a block of memory.
+  *
+  *  searcher	- for storing state between calls. Opaque to caller, 
+  *  init_searcher will initialize this.
+  *  str			- block of memory to search
+  *  len			- length of str
+  *  c			- bytes to search for, max 4.
+  *  n			- number of bytes in c
+  */
+ static void
+ init_searcher(searcher *searcher, char *str, int

Re: [PATCHES] WIP: guc enums

2008-03-04 Thread Heikki Linnakangas

Magnus Hagander wrote:

The patch only converts a couple of the potential enum variables to the new
type, mainly as a proof of concept. But already I hit the problem twice -
the variable that holds the value of the guc enum is a C enum itself, which
gives a compiler warning when I pass a pointer to it for
config_enum.variable. (in this case, Log_error_verbosity and log_statement
are enums and have the problem, but client_min_messages, log_min_messages
and log_min_error_statement are already int and don't have it)

On my platform (linux x86) it works fine when I just cast this to (int *),
but I'm unsure if that's going to be safe on other platforms. I had some
indication that it's probably not?


No, I don't think that's safe. Some googleing (*) suggests that the 
compiler is free to choose any integer type for an enum. If you do 
*((int *)enumptr) = x, and the compiler chose a 16-bit type for the 
enum, you overwrite some unrelated piece of memory.



And if not, the only way I know to do it is to change the C level enums to
be an int and use #define:s instead of what's there now. If that's
required, is that an acceptable change in order to implement this? If not,
any better ideas on how to do it?


Yuck :-(.

We could keep using the assignment hooks. But they could be a lot 
simpler than they are nowadays, if the string - int conversion was done 
by the GUC machinery:


static const char *
assign_client_min_messages(int newval, bool doit, GucSource source)
{
client_min_messages = newval;
}

Another idea would be cajole the compiler to choose a type of the same 
length as int, by adding a dummy enum value to the enum, like:


enum client_min_messages {
  DEBUG,
  INFO,
  ...,
  DUMMY = INT_MAX
}

Though I guess it might in theory choose something even wider, and the 
*((int *)enumptr) = x would fail to set all the bits of the enum variable.


BTW, shouldn't be using malloc in config_enum_get_options...

(*): http://david.tribble.com/text/cdiffs.htm#C99-enum-type

and what I believe to be the current C99 standard, see 6.7.2.2 
Enumeration specifiers:


http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1256.pdf

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your Subscription:
http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.orgextra=pgsql-patches


Re: [PATCHES] CopyReadLineText optimization

2008-02-29 Thread Heikki Linnakangas

Heikki Linnakangas wrote:
Attached is a patch that modifies CopyReadLineText so that it uses 
memchr to speed up the scan. The nice thing about memchr is that we can 
take advantage of any clever optimizations that might be in libc or 
compiler.


Here's an updated version of the patch. The principle is the same, but 
the same optimization is now used for CSV input as well, and there's 
more comments.


I still need to do more benchmarking. I mentioned a ~5% speedup on the 
test I ran earlier, which was a load of the lineitem table from TPC-H. 
It looks like with cheaper data types the gain can be much bigger; 
here's an oprofile from loading the TPC-H partsupp table,


Before:

samples  %image name   symbol name
5146 25.7635  postgres CopyReadLine
4089 20.4716  postgres DoCopy
1449  7.2544  reiserfs (no symbols)
1369  6.8539  postgres pg_verify_mbstr_len
1013  5.0716  libc-2.7.so  memcpy
749   3.7499  libc-2.7.so  strtod_l_internal
598   2.9939  postgres heap_formtuple
548   2.7436  libc-2.7.so  strtol_l_internal
403   2.0176  libc-2.7.so  memset
309   1.5470  libc-2.7.so  strlen
208   1.0414  postgres AllocSetAlloc
...

After:

samples  %image name   symbol name
4165 25.7879  postgres DoCopy
1574  9.7455  postgres pg_verify_mbstr_len
1520  9.4112  reiserfs (no symbols)
1005  6.2225  libc-2.7.so  memchr
986   6.1049  libc-2.7.so  memcpy
632   3.9131  libc-2.7.so  strtod_l_internal
589   3.6468  postgres heap_formtuple
546   3.3806  libc-2.7.so  strtol_l_internal
386   2.3899  libc-2.7.so  memset
366   2.2661  postgres CopyReadLine
287   1.7770  libc-2.7.so  strlen
215   1.3312  postgres LWLockAcquire
208   1.2878  postgres hash_any
176   1.0897  postgres LWLockRelease
161   0.9968  postgres InputFunctionCall
157   0.9721  postgres AllocSetAlloc
...

Profile shows that with the patch, ~8.5% of the CPU time is spent in 
CopyReadLine+memchr, vs. 25.5% before. That's a quite significant speedup.


I still need to test the worst-case performance, with input that has a 
lot of escapes. It would be interesting to hear reports with this patch 
from people on different platforms. These results are from my laptop 
with 32-bit Intel CPU, running Linux. There could be big differences in 
the memchr implementations.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/commands/copy.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.295
diff -c -r1.295 copy.c
*** src/backend/commands/copy.c	1 Jan 2008 19:45:48 -	1.295
--- src/backend/commands/copy.c	29 Feb 2008 17:43:53 -
***
*** 67,72 
--- 67,275 
  	EOL_CRNL
  } EolType;
  
+ 
+ /* Fast byte searcher functions *
+  *
+  * CopyReadLine needs to search for newlines, as well as quoting and escape
+  * characters to determine which newlines are real and which ones are escaped.
+  * Doing that in the naive way, looping through the input byte by byte and
+  * comparing against the interesting characters, can be slow.
+  *
+  * To speed that up, we provide a searcher interface, that can be used to
+  * search a piece of memory for up to 4 different bytes simultaneously. It's
+  * similar to memchr, but allows searching for multiple chars at the same 
+  * time.
+  * 
+  * To start a search on a new block of memory, call init_searcher. Then call
+  * next_searcher repeatedly to return all the occurrances of the bytes being
+  * searched for.
+  *
+  * The searcher implementation uses memchr to search for the bytes, and keeps
+  * track of where the next occurrance of each is. Using memchr allows us
+  * to take advantage of any platform-specific optimizations.
+  */
+ 
+ /*
+  * Struct used to store state of the current search between calls to 
+  * next_searcher. Initialized by init_search.
+  */
+ typedef struct {
+ 	/* Each element in c corresponds the element in s. These are sorted
+ 	 * by the pointers (ptr) */
+ 	int n;		/* elements used in the arrays */
+ 	char c[4];	/* bytes we're looking for */
+ 	char *ptr[4]; /* pointers to next occurances of the bytes */
+ 	char *end;	/* end of block being searched. Last byte to search is end-1 */
+ } searcher;
+ 
+ /* Functions internal to searcher */
+ 
+ #define swap_searcher_entries(searcher, a, b) do { \
+ 	char tmpc = (searcher)-c[a]; \
+ 	char *tmpptr = (searcher)-ptr[a]; \
+ 	(searcher)-c

[PATCHES] CopyReadAttributesCSV optimization

2008-02-29 Thread Heikki Linnakangas
Here's a patch to speed up CopyReadAttributesCSV. On the test case I've 
been playing with, loading the TPC-H partsupp table, about 20% 
CopyReadAttributesCSV (inlined into DoCopy, DoCopy itself is insignificant):


samples  %image name   symbol name
8136 25.8360  postgres CopyReadLine
6350 20.1645  postgres DoCopy
2181  6.9258  postgres pg_verify_mbstr_len
2157  6.8496  reiserfs (no symbols)
1668  5.2968  libc-2.7.so  memcpy
1142  3.6264  libc-2.7.so  strtod_l_internal
951   3.0199  postgres heap_formtuple
904   2.8707  libc-2.7.so  strtol_l_internal
619   1.9656  libc-2.7.so  memset
442   1.4036  libc-2.7.so  strlen
341   1.0828  postgres hash_any
329   1.0447  postgres pg_atoi
300   0.9527  postgres AllocSetAlloc

With this patch, the usage of that function goes down to ~13%

samples  %image name   symbol name
7191 28.7778  postgres CopyReadLine
3257 13.0343  postgres DoCopy
2127  8.5121  reiserfs (no symbols)
1914  7.6597  postgres pg_verify_mbstr_len
1413  5.6547  libc-2.7.so  memcpy
920   3.6818  libc-2.7.so  strtod_l_internal
784   3.1375  libc-2.7.so  strtol_l_internal
745   2.9814  postgres heap_formtuple
508   2.0330  libc-2.7.so  memset
398   1.5928  libc-2.7.so  strlen
315   1.2606  postgres hash_any
255   1.0205  postgres AllocSetAlloc

The trick is to split the loop in CopyReadAttributesCSV into two parts, 
inside quotes, and outside quotes, saving some instructions in both 
parts.


Your mileage may vary, but I'm quite happy with this. I haven't tested 
it much yet, but I wouldn't expect it to be a loss in any interesting 
scenario. The code also doesn't look much worse after the patch, perhaps 
even better.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/commands/copy.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.295
diff -c -r1.295 copy.c
*** src/backend/commands/copy.c	1 Jan 2008 19:45:48 -	1.295
--- src/backend/commands/copy.c	29 Feb 2008 20:57:09 -
***
*** 2913,2919 
  	for (;;)
  	{
  		bool		found_delim = false;
- 		bool		in_quote = false;
  		bool		saw_quote = false;
  		char	   *start_ptr;
  		char	   *end_ptr;
--- 3146,3151 
***
*** 2934,3000 
  		{
  			char		c;
  
! 			end_ptr = cur_ptr;
! 			if (cur_ptr = line_end_ptr)
! break;
! 			c = *cur_ptr++;
! 			/* unquoted field delimiter */
! 			if (c == delimc  !in_quote)
  			{
! found_delim = true;
! break;
! 			}
! 			/* start of quoted field (or part of field) */
! 			if (c == quotec  !in_quote)
! 			{
! saw_quote = true;
! in_quote = true;
! continue;
  			}
! 			/* escape within a quoted field */
! 			if (c == escapec  in_quote)
  			{
! /*
!  * peek at the next char if available, and escape it if it is
!  * an escape char or a quote char
!  */
! if (cur_ptr  line_end_ptr)
! {
! 	char		nextc = *cur_ptr;
  
! 	if (nextc == escapec || nextc == quotec)
  	{
! 		*output_ptr++ = nextc;
! 		cur_ptr++;
! 		continue;
  	}
  }
! 			}
  
! 			/*
! 			 * end of quoted field. Must do this test after testing for escape
! 			 * in case quote char and escape char are the same (which is the
! 			 * common case).
! 			 */
! 			if (c == quotec  in_quote)
! 			{
! in_quote = false;
! continue;
  			}
- 
- 			/* Add c to output string */
- 			*output_ptr++ = c;
  		}
  
  		/* Terminate attribute value in output area */
  		*output_ptr++ = '\0';
  
- 		/* Shouldn't still be in quote mode */
- 		if (in_quote)
- 			ereport(ERROR,
- 	(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
- 	 errmsg(unterminated CSV quoted field)));
- 
  		/* Check whether raw input matched null marker */
  		input_len = end_ptr - start_ptr;
  		if (!saw_quote  input_len == cstate-null_print_len 
--- 3166,3241 
  		{
  			char		c;
  
! 			/* Not in quote */
! 			for (;;)
  			{
! end_ptr = cur_ptr;
! if (cur_ptr = line_end_ptr)
! 	goto endfield;
! c = *cur_ptr++;
! /* unquoted field delimiter */
! if (c == delimc)
! {
! 	found_delim = true;
! 	goto endfield;
! }
! /* start of quoted field (or part of field) */
! if (c == quotec)
! {
! 	saw_quote = true;
! 	break;
! }
! /* Add c to output string */
! *output_ptr++ = c;
  			}
! 
! 			/* In quote */
! 			for (;;)
  			{
! end_ptr = cur_ptr

Re: [PATCHES] Fix for initdb failures on Vista

2008-02-25 Thread Heikki Linnakangas

Dave Page wrote:

The attached patch fixes problems reported primarily on Vista, but
also on some Windows 2003 and XP installations in which initdb reports
that it cannot find postgres.exe.


A couple of minor nitpicks:

Regarding the AddUserToDaclCleanup helper function, I would suggest 
putting all the cleanups at the end of AddUserToDacl, jump to the 
cleanup section with a goto. That's a commonly used pattern to do it. 
One problem with the Cleanup function is that if you need to add more 
cleanup code (probably not likely in this case, though), you need to 
modify the function signature and all callers.


The comment in AddUserToDacl says This is required on Windows machines 
running some of Microsoft's latest security patches on XP/2K3, and on 
Vista/Longhorn boxes. The security patches we're talking about are not 
going to be the latest for very long; might want to rephrase that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PATCHES] CopyReadLineText optimization

2008-02-23 Thread Heikki Linnakangas
The purpose of CopyReadLineText is to scan the input buffer, and find 
the next newline, taking into account any escape characters. It 
currently operates in a loop, one byte at a time, searching for LF, CR, 
or a backslash. That's a bit slow: I've been running oprofile on COPY, 
and I've seen CopyReadLine to take around ~10% of the CPU time, and 
Joshua Drake just posted a very similar profile to hackers.


Attached is a patch that modifies CopyReadLineText so that it uses 
memchr to speed up the scan. The nice thing about memchr is that we can 
take advantage of any clever optimizations that might be in libc or 
compiler.


In the tests I've been running, it roughly halves the time spent in 
CopyReadLine (including the new memchr calls), thus reducing the total 
CPU overhead by ~5%. I'm planning to run more tests with data that has 
backslashes and with different width tables to see what the worst-case 
and best-case performance is like. Also, it doesn't work for CSV format 
at the moment; that needs to be fixed.


5% isn't exactly breathtaking, but it's a start. I tried the same trick 
to CopyReadAttributesText, but unfortunately it doesn't seem to help 
there because you need to stop the efficient word-at-a-time scan that 
memchr does (at least with glibc, YMMV) whenever there's a column 
separator, while in CopyReadLineText you get to process the whole line 
in one call, assuming there's no backslashes.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/commands/copy.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.295
diff -c -r1.295 copy.c
*** src/backend/commands/copy.c	1 Jan 2008 19:45:48 -	1.295
--- src/backend/commands/copy.c	24 Feb 2008 00:40:09 -
***
*** 67,72 
--- 67,77 
  	EOL_CRNL
  } EolType;
  
+ typedef struct {
+ 	char c;		/* byte we're looking for */
+ 	char *s;	/* pointer to next occurance of this byte */
+ } searchbyte;
+ 
  /*
   * This struct contains all the state variables used throughout a COPY
   * operation. For simplicity, we use the same struct for all variants of COPY,
***
*** 159,164 
--- 164,171 
  	char	   *raw_buf;
  	int			raw_buf_index;	/* next byte to process */
  	int			raw_buf_len;	/* total # of bytes stored */
+ 
+ 	searchbyte	sb[3];			/* pointers to next interesting characters in raw_buf */
  } CopyStateData;
  
  typedef CopyStateData *CopyState;
***
*** 170,175 
--- 177,283 
  	CopyState	cstate;			/* CopyStateData for the command */
  } DR_copy;
  
+ #define swap_sb(a,b) do { \
+ 	searchbyte c = (a); \
+ 	(a) = (b); \
+ 	(b) = (c); \
+ } while(0)
+ 
+ /* is a  b? NULLs are last */
+ #define cmp_sb(a,b) ((a).s  (b).s  (a).s != NULL)
+ 
+ static void
+ sort_sb(searchbyte *sb)
+ {
+ 	if (cmp_sb(sb[2], sb[1]))
+ 		swap_sb(sb[1], sb[2]);
+ 	if (cmp_sb(sb[1], sb[0]))
+ 		swap_sb(sb[0], sb[1]);
+ 	if (cmp_sb(sb[2], sb[1]))
+ 		swap_sb(sb[1], sb[2]);
+ }
+ 
+ /*
+  * Starts a new search in a string
+  */
+ static void
+ init_interesting(searchbyte *sb, char *str, int len)
+ {
+ 	sb[0].c = '\n';
+ 	sb[1].c = '\r';
+ 	sb[2].c = '\\';
+ 	sb[0].s = memchr(str, '\n', len);
+ 	sb[1].s = memchr(str, '\r', len);
+ 	sb[2].s = memchr(str, '\\', len);
+ 	sort_sb(sb);
+ }
+ 
+ /*
+  * Look for the next interesting character 
+  *
+  * sb - array of three searchbytes.
+  * ss - string to search
+  * len_left - number of bytes left in ss to search
+  *
+  * Returns the offset of the next interesting location, relative to 'ss'
+  */
+ static int
+ next_interesting(searchbyte *sb, char *ss, int len_left)
+ {
+ 	char *s;
+ 
+ 	if (sb[0].s == NULL)
+ 		return 0;
+ 
+ 	/*
+ 	 * The caller skipped over the next pointer we had memorized, so we have to
+ 	 * search for the one after that.
+ 	 */
+ 	if (ss  sb[0].s)
+ 	{
+ 		sb[0].s = memchr(ss, sb[0].c, len_left);
+ 		if (sb[0].s == NULL)
+ 			return 0; /* we're done */
+ 		if (sb[1].s != NULL)
+ 		{
+ 			if (ss  sb[1].s)
+ 			{
+ /*
+  * The caller skipped over the 2nd pointer we had memorized
+  * as well. Have to search for that as well
+  */
+ sb[1].s = memchr(ss, sb[1].c, len_left);
+ if (sb[2].s != NULL  ss  sb[2].s)
+ {
+ 	/* Caller skipped over 3rd pointer as well... */
+ 	sb[2].s = memchr(ss, sb[2].c, len_left);
+ }
+ 			}
+ 			sort_sb(sb);
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Return the next interesting location we had memorized, and search
+ 	 * for the next occurance of that byte.
+ 	 */
+ 	s = sb[0].s;
+ 
+ 	sb[0].s = memchr(s+1, sb[0].c, (ss+len_left) - (s + 1));
+ 
+ 	/*
+ 	 * Bubble down the next location into the three-element list we have,
+ 	 */
+ 	if (cmp_sb(sb[1], sb[0]))
+ 	{
+ 		swap_sb(sb[0], sb[1]);
+ 		if (cmp_sb(sb[2], sb[1]))
+ 			swap_sb(sb[1], sb[2]);
+ 	}
+ 
+ 	return s - ss;
+ }
+ 
  
  /*
   * These macros centralize code used to process line_buf

Re: [PATCHES] tzcode update

2008-02-20 Thread Heikki Linnakangas

Tom Lane wrote:

Magnus Hagander [EMAIL PROTECTED] writes:

On Mon, Feb 18, 2008 at 04:32:58PM +0100, Bjorn Munch wrote:

Ouch!  This fails on our Solaris builds, because we build with the
Solaris timezone files.  And these apparently don't work beyond 2038
and don't know that Finland plans to have DST also in the summer of
2050...

I haven't studied this in depth but I'm afraid the answer is fix your
own timezone files?



Or use the ones shipping with pg ;-)


Yeah.  I included the far-future cases in the committed patch
specifically to make it obvious if you were using tz files that lacked
post-2038 support.


Agreed, we want to give people a notice.

What we could do is to add comments to the expected output file, that 
would show up in regression.diffs, that explain what the failure means.



I can't imagine that fixing this isn't pretty darn high on the Solaris
to-do list, anyway.


Yep, and PostgreSQL 8.4 won't be released for quite some time.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] tzcode update

2008-02-14 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
Looking closer, I don't understand how that change was supposed to do 
anything.


The point of that patch is to avoid an off-by-one result for years BC.
The direction of rounding in integer division with a negative numerator
is undefined in C (or at least used to be --- did C99 tighten this up?).


Oh, I see. In that case we're good. The corresponding new code in tzcode 
 actually looks like this:



! static int
! leaps_thru_end_of(const int y)
! {
!   return (y = 0) ? (y / 4 - y / 100 + y / 400) :
!   -(leaps_thru_end_of(-(y + 1)) + 1);
! }



--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] tzcode update

2008-02-13 Thread Heikki Linnakangas

I just noticed that I had accidentally reverted this change in the patch:


/*
 * Note: the point of adding 4800 is to ensure we make the same
 * assumptions as Postgres' Julian-date routines about the placement of
 * leap years in centuries BC, at least back to 4713BC which is as far 
as
 * we'll go. This is effectively extending Gregorian timekeeping into
 * pre-Gregorian centuries, which is a tad bogus but it conforms to the
 * SQL spec...
 */
#define LEAPS_THRU_END_OF(y)(((y) + 4800) / 4 - ((y) + 4800) / 100 + ((y) + 
4800) / 400)


vs original in tzcode2003e:


#define LEAPS_THRU_END_OF(y)((y) / 4 - (y) / 100 + (y) / 400)


Looking closer, I don't understand how that change was supposed to do 
anything. If I'm doing my math right, our +4800 version of the code can 
be written as: y/4 - y/100 - y/400 + 1164. The only place this macro 
is used is this:



days -= ((int64) (newy - y)) * DAYSPERNYEAR +
LEAPS_THRU_END_OF(newy - 1) -
LEAPS_THRU_END_OF(y - 1);


AFAICS, the constant  + 1164 is always cancelled out, making the 
exercise of adding 4800 a waste of time.


Am I missing something?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PATCHES] tzcode update

2008-02-11 Thread Heikki Linnakangas
We included the public domain timezone library by Arthur David Olson 
back in 2004 into our source tree, but we haven't kept it up to date 
with the upstream changes since.


We've made a number of small changes to our version of the library, 
including:

- formatting changes, mostly thanks to pgindent
- widened time_t to 8 bytes
- we only include the parts of the library that we need

which means that we can't just take the new library version and drop it 
in the src/timezone directory. We can debate whether those changes were 
a good idea or not, given that they make the merging harder, but my take 
is that it's not that bad given how seldom and little the upstream code 
changes.


I was not able to find anything like release notes that would list the 
differences between tzcode2003e, which I believe is the version that we 
included back then, and the latest version tzcode2007k. So I just took a 
diff between those, and deduced from there what has changed.


The main difference between those version seems to be support for 64-bit 
time_t, including

  - extended data file format, with 64-bit time values
  - extrapolation of DST transitions in a 400 year cycle, for values 
not directly covered by the transition table.


In addition to that, they've added public domain notice to the top of 
ialloc.c, and some other cosmetic changes. We already had such a notice 
in our version of the file, but the original did not.


I merged the upstream changes, mostly manually, trying to be faithful to 
the original diff to make future merges easier. But I also made some 
whitespace/comment changes, like we've done before to our version of the 
code: no double-stars in comments, braces on lines of their own. 
Attached is the resulting patch against Postgres CVS HEAD.


In addition to manually merging the patch, I had to teach 
pg_next_dst_boundary how to extrapolate the DST transitions. The code 
there is copy-pasted from localsub, so I copy-pasted that change from 
there as well. Also, they now use a binary search instead of linear 
search in localsub, and I copied that change to pg_next_dst_bounday as 
well (that is why I started looking at this in the first place, 
http://archives.postgresql.org/pgsql-patches/2007-09/msg00291.php)


I don't really know how to test this. It passes the regression tests, 
after the fixes to pg_dst_next_boundary, but I was expecting there to be 
some failures now that we support timezones for timestamps outside the 
32-bit time_t range. Apparently our tests don't cover that?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


pg-tzcode2007k.patch.gz
Description: GNU Zip compressed data

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Endless recovery

2008-02-11 Thread Heikki Linnakangas

Hans-Juergen Schoenig wrote:

this is he last info which was issued ...

nothing in between ...

during the rm_cleanup() nothing was logged into the logs. this is the 
last log from today dawn:


[2008-02-11 03:45:16 CET ]LOG: lost parent for block 8558565
[2008-02-11 03:45:16 CET ]LOG: index 1663/16384/16578435 needs VACUUM or 
REINDEX to finish crash recovery
[2008-02-11 03:45:16 CET ]DETAIL: Incomplete insertion detected during 
crash replay.

[2008-02-11 03:47:54 CET ]LOG: database system is ready
[2008-02-11 03:47:54 CET ]LOG: transaction ID wrap limit is 1073742476, 
limited by database blids


that's where it finished, nothing else was logged between the redo 
done and the last log messages


I bet you've bumped into a bug in gist redo code, the cleanup phase 
shouldn't take long. It's just for completing any incomplete splits by 
inserting pointers to upper-level pages, and there shouldn't be more 
than a few of those active at any point in time.


It looks like there's been quite a few changes to gistlog.c that haven't 
been back-patched. This one in particular might be relevant here:



revision 1.15
date: 2006-04-03 17:45:50 +0100;  author: tgl;  state: Exp;  lines: +15 -13;
Fix thinko in gistRedoPageUpdateRecord: if XLR_BKP_BLOCK_1 is set, we
don't have anything to do to the page, but we still have to adjust the
incomplete_inserts list that we're maintaining in memory.


Teodor, what do you think?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Endless recovery

2008-02-11 Thread Heikki Linnakangas

Hans-Juergen Schoenig wrote:

Last week we have seen a problem with some horribly configured machine.
The disk filled up (bad FSM ;) ) and once this happened the sysadmi killed the 
system (-9).
After two days PostgreSQL has still not started up and they tried to restart it 
again and again making sure that the consistency check was started over an over 
again (thus causing more and more downtime).
 From the admi point of view there was no way to find out whether the machine 
was actually dead or still recovering.


Here is a small patch which issues a log message indicating that the recovery 
process can take ages.

Maybe this can prevent some admis from interrupting the recovery process.


Wait, are you saying that the time was spent in the rm_cleanup phase? 
That sounds unbelievable. Surely the time was spent in the redo phase, no?



In our case, the recovery process took 3.5 days !!


That's a ridiculously long time. Was this a normal recovery, not a PITR 
archive recovery? Any idea why the recovery took so long? Given the max. 
checkpoint timeout of 1h, I would expect that the recovery would take a 
maximum of few hours even with an extremely write-heavy workload.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PATCHES] DDL in EDB-SPL

2007-12-12 Thread Heikki Linnakangas
While looking at the package function precedence problem, I bumped into 
another unrelated bug:


According to a quick Google search, Oracle doesn't accept DDL in PL/SQL; 
you have to use EXECUTE IMMEDIATE to do that. Trying to run DDL in the 
edb-spl fails with a bizarre error message. For example, for CREATE 
TABLE footable (full test case attached):

ERROR:  syntax error at or near footable
LINE 1: CREATE footable2 (id integer)

So the TABLE token seems to be stripped away somewhere. This begs the 
question of what happens with CREATE TEMPORARY TABLE. Lo and behold, it 
does what you might've guessed, kind of. TEMPORARY is stripped away, 
leaving just CREATE TABLE tablename. However, we've set the package 
namespace as the special namespace, and that's the current default 
creation namespace. Therefore the table gets created inside the package 
namespace.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
DROP PACKAGE foopack;

CREATE PACKAGE foopack
IS
  foovar integer;
  PROCEDURE fooproc;
END;

CREATE PACKAGE BODY foopack
IS

  PROCEDURE fooproc IS
  BEGIN
CREATE TEMPORARY TABLE footable2 (id integer);
  END;
END;

exec foopack.fooproc;

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] DDL in EDB-SPL

2007-12-12 Thread Heikki Linnakangas

Pavel Stehule wrote:

Wrong address :)


Shit! I knew this was going to happen eventually, because when I start 
to type pa... in the address bar, [EMAIL PROTECTED] is the first 
hit...


Sorry for the noise, please ignore...

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] Proposed patch to disallow password=foo in databasename parameter

2007-12-11 Thread Heikki Linnakangas

Alvaro Herrera wrote:

Magnus Hagander wrote:

On Mon, Dec 10, 2007 at 10:47:19PM -0500, Tom Lane wrote:



If we want to prevent it for psql, we should actually prevent it *in* psql,
not in libpq. There are an infinite number of scenarios where it's
perfectly safe to put the password there... If we want to do it share, we
should add a function like PQSanitizeConnectionString() that will remove
it, that can be called from those client apps that may be exposing it.

There are also platforms that don't show the full commandline to other
users - or even other processes - that aren't affected, of course.


One idea is to have psql hide the password on the ps status.  That way
it becomes less of a security issue.  It would still be a problem on
certain operating systems, but at least several common platforms would
be covered.


There would still be race condition. It would still be visible until 
psql hides it. In a way that would be even worse, because it wouldn't be 
obvious to an administrator that there's a problem because the password 
wouldn't be visible in ps output, but hackers know about stuff like that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] hashlittle(), hashbig(), hashword() and endianness

2007-11-15 Thread Heikki Linnakangas

Alex Vinokur wrote:

On Nov 15, 10:40 am, Alex Vinokur [EMAIL PROTECTED]
wrote:
[snip]

I have some question concerning Bob Jenkins' functions
hashword(uint32_t*, size_t), hashlittle(uint8_t*, size_t) and
hashbig(uint8_t*, size_t) in lookup3.c.

Let k1 by a key: uint8_t* k1; strlen(k1)%sizeof(uint32_t) == 0.

1. hashlittle(k1) produces the same value on Little-Endian and Big-
Endian machines.
   Let hashlittle(k1) be == L1.

2. hashbig(k1) produces the same value on Little-Endian and Big-Endian
machines.
   Let hashbig(k1) be == B1.

  L1 != B1

3. hashword((uint32_t*)k1) produces
* L1 on LittleEndian machine and
* B1 on BigEndian machine.


===

-
The question is: is it possible to change hashword() to get
* L1 on Little-Endian machine and
* B1 on Big-Endian machine
   ?


Sorry, it should be as follows:

Is it possible to create two new hash functions on basis of
hashword():
   i)  hashword_little () that produces L1 on Little-Endian and Big-
Endian machines;
   ii) hashword_big ()that produces B1 on Little-Endian and Big-
Endian machines
   ?


Why?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Miscalculation in IsCheckpointOnSchedule()

2007-11-14 Thread Heikki Linnakangas

Tom Lane wrote:

ITAGAKI Takahiro [EMAIL PROTECTED] writes:

-((double) (int32) (recptr.xrecoff - 
ckpt_start_recptr.xrecoff)) / XLogSegSize) /
+((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) 
/ XLogSegSize) /


Surely this makes matters worse, not better.  What happens near a segment
boundary crossing?


Hmm. There seems to be another little bug in there. XLogSegsPerFile is 
defined as 0x/XLogSegSize, which is 255 with default settings. 
It should be 256. That leads to negative elapsed_xlogs estimates at xlog 
file boundaries. XLogCheckpointNeeded suffers from it too; the number of 
segments consumed since last checkpoint is off by one per each xlogid, 
so we trigger the checkpoint a little bit too late.


Otherwise, the patch looks good to me. I also tested the calculation 
with a little C program (attached), and it seems to work on segment and 
xlog file boundaries just fine.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
#include stdio.h
#include stdlib.h

typedef unsigned char uint8;	/* == 8 bits */
typedef unsigned short uint16;	/* == 16 bits */
typedef unsigned int uint32;	/* == 32 bits */

typedef signed char int8;		/* == 8 bits */
typedef signed short int16;		/* == 16 bits */
typedef signed int int32;		/* == 32 bits */

#define XLOG_SEG_SIZE	(16*1024*1024)
#define XLogSegSize		((uint32) XLOG_SEG_SIZE)
#define XLogSegsPerFile (((uint32) 0x) / XLogSegSize)

int main(int argc, char **argv)
{
  double elapsed_xlogs;
  uint32 cur_xrecoff, cur_xlogid;
  uint32 ckpt_xrecoff, ckpt_xlogid;
  double a, b;

  cur_xlogid = strtoul(argv[1], NULL, 10);
  cur_xrecoff = strtoul(argv[2], NULL, 10);

  ckpt_xlogid = strtoul(argv[3], NULL, 10);
  ckpt_xrecoff = strtoul(argv[4], NULL, 10);

  //a = (cur_xlogid - ckpt_xlogid) * XLogSegsPerFile;

  //b = (((double) cur_xrecoff) - ((double) ckpt_xrecoff)) / ((double )XLogSegSize);

  //a = ((double) (int32) (cur_xlogid - ckpt_xlogid)) * XLogSegsPerFile;
  //b  = ((double) (int32) (cur_xrecoff - ckpt_xrecoff)) / XLogSegSize;
  a = (cur_xlogid - ckpt_xlogid) * XLogSegsPerFile;
  b = ((double) cur_xrecoff - (double) ckpt_xrecoff) / XLogSegSize;

  elapsed_xlogs = a + b;


  printf(XLogSegsPerFile: %d\n, XLogSegsPerFile);
  printf(xrecoff: %u %f\n, ckpt_xrecoff, (double) ckpt_xrecoff);
  printf(a: %f\n, a);
  printf(b: %f\n, b);
  printf(elapsed_xlogs = %f\n, elapsed_xlogs);
}


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Miscalculation in IsCheckpointOnSchedule()

2007-11-14 Thread Heikki Linnakangas

Heikki Linnakangas wrote:

Tom Lane wrote:

ITAGAKI Takahiro [EMAIL PROTECTED] writes:
- ((double) (int32) (recptr.xrecoff - 
ckpt_start_recptr.xrecoff)) / XLogSegSize) /
+ ((double) recptr.xrecoff - (double) 
ckpt_start_recptr.xrecoff) / XLogSegSize) /


Surely this makes matters worse, not better.  What happens near a segment
boundary crossing?


Hmm. There seems to be another little bug in there. XLogSegsPerFile is 
defined as 0x/XLogSegSize, which is 255 with default settings. 
It should be 256. That leads to negative elapsed_xlogs estimates at xlog 
file boundaries. XLogCheckpointNeeded suffers from it too; the number of 
segments consumed since last checkpoint is off by one per each xlogid, 
so we trigger the checkpoint a little bit too late.


I'll take that back. We intentionally don't use the last possible 
segment of each xlog file, to avoid overflows. Sorry for the noise.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-09 Thread Heikki Linnakangas

Jan Urbański wrote:

The solution I came up with was simple: write a dictionary, that does
only one thing: looks up the lexeme in a stopwords file and either
discards it or returns NULL.

Doesn't the simple dictionary handle this?


I don't think so. The 'simple' dictionary discards stopwords, but
accepts any other lexemes. So if use {'simple', 'pl_ispell'} for my
config, I'll get rid of the stopwords, but I won't get any lexemes
stemmed by ispell. Every lexeme that's not a stopword will produce the
very same lexeme (this is how I think the 'simple' dictionary works).

My dictionary does basically the same thing as the 'simple' dictionary,
but it returns NULL instead of the original lexeme in case the lexeme is
not found in the stopwords file.


In the long term, what we really need a more flexible way to chain 
dictionaries. In this case, you would first check against one stopword 
list, eliminating 'od', then check the ispell dictionary, and then check 
another stopword list without 'od'.


I suggested that a while ago 
(http://archives.postgresql.org/pgsql-hackers/2007-08/msg01036.php). 
Hopefully Oleg or someone else gets around restructuring the 
dictionaries in a future release.


I wonder if you could hack the ispell dictionary file to treat oda 
specially?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] mingw autoconf again

2007-10-29 Thread Heikki Linnakangas
Magnus Hagander wrote:
 So I was fixing my MingW environment to test and fix that issue with the
 functions missing. In doing so, I came across the error previously
 discussed that gettimeofday() is now suddently defined in the latest
 versions of mingw, but wasn't before. 

Is it just missing from header files, or does it really not exist?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Autovacuum cancellation

2007-10-26 Thread Heikki Linnakangas
Alvaro Herrera wrote:
 /*
  * Look for a blocking autovacuum. There will only ever
  * be one, since the autovacuum workers are careful
  * not to operate concurrently on the same table. 
  */

I think that's a bit unaccurate. You could have multiple autovacuum
workers operating on different tables participating in a deadlock. The
reason that can't happen is that autovacuum never holds a lock while
waiting for another.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


  1   2   3   4   >