date:20060308

Re: [HACKERS] Coverity Open Source Defect Scan of PostgreSQL

2006-03-08 Thread Martijn van Oosterhout

On Tue, Mar 07, 2006 at 05:39:18PM -0500, Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  #ifdef STATIC_ANALYSIS
  #define ereport(elevel, rest)  \
  (errstart(elevel, __FILE__, __LINE__, PG_FUNCNAME_MACRO) ? \
   (errfinish rest) : (void) 0), (elevel = ERROR ? exit(0) : 0)
  #else
  /* Normal def */
  #endif
 
 Hmm, neat idea ... though I wonder whether either gcc or Coverity's tool
 is smart enough to draw the right conclusions from a conditional exit()
 call ...

Well, remember this is a macro so the conditional is known at compile
time and the optimiser should see that the exit is unconditional. A
quick test with the attached program shows that gcc does correctly
determine that the last few lines are unreachable and are optimised out
entirely (with -Wunreachable-code which is not the default).

I tried to create an empty static inline function with
attribute((noreturn)) to optimise out the call to exit(), but gcc
merely points out the function does actually return and proceeds to
assume that the rest of main() is also reachable.

Another possibility would be to create two versions of errfinish, one
marked (noreturn), and use a conditional on elevel to decide which to
use. However, then you get issues with multiple evaluation of macro
arguments...

gcc 3.3.5
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.
#include stdlib.h
#include stdio.h

typedef enum {
   NOTICE,
   WARN,
   ERROR,
   FATAL
} ErrorLevel;

#define errortest(elevel,emessage) \
 doerr(elevel,emessage), (elevel = ERROR ? exit(0) : 0)
 
void doerr( ErrorLevel level, char *msg )
{
  printf( Error: %s\n, msg );
}

static inline void  __attribute__((noreturn)) fake_exit() { }

int main()
{
  int i=0;
  errortest( NOTICE, test1 );
  i=1;
  errortest( FATAL, test2 );
  i=2;
  return 0;
}


signature.asc
Description: Digital signature

Re: [HACKERS] FW: PGBuildfarm member snake Branch HEAD Status changed from OK to PLCheck failure

2006-03-08 Thread Michael Glaesemann



On Mar 8, 2006, at 13:01 , Bruce Momjian wrote:


I have updated the /contrib and PL regression tests for
escape_string_warning.  I can't test all the regressions but the build
farm will tell us soon enough.


I see 'em turning green :) Thanks, Bruce.

Michael Glaesemann
grzm myrealbox com




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] FW: PGBuildfarm member snake Branch HEAD Status changed

2006-03-08 Thread Bruce Momjian

Michael Glaesemann wrote:
 
 On Mar 8, 2006, at 13:01 , Bruce Momjian wrote:
 
  I have updated the /contrib and PL regression tests for
  escape_string_warning.  I can't test all the regressions but the build
  farm will tell us soon enough.
 
 I see 'em turning green :) Thanks, Bruce.

Great.  Just fixed another pltcl problem so hopefully that is all.

-- 
  Bruce Momjian   http://candle.pha.pa.us
  SRA OSS, Inc.   http://www.sraoss.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

[HACKERS] 8.2 hold queue

2006-03-08 Thread Bruce Momjian

I have applied all the patches in the patch queue, and am starting to
look at the patches_hold queue, which are patches submitted after the
feature freeze.

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

-- 
  Bruce Momjian   http://candle.pha.pa.us
  SRA OSS, Inc.   http://www.sraoss.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [SQL] Interval subtracting

2006-03-08 Thread Markus Schaber

Hi, Scott,

Scott Marlowe wrote:

But it isn't '-2 months, -1 day'.  I think what you are saying is what I
am saying, that we should make the signs consistent.
 Pretty much.  It just seems wrong to have different signs in what is
 essentially a single unit.
 
 We don't say 42 degrees, -12 minutes when measuring arc, do we?  Then
 again, maybe some folks do.  It just seems wrong to me.

But we say quarter to twelve, at least in some areas on this planet.

The problem is that months have different lengths. '2 months - 1 day'
can be '1 month 27 days', '1 month 28 days', '1 month 29 days' or '1
month 30 days', depending on the timestamp we apply the interval.

Markus
-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] [PATCHES] Inherited Constraints

2006-03-08 Thread Hannu Krosing

Ühel kenal päeval, E, 2006-03-06 kell 17:25, kirjutas Bruce Momjian:
 Hannu Krosing wrote:
  ?hel kenal p?eval, E, 2006-03-06 kell 12:12, kirjutas Bruce Momjian:
   Added to TODO:
   
 o Prevent parent tables from altering or dropping constraints
   like CHECK that are inherited by child tables
   
   Dropping constraints should only be possible with CASCADE.
   
   and we already have this in TODO:
   
   o %Prevent child tables from altering or dropping constraints
 like CHECK that were inherited from the parent table
   
   so I think we now have all the failure cases documented.
  
  If you want to be consistent, then ALTER TABLE ONLY ADD CONSTRAINT  .. 
  should also be forbidden, so you can't create non-inherited constraints
 
 I don't have a problem with creating ONLY constraints on parents and
 children.  We just don't want them to be removed/modified if they are
 shared.

Well, when you delete a constraint from child, the constraint becomes an
ONLY constraint on parent. If you allow ONLY constraints on parents,
then why disallow dropping them from childs ?

IIRC the original complaint about being able to drop constraints from
children was that inherited tables not being bound by constraints on
parents was unexpected/broken.

I.E when you have 

CREATE TABLE T(i int check (i0));

then you would be really surprised by getting -1 out from that table.

---
Hannu



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Simon Riggs

On Tue, 2006-03-07 at 18:14 -0500, Tom Lane wrote:
 BTW, I was just looking over Knuth's discussion of sorting again, and
 realized that there is still something more that could be done within
 the existing sort framework.  We currently use standard polyphase merge
 (his Algorithm 5.4.2D), which IIRC I chose because it was simple and
 for relatively small numbers of tapes T it was about as good as anything
 else.  Knuth spends a great deal of energy on minimizing tape rewind
 time which of course is of no interest to us, and I had supposed that
 all of his more-complex algorithms were really only of interest if you
 needed to consider rewind time.  However, now that we've changed the
 code to prefer large numbers of tapes, it's not at all clear that
 Algorithm D is still the right one to use.  In particular I'm looking at
 cascade merge, Algorithm 5.4.3C, which appears to use significantly
 fewer passes when T is large.  

Ah! Well spotted. Yeh, looks like it will improve performance a good
deal. So, yes, definitely a TODO item. 

 Do you want to try that?

The Cascade Merge re-writes the way logical tapes are selected and how
the runs are merged. It doesn't seem to do anything at all about the
run-forming, which would still use heapsort. So the only effect is when
we have more runs than tapes, so for the limits of where we would
begin noticing any benefit would be:
work_mem= 1 GB  benefit at 8 TB
work_mem= 256MB benefit at 0.5 TB
work_mem= 8MB   benefit at 256 MB
work_mem= 1MB   benefit at 12 MB (min 7 tapes).
(based upon runs on average twice size of memory, and each logical tape
requiring 256KB memory, i.e. min(work_mem/4, 6) * work_mem * 2, which
for work_mem  2 MB gives 0.5 * work_mem^2)

Which means the benefit we get is when we have for some reason been
unable to give the sort enough space, or not set parameters correctly.
So, still a concern...but makes me think about 2 other issues first:

1. Earlier we had some results that showed that the heapsorts got slower
when work_mem was higher and that concerns me most of all right now.
It's possible you'll have reduced that considerably with the
pull-out-the-first-attr patch. I'll look into some test results to show
that has gone away. We also have Nyberg et al telling us that as of 1994
they established that heapsort would always be slower than qsort, as a
result of CPU cache locality improvements. An improvement here would
effect all sorts  work_mem.

2. Improvement in the way we do overall memory allocation, so we would
not have the problem of undersetting work_mem that we currently
experience. If we solved this problem we would have faster sorts in
*all* cases, not just extremely large ones. Dynamically setting work_mem
higher when possible would be very useful. I've looked at this a few
times and have some suggestions, but perhaps its worth asking for ideas
in this area?

Best Regards, Simon Riggs



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[HACKERS] Running out of disk space during query

2006-03-08 Thread Stephen Frost

Greetings,

* Simon Riggs ([EMAIL PROTECTED]) wrote:
 work_mem= 1 GBbenefit at 8 TB
 work_mem= 256MB benefit at 0.5 TB
 (based upon runs on average twice size of memory, and each logical tape
 requiring 256KB memory, i.e. min(work_mem/4, 6) * work_mem * 2, which
 for work_mem  2 MB gives 0.5 * work_mem^2)

Seeing this reminded me of an issue I ran into recently.  In 8.1 on a
database that's only 16G, I ran a query that chewed up all the available
disk space (about 250G, yes, 0.25TB) on the partition and then failed.
Of course, this took many hours on a rather speedy box (and the disk
array is a pretty nice IBM SAN so it's not exactly a slacker either) and
produced nothing for me.

I'd like to think it's often the case that Postgres has some idea what
the total disk space usage of a given query is going to be prior to
actually running the whole query and just seeing how much space it took
at the highest point.  If this can be done with some confidence then
it'd be neat if Postgres could either check if there's enough disk space
available and if not bail (I know, difficult to do cross-platform and
there's tablespaces and whatnot to consider) OR if there was a parameter
along the lines of max_temp_disk_space which would fail the query if
that would be exceeded by the query.  The latter could even be two GUC
variables, one administrator set and unchangable by the user ('hard'
limit) and one settable by the user with a sane default ('soft' limit)
and perhaps a HINT which indicates how to change it in the error
message when the limit is hit.

I suppose I could put quotas in place or something but I don't really
have a problem with the database as a whole using up a bunch of disk
space (hence why it's got alot of room to grow into), I just would have
liked a this will chew up more disk space than you have and then fail
message instead of what ended up happening for this query.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 1. Earlier we had some results that showed that the heapsorts got slower
 when work_mem was higher and that concerns me most of all right now.

Fair enough, but that's completely independent of the merge algorithm.
(I don't think the Nyberg results necessarily apply to our situation
anyway, as we are not sorting arrays of integers, and hence the cache
effects are far weaker for us.  I don't mind trying alternate sort
algorithms, but I'm not going to believe an improvement in advance of
direct evidence in our own environment.)

 2. Improvement in the way we do overall memory allocation, so we would
 not have the problem of undersetting work_mem that we currently
 experience. If we solved this problem we would have faster sorts in
 *all* cases, not just extremely large ones. Dynamically setting work_mem
 higher when possible would be very useful.

I think this would be extremely dangerous, as it would encourage
processes to take more than their fair share of available resources.
Also, to the extent that you believe the problem is insufficient L2
cache, it seems increasing work_mem to many times the size of L2 will
always be counterproductive.  (Certainly there is no value in increasing
work_mem until we are in a regime where it consistently improves
performance significantly, which it seems we aren't yet.)

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] pg_freespacemap question

2006-03-08 Thread Tom Lane

Mark Kirkwood [EMAIL PROTECTED] writes:
 Good points! I had not noticed this test case. Probably NULL is better 

 Would setting it to 'BLCKSZ - (fixed index header stuff)' be better, 

No, I don't think so, because that will just make it harder to recognize
what's what (remember that BLCKSZ isn't really a constant, and the index
overhead is not the same for all AMs either).  The point here is that
for indexes the FSM tracks whole-page availability, not the amount of
free space within pages.  So I think NULL is a reasonable representation
of that.  Using NULL will make it easy to filter the results if you want
to see only heap-page data or only index-page data, whereas it will be
very hard to do that if the view adopts an ultimately-artificial
convention about the amount of available space on an index page.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Luke Lonergan

Tom,

On 3/8/06 7:21 AM, Tom Lane [EMAIL PROTECTED] wrote:

 Simon Riggs [EMAIL PROTECTED] writes:
 1. Earlier we had some results that showed that the heapsorts got slower
 when work_mem was higher and that concerns me most of all right now.
 
 Fair enough, but that's completely independent of the merge algorithm.
 (I don't think the Nyberg results necessarily apply to our situation
 anyway, as we are not sorting arrays of integers, and hence the cache
 effects are far weaker for us.  I don't mind trying alternate sort

Even with the indirection, we should investigate alternative approaches that
others have demonstrated to be superior WRT L2 cache use.

A major commercial database currently performs external sorts of various
fields 4 times faster, and commonly uses more than 256MB of sort memory in
one example case to do it.

 I think this would be extremely dangerous, as it would encourage
 processes to take more than their fair share of available resources.

I agree - in fact, we currently have no structured concept of fair share of
available resources, nor a way to share them.

I think the answer to this should involve the use of statement queuing and
resource queues.
 
 Also, to the extent that you believe the problem is insufficient L2
 cache, it seems increasing work_mem to many times the size of L2 will
 always be counterproductive.  (Certainly there is no value in increasing
 work_mem until we are in a regime where it consistently improves
 performance significantly, which it seems we aren't yet.)

Not if you cache block, the optimization that operates on a block of memory
one L2 block in size at a time.

- Luke 



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [PATCHES] Inherited Constraints

2006-03-08 Thread Stephan Szabo

On Wed, 8 Mar 2006, Hannu Krosing wrote:

 ??hel kenal p??eval, E, 2006-03-06 kell 17:25, kirjutas Bruce Momjian:
  Hannu Krosing wrote:
   ?hel kenal p?eval, E, 2006-03-06 kell 12:12, kirjutas Bruce Momjian:
Added to TODO:
   
  o Prevent parent tables from altering or dropping constraints
like CHECK that are inherited by child tables
   
Dropping constraints should only be possible with CASCADE.
   
and we already have this in TODO:
   
o %Prevent child tables from altering or dropping constraints
  like CHECK that were inherited from the parent table
   
so I think we now have all the failure cases documented.
  
   If you want to be consistent, then ALTER TABLE ONLY ADD CONSTRAINT  ..
   should also be forbidden, so you can't create non-inherited constraints
 
  I don't have a problem with creating ONLY constraints on parents and
  children.  We just don't want them to be removed/modified if they are
  shared.

 Well, when you delete a constraint from child, the constraint becomes an
 ONLY constraint on parent.

Only if there's a single child, otherwise you have a partially-ONLY
constraint unless you made it ONLY constraints on the parent and all other
children (but then removing the parent constraint wouldn't remove it from
the other children presumably).

 If you allow ONLY constraints on parents, then why disallow dropping
 them from childs ?

I agree with this in any case.  I think both are fairly broken.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] problem with large maintenance_work_mem settings and

2006-03-08 Thread Stefan Kaltenbrunner


Tom Lane wrote:

I wrote:


Stefan Kaltenbrunner [EMAIL PROTECTED] writes:


samples  %symbol name
24915704 96.2170  ltsReleaseBlock




We probably need to tweak things so this doesn't get called during the
final merge pass.  Looking at it now.



I've committed a fix for this into CVS HEAD --- please try it out.


just tried that with CVS HEAD (includes the second fix too):


CREATE INDEX on a 1,8B row table (5 int columns - index created on the 
first row about 300M distinct values):


before: 11h 51min
after: 3h 11min(!)



Stefan

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] problem with large maintenance_work_mem settings and

2006-03-08 Thread Tom Lane

Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 CREATE INDEX on a 1,8B row table (5 int columns - index created on the 
 first row about 300M distinct values):

 before: 11h 51min
 after: 3h 11min(!)

Cool.  Does it seem to be I/O bound now?  Would you be willing to do it
over with oprofile turned on?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

[HACKERS] Add switches for DELIMITER and NULL in pg_dump COPY

2006-03-08 Thread David Fetter

Folks,

From the earlier discussion, it appears that there is a variety of
opinions on what the COPY delimiter should be in pg_dump.  This patch
allows people to set it and the NULL string.  Thanks to Gavin Sherry
for help with the pointers :)

I didn't patch pg_dumpall, but it would be trivial if there's a use
case.

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778

Remember to vote!
Index: doc/src/sgml/ref/pg_dump.sgml
===
RCS file: /projects/cvsroot/pgsql/doc/src/sgml/ref/pg_dump.sgml,v
retrieving revision 1.81
diff -c -r1.81 pg_dump.sgml
*** doc/src/sgml/ref/pg_dump.sgml   1 Nov 2005 21:09:50 -   1.81
--- doc/src/sgml/ref/pg_dump.sgml   6 Mar 2006 07:32:05 -
***
*** 163,168 
--- 163,208 
   /varlistentry
  
   varlistentry
+   termoption--copy-delimiter=replaceable 
class=parameterdelimiter/replaceable/option/term
+   listitem
+para
+ Use replaceable class=parameterdelimiter/replaceable
+ instead of the default tab character in commandCOPY/command 
statements. 
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
+   termoption--copy-null=replaceable 
class=parameterstring_for_nulls/replaceable/option/term
+   listitem
+para
+ Use  replaceable class=parameterstring_for_nulls/replaceable 
instead of the
+ default \N in commandCOPY/command statements.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
+   termoption--copy-delimiter=replaceable 
class=parameterdelimiter/replaceable/option/term
+   listitem
+para
+ Use replaceable class=parameterdelimiter/replaceable
+ instead of the default tab character in commandCOPY/command 
statements. 
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
+   termoption--copy-null=replaceable 
class=parameterstring_for_nulls/replaceable/option/term
+   listitem
+para
+ Use  replaceable class=parameterstring_for_nulls/replaceable 
instead of the
+ default \N in commandCOPY/command statements.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
termoption-d/option/term
termoption--inserts/option/term
listitem
Index: src/bin/pg_dump/pg_dump.c
===
RCS file: /projects/cvsroot/pgsql/src/bin/pg_dump/pg_dump.c,v
retrieving revision 1.433
diff -c -r1.433 pg_dump.c
*** src/bin/pg_dump/pg_dump.c   5 Mar 2006 15:58:50 -   1.433
--- src/bin/pg_dump/pg_dump.c   6 Mar 2006 07:32:12 -
***
*** 114,119 
--- 114,125 
  /* flag to turn on/off dollar quoting */
  static intdisable_dollar_quoting = 0;
  
+ /* Things used when caller invokes COPY options. */
+ #define ARG_COPY_DELIMITER 2
+ #define ARG_COPY_NULL 3
+ char *copy_delimiter = \t;
+ char *copy_null;
+ 
  
  static void help(const char *progname);
  static NamespaceInfo *findNamespace(Oid nsoid, Oid objoid);
***
*** 181,186 
--- 187,193 
 ExecStatusType expected);
  
  
+ 
  int
  main(int argc, char **argv)
  {
***
*** 211,217 
char   *outputSuperuser = NULL;
  
RestoreOptions *ropt;
! 
static struct option long_options[] = {
{data-only, no_argument, NULL, 'a'},
{blobs, no_argument, NULL, 'b'},
--- 218,224 
char   *outputSuperuser = NULL;
  
RestoreOptions *ropt;
!   
static struct option long_options[] = {
{data-only, no_argument, NULL, 'a'},
{blobs, no_argument, NULL, 'b'},
***
*** 249,254 
--- 256,269 
{disable-dollar-quoting, no_argument, 
disable_dollar_quoting, 1},
{disable-triggers, no_argument, disable_triggers, 1},
{use-set-session-authorization, no_argument, 
use_setsessauth, 1},
+   
+   /*
+* The following options don't have an equivalent short option
+* letter, and are not available as -X long-name.  Just use
+* the long form.
+*/
+   {copy-delimiter, required_argument, NULL, ARG_COPY_DELIMITER},
+   {copy-null, required_argument, NULL, ARG_COPY_NULL},
  
{NULL, 0, NULL, 0}
};
***
*** 418,423 
--- 433,460 
break;
/* This covers the long options equivalent to 
-X xxx. */
  
+   case ARG_COPY_DELIMITER:
+   if ( strlen(optarg) != 1)
+   {
+   fprintf(stderr, _(In %s, 
copy-delimiter must be exactly one byte long, not %d.\n),
+

Re: [HACKERS] [SQL] Interval subtracting

2006-03-08 Thread Scott Marlowe

On Wed, 2006-03-08 at 06:07, Markus Schaber wrote:
 Hi, Scott,
 
 Scott Marlowe wrote:
 
 But it isn't '-2 months, -1 day'.  I think what you are saying is what I
 am saying, that we should make the signs consistent.
  Pretty much.  It just seems wrong to have different signs in what is
  essentially a single unit.
  
  We don't say 42 degrees, -12 minutes when measuring arc, do we?  Then
  again, maybe some folks do.  It just seems wrong to me.
 
 But we say quarter to twelve, at least in some areas on this planet.
 
 The problem is that months have different lengths. '2 months - 1 day'
 can be '1 month 27 days', '1 month 28 days', '1 month 29 days' or '1
 month 30 days', depending on the timestamp we apply the interval.

I made this point before.  In the military they say 1145 or 2345 instead
of quarter to twelve, because 1: there are two quarter to twelves a
day, and 2: It's easy to get it confused.  

For same reasons, i.e. a need for precision, I find it hard to accept
the idea of mixing positive and negative units in the same interval. 
The plus or minus sign should be outside of the interval.

Then, it's quite certain what you mean.  If you say 

select '2006-06-12'::date - interval '1 month 2 days' 

there is no ambiguity.  If you say:

select '2006-06-12'::date + interval '-1 month -2 days'

do you mean (1 month - 2 days) subtracted from the date, or 
do you mean to subtract 1 month, then 2 days from the date?

Putting the + or - outside the interval seems to make the most sense to
me.  Allowing them inside makes no sense to me.  And colloquialisms
aren't really a good reason.  :)

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Running out of disk space during query

2006-03-08 Thread Tom Lane

Stephen Frost [EMAIL PROTECTED] writes:
 I suppose I could put quotas in place or something but I don't really
 have a problem with the database as a whole using up a bunch of disk
 space (hence why it's got alot of room to grow into), I just would have
 liked a this will chew up more disk space than you have and then fail
 message instead of what ended up happening for this query.

I've got the same problem with this that I do with the recently-proposed
patch to fail queries with estimated cost  X --- to wit, I think it
will result in a net *reduction* in system reliability not an improvement.
Any such feature changes the planner estimates from mere heuristics into
a gating factor that will make queries fail entirely.  And they are
really not good enough to put that kind of trust into.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Add switches for DELIMITER and NULL in pg_dump COPY

2006-03-08 Thread Tom Lane

David Fetter [EMAIL PROTECTED] writes:
 From the earlier discussion, it appears that there is a variety of
 opinions on what the COPY delimiter should be in pg_dump.  This patch
 allows people to set it and the NULL string.

Did anyone provide a convincing use case for this?  It's of zero value
from the perspective of pg_dump itself; the only possible argument is
that it makes it easier for program-foo to parse the output of pg_dump.
But I don't see any programs around to parse arbitrary SQL scripts,
especially not the pretty-PG-specific scripts that pg_dump emits.

I think it much more likely that people needing this sort of thing would
be using something like psql -c 'copy foo to stdout', so as to get the
data without any added overhead.

So this seems like mere creeping featurism to me.  pg_dump has too many
switches already.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Jim C. Nasby

On Wed, Mar 08, 2006 at 07:28:16AM -0800, Luke Lonergan wrote:
  I think this would be extremely dangerous, as it would encourage
  processes to take more than their fair share of available resources.
 
 I agree - in fact, we currently have no structured concept of fair share of
 available resources, nor a way to share them.

A concept it would be great to add at some point, both for memory and
IO. But that's another discussion entirely.

 I think the answer to this should involve the use of statement queuing and
 resource queues.

Something else to consider is reducing the amount of memory used when we
have to fail to a tape sort, because at that point we'll be
substantially slower. So, for example, allow in-memory sorts to use up
to 1GB, because it shouldn't take a long period of time to read that
data in, and the sort will then be extremely fast. That means that the
sort would be using that amount of memory for a short period of time. If
we do have to fail to disk, cut back to 128MB, because having 8x that
certainly won't make the sort run anywhere close to 8x faster. The trick
would be releasing memory that a sort we thought could fit in memory but
couldn't. It would also be good to start estimating which sorts should
fit in memory and which won't before we start (AFAIK the current code
assumes we'll fit in memory until it runs out).
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Add switches for DELIMITER and NULL in pg_dump COPY

2006-03-08 Thread David Fetter

On Wed, Mar 08, 2006 at 11:03:00AM -0500, Tom Lane wrote:
 David Fetter [EMAIL PROTECTED] writes:
  From the earlier discussion, it appears that there is a variety of
  opinions on what the COPY delimiter should be in pg_dump.  This
  patch allows people to set it and the NULL string.
 
 Did anyone provide a convincing use case for this?

I've had one so far, and it was enough to cause me to make a special
patched version of pg_dump.  To get some idea of how drastic that was,
consider that I think it's generally bad practice to compile from
source because it can take you too far off the generally supported
software map.  The case I had was making a database with a schema and
initial data whose dump output gets checked into a source code
management system.  Those initial data sets, which can change--for
example when the corresponding ISO codes do--may be in many different
tables, so the easiest way to do this is to make the dump file as easy
as possible to edit.

 It's of zero value from the perspective of pg_dump itself; the only
 possible argument is that it makes it easier for program-foo to
 parse the output of pg_dump.  But I don't see any programs around to
 parse arbitrary SQL scripts, especially not the pretty-PG-specific
 scripts that pg_dump emits.

It's less about program-foo parsing than about multi-table data
management, as above.  However, I'm sure that there are people who
will find other uses for it.

 I think it much more likely that people needing this sort of thing would
 be using something like psql -c 'copy foo to stdout', so as to get the
 data without any added overhead.

The one-table-at-a-time approach is quite error-prone for large
numbers of tables and/or large data sets.

 So this seems like mere creeping featurism to me.  pg_dump has too
 many switches already.

I've been careful to see to it that only people who use the switches
are affected by it.  I am also volunteering to do ongoing maintenance
of this feature. :)

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778

Remember to vote!

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] [SQL] Interval subtracting

2006-03-08 Thread Tom Lane

Scott Marlowe [EMAIL PROTECTED] writes:
 For same reasons, i.e. a need for precision, I find it hard to accept
 the idea of mixing positive and negative units in the same interval. 

The semantics are perfectly well defined, so I don't buy this.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] [PATCHES] Add switches for DELIMITER and NULL in pg_dump COPY

2006-03-08 Thread David Fetter

On Wed, Mar 08, 2006 at 11:10:04AM -0500, Neil Conway wrote:
 On Wed, 2006-03-08 at 07:47 -0800, David Fetter wrote:
  From the earlier discussion, it appears that there is a variety of
  opinions on what the COPY delimiter should be in pg_dump.  This patch
  allows people to set it and the NULL string.
 
 I'm still not convinced there is a reasonable use-case for this feature.
 I can't recall: did the previous discussion conclude that we actually
 want this functionality?

The previous discussion showed that there is a wide diversity of
opinions on what The Right Delimiter and The Right NULL String(TM)
are.

Thanks for the tips. :)  I'll make a revised patch this evening, time
permitting.

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778

Remember to vote!

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 If we do have to fail to disk, cut back to 128MB, because having 8x that
 certainly won't make the sort run anywhere close to 8x faster.

Not sure that follows.  In particular, the entire point of the recent
changes has been to extend the range in which we can use a single merge
pass --- that is, write the data once as N sorted runs, then merge them
in a single read pass.  As soon as you have to do an actual merge-back-
to-disk pass, your total I/O volume doubles, so there is definitely a
considerable gain if that can be avoided.  And a larger work_mem
translates directly to fewer/longer sorted runs.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Running out of disk space during query

2006-03-08 Thread Stephen Frost

Tom,

* Tom Lane ([EMAIL PROTECTED]) wrote:
 I've got the same problem with this that I do with the recently-proposed
 patch to fail queries with estimated cost  X --- to wit, I think it
 will result in a net *reduction* in system reliability not an improvement.
 Any such feature changes the planner estimates from mere heuristics into
 a gating factor that will make queries fail entirely.  And they are
 really not good enough to put that kind of trust into.

Perhaps instead then have the system fail the query once it's gone
beyond some configurable limit on temporary disk usage?  The query still
would have run for a while but it wouldn't have run the partition out of
space and would have come back faster at least.

Comparing this to work_mem- do we do something like this there?  I don't
think we do, which means we're trusting the planner's estimate to get
the memory size estimate right and that can end up being way off
resulting in queries taking up well beyond what work_mem would normally
allow them...  I recall alot of discussion but don't recall if anything
was actually done to resolve that issue either.

It seems to me we should probably: not trust the planner's estimates and
therefore implement checks to fail things once we've gone well beyond
what we expected to use.  If we've done this for work_mem then using
whatever we did there for a 'temporary disk space limit' would at least
make me happy.  If we havn't then perhaps we should do something for 
both.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Add switches for DELIMITER and NULL in pg_dump COPY

2006-03-08 Thread Tom Lane

David Fetter [EMAIL PROTECTED] writes:
 On Wed, Mar 08, 2006 at 11:03:00AM -0500, Tom Lane wrote:
 Did anyone provide a convincing use case for this?

 I've had one so far, and it was enough to cause me to make a special
 patched version of pg_dump.  To get some idea of how drastic that was,
 consider that I think it's generally bad practice to compile from
 source because it can take you too far off the generally supported
 software map.  The case I had was making a database with a schema and
 initial data whose dump output gets checked into a source code
 management system.

So?  Don't tell me your SCMS can't handle tabs.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [PATCHES] Add switches for DELIMITER and NULL in pg_dump COPY

2006-03-08 Thread Neil Conway

On Wed, 2006-03-08 at 07:47 -0800, David Fetter wrote:
 From the earlier discussion, it appears that there is a variety of
 opinions on what the COPY delimiter should be in pg_dump.  This patch
 allows people to set it and the NULL string.

I'm still not convinced there is a reasonable use-case for this feature.
I can't recall: did the previous discussion conclude that we actually
want this functionality?

 *** src/bin/pg_dump/pg_dump.c   5 Mar 2006 15:58:50 -   1.433
 --- src/bin/pg_dump/pg_dump.c   6 Mar 2006 07:32:12 -
 ***
 *** 114,119 
 --- 114,125 
   /* flag to turn on/off dollar quoting */
   static intdisable_dollar_quoting = 0;
   
 + /* Things used when caller invokes COPY options. */
 + #define ARG_COPY_DELIMITER 2
 + #define ARG_COPY_NULL 3
 + char *copy_delimiter = \t;
 + char *copy_null;
 + 

The variables should be declared static.
  
   static void help(const char *progname);
   static NamespaceInfo *findNamespace(Oid nsoid, Oid objoid);
 ***
 *** 181,186 
 --- 187,193 
  ExecStatusType expected);
   
   
 + 
   int
   main(int argc, char **argv)
   {
 ***
 *** 211,217 
 char   *outputSuperuser = NULL;
   
 RestoreOptions *ropt;
 ! 
 static struct option long_options[] = {
 {data-only, no_argument, NULL, 'a'},
 {blobs, no_argument, NULL, 'b'},
 --- 218,224 
 char   *outputSuperuser = NULL;
   
 RestoreOptions *ropt;
 !   
 static struct option long_options[] = {
 {data-only, no_argument, NULL, 'a'},
 {blobs, no_argument, NULL, 'b'},

Please review patches and eliminate content-free hunks like these before
submitting.

 ***
 *** 427,432 
 --- 464,479 
 }
 }
   
 +   if (copy_null == NULL)
 +   copy_null = malloc(3);
 +   strcpy(copy_null, \\N);

You're missing some braces.

 +   if (strstr(copy_null, copy_delimiter))
 +   {
 +   fprintf(stderr, _(In %s, the NULL AS string cannot
 contain the COPY delimiter.\n), progname);
 +   exit(1);
 +   }

I'm not sure as to whether you should be using write_msg() or fprintf()
here, but we should probably pick one and be consistent. Also ISTM we
should be to refactor the code to use exit_nicely() anyway, provided
that g_conn is initialized to NULL before we have connected to the DB.

 ***
 *** 702,707 
 --- 749,756 
 use SESSION
 AUTHORIZATION commands instead of\n
 OWNER TO commands
 \n));
   
 +   printf(_(  --copy-delimiter string to use as column
 DELIMITER in COPY statements\n));

Capitalizing DELIMITER here is not good style, IMHO: it is just a
normal word.

 *** 844,849 
 --- 893,904 
 int ret;
 char   *copybuf;
 const char *column_list;
 +   char *local_copy_delimiter;
 +   char *local_copy_null;
 +   local_copy_delimiter = malloc(2*strlen(copy_delimiter)+1);
 +   PQescapeString (local_copy_delimiter, copy_delimiter,
 2*strlen(copy_delimiter)+1);
 +   local_copy_null = malloc(2*strlen(copy_null)+1);
 +   PQescapeString (local_copy_null, copy_null,
 2*strlen(copy_null)+1);

Spacing: spaces around operands to mathematical operators, no spaces
before the parameter list to a function call.

You should also fix this compiler warning:

[...]/pg_dump.c:440: warning: format '%d' expects type 'int', but
argument 4 has type 'size_t'

-Neil



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Add switches for DELIMITER and NULL in pg_dump COPY

2006-03-08 Thread David Fetter

On Wed, Mar 08, 2006 at 11:26:00AM -0500, Tom Lane wrote:
 David Fetter [EMAIL PROTECTED] writes:
  On Wed, Mar 08, 2006 at 11:03:00AM -0500, Tom Lane wrote:
  Did anyone provide a convincing use case for this?
 
  I've had one so far, and it was enough to cause me to make a
  special patched version of pg_dump.  To get some idea of how
  drastic that was, consider that I think it's generally bad
  practice to compile from source because it can take you too far
  off the generally supported software map.  The case I had was
  making a database with a schema and initial data whose dump output
  gets checked into a source code management system.
 
 So?  Don't tell me your SCMS can't handle tabs.

Not everybody's editor/mailer/whatever does this right, and it makes
things fragile.  Another way to do this is to change the delimter to a
printable character like '|', but that raises hackles, too.

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778

Remember to vote!

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Add switches for DELIMITER and NULL in pg_dump COPY

2006-03-08 Thread Greg Stark


David Fetter [EMAIL PROTECTED] writes:

 Not everybody's editor/mailer/whatever does this right, and it makes
 things fragile.  Another way to do this is to change the delimter to a
 printable character like '|', but that raises hackles, too.

Frankly if you're passing you data through an editor/mailer/whatever you don't
trust then your setup is already fragile. At least if you're using tabs then
you find out about these problems. Tiptoeing around the untrustworthy process
just means that it'll fail randomly (and unpredictably) when other characters
appear in the data that the software doesn't handle.

There are certainly cases where you'll need to do this to interface with other
(amateurish) software. But pg_dump isn't for that at all. Even COPY isn't a
general purpose data formatter. To interface with other software not using a
standard format you're going to have to pass the data through Perl or
something like that anyways.

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Running out of disk space during query

2006-03-08 Thread Simon Riggs

On Wed, 2006-03-08 at 08:33 -0500, Stephen Frost wrote:
 Greetings,
 
 * Simon Riggs ([EMAIL PROTECTED]) wrote:
  work_mem= 1 GB  benefit at 8 TB
  work_mem= 256MB benefit at 0.5 TB
  (based upon runs on average twice size of memory, and each logical tape
  requiring 256KB memory, i.e. min(work_mem/4, 6) * work_mem * 2, which
  for work_mem  2 MB gives 0.5 * work_mem^2)
 
 Seeing this reminded me of an issue I ran into recently.  In 8.1 on a
 database that's only 16G, I ran a query that chewed up all the available
 disk space (about 250G, yes, 0.25TB) on the partition and then failed.
 Of course, this took many hours on a rather speedy box (and the disk
 array is a pretty nice IBM SAN so it's not exactly a slacker either) and
 produced nothing for me.
 
 I'd like to think it's often the case that Postgres has some idea what
 the total disk space usage of a given query is going to be prior to
 actually running the whole query and just seeing how much space it took
 at the highest point.  If this can be done with some confidence then
 it'd be neat if Postgres could either check if there's enough disk space
 available and if not bail (I know, difficult to do cross-platform and
 there's tablespaces and whatnot to consider) OR if there was a parameter
 along the lines of max_temp_disk_space which would fail the query if
 that would be exceeded by the query.  The latter could even be two GUC
 variables, one administrator set and unchangable by the user ('hard'
 limit) and one settable by the user with a sane default ('soft' limit)
 and perhaps a HINT which indicates how to change it in the error
 message when the limit is hit.
 
 I suppose I could put quotas in place or something but I don't really
 have a problem with the database as a whole using up a bunch of disk
 space (hence why it's got alot of room to grow into), I just would have
 liked a this will chew up more disk space than you have and then fail
 message instead of what ended up happening for this query.

We can do work_space and maintenance_work_space fairly easily. We
know how much we are writing, so we don't need to ask the OS how much it
has left, just compare against the parameter and assume that it has been
set correctly by the admin.

Personally, I would rather abort a large sort before we ran for many
hours and then hit those limits. That was the purpose of the
statement_cost_limit parameter mentioned just recently.

Top-down space allocation is essentially the same problem as top-down
memory allocation. In both memory and tempspace we have a hard limit
that if we go beyond, bad things happen. ISTM that we would like to
logically allocate these resources from central pool(s) and then reclaim
or return that allocation when you're done with it. In both cases the
actual physical allocation would be made by the individual backend. It's
fairly easy to track overall space, but its somewhat harder to force a
single query to work within a single allocation since multiple steps
might well want to allocate the same work_mem and have been optimized to
expect they will get that size of allocation...

Best Regards, Simon Riggs



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Jim C. Nasby

On Wed, Mar 08, 2006 at 11:20:50AM -0500, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  If we do have to fail to disk, cut back to 128MB, because having 8x that
  certainly won't make the sort run anywhere close to 8x faster.
 
 Not sure that follows.  In particular, the entire point of the recent
 changes has been to extend the range in which we can use a single merge
 pass --- that is, write the data once as N sorted runs, then merge them
 in a single read pass.  As soon as you have to do an actual merge-back-
 to-disk pass, your total I/O volume doubles, so there is definitely a
 considerable gain if that can be avoided.  And a larger work_mem
 translates directly to fewer/longer sorted runs.

But do fewer/longer sorted runs translate into not merging back to disk?
I thought that was controlled by if we had to be able to rewind the
result set.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] problem with large maintenance_work_mem settings and

2006-03-08 Thread Simon Riggs

On Wed, 2006-03-08 at 10:45 -0500, Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
  CREATE INDEX on a 1,8B row table (5 int columns - index created on the 
  first row about 300M distinct values):
 
  before: 11h 51min
  after: 3h 11min(!)
 
 Cool.  Does it seem to be I/O bound now?  Would you be willing to do it
 over with oprofile turned on?

Very.

Any chance of trying it with different maintenance_work_mem settings?

Did you try this with trace_sort=on? If so could we get the logs for
that?

[Results welcome from other hackers...particularly with regard to
queries with sort steps in rather than CREATE INDEX.]

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Simon Riggs

On Wed, 2006-03-08 at 10:21 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  1. Earlier we had some results that showed that the heapsorts got slower
  when work_mem was higher and that concerns me most of all right now.
 
 Fair enough, but that's completely independent of the merge algorithm.
 (I don't think the Nyberg results necessarily apply to our situation
 anyway, as we are not sorting arrays of integers, and hence the cache
 effects are far weaker for us.  I don't mind trying alternate sort
 algorithms, but I'm not going to believe an improvement in advance of
 direct evidence in our own environment.)

Of course, this would be prototyped first...and I agree about possible
variability of those results for us.

  2. Improvement in the way we do overall memory allocation, so we would
  not have the problem of undersetting work_mem that we currently
  experience. If we solved this problem we would have faster sorts in
  *all* cases, not just extremely large ones. Dynamically setting work_mem
  higher when possible would be very useful.
 
 I think this would be extremely dangerous, as it would encourage
 processes to take more than their fair share of available resources.

Fair share is the objective. I was trying to describe the general case
so we could discuss a solution that would allow a dynamic approach
rather than the static one we have now.

Want to handle these cases: How much to allocate, when...
A. we have predicted number of users 
B. we have a busy system - more than predicted number of users
C. we have a quiet system - less than predicted number of users

In B/C we have to be careful that we don't under/overallocate resources
only to find the situation changes immediately afterwards.

In many cases the static allocation is actually essential since you may
be more interested in guaranteeing a conservative run time rather than
seeking to produce occasional/unpredictable bursts of speed. But in many
cases people want to have certain tasks go faster when its quiet and go
slower when its not.

 Also, to the extent that you believe the problem is insufficient L2
 cache, it seems increasing work_mem to many times the size of L2 will
 always be counterproductive.  

Sorry to confuse: (1) and (2) were completely separate, so no intended
interaction between L2 cache and memory.

 (Certainly there is no value in increasing
 work_mem until we are in a regime where it consistently improves
 performance significantly, which it seems we aren't yet.)

Very much agreed.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Status of TODO: postgresql.conf: reset to default when

2006-03-08 Thread Joachim Wieland

On Sun, Mar 05, 2006 at 01:23:10PM -0500, Bruce Momjian wrote:
  Allow commenting of variables in postgresql.conf to restore them to 
  defaults
  Currently, if a variable is commented out, it keeps the previous
  uncommented value until a server restarted.

  I take that to apply to the configuration re-read at SIGHUP?

 Right, and it is something we have needed to fix for a while.

I'm working on this one.


Joachim

-- 
Joachim Wieland  [EMAIL PROTECTED]
C/ Usandizaga 12 1°B   ICQ: 37225940
20002 Donostia / San Sebastian (Spain) GPG key available

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Luke Lonergan

Jim,

On 3/8/06 9:49 AM, Jim C. Nasby [EMAIL PROTECTED] wrote:

 On Wed, Mar 08, 2006 at 11:20:50AM -0500, Tom Lane wrote:

 Not sure that follows.  In particular, the entire point of the recent
 changes has been to extend the range in which we can use a single merge
 pass --- that is, write the data once as N sorted runs, then merge them
 in a single read pass.  As soon as you have to do an actual merge-back-
 to-disk pass, your total I/O volume doubles, so there is definitely a
 considerable gain if that can be avoided.  And a larger work_mem
 translates directly to fewer/longer sorted runs.
 
 But do fewer/longer sorted runs translate into not merging back to disk?
 I thought that was controlled by if we had to be able to rewind the
 result set.

In the *tape* algorithm, there is an intermediate abstraction in the merging
called tapes (!) that are used to store intermediate merge results.  Simon's
work implemented more tapes, which asymptotically approaches a single merge
pass as the number of tapes approaches the number of runs.

The Replacement Selection algorithm generally will produce about 1/2 the
number of runs that a simpler partial sort algorithm would, and the more
memory it uses, the fewer runs there are, and with fewer runs, fewer tapes
are required to avoid more passes on the merge.

This whole tape abstraction is something that I believe is unique to
Postgres among modern databases, and we have found that by removing it
entirely along with logtape.c, we remove 2000 lines of useless code that
only complicates our optimization problem.

- Luke 



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Problemas with gram.y

2006-03-08 Thread etmorelli

Tom, 

sorry, but the address that you wrote tells that there isn´t any patch to 
apply. Is this patch Itagaki's one? How could I pick it? 

By the way, don´t worry about the whole idea. It's an experiment that shall 
be improved in the future, I hope. 

Best regards, 

Eduardo Morelli 



 
 Tom Lane wrote: 


Um, are you aware that a patch for that was already submitted? 
http://momjian.postgresql.org/cgi-bin/pgpatches 
 
I find the whole idea pretty ugly myself. 
 
 regards, tom lane 





---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Coverity Open Source Defect Scan of PostgreSQL

2006-03-08 Thread Ben Chelf




Martijn van Oosterhout wrote:

On Tue, Mar 07, 2006 at 05:39:18PM -0500, Tom Lane wrote:


Martijn van Oosterhout kleptog@svana.org writes:


#ifdef STATIC_ANALYSIS
#define ereport(elevel, rest)  \
   (errstart(elevel, __FILE__, __LINE__, PG_FUNCNAME_MACRO) ? \
(errfinish rest) : (void) 0), (elevel = ERROR ? exit(0) : 0)
#else
/* Normal def */
#endif


Hmm, neat idea ... though I wonder whether either gcc or Coverity's tool
is smart enough to draw the right conclusions from a conditional exit()
call ...




As for Coverity, if the elevel that's passed to the ereport is really a 
constant, the above #ifdef should absolutely do the trick for us so we 
know to stop analyzing on that path...Let me know if it doesn't actually 
do that ;)


-ben


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 But do fewer/longer sorted runs translate into not merging back to disk?
 I thought that was controlled by if we had to be able to rewind the
 result set.

A plain SELECT ... ORDER BY doesn't assume that anymore.  It is still
required for some cases such as the input to a merge join, but the
on-the-fly-final-merge code is going to be used a lot more in 8.2 than
it was before.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Dann Corbit

I do not clearly understand the sorting code in PostgreSQL.  If I did
have a good grasp of it, I would take a go at improving it.

Here are some suggestions of things that I know work really, really
well:

#1.  Two pass merge (none of that silly poly-tape merge goo)

#2.  Load ONLY the keys that are to be sorted into memory.  Use a
pointer exchange sort, and do not move the physical rows of data at all.

I am pretty sure from this thread that PostgreSQL is not doing #1, and I
have no idea if it is doing #2.

A useful trick:
Since merge is mentioned, I should say something else about merge joins.
If you do not have room to load the sorted keys for bsearch, load every
kth key (where k is computed by sizeof merge_ram / sizeof key_data).
Then, when you have found the block the thing you are looking for by the
kth key bsearch, bsearch that block.

Now, maybe PostrgeSQL already uses tricks better than these.  I don't
know.  But if they prove helpful suggestions I will be glad of it.

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Tom Lane
 Sent: Wednesday, March 08, 2006 12:32 PM
 To: Jim C. Nasby
 Cc: Luke Lonergan; Simon Riggs; pgsql-hackers@postgreSQL.org
 Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes
 
 Jim C. Nasby [EMAIL PROTECTED] writes:
  But do fewer/longer sorted runs translate into not merging back to
disk?
  I thought that was controlled by if we had to be able to rewind the
  result set.
 
 A plain SELECT ... ORDER BY doesn't assume that anymore.  It is still
 required for some cases such as the input to a merge join, but the
 on-the-fly-final-merge code is going to be used a lot more in 8.2 than
 it was before.
 
   regards, tom lane
 
 ---(end of
broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that
your
message can get through to the mailing list cleanly

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Problemas with gram.y

2006-03-08 Thread Bruce Momjian

etmorelli wrote:
 Tom, 
 
 sorry, but the address that you wrote tells that there isn?t any patch to 
 apply. Is this patch Itagaki's one? How could I pick it? 
 
 By the way, don?t worry about the whole idea. It's an experiment that shall 
 be improved in the future, I hope. 
 
 Best regards, 
 
 Eduardo Morelli 
 
 
 
  
  Tom Lane wrote: 
 
 
 Um, are you aware that a patch for that was already submitted? 
 http://momjian.postgresql.org/cgi-bin/pgpatches 
  
 I find the whole idea pretty ugly myself. 

The patch has moved to:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

-- 
  Bruce Momjian   http://candle.pha.pa.us
  SRA OSS, Inc.   http://www.sraoss.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Luke Lonergan

Dann,

On 3/8/06 12:39 PM, Dann Corbit [EMAIL PROTECTED] wrote:

 Here are some suggestions of things that I know work really, really
 well:

Can you point to an example?  That might help move the discussion along.

The reason to interject about the tape goo in this discussion is that we
seem to be spending a lot of time optimizing around the tape goo without
tackling the overall structure of the external sort.  I think we'll just end
up having to replace all of this goo when we really get around to fixing the
problem.

Add to this that other commercial databases external sort in 1/4 the time or
better on the same hardware with the same CPU/memory resources using a
2-pass external sort.
 
 #1.  Two pass merge (none of that silly poly-tape merge goo)

Voice of reason here.  It's what the other database systems do.
 
 #2.  Load ONLY the keys that are to be sorted into memory.  Use a
 pointer exchange sort, and do not move the physical rows of data at all.

Sounds right.  Example of this in practice?
 
 I am pretty sure from this thread that PostgreSQL is not doing #1, and I
 have no idea if it is doing #2.

Yep.  Even Knuth says that the tape goo is only interesting from a
historical perspective and may not be relevant in an era of disk drives.

- Luke



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [PATCHES] Add switches for DELIMITER and NULL in pg_dump COPY

2006-03-08 Thread Neil Conway

On Wed, 2006-03-08 at 08:20 -0800, David Fetter wrote:
 The previous discussion showed that there is a wide diversity of
 opinions on what The Right Delimiter and The Right NULL String(TM)
 are.

Barring a more convincing justification for why we need this feature,
I'm inclined to side with Tom: pg_dump has enough obscure options as it
is, and I can't imagine very many people needing this functionality.

-Neil



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Dann Corbit

 -Original Message-
 From: Luke Lonergan [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 08, 2006 1:52 PM
 To: Dann Corbit; Tom Lane; Jim C. Nasby
 Cc: Simon Riggs; pgsql-hackers@postgreSQL.org
 Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes

 Dann,

 On 3/8/06 12:39 PM, Dann Corbit [EMAIL PROTECTED] wrote:

  Here are some suggestions of things that I know work really, really
  well:

 Can you point to an example?  That might help move the discussion
along.

I wrote all of the sorting and merging stuff for CONNX Solutions
http://www.connx.com

I have carefully benched all of this stuff and (at least for our system)
the ideas I propose work well.  Of course, every system is different and
the only way to know if it is an improvement is to try it in place.

 The reason to interject about the tape goo in this discussion is that
we
 seem to be spending a lot of time optimizing around the tape goo
without
 tackling the overall structure of the external sort.  I think we'll
just
 end
 up having to replace all of this goo when we really get around to
fixing
 the
 problem.

I suggest trying several alternatives and benching them with real world
queries and especially with the open database benchmark suite.

 Add to this that other commercial databases external sort in 1/4 the
time
 or
 better on the same hardware with the same CPU/memory resources using a
 2-pass external sort.

Our sort merge is so fast that I can join two tables on a column with no
index faster than on a database that has a unique clustered index on the
column.  Benchmarked against Oracle, SQL*Server, and several others.

If you check our ORDER BY on a large table with no index, you will see
that it is competitive with the best commercial systems.

If you are interested, you could get an eval of CONNX and try it
yourself (eval is free for some number of days, I don't remember what).

   #1.  Two pass merge (none of that silly poly-tape merge goo)

 Voice of reason here.  It's what the other database systems do.

  #2.  Load ONLY the keys that are to be sorted into memory.  Use a
  pointer exchange sort, and do not move the physical rows of data at
all.

 Sounds right.  Example of this in practice?

It is what we use here.  It is the only way to fly.  This is well known,
and if you read a few articles from the ACM, you will see that it has
been known for decades.

  I am pretty sure from this thread that PostgreSQL is not doing #1,
and I
  have no idea if it is doing #2.

 Yep.  Even Knuth says that the tape goo is only interesting from a
 historical perspective and may not be relevant in an era of disk
drives.

 - Luke

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Coverity Open Source Defect Scan of PostgreSQL

2006-03-08 Thread Josh Berkus

Folks,
 As for Coverity, if the elevel that's passed to the ereport is really a
 constant, the above #ifdef should absolutely do the trick for us so we
 know to stop analyzing on that path...Let me know if it doesn't actually
 do that ;)

Um, I think the answer is to train Coverity, not change our code to avoid 
the false-positives.  I know that Coverity is sophisticated enough to, for 
example, be programed to understand that elog(ERROR) does not continue.  

Actually, I thougth that Neil/eDB did this with their copy.  Is there any 
way to get a copy of that training configuration?

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Coverity Open Source Defect Scan of PostgreSQL

2006-03-08 Thread Jonah H. Harris

On 3/8/06, Josh Berkus josh@agliodbs.com wrote:
Actually, I thougth that Neil/eDB did this with their copy.Is there anyway to get a copy of that training configuration?
I think we have a backup of it somewhere. I'll look into it. 
-- Jonah H. Harris, Database Internals ArchitectEnterpriseDB Corporation732.331.1324

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Dann Corbit

There are some articles here that are worth reading if you want to sort
fast:

http://research.microsoft.com/barc/SortBenchmark/

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Dann Corbit
 Sent: Wednesday, March 08, 2006 1:59 PM
 To: Luke Lonergan; Tom Lane; Jim C. Nasby
 Cc: Simon Riggs; pgsql-hackers@postgreSQL.org
 Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes

  -Original Message-
  From: Luke Lonergan [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, March 08, 2006 1:52 PM
  To: Dann Corbit; Tom Lane; Jim C. Nasby
  Cc: Simon Riggs; pgsql-hackers@postgreSQL.org
  Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes

  Dann,

  On 3/8/06 12:39 PM, Dann Corbit [EMAIL PROTECTED] wrote:

   Here are some suggestions of things that I know work really,
really
   well:

  Can you point to an example?  That might help move the discussion
 along.

 I wrote all of the sorting and merging stuff for CONNX Solutions
 http://www.connx.com

 I have carefully benched all of this stuff and (at least for our
system)
 the ideas I propose work well.  Of course, every system is different
and
 the only way to know if it is an improvement is to try it in place.

  The reason to interject about the tape goo in this discussion is
that
 we
  seem to be spending a lot of time optimizing around the tape goo
 without
  tackling the overall structure of the external sort.  I think we'll
 just
  end
  up having to replace all of this goo when we really get around to
 fixing
  the
  problem.

 I suggest trying several alternatives and benching them with real
world
 queries and especially with the open database benchmark suite.

  Add to this that other commercial databases external sort in 1/4 the
 time
  or
  better on the same hardware with the same CPU/memory resources using
a
  2-pass external sort.

 Our sort merge is so fast that I can join two tables on a column with
no
 index faster than on a database that has a unique clustered index on
the
 column.  Benchmarked against Oracle, SQL*Server, and several others.

 If you check our ORDER BY on a large table with no index, you will see
 that it is competitive with the best commercial systems.

 If you are interested, you could get an eval of CONNX and try it
 yourself (eval is free for some number of days, I don't remember
what).

#1.  Two pass merge (none of that silly poly-tape merge goo)

  Voice of reason here.  It's what the other database systems do.

   #2.  Load ONLY the keys that are to be sorted into memory.  Use a
   pointer exchange sort, and do not move the physical rows of data
at
 all.

  Sounds right.  Example of this in practice?

 It is what we use here.  It is the only way to fly.  This is well
known,
 and if you read a few articles from the ACM, you will see that it has
 been known for decades.

   I am pretty sure from this thread that PostgreSQL is not doing #1,
 and I
   have no idea if it is doing #2.

  Yep.  Even Knuth says that the tape goo is only interesting from a
  historical perspective and may not be relevant in an era of disk
 drives.

  - Luke

 ---(end of
broadcast)---
 TIP 5: don't forget to increase your free space map settings

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author

2006-03-08 Thread Alvaro Herrera

Bruce Momjian wrote:
 Log Message:
 ---
 Remove Christof Petig copyright on include file, per author request.

Huh, I thought what he actually told was that the file was released
under BSD license.  Maybe I missed it, but I didn't see him asking to
remove the copyright.

We certainly have copyrights attributed to individual people.  Jan Wieck
has his name on the PL/Tcl and PL/pgSQL files, for example.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author

2006-03-08 Thread Alvaro Herrera

Alvaro Herrera wrote:
 Bruce Momjian wrote:
  Log Message:
  ---
  Remove Christof Petig copyright on include file, per author request.
 
 Huh, I thought what he actually told was that the file was released
 under BSD license.  Maybe I missed it, but I didn't see him asking to
 remove the copyright.

Also, now that I see the actual diff, I see you only removed the
$PostgreSQL$ tag from ecpg_informix.h.  I assume this was unintended.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file,

2006-03-08 Thread Bruce Momjian

Alvaro Herrera wrote:
 Bruce Momjian wrote:
  Log Message:
  ---
  Remove Christof Petig copyright on include file, per author request.
 
 Huh, I thought what he actually told was that the file was released
 under BSD license.  Maybe I missed it, but I didn't see him asking to
 remove the copyright.
 
 We certainly have copyrights attributed to individual people.  Jan Wieck
 has his name on the PL/Tcl and PL/pgSQL files, for example.

We should not have individual copyrights to individuals in our source
tree.  If Jan's is in there, it should be removed too (with his
approval).  The only copyright holder should be PostgreSQL Global
Development Group.

Jan, would you fix that?

-- 
  Bruce Momjian   http://candle.pha.pa.us
  SRA OSS, Inc.   http://www.sraoss.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on

2006-03-08 Thread Bruce Momjian

Alvaro Herrera wrote:
 Alvaro Herrera wrote:
  Bruce Momjian wrote:
   Log Message:
   ---
   Remove Christof Petig copyright on include file, per author request.
  
  Huh, I thought what he actually told was that the file was released
  under BSD license.  Maybe I missed it, but I didn't see him asking to
  remove the copyright.
 
 Also, now that I see the actual diff, I see you only removed the
 $PostgreSQL$ tag from ecpg_informix.h.  I assume this was unintended.
 

Intended.  None of the other include files in that directory had it. 
Either all should, or none.  Should we add it to all of them?

-- 
  Bruce Momjian   http://candle.pha.pa.us
  SRA OSS, Inc.   http://www.sraoss.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author

2006-03-08 Thread Alvaro Herrera

Bruce Momjian wrote:
 Alvaro Herrera wrote:
  Bruce Momjian wrote:
   Log Message:
   ---
   Remove Christof Petig copyright on include file, per author request.
  
  Huh, I thought what he actually told was that the file was released
  under BSD license.  Maybe I missed it, but I didn't see him asking to
  remove the copyright.
  
  We certainly have copyrights attributed to individual people.  Jan Wieck
  has his name on the PL/Tcl and PL/pgSQL files, for example.
 
 We should not have individual copyrights to individuals in our source
 tree.  If Jan's is in there, it should be removed too (with his
 approval).  The only copyright holder should be PostgreSQL Global
 Development Group.

Why give all our code to a organisation that doesn't exist?

On the $PostgreSQL$ issue, I think it's best to have the tag everywhere,
so if these files were the only ones that had it, I'd rather add them to
the rest ...

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on

2006-03-08 Thread Tom Lane

Bruce Momjian pgman@candle.pha.pa.us writes:
 Intended.  None of the other include files in that directory had it. 
 Either all should, or none.  Should we add it to all of them?

All.  There's no good excuse for not having a version tag on every
source-code file.  The contrib stuff is pretty bad about this, but
hey, it's just contrib ;-).  ecpg is part of the main tree and it
ought to be up to project standards.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Tom Lane

Dann Corbit [EMAIL PROTECTED] writes:
 Here are some suggestions of things that I know work really, really
 well:
 #1.  Two pass merge (none of that silly poly-tape merge goo)

This amounts to an assumption that you have infinite work_mem, in which
case you hardly need an external sort at all.  If your work_mem is in
fact finite, then at some point you need more than two passes.  I'm not
really interested in ripping out support for sort operations that are
much larger than work_mem.

 #2.  Load ONLY the keys that are to be sorted into memory.  Use a
 pointer exchange sort, and do not move the physical rows of data at all.

This suggestion isn't a whole lot better; in general the rows to be
sorted don't exist until we compute them, and so proposing that we
don't load them until later is pretty much irrelevant.  Also, in
a lot of common cases the keys to be sorted are the bulk of the data
anyway.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Dann Corbit

 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 08, 2006 3:17 PM
 To: Dann Corbit
 Cc: Jim C. Nasby; Luke Lonergan; Simon Riggs;
pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes

 Dann Corbit [EMAIL PROTECTED] writes:
  Here are some suggestions of things that I know work really, really
  well:
  #1.  Two pass merge (none of that silly poly-tape merge goo)

 This amounts to an assumption that you have infinite work_mem, in
which
 case you hardly need an external sort at all.  If your work_mem is in
 fact finite, then at some point you need more than two passes.  I'm
not
 really interested in ripping out support for sort operations that are
 much larger than work_mem.

No it does not.  I have explained this before.  You can have one million
files and merge them all into a final output with a single pass.  It
does not matter how big they are or how much memory you have.

The idea is very simple.  Each subfile has its top record inserted into
a priority queue of file handles (or whatever else you want to use --
temp tables, you name it). When you extract the smallest record from the
queue, the priority changes and that file handle gets moved to a new
place in the queue.  You keep pulling records from the queue until the
entire queue is empty.

The outline is like this:
1. Sort chunks
2. Write chunks
3. Insert top record of chunks into priority queue
4. Extract records from queue, writing them to final output
5. Repeat step 4 until queue is empty.

  #2.  Load ONLY the keys that are to be sorted into memory.  Use a
  pointer exchange sort, and do not move the physical rows of data at
all.

 This suggestion isn't a whole lot better; in general the rows to be
 sorted don't exist until we compute them, and so proposing that we
 don't load them until later is pretty much irrelevant.  Also, in
 a lot of common cases the keys to be sorted are the bulk of the data
 anyway.

This suggestion is in addition to suggestion 1.  They are not even
related except that both suggestions make the sort run a lot faster.

I think I did not explain it clearly enough.  Suppose that you have a
set of rows you need to sort.  Instead of loading the whole row into
memory, just load the columns (or parts of columns) that are being
sorted.  I hope that it is more clear now.

   regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Coverity Open Source Defect Scan of PostgreSQL

2006-03-08 Thread Greg Stark

Ben Chelf [EMAIL PROTECTED] writes:

 #ifdef STATIC_ANALYSIS
 #define ereport(elevel, rest)  \
 (errstart(elevel, __FILE__, __LINE__, PG_FUNCNAME_MACRO) ? \
  (errfinish rest) : (void) 0), (elevel = ERROR ? exit(0) : 0)
 #else
 /* Normal def */
 #endif
 
 As for Coverity, if the elevel that's passed to the ereport is really a
 constant, the above #ifdef should absolutely do the trick for us so we know to
 stop analyzing on that path...Let me know if it doesn't actually do that ;)

If you're willing to require elevel to always be a constant then why not just
tack on the (elevel = ERROR ? exit(0) : 0) onto the end of the regular
definition of ereport instead of having an ifdef?

Incidentally, if it's not guaranteed to be a constant then the definition
above is wrong because it's missing parentheses around elevel at both
occurrences.

-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Greg Stark


Luke Lonergan [EMAIL PROTECTED] writes:

  I am pretty sure from this thread that PostgreSQL is not doing #1, and I
  have no idea if it is doing #2.
 
 Yep.  Even Knuth says that the tape goo is only interesting from a
 historical perspective and may not be relevant in an era of disk drives.

As the size of the data grows larger the behaviour of hard drives looks more
and more like tapes. The biggest factor controlling the speed of i/o
operations is how many seeks are required to complete them. Effectively
rewinds are still the problem it's just that the cost of rewinds becomes
constant regardless of how long the tape is.

That's one thing that gives me pause about the current approach of using more
tapes. It seems like ideally the user would create a temporary work space on
each spindle and the database would arrange to use no more than that number of
tapes. Then each merge operation would involve only sequential access for both
reads and writes.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Dann Corbit

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 08, 2006 3:56 PM
 To: Luke Lonergan
 Cc: Dann Corbit; Tom Lane; Jim C. Nasby; Simon Riggs; pgsql-
 [EMAIL PROTECTED]
 Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes

 Luke Lonergan [EMAIL PROTECTED] writes:

   I am pretty sure from this thread that PostgreSQL is not doing #1,
and
 I
   have no idea if it is doing #2.

  Yep.  Even Knuth says that the tape goo is only interesting from a
  historical perspective and may not be relevant in an era of disk
drives.

 As the size of the data grows larger the behaviour of hard drives
looks
 more
 and more like tapes. The biggest factor controlling the speed of i/o
 operations is how many seeks are required to complete them.
Effectively
 rewinds are still the problem it's just that the cost of rewinds
becomes
 constant regardless of how long the tape is.

 That's one thing that gives me pause about the current approach of
using
 more
 tapes. It seems like ideally the user would create a temporary work
space
 on
 each spindle and the database would arrange to use no more than that
 number of
 tapes. Then each merge operation would involve only sequential access
for
 both
 reads and writes.

If the chief concern is in the number of subfiles created, replacement
selection doubles the length of the subfiles while consuming no more
memory.
{The big-O of the algorithm sucks, though}

It is certainly worth testing several cases.

It is not a bad idea to enable more than one method of performing an
operation.

In the ideal case, you would have specific information about drives,
spindles, rates for seek, transfer, etc.

It all depends on how much effort you want to throw at it.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Jim C. Nasby

On Wed, Mar 08, 2006 at 10:49:16AM -0800, Luke Lonergan wrote:
 Jim,
 
 On 3/8/06 9:49 AM, Jim C. Nasby [EMAIL PROTECTED] wrote:
 
  On Wed, Mar 08, 2006 at 11:20:50AM -0500, Tom Lane wrote:
 
  Not sure that follows.  In particular, the entire point of the recent
  changes has been to extend the range in which we can use a single merge
  pass --- that is, write the data once as N sorted runs, then merge them
  in a single read pass.  As soon as you have to do an actual merge-back-
  to-disk pass, your total I/O volume doubles, so there is definitely a
  considerable gain if that can be avoided.  And a larger work_mem
  translates directly to fewer/longer sorted runs.
  
  But do fewer/longer sorted runs translate into not merging back to disk?
  I thought that was controlled by if we had to be able to rewind the
  result set.
 
 In the *tape* algorithm, there is an intermediate abstraction in the merging
 called tapes (!) that are used to store intermediate merge results.  Simon's
 work implemented more tapes, which asymptotically approaches a single merge
 pass as the number of tapes approaches the number of runs.
 
 The Replacement Selection algorithm generally will produce about 1/2 the
 number of runs that a simpler partial sort algorithm would, and the more
 memory it uses, the fewer runs there are, and with fewer runs, fewer tapes
 are required to avoid more passes on the merge.
 
 This whole tape abstraction is something that I believe is unique to
 Postgres among modern databases, and we have found that by removing it
 entirely along with logtape.c, we remove 2000 lines of useless code that
 only complicates our optimization problem.

Oh, geez, I think I get it now. I was thinking that we did something
like sort a chunk, write it to disk, repeat until all data processed and
then just read from the stuff on disk in order, switching between files
as needed. But of course that would suck horribly if we were actually
using tapes. Like others have said, surely there's got to be a much
better way to go about things with more modern hardware. If there is,
then hopefully the possibility exists of returning memory back to the
pool if it's not going to be as useful as it would be to a sort that
would fit in-memory.

As an example, in my hypothetical algorithm that sorts one chunk at a
time and then bounces between chunks when reading the data back out, it
would probably be better to have fewer, larger chunks than many more
small ones. But the difference between 256M chunks and 1GB chunks
probably wouldn't be that big a difference, certainly not a 4x
improvement. So it makes sense to go with the smaller chunks if it means
that other sorts would be able to operate entirely in-memory. In an
ideal world, this allocation could even by dynamic, based on what else
was happening on the machine.

But I'll take any incremental improvement I can get right now. :) Just
having the ability to set a more aggressive work_mem without worrying
about causing a swap storm would be a huge improvement over the current
situation. Being able to cut back on memory use when we fall back to
disk would be icing on the cake. :)
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Jonah H. Harris

An interesting read at http://www.vldb.org/conf/1997/P376.PDFOn 3/8/06, Dann Corbit 
[EMAIL PROTECTED] wrote: -Original Message- From: 
[EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Wednesday, March 08, 2006 3:56 PM To: Luke Lonergan Cc: Dann Corbit; Tom Lane; Jim C. Nasby; Simon Riggs; pgsql-
 [EMAIL PROTECTED] Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes Luke Lonergan 
[EMAIL PROTECTED] writes:   I am pretty sure from this thread that PostgreSQL is not doing #1,and I   have no idea if it is doing #2.   Yep.Even Knuth says that the tape goo is only interesting from a
  historical perspective and may not be relevant in an era of diskdrives. As the size of the data grows larger the behaviour of hard driveslooks more and more like tapes. The biggest factor controlling the speed of i/o
 operations is how many seeks are required to complete them.Effectively rewinds are still the problem it's just that the cost of rewindsbecomes constant regardless of how long the tape is.
 That's one thing that gives me pause about the current approach ofusing more tapes. It seems like ideally the user would create a temporary workspace on each spindle and the database would arrange to use no more than that
 number of tapes. Then each merge operation would involve only sequential accessfor both reads and writes.If the chief concern is in the number of subfiles created, replacement
selection doubles the length of the subfiles while consuming no morememory.{The big-O of the algorithm sucks, though}It is certainly worth testing several cases.It is not a bad idea to enable more than one method of performing an
operation.In the ideal case, you would have specific information about drives,spindles, rates for seek, transfer, etc.It all depends on how much effort you want to throw at it.---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq-- Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation732.331.1324

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Jim C. Nasby

On Wed, Mar 08, 2006 at 03:35:53PM -0800, Dann Corbit wrote:
 I think I did not explain it clearly enough.  Suppose that you have a
 set of rows you need to sort.  Instead of loading the whole row into
 memory, just load the columns (or parts of columns) that are being
 sorted.  I hope that it is more clear now.

The issue is that there is a non-trivial amount of overhead in going
back to disk to get the raw data, and then you have to parse that into a
valid in-memory tuple. A worst-case scenario is if you're sorting all
the data that you've been asked to retrieve, ie:

SELECT a, b, c ... ORDER BY b, a, c;

That case is almost guaranteed to take longer if you try and do it with
just pointers.

But there is the other case:

SELECT a, b, c, big_honking_text_field ... ORDER BY a, b, c;

In this example it's entirely possible that leaving the big_honking
field out of the actual sorting would be a big win. Especially if your
temporary space was on a different set of spindles.

Regarding your suggestion of testing different kinds of sorts, that's
certainly a good idea if it can be done without a huge amount of work
coding each one up. Ultimately, it might make the most sense to support
multiple sort algorithms (at least for now) and let the planner decide
which one to use. That would at least get us a lot more real-world data
than any other method would.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Dann Corbit

 -Original Message-
 From: Jim C. Nasby [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 08, 2006 5:44 PM
 To: Dann Corbit
 Cc: Tom Lane; Luke Lonergan; Simon Riggs; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes

 On Wed, Mar 08, 2006 at 03:35:53PM -0800, Dann Corbit wrote:
  I think I did not explain it clearly enough.  Suppose that you have
a
  set of rows you need to sort.  Instead of loading the whole row into
  memory, just load the columns (or parts of columns) that are being
  sorted.  I hope that it is more clear now.

 The issue is that there is a non-trivial amount of overhead in going
 back to disk to get the raw data, and then you have to parse that into
a
 valid in-memory tuple. A worst-case scenario is if you're sorting all
 the data that you've been asked to retrieve, ie:

 SELECT a, b, c ... ORDER BY b, a, c;

 That case is almost guaranteed to take longer if you try and do it
with
 just pointers.

 But there is the other case:

 SELECT a, b, c, big_honking_text_field ... ORDER BY a, b, c;

 In this example it's entirely possible that leaving the big_honking
 field out of the actual sorting would be a big win. Especially if your
 temporary space was on a different set of spindles.

 Regarding your suggestion of testing different kinds of sorts, that's
 certainly a good idea if it can be done without a huge amount of work
 coding each one up. Ultimately, it might make the most sense to
support
 multiple sort algorithms (at least for now) and let the planner decide
 which one to use. That would at least get us a lot more real-world
data
 than any other method would.

I typically do it something like this:

MSD_Radix_Sort_Hunks()
{
// We might have to bail for many reasons :
// Early part of the key may be identical for all rows
// We may not have a binning algorithm for this data type
// We may also only partially sort with MSD Radix sort
If (Set_Is_Too_Small_Or_Otherwise_Bail()) 
{
Introspective_Sort_Hunks();
}
Else
MSD_Radix_Alg(); // Cookie cutter of data stream into sorted hunks
}

Introspective_Sort_Hunks()
{
If (Set_Is_Too_Small_Or_Otherwise_Bail())
{
Ford_Johnson_Variant(); // Near optimal sort of very small sets
}
Else
Introspective_Alg();// Cookie cutter of data stream into sorted hunks
}

Queue_based_hunk_merge();

Now, you might have a merge that makes choices on entry similar to the
way that my sorts make choices on entry.

You will notice that my sorts decide internally on what algorithm to
perform.  Certainly, this is a simple approach that can generalize in
many ways.

 --
 Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
 Pervasive Software  http://pervasive.comwork: 512-231-6117
 vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Jim C. Nasby

On Wed, Mar 08, 2006 at 06:55:59PM -0500, Greg Stark wrote:
 
 Luke Lonergan [EMAIL PROTECTED] writes:
 
   I am pretty sure from this thread that PostgreSQL is not doing #1, and I
   have no idea if it is doing #2.
  
  Yep.  Even Knuth says that the tape goo is only interesting from a
  historical perspective and may not be relevant in an era of disk drives.
 
 As the size of the data grows larger the behaviour of hard drives looks more
 and more like tapes. The biggest factor controlling the speed of i/o
 operations is how many seeks are required to complete them. Effectively
 rewinds are still the problem it's just that the cost of rewinds becomes
 constant regardless of how long the tape is.

But it will take a whole lot of those rewinds to equal the amount of
time required by an additional pass through the data. I'll venture a
guess that as long as you've got enough memory to still read chunks back
in 8k blocks  that it won't be possible for a multi-pass sort to
out-perform a one-pass sort. Especially if you also had the ability to
do pre-fetching (not something to fuss with now, but certainly a
possibility in the future).
 
In any case, what we really need is at least good models backed by good
drive performance data. And we really should have that anyway so that we
can improve upon our cost estimator functions. I'm betting that what
that will show us is that no single sort method is going to work best
for all cases. IE: I'd bet that if your data set is sufficiently larger
than available memory that you'll actually be better off with a
multi-pass approach over a single/two pass approach.

 That's one thing that gives me pause about the current approach of using more
 tapes. It seems like ideally the user would create a temporary work space on
 each spindle and the database would arrange to use no more than that number of
 tapes. Then each merge operation would involve only sequential access for both
 reads and writes.

For that to be of any use, wouldn't you need to use only as many tapes
as spindles/2? Otherwise you're still trying to read and write from the
same set of drives, which means you're probably doing a lot of seeking.
Or do the tape algorithms re-write data as they read it?
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author

2006-03-08 Thread Jim C. Nasby

On Wed, Mar 08, 2006 at 07:37:14PM -0300, Alvaro Herrera wrote:
 On the $PostgreSQL$ issue, I think it's best to have the tag everywhere,
 so if these files were the only ones that had it, I'd rather add them to
 the rest ...

+1. I'd also argue that it should be in contrib as well.

I can submit a patch for contrib or the entire tree if desired.
Convention is that it goes near the top of the file, correct?
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] [PATCHES] Add switches for DELIMITER and NULL in pg_dump COPY

2006-03-08 Thread Jim C. Nasby

On Wed, Mar 08, 2006 at 04:57:52PM -0500, Neil Conway wrote:
 On Wed, 2006-03-08 at 08:20 -0800, David Fetter wrote:
  The previous discussion showed that there is a wide diversity of
  opinions on what The Right Delimiter and The Right NULL String(TM)
  are.
 
 Barring a more convincing justification for why we need this feature,
 I'm inclined to side with Tom: pg_dump has enough obscure options as it
 is, and I can't imagine very many people needing this functionality.

Given all the different requests that come in for pg_dump and copy,
maybe it makes sense for Someone Who Cares to start a pgFoundry project
(or maybe extend the import/export project that's already there).
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Andrew Dunstan


Dann Corbit wrote:


I do not clearly understand the sorting code in PostgreSQL.  If I did
have a good grasp of it, I would take a go at improving it.

 



Show me the code (and the benchmarks).

Seriously. We see regular discussions on this and similar topics, but I 
haven't seen a patch that anyone has proven is an unequivocal 
improvement. that I can recall.


cheers

andrew



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author

2006-03-08 Thread Alvaro Herrera

Jim C. Nasby wrote:

 +1. I'd also argue that it should be in contrib as well.
 
 I can submit a patch for contrib or the entire tree if desired.
 Convention is that it goes near the top of the file, correct?

Correct.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author

2006-03-08 Thread Jim C. Nasby

On Wed, Mar 08, 2006 at 11:27:23PM -0300, Alvaro Herrera wrote:
 Jim C. Nasby wrote:
 
  +1. I'd also argue that it should be in contrib as well.
  
  I can submit a patch for contrib or the entire tree if desired.
  Convention is that it goes near the top of the file, correct?
 
 Correct.

K, I'll work on everything but contrib tonight.

What do people think about contrib? Yea or nay for enforcing
$PostgreSQL$?
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-08 Thread Greg Stark


Jim C. Nasby [EMAIL PROTECTED] writes:

 On Wed, Mar 08, 2006 at 06:55:59PM -0500, Greg Stark wrote:
  
  Luke Lonergan [EMAIL PROTECTED] writes:
  
I am pretty sure from this thread that PostgreSQL is not doing #1, and I
have no idea if it is doing #2.
   
   Yep.  Even Knuth says that the tape goo is only interesting from a
   historical perspective and may not be relevant in an era of disk drives.
  
  As the size of the data grows larger the behaviour of hard drives looks more
  and more like tapes. The biggest factor controlling the speed of i/o
  operations is how many seeks are required to complete them. Effectively
  rewinds are still the problem it's just that the cost of rewinds becomes
  constant regardless of how long the tape is.
 
 But it will take a whole lot of those rewinds to equal the amount of
 time required by an additional pass through the data. I'll venture a
 guess that as long as you've got enough memory to still read chunks back
 in 8k blocks  that it won't be possible for a multi-pass sort to
 out-perform a one-pass sort. 

Well that's clearly a bit overoptimistic. If we believe the random page cost
of 4 then having more tapes than you have spindles would impose a penalty
equal to having four times as many passes. 

(And that's *with* the 8k block size. And with the kernel performing pre-fetch
already too.)

 For that to be of any use, wouldn't you need to use only as many tapes
 as spindles/2? Otherwise you're still trying to read and write from the
 same set of drives, which means you're probably doing a lot of seeking.
 Or do the tape algorithms re-write data as they read it?

Well, spindles-1. I was thinking as many tapes as you have spindles *in total*,
ie, including the output tape. You only have one output tape for each n-way
merge though.

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright

2006-03-08 Thread Marc G. Fournier


On Wed, 8 Mar 2006, Jim C. Nasby wrote:


On Wed, Mar 08, 2006 at 11:27:23PM -0300, Alvaro Herrera wrote:

Jim C. Nasby wrote:


+1. I'd also argue that it should be in contrib as well.

I can submit a patch for contrib or the entire tree if desired.
Convention is that it goes near the top of the file, correct?


Correct.


K, I'll work on everything but contrib tonight.

What do people think about contrib? Yea or nay for enforcing
$PostgreSQL$?


Any code in our source tree should conform to the same version tagging, so 
Yah ...



Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author

2006-03-08 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 Convention is that it goes near the top of the file, correct?

It goes at the bottom of the file header block comment ... if there
isn't a block comment at the start of the file, there should be.  We
have a thousand or so examples to follow ;-)

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] pg_freespacemap question

2006-03-08 Thread Mark Kirkwood


Tom Lane wrote:

Mark Kirkwood [EMAIL PROTECTED] writes:

Good points! I had not noticed this test case. Probably NULL is better 



Would setting it to 'BLCKSZ - (fixed index header stuff)' be better, 



No, I don't think so, because that will just make it harder to recognize
what's what (remember that BLCKSZ isn't really a constant, and the index
overhead is not the same for all AMs either).  The point here is that
for indexes the FSM tracks whole-page availability, not the amount of
free space within pages.  So I think NULL is a reasonable representation
of that.  Using NULL will make it easy to filter the results if you want
to see only heap-page data or only index-page data, whereas it will be
very hard to do that if the view adopts an ultimately-artificial
convention about the amount of available space on an index page.



Right - after suggesting it I realized that coding the different index
overhead for each possible AM would have been ... difficult :-). A patch
is attached to implement the NULL free bytes and other recommendations:

1/ Index free bytes set to NULL
2/ Comment added to the README briefly mentioning the index business
3/ Columns reordered more logically
4/ 'Blockid' column removed
5/ Free bytes column renamed to just 'bytes' instead of 'blockfreebytes'

Now 5/ was only hinted at, but seemed worth doing while I was there
(hopefully I haven't made it too terse now).

cheers

Mark


Index: pg_freespacemap.c
===
RCS file: /projects/cvsroot/pgsql/contrib/pg_freespacemap/pg_freespacemap.c,v
retrieving revision 1.2
diff -c -r1.2 pg_freespacemap.c
*** pg_freespacemap.c   14 Feb 2006 15:03:59 -  1.2
--- pg_freespacemap.c   9 Mar 2006 03:38:10 -
***
*** 12,18 
  #include storage/freespace.h
  #include utils/relcache.h
  
! #define   NUM_FREESPACE_PAGES_ELEM6
  
  #if defined(WIN32) || defined(__CYGWIN__)
  /* Need DLLIMPORT for some things that are not so marked in main headers */
--- 12,18 
  #include storage/freespace.h
  #include utils/relcache.h
  
! #define   NUM_FREESPACE_PAGES_ELEM5
  
  #if defined(WIN32) || defined(__CYGWIN__)
  /* Need DLLIMPORT for some things that are not so marked in main headers */
***
*** 29,40 
  typedef struct
  {
  
-   uint32  blockid;
-   uint32  relfilenode;
uint32  reltablespace;
uint32  reldatabase;
uint32  relblocknumber;
!   uint32  blockfreebytes;
  
  } FreeSpacePagesRec;
  
--- 29,40 
  typedef struct
  {
  
uint32  reltablespace;
uint32  reldatabase;
+   uint32  relfilenode;
uint32  relblocknumber;
!   uint32  bytes;
!   boolisindex;
  
  } FreeSpacePagesRec;
  
***
*** 91,107 
  
/* Construct a tuple to return. */
tupledesc = CreateTemplateTupleDesc(NUM_FREESPACE_PAGES_ELEM, 
false);
!   TupleDescInitEntry(tupledesc, (AttrNumber) 1, blockid,
!  INT4OID, -1, 0);
!   TupleDescInitEntry(tupledesc, (AttrNumber) 2, relfilenode,
   OIDOID, -1, 0);
!   TupleDescInitEntry(tupledesc, (AttrNumber) 3, reltablespace,
   OIDOID, -1, 0);
!   TupleDescInitEntry(tupledesc, (AttrNumber) 4, reldatabase,
   OIDOID, -1, 0);
!   TupleDescInitEntry(tupledesc, (AttrNumber) 5, relblocknumber,
   INT8OID, -1, 0);
!   TupleDescInitEntry(tupledesc, (AttrNumber) 6, blockfreebytes,
   INT4OID, -1, 0);
  
/* Generate attribute metadata needed later to produce tuples */
--- 91,105 
  
/* Construct a tuple to return. */
tupledesc = CreateTemplateTupleDesc(NUM_FREESPACE_PAGES_ELEM, 
false);
!   TupleDescInitEntry(tupledesc, (AttrNumber) 1, reltablespace,
   OIDOID, -1, 0);
!   TupleDescInitEntry(tupledesc, (AttrNumber) 2, reldatabase,
   OIDOID, -1, 0);
!   TupleDescInitEntry(tupledesc, (AttrNumber) 3, relfilenode,
   OIDOID, -1, 0);
!   TupleDescInitEntry(tupledesc, (AttrNumber) 4, relblocknumber,
   INT8OID, -1, 0);
!   TupleDescInitEntry(tupledesc,

Re: [HACKERS] Automatic free space map filling

2006-03-08 Thread ITAGAKI Takahiro

Zeugswetter Andreas DCP SD [EMAIL PROTECTED] wrote:

 Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead
 tuple by reducing the tuple to it's header info.

I was just working about your idea. In my work, bgwriter truncates
dead tuples and leaves only their headers. I'll send a concept patch
to PATCHES.

We must take super-exclusive-lock of pages before vacuum. Bgwriter tries to
take exclusive-lock before it writes a page, and does vacuum only if the lock
is super-exclusive. Otherwise, it gives up and writes normally. This is an
optimistic way, but I assume the possibility is high because the most pages
written by bgwriter are least recently used (LRU).

Also, I changed bgwriter_lru_maxpages to be adjusted automatically, because
backends won't do vacuum not to disturb main transaction processing,
so bgwriter should write most of the dirty pages.


There are much room for discussion on this idea.
Comments are welcome.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] problem with large maintenance_work_mem settings and

2006-03-08 Thread Stefan Kaltenbrunner

Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 
CREATE INDEX on a 1,8B row table (5 int columns - index created on the 
first row about 300M distinct values):
 
 
before: 11h 51min
after: 3h 11min(!)
 
 
 Cool.  Does it seem to be I/O bound now?  Would you be willing to do it
 over with oprofile turned on?

while it now does a fair amount of IO during the whole operation, it is
not yet IObound afaiks.

profile:

samples  %symbol name
103520432 47.9018  inlineApplySortFunction
33382738 15.4471  comparetup_index
25296438 11.7054  tuplesort_heap_siftup
10089122  4.6685  btint4cmp
8395676   3.8849  LogicalTapeRead
2873556   1.3297  tuplesort_heap_insert
2796545   1.2940  tuplesort_gettuple_common
2752345   1.2736  AllocSetFree
2233889   1.0337  IndexBuildHeapScan
2035265   0.9418  heapgettup
1571035   0.7270  LWLockAcquire
1498800   0.6935  readtup_index
1213587   0.5616  index_form_tuple
1097172   0.5077  AllocSetAlloc
1056964   0.4891  heap_fill_tuple
1041172   0.4818  btbuildCallback
9900050.4581  LWLockRelease
8976620.4154  slot_deform_tuple
8585270.3973  LogicalTapeWrite
8068490.3734  PageAddItem
7641360.3536  LockBuffer

trace_sort:

LOG:  begin index sort: unique = f, workMem = 2048000, randomAccess = f
LOG:  switching to external sort with 7315 tapes: CPU 4.07s/13.70u sec
elapsed 17.79 sec
LOG:  finished writing run 1 to tape 0: CPU 240.07s/3926.66u sec elapsed
4498.49 sec
LOG:  performsort starting: CPU 535.66s/8138.92u sec elapsed 9435.11 sec
LOG:  finished writing final run 2 to tape 1: CPU 538.54s/8242.23u sec
elapsed 9541.55 sec
LOG:  performsort done (except final merge): CPU 539.39s/8254.83u sec
elapsed 9559.75 sec
LOG:  external sort ended, 4398827 disk blocks used: CPU
768.38s/10027.39u sec elapsed 11884.63 sec


Stefan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

72 matches

Mail list logo