[PATCHES] sound extract century/millennium date_part

2004-04-05 Thread Fabien COELHO

Dear patchers,

Please find a small patch to fix the brain damage century and
millennium date part implementation in postgresql, both in the code and
the documentation, so that it conforms to the official definition. If you
do not agree with the official definition, please send your complaint to
[EMAIL PROTECTED]. I'm not responsible for them;-)

With the previous version, the centuries and millenniums had a wrong
number and started the wrong year. Moreover century number 0, which does
not exist in reality, lasted 200 years. Also, millennium number 0 lasted
2000 years.

If you want postgresql to have it's own definition of century and
millennium that does not conform to the one of the society, just give
them another name. I would suggest pgCENTURY and pgMILLENNIUM;-)

IMO, if someone may use the options, it means that postgresql is used for
historical data, so it make sense to have an historical definition. Also,
I just want to divide the year by 100 or 1000, I can do that quite easily.

Have a nice day,

-- 
Fabien Coelho - [EMAIL PROTECTED]*** ./doc/src/sgml/func.sgml.orig   Wed Mar 31 08:58:31 2004
--- ./doc/src/sgml/func.sgmlSun Apr  4 10:23:00 2004
***
*** 4948,4965 
termliteralcentury/literal/term
listitem
 para
! The year field divided by 100
 /para
  
  screen
! SELECT EXTRACT(CENTURY FROM TIMESTAMP '2001-02-16 20:38:40');
  lineannotationResult: /lineannotationcomputeroutput20/computeroutput
  /screen
  
 para
! Note that the result for the century field is simply the year field
! divided by 100, and not the conventional definition which puts most
! years in the 1900's in the twentieth century.
 /para
/listitem
   /varlistentry
--- 4948,4978 
termliteralcentury/literal/term
listitem
 para
! The historical definition of a century.
 /para
  
  screen
! SELECT EXTRACT(CENTURY FROM TIMESTAMP '2000-12-16 12:21:13');
  lineannotationResult: /lineannotationcomputeroutput20/computeroutput
+ SELECT EXTRACT(CENTURY FROM TIMESTAMP '2001-02-16 20:38:40');
+ lineannotationResult: /lineannotationcomputeroutput21/computeroutput
  /screen
  
 para
! An historical century is a period of 100 years.
!   The first century starts at 0001-01-01 00:00:00 AD, although
!   they did not know at the time. This definition applies to all
!   Gregorian calendar countries. There is no number 0 century, 
!   you go from -1 to 1.
! 
!   If you disagree with this, please write your complaint to:
!   Pope, Cathedral Saint-Peter of Roma, Vatican.
!/para
! 
!para
!   Compatibility: if you want the previous postgres version of century,
!   just divide the year by 100. Note that with this definition, 
!   century number 0 lasts 200 years.
 /para
/listitem
   /varlistentry
***
*** 5083,5100 
termliteralmillennium/literal/term
listitem
 para
! The year field divided by 1000
 /para
  
  screen
  SELECT EXTRACT(MILLENNIUM FROM TIMESTAMP '2001-02-16 20:38:40');
! lineannotationResult: /lineannotationcomputeroutput2/computeroutput
  /screen
  
 para
! Note that the result for the millennium field is simply the year field
! divided by 1000, and not the conventional definition which puts
! years in the 1900's in the second millennium.
 /para
/listitem
   /varlistentry
--- 5096,5112 
termliteralmillennium/literal/term
listitem
 para
! The conventional historical millennium.
 /para
  
  screen
  SELECT EXTRACT(MILLENNIUM FROM TIMESTAMP '2001-02-16 20:38:40');
! lineannotationResult: /lineannotationcomputeroutput3/computeroutput
  /screen
  
 para
! Years in the 1900's are in the second millennium.
!   The third millennium starts January 1, 2001.
 /para
/listitem
   /varlistentry
*** ./src/backend/utils/adt/timestamp.c.origWed Mar 31 08:58:40 2004
--- ./src/backend/utils/adt/timestamp.c Sun Apr  4 10:45:59 2004
***
*** 3273,3283 
break;
  
case DTK_CENTURY:
!   result = (tm-tm_year / 100);
break;
  
case DTK_MILLENNIUM:
!   result = (tm-tm_year / 1000);
break;
  
case DTK_JULIAN:
--- 3273,3295 
break;
  
case DTK_CENTURY:
!   /* centuries AD, c0: year in [ (c-1)*100+1 : 
c*100   ]
!* centuries BC, c0: year in [ c*100   : 
(c+1)*100-1 ]
!* there is no number 0 century.
!*/
! 

Re: [PATCHES] Translation Updates for 7.3: pg_dump-ru.po.gz;psql-ru.po.gz

2004-04-05 Thread Peter Eisentraut
Am Montag, 29. März 2004 03:35 schrieb Serguei Mokhov:
 This should complete traslation updates for 7.3.
 Please install.

Done.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] MSFT compiler fixes + misc

2004-04-05 Thread Magnus Hagander
  A thought about this - how about converting pgpiperead() and
  pgpipewrite() into functions intead of macros (on win32 - still 
  redifining them on != win32), mimicking the behaviour of read() and 
  write()?
 
 And #def'ing them to be read + write under win32? Don't want 
 to change every instance of read/write.

No.
#def'ing them to be read + write under non-win32. As is done today.


It would change (in port.h):
#ifndef WIN32
#define pgpipe(a)   pipe(a)
#define piperead(a,b,c) read(a,b,c)
#define pipewrite(a,b,c)write(a,b,c)
#else
extern int pgpipe(int handles[2]);
#define piperead(a,b,c) recv(a,b,c,0)
#define pipewrite(a,b,c)send(a,b,c,0)
#endif

to

#ifndef WIN32
#define pgpipe(a)   pipe(a)
#define piperead(a,b,c) read(a,b,c)
#define pipewrite(a,b,c)write(a,b,c)
#else
extern int pgpipe(int handles[2]);
extern int piperead(a,b,c);
extern int pipewrite(a,b,c);
#endif

And then put piperead() and pipewrite() along with pgpipe() in the C
file. (Naturally, arguments with the correct syntax, but you get the
idea)


  Then we could do awya with the #ifdefs at the points where 
 its used, 
  and just expect the normal Unix behaviour?
 
 I don't see that we do have any #ifdefs where its used.

pgstat.c, inside the main loop.
#ifdef WIN32
if (WSAGetLastError() == WSAECONNRESET) /* EOF on the pipe! (win32
socket based implementation) */
{
pipeEOF = true;
 break;
}
#endif

There's where I notived it. That might be the only place, though -
haven't checked further. If it's the only case, then it's not such a big
deal. 
But it might come back to bite us in the future - since it currently
does not actually implement expected behaviour of a pipe.


//Magnus

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] Translation Updates for 7.3: libpq-ru.po.gz;pg_controldata-ru.po.gz;pg_resetxlog-ru.po.gz;postgres-ru.po.gz

2004-04-05 Thread Peter Eisentraut
Am Samstag, 27. März 2004 22:40 schrieb Serguei Mokhov:
 Please include these updates for 7.3

Done.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Translation Upadtes for 7.5: initdb-ru.po.gz;libpq-ru.po.gz;psql-ru.po.gz

2004-04-05 Thread Peter Eisentraut
Am Samstag, 27. März 2004 22:18 schrieb Serguei Mokhov:
 These are for the HEAD:

 - updated libpq and psql for 7.5
 - initial cut on initdb

Installed.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] hint infrastructure setup (v3)

2004-04-05 Thread Fabien COELHO

Dear Tom,

First, thanks for the discussion about my hint infrastructure patch
submission. Whatever the opinion, it helps.


 We'd have to write our own version of bison's verbose-error code anyway,
 because the canned code doesn't support localization --- it uses
 hardwired strings for expecting and so on.  But it looks possibly
 doable to me.

I also played a little bit with the parser over the weekend. I found the
same issues that you noted in your mail, tried more options, and found
other problems. I sent here the full mail I wrote.

Although I agree that it is doable, I have stronger reserve than yours.
Also, I do not find it an appealing solution to change gram.c a lot.

The problems I found are:


(1) not available raw information.

The automaton stack keeps states, which are not directly linked to rules
and terminals. The terminals are not available, they must be kept
separatly if you want them. This can be done in yylex().

The internal state, stack, token... are local variables within yyparse().
As a result, they are not accessible from yyerror. I haven't found any
available hook, so you have to hack gram.c to get this information. It
is a 1 line hack, but it modifies the generated code. Or you have to
produce a new hacked gram.c, which is what you suggest.


(2) hard to compute follow set

The information about the syntax is there, obviously! However, from a
given automaton state, the actually allowed tokens are not directly
available. Only some are directly available, so the result is:

DROP ? ;
ERROR:  syntax error at or near ? at character 6
HINT:  hint status is state=422 n=796 char=575 type=320=Op stack= 0 16 422
   tokens= 88:DROP 320:Op
   expected= 150:LANGUAGE?

Other tokens may be found, but you have either:

- to *simulate* the automaton forward through reductions to check whether
  it is finally accepted. Not really beautiful, but that can be done.
  It is just the bison code without the reduction production rules;-)

- move backwards before doing the above, if some reductions where
  performed because of the submitted token and finally resulted in the error,
  the state that lead to the error may not be the highest available one, so
  maybe other allowed tokens may also be missed. We would need to have
  the last state before any reduction.
  Another small hack in generated gram.c would be needed to get it.

However,


(3) the computed follow set is often useless, as you noted.

The reason for that is that keywords are accepted in place of identifiers.
The use of rules: reserved_keyword, func_name_keyword, col_name_keyword,
unreserved_keyword as possible column/function/... identifiers is IMHO the
main reason for the size of the generated automaton.

If you drop that, everything would be much simpler, because it would
reduce a lot the number of arcs in the automaton. In particular, anything
which is just an identifier should not be given a name (say, BOOLEAN_P or
SMALLINT should not be tokens, but could rather be recognized as a special
case of identifier from within the parser, maybe with some simple
post-filtering).

As you noted, for things like SHOW 123, the follow set basically
includes all keywords although you can have SHOW ALL or SHOW name.

So, as you suggested, you can either say ident as a simplification, but
you miss ALL which is meaningful, or you say all keywords, which is
useless.

An alternative is to know that you're within a SHOW something, that is
you somehow reparse the code from within the syntax error:-( It is
necessary if you want to say something more interesting than ident, say
option name. You may also get this information by simulating the
automaton forward, or noticing that some of the current on stack states
can only lead to the reduction of a ShowStmt...

Well, on the positive side, it is sometimes right, say:

REVOKE ? ;
ERROR:  syntax error, unexpected Op at or near ? at character 8
HINT:  hint status is state=481 n=1297 char=575 type=320=Op
  stack= 0 31 481
  tokens= 234:REVOKE 320:Op
  expected= 10:ALL 59:CREATE 80:DELETE_P 98:EXECUTE 136:INSERT 224:REFERENCES
239:RULE 244:SELECT 268:TEMP 270:TEMPORARY 279:TRIGGER 292:UPDATE
293:USAGE?


(4) the available automaton data are inverted with respect to what would
be needed... It would be nice to know whether we're heading for a create
statement or something like that, but the information is hidden down the
automaton. basically, the automaton would need to be inverted to compute
this information. We would need the initial information in gram.y, which
was inverted to build the automaton. So the code somehow needs to undo
what bison compilation has done.


(5) anything that can be done would be hardwired to one version of bison.
There is a lot of asumptions in the code and data structures, and any
newer/older version with some different internal representation would
basically break any code that would rely on that. So postgres would not be
bison portable:-( I don't think 

Re: [PATCHES] hint infrastructure setup (v3)

2004-04-05 Thread Fabien COELHO

 As another teaching aid idea, is there something that would auto-format
 queries to they are more readable?

Not that I know of. But I guess it must exist in some interface,
maybe with Oracle or DB2.

My emacs does seem to do that. It just put colors on keywords.

If I had to put it somewhere, I would try to enhance emacs sql-mode.
But I don't like much lisp programming.

-- 
Fabien Coelho - [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] hint infrastructure setup (v3)

2004-04-05 Thread Bruce Momjian

As another teaching aid idea, is there something that would auto-format
queries to they are more readable?

---

Fabien COELHO wrote:
 
 
 
 Dear Tom,
 
  No, just redefine yyerror as a macro that passes additional parameters.
 
 Ok!  That's just the kind of the hack-hook I was looking for!!
 
 
   The terminals are not available, they must be kept separatly if you
   want them. This can be done in yylex().
 
  We don't need them. Any already-shifted tokens are to the left of where
  the error is, no?
 
 Yes and no. I agree that you don't need them for computing the follow set.
 However it would help a lot to generate nicer hints. I'm looking for help
 the user, and the follow set is just part of the help.
 
 Think of typical SELECT foo, bla, FROM xxx (it looks stupid on a one
 line query, but not so on a very large query) which is quite easier to
 detect and hint about because of FROM is just after a comma.
 
 The rule part is also a real issue. There is no easy way to translate
 states into rules, to know whether we're somewhere in a ShowStmt or
 AlterTableStmt...
 
 If you're after a comma in state 1232, should you just say IDENT... I'd
 rather say user name or table name if I can... Otherwise the hint
 stays at a low parser level. Good hints requires to know about the
 current context, and it is not easy to get that deep in the automaton.
 
 
  Yeah, I had come to the same conclusion --- state moves made without
  consuming input would need to be backed out if we wanted to report the
  complete follow set.  I haven't yet looked to see how to do that.
 
 Well, following you're previous suggestion, one can redefine the YYLEX
 macro to save the current state each time a token is required.
 
 
   As you noted, for things like SHOW 123, the follow set basically
   includes all keywords although you can have SHOW ALL or SHOW name.
   So, as you suggested, you can either say ident as a simplification, but
 
  You're ignoring the distinction between classes of keywords.  I would
  not recommend treating reserved_keywords as a subclass of ident.
 
 Ok, I agree that it can help in this very case a little bit here because
 ALL is reserved. I did not make this distinction when I played with bison.
 
 
   (5) anything that can be done would be hardwired to one version of bison.
 
  I think this argument is completely without merit.
 
 Possible;-)
 
 However I don't like writing hacks that depends on other people future
 behavior.
 
 
   (b) write a new recursive descendant parser, and drop gram.y
 
  Been there, done that, not impressed with the idea.  There's a reason
  why people invented parser generators...
 
 Sure. It was just a list of options;-)
 
 
   As a side effect of my inspection is that the automaton generated by bison
   could be simplified if a different tradeoff between the lexer, the parser
   and the post-processing would be chosen. Namelly, all tokens that are
   just identifiers should be dropped and processed differently.
 
  We are not going to reserve the keywords that are presently unreserved.
 
 Reserving keywords would help, but I would not think it could be accepted,
 because it is not SQL philosophy. SQL is about 30 years also, as old
 as yacc ideas... but they did not care at the time;-) When you look at
 the syntax, it was designed with a recursive parser in mind.
 
 Part of my comment was to explain why the generated automaton is large.
 SQL is a small language, but it has a big automaton in postgresql, and
 this is IMHO the explanation.
 
 
  If you can think of a reasonable way to stop treating them as separate
  tokens inside the grammar without altering the user-visible behavior,
  I'm certainly interested.
 
 I was thinking about type names especially, and maybe others.
 
 I join a small proof-of-concept patch to drop some tokens out of the
 parser. As a results, 6 tokens, 6 rules and 9 states are removed,
 and the automaton table is reduced by 438 entries (1.5%). Not too bad;-)
 I think others could be dealt with similarily, or with more work.
 
 
 Thanks for the discussion,
 
 -- 
 Fabien.

Content-Description: 

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 9: the planner will ignore your desire to choose an index scan if your
   joining column's datatypes do not match

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] hint infrastructure setup (v3)

2004-04-05 Thread Tom Lane
Fabien COELHO [EMAIL PROTECTED] writes:
 If you can think of a reasonable way to stop treating them as separate
 tokens inside the grammar without altering the user-visible behavior,
 I'm certainly interested.

 I join a small proof-of-concept patch to drop some tokens out of the
 parser.

I believe these were treated this way *specifically* because of the
keyword-is-not-an-identifier issue.  SQL99 calls out most of these
as being keywords:

 SQL defines predefined data types named by the following key
 words: CHARACTER, CHARACTER VARYING, CHARACTER LARGE OBJECT,
 BINARY LARGE OBJECT, BIT, BIT VARYING, NUMERIC, DECIMAL, INTEGER,
 SMALLINT, FLOAT, REAL, DOUBLE PRECISION, BOOLEAN, DATE, TIME,
 TIMESTAMP, and INTERVAL. These names are used in the type

and if we don't treat them as keywords then we will have a couple of
problems.  One is case-conversion issues in locales where the standard
downcasing is not an extension of ASCII (Turkish is the only one I know
of offhand).  Another is that depending on where you put the renaming
that this patch removes without replacing :-(, it would be possible for
the renaming transformation to get applied to user-defined types with
similar names, or for user-defined types to unexpectedly shadow system
definitions.  The former would be surprising and the latter would
violate the spec.  Check the archives; I'm sure this was discussed in
the 7.3 development cycle and we concluded that treating these names
as keywords was the only practical solution.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] hint infrastructure setup (v3)

2004-04-05 Thread Andrew Dunstan
Tom Lane wrote:

Fabien COELHO [EMAIL PROTECTED] writes:
 

(b) write a new recursive descendant parser, and drop gram.y
   

er, that's recursive descent :-)

Been there, done that, not impressed with the idea.  There's a reason
why people invented parser generators...
 



Well, unless you are a serious masochist, cutting a parser of the LR 
family (including LALR(1), such as yacc/bison produce) by hand is very 
difficult and error prone. That is not the case with LL type parsers. 
Also, there are parser generators for this family, both table driven and 
recursive descent. The table driven variety especially can have quite 
well tuned error reporting and recovery. (Among the reasons I know this 
is that I actually wrote such a beast around 13 years ago).

That is not to say that I think we should move from the present setup. 
Familiarity of the developer community with the tool used suggests we 
should not, quite apart from any other reasons. Changing to, say, an RD 
parser, would be a massive and probably error prone change and the 
benefit just doesn't seem worth it.

In fact, considering this thread I'm not sure any of the suggested steps 
will achieve Fabien's goal. ISTM that a smarter training wheels 
frontend, perhaps with some modest backend improvements, is likely to 
have better results. (Oh, you found an error near there - now what do I 
suggest belongs there?)

cheers

andrew







---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PATCHES] hint infrastructure setup (v3)

2004-04-05 Thread Fabien COELHO

Dear Tom,

 No, just redefine yyerror as a macro that passes additional parameters.

Ok!  That's just the kind of the hack-hook I was looking for!!


  The terminals are not available, they must be kept separatly if you
  want them. This can be done in yylex().

 We don't need them. Any already-shifted tokens are to the left of where
 the error is, no?

Yes and no. I agree that you don't need them for computing the follow set.
However it would help a lot to generate nicer hints. I'm looking for help
the user, and the follow set is just part of the help.

Think of typical SELECT foo, bla, FROM xxx (it looks stupid on a one
line query, but not so on a very large query) which is quite easier to
detect and hint about because of FROM is just after a comma.

The rule part is also a real issue. There is no easy way to translate
states into rules, to know whether we're somewhere in a ShowStmt or
AlterTableStmt...

If you're after a comma in state 1232, should you just say IDENT... I'd
rather say user name or table name if I can... Otherwise the hint
stays at a low parser level. Good hints requires to know about the
current context, and it is not easy to get that deep in the automaton.


 Yeah, I had come to the same conclusion --- state moves made without
 consuming input would need to be backed out if we wanted to report the
 complete follow set.  I haven't yet looked to see how to do that.

Well, following you're previous suggestion, one can redefine the YYLEX
macro to save the current state each time a token is required.


  As you noted, for things like SHOW 123, the follow set basically
  includes all keywords although you can have SHOW ALL or SHOW name.
  So, as you suggested, you can either say ident as a simplification, but

 You're ignoring the distinction between classes of keywords.  I would
 not recommend treating reserved_keywords as a subclass of ident.

Ok, I agree that it can help in this very case a little bit here because
ALL is reserved. I did not make this distinction when I played with bison.


  (5) anything that can be done would be hardwired to one version of bison.

 I think this argument is completely without merit.

Possible;-)

However I don't like writing hacks that depends on other people future
behavior.


  (b) write a new recursive descendant parser, and drop gram.y

 Been there, done that, not impressed with the idea.  There's a reason
 why people invented parser generators...

Sure. It was just a list of options;-)


  As a side effect of my inspection is that the automaton generated by bison
  could be simplified if a different tradeoff between the lexer, the parser
  and the post-processing would be chosen. Namelly, all tokens that are
  just identifiers should be dropped and processed differently.

 We are not going to reserve the keywords that are presently unreserved.

Reserving keywords would help, but I would not think it could be accepted,
because it is not SQL philosophy. SQL is about 30 years also, as old
as yacc ideas... but they did not care at the time;-) When you look at
the syntax, it was designed with a recursive parser in mind.

Part of my comment was to explain why the generated automaton is large.
SQL is a small language, but it has a big automaton in postgresql, and
this is IMHO the explanation.


 If you can think of a reasonable way to stop treating them as separate
 tokens inside the grammar without altering the user-visible behavior,
 I'm certainly interested.

I was thinking about type names especially, and maybe others.

I join a small proof-of-concept patch to drop some tokens out of the
parser. As a results, 6 tokens, 6 rules and 9 states are removed,
and the automaton table is reduced by 438 entries (1.5%). Not too bad;-)
I think others could be dealt with similarily, or with more work.


Thanks for the discussion,

-- 
Fabien.*** ./src/backend/parser/gram.y.origMon Apr  5 12:06:42 2004
--- ./src/backend/parser/gram.y Mon Apr  5 18:21:21 2004
***
*** 336,343 
AGGREGATE ALL ALSO ALTER ANALYSE ANALYZE AND ANY ARRAY AS ASC
ASSERTION ASSIGNMENT AT AUTHORIZATION
  
!   BACKWARD BEFORE BEGIN_P BETWEEN BIGINT BINARY BIT
!   BOOLEAN_P BOTH BY
  
CACHE CALLED CASCADE CASE CAST CHAIN CHAR_P
CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE
--- 336,343 
AGGREGATE ALL ALSO ALTER ANALYSE ANALYZE AND ANY ARRAY AS ASC
ASSERTION ASSIGNMENT AT AUTHORIZATION
  
!   BACKWARD BEFORE BEGIN_P BETWEEN BINARY BIT
!   BOTH BY
  
CACHE CALLED CASCADE CASE CAST CHAIN CHAR_P
CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE
***
*** 362,368 
  
ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IN_P INCLUDING INCREMENT
INDEX INHERITS INITIALLY INNER_P INOUT INPUT_P
!   INSENSITIVE INSERT INSTEAD INT_P INTEGER INTERSECT
INTERVAL INTO INVOKER IS ISNULL ISOLATION
  
JOIN
--- 362,368 
  
ILIKE IMMEDIATE IMMUTABLE 

Re: [PATCHES] O(samplesize) tuple sampling, proof of concept

2004-04-05 Thread Tom Lane
Manfred Koizar [EMAIL PROTECTED] writes:
 The old sampling method is still there.  I have added a GUC variable
 sampling_method to make testing and benchmarking easier.

I wouldn't bother with a GUC variable for the production patch.

 Once a block is selected for inspection, all tuples of this
 block are accessed to get a good estimation of the live : dead row
 ratio.

Why are you bothering to compute the live:dead ratio?  The statistics
model doesn't have any place for a dead-tuples count, so there's no need
to think about anything but live tuples.  (This is a historical
artifact, perhaps, arising from the fact that originally analysis
happened after VACUUM FULL and so the number of dead tuples was
guaranteed to be zero at the time.  But still, I'm not sure how we'd
factor a dead-tuples count into the estimates if we had it.)

 Because I was afraid that method 1 might be too expensive in terms of
 CPU cycles, I implemented a small variation that skips tuples without
 checking them for visibility; this is sampling_method 2.

And?  Does it matter?  (I'd guess not in practice, as checking a tuple
for visibility is cheap if someone's already marked it good.)

 At 1000 pages there is only a difference of approximately 1%,
 at 3000 pages the difference has its maximum of ca. 15%.  We can sell
 this by pointing to the better quality of the statistics.

If that's as bad as it gets I think we are OK.  You should redo the test
with larger sample size though (try stats target = 100) to see if the
answer changes.

 Are there any spots in the documentation that
 should be updated?

AFAIR there is noplace that specifically needs to be changed.

 diff -ruN ../base/src/backend/commands/analyze.c src/backend/commands/analyze.c

I find -u diffs close to unreadable for reviewing purposes.  Please
submit diffs in -c format in future.

 + /*
 +  * If we ran out of tuples then we're done.  No sort is needed,
 +  * since they're already in order.
 +  *
 +  * Otherwise we need to sort the collected tuples by position
 +  * (itempointer).  I don't care for corner cases where the tuples
 +  * are already sorted.
 +  */
 + if (numrows == targrows)
 + qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);

AFAICS the rows will *always* be sorted already, and so this qsort is an
utter waste of cycles.  No?

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PATCHES] O(samplesize) tuple sampling, proof of concept

2004-04-05 Thread Tom Lane
Manfred Koizar [EMAIL PROTECTED] writes:
 Because I was afraid that method 1 might be too expensive in terms of
 CPU cycles, I implemented a small variation that skips tuples without
 checking them for visibility; this is sampling_method 2.
 
 And?  Does it matter?

 There's a clearly visible difference for mid-size relations.  I'm not
 sure whether this can be attributed to visibility bit updating or other
 noise-contributing factors.

I think it would have to be visibility-bit updates.  Can you try it with
a pre-vacuumed relation, so that there is no slowdown for updates?

The update cost will have to be paid sooner or later by some xact, and
on the whole it's better that it be done by a maintenance xact than by
a foreground client; so even if there is a noticeable slowdown here,
that doesn't seem like a reason not to inspect all the tuples.

 I find -u diffs close to unreadable for reviewing purposes.  Please
 submit diffs in -c format in future.

 De gustibus non est disputandum :-)
 Fortunately this patch wouldn't look much different.  There is just a
 bunch of + lines.

Yeah, so I managed to read it anyway ;-).  It's the ones with
intricately intermixed - and + that I find difficult to follow.

 AFAICS the rows will *always* be sorted already, and so this qsort is an
 utter waste of cycles.  No?

 I don't think so.  We get the *blocks* in the correct order.  But tuples
 are still sampled by the Vitter method which starts to replace random
 tuples after the pool is filled.

Duh, you're right --- I was thinking that the old code doesn't need a
qsort, but it does.  This seems a tad annoying considering that we know
the tuples were inserted into the pool in increasing order.  I wonder if
it's worth using a more complex data structure to keep track of both
orders at once?  I suppose that could easily wind up costing more than
the qsort though ...

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] O(samplesize) tuple sampling, proof of concept

2004-04-05 Thread Manfred Koizar
On Mon, 05 Apr 2004 15:37:07 -0400, Tom Lane [EMAIL PROTECTED] wrote:
I wouldn't bother with a GUC variable for the production patch.

Among other things the GUC variable will be thrown out for the final
version.

 Once a block is selected for inspection, all tuples of this
 block are accessed to get a good estimation of the live : dead row
 ratio.

Why are you bothering to compute the live:dead ratio?

That was basically bad wording.  It should have been ... to get a good
estimation of live rows per page.  Counting dead rows turned out to be
trivial, so I just did it and included the number in the debug messages.
Then it happened to be useful for method 2.

 Because I was afraid that method 1 might be too expensive in terms of
 CPU cycles, I implemented a small variation that skips tuples without
 checking them for visibility; this is sampling_method 2.

And?  Does it matter?

There's a clearly visible difference for mid-size relations.  I'm not
sure whether this can be attributed to visibility bit updating or other
noise-contributing factors.

Method 2 gives a row count estimation error between 10 and 17% where
method 1 is less than 1% off.  (My updates generated dead tuples at very
evenly distributed locations by using conditions like WHERE mod(twenty,
7) = 0).

If that's as bad as it gets I think we are OK.  You should redo the test
with larger sample size though (try stats target = 100) to see if the
answer changes.

Will do.

I find -u diffs close to unreadable for reviewing purposes.  Please
submit diffs in -c format in future.

De gustibus non est disputandum :-)
Fortunately this patch wouldn't look much different.  There is just a
bunch of + lines.

AFAICS the rows will *always* be sorted already, and so this qsort is an
utter waste of cycles.  No?

I don't think so.  We get the *blocks* in the correct order.  But tuples
are still sampled by the Vitter method which starts to replace random
tuples after the pool is filled.

BTW, thanks for the paper!

Servus
 Manfred

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings