[PATCHES] sound extract century/millennium date_part
Dear patchers, Please find a small patch to fix the brain damage century and millennium date part implementation in postgresql, both in the code and the documentation, so that it conforms to the official definition. If you do not agree with the official definition, please send your complaint to [EMAIL PROTECTED]. I'm not responsible for them;-) With the previous version, the centuries and millenniums had a wrong number and started the wrong year. Moreover century number 0, which does not exist in reality, lasted 200 years. Also, millennium number 0 lasted 2000 years. If you want postgresql to have it's own definition of century and millennium that does not conform to the one of the society, just give them another name. I would suggest pgCENTURY and pgMILLENNIUM;-) IMO, if someone may use the options, it means that postgresql is used for historical data, so it make sense to have an historical definition. Also, I just want to divide the year by 100 or 1000, I can do that quite easily. Have a nice day, -- Fabien Coelho - [EMAIL PROTECTED]*** ./doc/src/sgml/func.sgml.orig Wed Mar 31 08:58:31 2004 --- ./doc/src/sgml/func.sgmlSun Apr 4 10:23:00 2004 *** *** 4948,4965 termliteralcentury/literal/term listitem para ! The year field divided by 100 /para screen ! SELECT EXTRACT(CENTURY FROM TIMESTAMP '2001-02-16 20:38:40'); lineannotationResult: /lineannotationcomputeroutput20/computeroutput /screen para ! Note that the result for the century field is simply the year field ! divided by 100, and not the conventional definition which puts most ! years in the 1900's in the twentieth century. /para /listitem /varlistentry --- 4948,4978 termliteralcentury/literal/term listitem para ! The historical definition of a century. /para screen ! SELECT EXTRACT(CENTURY FROM TIMESTAMP '2000-12-16 12:21:13'); lineannotationResult: /lineannotationcomputeroutput20/computeroutput + SELECT EXTRACT(CENTURY FROM TIMESTAMP '2001-02-16 20:38:40'); + lineannotationResult: /lineannotationcomputeroutput21/computeroutput /screen para ! An historical century is a period of 100 years. ! The first century starts at 0001-01-01 00:00:00 AD, although ! they did not know at the time. This definition applies to all ! Gregorian calendar countries. There is no number 0 century, ! you go from -1 to 1. ! ! If you disagree with this, please write your complaint to: ! Pope, Cathedral Saint-Peter of Roma, Vatican. !/para ! !para ! Compatibility: if you want the previous postgres version of century, ! just divide the year by 100. Note that with this definition, ! century number 0 lasts 200 years. /para /listitem /varlistentry *** *** 5083,5100 termliteralmillennium/literal/term listitem para ! The year field divided by 1000 /para screen SELECT EXTRACT(MILLENNIUM FROM TIMESTAMP '2001-02-16 20:38:40'); ! lineannotationResult: /lineannotationcomputeroutput2/computeroutput /screen para ! Note that the result for the millennium field is simply the year field ! divided by 1000, and not the conventional definition which puts ! years in the 1900's in the second millennium. /para /listitem /varlistentry --- 5096,5112 termliteralmillennium/literal/term listitem para ! The conventional historical millennium. /para screen SELECT EXTRACT(MILLENNIUM FROM TIMESTAMP '2001-02-16 20:38:40'); ! lineannotationResult: /lineannotationcomputeroutput3/computeroutput /screen para ! Years in the 1900's are in the second millennium. ! The third millennium starts January 1, 2001. /para /listitem /varlistentry *** ./src/backend/utils/adt/timestamp.c.origWed Mar 31 08:58:40 2004 --- ./src/backend/utils/adt/timestamp.c Sun Apr 4 10:45:59 2004 *** *** 3273,3283 break; case DTK_CENTURY: ! result = (tm-tm_year / 100); break; case DTK_MILLENNIUM: ! result = (tm-tm_year / 1000); break; case DTK_JULIAN: --- 3273,3295 break; case DTK_CENTURY: ! /* centuries AD, c0: year in [ (c-1)*100+1 : c*100 ] !* centuries BC, c0: year in [ c*100 : (c+1)*100-1 ] !* there is no number 0 century. !*/ !
Re: [PATCHES] Translation Updates for 7.3: pg_dump-ru.po.gz;psql-ru.po.gz
Am Montag, 29. März 2004 03:35 schrieb Serguei Mokhov: This should complete traslation updates for 7.3. Please install. Done. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] MSFT compiler fixes + misc
A thought about this - how about converting pgpiperead() and pgpipewrite() into functions intead of macros (on win32 - still redifining them on != win32), mimicking the behaviour of read() and write()? And #def'ing them to be read + write under win32? Don't want to change every instance of read/write. No. #def'ing them to be read + write under non-win32. As is done today. It would change (in port.h): #ifndef WIN32 #define pgpipe(a) pipe(a) #define piperead(a,b,c) read(a,b,c) #define pipewrite(a,b,c)write(a,b,c) #else extern int pgpipe(int handles[2]); #define piperead(a,b,c) recv(a,b,c,0) #define pipewrite(a,b,c)send(a,b,c,0) #endif to #ifndef WIN32 #define pgpipe(a) pipe(a) #define piperead(a,b,c) read(a,b,c) #define pipewrite(a,b,c)write(a,b,c) #else extern int pgpipe(int handles[2]); extern int piperead(a,b,c); extern int pipewrite(a,b,c); #endif And then put piperead() and pipewrite() along with pgpipe() in the C file. (Naturally, arguments with the correct syntax, but you get the idea) Then we could do awya with the #ifdefs at the points where its used, and just expect the normal Unix behaviour? I don't see that we do have any #ifdefs where its used. pgstat.c, inside the main loop. #ifdef WIN32 if (WSAGetLastError() == WSAECONNRESET) /* EOF on the pipe! (win32 socket based implementation) */ { pipeEOF = true; break; } #endif There's where I notived it. That might be the only place, though - haven't checked further. If it's the only case, then it's not such a big deal. But it might come back to bite us in the future - since it currently does not actually implement expected behaviour of a pipe. //Magnus ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PATCHES] Translation Updates for 7.3: libpq-ru.po.gz;pg_controldata-ru.po.gz;pg_resetxlog-ru.po.gz;postgres-ru.po.gz
Am Samstag, 27. März 2004 22:40 schrieb Serguei Mokhov: Please include these updates for 7.3 Done. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Translation Upadtes for 7.5: initdb-ru.po.gz;libpq-ru.po.gz;psql-ru.po.gz
Am Samstag, 27. März 2004 22:18 schrieb Serguei Mokhov: These are for the HEAD: - updated libpq and psql for 7.5 - initial cut on initdb Installed. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] hint infrastructure setup (v3)
Dear Tom, First, thanks for the discussion about my hint infrastructure patch submission. Whatever the opinion, it helps. We'd have to write our own version of bison's verbose-error code anyway, because the canned code doesn't support localization --- it uses hardwired strings for expecting and so on. But it looks possibly doable to me. I also played a little bit with the parser over the weekend. I found the same issues that you noted in your mail, tried more options, and found other problems. I sent here the full mail I wrote. Although I agree that it is doable, I have stronger reserve than yours. Also, I do not find it an appealing solution to change gram.c a lot. The problems I found are: (1) not available raw information. The automaton stack keeps states, which are not directly linked to rules and terminals. The terminals are not available, they must be kept separatly if you want them. This can be done in yylex(). The internal state, stack, token... are local variables within yyparse(). As a result, they are not accessible from yyerror. I haven't found any available hook, so you have to hack gram.c to get this information. It is a 1 line hack, but it modifies the generated code. Or you have to produce a new hacked gram.c, which is what you suggest. (2) hard to compute follow set The information about the syntax is there, obviously! However, from a given automaton state, the actually allowed tokens are not directly available. Only some are directly available, so the result is: DROP ? ; ERROR: syntax error at or near ? at character 6 HINT: hint status is state=422 n=796 char=575 type=320=Op stack= 0 16 422 tokens= 88:DROP 320:Op expected= 150:LANGUAGE? Other tokens may be found, but you have either: - to *simulate* the automaton forward through reductions to check whether it is finally accepted. Not really beautiful, but that can be done. It is just the bison code without the reduction production rules;-) - move backwards before doing the above, if some reductions where performed because of the submitted token and finally resulted in the error, the state that lead to the error may not be the highest available one, so maybe other allowed tokens may also be missed. We would need to have the last state before any reduction. Another small hack in generated gram.c would be needed to get it. However, (3) the computed follow set is often useless, as you noted. The reason for that is that keywords are accepted in place of identifiers. The use of rules: reserved_keyword, func_name_keyword, col_name_keyword, unreserved_keyword as possible column/function/... identifiers is IMHO the main reason for the size of the generated automaton. If you drop that, everything would be much simpler, because it would reduce a lot the number of arcs in the automaton. In particular, anything which is just an identifier should not be given a name (say, BOOLEAN_P or SMALLINT should not be tokens, but could rather be recognized as a special case of identifier from within the parser, maybe with some simple post-filtering). As you noted, for things like SHOW 123, the follow set basically includes all keywords although you can have SHOW ALL or SHOW name. So, as you suggested, you can either say ident as a simplification, but you miss ALL which is meaningful, or you say all keywords, which is useless. An alternative is to know that you're within a SHOW something, that is you somehow reparse the code from within the syntax error:-( It is necessary if you want to say something more interesting than ident, say option name. You may also get this information by simulating the automaton forward, or noticing that some of the current on stack states can only lead to the reduction of a ShowStmt... Well, on the positive side, it is sometimes right, say: REVOKE ? ; ERROR: syntax error, unexpected Op at or near ? at character 8 HINT: hint status is state=481 n=1297 char=575 type=320=Op stack= 0 31 481 tokens= 234:REVOKE 320:Op expected= 10:ALL 59:CREATE 80:DELETE_P 98:EXECUTE 136:INSERT 224:REFERENCES 239:RULE 244:SELECT 268:TEMP 270:TEMPORARY 279:TRIGGER 292:UPDATE 293:USAGE? (4) the available automaton data are inverted with respect to what would be needed... It would be nice to know whether we're heading for a create statement or something like that, but the information is hidden down the automaton. basically, the automaton would need to be inverted to compute this information. We would need the initial information in gram.y, which was inverted to build the automaton. So the code somehow needs to undo what bison compilation has done. (5) anything that can be done would be hardwired to one version of bison. There is a lot of asumptions in the code and data structures, and any newer/older version with some different internal representation would basically break any code that would rely on that. So postgres would not be bison portable:-( I don't think
Re: [PATCHES] hint infrastructure setup (v3)
As another teaching aid idea, is there something that would auto-format queries to they are more readable? Not that I know of. But I guess it must exist in some interface, maybe with Oracle or DB2. My emacs does seem to do that. It just put colors on keywords. If I had to put it somewhere, I would try to enhance emacs sql-mode. But I don't like much lisp programming. -- Fabien Coelho - [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PATCHES] hint infrastructure setup (v3)
As another teaching aid idea, is there something that would auto-format queries to they are more readable? --- Fabien COELHO wrote: Dear Tom, No, just redefine yyerror as a macro that passes additional parameters. Ok! That's just the kind of the hack-hook I was looking for!! The terminals are not available, they must be kept separatly if you want them. This can be done in yylex(). We don't need them. Any already-shifted tokens are to the left of where the error is, no? Yes and no. I agree that you don't need them for computing the follow set. However it would help a lot to generate nicer hints. I'm looking for help the user, and the follow set is just part of the help. Think of typical SELECT foo, bla, FROM xxx (it looks stupid on a one line query, but not so on a very large query) which is quite easier to detect and hint about because of FROM is just after a comma. The rule part is also a real issue. There is no easy way to translate states into rules, to know whether we're somewhere in a ShowStmt or AlterTableStmt... If you're after a comma in state 1232, should you just say IDENT... I'd rather say user name or table name if I can... Otherwise the hint stays at a low parser level. Good hints requires to know about the current context, and it is not easy to get that deep in the automaton. Yeah, I had come to the same conclusion --- state moves made without consuming input would need to be backed out if we wanted to report the complete follow set. I haven't yet looked to see how to do that. Well, following you're previous suggestion, one can redefine the YYLEX macro to save the current state each time a token is required. As you noted, for things like SHOW 123, the follow set basically includes all keywords although you can have SHOW ALL or SHOW name. So, as you suggested, you can either say ident as a simplification, but You're ignoring the distinction between classes of keywords. I would not recommend treating reserved_keywords as a subclass of ident. Ok, I agree that it can help in this very case a little bit here because ALL is reserved. I did not make this distinction when I played with bison. (5) anything that can be done would be hardwired to one version of bison. I think this argument is completely without merit. Possible;-) However I don't like writing hacks that depends on other people future behavior. (b) write a new recursive descendant parser, and drop gram.y Been there, done that, not impressed with the idea. There's a reason why people invented parser generators... Sure. It was just a list of options;-) As a side effect of my inspection is that the automaton generated by bison could be simplified if a different tradeoff between the lexer, the parser and the post-processing would be chosen. Namelly, all tokens that are just identifiers should be dropped and processed differently. We are not going to reserve the keywords that are presently unreserved. Reserving keywords would help, but I would not think it could be accepted, because it is not SQL philosophy. SQL is about 30 years also, as old as yacc ideas... but they did not care at the time;-) When you look at the syntax, it was designed with a recursive parser in mind. Part of my comment was to explain why the generated automaton is large. SQL is a small language, but it has a big automaton in postgresql, and this is IMHO the explanation. If you can think of a reasonable way to stop treating them as separate tokens inside the grammar without altering the user-visible behavior, I'm certainly interested. I was thinking about type names especially, and maybe others. I join a small proof-of-concept patch to drop some tokens out of the parser. As a results, 6 tokens, 6 rules and 9 states are removed, and the automaton table is reduced by 438 entries (1.5%). Not too bad;-) I think others could be dealt with similarily, or with more work. Thanks for the discussion, -- Fabien. Content-Description: [ Attachment, skipping... ] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] hint infrastructure setup (v3)
Fabien COELHO [EMAIL PROTECTED] writes: If you can think of a reasonable way to stop treating them as separate tokens inside the grammar without altering the user-visible behavior, I'm certainly interested. I join a small proof-of-concept patch to drop some tokens out of the parser. I believe these were treated this way *specifically* because of the keyword-is-not-an-identifier issue. SQL99 calls out most of these as being keywords: SQL defines predefined data types named by the following key words: CHARACTER, CHARACTER VARYING, CHARACTER LARGE OBJECT, BINARY LARGE OBJECT, BIT, BIT VARYING, NUMERIC, DECIMAL, INTEGER, SMALLINT, FLOAT, REAL, DOUBLE PRECISION, BOOLEAN, DATE, TIME, TIMESTAMP, and INTERVAL. These names are used in the type and if we don't treat them as keywords then we will have a couple of problems. One is case-conversion issues in locales where the standard downcasing is not an extension of ASCII (Turkish is the only one I know of offhand). Another is that depending on where you put the renaming that this patch removes without replacing :-(, it would be possible for the renaming transformation to get applied to user-defined types with similar names, or for user-defined types to unexpectedly shadow system definitions. The former would be surprising and the latter would violate the spec. Check the archives; I'm sure this was discussed in the 7.3 development cycle and we concluded that treating these names as keywords was the only practical solution. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] hint infrastructure setup (v3)
Tom Lane wrote: Fabien COELHO [EMAIL PROTECTED] writes: (b) write a new recursive descendant parser, and drop gram.y er, that's recursive descent :-) Been there, done that, not impressed with the idea. There's a reason why people invented parser generators... Well, unless you are a serious masochist, cutting a parser of the LR family (including LALR(1), such as yacc/bison produce) by hand is very difficult and error prone. That is not the case with LL type parsers. Also, there are parser generators for this family, both table driven and recursive descent. The table driven variety especially can have quite well tuned error reporting and recovery. (Among the reasons I know this is that I actually wrote such a beast around 13 years ago). That is not to say that I think we should move from the present setup. Familiarity of the developer community with the tool used suggests we should not, quite apart from any other reasons. Changing to, say, an RD parser, would be a massive and probably error prone change and the benefit just doesn't seem worth it. In fact, considering this thread I'm not sure any of the suggested steps will achieve Fabien's goal. ISTM that a smarter training wheels frontend, perhaps with some modest backend improvements, is likely to have better results. (Oh, you found an error near there - now what do I suggest belongs there?) cheers andrew ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PATCHES] hint infrastructure setup (v3)
Dear Tom, No, just redefine yyerror as a macro that passes additional parameters. Ok! That's just the kind of the hack-hook I was looking for!! The terminals are not available, they must be kept separatly if you want them. This can be done in yylex(). We don't need them. Any already-shifted tokens are to the left of where the error is, no? Yes and no. I agree that you don't need them for computing the follow set. However it would help a lot to generate nicer hints. I'm looking for help the user, and the follow set is just part of the help. Think of typical SELECT foo, bla, FROM xxx (it looks stupid on a one line query, but not so on a very large query) which is quite easier to detect and hint about because of FROM is just after a comma. The rule part is also a real issue. There is no easy way to translate states into rules, to know whether we're somewhere in a ShowStmt or AlterTableStmt... If you're after a comma in state 1232, should you just say IDENT... I'd rather say user name or table name if I can... Otherwise the hint stays at a low parser level. Good hints requires to know about the current context, and it is not easy to get that deep in the automaton. Yeah, I had come to the same conclusion --- state moves made without consuming input would need to be backed out if we wanted to report the complete follow set. I haven't yet looked to see how to do that. Well, following you're previous suggestion, one can redefine the YYLEX macro to save the current state each time a token is required. As you noted, for things like SHOW 123, the follow set basically includes all keywords although you can have SHOW ALL or SHOW name. So, as you suggested, you can either say ident as a simplification, but You're ignoring the distinction between classes of keywords. I would not recommend treating reserved_keywords as a subclass of ident. Ok, I agree that it can help in this very case a little bit here because ALL is reserved. I did not make this distinction when I played with bison. (5) anything that can be done would be hardwired to one version of bison. I think this argument is completely without merit. Possible;-) However I don't like writing hacks that depends on other people future behavior. (b) write a new recursive descendant parser, and drop gram.y Been there, done that, not impressed with the idea. There's a reason why people invented parser generators... Sure. It was just a list of options;-) As a side effect of my inspection is that the automaton generated by bison could be simplified if a different tradeoff between the lexer, the parser and the post-processing would be chosen. Namelly, all tokens that are just identifiers should be dropped and processed differently. We are not going to reserve the keywords that are presently unreserved. Reserving keywords would help, but I would not think it could be accepted, because it is not SQL philosophy. SQL is about 30 years also, as old as yacc ideas... but they did not care at the time;-) When you look at the syntax, it was designed with a recursive parser in mind. Part of my comment was to explain why the generated automaton is large. SQL is a small language, but it has a big automaton in postgresql, and this is IMHO the explanation. If you can think of a reasonable way to stop treating them as separate tokens inside the grammar without altering the user-visible behavior, I'm certainly interested. I was thinking about type names especially, and maybe others. I join a small proof-of-concept patch to drop some tokens out of the parser. As a results, 6 tokens, 6 rules and 9 states are removed, and the automaton table is reduced by 438 entries (1.5%). Not too bad;-) I think others could be dealt with similarily, or with more work. Thanks for the discussion, -- Fabien.*** ./src/backend/parser/gram.y.origMon Apr 5 12:06:42 2004 --- ./src/backend/parser/gram.y Mon Apr 5 18:21:21 2004 *** *** 336,343 AGGREGATE ALL ALSO ALTER ANALYSE ANALYZE AND ANY ARRAY AS ASC ASSERTION ASSIGNMENT AT AUTHORIZATION ! BACKWARD BEFORE BEGIN_P BETWEEN BIGINT BINARY BIT ! BOOLEAN_P BOTH BY CACHE CALLED CASCADE CASE CAST CHAIN CHAR_P CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE --- 336,343 AGGREGATE ALL ALSO ALTER ANALYSE ANALYZE AND ANY ARRAY AS ASC ASSERTION ASSIGNMENT AT AUTHORIZATION ! BACKWARD BEFORE BEGIN_P BETWEEN BINARY BIT ! BOTH BY CACHE CALLED CASCADE CASE CAST CHAIN CHAR_P CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE *** *** 362,368 ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IN_P INCLUDING INCREMENT INDEX INHERITS INITIALLY INNER_P INOUT INPUT_P ! INSENSITIVE INSERT INSTEAD INT_P INTEGER INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION JOIN --- 362,368 ILIKE IMMEDIATE IMMUTABLE
Re: [PATCHES] O(samplesize) tuple sampling, proof of concept
Manfred Koizar [EMAIL PROTECTED] writes: The old sampling method is still there. I have added a GUC variable sampling_method to make testing and benchmarking easier. I wouldn't bother with a GUC variable for the production patch. Once a block is selected for inspection, all tuples of this block are accessed to get a good estimation of the live : dead row ratio. Why are you bothering to compute the live:dead ratio? The statistics model doesn't have any place for a dead-tuples count, so there's no need to think about anything but live tuples. (This is a historical artifact, perhaps, arising from the fact that originally analysis happened after VACUUM FULL and so the number of dead tuples was guaranteed to be zero at the time. But still, I'm not sure how we'd factor a dead-tuples count into the estimates if we had it.) Because I was afraid that method 1 might be too expensive in terms of CPU cycles, I implemented a small variation that skips tuples without checking them for visibility; this is sampling_method 2. And? Does it matter? (I'd guess not in practice, as checking a tuple for visibility is cheap if someone's already marked it good.) At 1000 pages there is only a difference of approximately 1%, at 3000 pages the difference has its maximum of ca. 15%. We can sell this by pointing to the better quality of the statistics. If that's as bad as it gets I think we are OK. You should redo the test with larger sample size though (try stats target = 100) to see if the answer changes. Are there any spots in the documentation that should be updated? AFAIR there is noplace that specifically needs to be changed. diff -ruN ../base/src/backend/commands/analyze.c src/backend/commands/analyze.c I find -u diffs close to unreadable for reviewing purposes. Please submit diffs in -c format in future. + /* + * If we ran out of tuples then we're done. No sort is needed, + * since they're already in order. + * + * Otherwise we need to sort the collected tuples by position + * (itempointer). I don't care for corner cases where the tuples + * are already sorted. + */ + if (numrows == targrows) + qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows); AFAICS the rows will *always* be sorted already, and so this qsort is an utter waste of cycles. No? regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] O(samplesize) tuple sampling, proof of concept
Manfred Koizar [EMAIL PROTECTED] writes: Because I was afraid that method 1 might be too expensive in terms of CPU cycles, I implemented a small variation that skips tuples without checking them for visibility; this is sampling_method 2. And? Does it matter? There's a clearly visible difference for mid-size relations. I'm not sure whether this can be attributed to visibility bit updating or other noise-contributing factors. I think it would have to be visibility-bit updates. Can you try it with a pre-vacuumed relation, so that there is no slowdown for updates? The update cost will have to be paid sooner or later by some xact, and on the whole it's better that it be done by a maintenance xact than by a foreground client; so even if there is a noticeable slowdown here, that doesn't seem like a reason not to inspect all the tuples. I find -u diffs close to unreadable for reviewing purposes. Please submit diffs in -c format in future. De gustibus non est disputandum :-) Fortunately this patch wouldn't look much different. There is just a bunch of + lines. Yeah, so I managed to read it anyway ;-). It's the ones with intricately intermixed - and + that I find difficult to follow. AFAICS the rows will *always* be sorted already, and so this qsort is an utter waste of cycles. No? I don't think so. We get the *blocks* in the correct order. But tuples are still sampled by the Vitter method which starts to replace random tuples after the pool is filled. Duh, you're right --- I was thinking that the old code doesn't need a qsort, but it does. This seems a tad annoying considering that we know the tuples were inserted into the pool in increasing order. I wonder if it's worth using a more complex data structure to keep track of both orders at once? I suppose that could easily wind up costing more than the qsort though ... regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PATCHES] O(samplesize) tuple sampling, proof of concept
On Mon, 05 Apr 2004 15:37:07 -0400, Tom Lane [EMAIL PROTECTED] wrote: I wouldn't bother with a GUC variable for the production patch. Among other things the GUC variable will be thrown out for the final version. Once a block is selected for inspection, all tuples of this block are accessed to get a good estimation of the live : dead row ratio. Why are you bothering to compute the live:dead ratio? That was basically bad wording. It should have been ... to get a good estimation of live rows per page. Counting dead rows turned out to be trivial, so I just did it and included the number in the debug messages. Then it happened to be useful for method 2. Because I was afraid that method 1 might be too expensive in terms of CPU cycles, I implemented a small variation that skips tuples without checking them for visibility; this is sampling_method 2. And? Does it matter? There's a clearly visible difference for mid-size relations. I'm not sure whether this can be attributed to visibility bit updating or other noise-contributing factors. Method 2 gives a row count estimation error between 10 and 17% where method 1 is less than 1% off. (My updates generated dead tuples at very evenly distributed locations by using conditions like WHERE mod(twenty, 7) = 0). If that's as bad as it gets I think we are OK. You should redo the test with larger sample size though (try stats target = 100) to see if the answer changes. Will do. I find -u diffs close to unreadable for reviewing purposes. Please submit diffs in -c format in future. De gustibus non est disputandum :-) Fortunately this patch wouldn't look much different. There is just a bunch of + lines. AFAICS the rows will *always* be sorted already, and so this qsort is an utter waste of cycles. No? I don't think so. We get the *blocks* in the correct order. But tuples are still sampled by the Vitter method which starts to replace random tuples after the pool is filled. BTW, thanks for the paper! Servus Manfred ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings