Re: [HACKERS] Coverity Open Source Defect Scan of PostgreSQL
On Tue, Mar 07, 2006 at 05:39:18PM -0500, Tom Lane wrote: Martijn van Oosterhout kleptog@svana.org writes: #ifdef STATIC_ANALYSIS #define ereport(elevel, rest) \ (errstart(elevel, __FILE__, __LINE__, PG_FUNCNAME_MACRO) ? \ (errfinish rest) : (void) 0), (elevel = ERROR ? exit(0) : 0) #else /* Normal def */ #endif Hmm, neat idea ... though I wonder whether either gcc or Coverity's tool is smart enough to draw the right conclusions from a conditional exit() call ... Well, remember this is a macro so the conditional is known at compile time and the optimiser should see that the exit is unconditional. A quick test with the attached program shows that gcc does correctly determine that the last few lines are unreachable and are optimised out entirely (with -Wunreachable-code which is not the default). I tried to create an empty static inline function with attribute((noreturn)) to optimise out the call to exit(), but gcc merely points out the function does actually return and proceeds to assume that the rest of main() is also reachable. Another possibility would be to create two versions of errfinish, one marked (noreturn), and use a conditional on elevel to decide which to use. However, then you get issues with multiple evaluation of macro arguments... gcc 3.3.5 -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. #include stdlib.h #include stdio.h typedef enum { NOTICE, WARN, ERROR, FATAL } ErrorLevel; #define errortest(elevel,emessage) \ doerr(elevel,emessage), (elevel = ERROR ? exit(0) : 0) void doerr( ErrorLevel level, char *msg ) { printf( Error: %s\n, msg ); } static inline void __attribute__((noreturn)) fake_exit() { } int main() { int i=0; errortest( NOTICE, test1 ); i=1; errortest( FATAL, test2 ); i=2; return 0; } signature.asc Description: Digital signature
Re: [HACKERS] FW: PGBuildfarm member snake Branch HEAD Status changed from OK to PLCheck failure
On Mar 8, 2006, at 13:01 , Bruce Momjian wrote: I have updated the /contrib and PL regression tests for escape_string_warning. I can't test all the regressions but the build farm will tell us soon enough. I see 'em turning green :) Thanks, Bruce. Michael Glaesemann grzm myrealbox com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] FW: PGBuildfarm member snake Branch HEAD Status changed
Michael Glaesemann wrote: On Mar 8, 2006, at 13:01 , Bruce Momjian wrote: I have updated the /contrib and PL regression tests for escape_string_warning. I can't test all the regressions but the build farm will tell us soon enough. I see 'em turning green :) Thanks, Bruce. Great. Just fixed another pltcl problem so hopefully that is all. -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] 8.2 hold queue
I have applied all the patches in the patch queue, and am starting to look at the patches_hold queue, which are patches submitted after the feature freeze. http://momjian.postgresql.org/cgi-bin/pgpatches_hold -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [SQL] Interval subtracting
Hi, Scott, Scott Marlowe wrote: But it isn't '-2 months, -1 day'. I think what you are saying is what I am saying, that we should make the signs consistent. Pretty much. It just seems wrong to have different signs in what is essentially a single unit. We don't say 42 degrees, -12 minutes when measuring arc, do we? Then again, maybe some folks do. It just seems wrong to me. But we say quarter to twelve, at least in some areas on this planet. The problem is that months have different lengths. '2 months - 1 day' can be '1 month 27 days', '1 month 28 days', '1 month 29 days' or '1 month 30 days', depending on the timestamp we apply the interval. Markus -- Markus Schaber | Logical TrackingTracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PATCHES] Inherited Constraints
Ühel kenal päeval, E, 2006-03-06 kell 17:25, kirjutas Bruce Momjian: Hannu Krosing wrote: ?hel kenal p?eval, E, 2006-03-06 kell 12:12, kirjutas Bruce Momjian: Added to TODO: o Prevent parent tables from altering or dropping constraints like CHECK that are inherited by child tables Dropping constraints should only be possible with CASCADE. and we already have this in TODO: o %Prevent child tables from altering or dropping constraints like CHECK that were inherited from the parent table so I think we now have all the failure cases documented. If you want to be consistent, then ALTER TABLE ONLY ADD CONSTRAINT .. should also be forbidden, so you can't create non-inherited constraints I don't have a problem with creating ONLY constraints on parents and children. We just don't want them to be removed/modified if they are shared. Well, when you delete a constraint from child, the constraint becomes an ONLY constraint on parent. If you allow ONLY constraints on parents, then why disallow dropping them from childs ? IIRC the original complaint about being able to drop constraints from children was that inherited tables not being bound by constraints on parents was unexpected/broken. I.E when you have CREATE TABLE T(i int check (i0)); then you would be really surprised by getting -1 out from that table. --- Hannu ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Merge algorithms for large numbers of tapes
On Tue, 2006-03-07 at 18:14 -0500, Tom Lane wrote: BTW, I was just looking over Knuth's discussion of sorting again, and realized that there is still something more that could be done within the existing sort framework. We currently use standard polyphase merge (his Algorithm 5.4.2D), which IIRC I chose because it was simple and for relatively small numbers of tapes T it was about as good as anything else. Knuth spends a great deal of energy on minimizing tape rewind time which of course is of no interest to us, and I had supposed that all of his more-complex algorithms were really only of interest if you needed to consider rewind time. However, now that we've changed the code to prefer large numbers of tapes, it's not at all clear that Algorithm D is still the right one to use. In particular I'm looking at cascade merge, Algorithm 5.4.3C, which appears to use significantly fewer passes when T is large. Ah! Well spotted. Yeh, looks like it will improve performance a good deal. So, yes, definitely a TODO item. Do you want to try that? The Cascade Merge re-writes the way logical tapes are selected and how the runs are merged. It doesn't seem to do anything at all about the run-forming, which would still use heapsort. So the only effect is when we have more runs than tapes, so for the limits of where we would begin noticing any benefit would be: work_mem= 1 GB benefit at 8 TB work_mem= 256MB benefit at 0.5 TB work_mem= 8MB benefit at 256 MB work_mem= 1MB benefit at 12 MB (min 7 tapes). (based upon runs on average twice size of memory, and each logical tape requiring 256KB memory, i.e. min(work_mem/4, 6) * work_mem * 2, which for work_mem 2 MB gives 0.5 * work_mem^2) Which means the benefit we get is when we have for some reason been unable to give the sort enough space, or not set parameters correctly. So, still a concern...but makes me think about 2 other issues first: 1. Earlier we had some results that showed that the heapsorts got slower when work_mem was higher and that concerns me most of all right now. It's possible you'll have reduced that considerably with the pull-out-the-first-attr patch. I'll look into some test results to show that has gone away. We also have Nyberg et al telling us that as of 1994 they established that heapsort would always be slower than qsort, as a result of CPU cache locality improvements. An improvement here would effect all sorts work_mem. 2. Improvement in the way we do overall memory allocation, so we would not have the problem of undersetting work_mem that we currently experience. If we solved this problem we would have faster sorts in *all* cases, not just extremely large ones. Dynamically setting work_mem higher when possible would be very useful. I've looked at this a few times and have some suggestions, but perhaps its worth asking for ideas in this area? Best Regards, Simon Riggs ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] Running out of disk space during query
Greetings, * Simon Riggs ([EMAIL PROTECTED]) wrote: work_mem= 1 GBbenefit at 8 TB work_mem= 256MB benefit at 0.5 TB (based upon runs on average twice size of memory, and each logical tape requiring 256KB memory, i.e. min(work_mem/4, 6) * work_mem * 2, which for work_mem 2 MB gives 0.5 * work_mem^2) Seeing this reminded me of an issue I ran into recently. In 8.1 on a database that's only 16G, I ran a query that chewed up all the available disk space (about 250G, yes, 0.25TB) on the partition and then failed. Of course, this took many hours on a rather speedy box (and the disk array is a pretty nice IBM SAN so it's not exactly a slacker either) and produced nothing for me. I'd like to think it's often the case that Postgres has some idea what the total disk space usage of a given query is going to be prior to actually running the whole query and just seeing how much space it took at the highest point. If this can be done with some confidence then it'd be neat if Postgres could either check if there's enough disk space available and if not bail (I know, difficult to do cross-platform and there's tablespaces and whatnot to consider) OR if there was a parameter along the lines of max_temp_disk_space which would fail the query if that would be exceeded by the query. The latter could even be two GUC variables, one administrator set and unchangable by the user ('hard' limit) and one settable by the user with a sane default ('soft' limit) and perhaps a HINT which indicates how to change it in the error message when the limit is hit. I suppose I could put quotas in place or something but I don't really have a problem with the database as a whole using up a bunch of disk space (hence why it's got alot of room to grow into), I just would have liked a this will chew up more disk space than you have and then fail message instead of what ended up happening for this query. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Merge algorithms for large numbers of tapes
Simon Riggs [EMAIL PROTECTED] writes: 1. Earlier we had some results that showed that the heapsorts got slower when work_mem was higher and that concerns me most of all right now. Fair enough, but that's completely independent of the merge algorithm. (I don't think the Nyberg results necessarily apply to our situation anyway, as we are not sorting arrays of integers, and hence the cache effects are far weaker for us. I don't mind trying alternate sort algorithms, but I'm not going to believe an improvement in advance of direct evidence in our own environment.) 2. Improvement in the way we do overall memory allocation, so we would not have the problem of undersetting work_mem that we currently experience. If we solved this problem we would have faster sorts in *all* cases, not just extremely large ones. Dynamically setting work_mem higher when possible would be very useful. I think this would be extremely dangerous, as it would encourage processes to take more than their fair share of available resources. Also, to the extent that you believe the problem is insufficient L2 cache, it seems increasing work_mem to many times the size of L2 will always be counterproductive. (Certainly there is no value in increasing work_mem until we are in a regime where it consistently improves performance significantly, which it seems we aren't yet.) regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] pg_freespacemap question
Mark Kirkwood [EMAIL PROTECTED] writes: Good points! I had not noticed this test case. Probably NULL is better Would setting it to 'BLCKSZ - (fixed index header stuff)' be better, No, I don't think so, because that will just make it harder to recognize what's what (remember that BLCKSZ isn't really a constant, and the index overhead is not the same for all AMs either). The point here is that for indexes the FSM tracks whole-page availability, not the amount of free space within pages. So I think NULL is a reasonable representation of that. Using NULL will make it easy to filter the results if you want to see only heap-page data or only index-page data, whereas it will be very hard to do that if the view adopts an ultimately-artificial convention about the amount of available space on an index page. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Merge algorithms for large numbers of tapes
Tom, On 3/8/06 7:21 AM, Tom Lane [EMAIL PROTECTED] wrote: Simon Riggs [EMAIL PROTECTED] writes: 1. Earlier we had some results that showed that the heapsorts got slower when work_mem was higher and that concerns me most of all right now. Fair enough, but that's completely independent of the merge algorithm. (I don't think the Nyberg results necessarily apply to our situation anyway, as we are not sorting arrays of integers, and hence the cache effects are far weaker for us. I don't mind trying alternate sort Even with the indirection, we should investigate alternative approaches that others have demonstrated to be superior WRT L2 cache use. A major commercial database currently performs external sorts of various fields 4 times faster, and commonly uses more than 256MB of sort memory in one example case to do it. I think this would be extremely dangerous, as it would encourage processes to take more than their fair share of available resources. I agree - in fact, we currently have no structured concept of fair share of available resources, nor a way to share them. I think the answer to this should involve the use of statement queuing and resource queues. Also, to the extent that you believe the problem is insufficient L2 cache, it seems increasing work_mem to many times the size of L2 will always be counterproductive. (Certainly there is no value in increasing work_mem until we are in a regime where it consistently improves performance significantly, which it seems we aren't yet.) Not if you cache block, the optimization that operates on a block of memory one L2 block in size at a time. - Luke ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PATCHES] Inherited Constraints
On Wed, 8 Mar 2006, Hannu Krosing wrote: ??hel kenal p??eval, E, 2006-03-06 kell 17:25, kirjutas Bruce Momjian: Hannu Krosing wrote: ?hel kenal p?eval, E, 2006-03-06 kell 12:12, kirjutas Bruce Momjian: Added to TODO: o Prevent parent tables from altering or dropping constraints like CHECK that are inherited by child tables Dropping constraints should only be possible with CASCADE. and we already have this in TODO: o %Prevent child tables from altering or dropping constraints like CHECK that were inherited from the parent table so I think we now have all the failure cases documented. If you want to be consistent, then ALTER TABLE ONLY ADD CONSTRAINT .. should also be forbidden, so you can't create non-inherited constraints I don't have a problem with creating ONLY constraints on parents and children. We just don't want them to be removed/modified if they are shared. Well, when you delete a constraint from child, the constraint becomes an ONLY constraint on parent. Only if there's a single child, otherwise you have a partially-ONLY constraint unless you made it ONLY constraints on the parent and all other children (but then removing the parent constraint wouldn't remove it from the other children presumably). If you allow ONLY constraints on parents, then why disallow dropping them from childs ? I agree with this in any case. I think both are fairly broken. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] problem with large maintenance_work_mem settings and
Tom Lane wrote: I wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: samples %symbol name 24915704 96.2170 ltsReleaseBlock We probably need to tweak things so this doesn't get called during the final merge pass. Looking at it now. I've committed a fix for this into CVS HEAD --- please try it out. just tried that with CVS HEAD (includes the second fix too): CREATE INDEX on a 1,8B row table (5 int columns - index created on the first row about 300M distinct values): before: 11h 51min after: 3h 11min(!) Stefan ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] problem with large maintenance_work_mem settings and
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: CREATE INDEX on a 1,8B row table (5 int columns - index created on the first row about 300M distinct values): before: 11h 51min after: 3h 11min(!) Cool. Does it seem to be I/O bound now? Would you be willing to do it over with oprofile turned on? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] Add switches for DELIMITER and NULL in pg_dump COPY
Folks, From the earlier discussion, it appears that there is a variety of opinions on what the COPY delimiter should be in pg_dump. This patch allows people to set it and the NULL string. Thanks to Gavin Sherry for help with the pointers :) I didn't patch pg_dumpall, but it would be trivial if there's a use case. Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778 Remember to vote! Index: doc/src/sgml/ref/pg_dump.sgml === RCS file: /projects/cvsroot/pgsql/doc/src/sgml/ref/pg_dump.sgml,v retrieving revision 1.81 diff -c -r1.81 pg_dump.sgml *** doc/src/sgml/ref/pg_dump.sgml 1 Nov 2005 21:09:50 - 1.81 --- doc/src/sgml/ref/pg_dump.sgml 6 Mar 2006 07:32:05 - *** *** 163,168 --- 163,208 /varlistentry varlistentry + termoption--copy-delimiter=replaceable class=parameterdelimiter/replaceable/option/term + listitem +para + Use replaceable class=parameterdelimiter/replaceable + instead of the default tab character in commandCOPY/command statements. +/para + /listitem + /varlistentry + + varlistentry + termoption--copy-null=replaceable class=parameterstring_for_nulls/replaceable/option/term + listitem +para + Use replaceable class=parameterstring_for_nulls/replaceable instead of the + default \N in commandCOPY/command statements. +/para + /listitem + /varlistentry + + varlistentry + termoption--copy-delimiter=replaceable class=parameterdelimiter/replaceable/option/term + listitem +para + Use replaceable class=parameterdelimiter/replaceable + instead of the default tab character in commandCOPY/command statements. +/para + /listitem + /varlistentry + + varlistentry + termoption--copy-null=replaceable class=parameterstring_for_nulls/replaceable/option/term + listitem +para + Use replaceable class=parameterstring_for_nulls/replaceable instead of the + default \N in commandCOPY/command statements. +/para + /listitem + /varlistentry + + varlistentry termoption-d/option/term termoption--inserts/option/term listitem Index: src/bin/pg_dump/pg_dump.c === RCS file: /projects/cvsroot/pgsql/src/bin/pg_dump/pg_dump.c,v retrieving revision 1.433 diff -c -r1.433 pg_dump.c *** src/bin/pg_dump/pg_dump.c 5 Mar 2006 15:58:50 - 1.433 --- src/bin/pg_dump/pg_dump.c 6 Mar 2006 07:32:12 - *** *** 114,119 --- 114,125 /* flag to turn on/off dollar quoting */ static intdisable_dollar_quoting = 0; + /* Things used when caller invokes COPY options. */ + #define ARG_COPY_DELIMITER 2 + #define ARG_COPY_NULL 3 + char *copy_delimiter = \t; + char *copy_null; + static void help(const char *progname); static NamespaceInfo *findNamespace(Oid nsoid, Oid objoid); *** *** 181,186 --- 187,193 ExecStatusType expected); + int main(int argc, char **argv) { *** *** 211,217 char *outputSuperuser = NULL; RestoreOptions *ropt; ! static struct option long_options[] = { {data-only, no_argument, NULL, 'a'}, {blobs, no_argument, NULL, 'b'}, --- 218,224 char *outputSuperuser = NULL; RestoreOptions *ropt; ! static struct option long_options[] = { {data-only, no_argument, NULL, 'a'}, {blobs, no_argument, NULL, 'b'}, *** *** 249,254 --- 256,269 {disable-dollar-quoting, no_argument, disable_dollar_quoting, 1}, {disable-triggers, no_argument, disable_triggers, 1}, {use-set-session-authorization, no_argument, use_setsessauth, 1}, + + /* +* The following options don't have an equivalent short option +* letter, and are not available as -X long-name. Just use +* the long form. +*/ + {copy-delimiter, required_argument, NULL, ARG_COPY_DELIMITER}, + {copy-null, required_argument, NULL, ARG_COPY_NULL}, {NULL, 0, NULL, 0} }; *** *** 418,423 --- 433,460 break; /* This covers the long options equivalent to -X xxx. */ + case ARG_COPY_DELIMITER: + if ( strlen(optarg) != 1) + { + fprintf(stderr, _(In %s, copy-delimiter must be exactly one byte long, not %d.\n), +
Re: [HACKERS] [SQL] Interval subtracting
On Wed, 2006-03-08 at 06:07, Markus Schaber wrote: Hi, Scott, Scott Marlowe wrote: But it isn't '-2 months, -1 day'. I think what you are saying is what I am saying, that we should make the signs consistent. Pretty much. It just seems wrong to have different signs in what is essentially a single unit. We don't say 42 degrees, -12 minutes when measuring arc, do we? Then again, maybe some folks do. It just seems wrong to me. But we say quarter to twelve, at least in some areas on this planet. The problem is that months have different lengths. '2 months - 1 day' can be '1 month 27 days', '1 month 28 days', '1 month 29 days' or '1 month 30 days', depending on the timestamp we apply the interval. I made this point before. In the military they say 1145 or 2345 instead of quarter to twelve, because 1: there are two quarter to twelves a day, and 2: It's easy to get it confused. For same reasons, i.e. a need for precision, I find it hard to accept the idea of mixing positive and negative units in the same interval. The plus or minus sign should be outside of the interval. Then, it's quite certain what you mean. If you say select '2006-06-12'::date - interval '1 month 2 days' there is no ambiguity. If you say: select '2006-06-12'::date + interval '-1 month -2 days' do you mean (1 month - 2 days) subtracted from the date, or do you mean to subtract 1 month, then 2 days from the date? Putting the + or - outside the interval seems to make the most sense to me. Allowing them inside makes no sense to me. And colloquialisms aren't really a good reason. :) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Running out of disk space during query
Stephen Frost [EMAIL PROTECTED] writes: I suppose I could put quotas in place or something but I don't really have a problem with the database as a whole using up a bunch of disk space (hence why it's got alot of room to grow into), I just would have liked a this will chew up more disk space than you have and then fail message instead of what ended up happening for this query. I've got the same problem with this that I do with the recently-proposed patch to fail queries with estimated cost X --- to wit, I think it will result in a net *reduction* in system reliability not an improvement. Any such feature changes the planner estimates from mere heuristics into a gating factor that will make queries fail entirely. And they are really not good enough to put that kind of trust into. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Add switches for DELIMITER and NULL in pg_dump COPY
David Fetter [EMAIL PROTECTED] writes: From the earlier discussion, it appears that there is a variety of opinions on what the COPY delimiter should be in pg_dump. This patch allows people to set it and the NULL string. Did anyone provide a convincing use case for this? It's of zero value from the perspective of pg_dump itself; the only possible argument is that it makes it easier for program-foo to parse the output of pg_dump. But I don't see any programs around to parse arbitrary SQL scripts, especially not the pretty-PG-specific scripts that pg_dump emits. I think it much more likely that people needing this sort of thing would be using something like psql -c 'copy foo to stdout', so as to get the data without any added overhead. So this seems like mere creeping featurism to me. pg_dump has too many switches already. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Merge algorithms for large numbers of tapes
On Wed, Mar 08, 2006 at 07:28:16AM -0800, Luke Lonergan wrote: I think this would be extremely dangerous, as it would encourage processes to take more than their fair share of available resources. I agree - in fact, we currently have no structured concept of fair share of available resources, nor a way to share them. A concept it would be great to add at some point, both for memory and IO. But that's another discussion entirely. I think the answer to this should involve the use of statement queuing and resource queues. Something else to consider is reducing the amount of memory used when we have to fail to a tape sort, because at that point we'll be substantially slower. So, for example, allow in-memory sorts to use up to 1GB, because it shouldn't take a long period of time to read that data in, and the sort will then be extremely fast. That means that the sort would be using that amount of memory for a short period of time. If we do have to fail to disk, cut back to 128MB, because having 8x that certainly won't make the sort run anywhere close to 8x faster. The trick would be releasing memory that a sort we thought could fit in memory but couldn't. It would also be good to start estimating which sorts should fit in memory and which won't before we start (AFAIK the current code assumes we'll fit in memory until it runs out). -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Add switches for DELIMITER and NULL in pg_dump COPY
On Wed, Mar 08, 2006 at 11:03:00AM -0500, Tom Lane wrote: David Fetter [EMAIL PROTECTED] writes: From the earlier discussion, it appears that there is a variety of opinions on what the COPY delimiter should be in pg_dump. This patch allows people to set it and the NULL string. Did anyone provide a convincing use case for this? I've had one so far, and it was enough to cause me to make a special patched version of pg_dump. To get some idea of how drastic that was, consider that I think it's generally bad practice to compile from source because it can take you too far off the generally supported software map. The case I had was making a database with a schema and initial data whose dump output gets checked into a source code management system. Those initial data sets, which can change--for example when the corresponding ISO codes do--may be in many different tables, so the easiest way to do this is to make the dump file as easy as possible to edit. It's of zero value from the perspective of pg_dump itself; the only possible argument is that it makes it easier for program-foo to parse the output of pg_dump. But I don't see any programs around to parse arbitrary SQL scripts, especially not the pretty-PG-specific scripts that pg_dump emits. It's less about program-foo parsing than about multi-table data management, as above. However, I'm sure that there are people who will find other uses for it. I think it much more likely that people needing this sort of thing would be using something like psql -c 'copy foo to stdout', so as to get the data without any added overhead. The one-table-at-a-time approach is quite error-prone for large numbers of tables and/or large data sets. So this seems like mere creeping featurism to me. pg_dump has too many switches already. I've been careful to see to it that only people who use the switches are affected by it. I am also volunteering to do ongoing maintenance of this feature. :) Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778 Remember to vote! ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [SQL] Interval subtracting
Scott Marlowe [EMAIL PROTECTED] writes: For same reasons, i.e. a need for precision, I find it hard to accept the idea of mixing positive and negative units in the same interval. The semantics are perfectly well defined, so I don't buy this. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] Add switches for DELIMITER and NULL in pg_dump COPY
On Wed, Mar 08, 2006 at 11:10:04AM -0500, Neil Conway wrote: On Wed, 2006-03-08 at 07:47 -0800, David Fetter wrote: From the earlier discussion, it appears that there is a variety of opinions on what the COPY delimiter should be in pg_dump. This patch allows people to set it and the NULL string. I'm still not convinced there is a reasonable use-case for this feature. I can't recall: did the previous discussion conclude that we actually want this functionality? The previous discussion showed that there is a wide diversity of opinions on what The Right Delimiter and The Right NULL String(TM) are. Thanks for the tips. :) I'll make a revised patch this evening, time permitting. Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778 Remember to vote! ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Merge algorithms for large numbers of tapes
Jim C. Nasby [EMAIL PROTECTED] writes: If we do have to fail to disk, cut back to 128MB, because having 8x that certainly won't make the sort run anywhere close to 8x faster. Not sure that follows. In particular, the entire point of the recent changes has been to extend the range in which we can use a single merge pass --- that is, write the data once as N sorted runs, then merge them in a single read pass. As soon as you have to do an actual merge-back- to-disk pass, your total I/O volume doubles, so there is definitely a considerable gain if that can be avoided. And a larger work_mem translates directly to fewer/longer sorted runs. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Running out of disk space during query
Tom, * Tom Lane ([EMAIL PROTECTED]) wrote: I've got the same problem with this that I do with the recently-proposed patch to fail queries with estimated cost X --- to wit, I think it will result in a net *reduction* in system reliability not an improvement. Any such feature changes the planner estimates from mere heuristics into a gating factor that will make queries fail entirely. And they are really not good enough to put that kind of trust into. Perhaps instead then have the system fail the query once it's gone beyond some configurable limit on temporary disk usage? The query still would have run for a while but it wouldn't have run the partition out of space and would have come back faster at least. Comparing this to work_mem- do we do something like this there? I don't think we do, which means we're trusting the planner's estimate to get the memory size estimate right and that can end up being way off resulting in queries taking up well beyond what work_mem would normally allow them... I recall alot of discussion but don't recall if anything was actually done to resolve that issue either. It seems to me we should probably: not trust the planner's estimates and therefore implement checks to fail things once we've gone well beyond what we expected to use. If we've done this for work_mem then using whatever we did there for a 'temporary disk space limit' would at least make me happy. If we havn't then perhaps we should do something for both. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Add switches for DELIMITER and NULL in pg_dump COPY
David Fetter [EMAIL PROTECTED] writes: On Wed, Mar 08, 2006 at 11:03:00AM -0500, Tom Lane wrote: Did anyone provide a convincing use case for this? I've had one so far, and it was enough to cause me to make a special patched version of pg_dump. To get some idea of how drastic that was, consider that I think it's generally bad practice to compile from source because it can take you too far off the generally supported software map. The case I had was making a database with a schema and initial data whose dump output gets checked into a source code management system. So? Don't tell me your SCMS can't handle tabs. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PATCHES] Add switches for DELIMITER and NULL in pg_dump COPY
On Wed, 2006-03-08 at 07:47 -0800, David Fetter wrote: From the earlier discussion, it appears that there is a variety of opinions on what the COPY delimiter should be in pg_dump. This patch allows people to set it and the NULL string. I'm still not convinced there is a reasonable use-case for this feature. I can't recall: did the previous discussion conclude that we actually want this functionality? *** src/bin/pg_dump/pg_dump.c 5 Mar 2006 15:58:50 - 1.433 --- src/bin/pg_dump/pg_dump.c 6 Mar 2006 07:32:12 - *** *** 114,119 --- 114,125 /* flag to turn on/off dollar quoting */ static intdisable_dollar_quoting = 0; + /* Things used when caller invokes COPY options. */ + #define ARG_COPY_DELIMITER 2 + #define ARG_COPY_NULL 3 + char *copy_delimiter = \t; + char *copy_null; + The variables should be declared static. static void help(const char *progname); static NamespaceInfo *findNamespace(Oid nsoid, Oid objoid); *** *** 181,186 --- 187,193 ExecStatusType expected); + int main(int argc, char **argv) { *** *** 211,217 char *outputSuperuser = NULL; RestoreOptions *ropt; ! static struct option long_options[] = { {data-only, no_argument, NULL, 'a'}, {blobs, no_argument, NULL, 'b'}, --- 218,224 char *outputSuperuser = NULL; RestoreOptions *ropt; ! static struct option long_options[] = { {data-only, no_argument, NULL, 'a'}, {blobs, no_argument, NULL, 'b'}, Please review patches and eliminate content-free hunks like these before submitting. *** *** 427,432 --- 464,479 } } + if (copy_null == NULL) + copy_null = malloc(3); + strcpy(copy_null, \\N); You're missing some braces. + if (strstr(copy_null, copy_delimiter)) + { + fprintf(stderr, _(In %s, the NULL AS string cannot contain the COPY delimiter.\n), progname); + exit(1); + } I'm not sure as to whether you should be using write_msg() or fprintf() here, but we should probably pick one and be consistent. Also ISTM we should be to refactor the code to use exit_nicely() anyway, provided that g_conn is initialized to NULL before we have connected to the DB. *** *** 702,707 --- 749,756 use SESSION AUTHORIZATION commands instead of\n OWNER TO commands \n)); + printf(_( --copy-delimiter string to use as column DELIMITER in COPY statements\n)); Capitalizing DELIMITER here is not good style, IMHO: it is just a normal word. *** 844,849 --- 893,904 int ret; char *copybuf; const char *column_list; + char *local_copy_delimiter; + char *local_copy_null; + local_copy_delimiter = malloc(2*strlen(copy_delimiter)+1); + PQescapeString (local_copy_delimiter, copy_delimiter, 2*strlen(copy_delimiter)+1); + local_copy_null = malloc(2*strlen(copy_null)+1); + PQescapeString (local_copy_null, copy_null, 2*strlen(copy_null)+1); Spacing: spaces around operands to mathematical operators, no spaces before the parameter list to a function call. You should also fix this compiler warning: [...]/pg_dump.c:440: warning: format '%d' expects type 'int', but argument 4 has type 'size_t' -Neil ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Add switches for DELIMITER and NULL in pg_dump COPY
On Wed, Mar 08, 2006 at 11:26:00AM -0500, Tom Lane wrote: David Fetter [EMAIL PROTECTED] writes: On Wed, Mar 08, 2006 at 11:03:00AM -0500, Tom Lane wrote: Did anyone provide a convincing use case for this? I've had one so far, and it was enough to cause me to make a special patched version of pg_dump. To get some idea of how drastic that was, consider that I think it's generally bad practice to compile from source because it can take you too far off the generally supported software map. The case I had was making a database with a schema and initial data whose dump output gets checked into a source code management system. So? Don't tell me your SCMS can't handle tabs. Not everybody's editor/mailer/whatever does this right, and it makes things fragile. Another way to do this is to change the delimter to a printable character like '|', but that raises hackles, too. Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778 Remember to vote! ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Add switches for DELIMITER and NULL in pg_dump COPY
David Fetter [EMAIL PROTECTED] writes: Not everybody's editor/mailer/whatever does this right, and it makes things fragile. Another way to do this is to change the delimter to a printable character like '|', but that raises hackles, too. Frankly if you're passing you data through an editor/mailer/whatever you don't trust then your setup is already fragile. At least if you're using tabs then you find out about these problems. Tiptoeing around the untrustworthy process just means that it'll fail randomly (and unpredictably) when other characters appear in the data that the software doesn't handle. There are certainly cases where you'll need to do this to interface with other (amateurish) software. But pg_dump isn't for that at all. Even COPY isn't a general purpose data formatter. To interface with other software not using a standard format you're going to have to pass the data through Perl or something like that anyways. -- greg ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Running out of disk space during query
On Wed, 2006-03-08 at 08:33 -0500, Stephen Frost wrote: Greetings, * Simon Riggs ([EMAIL PROTECTED]) wrote: work_mem= 1 GB benefit at 8 TB work_mem= 256MB benefit at 0.5 TB (based upon runs on average twice size of memory, and each logical tape requiring 256KB memory, i.e. min(work_mem/4, 6) * work_mem * 2, which for work_mem 2 MB gives 0.5 * work_mem^2) Seeing this reminded me of an issue I ran into recently. In 8.1 on a database that's only 16G, I ran a query that chewed up all the available disk space (about 250G, yes, 0.25TB) on the partition and then failed. Of course, this took many hours on a rather speedy box (and the disk array is a pretty nice IBM SAN so it's not exactly a slacker either) and produced nothing for me. I'd like to think it's often the case that Postgres has some idea what the total disk space usage of a given query is going to be prior to actually running the whole query and just seeing how much space it took at the highest point. If this can be done with some confidence then it'd be neat if Postgres could either check if there's enough disk space available and if not bail (I know, difficult to do cross-platform and there's tablespaces and whatnot to consider) OR if there was a parameter along the lines of max_temp_disk_space which would fail the query if that would be exceeded by the query. The latter could even be two GUC variables, one administrator set and unchangable by the user ('hard' limit) and one settable by the user with a sane default ('soft' limit) and perhaps a HINT which indicates how to change it in the error message when the limit is hit. I suppose I could put quotas in place or something but I don't really have a problem with the database as a whole using up a bunch of disk space (hence why it's got alot of room to grow into), I just would have liked a this will chew up more disk space than you have and then fail message instead of what ended up happening for this query. We can do work_space and maintenance_work_space fairly easily. We know how much we are writing, so we don't need to ask the OS how much it has left, just compare against the parameter and assume that it has been set correctly by the admin. Personally, I would rather abort a large sort before we ran for many hours and then hit those limits. That was the purpose of the statement_cost_limit parameter mentioned just recently. Top-down space allocation is essentially the same problem as top-down memory allocation. In both memory and tempspace we have a hard limit that if we go beyond, bad things happen. ISTM that we would like to logically allocate these resources from central pool(s) and then reclaim or return that allocation when you're done with it. In both cases the actual physical allocation would be made by the individual backend. It's fairly easy to track overall space, but its somewhat harder to force a single query to work within a single allocation since multiple steps might well want to allocate the same work_mem and have been optimized to expect they will get that size of allocation... Best Regards, Simon Riggs ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Merge algorithms for large numbers of tapes
On Wed, Mar 08, 2006 at 11:20:50AM -0500, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: If we do have to fail to disk, cut back to 128MB, because having 8x that certainly won't make the sort run anywhere close to 8x faster. Not sure that follows. In particular, the entire point of the recent changes has been to extend the range in which we can use a single merge pass --- that is, write the data once as N sorted runs, then merge them in a single read pass. As soon as you have to do an actual merge-back- to-disk pass, your total I/O volume doubles, so there is definitely a considerable gain if that can be avoided. And a larger work_mem translates directly to fewer/longer sorted runs. But do fewer/longer sorted runs translate into not merging back to disk? I thought that was controlled by if we had to be able to rewind the result set. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] problem with large maintenance_work_mem settings and
On Wed, 2006-03-08 at 10:45 -0500, Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: CREATE INDEX on a 1,8B row table (5 int columns - index created on the first row about 300M distinct values): before: 11h 51min after: 3h 11min(!) Cool. Does it seem to be I/O bound now? Would you be willing to do it over with oprofile turned on? Very. Any chance of trying it with different maintenance_work_mem settings? Did you try this with trace_sort=on? If so could we get the logs for that? [Results welcome from other hackers...particularly with regard to queries with sort steps in rather than CREATE INDEX.] Best Regards, Simon Riggs ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Merge algorithms for large numbers of tapes
On Wed, 2006-03-08 at 10:21 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: 1. Earlier we had some results that showed that the heapsorts got slower when work_mem was higher and that concerns me most of all right now. Fair enough, but that's completely independent of the merge algorithm. (I don't think the Nyberg results necessarily apply to our situation anyway, as we are not sorting arrays of integers, and hence the cache effects are far weaker for us. I don't mind trying alternate sort algorithms, but I'm not going to believe an improvement in advance of direct evidence in our own environment.) Of course, this would be prototyped first...and I agree about possible variability of those results for us. 2. Improvement in the way we do overall memory allocation, so we would not have the problem of undersetting work_mem that we currently experience. If we solved this problem we would have faster sorts in *all* cases, not just extremely large ones. Dynamically setting work_mem higher when possible would be very useful. I think this would be extremely dangerous, as it would encourage processes to take more than their fair share of available resources. Fair share is the objective. I was trying to describe the general case so we could discuss a solution that would allow a dynamic approach rather than the static one we have now. Want to handle these cases: How much to allocate, when... A. we have predicted number of users B. we have a busy system - more than predicted number of users C. we have a quiet system - less than predicted number of users In B/C we have to be careful that we don't under/overallocate resources only to find the situation changes immediately afterwards. In many cases the static allocation is actually essential since you may be more interested in guaranteeing a conservative run time rather than seeking to produce occasional/unpredictable bursts of speed. But in many cases people want to have certain tasks go faster when its quiet and go slower when its not. Also, to the extent that you believe the problem is insufficient L2 cache, it seems increasing work_mem to many times the size of L2 will always be counterproductive. Sorry to confuse: (1) and (2) were completely separate, so no intended interaction between L2 cache and memory. (Certainly there is no value in increasing work_mem until we are in a regime where it consistently improves performance significantly, which it seems we aren't yet.) Very much agreed. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Status of TODO: postgresql.conf: reset to default when
On Sun, Mar 05, 2006 at 01:23:10PM -0500, Bruce Momjian wrote: Allow commenting of variables in postgresql.conf to restore them to defaults Currently, if a variable is commented out, it keeps the previous uncommented value until a server restarted. I take that to apply to the configuration re-read at SIGHUP? Right, and it is something we have needed to fix for a while. I'm working on this one. Joachim -- Joachim Wieland [EMAIL PROTECTED] C/ Usandizaga 12 1°B ICQ: 37225940 20002 Donostia / San Sebastian (Spain) GPG key available ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Merge algorithms for large numbers of tapes
Jim, On 3/8/06 9:49 AM, Jim C. Nasby [EMAIL PROTECTED] wrote: On Wed, Mar 08, 2006 at 11:20:50AM -0500, Tom Lane wrote: Not sure that follows. In particular, the entire point of the recent changes has been to extend the range in which we can use a single merge pass --- that is, write the data once as N sorted runs, then merge them in a single read pass. As soon as you have to do an actual merge-back- to-disk pass, your total I/O volume doubles, so there is definitely a considerable gain if that can be avoided. And a larger work_mem translates directly to fewer/longer sorted runs. But do fewer/longer sorted runs translate into not merging back to disk? I thought that was controlled by if we had to be able to rewind the result set. In the *tape* algorithm, there is an intermediate abstraction in the merging called tapes (!) that are used to store intermediate merge results. Simon's work implemented more tapes, which asymptotically approaches a single merge pass as the number of tapes approaches the number of runs. The Replacement Selection algorithm generally will produce about 1/2 the number of runs that a simpler partial sort algorithm would, and the more memory it uses, the fewer runs there are, and with fewer runs, fewer tapes are required to avoid more passes on the merge. This whole tape abstraction is something that I believe is unique to Postgres among modern databases, and we have found that by removing it entirely along with logtape.c, we remove 2000 lines of useless code that only complicates our optimization problem. - Luke ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Problemas with gram.y
Tom, sorry, but the address that you wrote tells that there isn´t any patch to apply. Is this patch Itagaki's one? How could I pick it? By the way, don´t worry about the whole idea. It's an experiment that shall be improved in the future, I hope. Best regards, Eduardo Morelli Tom Lane wrote: Um, are you aware that a patch for that was already submitted? http://momjian.postgresql.org/cgi-bin/pgpatches I find the whole idea pretty ugly myself. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Coverity Open Source Defect Scan of PostgreSQL
Martijn van Oosterhout wrote: On Tue, Mar 07, 2006 at 05:39:18PM -0500, Tom Lane wrote: Martijn van Oosterhout kleptog@svana.org writes: #ifdef STATIC_ANALYSIS #define ereport(elevel, rest) \ (errstart(elevel, __FILE__, __LINE__, PG_FUNCNAME_MACRO) ? \ (errfinish rest) : (void) 0), (elevel = ERROR ? exit(0) : 0) #else /* Normal def */ #endif Hmm, neat idea ... though I wonder whether either gcc or Coverity's tool is smart enough to draw the right conclusions from a conditional exit() call ... As for Coverity, if the elevel that's passed to the ereport is really a constant, the above #ifdef should absolutely do the trick for us so we know to stop analyzing on that path...Let me know if it doesn't actually do that ;) -ben ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Merge algorithms for large numbers of tapes
Jim C. Nasby [EMAIL PROTECTED] writes: But do fewer/longer sorted runs translate into not merging back to disk? I thought that was controlled by if we had to be able to rewind the result set. A plain SELECT ... ORDER BY doesn't assume that anymore. It is still required for some cases such as the input to a merge join, but the on-the-fly-final-merge code is going to be used a lot more in 8.2 than it was before. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Merge algorithms for large numbers of tapes
I do not clearly understand the sorting code in PostgreSQL. If I did have a good grasp of it, I would take a go at improving it. Here are some suggestions of things that I know work really, really well: #1. Two pass merge (none of that silly poly-tape merge goo) #2. Load ONLY the keys that are to be sorted into memory. Use a pointer exchange sort, and do not move the physical rows of data at all. I am pretty sure from this thread that PostgreSQL is not doing #1, and I have no idea if it is doing #2. A useful trick: Since merge is mentioned, I should say something else about merge joins. If you do not have room to load the sorted keys for bsearch, load every kth key (where k is computed by sizeof merge_ram / sizeof key_data). Then, when you have found the block the thing you are looking for by the kth key bsearch, bsearch that block. Now, maybe PostrgeSQL already uses tricks better than these. I don't know. But if they prove helpful suggestions I will be glad of it. -Original Message- From: [EMAIL PROTECTED] [mailto:pgsql-hackers- [EMAIL PROTECTED] On Behalf Of Tom Lane Sent: Wednesday, March 08, 2006 12:32 PM To: Jim C. Nasby Cc: Luke Lonergan; Simon Riggs; pgsql-hackers@postgreSQL.org Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes Jim C. Nasby [EMAIL PROTECTED] writes: But do fewer/longer sorted runs translate into not merging back to disk? I thought that was controlled by if we had to be able to rewind the result set. A plain SELECT ... ORDER BY doesn't assume that anymore. It is still required for some cases such as the input to a merge join, but the on-the-fly-final-merge code is going to be used a lot more in 8.2 than it was before. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Problemas with gram.y
etmorelli wrote: Tom, sorry, but the address that you wrote tells that there isn?t any patch to apply. Is this patch Itagaki's one? How could I pick it? By the way, don?t worry about the whole idea. It's an experiment that shall be improved in the future, I hope. Best regards, Eduardo Morelli Tom Lane wrote: Um, are you aware that a patch for that was already submitted? http://momjian.postgresql.org/cgi-bin/pgpatches I find the whole idea pretty ugly myself. The patch has moved to: http://momjian.postgresql.org/cgi-bin/pgpatches_hold -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Merge algorithms for large numbers of tapes
Dann, On 3/8/06 12:39 PM, Dann Corbit [EMAIL PROTECTED] wrote: Here are some suggestions of things that I know work really, really well: Can you point to an example? That might help move the discussion along. The reason to interject about the tape goo in this discussion is that we seem to be spending a lot of time optimizing around the tape goo without tackling the overall structure of the external sort. I think we'll just end up having to replace all of this goo when we really get around to fixing the problem. Add to this that other commercial databases external sort in 1/4 the time or better on the same hardware with the same CPU/memory resources using a 2-pass external sort. #1. Two pass merge (none of that silly poly-tape merge goo) Voice of reason here. It's what the other database systems do. #2. Load ONLY the keys that are to be sorted into memory. Use a pointer exchange sort, and do not move the physical rows of data at all. Sounds right. Example of this in practice? I am pretty sure from this thread that PostgreSQL is not doing #1, and I have no idea if it is doing #2. Yep. Even Knuth says that the tape goo is only interesting from a historical perspective and may not be relevant in an era of disk drives. - Luke ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PATCHES] Add switches for DELIMITER and NULL in pg_dump COPY
On Wed, 2006-03-08 at 08:20 -0800, David Fetter wrote: The previous discussion showed that there is a wide diversity of opinions on what The Right Delimiter and The Right NULL String(TM) are. Barring a more convincing justification for why we need this feature, I'm inclined to side with Tom: pg_dump has enough obscure options as it is, and I can't imagine very many people needing this functionality. -Neil ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Merge algorithms for large numbers of tapes
-Original Message- From: Luke Lonergan [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 08, 2006 1:52 PM To: Dann Corbit; Tom Lane; Jim C. Nasby Cc: Simon Riggs; pgsql-hackers@postgreSQL.org Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes Dann, On 3/8/06 12:39 PM, Dann Corbit [EMAIL PROTECTED] wrote: Here are some suggestions of things that I know work really, really well: Can you point to an example? That might help move the discussion along. I wrote all of the sorting and merging stuff for CONNX Solutions http://www.connx.com I have carefully benched all of this stuff and (at least for our system) the ideas I propose work well. Of course, every system is different and the only way to know if it is an improvement is to try it in place. The reason to interject about the tape goo in this discussion is that we seem to be spending a lot of time optimizing around the tape goo without tackling the overall structure of the external sort. I think we'll just end up having to replace all of this goo when we really get around to fixing the problem. I suggest trying several alternatives and benching them with real world queries and especially with the open database benchmark suite. Add to this that other commercial databases external sort in 1/4 the time or better on the same hardware with the same CPU/memory resources using a 2-pass external sort. Our sort merge is so fast that I can join two tables on a column with no index faster than on a database that has a unique clustered index on the column. Benchmarked against Oracle, SQL*Server, and several others. If you check our ORDER BY on a large table with no index, you will see that it is competitive with the best commercial systems. If you are interested, you could get an eval of CONNX and try it yourself (eval is free for some number of days, I don't remember what). #1. Two pass merge (none of that silly poly-tape merge goo) Voice of reason here. It's what the other database systems do. #2. Load ONLY the keys that are to be sorted into memory. Use a pointer exchange sort, and do not move the physical rows of data at all. Sounds right. Example of this in practice? It is what we use here. It is the only way to fly. This is well known, and if you read a few articles from the ACM, you will see that it has been known for decades. I am pretty sure from this thread that PostgreSQL is not doing #1, and I have no idea if it is doing #2. Yep. Even Knuth says that the tape goo is only interesting from a historical perspective and may not be relevant in an era of disk drives. - Luke ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Coverity Open Source Defect Scan of PostgreSQL
Folks, As for Coverity, if the elevel that's passed to the ereport is really a constant, the above #ifdef should absolutely do the trick for us so we know to stop analyzing on that path...Let me know if it doesn't actually do that ;) Um, I think the answer is to train Coverity, not change our code to avoid the false-positives. I know that Coverity is sophisticated enough to, for example, be programed to understand that elog(ERROR) does not continue. Actually, I thougth that Neil/eDB did this with their copy. Is there any way to get a copy of that training configuration? -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Coverity Open Source Defect Scan of PostgreSQL
On 3/8/06, Josh Berkus josh@agliodbs.com wrote: Actually, I thougth that Neil/eDB did this with their copy.Is there anyway to get a copy of that training configuration? I think we have a backup of it somewhere. I'll look into it. -- Jonah H. Harris, Database Internals ArchitectEnterpriseDB Corporation732.331.1324
Re: [HACKERS] Merge algorithms for large numbers of tapes
There are some articles here that are worth reading if you want to sort fast: http://research.microsoft.com/barc/SortBenchmark/ -Original Message- From: [EMAIL PROTECTED] [mailto:pgsql-hackers- [EMAIL PROTECTED] On Behalf Of Dann Corbit Sent: Wednesday, March 08, 2006 1:59 PM To: Luke Lonergan; Tom Lane; Jim C. Nasby Cc: Simon Riggs; pgsql-hackers@postgreSQL.org Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes -Original Message- From: Luke Lonergan [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 08, 2006 1:52 PM To: Dann Corbit; Tom Lane; Jim C. Nasby Cc: Simon Riggs; pgsql-hackers@postgreSQL.org Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes Dann, On 3/8/06 12:39 PM, Dann Corbit [EMAIL PROTECTED] wrote: Here are some suggestions of things that I know work really, really well: Can you point to an example? That might help move the discussion along. I wrote all of the sorting and merging stuff for CONNX Solutions http://www.connx.com I have carefully benched all of this stuff and (at least for our system) the ideas I propose work well. Of course, every system is different and the only way to know if it is an improvement is to try it in place. The reason to interject about the tape goo in this discussion is that we seem to be spending a lot of time optimizing around the tape goo without tackling the overall structure of the external sort. I think we'll just end up having to replace all of this goo when we really get around to fixing the problem. I suggest trying several alternatives and benching them with real world queries and especially with the open database benchmark suite. Add to this that other commercial databases external sort in 1/4 the time or better on the same hardware with the same CPU/memory resources using a 2-pass external sort. Our sort merge is so fast that I can join two tables on a column with no index faster than on a database that has a unique clustered index on the column. Benchmarked against Oracle, SQL*Server, and several others. If you check our ORDER BY on a large table with no index, you will see that it is competitive with the best commercial systems. If you are interested, you could get an eval of CONNX and try it yourself (eval is free for some number of days, I don't remember what). #1. Two pass merge (none of that silly poly-tape merge goo) Voice of reason here. It's what the other database systems do. #2. Load ONLY the keys that are to be sorted into memory. Use a pointer exchange sort, and do not move the physical rows of data at all. Sounds right. Example of this in practice? It is what we use here. It is the only way to fly. This is well known, and if you read a few articles from the ACM, you will see that it has been known for decades. I am pretty sure from this thread that PostgreSQL is not doing #1, and I have no idea if it is doing #2. Yep. Even Knuth says that the tape goo is only interesting from a historical perspective and may not be relevant in an era of disk drives. - Luke ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author
Bruce Momjian wrote: Log Message: --- Remove Christof Petig copyright on include file, per author request. Huh, I thought what he actually told was that the file was released under BSD license. Maybe I missed it, but I didn't see him asking to remove the copyright. We certainly have copyrights attributed to individual people. Jan Wieck has his name on the PL/Tcl and PL/pgSQL files, for example. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author
Alvaro Herrera wrote: Bruce Momjian wrote: Log Message: --- Remove Christof Petig copyright on include file, per author request. Huh, I thought what he actually told was that the file was released under BSD license. Maybe I missed it, but I didn't see him asking to remove the copyright. Also, now that I see the actual diff, I see you only removed the $PostgreSQL$ tag from ecpg_informix.h. I assume this was unintended. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file,
Alvaro Herrera wrote: Bruce Momjian wrote: Log Message: --- Remove Christof Petig copyright on include file, per author request. Huh, I thought what he actually told was that the file was released under BSD license. Maybe I missed it, but I didn't see him asking to remove the copyright. We certainly have copyrights attributed to individual people. Jan Wieck has his name on the PL/Tcl and PL/pgSQL files, for example. We should not have individual copyrights to individuals in our source tree. If Jan's is in there, it should be removed too (with his approval). The only copyright holder should be PostgreSQL Global Development Group. Jan, would you fix that? -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on
Alvaro Herrera wrote: Alvaro Herrera wrote: Bruce Momjian wrote: Log Message: --- Remove Christof Petig copyright on include file, per author request. Huh, I thought what he actually told was that the file was released under BSD license. Maybe I missed it, but I didn't see him asking to remove the copyright. Also, now that I see the actual diff, I see you only removed the $PostgreSQL$ tag from ecpg_informix.h. I assume this was unintended. Intended. None of the other include files in that directory had it. Either all should, or none. Should we add it to all of them? -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author
Bruce Momjian wrote: Alvaro Herrera wrote: Bruce Momjian wrote: Log Message: --- Remove Christof Petig copyright on include file, per author request. Huh, I thought what he actually told was that the file was released under BSD license. Maybe I missed it, but I didn't see him asking to remove the copyright. We certainly have copyrights attributed to individual people. Jan Wieck has his name on the PL/Tcl and PL/pgSQL files, for example. We should not have individual copyrights to individuals in our source tree. If Jan's is in there, it should be removed too (with his approval). The only copyright holder should be PostgreSQL Global Development Group. Why give all our code to a organisation that doesn't exist? On the $PostgreSQL$ issue, I think it's best to have the tag everywhere, so if these files were the only ones that had it, I'd rather add them to the rest ... -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on
Bruce Momjian pgman@candle.pha.pa.us writes: Intended. None of the other include files in that directory had it. Either all should, or none. Should we add it to all of them? All. There's no good excuse for not having a version tag on every source-code file. The contrib stuff is pretty bad about this, but hey, it's just contrib ;-). ecpg is part of the main tree and it ought to be up to project standards. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Merge algorithms for large numbers of tapes
Dann Corbit [EMAIL PROTECTED] writes: Here are some suggestions of things that I know work really, really well: #1. Two pass merge (none of that silly poly-tape merge goo) This amounts to an assumption that you have infinite work_mem, in which case you hardly need an external sort at all. If your work_mem is in fact finite, then at some point you need more than two passes. I'm not really interested in ripping out support for sort operations that are much larger than work_mem. #2. Load ONLY the keys that are to be sorted into memory. Use a pointer exchange sort, and do not move the physical rows of data at all. This suggestion isn't a whole lot better; in general the rows to be sorted don't exist until we compute them, and so proposing that we don't load them until later is pretty much irrelevant. Also, in a lot of common cases the keys to be sorted are the bulk of the data anyway. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Merge algorithms for large numbers of tapes
-Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 08, 2006 3:17 PM To: Dann Corbit Cc: Jim C. Nasby; Luke Lonergan; Simon Riggs; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes Dann Corbit [EMAIL PROTECTED] writes: Here are some suggestions of things that I know work really, really well: #1. Two pass merge (none of that silly poly-tape merge goo) This amounts to an assumption that you have infinite work_mem, in which case you hardly need an external sort at all. If your work_mem is in fact finite, then at some point you need more than two passes. I'm not really interested in ripping out support for sort operations that are much larger than work_mem. No it does not. I have explained this before. You can have one million files and merge them all into a final output with a single pass. It does not matter how big they are or how much memory you have. The idea is very simple. Each subfile has its top record inserted into a priority queue of file handles (or whatever else you want to use -- temp tables, you name it). When you extract the smallest record from the queue, the priority changes and that file handle gets moved to a new place in the queue. You keep pulling records from the queue until the entire queue is empty. The outline is like this: 1. Sort chunks 2. Write chunks 3. Insert top record of chunks into priority queue 4. Extract records from queue, writing them to final output 5. Repeat step 4 until queue is empty. #2. Load ONLY the keys that are to be sorted into memory. Use a pointer exchange sort, and do not move the physical rows of data at all. This suggestion isn't a whole lot better; in general the rows to be sorted don't exist until we compute them, and so proposing that we don't load them until later is pretty much irrelevant. Also, in a lot of common cases the keys to be sorted are the bulk of the data anyway. This suggestion is in addition to suggestion 1. They are not even related except that both suggestions make the sort run a lot faster. I think I did not explain it clearly enough. Suppose that you have a set of rows you need to sort. Instead of loading the whole row into memory, just load the columns (or parts of columns) that are being sorted. I hope that it is more clear now. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Coverity Open Source Defect Scan of PostgreSQL
Ben Chelf [EMAIL PROTECTED] writes: #ifdef STATIC_ANALYSIS #define ereport(elevel, rest) \ (errstart(elevel, __FILE__, __LINE__, PG_FUNCNAME_MACRO) ? \ (errfinish rest) : (void) 0), (elevel = ERROR ? exit(0) : 0) #else /* Normal def */ #endif As for Coverity, if the elevel that's passed to the ereport is really a constant, the above #ifdef should absolutely do the trick for us so we know to stop analyzing on that path...Let me know if it doesn't actually do that ;) If you're willing to require elevel to always be a constant then why not just tack on the (elevel = ERROR ? exit(0) : 0) onto the end of the regular definition of ereport instead of having an ifdef? Incidentally, if it's not guaranteed to be a constant then the definition above is wrong because it's missing parentheses around elevel at both occurrences. -- greg ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Merge algorithms for large numbers of tapes
Luke Lonergan [EMAIL PROTECTED] writes: I am pretty sure from this thread that PostgreSQL is not doing #1, and I have no idea if it is doing #2. Yep. Even Knuth says that the tape goo is only interesting from a historical perspective and may not be relevant in an era of disk drives. As the size of the data grows larger the behaviour of hard drives looks more and more like tapes. The biggest factor controlling the speed of i/o operations is how many seeks are required to complete them. Effectively rewinds are still the problem it's just that the cost of rewinds becomes constant regardless of how long the tape is. That's one thing that gives me pause about the current approach of using more tapes. It seems like ideally the user would create a temporary work space on each spindle and the database would arrange to use no more than that number of tapes. Then each merge operation would involve only sequential access for both reads and writes. -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Merge algorithms for large numbers of tapes
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 08, 2006 3:56 PM To: Luke Lonergan Cc: Dann Corbit; Tom Lane; Jim C. Nasby; Simon Riggs; pgsql- [EMAIL PROTECTED] Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes Luke Lonergan [EMAIL PROTECTED] writes: I am pretty sure from this thread that PostgreSQL is not doing #1, and I have no idea if it is doing #2. Yep. Even Knuth says that the tape goo is only interesting from a historical perspective and may not be relevant in an era of disk drives. As the size of the data grows larger the behaviour of hard drives looks more and more like tapes. The biggest factor controlling the speed of i/o operations is how many seeks are required to complete them. Effectively rewinds are still the problem it's just that the cost of rewinds becomes constant regardless of how long the tape is. That's one thing that gives me pause about the current approach of using more tapes. It seems like ideally the user would create a temporary work space on each spindle and the database would arrange to use no more than that number of tapes. Then each merge operation would involve only sequential access for both reads and writes. If the chief concern is in the number of subfiles created, replacement selection doubles the length of the subfiles while consuming no more memory. {The big-O of the algorithm sucks, though} It is certainly worth testing several cases. It is not a bad idea to enable more than one method of performing an operation. In the ideal case, you would have specific information about drives, spindles, rates for seek, transfer, etc. It all depends on how much effort you want to throw at it. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Merge algorithms for large numbers of tapes
On Wed, Mar 08, 2006 at 10:49:16AM -0800, Luke Lonergan wrote: Jim, On 3/8/06 9:49 AM, Jim C. Nasby [EMAIL PROTECTED] wrote: On Wed, Mar 08, 2006 at 11:20:50AM -0500, Tom Lane wrote: Not sure that follows. In particular, the entire point of the recent changes has been to extend the range in which we can use a single merge pass --- that is, write the data once as N sorted runs, then merge them in a single read pass. As soon as you have to do an actual merge-back- to-disk pass, your total I/O volume doubles, so there is definitely a considerable gain if that can be avoided. And a larger work_mem translates directly to fewer/longer sorted runs. But do fewer/longer sorted runs translate into not merging back to disk? I thought that was controlled by if we had to be able to rewind the result set. In the *tape* algorithm, there is an intermediate abstraction in the merging called tapes (!) that are used to store intermediate merge results. Simon's work implemented more tapes, which asymptotically approaches a single merge pass as the number of tapes approaches the number of runs. The Replacement Selection algorithm generally will produce about 1/2 the number of runs that a simpler partial sort algorithm would, and the more memory it uses, the fewer runs there are, and with fewer runs, fewer tapes are required to avoid more passes on the merge. This whole tape abstraction is something that I believe is unique to Postgres among modern databases, and we have found that by removing it entirely along with logtape.c, we remove 2000 lines of useless code that only complicates our optimization problem. Oh, geez, I think I get it now. I was thinking that we did something like sort a chunk, write it to disk, repeat until all data processed and then just read from the stuff on disk in order, switching between files as needed. But of course that would suck horribly if we were actually using tapes. Like others have said, surely there's got to be a much better way to go about things with more modern hardware. If there is, then hopefully the possibility exists of returning memory back to the pool if it's not going to be as useful as it would be to a sort that would fit in-memory. As an example, in my hypothetical algorithm that sorts one chunk at a time and then bounces between chunks when reading the data back out, it would probably be better to have fewer, larger chunks than many more small ones. But the difference between 256M chunks and 1GB chunks probably wouldn't be that big a difference, certainly not a 4x improvement. So it makes sense to go with the smaller chunks if it means that other sorts would be able to operate entirely in-memory. In an ideal world, this allocation could even by dynamic, based on what else was happening on the machine. But I'll take any incremental improvement I can get right now. :) Just having the ability to set a more aggressive work_mem without worrying about causing a swap storm would be a huge improvement over the current situation. Being able to cut back on memory use when we fall back to disk would be icing on the cake. :) -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Merge algorithms for large numbers of tapes
An interesting read at http://www.vldb.org/conf/1997/P376.PDFOn 3/8/06, Dann Corbit [EMAIL PROTECTED] wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Wednesday, March 08, 2006 3:56 PM To: Luke Lonergan Cc: Dann Corbit; Tom Lane; Jim C. Nasby; Simon Riggs; pgsql- [EMAIL PROTECTED] Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes Luke Lonergan [EMAIL PROTECTED] writes: I am pretty sure from this thread that PostgreSQL is not doing #1,and I have no idea if it is doing #2. Yep.Even Knuth says that the tape goo is only interesting from a historical perspective and may not be relevant in an era of diskdrives. As the size of the data grows larger the behaviour of hard driveslooks more and more like tapes. The biggest factor controlling the speed of i/o operations is how many seeks are required to complete them.Effectively rewinds are still the problem it's just that the cost of rewindsbecomes constant regardless of how long the tape is. That's one thing that gives me pause about the current approach ofusing more tapes. It seems like ideally the user would create a temporary workspace on each spindle and the database would arrange to use no more than that number of tapes. Then each merge operation would involve only sequential accessfor both reads and writes.If the chief concern is in the number of subfiles created, replacement selection doubles the length of the subfiles while consuming no morememory.{The big-O of the algorithm sucks, though}It is certainly worth testing several cases.It is not a bad idea to enable more than one method of performing an operation.In the ideal case, you would have specific information about drives,spindles, rates for seek, transfer, etc.It all depends on how much effort you want to throw at it.---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq-- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation732.331.1324
Re: [HACKERS] Merge algorithms for large numbers of tapes
On Wed, Mar 08, 2006 at 03:35:53PM -0800, Dann Corbit wrote: I think I did not explain it clearly enough. Suppose that you have a set of rows you need to sort. Instead of loading the whole row into memory, just load the columns (or parts of columns) that are being sorted. I hope that it is more clear now. The issue is that there is a non-trivial amount of overhead in going back to disk to get the raw data, and then you have to parse that into a valid in-memory tuple. A worst-case scenario is if you're sorting all the data that you've been asked to retrieve, ie: SELECT a, b, c ... ORDER BY b, a, c; That case is almost guaranteed to take longer if you try and do it with just pointers. But there is the other case: SELECT a, b, c, big_honking_text_field ... ORDER BY a, b, c; In this example it's entirely possible that leaving the big_honking field out of the actual sorting would be a big win. Especially if your temporary space was on a different set of spindles. Regarding your suggestion of testing different kinds of sorts, that's certainly a good idea if it can be done without a huge amount of work coding each one up. Ultimately, it might make the most sense to support multiple sort algorithms (at least for now) and let the planner decide which one to use. That would at least get us a lot more real-world data than any other method would. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Merge algorithms for large numbers of tapes
-Original Message- From: Jim C. Nasby [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 08, 2006 5:44 PM To: Dann Corbit Cc: Tom Lane; Luke Lonergan; Simon Riggs; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Merge algorithms for large numbers of tapes On Wed, Mar 08, 2006 at 03:35:53PM -0800, Dann Corbit wrote: I think I did not explain it clearly enough. Suppose that you have a set of rows you need to sort. Instead of loading the whole row into memory, just load the columns (or parts of columns) that are being sorted. I hope that it is more clear now. The issue is that there is a non-trivial amount of overhead in going back to disk to get the raw data, and then you have to parse that into a valid in-memory tuple. A worst-case scenario is if you're sorting all the data that you've been asked to retrieve, ie: SELECT a, b, c ... ORDER BY b, a, c; That case is almost guaranteed to take longer if you try and do it with just pointers. But there is the other case: SELECT a, b, c, big_honking_text_field ... ORDER BY a, b, c; In this example it's entirely possible that leaving the big_honking field out of the actual sorting would be a big win. Especially if your temporary space was on a different set of spindles. Regarding your suggestion of testing different kinds of sorts, that's certainly a good idea if it can be done without a huge amount of work coding each one up. Ultimately, it might make the most sense to support multiple sort algorithms (at least for now) and let the planner decide which one to use. That would at least get us a lot more real-world data than any other method would. I typically do it something like this: MSD_Radix_Sort_Hunks() { // We might have to bail for many reasons : // Early part of the key may be identical for all rows // We may not have a binning algorithm for this data type // We may also only partially sort with MSD Radix sort If (Set_Is_Too_Small_Or_Otherwise_Bail()) { Introspective_Sort_Hunks(); } Else MSD_Radix_Alg(); // Cookie cutter of data stream into sorted hunks } Introspective_Sort_Hunks() { If (Set_Is_Too_Small_Or_Otherwise_Bail()) { Ford_Johnson_Variant(); // Near optimal sort of very small sets } Else Introspective_Alg();// Cookie cutter of data stream into sorted hunks } Queue_based_hunk_merge(); Now, you might have a merge that makes choices on entry similar to the way that my sorts make choices on entry. You will notice that my sorts decide internally on what algorithm to perform. Certainly, this is a simple approach that can generalize in many ways. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Merge algorithms for large numbers of tapes
On Wed, Mar 08, 2006 at 06:55:59PM -0500, Greg Stark wrote: Luke Lonergan [EMAIL PROTECTED] writes: I am pretty sure from this thread that PostgreSQL is not doing #1, and I have no idea if it is doing #2. Yep. Even Knuth says that the tape goo is only interesting from a historical perspective and may not be relevant in an era of disk drives. As the size of the data grows larger the behaviour of hard drives looks more and more like tapes. The biggest factor controlling the speed of i/o operations is how many seeks are required to complete them. Effectively rewinds are still the problem it's just that the cost of rewinds becomes constant regardless of how long the tape is. But it will take a whole lot of those rewinds to equal the amount of time required by an additional pass through the data. I'll venture a guess that as long as you've got enough memory to still read chunks back in 8k blocks that it won't be possible for a multi-pass sort to out-perform a one-pass sort. Especially if you also had the ability to do pre-fetching (not something to fuss with now, but certainly a possibility in the future). In any case, what we really need is at least good models backed by good drive performance data. And we really should have that anyway so that we can improve upon our cost estimator functions. I'm betting that what that will show us is that no single sort method is going to work best for all cases. IE: I'd bet that if your data set is sufficiently larger than available memory that you'll actually be better off with a multi-pass approach over a single/two pass approach. That's one thing that gives me pause about the current approach of using more tapes. It seems like ideally the user would create a temporary work space on each spindle and the database would arrange to use no more than that number of tapes. Then each merge operation would involve only sequential access for both reads and writes. For that to be of any use, wouldn't you need to use only as many tapes as spindles/2? Otherwise you're still trying to read and write from the same set of drives, which means you're probably doing a lot of seeking. Or do the tape algorithms re-write data as they read it? -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author
On Wed, Mar 08, 2006 at 07:37:14PM -0300, Alvaro Herrera wrote: On the $PostgreSQL$ issue, I think it's best to have the tag everywhere, so if these files were the only ones that had it, I'd rather add them to the rest ... +1. I'd also argue that it should be in contrib as well. I can submit a patch for contrib or the entire tree if desired. Convention is that it goes near the top of the file, correct? -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] Add switches for DELIMITER and NULL in pg_dump COPY
On Wed, Mar 08, 2006 at 04:57:52PM -0500, Neil Conway wrote: On Wed, 2006-03-08 at 08:20 -0800, David Fetter wrote: The previous discussion showed that there is a wide diversity of opinions on what The Right Delimiter and The Right NULL String(TM) are. Barring a more convincing justification for why we need this feature, I'm inclined to side with Tom: pg_dump has enough obscure options as it is, and I can't imagine very many people needing this functionality. Given all the different requests that come in for pg_dump and copy, maybe it makes sense for Someone Who Cares to start a pgFoundry project (or maybe extend the import/export project that's already there). -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Merge algorithms for large numbers of tapes
Dann Corbit wrote: I do not clearly understand the sorting code in PostgreSQL. If I did have a good grasp of it, I would take a go at improving it. Show me the code (and the benchmarks). Seriously. We see regular discussions on this and similar topics, but I haven't seen a patch that anyone has proven is an unequivocal improvement. that I can recall. cheers andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author
Jim C. Nasby wrote: +1. I'd also argue that it should be in contrib as well. I can submit a patch for contrib or the entire tree if desired. Convention is that it goes near the top of the file, correct? Correct. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author
On Wed, Mar 08, 2006 at 11:27:23PM -0300, Alvaro Herrera wrote: Jim C. Nasby wrote: +1. I'd also argue that it should be in contrib as well. I can submit a patch for contrib or the entire tree if desired. Convention is that it goes near the top of the file, correct? Correct. K, I'll work on everything but contrib tonight. What do people think about contrib? Yea or nay for enforcing $PostgreSQL$? -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Merge algorithms for large numbers of tapes
Jim C. Nasby [EMAIL PROTECTED] writes: On Wed, Mar 08, 2006 at 06:55:59PM -0500, Greg Stark wrote: Luke Lonergan [EMAIL PROTECTED] writes: I am pretty sure from this thread that PostgreSQL is not doing #1, and I have no idea if it is doing #2. Yep. Even Knuth says that the tape goo is only interesting from a historical perspective and may not be relevant in an era of disk drives. As the size of the data grows larger the behaviour of hard drives looks more and more like tapes. The biggest factor controlling the speed of i/o operations is how many seeks are required to complete them. Effectively rewinds are still the problem it's just that the cost of rewinds becomes constant regardless of how long the tape is. But it will take a whole lot of those rewinds to equal the amount of time required by an additional pass through the data. I'll venture a guess that as long as you've got enough memory to still read chunks back in 8k blocks that it won't be possible for a multi-pass sort to out-perform a one-pass sort. Well that's clearly a bit overoptimistic. If we believe the random page cost of 4 then having more tapes than you have spindles would impose a penalty equal to having four times as many passes. (And that's *with* the 8k block size. And with the kernel performing pre-fetch already too.) For that to be of any use, wouldn't you need to use only as many tapes as spindles/2? Otherwise you're still trying to read and write from the same set of drives, which means you're probably doing a lot of seeking. Or do the tape algorithms re-write data as they read it? Well, spindles-1. I was thinking as many tapes as you have spindles *in total*, ie, including the output tape. You only have one output tape for each n-way merge though. -- greg ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright
On Wed, 8 Mar 2006, Jim C. Nasby wrote: On Wed, Mar 08, 2006 at 11:27:23PM -0300, Alvaro Herrera wrote: Jim C. Nasby wrote: +1. I'd also argue that it should be in contrib as well. I can submit a patch for contrib or the entire tree if desired. Convention is that it goes near the top of the file, correct? Correct. K, I'll work on everything but contrib tonight. What do people think about contrib? Yea or nay for enforcing $PostgreSQL$? Any code in our source tree should conform to the same version tagging, so Yah ... Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file, per author
Jim C. Nasby [EMAIL PROTECTED] writes: Convention is that it goes near the top of the file, correct? It goes at the bottom of the file header block comment ... if there isn't a block comment at the start of the file, there should be. We have a thousand or so examples to follow ;-) regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] pg_freespacemap question
Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: Good points! I had not noticed this test case. Probably NULL is better Would setting it to 'BLCKSZ - (fixed index header stuff)' be better, No, I don't think so, because that will just make it harder to recognize what's what (remember that BLCKSZ isn't really a constant, and the index overhead is not the same for all AMs either). The point here is that for indexes the FSM tracks whole-page availability, not the amount of free space within pages. So I think NULL is a reasonable representation of that. Using NULL will make it easy to filter the results if you want to see only heap-page data or only index-page data, whereas it will be very hard to do that if the view adopts an ultimately-artificial convention about the amount of available space on an index page. Right - after suggesting it I realized that coding the different index overhead for each possible AM would have been ... difficult :-). A patch is attached to implement the NULL free bytes and other recommendations: 1/ Index free bytes set to NULL 2/ Comment added to the README briefly mentioning the index business 3/ Columns reordered more logically 4/ 'Blockid' column removed 5/ Free bytes column renamed to just 'bytes' instead of 'blockfreebytes' Now 5/ was only hinted at, but seemed worth doing while I was there (hopefully I haven't made it too terse now). cheers Mark Index: pg_freespacemap.c === RCS file: /projects/cvsroot/pgsql/contrib/pg_freespacemap/pg_freespacemap.c,v retrieving revision 1.2 diff -c -r1.2 pg_freespacemap.c *** pg_freespacemap.c 14 Feb 2006 15:03:59 - 1.2 --- pg_freespacemap.c 9 Mar 2006 03:38:10 - *** *** 12,18 #include storage/freespace.h #include utils/relcache.h ! #define NUM_FREESPACE_PAGES_ELEM6 #if defined(WIN32) || defined(__CYGWIN__) /* Need DLLIMPORT for some things that are not so marked in main headers */ --- 12,18 #include storage/freespace.h #include utils/relcache.h ! #define NUM_FREESPACE_PAGES_ELEM5 #if defined(WIN32) || defined(__CYGWIN__) /* Need DLLIMPORT for some things that are not so marked in main headers */ *** *** 29,40 typedef struct { - uint32 blockid; - uint32 relfilenode; uint32 reltablespace; uint32 reldatabase; uint32 relblocknumber; ! uint32 blockfreebytes; } FreeSpacePagesRec; --- 29,40 typedef struct { uint32 reltablespace; uint32 reldatabase; + uint32 relfilenode; uint32 relblocknumber; ! uint32 bytes; ! boolisindex; } FreeSpacePagesRec; *** *** 91,107 /* Construct a tuple to return. */ tupledesc = CreateTemplateTupleDesc(NUM_FREESPACE_PAGES_ELEM, false); ! TupleDescInitEntry(tupledesc, (AttrNumber) 1, blockid, ! INT4OID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 2, relfilenode, OIDOID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 3, reltablespace, OIDOID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 4, reldatabase, OIDOID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 5, relblocknumber, INT8OID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 6, blockfreebytes, INT4OID, -1, 0); /* Generate attribute metadata needed later to produce tuples */ --- 91,105 /* Construct a tuple to return. */ tupledesc = CreateTemplateTupleDesc(NUM_FREESPACE_PAGES_ELEM, false); ! TupleDescInitEntry(tupledesc, (AttrNumber) 1, reltablespace, OIDOID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 2, reldatabase, OIDOID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 3, relfilenode, OIDOID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 4, relblocknumber, INT8OID, -1, 0); ! TupleDescInitEntry(tupledesc,
Re: [HACKERS] Automatic free space map filling
Zeugswetter Andreas DCP SD [EMAIL PROTECTED] wrote: Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead tuple by reducing the tuple to it's header info. I was just working about your idea. In my work, bgwriter truncates dead tuples and leaves only their headers. I'll send a concept patch to PATCHES. We must take super-exclusive-lock of pages before vacuum. Bgwriter tries to take exclusive-lock before it writes a page, and does vacuum only if the lock is super-exclusive. Otherwise, it gives up and writes normally. This is an optimistic way, but I assume the possibility is high because the most pages written by bgwriter are least recently used (LRU). Also, I changed bgwriter_lru_maxpages to be adjusted automatically, because backends won't do vacuum not to disturb main transaction processing, so bgwriter should write most of the dirty pages. There are much room for discussion on this idea. Comments are welcome. --- ITAGAKI Takahiro NTT Cyber Space Laboratories ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] problem with large maintenance_work_mem settings and
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: CREATE INDEX on a 1,8B row table (5 int columns - index created on the first row about 300M distinct values): before: 11h 51min after: 3h 11min(!) Cool. Does it seem to be I/O bound now? Would you be willing to do it over with oprofile turned on? while it now does a fair amount of IO during the whole operation, it is not yet IObound afaiks. profile: samples %symbol name 103520432 47.9018 inlineApplySortFunction 33382738 15.4471 comparetup_index 25296438 11.7054 tuplesort_heap_siftup 10089122 4.6685 btint4cmp 8395676 3.8849 LogicalTapeRead 2873556 1.3297 tuplesort_heap_insert 2796545 1.2940 tuplesort_gettuple_common 2752345 1.2736 AllocSetFree 2233889 1.0337 IndexBuildHeapScan 2035265 0.9418 heapgettup 1571035 0.7270 LWLockAcquire 1498800 0.6935 readtup_index 1213587 0.5616 index_form_tuple 1097172 0.5077 AllocSetAlloc 1056964 0.4891 heap_fill_tuple 1041172 0.4818 btbuildCallback 9900050.4581 LWLockRelease 8976620.4154 slot_deform_tuple 8585270.3973 LogicalTapeWrite 8068490.3734 PageAddItem 7641360.3536 LockBuffer trace_sort: LOG: begin index sort: unique = f, workMem = 2048000, randomAccess = f LOG: switching to external sort with 7315 tapes: CPU 4.07s/13.70u sec elapsed 17.79 sec LOG: finished writing run 1 to tape 0: CPU 240.07s/3926.66u sec elapsed 4498.49 sec LOG: performsort starting: CPU 535.66s/8138.92u sec elapsed 9435.11 sec LOG: finished writing final run 2 to tape 1: CPU 538.54s/8242.23u sec elapsed 9541.55 sec LOG: performsort done (except final merge): CPU 539.39s/8254.83u sec elapsed 9559.75 sec LOG: external sort ended, 4398827 disk blocks used: CPU 768.38s/10027.39u sec elapsed 11884.63 sec Stefan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings