Re: [HACKERS] PG Killed by OOM Condition
It's not an easy decision. Linux isn't wrong. Solaris isn't wrong. Most people never hit these problems, and the people that do, are just as likely to hit one problem, or the other. The grass is always greener on the side of the fence that isn't hurting me right now, and all that. Cheers, mark Thanks, a very informative reply. Do you have some references where I can learn more? I think that I've run into the OOM killer without a fork() being involved, but I could be wrong. Is it possible to be hit by the OOM killer if no applications use fork()? Regards, Jeff Davis ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] PG Killed by OOM Condition
On Mon, 3 Oct 2005, Jeff Davis wrote: involved, but I could be wrong. Is it possible to be hit by the OOM killer if no applications use fork()? Sure, whenever the system is out of mem and the os can't find a free page then it kills a process. If you check the kernel log you can see if the oom killer have been doing some work. -- /Dennis Björklund ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Tuning current tuplesort external sort code for 8.2
On Tue, 2005-10-04 at 00:21 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: AFAICS the following opportunities exist, without changing any of the theoretical algorithms or the flexibility of definable datatypes: 1. tuplesort_heap_siftup and tuplesort_heap_insert make no attempt to cache the values of keys that have been obtained from *_getattr macros. The two routines navigate a tournament sort heap, so that on average 50% of comparisons use at least one immediately preceeding tuple and key values from that could be cached ready for the next call. Hmm ... this seems interesting, but you also have to look at the potential downside: what is the cost of doing the caching? Actually, nothing for the cache-1-tuple case. We just re-initialise the fcinfo based around whether we are caching 0,1 or 2 fields. We keep two fcinfo structures, so that when we cache=0 fields we do not pollute the cache for the cache=1 case. tuplesort_heap_insert allows us to cache one value each time round the inner loop, always using fcinfo1. (cache=1). tuplesort_heap_siftup has two comparisons per iteration; it allows us to cache 0 values on the first call, so we use fcinfo0. We can always cache one value for the second call, exactly same as _heap_insert case, so we use fcinfo1. Maybe we can also reuse the winner from the first call as the second input to the second call. ehem...It's easier to see when you look at the code. We need to work a bit harder to get the cache=2 case but its possible, though I take your point that it may not be worth it. I'll do the cache=1 case before attempting the cache=2 case, so we can measure the gain. We run _heap_insert and _heap_siftup once for each incoming tuple, so the saving should be good. All of the remaining ideas relate to NULL handling. I can't get excited about this. Most sort scenarios involve few if any nulls. The idea would be much less important anyway if you follow up on this next idea: One thought that comes to mind is that the current system structure encourages sorting on keys that are at the end of their tuples. For instance, SELECT foo, bar FROM ... ORDER BY baz will sort by the third column of the generated tuples, which is certainly the least efficient to access. Not sure what goes on in there. Are you saying heap sort cols are always at the end, or only in the cases you mention? What about where we're doing a GROUP BY, so all the sort columns are always part of the select list and frequently (by coding convention) at the front of the row? If the cols aren't always in resjunk we can come up with some performance test cases to check whether or not this is a problem, how big and where it starts to show itself. It'd be interesting to look into generating the working tuples with baz as the first column. I fear this is nontrivial, because there are a lot of places that expect resjunk columns to come last, but we should study it. It could be worth it to arrange the main select's attr list so that the sort keys always come first. I'd settle just for that if the second part is too much work. It sounds like hard work to optimise the case where the ORDER BY is on columns not mentioned in the select. The latter seems worth it for when we do a sort-merge based on join columns not mentioned in the SELECT, but not for optimizing sort-aggregate or one-table-select cases. (Note: this will do nada for Josh's original complaint about index build time, since btree index sorts will by definition use all the tuple's columns, but it seems like a significant issue for ordinary sorts.) Understood, but we do care about heap sorts too! Heap and index sorts have many dissimilarities, but both are important, so perhaps we should tune for each case separately. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PG Killed by OOM Condition
On Mon, Oct 03, 2005 at 11:47:57PM -0700, Jeff Davis wrote: I think that I've run into the OOM killer without a fork() being involved, but I could be wrong. Is it possible to be hit by the OOM killer if no applications use fork()? fork() is the obvious overcomitter. If Netscape wants to spawn a new process, it first has to fork 50MB of memory, then free probably most of it because it execs some little plugin. If processes mmap() a large block and then doesn't use it until later. Similar idea with brk(). If you run out of swap at the wrong moment... Recent versions are more clever about who to kill. Sometimes you just get unlucky... It's always killed the right process for me (Mozilla derivative leaked masses of memory over long period). -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgppqV21TxiXq.pgp Description: PGP signature
Re: [HACKERS] Tuning current tuplesort external sort code for 8.2
On Tue, Oct 04, 2005 at 12:37:51AM +0100, Simon Riggs wrote: On Mon, 2005-10-03 at 23:25 +0200, Martijn van Oosterhout wrote: Please note: if inlineApplySortFunction() is actually inlined (it isn't by default) Can you explain your last post some more. Thats not what I get. The inline keyword is just a flag that you would like the compiler to inline it. GCC will decide itself. It has a limit on the size of functions that it will consider for inlining. Quote the gcc manual: -finline-limit-N By default, gcc limits the size of functions that can be inlined. This flag allows the control of this limit for functions that are explicitly marked as inline (ie marked with the inline keyword or defined within the class definition in c++). N is the size of functions that can be inlined in number of pseudo instructions (not counting parameter handling). It goes in to say that the limit is 1 for gcc 2.95, but if you examine the manual for gcc 3.3 it has the limit at 600. So it's entirely possible that at the time the person wrote that code, it *was* being inlined, but it sure isn't on some versions of some compilers. I experimented and found that -finline-limit-1500 causes it to start inlining. The -Winline flag causes gcc to print a warning if you specified inline but gcc didn't inline it, for whatever reason. Hopefully this clears that up. -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpSmmXmACi5l.pgp Description: PGP signature
Re: [HACKERS] [PERFORM] A Better External Sort?
On Mon, Oct 03, 2005 at 10:51:32PM +0100, Simon Riggs wrote: Basically, I recommend adding -Winline -finline-limit-1500 to the default build while we discuss other options. I add -Winline but get no warnings. Why would I use -finline-limit-1500? I'm interested, but uncertain as to what difference this makes. Surely using -O3 works fine? Different versions of gcc have different ideas of when a function can be inlined. From my reading of the documentation, this decision is independant of optimisation level. Maybe your gcc version has a limit higher than 1500 by default. -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgppgbguCvZK1.pgp Description: PGP signature
Re: [HACKERS] Vacuum and Transactions
On Tue, 2005-10-04 at 00:26 -0400, Rod Taylor wrote: As I understand it vacuum operates outside of the regular transaction and if you stop it (SIGTERM, or pg_cancel_backend()) some of the work it accomplished will be kept when it rolls back. For large structures with a ton of dead entries (which I seem to have a case), running vacuum takes long enough that high-churn structures begin to experience difficulties. Is it reasonable to cancel and restart the vacuum process periodically (say every 12 hours) until it manages to complete the work? It takes about 2 hours to do the table scan, and should get in about 10 hours of index work each round. That is what I've had to recommend in extreme cases, with some success. For a non-FULL VACUUM, all of the database changes it does will be kept, though that is not the only cost, as you indicate. However, you're right to question it since it does have some downsides like not correctly updating statistics at the end of the run. I wouldn't try this with VACUUM FULL. In that case, I'd VACUUM first, then when all dead-rows are gone go for the VACUUM FULL; but I would find another way round that, like a CTAS. The problem is that VACUUM doesn't emit enough messages for you to know when it gets to the end of each phase, so you've not much clue about how much of that 12 hours would be wasted. Though as you say, it seems likely that much of it is worthwhile in the situation you describe. The tipping point is when VACUUM finds more dead rows than fits within maintenance_work_mem/(size of row pointer). Thats when we start to do multiple passes of each of the indexes. Maybe it would be good to have a VACUUM max-one-pass only command, to allow you to break big VACUUMs down into smaller chunks. Or perhaps we should have a trace_vacuum command as well to allow you to see where to cancel it? (Put notices in lazy_vacuum_index and lazy_vacuum_heap). Hope that helps. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Tuning current tuplesort external sort code for 8.2
On Tue, Oct 04, 2005 at 11:55:58AM +0200, Martijn van Oosterhout wrote: It goes in to say that the limit is 1 for gcc 2.95, but if you examine the manual for gcc 3.3 it has the limit at 600. So it's entirely possible that at the time the person wrote that code, it *was* being inlined, but it sure isn't on some versions of some compilers. I experimented and found that -finline-limit-1500 causes it to start inlining. From searching the web, it appears the inline limit was dropped from 1 to 600 between gcc 3.0.0 and 3.0.1 in response to complaints about gcc memory usage. Any function that could be inlined needed to be kept in memory in semicompiled form and in C++ where lots of inlinable functions call eachother, the memory usage blew up completely. The difference between -O2 and -O3 is that the latter will consider any function for inlining, even if you didn't ask. For C programs that basically means any function declared static. Also, the number is pseudo-instructions and their meaning can change from version to version. Since we're pretty much relying on gcc to inline for performance, I still think we should add -Winline by default so we can tell when it's not doing what we want. Hope this helps, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpkTvFoZ4PXA.pgp Description: PGP signature
[HACKERS] New pg_config behavior
I have some doubts about the new pg_config behavior of printing not recorded when an item is not known. I doubt that this will cause automated users of this program to behave reasonably in error scenarios. I think a more reasonable behavior would be to use a non-zero exit status and possibly print a message to stderr and nothing to stdout. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] memory bug debugging
Hello hackers, I'm fiddling around with the backend and have created a memory bug. I guess I'm overriding a palloced chunk, but can't figure out where. The backend's report is: WARNING: problem in alloc set MessageContext: req size alloc size for chunk 0x8383ca8 in block 0x83834a0 Investigating a little I have found out that chunk size is 8, but requested size is: 1912602626 (more than 1GB???). This certainly doesn't make sense. What tools or debugging options exist to find memory leaks / overrides? I've read the utils/mmgr/README, which leads me to the conclusion that I cannot use valgrind or similar tools. It would be very helpful to know what chunk it is, or better: where it got allocated. Thank you for your help Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Tuning current tuplesort external sort code for 8.2
On Tue, 2005-10-04 at 12:37 +0200, Martijn van Oosterhout wrote: Since we're pretty much relying on gcc to inline for performance, I still think we should add -Winline by default so we can tell when it's not doing what we want. Very much agree to this. Can we put that in the build for 8.1 please? People need to know we expect certain code to be inlined and they need warnings when this does not occur. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
On Tue, 2005-10-04 at 12:04 +0200, Martijn van Oosterhout wrote: On Mon, Oct 03, 2005 at 10:51:32PM +0100, Simon Riggs wrote: Basically, I recommend adding -Winline -finline-limit-1500 to the default build while we discuss other options. I add -Winline but get no warnings. Why would I use -finline-limit-1500? I'm interested, but uncertain as to what difference this makes. Surely using -O3 works fine? How did you determine the 1500 figure? Can you give some more info to surround that recommendation to allow everybody to evaluate it? Best Regards, Simon Riggs ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] memory bug debugging
On Tue, Oct 04, 2005 at 01:11:41PM +0200, Markus Schiltknecht wrote: Hello hackers, I'm fiddling around with the backend and have created a memory bug. I guess I'm overriding a palloced chunk, but can't figure out where. There are some defines (MEMORY_CONTEXT_CHECKING and CLOBBER_FREED_MEMORY) which can help find leaks. Valgrind works too, with a bit of attention. I was going to add some directives to allow Valgrind to handle PostgreSQL's memory allocator, but have got there yet. -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpM1s4YvBPVo.pgp Description: PGP signature
Re: [HACKERS] [PERFORM] A Better External Sort?
On Tue, Oct 04, 2005 at 12:24:54PM +0100, Simon Riggs wrote: How did you determine the 1500 figure? Can you give some more info to surround that recommendation to allow everybody to evaluate it? [EMAIL PROTECTED]:~/dl/cvs/pgsql-local/src/backend/utils/sort$ gcc -finline-limit-1000 -Winline -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -fno-strict-aliasing -g -I../../../../src/include -D_GNU_SOURCE -c -o tuplesort.o tuplesort.c tuplesort.c: In function 'applySortFunction': tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction' tuplesort.c:1906: warning: called from here tuplesort.c: In function 'comparetup_heap': tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction' tuplesort.c:1937: warning: called from here tuplesort.c: In function 'comparetup_index': tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction' tuplesort.c:2048: warning: called from here tuplesort.c: In function 'comparetup_datum': tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction' tuplesort.c:2167: warning: called from here [EMAIL PROTECTED]:~/dl/cvs/pgsql-local/src/backend/utils/sort$ gcc -finline-limit-1500 -Winline -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -fno-strict-aliasing -g -I../../../../src/include -D_GNU_SOURCE -c -o tuplesort.o tuplesort.c no warnings A quick binary search puts the cutoff between 1200 and 1300. Given version variation I picked a nice round number, 1500. Ugh, that's for -O2, for -O3 and above it needs to be 4100 to work. Maybe we should go for 5000 or so. I'm using: gcc (GCC) 3.3.5 (Debian 1:3.3.5-13) Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpadXPF53tUp.pgp Description: PGP signature
Re: [HACKERS] New Point Releases Available
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Marc G. Fournier Sent: 04 October 2005 02:34 To: pgsql-hackers@postgresql.org Subject: [HACKERS] New Point Releases Available Just bundled up 7.3.11, 7.4.9 and 8.0.4 ... please look them over and make sure they look okay ... will announce late tomorrow to -announce ... Website docs updated, Windows version of 8.0.4 bundled and uploaded, ftp site symlinks updated. Might take a day or so for the latter two to hit all the mirrors of course... Regards, Dave ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] New pg_config behavior
Peter Eisentraut [EMAIL PROTECTED] writes: I have some doubts about the new pg_config behavior of printing not recorded when an item is not known. I doubt that this will cause automated users of this program to behave reasonably in error scenarios. I think a more reasonable behavior would be to use a non-zero exit status and possibly print a message to stderr and nothing to stdout. Good point, although in practice that code only gets compiled in builds with brain-dead Windows tools; I rather doubt that the resulting executable would ever get used as you foresee. But if you want to change it, OK with me. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
The constants related to inlining involve pcode, not actual assembly instructions, and are compiler version dependent as well as subject to change without notice by the GNU folks... from: http://gcc.gnu.org/onlinedocs/gcc-3.3.5/gcc/Optimize-Options.html#Optimize-Options -finline-limit=n By default, gcc limits the size of functions that can be inlined. This flag allows the control of this limit for functions that are explicitly marked as inline (i.e., marked with the inline keyword or defined within the class definition in c++). n is the size of functions that can be inlined in number of pseudo instructions (not counting parameter handling). The default value of n is 600. Increasing this value can result in more inlined code at the cost of compilation time and memory consumption. Decreasing usually makes the compilation faster and less code will be inlined (which presumably means slower programs). This option is particularly useful for programs that use inlining heavily such as those based on recursive templates with C++. Inlining is actually controlled by a number of parameters, which may be specified individually by using --param name=value. The -finline-limit=n option sets some of these parameters as follows: max-inline-insns is set to n. max-inline-insns-single is set to n/2. max-inline-insns-auto is set to n/2. min-inline-insns is set to 130 or n/4, whichever is smaller. max-inline-insns-rtl is set to n. Using -finline-limit=600 thus results in the default settings for these parameters. See below for a documentation of the individual parameters controlling inlining. Note: pseudo instruction represents, in this particular context, an abstract measurement of function's size. In no way, it represents a count of assembly instructions and as such its exact meaning might change from one release to an another. Further Down It Says... --param name=value In some places, GCC uses various constants to control the amount of optimization that is done. For example, GCC will not inline functions that contain more that a certain number of instructions. You can control some of these constants on the command-line using the --param option. The names of specific parameters, and the meaning of the values, are tied to the internals of the compiler, and are subject to change without notice in future releases. In each case, the value is an integer. The allowable choices for name are given in the following table: snip max-inline-insns-single Several parameters control the tree inliner used in gcc. This number sets the maximum number of instructions (counted in gcc's internal representation) in a single function that the tree inliner will consider for inlining. This only affects functions declared inline and methods implemented in a class declaration (C++). The default value is 300. max-inline-insns-auto When you use -finline-functions (included in -O3), a lot of functions that would otherwise not be considered for inlining by the compiler will be investigated. To those functions, a different (more restrictive) limit compared to functions declared inline can be applied. The default value is 300. max-inline-insns The tree inliner does decrease the allowable size for single functions to be inlined after we already inlined the number of instructions given here by repeated inlining. This number should be a factor of two or more larger than the single function limit. Higher numbers result in better runtime performance, but incur higher compile-time resource (CPU time, memory) requirements and result in larger binaries. Very high values are not advisable, as too large binaries may adversely affect runtime performance. The default value is 600. max-inline-slope After exceeding the maximum number of inlined instructions by repeated inlining, a linear function is used to decrease the allowable size for single functions. The slope of that function is the negative reciprocal of the number specified here. The default value is 32. min-inline-insns The repeated inlining is throttled more and more by the linear function after exceeding the limit. To avoid too much throttling, a minimum for this function is specified here to allow repeated inlining for very small functions even when a lot of repeated inlining already has been done. The default value is 130. max-inline-insns-rtl For languages that use the RTL inliner (this happens at a later stage than tree inlining), you can set the maximum allowable size (counted in RTL instructions) for the RTL inliner with this parameter. The default value is 600. -Original Message- From: Martijn van Oosterhout kleptog@svana.org Sent: Oct 4, 2005 8:24 AM To: Simon Riggs [EMAIL PROTECTED] Cc: Tom Lane [EMAIL PROTECTED], Ron Peacetree [EMAIL PROTECTED], pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External
Re: [HACKERS] memory bug debugging
Martijn van Oosterhout kleptog@svana.org writes: On Tue, Oct 04, 2005 at 01:11:41PM +0200, Markus Schiltknecht wrote: I'm fiddling around with the backend and have created a memory bug. I guess I'm overriding a palloced chunk, but can't figure out where. There are some defines (MEMORY_CONTEXT_CHECKING and CLOBBER_FREED_MEMORY) which can help find leaks. Those are on by default if you built with --enable-cassert (which you should surely always do when testing new backend code). Best idea that comes to mind is: (1) make sure you have a reproducible scenario in which the clobber always happens at the same physical location. (This shouldn't be too hard if you start a fresh backend for each trial.) (2) run it under gdb and set a hardware watchpoint at that location. You'll get some false matches from touches in the aset.c code, but hopefully not too many. The first breakpoint outside aset.c is your problem. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] \d on database with a lot of tables is slow
On Sat, Oct 01, 2005 at 02:00:12PM -0400, Tom Lane wrote: I wrote: It's presumably mostly in the pg_table_is_visible() calls. I did some profiling on a test case with 1 tables, and noticed that a big part of the problem is that the catalog caches become entirely useless: almost every catcache lookup ends up going to the underlying tables. This is because MAXCCTUPLES in catcache.c is fixed at 5000, and that's not an adequate working set for this many tables. If you are willing to throw memory at the problem, you could try increasing MAXCCTUPLES (to say 50K or 100K) and see if that helps. Out of curiosity... does catcache cache all pg_* tables? Also, at what point would it be good to up NCCBUCKETS? -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
Martijn van Oosterhout kleptog@svana.org writes: A quick binary search puts the cutoff between 1200 and 1300. Given version variation I picked a nice round number, 1500. Ugh, that's for -O2, for -O3 and above it needs to be 4100 to work. Maybe we should go for 5000 or so. I'm using: gcc (GCC) 3.3.5 (Debian 1:3.3.5-13) I don't know what the units of this number are, but it's apparently far too gcc-version-dependent to consider putting into our build scripts. Using gcc version 4.0.1 20050727 (current Fedora Core 4 compiler) on i386, and compiling tuplesort.c as you did, I find: -O2: warning goes away between 800 and 900 -O3: warning is always there (tried values up to 1000) (the latter behavior may indicate a bug, not sure). What's even more interesting is that the warning does not appear in either case if I omit -finline-limit --- so the default value is plenty. At least on this particular compiler, the proposed switch would be counterproductive. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] pg_dump versioning
On Mon, Oct 03, 2005 at 12:11:49AM -0400, Tom Lane wrote: Bruce Momjian pgman@candle.pha.pa.us writes: Jim C. Nasby wrote: Watching the discussion about how to handle nextval() and keep it dump compatible makes me wonder: would it be useful to encode database or dump version info into dumps? If we ever get to a case where we _need_ to use it, it would be good to have, just in case. The trouble is that it won't help you until years after you invest the work --- invariably, the version info you wish you had is for distinguishing between different *old* releases. True, but that will always be the case until someone adds it. And in this particular case, ISTM it shouldn't require much effort to add the info. Plus, it would certainly be useful for users to be able to examine a dump and find out what exact version of the database/pgdump was used to produce it. This would also allow for pg_restore or even a dump piped into psql to throw a warning or error if being fed a dump version it might not be able to handle (ie: feeding a newer dump to and older version). I'm not real excited about it myself. My own design experience says that version numbers aren't that hot as a way of determining does this data have the X nature?. If you can't test for the problem directly, you haven't thought hard enough. True, but at least if the info is made available now it could be used down the road if needed. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
On Tue, 2005-10-04 at 16:30 +0200, Martijn van Oosterhout wrote: On Tue, Oct 04, 2005 at 10:06:24AM -0400, Tom Lane wrote: Martijn van Oosterhout kleptog@svana.org writes: I'm using: gcc (GCC) 3.3.5 (Debian 1:3.3.5-13) I don't know what the units of this number are, but it's apparently far too gcc-version-dependent to consider putting into our build scripts. Using gcc version 4.0.1 20050727 (current Fedora Core 4 compiler) on i386, and compiling tuplesort.c as you did, I find: -O2: warning goes away between 800 and 900 -O3: warning is always there (tried values up to 1000) (the latter behavior may indicate a bug, not sure). Facsinating. The fact that the warning goes away if you don't specify -finline-limit seems to indicate they've gotten smarter. Or a bug. We'd have to check the asm code to see if it's actually inlined or not. I've been using gcc 3.4 and saw no warning when using either -Winline or -O3 -Winline. Martijn, at the moment it sounds like this is a feature that we no longer need to support - even if we should have done for previous releases. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
Martijn van Oosterhout kleptog@svana.org writes: 1. Add -Winline so we can at least be aware of when it's (not) happening. Yeah, I agree with that part, just not with adding a fixed -finline-limit value. While on the subject of gcc warnings ... if I touch that code, I want to remove -Wold-style-definition from the default flags, too. It's causing much more clutter than it's worth, because all the flex files generate several such warnings. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] effective SELECT from child tables
Is there enough for a TODO here? On Mon, Oct 03, 2005 at 11:24:30PM -0400, Greg Stark wrote: Hannu Krosing [EMAIL PROTECTED] writes: On P, 2005-10-02 at 23:00 -0400, Tom Lane wrote: Here's another interesting case to think about: ALTER TABLE ADD foo integer DEFAULT 1 ... ALTER TABLE ALTER foo SET DEFAULT 2 You'll have to pay the table-traversal cost on one step or the other. The second, ALTER ... SET DEFAULT, would only set default for newly inserted columns, not the ones which are missing due to tuples being created before the column existed. Hm. So you're saying there are only ever exactly two types of defaults. The initial default that applies to all tuples that were created before the column was added. And the current default that only ever applies to newly created tuples. That does seem to cleanly close this hole. -- greg ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Vacuum and Transactions
Is it reasonable to cancel and restart the vacuum process periodically No. How big is that table, anyway? Are you trying a VACUUM FULL, or plain vacuum? It's only about 60GB in size but appears it has been missed for nearly a month for vacuum and probably has a large percentage dead material (30% or so). I'm trying to run a plain vacuum and the Pg version is 8.0.3. Building indexes on this structure also takes a significant amount of time. I have maintenace_work_mem set to about 1GB in size. The catch is that there are some other very active structures (like pg_listener for Slony) which after a couple of hours without vacuuming will quickly have the DB at an unreasonably high load (low tens) which seems to all but halt the vacuum on the large structure. Rightfully the table should be partitioned by time, but I haven't quite figured out how to delete data from the old structure without increasing the vacuum time required. Another alternative I'm considering is to create new partial indexes for the active structures, drop all of the old full table indexes and run vacuum on that, then partition it. -- ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Inherited indexes.
On Sun, Oct 02, 2005 at 09:46:07PM -0400, Tom Lane wrote: Fredrik Olsson [EMAIL PROTECTED] writes: To allow indexes to be inherited so unique, foreign keys and such works properly with inheritance has been on the todo for quite some time. I thought that most probably it is a very non trivial thing, perhaps completely rethinking how indexes are done. Yup, you're right. There are a couple of major problems, to my mind: 1. A cross-table index would need to store a table OID as well as the existing block/offset information in order to tell you what an entry is pointing at. An extra 4 bytes per index entry (8 bytes if MAXALIGN is 8) is a lot of overhead, so you'd not want to pay that all the time. Which means two index tuple header formats to support, which looks painful. How can that be handled cleanly and efficiently? Wouldn't it make more sense to use a smaller pointer to a table of OIDs that that index covers? I don't know off-hand how much padding there currently is in index tuples, but hopefully this would allow bringing the space usage under control for common cases involving less than a few dozen tables. Another possibility is optimizing for the special case of indexing on a partitioning key. In this case, index values would be very localized to one table, so just storing the table info on each index page (or something similar) would work well. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
On Tue, Oct 04, 2005 at 10:06:24AM -0400, Tom Lane wrote: Martijn van Oosterhout kleptog@svana.org writes: I'm using: gcc (GCC) 3.3.5 (Debian 1:3.3.5-13) I don't know what the units of this number are, but it's apparently far too gcc-version-dependent to consider putting into our build scripts. Using gcc version 4.0.1 20050727 (current Fedora Core 4 compiler) on i386, and compiling tuplesort.c as you did, I find: -O2: warning goes away between 800 and 900 -O3: warning is always there (tried values up to 1000) (the latter behavior may indicate a bug, not sure). Facsinating. The fact that the warning goes away if you don't specify -finline-limit seems to indicate they've gotten smarter. Or a bug. We'd have to check the asm code to see if it's actually inlined or not. Two options: 1. Add -Winline so we can at least be aware of when it's (not) happening. 2. If we can't get gcc to reliably inline, maybe we need to consider other options? In particular, move the isNull test statements out since they are ones the optimiser can use to best effect. Add if we put in -Winline, it would be visible to users while compiling so they can tweak their own build options (if they care). -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgppKVdbv5luV.pgp Description: PGP signature
Re: [HACKERS] Vacuum and Transactions
Rod Taylor [EMAIL PROTECTED] writes: As I understand it vacuum operates outside of the regular transaction It's a perfectly normal transaction. Is it reasonable to cancel and restart the vacuum process periodically No. How big is that table, anyway? Are you trying a VACUUM FULL, or plain vacuum? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] effective SELECT from child tables
On Sat, Oct 01, 2005 at 07:59:11PM +0100, Simon Riggs wrote: On Sat, 2005-10-01 at 10:57 -0500, Jim C. Nasby wrote: To clarify, this is a hard-coded implementation of what I'm asking for: http://cvs.distributed.net/viewcvs.cgi/stats-sql/logdb/ in a nutshell: CREATE TABLE log_other ( project_id smallint NOT NULL ... ) CREATE TABLE log_8 ( -- No project_id ... ) CREATE TABLE log_24, log_25, log_5... CREATE VIEW log AS SELECT * FROM log_other UNION ALL SELECT 8 AS project_id, * FROM log_8 ... So the end result is that for cases where project_id is 5, 8, 24, or 25, the data will be stored in tables that don't have the project_id. If I were to use this on the main table for http://stats.distributed.net, which has ~130M rows, I would be able to save 130M*4 bytes (4 instead of 2 due to alignment), or 520MB. The logdb will have many times that number of rows, so the savings will be even larger. Note that this technique wouldn't help at all for something like date partitioning, because you have to store the date in the partitioned table. Jim, Your idea was noted before and actually; I mentioned it to show that I listen and take note of ideas from any source. For everybody, I would note that the current behaviour is exactly the way that List Partitioning works on other systems. The cost of this technique is only paid if you choose to partition on something that you would not otherwise have included in your table. In many cases, you'll choose a column that would have been in the table if you created one big table so the additional cost is zero. Well, the idea is to be more space efficient than if one big table was used. This is unique to this class of partitioning problems. In your example, I would expect to see project_id in a superclass table and so there would be no cost. Superclass table? The idea is neat, but IMHO the potential saving of this idea is not big enough for me to prioritise that very highly over other items at this time. Certainly. I only chimed in with a specific example so people could better understand what the idea was. I know it's on the list and might be addressed at some point. In the mean time it's not too horrible to hard-code a solution. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] A Better External Sort?
On Tue, Oct 04, 2005 at 03:56:53PM +0100, Simon Riggs wrote: I've been using gcc 3.4 and saw no warning when using either -Winline or -O3 -Winline. Ok, I've just installed 3.4 and verified that. I examined the asm code and gcc is inlining it. I concede, at this point just throw in -Winline and monitor the situation. As an aside, the *_getattr calls end up a bit suboptimal though. It's producing code like: cmp attlen, 4 je $elsewhere1 cmp attlen, 2 je $elsewhere2 ld byte here: --- much later --- elsewhere1: ld integer jmp $here elsewhere2: ld short jmp $here No idea whether we want to go down the path of hinting to gcc which size will be the most common. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgp4B0wVNkzsI.pgp Description: PGP signature
Re: [HACKERS] Vacuum and Transactions
Rod Taylor [EMAIL PROTECTED] writes: The catch is that there are some other very active structures (like pg_listener for Slony) which after a couple of hours without vacuuming will quickly have the DB at an unreasonably high load (low tens) which seems to all but halt the vacuum on the large structure. Yeah. We desperately need to reimplement listen/notify :-( ... that code was never designed to handle high event rates. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Inherited indexes.
Jim C. Nasby [EMAIL PROTECTED] writes: On Sun, Oct 02, 2005 at 09:46:07PM -0400, Tom Lane wrote: 1. A cross-table index would need to store a table OID as well as the existing block/offset information in order to tell you what an entry is pointing at. Wouldn't it make more sense to use a smaller pointer to a table of OIDs that that index covers? Smaller than what? Don't tell me you want to restrict how many tables a cross-table index can handle :-( In any case, the gain from doing that would be exactly zero because of alignment considerations: the size of an index tuple header really has to be a multiple of MAXALIGN. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Inherited indexes.
Another possibility is optimizing for the special case of indexing on a partitioning key. In this case, index values would be very localized to one table, so just storing the table info on each index page (or something similar) would work well. If you have the partitioning key in the index and the partitions don't overlap, it is better to create separate [unique] indexes on the subtables. Building separate indexes per partition is usually preferred because of: 1. performance of dropping a partition 2. smaller index for CE Only if you need an order by without a sort step, that spawns more than one partition things usually get ugly. Imho the best solution would be a merge node, that merges results of several index accesses to avoid a sort and still use separate indexes. Such a merge node could probably also detect the case where accessing partitions in a certain order still produces ordered results. Usually you will only want the one big unique index when the partitioning is not reflectable in the index keys, and then (also in other db's) such an index is usually a pain ... Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Inherited indexes.
On Tue, Oct 04, 2005 at 11:05:49AM -0400, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: On Sun, Oct 02, 2005 at 09:46:07PM -0400, Tom Lane wrote: 1. A cross-table index would need to store a table OID as well as the existing block/offset information in order to tell you what an entry is pointing at. Wouldn't it make more sense to use a smaller pointer to a table of OIDs that that index covers? Smaller than what? Don't tell me you want to restrict how many tables a cross-table index can handle :-( In any case, the gain from doing that would be exactly zero because of alignment considerations: the size of an index tuple header really has to be a multiple of MAXALIGN. Hrm, I see that IndexTupleData doesn't have room 'left over' like HeapTupleData does. If it did, it would probably be a win to allow for indexes that are on less than X number of tables, where X is whatever value we can fit into the tuple header. Since this could be exceeded at any time, we'd also need a flag to indicate that a given tuple is for a table that's not in the lookup table and that an actual OID is stored. Given that that's not (currently) the case, it seems that the unused bit could be used to indicate if the tuple was for a table other than the one the index was originally created on. That would allow for adding a table to an existing index without re-writing the entire thing. It could also provide some speed improvement in cases where the table the index was defined on contained the majority of the data, but that's a pretty far-out corner case. Of course, this is all academic performance tuning until we actually have cross-table indexes. Does that 'just' leave locking as the issue? I think cross-table indexes are going to become a lot more important as our partitioning support increases, so it would be good if this could get done for 8.2 (I think it's on Simon's list right now). -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Inherited indexes.
Well, I never said unique, but you're correct, it's pretty undesirable to put a global index on your partitioning key. On Tue, Oct 04, 2005 at 06:16:21PM +0200, Zeugswetter Andreas DAZ SD wrote: Another possibility is optimizing for the special case of indexing on a partitioning key. In this case, index values would be very localized to one table, so just storing the table info on each index page (or something similar) would work well. If you have the partitioning key in the index and the partitions don't overlap, it is better to create separate [unique] indexes on the subtables. Building separate indexes per partition is usually preferred because of: 1. performance of dropping a partition 2. smaller index for CE Only if you need an order by without a sort step, that spawns more than one partition things usually get ugly. Imho the best solution would be a merge node, that merges results of several index accesses to avoid a sort and still use separate indexes. Such a merge node could probably also detect the case where accessing partitions in a certain order still produces ordered results. Usually you will only want the one big unique index when the partitioning is not reflectable in the index keys, and then (also in other db's) such an index is usually a pain ... Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
On Tue, Oct 04, 2005 at 05:23:41PM +0200, Martijn van Oosterhout wrote: On Tue, Oct 04, 2005 at 03:56:53PM +0100, Simon Riggs wrote: I've been using gcc 3.4 and saw no warning when using either -Winline or -O3 -Winline. Ok, I've just installed 3.4 and verified that. I examined the asm code and gcc is inlining it. I concede, at this point just throw in -Winline and monitor the situation. As an aside, the *_getattr calls end up a bit suboptimal though. It's producing code like: cmp attlen, 4 je $elsewhere1 cmp attlen, 2 je $elsewhere2 ld byte here: --- much later --- elsewhere1: ld integer jmp $here elsewhere2: ld short jmp $here No idea whether we want to go down the path of hinting to gcc which size will be the most common. If it will very frequently be one value, and not the other values, I don't see why we wouldn't want to hint? #ifdef it to a expand to just the expression if not using GCC. It's important that we know that the value would be almost always a certain value, however, as GCC will try to make the path for the expected value as fast as possible, at the cost of an unexpected value being slower. __builtin_expect (long EXP, long C) You may use `__builtin_expect' to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (`-fprofile-arcs'), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect. The return value is the value of EXP, which should be an integral expression. The value of C must be a compile-time constant. The semantics of the built-in are that it is expected that EXP == C. For example: if (__builtin_expect (x, 0)) foo (); would indicate that we do not expect to call `foo', since we expect `x' to be zero. Since you are limited to integral expressions for EXP, you should use constructions such as if (__builtin_expect (ptr != NULL, 1)) error (); when testing pointer or floating-point values. Cheers, mark -- [EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] __ . . _ ._ . . .__. . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/|_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] [HACKERS] Query in SQL statement
On Sat, Oct 01, 2005 at 12:51:08PM -0700, Roger Hand wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Jim C. Nasby Sent: Friday, September 30, 2005 4:49 PM Subject: Re: [PERFORM] [HACKERS] Query in SQL statement I suggest ditching the CamelCase and going with underline_seperators. I'd also not use the bareword id, instead using bad_user_id. And I'd name the table bad_user. But that's just me. :) I converted a db from MS SQL, where tables and fields were CamelCase, and just lowercased the ddl to create the tables. So table and fields names were all created in lowercase, but I didn't have to change any of the application code: the SELECT statements worked fine with mixed case. -- sample DDL CREATE TABLE testtable ( fieldone int4 ) insert into TestTable (fieldone) values (11); That will usually work (see Tom's reply), but fieldone is a heck of a lot harder to read than field_one. But like I said, this is the coding conventions I've found work well; YMMV. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Inherited indexes.
On Tue, 2005-10-04 at 18:16 +0200, Zeugswetter Andreas DAZ SD wrote: Another possibility is optimizing for the special case of indexing on a partitioning key. In this case, index values would be very localized to one table, so just storing the table info on each index page (or something similar) would work well. If you have the partitioning key in the index and the partitions don't overlap, it is better to create separate [unique] indexes on the subtables. Building separate indexes per partition is usually preferred because of: 1. performance of dropping a partition 2. smaller index for CE ... Imho the best solution would be a merge node, that merges results of several index accesses to avoid a sort and still use separate indexes. Such a merge node could probably also detect the case where accessing partitions in a certain order still produces ordered results. Yes, that was my conclusion also. There are a number of intermediate steps along the way, so it will take some time to achieve it. Usually you will only want the one big unique index when the partitioning is not reflectable in the index keys, and then (also in other db's) such an index is usually a pain ... Agreed^2. The idea of a global index is a non-starter for all of the reasons that Tom gave and the main one: Its's unusably huge. There's no point in partitioning a 1TB table if you then have to build a 500GB index on it. The tree would be so deep that each insert would require maybe 3 I/Os on index branch blocks before you got to the leaf. Insert performance would suck real bad, which is a blocker since if you have a large table you almost certainly have a lot of data to load. If you don't have a big table you shouldn't be partitioning it anyway. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Announcing Veil
This is to announce the first, Alpha, release of Veil, a database security add-on for Postgres. It allows databases to be created with efficient row-level access controls; different users have access to different subsets of data. Veil is hosted on pgfoundry at http://veil.projects.postgresql.org/ Documentation is here: http://veil.projects.postgresql.org/curdocs/index.html If there is sufficient interest, I would ultimately like Veil to be added to contrib - we will have to see how things pan out. I am announcing this to the postgres hackers list as I hope for review comments from a qualified community. All comments will be welcomed. If you would like somewhere controversial to start your review, take a look at veil_shmem.c which hooks into the postgres shared memory subsystem. Since I was unable to dynamically assign a LWLock using LWLockAssign (none available), I have fairly arbitrarily overloaded the use of existing LWLocks. When the flames die down perhaps we can discuss making a small number (one would be enough for me) of LWLocks available. Thank you. __ Marc Munro signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Announcing Veil
Marc Munro [EMAIL PROTECTED] writes: Since I was unable to dynamically assign a LWLock using LWLockAssign (none available), I have fairly arbitrarily overloaded the use of existing LWLocks. When the flames die down perhaps we can discuss making a small number (one would be enough for me) of LWLocks available. Perhaps you missed the comment in NumLWLocks()? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] prefix btree implementation
I am not sure if this idea was mentioned before. The basic prefix btree idea is quite straightforward, i.e., try to compress the key items within a data page by sharing the common prefix. Thus the fanout of the page is increased and the benefits is obvious theorectically. Consider the following cases: 1/ Multi-attributes index, example page looks like this: [1, 1, 'abc'], [1, 1, 'bde'], ..., [1, 22, 'mmk'] - So [1, 1] or [1] is compressible 2/ Index on string, example page looks like this: ['aaab'] ['aaac'], ..., ['aabc'] - So ['aaa'] or ['aa'] is compressible 3/ Both 1 and 2, example page looks like this: [1, 'abc', 1], [1, 'abc', 2], ..., [1, 'abk', 1] - So [1, 'abc'] or [1, 'ab'] is compressible 4/ And even this, example page looks like this: [0x1234], [0x1235], ..., [0x2213] - So [0x] is compressible So together, there are basically four types of possible sharing: column-wise (case 1), character-wise (case 2), column-character-wise (case 3), and byte-wise (case 4). To compress these prefixes, we have the following questions: 1/ What types of prefix compression shall we support? We can consider case 3 as a generalized case of 1 and 2, thus at least we can implement for case 1, 2 and 3 now. Case 4 could be seen as a special case of case 2, so the framework should be scalable enough to add supports to 4. In short, we will implement 1,2 and 3 now. 2/ When to do prefix compression and what index operations will be affected? To simplify things, we can do it when we do bottom-up index rebuild. We won't try to compress the prefix when inserting new data. By this design, I could think that the index operations affected will include btree build, btree split (new page has to inheritant prefix information). 3/ How to represent each index tuple and where to store the common prefix? Index tuples involved in the compression have to remember how much have been shared and the shared prefix should be stored within the same page. The basic idea is that we will use bit 13 of IndexTuple-t_info to indicate that if it shares a prefix or not, and the common prefix will be stored in the special area of an index page. The common prefix data will have format {length|value}. Thanks that the PageHeader-pd_special is dyanmic, so the common prefix could have different length on each page. Based on these changes, we will change index_getattr() and _bt_pageinit() accordingly. 4/ WAL support? We only affect btree build and btree split. For btree build, we don't worry too much, since the data is already page-wise xloged. We will add prefix information to btree split xlog record (xl_btree_split). Comments? Regards, Qingqing ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org