Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Fri, 13 Jan 2006 19:18:29 +, Simon Riggs [EMAIL PROTECTED] wrote: I enclose a patch for checking out block sampling. Can't comment on the merits of block sampling and your implementation thereof. Just some nitpicking: |! * Row Sampling: As of May 2004, we use the Vitter algorithm to create Linking the use of the Vitter algorithm to May 2004 is not quite appropriate. We introduced two stage sampling at that time. | * a random sample of targrows rows (or less, if there are less in the |! * sample of blocks). In this case, targblocks is always the same as |! * targrows, so we always read one row per block. This is just wrong, unless you add on average. Even then it is a bit misleading, because in most cases we *read* more tuples than we use. | * Although every row has an equal chance of ending up in the final | * sample, this sampling method is not perfect: not every possible | * sample has an equal chance of being selected. For large relations | * the number of different blocks represented by the sample tends to be |! * too small. In that case, block sampling should be used. Is the last sentence a fact or personal opinion? | * block. The previous sampling method put too much credence in the row | * density near the start of the table. FYI, previous refers to the date mentioned above: previous == before May 2004 == before two stage sampling. |+ /* Assume that we will have rows at least 64 bytes wide |+ * Currently we're very unlikely to overflow availMem here |+ */ |+ if ((allocrows * sizeof(HeapTuple)) (allowedMem 4)) This is a funny way of saying if (allocrows * (sizeof(Heaptuple) + 60) allowedMem) It doesn't match the comment above; and it is something different if the size of a pointer is not four bytes. |- if (bs.m 0) |+ if (bs.m 0 ) Oops. |+ ereport(DEBUG2, |+ (errmsg(ANALYZE attr %u sample: n=%u nmultiple=%u f1=%u d=%u, |+ stats-tupattnum,samplerows, nmultiple, f1, d))); ^ missing space here and in some more places I haven't been following the discussion too closely but didn't you say that a block sampling algorithm somehow compensates for the fact that the sample is not random? Servus Manfred ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Unsplitting btree index leaf pages
On Thu, 22 Dec 2005 10:40:24 -0500, Tom Lane [EMAIL PROTECTED] wrote: If you move items from one page to the other in the opposite direction from the way the scan is going, then it will miss those items. AFAIU the (PG implementaion of the) LY method is designed to make scans immune against problems caused by items moving right within the same page and against page splits, i.e. items moving to a *new* right sibling. So making scans work with items moving to an *existing* right sibling doesn't seem out of reach. The code following this comment in _bt_restscan /* * The item we're looking for moved right at least one page, so * move right. We are careful here to pin and read-lock the next * non-dead page before releasing the current one. This ensures * that a concurrent btbulkdelete scan cannot pass our position * --- if it did, it might be able to reach and delete our target * item before we can find it again. */ looks like it'd work for the case of page merging as well, as long as we are careful to move items always from left to right. BTW, if after having locked both pages we find that we have super-exclusive locks, i.e. nobody else has pins on these pages, we can reorganize much more agressively. It might even be safe to move items to the left page. The parent page might need some special handling, though. Servus Manfred ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Re: Which qsort is used
On Thu, 22 Dec 2005 08:01:00 +0100, Martijn van Oosterhout kleptog@svana.org wrote: But where are you including the cost to check how many cells are already sorted? That would be O(H), right? Yes. I didn't mention it, because H N. This is where we come back to the issue that comparisons in PostgreSQL are expensive. So we agree that we should try to reduce the number of comparisons. How many comparisons does it take to sort 10 items? 1.5 million? Hmm, what are the chances you have 10 unordered items to sort and that the first 8% will already be in order. ISTM that that probability will be close enough to zero to not matter... If the items are totally unordered, the check is so cheap you won't even notice. OTOH in Tom's example ... |What I think is much more probable in the Postgres environment |is almost-but-not-quite-ordered inputs --- eg, a table that was |perfectly ordered by key when filled, but some of the tuples have since |been moved by UPDATEs. ... I'd not be surprised if H is 90% of N. Servus Manfred ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Re: Which qsort is used
On Sat, 17 Dec 2005 00:03:25 -0500, Tom Lane [EMAIL PROTECTED] wrote: I've still got a problem with these checks; I think they are a net waste of cycles on average. [...] and when they fail, those cycles are entirely wasted; you have not advanced the state of the sort at all. How can we make the initial check adavance the state of the sort? One answer might be to exclude the sorted sequence at the start of the array from the qsort, and merge the two sorted lists as the final stage of the sort. Qsorting N elements costs O(N*lnN), so excluding H elements from the sort reduces the cost by at least O(H*lnN). The merge step costs O(N) plus some (=50%) more memory, unless someone knows a fast in-place merge. So depending on the constant factors involved there might be a usable solution. I've been playing with some numbers and assuming the constant factors to be equal for all the O()'s this method starts to pay off at H for N 20 100 130 1000 800010 Servus Manfred ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Shared locking in slru.c
On Wed, 30 Nov 2005 13:53:13 -0500, Tom Lane [EMAIL PROTECTED] wrote: The way the attached patch attacks this is for the shared-lock access case to simply set the page's LRU counter to zero, without bumping up the LRU counters of the other pages as the normal adjustment would do. If you still plan to do this, you might also want to revert the micro-optimisation intruduced by the original SLRU patch: | Apart from refactoring I made a little change to SlruRecentlyUsed, | formerly ClogRecentlyUsed: It now skips incrementing lru_counts, if | slotno is already the LRU slot, thus saving a few CPU cycles. |+#define SlruRecentlyUsed(shared, slotno) \ |+ do { \ |+ if ((shared)-page_lru_count[slotno] != 0) { \ |+ int iilru; \ |+ for (iilru = 0; iilru NUM_CLOG_BUFFERS; iilru++) \ |+ (shared)-page_lru_count[iilru]++; \ |+ (shared)-page_lru_count[slotno] = 0; \ |+ } \ |+ } while (0) Otherwise you could end up with a stable state of several pages having lru_count == 0. Servus Manfred ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Alternative variable length structure
On Thu, 08 Sep 2005 18:02:44 +0900, ITAGAKI Takahiro [EMAIL PROTECTED] wrote: + * The length of varlena2 is encoded as follows: + * + * | First| Trailing | Total | Max | + * | byte | bytes| bits | length | + * +--+--+---+-+ + * | 0*** |0 | 7 | 127 | + * | 10** |1 |14 | 16K -1 | + * | 110* |2 |21 | 2M -1 | + * | 1110 |3 |28 | 256M -1 | + * | |4 |32 | 4G -1 | With external and compression flags this would look like * | ec0* |0 | 5 | 31 | * | ec10 |1 |12 | 4K -1 | * | ec110*** |2 |19 | 512K -1 | * | ec1110** |3 |26 | 64M -1 | * | ec-- |4 |32 | 4G -1 | Only a matter of taste, but I'd just waste one byte for sizes between 4K and 512K and thus get a shorter table: * | ec0* |0 | 5 | 31 | * | ec10 |1 |12 | 4K -1 | * | ec110*** |3 |27 | 128M -1 | * | ec111--- |4 |32 | 4G -1 | And you get back that one byte for sizes between 64M and 128M :-) Another possible tweak: * | 0*** |0 | 5 | 127 | * | 1ec0 |1 |12 | 4K -1 | * | 1ec10*** |3 |27 | 128M -1 | * | 1ec11--- |4 |32 | 4G -1 | I.e. drop the flags for very short strings. If these flags are needed for a string of size 32 to 127, then use a two byte header. Servus Manfred ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Remove xmin and cmin from frozen tuples
On Fri, 2 Sep 2005 20:41:48 -0400 (EDT), Bruce Momjian pgman@candle.pha.pa.us wrote: Once I had a patch based on 7.4 that stored cmin and cmax in backend-local memory. Interesting idea, but how would you record the cmin/xmin values without requiring unlimited memory? That's exactly the reason for not sending it to -patches. Without spilling to disk this is just not ready for real life. The problem is that -- unlike other data structures that build up during a transaction, e.g. trigger queues -- cmin/cmax lookup requires random access, so we'd need some form of tree or hash. Unfornunately I never got beyond brainstorming :-( BTW, is there anything else that'd need spilling to disk during long transactions? Servus Manfred ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Remove xmin and cmin from frozen tuples
On Fri, 2 Sep 2005 15:51:15 -0400 (EDT), Bruce Momjian pgman@candle.pha.pa.us wrote: * Merge xmin/xmax/cmin/cmax back into three header fields And don't forget xvac, please. Before subtransactions, there used to be only three fields needed to store these four values. ... five values. This was possible because only the current transaction looks at the cmin/cmax values. Which is a reason to get rid of cmin/cmax in tuple headers entirely. Once I had a patch based on 7.4 that stored cmin and cmax in backend-local memory. It passed make check and some volume tests, but I felt it was not ready to be applied without any spill-to-disk mechanism. Development stalled when I tried to eliminate xvac as well, which would have required deep cuts into VACUUM code :-( Servus Manfred ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Must be owner to truncate?
On Wed, 24 Aug 2005 07:01:00 +0200, Andreas Seltenreich [EMAIL PROTECTED] wrote: However, a question arose quickly: According to the standard, revoking INSERT, UPDATE and DELETE after GRANT ALL PRIVILEGES would leave the relation read-only, but with the TRUNCATE privilege lying around, this would no longer be true for PostgreSQL. I'd say that the TRUNCATE privilege includes DELETE, so that REVOKE DELETE implicitly revokes TRUNCATE and GRANT TRUNCATE implicitly grants DELETE. Servus Manfred ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Cost of XLogInsert CRC calculations
On Tue, 31 May 2005 12:07:53 +0100, Mark Cave-Ayland [EMAIL PROTECTED] wrote: Perhaps Manfred can tell us the generator polynomial that was used to create the lookup tables? 32 26 23 22 16 12 11 10 8 7 5 4 2 1 X + X + X + X + X + X + X + X + X + X + X + X + X + X + 1 - http://www.opengroup.org/onlinepubs/009695399/utilities/cksum.html Or google for 04c11db7. Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] WAL replay failure after file truncation(?)
On Wed, 25 May 2005 18:19:19 -0400, Tom Lane [EMAIL PROTECTED] wrote: but it keeps a list (hash table, file, whatever) of those blocks. [...] Is it sufficient to remember just the relation and the block number or do we need the contents a well? We don't *have* the contents ... that's exactly why it's panicking ... I meant the contents of the WAL record, not the original block contents. Anyway, I think it's not needed. Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] WAL replay failure after file truncation(?)
On Wed, 25 May 2005 11:02:11 -0400, Tom Lane [EMAIL PROTECTED] wrote: Plan B is for WAL replay to always be willing to extend the file to whatever record number is mentioned in the log, even though this may require inventing the contents of empty pages; we trust that their contents won't matter because they'll be truncated again later in the replay sequence. Another idea: WAL replay does not apply changes to nonexistent blocks, but it keeps a list (hash table, file, whatever) of those blocks. When a truncate WAL record is found, all entries for blocks affected by the truncation are removed from the list. Is it sufficient to remember just the relation and the block number or do we need the contents a well? If the list is non-empty at the end of WAL replay, this is evidence of a serious problem (file system corruption or Postgres bug). Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Cost of XLogInsert CRC calculations
On Wed, 18 May 2005 13:50:22 +0200, I wrote: The most important figure is, that at MaxSpeed (/O2) 2x32 is almost twice as fast as CRC32 while only being marginally slower than CRC32. ^ Silly typo! That should have been: The most important figure is, that at MaxSpeed (/O2) 2x32 is almost twice as fast as CRC64 while only being marginally slower than CRC32. Servus Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Cost of XLogInsert CRC calculations
On Wed, 18 May 2005 01:12:26 -0400, Tom Lane [EMAIL PROTECTED] wrote: Wait, par for 32-bit CRCs? Or for 64-bit CRCs calculated using 32-bit ints? Right, the latter. We haven't actually tried to measure the cost of plain 32bit CRCs... although I seem to recall that when we originally decided to use 64bit, someone put up some benchmarks purporting to show that there wasn't much difference. That someone wasn't me (I wasn't around here at that time), but I have done a few tests today on 32 bit Intel with VC6: Optimization | CRC algorithms Settings | 3232a32b2x326464a64b -+--- Default | 7.67.66.28.39.19.29.5 MinSize | 2.96 2.97 2.97 4.76 6.00 5.98 6.31 MaxSpeed | 2.92 2.92 2.97 3.13 6.32 6.33 6.22 32a and 32b are functionally equivalent variants of CRC32 where the crc is a plain uint32, not a struct with just one field. Same for 64a, 64b, and 64, respectively. The most important figure is, that at MaxSpeed (/O2) 2x32 is almost twice as fast as CRC32 while only being marginally slower than CRC32. In case anybody wants to repeat my tests or find any flaw therein, the source is attached. Servus Manfred crctest.tgz Description: Binary data ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Bitmap scan cost model (was Re: bitmap scans, btree scans, and tid order)
On Tue, 17 May 2005 22:12:17 -0700, Jeffrey W. Baker [EMAIL PROTECTED] wrote: Incrementing random_page_cost from 4 (the default) to 5 causes the planner to make a better decision. We have such a low default random_page_cost primarily to mask other problems in the optimizer, two of which are . multi-column index correlation . interpolation between min_IO_Cost and max_IO_cost which approximates max_IO_cost too fast. Servus Manfred ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Cost of XLogInsert CRC calculations
On Mon, 16 May 2005 12:35:35 -0400, Tom Lane [EMAIL PROTECTED] wrote: Anyone want to try it with non-gcc compilers? MS VC++ 6.0 with various predefined optimizer settings 2x3264 Default (without any /O) 0.828125 0.906250 MinSize (contains /O1) 0.468750 0.593750 MaxSpeed (contains /O2) 0.312500 0.640625 Not that it really matters -- but at least this looks like another hint that letting the compiler emulate 64 bit operations on non 64 bit hardware is suboptimal. Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] BTW, if anyone wants to work on it...
On Tue, 03 May 2005 02:45:09 -0400, Tom Lane [EMAIL PROTECTED] wrote: I'm starting to think it'd be worth setting up a mechanism to handle such changes automatically. I've been using this skeleton for quite some time now. Magnus' psql ... | while read D might be more robust than my for db in `enumdatabases` Servus Manfred #!/bin/sh #--- # pg_plug -- painless upgrade # PG 7.4 standard - PG 7.4 with various enhancements # # 2003-10-04 mk new # 2003-10-14 mk addpatchinfo # 2003-11-01 mk VIEW pg_patchinfo # 2004-11-04 mk renamed VIEW to pg_patches; # features selectable via command line (-F) # 2004-12-14 mk freeze #--- ALLFEATURES=22 27 #--- abortscript(){ echo 12 echo $1 12 exit 1 } setconnectable(){ db=$1 allow=$2 $CMD template1 /dev/null EOF UPDATE pg_database SET datallowconn = $allow WHERE datname = '$db'; EOF } freeze(){ db=$1 $CMD $db /dev/null EOF VACUUM FREEZE; EOF } # This does not work for database names with spaces or other funny # characters. enumdatabases(){ echo SELECT datname FROM pg_database; \ | $CMD template1 \ | sed -e '/1: datname = /!d' -e 's/.*1: datname = //' -e 's/.*//' } add22indexstat(){ db=$1 echo adding \pg_indexstat\, ignore error if it already exists $CMD $db /dev/null EOF CREATE TABLE pg_catalog.pg_indexstat( \ istindex oid NOT NULL, \ istcorrel float4 NOT NULL \ ) WITHOUT OIDS; CREATE UNIQUE INDEX pg_indexstat_index_index \ ON pg_indexstat(istindex); EOF } add27patchinfo(){ db=$1 echo adding \pg_patchinfo()\, ignore error if it already exists $CMD $db /dev/null EOF COPY pg_proc FROM stdin WITH OIDS; 2981pg_patchinfo11 1 12 f f t t i 0 2249show_patchinfo - \N \. -- copy acl from version() UPDATE pg_proc SET proacl=(SELECT proacl FROM pg_proc WHERE oid=89) WHERE oid=2981; CREATE VIEW pg_patches AS SELECT * FROM pg_patchinfo() AS p(name text, version text, base text, descr text); EOF } #--- CMDNAME=`basename $0` BINDIR=bin FEATURES= while [ $# -gt 0 ] do case $1 in --username|-U) POSTGRES_SUPERUSERNAME=$2 shift;; --username=*) POSTGRES_SUPERUSERNAME=`echo $1 | sed 's/^--username=//'` ;; -U*) POSTGRES_SUPERUSERNAME=`echo $1 | sed 's/^-U//'` ;; --feature|-F) FEATURES=$FEATURES $2 shift;; --feature=*) F=`echo $1 | sed 's/^--feature=//'` FEATURES=$FEATURES $F ;; -F*) F=`echo $1 | sed 's/^-F//'` FEATURES=$FEATURES $F ;; # Data directory. No default, unless the environment # variable PGDATA is set. --pgdata|-D) PGDATA=$2 shift;; --pgdata=*) PGDATA=`echo $1 | sed 's/^--pgdata=//'` ;; -D*) PGDATA=`echo $1 | sed 's/^-D//'` ;; -*) echo $CMDNAME: invalid option: $1 exit 1 ;; # Non-option argument specifies data directory *) PGDATA=$1 ;; esac shift done #--- if [ -z $PGDATA ] then ( echo $CMDNAME: no data directory specified echo You must identify the directory where the data for this database system echo resides. Do this with either the invocation option -D or the echo environment variable PGDATA. ) 12 exit 1 fi #--- PG_OPT=-O PG_OPT=$PG_OPT -c search_path=pg_catalog PG_OPT=$PG_OPT -c exit_on_error=true #PG_OPT=$PG_OPT -c enable_indexstat=false PG_OPT=$PG_OPT -D $PGDATA CMD=$BINDIR/postgres $PG_OPT : ${FEATURES:=$ALLFEATURES} # Enable connections to template0 setconnectable template0 true || abortscript cannot enable connections to template0 for db in `enumdatabases` do echo converting database $db ... for f in $FEATURES do case $f in 22) add22indexstat $db # || abortscript cannot convert database $db
Re: [HACKERS] pgFoundry
On Mon, 16 May 2005 20:54:15 -0400 (EDT), Bruce Momjian pgman@candle.pha.pa.us wrote: I have modifed the TODO HTML so the completed items are in italics. Isn't it a bit misleading to have those items on the TODO list at all? Shouldn't there be a separate list: DONE for the next release? Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: Learning curves and such (was Re: [HACKERS] pgFoundry)
On Tue, 17 May 2005 14:45:00 -0300 (ADT), Marc G. Fournier [EMAIL PROTECTED] wrote: Also, how many 'bugs' have we seen go through the lists that someone hasn't jump'd on and fixed in a couple of days? Just imagine our marketing crew being able to say: According to our great bug tracking system NN serious bugs have been reported last year, 98% of which have been fixed within three days. Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Learning curves and such
On Tue, 17 May 2005 14:29:49 -0700, Josh Berkus josh@agliodbs.com wrote: grin You're not going to win over many people on *this* list with marketing arguments. Yeah, that's the problem with *my* learning curve ... Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Views, views, views! (long)
On Wed, 4 May 2005 21:37:40 -0700, Josh Berkus josh@agliodbs.com wrote: As stated above, these system views, once incorporated into a pg distribution, are likely to be with us *forever*. I don't think that this is doable. :-( You might want to put the system views into a version specific schema, say pg_views81. The next PG version will contain a new schema pg_views82 plus a version of 8.1 views that have been adapted to new features and catalog structures as far as possible without breaking compatibility. Ideally the views in pg_views81 and pg_views82 will look the same, but most likely there will be some differences. In PG 8.3 we will have schemas pg_views81, pg_views82, and pg_views83 ... Obviously it will get harder and harder to maintain older system view schemas with each new Postgres version. If in PG 8.7 it becomes clear that carrying on pg_views81 doesn't make sense any more, you simply drop it. By that time tool vendors should have had enough time to make their tools compatible with pg_views8x, for some x = 2. Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Best way to scan on-disk bitmaps
On Thu, 12 May 2005 17:40:06 -0400, Tom Lane [EMAIL PROTECTED] wrote: the planner believes that only consecutive columns in the index are usable --- that is, if you have quals for a and c but not for b, it will think that the condition for c isn't usable with the index. This is true for btree [...] It's not difficult to setup a test case where an index is used, but only with a=42 as an index condition, and c='foo' is applied as a filter condition. Now add a redundant condition on b ... AND b BETWEEN minb AND maxb ... and watch how c='foo' moves into the index condition. I did this test some time ago and I believe that adding the condition on b did not change the estimated cost, only the actual execution time. Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] We are not following the spec for HAVING without GROUP
On Fri, 11 Mar 2005 10:37:13 +1300, Mark Kirkwood [EMAIL PROTECTED] wrote: Firebird 1.5.1 FreeBSD 5.3 [correct results] Interbase 6.0: SQL create table tab (col integer); SQL select 1 from tab having 1=0; SQL select 1 from tab having 1=1; 0---:-) SQL insert into tab values(1); SQL insert into tab values(2); SQL select 1 from tab having 1=0; SQL select 1 from tab having 1=1; 1 SQL Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Group-count estimation statistics
On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane [EMAIL PROTECTED] wrote: Manfred Koizar [EMAIL PROTECTED] writes: That's not what I meant. I tried to say that if we have a GROUP BY several columns and one of these columns alone has more than N/10 distinct values, there's no way to get less than that many groups. Oh, I see, you want a max calculation in there too. Seems reasonable. Any objections? Yes. :-( What I said is only true in the absence of any WHERE clause (or join). Otherwise the same cross-column correlation issues you tried to work around with the N/10 clamping might come back through the backdoor. I'm not sure whether coding for such a narrow use case is worth the trouble. Forget my idea. Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Group-count estimation statistics
On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane [EMAIL PROTECTED] wrote: we should consider something like clamp to size of table / 10 instead. ... unless a *single* grouping column is estimated to have more than N/10 distinct values, which should be easy to check. Servus Manfred ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Refactoring
On Wed, 19 Jan 2005 18:57:48 +0100, I wrote: My first vacuum.c refactoring patch, rev 1.281 2004-06-08, added these comments in repair_frag(): /* * VACUUM FULL has an exclusive lock on the relation. So * normally no other transaction can have pending INSERTs or * DELETEs in this relation. A tuple is either * (a) a tuple in a system catalog, inserted or deleted by * a not yet committed transaction or * (b) dead (XMIN_INVALID or XMAX_COMMITTED) or * (c) inserted by a committed xact (XMIN_COMMITTED) or * (d) moved by the currently running VACUUM. * In case (a) we wouldn't be in repair_frag() at all. * In case (b) we cannot be here, because scan_heap() has * already marked the item as unused, see continue above. * Case (c) is what normally is to be expected. * Case (d) is only possible, if a whole tuple chain has been * moved while processing this or a higher numbered block. */ It turns out that this comment is not quite correct. It is incomplete. Case (b) should be: known dead (XMIN_INVALID, or XMAX_COMMITTED and xmax is visible to all active transactions). And there is a fifth possibility: (e) deleted (XMAX_COMMITTED) but at least one active transaction does not see the deleting transaction. The patch seems to imply that case (e) is a subcase of (b), but effectively tuples in this state are treated more like (c). Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Group-count estimation statistics
On Mon, 31 Jan 2005 11:20:31 -0500, Tom Lane [EMAIL PROTECTED] wrote: Already done that way. if (relvarcount 1) clamp *= 0.1; That's not what I meant. I tried to say that if we have a GROUP BY several columns and one of these columns alone has more than N/10 distinct values, there's no way to get less than that many groups. Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Autotuning Group Commit
On Fri, 21 Jan 2005 23:52:51 +, Simon Riggs [EMAIL PROTECTED] wrote: Currently, we have group commit functionality via GUC parameters commit_delay andcommit_siblings And since 7.3 we have ganged WAL writes (c.f. the thread starting at http://archives.postgresql.org/pgsql-hackers/2002-10/msg00331.php) which IMHO is a better solution to the same problem. Maybe the code dealing with commit_xxx parameters should just be removed. Are you or is anybody else aware of benchmarks showing that group commit via commit_xxx is still useful? Servus Manfred ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Much Ado About COUNT(*)
On Mon, 24 Jan 2005 08:28:09 -0700, Jonah H. Harris [EMAIL PROTECTED] wrote: UPDATE pg_user_table_counts SET rowcount = rowcount + 1 WHERE schemaname = this_schemaname AND tablename = TG_RELNAME; This might work for small single user applications. You'll have to keep an eye on dead tuples in pg_user_table_counts though. But as soon as there are several concurrent transactions doing both INSERTs and DELETEs, your solution will in the best case serialise access to test_tbl or it will break down because of deadlocks. Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] ARC patent
On Fri, 21 Jan 2005 02:31:40 +0200, Hannu Krosing [EMAIL PROTECTED] wrote: 2) Another simple, but nondeterministic, hack would be using randomness, i.e. 2.1) select a random buffer in LR side half (or 30% or 60%) of for replacement. 2.2) dont last accessed pages to top of LRU list immediately, just push them uphill some amount, either random, or perhaps 1/2 the way to top at each access. Sounds good, but how do find the middle of a linked list? Or the other way round: Given a list element, how do you find out its position in a linked list? So the only approach that is easily implementable is 2.3) If a sequential scan hint flag is set, put the buffer into the free list at a random position. Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Refactoring
[Sorry, Neil, for abusing your thread. Moving this discussion back to where it belongs.] On Tue, 18 Jan 2005 13:17:17 -0300, Alvaro Herrera [EMAIL PROTECTED] wrote: Hmm. I think this is a good idea on principle, but what happens in case a previous vacuum was interrupted? Is there a possibility that tuples belonging to that vacuum are still marked MOVED_OFF but are not in vacpage-offsets, for example? These bits are handled by an earlier phase of VACUUM, by HeapTupleSatisfiesVacuum() in scan_heap(). My first vacuum.c refactoring patch, rev 1.281 2004-06-08, added these comments in repair_frag(): /* * VACUUM FULL has an exclusive lock on the relation. So * normally no other transaction can have pending INSERTs or * DELETEs in this relation. A tuple is either * (a) a tuple in a system catalog, inserted or deleted by * a not yet committed transaction or * (b) dead (XMIN_INVALID or XMAX_COMMITTED) or * (c) inserted by a committed xact (XMIN_COMMITTED) or * (d) moved by the currently running VACUUM. * In case (a) we wouldn't be in repair_frag() at all. * In case (b) we cannot be here, because scan_heap() has * already marked the item as unused, see continue above. * Case (c) is what normally is to be expected. * Case (d) is only possible, if a whole tuple chain has been * moved while processing this or a higher numbered block. */ /* * There cannot be another concurrently running VACUUM. If * the tuple had been moved in by a previous VACUUM, the * visibility check would have set XMIN_COMMITTED. If the * tuple had been moved in by the currently running VACUUM, * the loop would have been terminated. We had * elog(ERROR, ...) here, but as we are testing for a * can't-happen condition, Assert() seems more appropriate. */ /* * MOVED_OFF by another VACUUM would have caused the * visibility check to set XMIN_COMMITTED or XMIN_INVALID. */ This should answer your question, unless the comments are wrong. (BTW parts of that patch have been backed out by someone, so you won't find the second comment in current sources.) As for the problem whether the two code paths deal with the same set of tuples, read http://archives.postgresql.org/pgsql-hackers/2004-06/msg00410.php: | [...] These assignments are | preceded either by code that sets the appropriate infomask bits or by | assertions that the bits are already set appropriately. Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Much Ado About COUNT(*)
On Thu, 13 Jan 2005 00:39:56 -0500, Tom Lane [EMAIL PROTECTED] wrote: A would-be deleter of a tuple would have to go and clear the known good bits on all the tuple's index entries before it could commit. This would bring the tuple back into the uncertain status condition where backends would have to visit the heap to find out what's up. Eventually the state would become certain again (either dead to everyone or live to everyone) and one or the other hint bit could be set again. Last time we discussed this, didn't we come to the conclusion, that resetting status bits is not a good idea because of possible race conditions? In a previous post you wrote: | I think we still have one free bit in index tuple headers... AFAICS we'd need two new bits: visible to all and maybe dead. Writing the three status bits in the order visible to all, maybe dead, known dead, a normal index tuple lifecycle would be 000 - 100 - 110 - 111 In states 000 and 110 the heap tuple has to be read to determine visibility. The transitions 000 - 100 and 110 - 111 happen as side effects of index scans. 100 - 110 has to be done by the deleting transaction. This is the operation where the additional run time cost lies. One weakness of this approach is that once the index tuple state is 110 but the deleting transaction is aborted there is no easy way to reset the maybe deleted bit. So we'd have to consult the heap for the rest of the tuple's lifetime. Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Shared row locking
On Wed, 29 Dec 2004 19:57:15 -0500, Tom Lane [EMAIL PROTECTED] wrote: Manfred Koizar [EMAIL PROTECTED] writes: I don't see too much of a difference between #1 (an on-disk structure buffered in shared memory) and #2 (a shared memory structure spilling to disk). If you stand back that far, maybe you can't see a difference ;-) ... Well, I tried to step back a bit to see the whole picture. You think I went too far this time? :-) but the proposal on the table was for an extremely special-purpose on-disk structure. I'd prefer to see first if we can solve a more general problem, namely the fixed size of the existing lock table. I haven't been digging into the code for some time (:-() -- but isn't this basically a key/value mapping with random access? My concern is that as soon as you start to push entries out of the memory structure you have to implement some kind of on-disk structure to support fast lookups, which sooner or later might lead to something that looks suspiciously like the existing hash or btree code. So the question is whether starting by making nbtree more flexible isn't the lower hanging fruit... Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Shared row locking
On Thu, 30 Dec 2004 13:36:53 -0500, Tom Lane [EMAIL PROTECTED] wrote: Certainly not; indexes depend on locks, not vice versa. You'd not be able to do that without introducing an infinite recursion into the system design. Wouldn't you have to face the same sort of problems if you spill part of the lock table to disk? While you do I/O you have to hold some lock. In either case there has to be a special class of locks that are pinned in memory. In any case nbtree is much more heavyweight than we need for this Having funcionality we don't need is not a showstopper ... unless heavyweight implies slow, which I have to admit may well be the case. Servus Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Shared row locking
On Thu, 16 Dec 2004 21:54:14 -0300, Alvaro Herrera [EMAIL PROTECTED] wrote: Else, it will have to wait, using XactLockTableWait, for the first transaction in the array that is still running. We can be sure that no one will try to share-lock the tuple while we check the btree because we hold an exclusive lock on the tuple's heap page's buffer. Do you really want to XactLockTableWait while holding an exclusive lock on a heap page? Or are you going to release the lock before entering the wait? Thanks to tablespaces, it's very easy to create special Relations that can be dealt with by standard buffer and storage manager, etc. I don't get how tablespaces would help. The btree idea: - does not need crash recovery. Maybe we could use a stripped down version of nbtree. This could cause a maintanibility nightmare. It could be a nightmare if you simply duplicate and then modify the code. A better way would be to refactor nbtree to be able to handle btrees with different properties: . shared/private . logged/non-logged . flexible value data type. - can't hold more than 300 or so simultaneous lockers (because of value length, limited to 1/3 of a page). I doubt this is a real problem. Or you store each lock as a separate index tuple. If this is expected to be a problem because of many repeating key vaules, nbtree should be enhanced to support duplicate key compression, which might be a good thing per se. Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Shared row locking
On Mon, 20 Dec 2004 21:44:01 +0100, [EMAIL PROTECTED] wrote: Tom Lane [EMAIL PROTECTED] wrote on 20.12.2004, 19:34:21: #1 could have a pretty serious performance impact, too. For small numbers of FOR UPDATE locks (too few to force spill to disk) I would expect #2 to substantially beat #1. #1 essentially imposes the worst case performance at all times, whereas #2 degrades (at a currently unknown rate) when there are lots and lots of FOR UPDATE locks. I don't see too much of a difference between #1 (an on-disk structure buffered in shared memory) and #2 (a shared memory structure spilling to disk). As long as we can avoid unnecessary writes, #1 has the nice property that it automatically adapts to different usage patterns because it uses the normal shared buffer pool and cache replacement strategy. [My gut feeling would be against another permanent on-disk structure, since this is one more thing for a user to delete to save space etc...] Irrelevant, IMHO. Whoever manually deletes any file under PGDATA deserves whatever this may cause. I haven't seen many programs that use extended SELECT FOR UPDATE logic. AFAICS, SELECT FOR UPDATE is not the primary issue here, because it locks rows exclusively. This thread is about shared locks, which should be used for foreign key checks. Servus Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Bgwriter behavior
[I know I'm late and this has already been discussed by Richrad, Tom, et al., but ...] On Tue, 21 Dec 2004 16:17:17 -0600, Jim C. Nasby [EMAIL PROTECTED] wrote: look at where the last page you wrote out has ended up in the LRU list since you last ran, and start scanning from there (by definition everything after that page would have to be clean). This is a bit oversimplified, because that page will be moved to the start of the list when it is accessed the next time. A = B = C = D = E = F = G = H = I = J = K = L = m = n = o = p = q ^ would become M = A = B = C = D = E = F = G = H = I = J = K = L = n = o = p = q ^ (a-z ... known to be clean, A-Z ... possibly dirty) But with a bit of cooperation from the backends this could be made to work. Whenever a backend takes the page which is the start of the clean tail out of the list (most probably to insert it into another list or to re-insert it at the start of the same list) the clean tail pointer is advanced to the next list element, if any. So we would get M = A = B = C = D = E = F = G = H = I = J = K = L = n = o = p = q ^ As a little improvement the clean tail could be prevented from shrinking unnecessarily fast by moving the pointer to the previous list element if this is found to be clean: M = A = B = C = D = E = F = G = H = I = J = K = l = n = o = p = q ^ Maybe this approach could serve both goals, (1) keeping a number of clean pages at the LRU end of the list and (2) writing out other dirty pages if there's not much to do near the end of the list. But ... On Tue, 21 Dec 2004 10:26:48 -0500, Tom Lane [EMAIL PROTECTED] wrote: Also, the cntxDirty mechanism allows a block to be dirtied without changing the ARC state at all. ... which might kill this proposal anyway. Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Updateable Views?
On Sat, 07 Aug 2004 10:24:34 -0400, Jan Wieck [EMAIL PROTECTED] wrote: I have not heard of updatable subselects yet. http://asktom.oracle.com/pls/ask/f?p=4950:8:6693556430011788783::NO::F4950_P8_DISPLAYID,F4950_P8_CRITERIA:273215737113, | Here we update a join. [...] | [EMAIL PROTECTED] update | 2( select columnName, value | 3from name, lookup | 4 where name.keyname = lookup.keyname | 5 and lookup.otherColumn = :other_value ) | 6 set columnName = value | 7 / Google for oracle delete statement syntax or oracle update statement syntax Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] More vacuum.c refactoring
[Sorry for the late reply. I'm still struggling to catch up after vacation ...] On Fri, 9 Jul 2004 21:29:52 -0400 (EDT), Bruce Momjian [EMAIL PROTECTED] wrote: Where are we on this, 2x. :-) Here: Tom Lane wrote: Will study these comments later, but it's too late at night here... Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Quick question regarding tablespaces
On Thu, 1 Jul 2004 22:55:56 -0400, Mike Rylander [EMAIL PROTECTED] wrote: I was thinking of purely tablespace-based random_page_cost, as that variable is tied to the access time of a particular filesystem. Strictly speaking we'd also need tablespace-based sequential_page_cost. Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Trapping QUERY_CANCELED: yes, no, maybe?
On Sat, 31 Jul 2004 21:24:33 -0400, Tom Lane [EMAIL PROTECTED] wrote: Exactly. There's a proof-of-concept test at the bottom of regress/sql/plpgsql.sql, wherein a function gets control back from a query that would have run for an unreasonably long time. referring to | -- we assume this will take longer than 1 second: | select count(*) into x from tenk1 a, tenk1 b, tenk1 c; On Thu, 29 Aug 2002 13:27:36 -0400, Tom Lane [EMAIL PROTECTED] wrote: You mean, that the test might fail on a system that takes more than ten seconds to INSERT or UPDATE a single row? I don't think this is a real problem. I don't like depending on a timeout *at all* in a regression test; the exact value of the timeout is not particularly relevant to my concern about it. Maybe SELECT sleep('0:0:2'::interval); as used in regress/sql/stats.sql is a better way to ensure that the query takes longer than one second? Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Trapping QUERY_CANCELED: yes, no, maybe?
On Fri, 06 Aug 2004 18:55:49 -0400, Tom Lane [EMAIL PROTECTED] wrote: You think there's a serious risk of failure there ;-) ? Not on my hardware... Servus Manfred ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Another unpleasant surprise using inheritance
On Fri, 11 Jun 2004 14:11:00 +0200, Darko Prenosil [EMAIL PROTECTED] wrote: I think I found bug related to table inheritance (or at least very weird behavior). This is well known and there's a todo for it: # Allow inherited tables to inherit index, UNIQUE constraint, and primary key, foreign key [inheritance] See also http://momjian.postgresql.org/cgi-bin/pgtodo?inheritance. Servus Manfred ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] More vacuum.c refactoring
Near the end of repair_frag() in vacuum.c -- under the comment /* clean moved tuples from last page in Nvacpagelist list */ -- there is code that marks itemids as unused. Itemids affected are those referring to tuples that have been moved off the last page. This code is very similar to vacuum_page(). The major difference is that vacuum_page() uses vacpage-offsets while the code in repair_frag() looks for MOVED_OFF bits in tuple headers. AFAICS the tuples with the MOVED_OFF bit set are exactly those referenced by vacpage-offsets. The attached patch passes make check and make installcheck. Please apply unless I'm missing something. Servus Manfred diff -Ncr ../base/src/backend/commands/vacuum.c src/backend/commands/vacuum.c *** ../base/src/backend/commands/vacuum.c Wed Jun 2 21:46:59 2004 --- src/backend/commands/vacuum.c Thu Jun 10 18:50:26 2004 *** *** 2288,2355 vacpage-offsets_free 0) { Buffer buf; - Pagepage; - OffsetNumberunused[BLCKSZ / sizeof(OffsetNumber)]; - OffsetNumberoffnum, - maxoff; - int uncnt; - int num_tuples = 0; buf = ReadBuffer(onerel, vacpage-blkno); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); ! page = BufferGetPage(buf); ! maxoff = PageGetMaxOffsetNumber(page); ! for (offnum = FirstOffsetNumber; !offnum = maxoff; !offnum = OffsetNumberNext(offnum)) ! { ! ItemId itemid = PageGetItemId(page, offnum); ! HeapTupleHeader htup; ! ! if (!ItemIdIsUsed(itemid)) ! continue; ! htup = (HeapTupleHeader) PageGetItem(page, itemid); ! if (htup-t_infomask HEAP_XMIN_COMMITTED) ! continue; ! ! /* ! ** See comments in the walk-along-page loop above, why we ! ** have Asserts here instead of if (...) elog(ERROR). ! */ ! Assert(!(htup-t_infomask HEAP_MOVED_IN)); ! Assert(htup-t_infomask HEAP_MOVED_OFF); ! Assert(HeapTupleHeaderGetXvac(htup) == myXID); ! ! itemid-lp_flags = ~LP_USED; ! num_tuples++; ! ! } ! Assert(vacpage-offsets_free == num_tuples); ! ! START_CRIT_SECTION(); ! ! uncnt = PageRepairFragmentation(page, unused); ! ! /* XLOG stuff */ ! if (!onerel-rd_istemp) ! { ! XLogRecPtr recptr; ! ! recptr = log_heap_clean(onerel, buf, unused, uncnt); ! PageSetLSN(page, recptr); ! PageSetSUI(page, ThisStartUpID); ! } ! else ! { ! /* !* No XLOG record, but still need to flag that XID exists !* on disk !*/ ! MyXactMadeTempRelUpdate = true; ! } ! ! END_CRIT_SECTION(); ! LockBuffer(buf, BUFFER_LOCK_UNLOCK); WriteBuffer(buf); } --- 2288,2297 vacpage-offsets_free 0) { Buffer buf; buf = ReadBuffer(onerel, vacpage-blkno); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); ! vacuum_page(onerel, buf, vacpage); LockBuffer(buf, BUFFER_LOCK_UNLOCK); WriteBuffer(buf); } ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] More vacuum.c refactoring
On Thu, 10 Jun 2004 17:19:22 -0400, Tom Lane [EMAIL PROTECTED] wrote: This does not make me comfortable. I understand you, honestly. Do I read between your lines that you didn't review my previous vacuum.c refactoring patch? Please do. It'd make *me* more comfortable. You *think* that two different bits of code are doing the same thing, I see three significant differences between the code in repair_frag() and vacuum_page(). 1) vacuum_page() has Assert(vacpage-offsets_used == 0); vacpage is the last VacPage that has been inserted into Nvacpagelist. It is allocated in line 1566, offsets_used is immediately set to 0 and never changed. So this Assert(...) doesn't hurt. 2) In vacuum_page() the lp_flags are set inside a critical section. This is no problem because the clear-used-flags loop does not elog(ERROR, ...). Please correct me if I'm wrong. 3) vacuum_page() uses vacpage-offsets to locate the itemids that are to be updated. If we can show that these are the same itemids that belong to the tuples that are found by inspecting the tuple headers, then the two code snippets are equivalent. The first hint that this is the case is Assert(vacpage-offsets_free == num_tuples); So both spots expect to update the same number of itemids. What about the contents of the offsets[] array? Offset numbers are entered into this array by statements like vacpage-offsets[vacpage-offsets_free++] = offnum; in or immediately after the walk-along-page loop. These assignments are preceded either by code that sets the appropriate infomask bits or by assertions that the bits are already set appropriately. The rest (from PageRepairFragmentation to END_CRIT_SECTION) is identical. so you want to hack up vacuum.c? This module is delicate code --- we've had tons of bugs there in the past But why is it so delicate? Not only because it handles difficult problems, but also because it is written in a not very maintenance-friendly way. Before I started refactoring the code repair_frag() had more than 1100 lines and (almost) all variables used anywhere in the function were declared at function level. We cannot declare a code freeze for a module just because it is so badly written that every change is likely to break it. Someday someone will *have* to change it. Better we break it today in an effort to make the code clearer. --- and no I have zero confidence that passing the regression tests proves anything Unfortunately you are right :-( Servus Manfred ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] SELECT * FROM table LIMIT 1; is really slow
On Fri, 28 May 2004 14:47:01 -0400, Tom Lane [EMAIL PROTECTED] wrote: If putting back xmax is the price we must pay for nested transactions, then we *will* pay that price. Maybe not in this release, but it will inevitably happen. we = every Postgres user, even those that do not use subtransactions. price = 2% to 5% performance loss for databases where the working set is larger than main memory. Don't bother hollering veto ;-) Ok, I'll shut up till I have something concrete to support my opinion. Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] SELECT * FROM table LIMIT 1; is really slow
On Fri, 28 May 2004 21:59:53 -0400, Alvaro Herrera [EMAIL PROTECTED] wrote: So the assumption was that when we see that this has happenned, the Cmin is no longer important (== every future command can already see the tuple) If it was that simple, we wouldn't have had to invent the CMIN_IS_CMAX flag. This has been discussed two years ago. Did you follow the link I posted last week? Every future command is not enough. You have to consider the current command and even commands started before this. Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] SELECT * FROM table LIMIT 1; is really slow
On Thu, 27 May 2004 16:50:24 -0400, Alvaro Herrera [EMAIL PROTECTED] wrote: Now you are on the subject, can I ask you to take a peek at what I did regarding tuple headers? I did read your patch, but I didn't understand it. :-( At first I thought I'd have to add back Xmax as a field on its own Veto! This would increase heap tuple header size == less tuples per page == more pages per table == more I/O == performance loss. is there a situation on which we should need to peek at Cmin after setting Xmax for a particusar tuple? http://archives.postgresql.org/pgsql-hackers/2002-05/msg00090.php Servus Manfred ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] SELECT * FROM table LIMIT 1; is really slow
On Wed, 26 May 2004 18:17:55 -0400, Tom Lane [EMAIL PROTECTED] wrote: (Some days I think we should dump VACUUM FULL, because it's optimized for a case that's no longer very interesting...) So we still have to stick with VACUUM FULL for some time, right? The next set of compatibility breakers I'm currently working on requires a change in VACUUM FULL behaviour. I would only move tuples that are visible to all running transactions. OTOH I wouldn't stop at the first unmovable tuple. With X active tuple . free space or dead tuple y new tuple, not yet visible to a running transaction z deleted tuple, still visible to a running transaction the current implementation transforms this relation XXyX XzXX into XzXX XXyX The new implementation would leave it as XX.. ..y. .z.. If there are concurrent long-running transactions, the new VACUUM FULL wouldn't truncate the relation as aggressively as it does now. It could leave the relation with lots of free space near the end. This was absolutely unacceptable at the time when VACUUM FULL was designed. But now we could use lazy VACUUM as an excuse for VACUUM FULL not working so hard. After the transaction still seeing z terminates, VACUUM (without FULL) can truncate the relation to XX.. ..y. and when y is updated the new version will be stored in a lower block and plain VACUUM can truncate the relation again: XXY. AFAICS this would make vacuum.c much simpler (no more chain moves). Clearly this change alone doesn't have any merit. But would such a patch have any chance of being accepted, if it facilitates improvements in other areas? Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] SELECT * FROM table LIMIT 1; is really slow
On Thu, 27 May 2004 14:23:07 -0400, Tom Lane [EMAIL PROTECTED] wrote: and when y is updated the new version will be stored in a lower block Oh? What makes you think that? I see no guarantee of it. You're right, I see only a tendency, because the majority of free space is before the last block (obviously). But don't we try to store the new version on the same block as the old version? That'd weaken my argument a bit. I think you'd have to not move *any* updated tuples to be sure you don't need the chain-move mechanism. Yes, per definitionem (move only tuples that are visible to all). I'm disinclined to mess with VACUUM FULL without a clearer explanation of where you're headed. Have no fear. I won't change anything in the near term. As you were talking about the future of VACUUM FULL, I thought this might be a good opportunity to ask. The fact that you didn't outright reject the idea is good enough for now. I have no clear explanation at the moment, just some fuzzy ideas that are beginning to crystallise. I'm messing around with heap tuple headers again, and the Xvac field seems to get in the way, unless I can cut down the number of different scenarios where it is needed. Servus Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] New horology failure
[resending...] On Sun, 23 May 2004 11:38:51 +0800, Christopher Kings-Lynne [EMAIL PROTECTED] wrote: I get this since Tom's commit. --- ./results/horology.out Sun May 23 11:39:49 2004 *** *** 1787,1796 ! | Sat Sep 22 18:19:20 2001 PDT | @ 34 years| Fri Sep 22 18:19:20 1967 PDT [...] --- 1787,1796 ! | Sat Sep 22 18:19:20 2001 PDT | @ 34 years| Fri Sep 22 18:19:20 1967 PST [...] I got the same with snapshot-20040521 yesterday [i.e. 2004-05-22] afternoon when I ran make check. But only once. make installcheck passed all tests, and the failure didn't reappear when I tried make check again. I just got the failure again with make check after having configured with a new install directory. My guess is that horology needs some datafile from the install location. Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] zero-column table behavior
[resending...] On Sat, 22 May 2004 20:28:43 -0400, Neil Conway [EMAIL PROTECTED] wrote: -- Why is there a blank line before the -- that indicates the -- end of the result set? -- separates the header line from the *start* of the result set. The empty line is the header line, containing zero column headers. -- If the result set contains two rows, ISTM the psql output -- should emit either two or three blank lines before the -- -- that indicates the end of the result set One empty header line before -- and then two empty data lines. Servus Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Multiple Xids in PGPROC?
On Tue, 04 May 2004 23:21:07 -0400, Tom Lane [EMAIL PROTECTED] wrote: I thought we had devised a solution that did not require expansible shared memory for this. Bruce, Manfred, do you recall how that went? AFAIR we did not discuss TransactionIdIsInProgress() specifically. Currently this function is special insofar as it does not consult pg_clog but loops over the PGPROC array. The current implementation is in sinval.c. The straightforward pg_clog lookup is still in transam.c, but has been deactivated: * Now this func in shmem.c and gives quality answer by scanning * PGPROC structures of all running backend. - vadim 11/26/96 What was the motivation for this change? Consistency or speed? With subtransactions we'd have to fall back to checking pg_clog (and pg_subtrans) in certain cases. There are lots of possible implementations. Here are some ideas (just brainstorming): . We could first scan the PGPROC array. If the xid is an active main transaction, we're finished. . If xid is older than RecentGlobalXmin, it cannot be active. . We could include a small number of subtransaction xids in PGPROC. . For additional subtransactions not fitting into this small array there could be minsubxid and maxsubxid fields in PGPROC. If the xid we are looking for is outside all these ranges, it cannot be an active subtransaction. . If all these tests fail, we fall back to checking pg_clog. Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Number of pages in a random sample
On Mon, 26 Apr 2004 08:08:16 -0700, Sailesh Krishnamurthy [EMAIL PROTECTED] wrote: A Bi-Level Bernoulli Scheme for Database Sampling Peter Haas, Christian Koenig (SIGMOD 2004) Does this apply to our problem? AFAIK with Bernoulli sampling you don't know the sample size in advance. Anyway, thanks for the hint. Unfortunately I couldn't find the document. Do you have a link? Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] btbulkdelete
On Wed, 28 Apr 2004 00:08:48 -0400, Tom Lane [EMAIL PROTECTED] wrote: Is there a special reason for scanning the leaf pages in *logical* order, i.e. by following the opaque-btpo_next links? Yes. [..] interlocking between indexscans and deletions. Thanks for refreshing my memory. This has been discussed two years ago, and I even participated in that discussion :-( Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] btbulkdelete
On Mon, 26 Apr 2004 14:29:58 +0100, Simon Riggs [EMAIL PROTECTED] wrote: Now that FSM covers free btree index pages this access pattern might be highly nonsequential. I had considered implementing a mode where the index doesn't keep trying to reuse space that was freed by earlier deletes. Or maybe an FSM function a la Give me a free page near this one? Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
[HACKERS] btbulkdelete
On -performance we have been discussing a configuration where a bulk delete run takes almost a day (and this is not due to crappy hardware or apparent misconfiguration). Unless I misinterpreted the numbers, btbulkdelete() processes 85 index pages per second, while lazy vacuum is able to clean up 620 heap pages per second. Is there a special reason for scanning the leaf pages in *logical* order, i.e. by following the opaque-btpo_next links? Now that FSM covers free btree index pages this access pattern might be highly nonsequential. I'd expect the following scheme to be faster: for blknum = 1 to nblocks { read block blknum; if (block is a leaf) { process it; } } As there is no free lunch this has the downside that it pollutes the cache with unneeded inner nodes and free pages. OTOH there are far less inner pages than leaf pages (even a balanced binary tree has more leaves than inner nodes), and if free pages become a problem it's time to re-index. Did I miss something else? Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
[HACKERS] Number of pages in a random sample (was: query slows down with more accurate stats)
On Mon, 19 Apr 2004 12:00:10 -0400, Tom Lane [EMAIL PROTECTED] wrote: A possible compromise is to limit the number of pages sampled to something a bit larger than n, perhaps 2n or 3n. I don't have a feeling for the shape of the different-pages probability function; would this make a significant difference, or would it just waste cycles? I would have replied earlier, if I had a good answer. What I have so far contains at least one, probably two flaws. Knowing not much more than the four basic arithmetic operations I was not able to improve my model. So I post what I have: As usual we assume a constant number c of tuples per page. If we have a table of size B pages and want to collect a sample of n tuples, the number of possible samples is (again in OOo syntax) left( binom{cB}{n} right) If we select an arbitrary page, the number of possible samples that do NOT contain any tuple from this page is left( binom {c (B-1)} {n} right) Let's forget about our actual implementations of sampling methods and pretend we have a perfect random sampling method. So the probability Pnot(c, B, n) that a certain page is not represented in a random sample is left( binom {c (B-1)} {n} right) over left( binom{cB}{n} right) which can be transformed into the more computing-friendly form prod from{i=0} to{n-1} {{cB-c - i} over {cB - i}} Clearly the probability that a certain page *is* represented in a sample is Pyes(c, B, n) = 1 - Pnot(c, B, n) The next step assumes that these probabilities are independent for different pages, which in reality they are not. We simply estimate the number of pages represented in a random sample as numPag(c, B, n) = B * Pyes(c, B, n) Here are some results for n = 3000: B \ c-10 | 100 | 200 ---+---+---+--- 100 | --- | 100 | 100 1000 | 972 | 953 | 951 2000 | 1606 | 1559 | 1556 3000 | 1954 | 1902 | 1899 6000 | 2408 | 2366 | 2363 9000 | 2588 | 2555 | 2553 2 | 2805 | 2788 | 2787 3 | 2869 | 2856 | 2856 10 | 2960 | 2956 | 2956 This doesn't look to depend heavily on the number of tuples per page, which sort of justifies the assumption that c is constant. In the next step I tried to estimate the number of pages that contain exactly 1, 2, ... tuples of the sample. My naive procedure works as follows (I'm not sure whether it is even valid as a rough approximation, constructive criticism is very welcome): For c=100, B=3000, n=3000 we expect 1902 pages to contain at least 1 tuple of the sample. There are 1098 more tuples than pages, these tuples lie somewhere in those 1902 pages from the first step. numPag(99, 1902, 1098) = 836 pages contain at least a second tuple. So the number of pages containing exactly 1 tuple is 1902 - 836 = 1066. Repeating these steps we get 611 pages with 2 tuples, 192 with 3, 30 with 4, and 3 pages with 5 tuples. Here are some more numbers for c = 100 and n = 3000: B | pages with 1, 2, ... tuples ---+ 100 | 1 to 24 tuples: 0, then 1, 2, 4, 10, 18, 26, 24, 11, 4 1000 | 108, 201, 268, 229, 113, 29, 5 2000 | 616, 555, 292, 83, 12, 1 3000 | 1066, 611, 192, 30, 3 6000 | 1809, 484, 68, 5 9000 | 2146, 374, 32, 2 2 | 2584, 196, 8 3 | 2716, 138, 3 10 | 2912, 44 A small C program to experimentally confirm or refute these calculations is attached. Its results are fairly compatible with above numbers, IMHO. Servus Manfred /* ** samsim.c - sampling simulator */ #include stdio.h #include stdlib.h #include sys/time.h #include unistd.h typedef int bool; #define MAX_RANDOM_VALUE (0x7FFF) static void initrandom() { struct timeval tv; gettimeofday(tv, NULL); srandom(tv.tv_sec ^ tv.tv_usec); }/*initrandom*/ /* Select a random value R uniformly distributed in 0 R 1 */ static double random_fract(void) { longz; /* random() can produce endpoint values, try again if so */ do { z = random(); } while (z = 0 || z = MAX_RANDOM_VALUE); return (double) z / (double) MAX_RANDOM_VALUE; } /* ** data structure for (modified) Algorithm S from Knuth 3.4.2 */ typedef struct { longN; /* number of tuples, known in advance */ int n; /* sample size */ longt; /* current tuple number */ int m; /* tuples selected so far */ } SamplerData; typedef SamplerData *Sampler; static void Sampler_Init(Sampler bs, long N, int samplesize); static bool Sampler_HasMore(Sampler bs); static long
[HACKERS] Tuple sampling
The proposed new sampling method (http://archives.postgresql.org/pgsql-hackers/2004-04/msg00036.php and http://archives.postgresql.org/pgsql-patches/2004-04/msg00045.php) basically incorporates two independant changes: (1) Two-stage sampling: Stage one collects a random sample of pages, stage two collects a random sample of tuples out of these pages. (2) Improved row count estimation: Once a page is read in, all tuples on this page are looked at to see whether they are active (visible, live) or dead. The estimated row count is calculated from the number of active tuples seen, the number of visited pages, and the total number of pages. The preview patch implements three sampling methods: Method 0 is the original sampling method; method 1 uses both changes; and method 2 is sort of change (1) without change (2). Theoretically we could apply only the second change, if we come to the conclusion that we do not want two-stage sampling. But I don't have test results for this. I'd like to present my *interpretation* of the results I got from testing various sampling methods. Whoever is interested in looking at the raw test results can get them from http://www.pivot.at/pg/SamplePerf2.sxc or (for those still without OOo) http://www.pivot.at/pg/SamplePerf2.html. This set of tests measures only the sampling times, actually analysing the sample and storing the results in pg_statistic has been disabled. To eliminate as many sources of noise as possible, following steps have been taken prior to each test run: . SELECT count(*) FROM testtable; to make sure that all visibility bits in the tuple headers are set correctly, . tar cf /dev/null something to fill the OS cache with unrelated stuff, . SELECT count(*) FROM othertable; to fill the shared buffers with unrelated data, . CHECKPOINT; to avoid a checkpoint happening during the test run. The tests have been performed with two types of tables. The tables named x, x2, x3, x3d have initially been copied from tenk1 in the regression database. These tables have between 20 and 30 tuples per page. Tables of the other type -- named y, y3, y3d -- have much smaller tuples, ca. 150 per page. Method 1 vs. method 2: With large tuples method 2 is sometimes faster, sometimes slower than method 1, never enough to worry about. This has been tested for statistics targets 10 and 100. Small tuples, statistics target 10 to 100: No reproduceable difference up to 490 pages (8 tuples). Starting at 1900 pages (32 tuples) method 2 is consistently faster. The difference can easily be explained by hundreds of thousands or even millions of heap_fetch() calls. OTOH with method 2 we get row count estimation errors of up to 20% compared to less than 1% for method 1. So I conclude that change (2) is well worth the CPU cycles it costs. Method 0 produced estimation errors of up to 60%. What about two-stage sampling? Comparing the speeds of method 0 and method 2 we get the following pattern which is basically the same for all statistics targets I tried. Sample size 3000 (3): For tables smaller than 1000 (5000) pages all sampling methods access all or almost all pages and there is hardly any runtime difference. For table sizes between 1000 (5000) and 3000 (3) the new method reads the whole table while the old method starts to skip pages. This doesn't result in a significant runtime difference, however. If the table size grows beyond 3000 (3) pages, the number of page reads continues to grow only for the old method, but up to ca. 3 (30) pages the new method is not much faster (if at all). For tables larger than 3 (30) pages two-stage sampling is reliably faster. Other pros and cons of two-stage sampling Pro and con: Cache pollution. Two-stage sampling reads the same number of pages as one-stage sampling for small tables, slightly more for medium sized tables (worst case is ca. 15%), and a little less to far less for large to very large tables (3000 vs. 2 for 43 pages with statistics target 10, 3 vs. 98000 for the same table with statistsitcs target 100). Con: In Tom's words There's still a risk of not being able to collect N rows out of N blocks, if you are unfortunate enough to select a lot of wholly-empty pages. But that seems like a low-probability scenario; besides such a table would be so desperately in need of VACUUM FULL that the possible low quality of the stats hardly matters. Con: The sample generated is not perfectly random (cf. discussion starting at http://archives.postgresql.org/pgsql-performance/2004-04/msg00196.php). If somebody has an idea how we could steer against that effect of collecting tuples from too few different pages, I'd be happy to implement it. If nothing pops up, I think that the current consensus is that we don't care. Anything else? Servus Manfred ---(end of broadcast)--- TIP 4: Don't 'kill -9' the
Re: [HACKERS] [GENERAL] Large DB
On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane [EMAIL PROTECTED] wrote: Manfred Koizar [EMAIL PROTECTED] writes: What I have in mind is a kind of Double Vitter algorithm. [...] random sample of sample_size block numbers, and then to sample the rows out of this pool of blocks. That assumption is faulty, though --- consider wholly-empty pages. A bigger problem is that this makes the sampling quite nonuniform, because rows that are on relatively low-density pages would be more likely to become part of the final sample than rows that are on pages with lots of tuples. This sounds like you are assuming that I want to take exactly one tuple out of each block of the block sample. This is not the case. In the second round I plan to apply the same (or a better) Vitter method as it is done now. The main difference is that blocks will be adressed indirectly through the array of block numbers obtained in the first round. Thus for example your sample would tend to favor rows with wide values of variable-width columns and exclude narrower values. (I am not certain that the existing algorithm completely avoids this trap, but at least it tries.) I'm reading 7.4 source code and I fail to see how it does this. If the relation starts with an atypical distribution of wide/narrow or dead/alive tuples, a wrong value for tuplesperpage is used for the rest of the sampling. Tuples immediately following one or more dead tuples have a better chance of being selected. This may be called as random as anything else and not favouring a special property. OTOH after long runs of dead tuples consecutive tuples are likely to be selected. Your comment about nonuniformity above exactly describes the current algorithm: Once the initial sample is fetched and tuplesperpage is determined, targpos is computed without any further feedback. If targpos points to a sparsely populated area (with wide tuples or with many dead tuples) tuples in this area are more likely to get into the sample than tuples in densely populated areas (with many small active tuples). I think that cutting down the number of blocks to be looked at does not affect these problems. Servus Manfred ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [GENERAL] Large DB
On Fri, 02 Apr 2004 18:06:12 -0500, Tom Lane [EMAIL PROTECTED] wrote: You should not need to use the Vitter algorithm for the block-level selection, since you can know the number of blocks in the table in advance. You can just use the traditional method of choosing each block or not with probability (k/K), where k = number of sample blocks still needed, K = number of blocks from here to the end. Sounds reasonable. I have to play around a bit more to get a feeling where the Vitter method gets more efficient. You'd run the Vitter algorithm separately to decide whether to keep or discard each live row you find in the blocks you read. You mean once a block is sampled we inspect it in any case? This was not the way I had planned to do it, but I'll keep this idea in mind. Question: if the table size is less than N blocks, are you going to read every block or try to reduce the number of blocks sampled? Don't know yet. people are setting the stats target to 100 which means a sample size of 3 --- how do the page-access counts look in that case? rel | page size | reads --+- 300 | 300 3000 | 3000 5000 | 4999 10K | 9.9K 30K | 25.8K 300K | 85K 1M | 120K 10M | 190K 100M | 260K 1G | 330K This is exactly the table I posted before (for sample size 3000) with every entry multiplied by 10. Well, not quite exactly, but the differences are far behind the decimal point. So for our purposes, for a given relation size the number of pages accessed is proportional to the sample size. Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [GENERAL] Large DB
On Fri, 02 Apr 2004 19:57:47 -0500, Tom Lane [EMAIL PROTECTED] wrote: If you like I can send you the Vitter paper off-list (I have a PDF of it). The comments in the code are not really intended to teach someone what it's good for ... Yes, please. [Would have sent this off-list. But I'm blacklisted.] Servus Manfred ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [GENERAL] Large DB
[time to move this to -hackers] On Fri, 02 Apr 2004 11:16:21 -0500, Tom Lane [EMAIL PROTECTED] wrote: Manfred Koizar [EMAIL PROTECTED] writes: The first step, however, (acquire_sample_rows() in analyze.c) has to read more rows than finally end up in the sample. It visits less than O(nblocks) pages but certainly more than O(1). A vague feeling tries to tell me that the number of page reads is somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which grow like O(ln(n)). Good guess. Vitter's paper says the expected time to sample n rows from a table of size N is O(n * (1 + log(N/n))). Well, for what I tried to find out my wild guess seems to be wrong. I don't doubt that Vitter's formula is correct, but it assumes that access to any tuple has the same cost. This does not apply to our problem, however. With 100 tuples per page, we access the first sample_size tuples at a cost of 0.01 sequential page reads per tuple. Later we use less and less tuples per page which results in higher per-tuple-cost. Near the end of a large relation we can expect to access only one tuple per page and more and more pages are skipped, so that prefetching doesn't help any more. Playing around with some real numbers (for 100 tuples/page and a sample size of 3000) I got: rel | page size | reads --+- 30 |30 300 | 300expectation is something like 299.9995 500 | 499 1K | 990 3K | 2.6K 30K |8K 100K | 12K 1M | 19K 10M | 26K 100M | 33K This growth rate is steeper than O(log(nblocks)). I have an idea how this could be done with O(1) page reads. What I have in mind is a kind of Double Vitter algorithm. Whatever we do to get our sample of rows, in the end the sampled rows come from no more than sample_size different blocks. So my idea is to first create a random sample of sample_size block numbers, and then to sample the rows out of this pool of blocks. I have to think harder though, what to do about those 400 pages that are not accessed when the sample size is 3000 ... The hard part is getting a genuinely random sample when we don't know N in advance. We do however know the table size in blocks, so if you're willing to make assumptions about constant tuple density you could do something different. (But the tuple density assumption is exactly the weak spot of what we've got, so I'm unconvinced that would be a big step forward.) Starting the scan at some random blocks should help against the common case of unusual distribution of dead tuples near the start of the relation. And I plan to factor information about dead tuple hits into an increasingly better estimation of dead/live tuple ratio. Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PostgreSQL block size vs. LVM2 stripe width
On Mon, 29 Mar 2004 08:50:42 -0800 (PST), [EMAIL PROTECTED] wrote: In this case, I've only done 1 per each combination. I've found the results for this test to be reproduceable. Pardon? Linux-2.6.3, LVM2 Stripe Width BLCKSZ (going down)16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 2 KB261726562652266426672642 4 KB439344864577455745114448 8 KB433744234471457641113642 16 KB 441244954532453629852312 32 KB 370537843886392529362362 Does this mean that you first ran all test with 8 KB, then with 4, 2, 16 and 32 KB BLCKSZ? If so, I suspect that you are measuring the effects of something different. Yes, that's correct, but why do you suspect that? Gut feelings, hard to put into words. Let me try: Nobody really knows what the optimal BLCKSZ is. Most probably it depends on the application, OS, hardware, and other factors. 8 KB is believed to be a good general purpose BLCKSZ. I wouldn't be surprised if 8 KB turns out to be suboptimal in one or the other case (or even in most cases). But if so, I would expect it to be either too small or too large. In your tests, however, there are three configurations where 8 KB is slower than both 4 KB and 16 KB. Absent any explanation for this interesting effect, it is easier to mistrust your numbers. If you run your tests in the opposite order, on the same hardware, in the same freshly formatted partitions, and you get the same results, that would be an argument in favour of their accurancy. Maybe we find out that those 1.5% are just noise. Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] PostgreSQL block size vs. LVM2 stripe width
Mark, how often did you run your tests? Are the results reproduceable? On Fri, 26 Mar 2004 14:00:01 -0800 (PST), [EMAIL PROTECTED] wrote: Linux-2.6.3, LVM2 Stripe Width (going across) PostgreSQL BLCKSZ (going down)16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 2 KB261726562652266426672642 4 KB439344864577455745114448 8 KB433744234471457641113642 16 KB 441244954532453629852312 32 KB 370537843886392529362362 Unless someone can present at least an idea of a theory why a BLCKSZ of 8 KB is at a local minimum (1 or 2% below the neighbouring values) for stripe widths up to 64 KB I'm not sure whether we can trust these numbers. Before I hit the send button, I did a quick check of the link you provided. The links in the table contain the following test numbers: 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 2 KB 72 71 70 69 66 65 4 KB 64 63 62 61 60 58 8 KB 54 53 52 51 50 49 16 KB79 78 77 76 75 74 32 KB86 85 84 83 82 80 Does this mean that you first ran all test with 8 KB, then with 4, 2, 16 and 32 KB BLCKSZ? If so, I suspect that you are measuring the effects of something different. Servus Manfred ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] log_line_info
On Tue, 09 Mar 2004 10:02:14 -0500, Andrew Dunstan [EMAIL PROTECTED] wrote: After this is applied (fingers crossed) and everyone is happy, I will submit a patch to remove log_timestamp, log_pid and (if we are agreed on it) log_source_port. Is there agreement on removing these 3 config vars? Please don't. Declare them obsolete for 7.5 and remove them in a later release. Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Summary of Changes since last release (7.4.1)
Simon, On Thu, 19 Feb 2004 00:05:15 -, Simon Riggs [EMAIL PROTECTED] wrote: POSTGRESQL: Summary of Changes since last release (7.4.1) -- 18 Feb 2004 this is getting long over time. If you plan to post it once a week, flagging items that are new or have changed might help keeping track of what's going on. - Index performance improved when scanning highly non-unique indices; ! Index performance improved when scanning highly non-unique indices; or - (updated) Index performance improved ... - ANALYZE will now collect statistics on expressional indexes, and make + ANALYZE will now collect statistics on expressional ... or - (new) ANALYZE will now collect statistics on expressional ... Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] [GENERAL] Transaction Question
On Sat, 6 Dec 2003 10:43:18 -0500 (EST), Bruce Momjian [EMAIL PROTECTED] wrote: Where are we on nested transactions. Is it something we can get for 7.5? I honestly don't know. I've been working on other things lately and have not heard from Alvaro for some time. Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [HACKERS] Index creation takes for ever
On Mon, 01 Dec 2003 13:32:10 -0500, Tom Lane [EMAIL PROTECTED] wrote: Manfred Koizar [EMAIL PROTECTED] writes: comparetup_index() compares two IndexTuples. The structure IndexTupleData consists basically of not much more than an ItemPointer, and the patch is not much more than adding a comparison of two ItemPointers. So how does the patch introduce a new low level implementation dependency? Because it sorts on tuple position, which is certainly about as low level as you can get. The patch affects only the sort during index creation. Mapping key values to tuple positions is the sole purpose of an index. The notion that an index should not care for tuple positions looks a bit strange to me. More to the point, though, no evidence has been provided that this is a good idea. The test script I posted with the patch shows that the patch produces more efficient b-tree indices when there are lots of duplicates. Servus Manfred ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PATCHES] [HACKERS] Index creation takes for ever
On Mon, 1 Dec 2003 00:02:54 -0500 (EST), Bruce Momjian [EMAIL PROTECTED] wrote: Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: And if it doesn't help index creation speed, at least the resulting index has better correlation. ... which has been shown by the example in the original message: Result without patch: ctid -- (153,14) (306,23) (305,80) (152,91) (76,68) (38,34) (153,34) (305,50) (9,62) (305,40) (10 rows) Result with patch: ctid (0,5) (0,10) (0,15) (0,20) (0,25) (0,30) (0,35) (0,40) (0,45) (0,50) (10 rows) And I think we all agree, that better index correlation leads to better performance. I think this is a *very* dubious idea. It introduces a low-level implementation dependency into our sort behavior The patch modifies the static function comparetup_index() in tuplesort.c. The comment above this function says /* * Routines specialized for IndexTuple case * * NOTE: actually, these are specialized for the btree case; [...] */ comparetup_index() compares two IndexTuples. The structure IndexTupleData consists basically of not much more than an ItemPointer, and the patch is not much more than adding a comparison of two ItemPointers. So how does the patch introduce a new low level implementation dependency? Roger --- patch removed. Thanks. Could we agree on only removing the first five a half lines of the comment in the patch? Servus Manfred ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] logical column position
On Wed, 19 Nov 2003 19:07:23 +0100, Andreas Pflug [EMAIL PROTECTED] wrote: is there any DB system out there that allows to reshuffle the column ordering? Firebird: ALTER TABLE tname ALTER COLUMN cname POSITION 7; Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: Defaults for GUC variables (was Re: [HACKERS] pg_ctl reports succes when start fails)
On Mon, 27 Oct 2003 10:22:32 -0500, Tom Lane [EMAIL PROTECTED] wrote: The low-tech solution to this would be to stop listing the default values as commented-out entries, but just make them ordinary uncommented entries. Please not. How should we ask a newbie seeking assistance on one of the support lists to show his non-default config settings? Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Cannot dump/restore text value \N
On Sun, 05 Oct 2003 19:12:50 -0400, Tom Lane [EMAIL PROTECTED] wrote: it seems we have to compare the null representation string to the pre-debackslashing input. Here is a patch that does this and adds a few regression tests. (This is probably fairly easy to make happen in CVS tip, but it might be pretty painful in 7.3.) There haven't been too much changes in this area between 7.3 and 7.4. A patch against 7.3.4 will follow ... Servus Manfred diff -ruN ../base/src/backend/commands/copy.c src/backend/commands/copy.c --- ../base/src/backend/commands/copy.c 2003-08-28 15:52:34.0 +0200 +++ src/backend/commands/copy.c 2003-10-08 10:43:02.0 +0200 @@ -90,7 +90,8 @@ char *delim, char *null_print); static void CopyFrom(Relation rel, List *attnumlist, bool binary, bool oids, char *delim, char *null_print); -static char *CopyReadAttribute(const char *delim, CopyReadResult *result); +static char *CopyReadAttribute(const char *delim, const char *nullst, + CopyReadResult *result, bool *isnull); static Datum CopyReadBinaryAttribute(int column_no, FmgrInfo *flinfo, Oid typelem, bool *isnull); static void CopyAttributeOut(char *string, char *delim); @@ -1361,7 +1362,7 @@ if (file_has_oids) { - string = CopyReadAttribute(delim, result); + string = CopyReadAttribute(delim, null_print, result, isnull); if (result == END_OF_FILE *string == '\0') { @@ -1370,7 +1371,7 @@ break; } - if (strcmp(string, null_print) == 0) + if (isnull) ereport(ERROR, (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), errmsg(null OID in COPY data))); @@ -1403,7 +1404,7 @@ errmsg(missing data for column \%s\, NameStr(attr[m]-attname; - string = CopyReadAttribute(delim, result); + string = CopyReadAttribute(delim, null_print, result, isnull); if (result == END_OF_FILE *string == '\0' cur == attnumlist !file_has_oids) @@ -1413,7 +1414,7 @@ break; /* out of per-attr loop */ } - if (strcmp(string, null_print) == 0) + if (isnull) { /* we read an SQL NULL, no need to do anything */ } @@ -1442,7 +1443,7 @@ { if (attnumlist == NIL !file_has_oids) { - string = CopyReadAttribute(delim, result); + string = CopyReadAttribute(delim, null_print, result, isnull); if (result == NORMAL_ATTR || *string != '\0') ereport(ERROR, (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), @@ -1650,14 +1651,13 @@ * END_OF_FILE:EOF indicator * In all cases, the string read up to the terminator is returned. * - * Note: This function does not care about SQL NULL values -- it - * is the caller's responsibility to check if the returned string - * matches what the user specified for the SQL NULL value. - * * delim is the column delimiter string. + * nullst says how NULL values are represented. + * *isnull is set true if a null attribute, else false. */ static char * -CopyReadAttribute(const char *delim, CopyReadResult *result) +CopyReadAttribute(const char *delim, const char *nullst, + CopyReadResult *result, bool *isnull) { int c; int delimc = (unsigned char) delim[0]; @@ -1665,6 +1665,17 @@ unsigned char s[2]; char *cvt; int j; + boolmatchnull = true; + int matchlen = 0; + +#define CHECK_MATCH(c) \ + do { \ + if (matchnull) \ + if (c == nullst[matchlen]) \ + ++matchlen; \ + else \ + matchnull = false; \ + } while (0) s[1] = 0; @@ -1733,6 +1744,7 @@ }
[HACKERS] Cannot dump/restore text value \N
To be clear, this is not about \N as the default external representation for NULL, I'm talking about a string consisting of the two characters backslash and uppercase-N. CREATE TABLE nonu (tx text NOT NULL); INSERT INTO nonu VALUES ('\\N'); SELECT * FROM nonu; COPY nonu TO stdout; This correctly gives: \\N Now try to feed that back into the table: DELETE FROM nonu; COPY nonu FROM stdin; \\N \. ERROR: copy: line 1, CopyFrom: Fail to add null value in not null attribute tx lost synchronization with server, resetting connection This happened with 7.3.4, while trying to restore a 1.3 GB dump :-( ERROR: copy: line 809051, CopyFrom: Fail to add null value in not null attribute text FATAL: Socket command type 0 unknown The bug is still in 7.4Beta3; didn't test with Beta 4 yet. Servus Manfred ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Cannot dump/restore text value \N
I have solved my restore problem by editing (the relevant part of) the dump (:%s/^IN^I/^IN ^I/), a one-off solution g Anyway, thanks for your investigation. On Sun, 05 Oct 2003 19:12:50 -0400, Tom Lane [EMAIL PROTECTED] wrote: it seems we have to compare the null representation string to the pre-debackslashing input. Sounds reasonable, IMHO. I wonder whether it would break any existing apps though. Couldn't be worse than silently converting valid non-null values to NULL ... Servus Manfred ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] ADD FOREIGN KEY
On Tue, 30 Sep 2003 08:00:07 -0400, Christopher Browne [EMAIL PROTECTED] wrote: I would be pretty game for a near-single-user-mode approach that would turn off some of the usual functionality that we knew we didn't need because the data source was an already-committed-and-FK-checked set of data. Single user mode is a good idea, IMHO. But it should only make sure that there is not more than one user connected to the database (or to the postmaster). Everything else should depend on special GUC variables that are only settable in single user mode: db= SET disable-fk-verification = true; ERROR: disable-fk-verification can only be set in single user mode db= SET SINGLE USER MODE ON; ERROR: permission denied HINT: Must be superuser or owner of database db. db= \c - dbo You are now connected as new user dbo. db= SET SINGLE USER MODE ON; ERROR: cannot enter single user mode HINT: You are not the only user connected to database db. -- after other users have logged out ... db= SET SINGLE USER MODE ON; SET db= SET disable-fk-verification = true; SET Single user mode would also help in several cases where now a standalone backend is required ... Servus Manfred ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] osdl-dbt3 run results - puzzled by the execution
On Fri, 19 Sep 2003 11:35:35 -0700, Jenny Zhang [EMAIL PROTECTED] wrote: I posted more results as you requested: Unfortunately they only confirm what I suspected earlier: 2) - Index Scan using i_ps_suppkey on partsupp (cost=0.00..323.16 rows=80 width=34) (actual time=0.16..2.98 rows=80 loops=380) ctr=108.44 the planner does not account for additional index scans hitting pages in the cache that have been brought in by preceding scans. This is a known problem PF1 = estimated number of page fetches for one loop ~ 320 L = estimated number of loops ~ 400 P = number of pages in relation ~ 21000 Cutting down the number of heap page fetches if PF1 * L P and P effective_cache_size seems like an obvious improvement, but I was not able to figure out where to make this change. Maybe it belongs into costsize.c near run_cost += outer_path_rows * (inner_path-total_cost - inner_path-startup_cost) * joininfactor; in cost_nestloop() or it should be pushed into the index cost estimation functions. Hackers? For now you have to keep lying about effective_cache_size to make the planner overestimate merge joins to compensate for the planner's overestimation of nested loops. Sorry for having no better answer. Servus Manfred ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [SQL] plpgsql doesn't coerce boolean expressions to boolean
On Mon, 08 Sep 2003 11:40:32 -0400, Tom Lane [EMAIL PROTECTED] wrote: 4. Use the parser's coerce_to_boolean procedure, so that nonbooleans will be accepted in exactly the same cases where they'd be accepted in a boolean-requiring SQL construct (such as CASE). (By default, none are, so this isn't really different from #2. But people could create casts to boolean to override this behavior in a controlled fashion.) I vote for 4. And - being fully aware of similar proposals having failed miserably - I propose to proceed as follows: If the current behaviour is considered a bug, let i=4, else let i=5. In 7.i: Create a new GUC variable plpgsql_strict_boolean (silly name, I know) in the VERSION/PLATFORM COMPATIBILITY section of postgresql.conf. Make the new behaviour dependent on this variable. Default plpgsql_strict_boolean to false. Place a warning into the release notes and maybe into the plpgsql documentation. In 7.j, ji: Change the default value of plpgsql_strict_boolean to true. Issue WARNINGs or NOTICEs as appropriate. Update documentation. In 7.k, kj: Remove old behaviour and GUC variable. Update documentation. Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Index creation takes for ever
On Mon, 8 Sep 2003 11:31:05 +0200, Zeugswetter Andreas SB SD [EMAIL PROTECTED] wrote: As Tom mentioned, we might not want to keep the tid's in order after the index is created because he wants the most recent tid's first, so the expired ones migrate to the end. But on average this argument only holds true for unique indexes, no ? Is there any code that stops the heap lookup after the visible tuple is found ? At least in an index with more rows per key you will fetch all heaps after the first one anyway to get at the next row. This is better done in heap order, no ? Yes, yes, yes, and yes. Seems we all agree on that; the patch has been queued for 7.5. Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Index creation takes for ever
On Sun, 7 Sep 2003 11:43:42 -0400 (EDT), Bruce Momjian [EMAIL PROTECTED] wrote: I assume this completes this TODO: * Order duplicate index entries by tid for faster heap lookups I don't think so, because the patch does nothing to keep the sort order once the index is initially created. If you want to post it now, [...] I did already post it. It's only the last page or so of the original message. The link in that message points to a testing aid which is not part of what I would like to see committed. Servus Manfred ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [GENERAL] Buglist
On Fri, 22 Aug 2003 16:27:53 +0530, Shridhar Daithankar [EMAIL PROTECTED] wrote: What does FSM does then? FSM = Free Space Map. VACUUM writes information into the FSM, INSERTs consult the FSM to find pages with free space for new tuples. I was under impression that FSM stores page pointers and vacuum work on FSM information only. In that case, it wouldn't have to waste time to find out which pages to clean. This has been discussed yesterday here and on -hackers (Decent VACUUM (was: Buglist)). We were talking about inventing a second data structure: RSM. Servus Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [GENERAL] Buglist
On Fri, 22 Aug 2003 10:45:50 -0400, Tom Lane [EMAIL PROTECTED] wrote: One big question mark in my mind about these partial vacuum proposals is whether they'd still allow adequate FSM information to be maintained. If VACUUM isn't looking at most of the pages, there's no very good way to acquire info about where there's free space. VACUUM has accurate information about the pages it just visited. Free space information for pages not touched by VACUUM is still in the FSM, unless free space on a page is too low to be interesting. VACUUM has to merge these two lists and throw away entries with little free space if running out of room. Thus we might end up with new almost full pages in the FSM while there are pages with more free space lying around that a previous VACUUM failed to register because there was more free space at that time. Considering that . FSM is lossy per definitionem . we are targeting at relations with large passive areas . decent VACUUM shall not replace lazy VACUUM I see no problem here. Future advice could be: VACCUM DECENT every hour, VACUUM daily, VACUUM FULL once a year where the first two could be scheduled by autovacuum ... Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [GENERAL] Buglist
On Fri, 22 Aug 2003 12:18:02 -0400, Jan Wieck [EMAIL PROTECTED] wrote: Okay, my proposal would be to have a VACUUM mode where it tells the buffer manager to only return a page if it is already in memory But how can it know? Yes, we know exactly what we have in PG shared buffers. OTOH we keep telling people that they should configure moderate values for shared_buffers because the OS is better at caching. Your CACHEONLY VACUUM wouldn't catch those pages that are in the OS cache but not in the shared buffers, although they are retrievable at almost the same low cost. We should not try to avoid _any_ physical disk access. It's good enough to avoid useless reads. Hence my proposal for a reclaimable space list ... Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] [SQL] SELECT IN Still Broken in 7.4b
On Thu, 21 Aug 2003 16:42:20 -0400, Tom Lane [EMAIL PROTECTED] wrote: The point is that given WHERE a = 1 OR b = 1 you could create a plan that first indexscans on a, then indexscans on b --- but you mustn't return any tuples in the second scan that you already returned in the first. IndexNext solves this by evaluating the prior-scan index conditions to see if they are true. WHERE a = 1 OR b = 2 and WHERE a = 1 OR a = 2 are totally different things. In the latter case you don't have to check prior conditions because the conditions are mutually exclusive. Is this reasonably easy to find out at plan creation time? Yes, I changed your example to make my point clear, because WHERE a = 1 OR a = 1 has its own set of problems. Servus Manfred ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [GENERAL] [HACKERS] [pgsql-advocacy] Need concrete Why Postgres not MySQL bullet list
On Thu, 21 Aug 2003 15:05:52 +0200, I wrote: Just wondering, what other databases has transactable DDLs? Firebird. Stop! I withdraw that statement. I must have mis-read some feature list :-( Tests with InterBase 6 showed that you can change metadata within a transaction, but when you ROLLBACK, metadata changes persist. Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[HACKERS] Decent VACUUM (was: Buglist)
On Wed, 20 Aug 2003 15:39:26 -0400, Tom Lane [EMAIL PROTECTED] wrote: But I think the real point here is that there's no reason to think that doing tuple deletion on-the-fly in foreground transactions is superior to doing it in background with a vacuum process. You're taking what should be noncritical maintenance work and moving it into the critical paths of your foreground applications. Not only that, but you're probably doing more total work per tuple --- VACUUM batches its work in more ways than just the index cleanup aspect, IIRC. Yes, I sign that, 100%. That doesn't mean that we couldn't do any better. AFAICS Vivek's problem is that it is hard enough to hold a good part of the working set in the cache, and still his disks are saturated. Now a VACUUM not only adds one more process to disk I/O contention, but also makes sure that the working set pages are *not* in memory which leads to higher I/O rates after the VACUUM. I can imagine several use cases where only a small part of a large relation is subject to DELETEs/UPDATEs. Maybe Vivek's application falls into this category. If we teach VACUUM to not read pages that don't contain any dead tuples, this could be a significant improvement. I'm envisioning a data structure (reclaimable space map, RSM) similar to the FSM. Whenever a backend encounters a dead tuple it inserts a reference to its page into the RSM. Dead tuple detection is no problem, it is already implemented for marking dead index tuples. VACUUM, when run in a new mode (decent), only checks pages that are listed in the RSM. To get full advantage of not doing unnecessary page reads, we'll also need to redesign the index bulk delete routines. The autovaccum daemon will watch the RSM and when the number of entries is above a configurable threshold, it will start a VACUUM DECENT ... Servus Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Decent VACUUM (was: Buglist)
[ still brainstorming ... ] On Thu, 21 Aug 2003 17:16:50 -0400, Tom Lane [EMAIL PROTECTED] wrote: Whenever a backend encounters a dead tuple it inserts a reference to its page into the RSM. This assumes that backends will visit dead tuples with significant probability. I doubt that assumption is tenable; Good point. What about: Whenever a backend *deletes* a tuple it inserts a reference to its page into the RSM? Then an entry in the RSM doesn't necessarily mean that the referenced page has reclaimable space, but it would still be valueable information. Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Decent VACUUM (was: Buglist)
On Thu, 21 Aug 2003 17:56:02 -0400, Tom Lane [EMAIL PROTECTED] wrote: Conceivably it could be a win, though, if you could do frequent vacuum decents and only a full-scan vacuum once in awhile (once a day maybe). That's what I had in mind; similar to the current situation where you can avoid expensive VACUUM FULL by doing lazy VACUUM frequently enough. Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] [pgsql-advocacy] Need concrete Why Postgres not MySQL bullet list
On Thu, 21 Aug 2003 14:45:03 +0530, Shridhar Daithankar [EMAIL PROTECTED] wrote: Just wondering, what other databases has transactable DDLs? Firebird. Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Correlation in cost_index()
On Fri, 8 Aug 2003 16:53:48 -0700, Sean Chittenden [EMAIL PROTECTED] wrote: the problem with your patch was that it picked an index less often than the current code when there was low correlation. Maybe bit rot? What version did you apply the patch against? Here is a new version for Postgres 7.3.4: http://www.pivot.at/pg/16d-correlation_734.diff The only difference to the previous version is that for (nKeys = 1; index-indexkeys[nKeys] != 0; nKeys++) is now replaced with for (nKeys = 1; nKeys index-ncolumns; nKeys++) Don't know whether the former just worked by chance when I tested the 7.3.2 version :-(. Tests with 7.4Beta1 showed that index correlation comes out too low with the old loop termination condition. Anyway, the latter version seems more robust. In my tests the new index_cost_algorithms (1, 2, 3, 4) gave consistently lower cost estimates than the old method (set index_cost_algorithm = 0), except of course for correlations of 1.0 or 0.0, because in these border cases you get always min_IO_cost or max_IO_cost, respectively. Care to re-evaluate? BTW, there's a version of the patch for 7.4Beta1 (http://www.pivot.at/pg/16d-correlation_74b1.diff) which also applies cleanly against cvs snapshot from 2003-08-17. Servus Manfred ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Again on index correlation
Recent discussion of index cost estimation ([HACKERS] Correlation in cost_index() ca. two weeks ago) has lead to the conclusion that the column correlation calculated by VACUUM does not always help when we want to find out how well index access order corresponds to physical tuple position. Most problems arise when dealing with multi-column or functional indices. Another case is when there are many equal values in the first index column. What we really want is not column correlation, but index correlation which turns out to be surprisingly easy to calculate. There's no need to look up comparison functions and handle different datatypes; there's no need to look at the key values at all: Just read the index items in index order, sort them by heap page, and compute the Spearman Rho function. This even works for non-btree indices. Try it yourself: . download http://www.pivot.at/pg/contrib_icorrel.tgz . unpack . make . make install . psql . \i path/to/share/postgresql/contrib/icorrel.sql . SELECT icorrel('myindex'::regclass); This should work with 7.3.x and 7.4Beta. How could the planner make use of index correlation? Here (http://www.pivot.at/pg/22-IndexCorrel_74b1.diff) is an experimental patch that introduces a new system table pg_indexstat. If the GUC variable enable_indexstat is set to on, genericcostestimate tries to get index correlation from pg_indexstat; if there is none, btcostestimate falls back to the old method. Compatibility: Although there is a new catalog table, initdb is not required. A patched postmaster still works with an old cluster, even without creating pg_catalog.pg_indexstat. Before you can make use of pg_indexstat you have to create it via a standalone backend: $ bin/postgres -D data -O template1 backend CREATE TABLE pg_catalog.pg_indexstat( \ istindex oid NOT NULL, \ istcorrel float4 NOT NULL) WITHOUT OIDS; backend CREATE UNIQUE INDEX pg_indexstat_index_index \ ON pg_indexstat(istindex); Repeat this for each database. Usage example (using Sean's data): psql testdb testdb=# \d rucc Table public.rucc Column | Type | Modifiers ---+--+--- user_id | integer | not null category_id | integer | not null img_bytes | bigint | not null img_hits | integer | not null html_bytes| bigint | not null html_hits | integer | not null unknown_bytes | bigint | not null unknown_hits | integer | not null utc_date | timestamp with time zone | not null time_interval | interval | not null Indexes: rucc_htmlbytes_idx btree (html_bytes), rucc_id_date_idx btree (user_id, utc_date) testdb=# SELECT 'rucc_id_date_idx'::regclass::oid; oid - 1281422 (1 row) testdb=# set enable_seqscan = off; SET testdb=# set enable_indexstat = on; SET testdb=# INSERT INTO pg_indexstat VALUES (1281422, 0.0001); INSERT 0 1 testdb=# EXPLAIN SELECT * FROM rucc WHERE user_id 1000; QUERY PLAN Index Scan using rucc_id_date_idx on rucc (cost=0.00..634342.50 rows=139802 width=64) Index Cond: (user_id 1000) (2 rows) testdb=# UPDATE pg_indexstat SET istcorrel=0.1 WHERE istindex=1281422; testdb=# EXPLAIN SELECT ... istcorrel | cost --+-- 0.0001 | 634342.50 0.1 | 514678.48 0.2 | 407497.85 0.5 | 161612.89 0.9 | 10299.07 1.0 | 3994.32 Actually the table is clustered on rucc_id_date_idx, so index correlation is 1.0, but there is no way to know that, when we only have the column correlations for user_id (1.0) and utc_date (0.59). The current code guesses the index correlation to be 0.5 which gives a cost estimate that is far too high. For comparison: seq scan estimated cost ~ 21000, actual ~ 11500, index scan actual ~ 4000 If you are going to test this patch, please be aware that I created it on top of another one of my experimental patches (http://www.pivot.at/pg/16d-correlation_74b1.diff). If you don't want to apply this one, one hunk of the IndexCorrel patch will fail in selfuncs.c. Should be no problem to apply it manually. And those who are still experimenting with 7.3.4 performance, can use http://www.pivot.at/pg/16d-correlation_734.diff and http://www.pivot.at/pg/22-IndexCorrel_74b1.diff. ToDo: .. Move get_index_correlation from contrib into the backend. .. ANALYZE table computes index correlation for all indexes. .. New command ANALYZE index? .. System cache invalidation? syscache.c: reloidattr = Anum_pg_indexstat_istindex ? .. Dependency? .. Remove GUC variable enable_indexstat .. Remove old method in
Re: [HACKERS] Correlation in cost_index()
On Fri, 8 Aug 2003 11:06:56 -0700, Sean Chittenden [EMAIL PROTECTED] wrote: [...] it'd seem as though an avg depth of nodes in index * tuples_fetched * (random_io_cost * indexCorrelation) would be closer than where we are now... Index depth does not belong here because we walk down the index only once per index scan not once per tuple. It might be part of the startup cost. The rest of your formula doesn't seem right, too, because you get higher costs for better correlation. Did you mean random_io_cost * (1 - abs(indexCorrelation))? FWIW, for small effective_cache_size max_IO_cost is almost equal to tuples_fetched * random_page_cost. So your formula (with the corrections presumed above) boils down to ignoring effective_cache_size and linear interpolation between 0 and max_IO_cost. It's very possible that cost_index() is wrong, but it seems as though after some testing as if PostgreSQL _overly_ _favors_ the use of indexes: Was this an unpatched backend? What were the values of effective_cache_size and random_page_cost? # SET enable_seqscan = true; SET enable_indexscan = true; SET SET # EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date '2002-10-01'::TIMESTAMP WITH TIME ZONE; INFO: cost_seqscan: run_cost: 21472.687500 startup_cost: 0.00 INFO: cost_index: run_cost: 21154.308116 startup_cost: 0.00 indexCorrelation: 0.999729 QUERY PLAN --- Index Scan using report_user_cat_count_utc_date_id_idx on report_user_cat_count rucc (cost=0.00..21154.31 rows=705954 width=64) (actual time=91.36..6625.79 rows=704840 loops=1) Index Cond: (utc_date '2002-10-01 00:00:00-07'::timestamp with time zone) Total runtime: 11292.68 msec (3 rows) actual time=91.36..6625.79 but Total runtime: 11292.68 msec! Where did those 4.7 seconds go? # SET enable_seqscan = true; SET enable_indexscan = false; SET SET # EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date '2002-10-01'::TIMESTAMP WITH TIME ZONE; INFO: cost_seqscan: run_cost: 21472.687500 startup_cost: 0.00 INFO: cost_index: run_cost: 21154.308116 startup_cost: 1.00 indexCorrelation: 0.999729 QUERY PLAN --- Seq Scan on report_user_cat_count rucc (cost=0.00..21472.69 rows=705954 width=64) (actual time=1091.45..7441.19 rows=704840 loops=1) Filter: (utc_date '2002-10-01 00:00:00-07'::timestamp with time zone) Total runtime: 10506.44 msec (3 rows) Same here: actual time=1091.45..7441.19 but Total runtime: 10506.44 msec - more than 3 seconds lost. When we ignore total runtime and look at actual time we get seqidx estimated 21473 21154 actual 7441 6626 This doesn't look too bad, IMHO. BTW, I believe that with your example (single-column index, almost perfect correlation, index cond selects almost all tuples) all interpolation methods give an index cost estimation that is very close to seq scan cost, and the actual runtimes show that this is correct. Which I find surprising and humorous given the popular belief is, mine included, contrary to those results. How many tuples are in report_user_cat_count? What are the stats for report_user_cat_count.utc_date? I can say with pretty high confidence that the patch to use a geometric mean isn't correct after having done real world testing as its break even point is vastly incorrect and only uses an index when there are less than 9,000 rows to fetch, a far cry from the 490K break even I found while testing. Could you elaborate, please. The intention of my patch was to favour index scans more than the current implementation. If it does not, you have found a bug in my patch. Did you test the other interpolation methods? What I did find interesting, however, was that it does work better at determining the use of multi-column indexes, Yes, because it computes the correlation for a two-column-index as correlation_of_first_index_column * 0.95 instead of correlation_of_first_index_column / 2 but I think that's because the geometric mean pessimizes the value of indexCorrelation, which gets pretty skewed when using a multi-column index. I don't understand this. # CREATE INDEX report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count (user_id,utc_date); # CLUSTER report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count; # ANALYZE report_user_cat_count; # SET enable_seqscan = true; SET enable_indexscan = true; SET SET # EXPLAIN ANALYZE SELECT * FROM
Re: [HACKERS] Correlation in cost_index()
On Fri, 08 Aug 2003 18:25:41 -0400, Tom Lane [EMAIL PROTECTED] wrote: Two examples: [...] One more example: X Y A A a B A C b A B B b C C A c B C C Correlation for column X is something less than 1.0, OTOH correlation for an index on upper(X) is 1.0. I don't really see a way to do this without actually examining the multi-column ordering relationship during ANALYZE. So did we reach consensus to add a TODO item? * Compute index correlation on CREATE INDEX and ANALYZE, use it for index scan cost estimation Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Correlation in cost_index()
On Thu, 7 Aug 2003 13:44:19 -0700, Sean Chittenden [EMAIL PROTECTED] wrote: The indexCorrelation^2 algorithm was only a quick hack with no theory behind it :-(. I've wanted to find some better method to put in there, but have not had any time to research the problem. Could we quick hack it to a geometric mean instead since a mean seemed to yield better results than indexCorrelation^2? Linear interpolation on (1-indexCorrelation)^2 (algorithm 3 in http://members.aon.at/pivot/pg/16-correlation-732.diff) is almost as good as geometric interpolation (algorithm 4 in the patch, proposal 3 in this thread), and its computation is much cheaper because it does not call exp() and log(). Download http://members.aon.at/pivot/pg/cost_index.sxc and play around with your own numbers to get a feeling. (1-indexCorrelation)^2 suffers from the same lack of theory behind it as indexCorrelation^2. But the results look much more plausible. Well, at least to me ;-) Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Correlation in cost_index()
On Fri, 8 Aug 2003 16:53:48 -0700, Sean Chittenden [EMAIL PROTECTED] wrote: # SHOW effective_cache_size ; effective_cache_size -- 4456 (1 row) Only 35 MB? Are you testing on such a small machine? The stats are attached bzip2 compressed. Nothing was attached. Did you upload it to your web site? I can say with pretty high confidence that the patch to use a geometric mean isn't correct ... the problem with your patch was that it picked an index less often than the current code when there was low correlation. In cost_index.sxc I get lower estimates for *all* proposed new interpolation methods. Either my C code doesn't implement the same calculations as the spreadsheet, or ... I manually applied bits of it [...] ... could this explain the unexpected behaviour? I'm currently downloading your dump. Can you post the query you mentioned above? Servus Manfred ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proof-of-concept for initdb-time shared_buffers selection
On Fri, 04 Jul 2003 15:29:37 -0400, Tom Lane [EMAIL PROTECTED] wrote: The attached patch shows how initdb can dynamically determine reasonable shared_buffers and max_connections settings that will work on the current machine. Can't this be done on postmaster startup? I think of two GUC variables where there is only one today: min_shared_buffers and max_shared_buffers. If allocation for the max_ values fails, the numbers are decreased in a loop of, say, 10 steps until allocation succeeds, or even fails at the min_ values. The actual values chosen are reported as a NOTICE and can be inspected as readonly GUC variables. This would make the lives easier for the folks trying to come up with default .conf files, e.g. min_shared_buffers = 64 max_shared_buffers = 2000 could cover a fairly large range of low level to mid level machines. A paranoid dba, who doesn't want the postmaster to do unpredictable things on startup, can always set min_xxx == max_xxx to get the current behaviour. Servus Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] 'out of tree' compile (was: Two weeks to feature freeze)
On Thu, 26 Jun 2003 22:55:45 -0400, Tom Lane [EMAIL PROTECTED] wrote: I've not tried, but if PostgreSQL can do an 'out of tree' compile it could make it much easier. Yes it can, following the usual procedure for autoconfiscated trees: just invoke configure from an empty directory, eg mkdir build cd build ../pgsql/configure I do this all the time. Works pretty well. The only minor annoyance is that some files are created in the source tree: ./src/backend/bootstrap/bootparse.c ./src/backend/bootstrap/bootscanner.c ./src/backend/bootstrap/bootstrap_tokens.h ./src/backend/parser/gram.c ./src/backend/parser/parse.h ./src/backend/parser/scan.c ./src/backend/utils/misc/guc-file.c ./src/bin/psql/sql_help.h ./src/interfaces/ecpg/preproc/pgc.c ./src/interfaces/ecpg/preproc/preproc.c ./src/interfaces/ecpg/preproc/preproc.h ./src/pl/plpgsql/src/pl.tab.h ./src/pl/plpgsql/src/pl_gram.c ./src/pl/plpgsql/src/pl_scan.c This didn't itch enough to make me scratch ;-) Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] MARKED_FOR_UPDATE XMAX_COMMITTED == XMAX_INVALID ?
On Wed, 11 Jun 2003 09:05:33 -0400, Tom Lane [EMAIL PROTECTED] wrote: If a transaction marks a tuple for update and later commits without actually having updated the tuple, [...] can we simply set the HEAP_XMAX_INVALID hint bit of the tuple? AFAICS this is a reasonable thing to do. Thanks for the confirmation. Here's a patch which also contains some more noncritical changes to tqual.c: . make code more readable by introducing local variables for xvac . no longer two separate branches for aborted and crashed. The actions were the same in all cases. Servus Manfred diff -rcN ../base/src/backend/utils/time/tqual.c src/backend/utils/time/tqual.c *** ../base/src/backend/utils/time/tqual.c Thu May 8 21:45:55 2003 --- src/backend/utils/time/tqual.c Thu Jun 12 12:10:31 2003 *** *** 76,86 if (tuple-t_infomask HEAP_MOVED_OFF) { ! if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXvac(tuple))) return false; ! if (!TransactionIdIsInProgress(HeapTupleHeaderGetXvac(tuple))) { ! if (TransactionIdDidCommit(HeapTupleHeaderGetXvac(tuple))) { tuple-t_infomask |= HEAP_XMIN_INVALID; return false; --- 76,87 if (tuple-t_infomask HEAP_MOVED_OFF) { ! TransactionId xvac = HeapTupleHeaderGetXvac(tuple); ! if (TransactionIdIsCurrentTransactionId(xvac)) return false; ! if (!TransactionIdIsInProgress(xvac)) { ! if (TransactionIdDidCommit(xvac)) { tuple-t_infomask |= HEAP_XMIN_INVALID; return false; *** *** 90,100 } else if (tuple-t_infomask HEAP_MOVED_IN) { ! if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXvac(tuple))) { ! if (TransactionIdIsInProgress(HeapTupleHeaderGetXvac(tuple))) return false; ! if (TransactionIdDidCommit(HeapTupleHeaderGetXvac(tuple))) tuple-t_infomask |= HEAP_XMIN_COMMITTED; else { --- 91,102 } else if (tuple-t_infomask HEAP_MOVED_IN) { ! TransactionId xvac = HeapTupleHeaderGetXvac(tuple); ! if (!TransactionIdIsCurrentTransactionId(xvac)) { ! if (TransactionIdIsInProgress(xvac)) return false; ! if (TransactionIdDidCommit(xvac)) tuple-t_infomask |= HEAP_XMIN_COMMITTED; else { *** *** 152,162 } /* xmax transaction committed */ - tuple-t_infomask |= HEAP_XMAX_COMMITTED; if (tuple-t_infomask HEAP_MARKED_FOR_UPDATE) return true; return false; } --- 154,167 } /* xmax transaction committed */ if (tuple-t_infomask HEAP_MARKED_FOR_UPDATE) + { + tuple-t_infomask |= HEAP_XMAX_INVALID; return true; + } + tuple-t_infomask |= HEAP_XMAX_COMMITTED; return false; } *** *** 212,222 if (tuple-t_infomask HEAP_MOVED_OFF) { ! if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXvac(tuple))) return false; ! if (!TransactionIdIsInProgress(HeapTupleHeaderGetXvac(tuple))) { ! if (TransactionIdDidCommit(HeapTupleHeaderGetXvac(tuple))) { tuple-t_infomask |= HEAP_XMIN_INVALID; return false; --- 217,228 if (tuple-t_infomask HEAP_MOVED_OFF) { ! TransactionId xvac = HeapTupleHeaderGetXvac(tuple); ! if (TransactionIdIsCurrentTransactionId(xvac)) return false; ! if (!TransactionIdIsInProgress(xvac)) { ! if (TransactionIdDidCommit(xvac)) {
[HACKERS] MARKED_FOR_UPDATE XMAX_COMMITTED == XMAX_INVALID ?
If a transaction marks a tuple for update and later commits without actually having updated the tuple, do we still need the information that the tuple has once been reserved for an update or can we simply set the HEAP_XMAX_INVALID hint bit of the tuple? In other words, is this snippet from a patch I'm working on a valid modification to HeapTupleSatisfiesVacuum in tqual.c? { if (TransactionIdIsInProgress(HeapTupleHeaderGetXmax(tuple))) return HEAPTUPLE_LIVE; - if (TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) - tuple-t_infomask |= HEAP_XMAX_COMMITTED; - else -/* it's either aborted or crashed */ - tuple-t_infomask |= HEAP_XMAX_INVALID; + /* +* We don't really care whether xmax did commit, abort or +* crash. We know that xmax did mark the tuple for update, +* but it did not and will never actually update it. +*/ + tuple-t_infomask |= HEAP_XMAX_INVALID; } return HEAPTUPLE_LIVE; There are a few more places in tqual.c which could be simplified like that. Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
[HACKERS] Separate build directory
When I configure and make in a separate build directory tree, I get createdb.c:15: dumputils.h: No such file or directory and print.c:9: common.h: No such file or directory in src/bin/scripts. I don't know whether the attached change to the Makefile is the preferred way to fix this, at least it works for me. BTW, make creates these files in the source tree: ./src/backend/bootstrap/bootparse.c ./src/backend/bootstrap/bootscanner.c ./src/backend/bootstrap/bootstrap_tokens.h ./src/backend/parser/gram.c ./src/backend/parser/parse.h ./src/backend/parser/scan.c ./src/backend/utils/misc/guc-file.c ./src/bin/psql/sql_help.h ./src/interfaces/ecpg/preproc/pgc.c ./src/interfaces/ecpg/preproc/preproc.c ./src/interfaces/ecpg/preproc/preproc.h ./src/pl/plpgsql/src/pl.tab.h ./src/pl/plpgsql/src/pl_gram.c ./src/pl/plpgsql/src/pl_scan.c Shouldn't they better be created in the build tree? Servus Manfred diff -ruN ../base/src/bin/scripts/Makefile src/bin/scripts/Makefile --- ../base/src/bin/scripts/Makefile2003-04-04 15:45:47.0 +0200 +++ src/bin/scripts/Makefile2003-04-05 14:42:15.0 +0200 @@ -16,7 +16,7 @@ SCRIPTS := vacuumdb clusterdb PROGRAMS = createdb createlang createuser dropdb droplang dropuser -override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS) +override CPPFLAGS := -I$(libpq_srcdir) -I. -I$(top_srcdir)/src/bin/scripts $(CPPFLAGS) all: submake-libpq submake-backend $(PROGRAMS) ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly