Re: [HACKERS] Using multi-row technique with COPY
On Sun, Nov 27, 2005 at 05:45:31PM -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: COPY FROM can read in sufficient rows until it has a whole block worth of data, then get a new block and write it all with one pair of BufferLock calls. Comments? I don't see any way to do this without horrible modularity violations. The COPY code has no business going anywhere near individual buffers; for that matter, it doesn't even really know what a block worth of data is, since the tuples it's dealing with aren't toasted yet. I wonder if you could get the same effect by having COPY copy into a temporary table (thus no WAL, no bufferlocks) and then every hundred rows do a SELECT INTO. Mind you, that just moves the modularity violation elsewhere since only the bufmgr knows about the size of buffers really. Whatever happened to that idea to build as entire datafile with COPY or some external tool and simply copy it into place and update the catalog? Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpl1Ql5tjcOX.pgp Description: PGP signature
Re: [HACKERS] Using multi-row technique with COPY
On Mon, 2005-11-28 at 09:40 +0100, Martijn van Oosterhout wrote: On Sun, Nov 27, 2005 at 05:45:31PM -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: COPY FROM can read in sufficient rows until it has a whole block worth of data, then get a new block and write it all with one pair of BufferLock calls. Comments? Whatever happened to that idea to build as entire datafile with COPY or some external tool and simply copy it into place and update the catalog? What's wrong with tuning the server to do this? Zapping the catalog as a normal operation is the wrong approach if you want a robust system. All actions on the catalog must be under tight control. Most other RDBMS support a fast path loader, but all of them include strong hooks into the main server to maintain catalog correctly. That is one approach, but it requires creation of an external API - which seems more work, plus a security risk. Copying data in a block at a time is the basic technique all use. I never discuss implementing features that other RDBMS have for any other reason than than a similar use case exists for both. There are many features where PostgreSQL is already ahead. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS][OT] Doubt
http://acronymfinder.com/ Cheers, Csaba. On Fri, 2005-11-25 at 19:24, Gustavo Tonini wrote: What is ISTM? Sorry, Gustavo. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Using multi-row technique with COPY
On Sun, 2005-11-27 at 17:45 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: COPY FROM can read in sufficient rows until it has a whole block worth of data, then get a new block and write it all with one pair of BufferLock calls. Comments? I don't see any way to do this without horrible modularity violations. The COPY code has no business going anywhere near individual buffers; for that matter, it doesn't even really know what a block worth of data is, since the tuples it's dealing with aren't toasted yet. I've taken on board your comments about modularity issues from earlier. [I've not included anything on unique indexes, notice] I was expecting to buffer this in the heap access method with a new call, say, heap_bulk_insert() rather than have all that code hanging around in COPY. A lower level routine RelationGetBufferForTupleArray can handle the actual grunt. It can work, without ugliness. We'd need to handle a buffer bigger than a single tuple anyway, so you keep adding tuples until the last one tips over the edge, which then gets saved for the next block. Heap access method knows about blocks. We could reasonably do a test for would-be-toasted within those routines. I should have said that this wouldn't apply if any of the tuples require toasting, which of course has to be a dynamic test. COPY would only need to know whether it was invoking the normal or the bulk mode, which is reasonable, since it knows about indexes, triggers etc. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Using multi-row technique with COPY
Simon Riggs wrote: On Sun, 2005-11-27 at 17:45 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: COPY FROM can read in sufficient rows until it has a whole block worth of data, then get a new block and write it all with one pair of BufferLock calls. Comments? I don't see any way to do this without horrible modularity violations. The COPY code has no business going anywhere near individual buffers; for that matter, it doesn't even really know what a block worth of data is, since the tuples it's dealing with aren't toasted yet. I've taken on board your comments about modularity issues from earlier. [I've not included anything on unique indexes, notice] I don't see why couldn't have an additional index access method entry point to insert multiple rows on one call. For an unique index, we could return an array of entries that violated the condition (if we were to implement such a feature for single-entry insertion.) -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Using multi-row technique with COPY
On Mon, 2005-11-28 at 00:56 +, Simon Riggs wrote: On Sun, 2005-11-27 at 17:45 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: COPY FROM can read in sufficient rows until it has a whole block worth of data, then get a new block and write it all with one pair of BufferLock calls. Comments? I don't see any way to do this without horrible modularity violations. The COPY code has no business going anywhere near individual buffers; for that matter, it doesn't even really know what a block worth of data is, since the tuples it's dealing with aren't toasted yet. I've taken on board your comments about modularity issues from earlier. [I've not included anything on unique indexes, notice] I was expecting to buffer this in the heap access method with a new call, say, heap_bulk_insert() rather than have all that code hanging around in COPY. A lower level routine RelationGetBufferForTupleArray can handle the actual grunt. It can work, without ugliness. We'd need to handle a buffer bigger than a single tuple anyway, so you keep adding tuples until the last one tips over the edge, which then gets saved for the next block. Heap access method knows about blocks. We could reasonably do a test for would-be-toasted within those routines. I should have said that this wouldn't apply if any of the tuples require toasting, which of course has to be a dynamic test. If we had a buffer big enough (say 10-100x the page size), then we would not actually need to test toasting. We can just pass the big buffer to heap_bulk_insert() which inserts the whole buffer in as big chunks as needed to fill the free space on pages (with single page lock). -- Hannu ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Using multi-row technique with COPY
On Mon, Nov 28, 2005 at 09:39:36AM +, Simon Riggs wrote: On Mon, 2005-11-28 at 09:40 +0100, Martijn van Oosterhout wrote: Whatever happened to that idea to build as entire datafile with COPY or some external tool and simply copy it into place and update the catalog? What's wrong with tuning the server to do this? It was just a question. I remember it being mentioned but didn't hear if it had been rejected or just not implemented. I was wondering if maybe we can approach this another way. Whenever you create a new block by fetching with P_NEW you pretty much know that no other backends are going to be interested in this block. Doesn't this mean you can simply hold this block exclusively until the end of the transaction? Hence you can avoid all the locking on this block. If this is acheives the same reduction in block locking then it would be active in far more places such as SELECT INTO or any large copy (I think index creation is already special cased). Once a block has been written once it would revert to standard locking. Is this feasable? Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgp3KpUjllTza.pgp Description: PGP signature
[HACKERS] Getting different number of results when using hashjoin on/off
I've a problem that might be a bug in the core system (hashjoins) or with ltree using gist-index, but I fail miserable to produce a useful testcase (using 8.1, worked in 8.0): A query produces wrong (=0) results, when a different plan is enforced, I get a merge-join plan that looks similar, but produces the correct result (=16 rows). I can post a queryplan, but cannot post the data itself since it's confidental (though I might be able to randomize some data and construct a self contained case, but this would take quite some time). The working case is: set enable_hashjoin to off; Seq Scan on foo1 cost=0.00..423583.57 rows=10810 width=4) (actual time=675.422..706.815 rows=16 loops=1) Filter: (subplan) SubPlan - Merge Join (cost=19.49..19.55 rows=1 width=0) (actual time=0.028..0.028 rows=0 loops=21619) Merge Cond: (outer.str_id = inner.id) - Sort (cost=6.49..6.50 rows=5 width=4) (actual time=0.023..0.023 rows=0 loops=21619) Sort Key: bz.str_id - Bitmap Heap Scan on foo2 bz (cost=2.02..6.43 rows=5 width=4) (actual time=0.012..0.012 rows=0 loops=21619) Recheck Cond: (bid = $0) - Bitmap Index Scan on foo2_bid_key1 (cost=0.00..2.02 rows=5 width=0) (actual time=0.009..0.009 rows=0 loops=21619) Index Cond: (bid = $0) - Sort (cost=13.00..13.01 rows=6 width=4) (actual time=0.002..0.003 rows=1 loops=136) Sort Key: str.id - Bitmap Heap Scan on structure str (cost=2.02..12.92 rows=6 width=4) (actual time=0.095..0.097 rows=1 loops=1) Recheck Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) - Bitmap Index Scan on str_uk4 (cost=0.00..2.02 rows=6 width=0) (actual time=0.086..0.086 rows=1 loops=1) Index Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) Total runtime: 707.019 ms 16 rows... The failing case is: set enable_hashjoin to on; Seq Scan on foo1 cost=0.00..421679.00 rows=10810 width=4) (actual time=154.663..154.663 rows=0 loops=1) Filter: (subplan) SubPlan - Hash Join (cost=8.47..19.46 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=21619) Hash Cond: (outer.id = inner.str_id) - Bitmap Heap Scan on structure str (cost=2.02..12.92 rows=6 width=4) (actual time=0.100..30.095 rows=1 loops=1) Recheck Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) - Bitmap Index Scan on str_uk4 (cost=0.00..2.02 rows=6 width=0) (actual time=0.090..0.090 rows=1 loops=1) Index Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) - Hash (cost=6.43..6.43 rows=5 width=4) (actual time=0.032..0.032 rows=0 loops=1) - Bitmap Heap Scan on foo2 bz (cost=2.02..6.43 rows=5 width=4) (actual time=0.025..0.025 rows=0 loops=1) Recheck Cond: (bid = $0) - Bitmap Index Scan on foo2_bid_key1 (cost=0.00..2.02 rows=5 width=0) (actual time=0.021..0.021 rows=0 loops=1) Index Cond: (bid = $0) Total runtime: 154.862 ms No rows The query itself is quite simple: select foo1.id from foo1 where foo1.datloesch is null and exists (select 1 from foo2 bz, structure str where bz.bid=foo1.id and str.id = bz.str_id and str.path ~ '*.2330676.*' ); The path field is an ltree column, with an GIST index on it. Any ideas what I could try to track this down? Best regards, Mario Weilguni ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Getting different number of results when using hashjoin
The path field is an ltree column, with an GIST index on it. Something to do with bitmap indexscans on lossy indexes? Chris ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Getting different number of results when using hashjoin on/off
Am Montag, 28. November 2005 14:12 schrieb Christopher Kings-Lynne: The path field is an ltree column, with an GIST index on it. Something to do with bitmap indexscans on lossy indexes? Chris I doubt that, set enable_bitmapscan to off produces the wrong result as well. Best regards Mario ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] What's wrong with the lists?
Hi, Is there something weird going on with the lists? I ask because lately there has been spam showing up rather frequently, something that happenned only very ocassionally. Has there been a change from human moderation to robotical approval of messages? Also, my last auto-mails to pgsql-committers have been held for moderator approval. Maybe I should subscribe-nomail the @postgresql.org account to the list so that it doesn't happen ... but I wonder why isn't it working like it was before? Not meant to criticize anyone -- I'd just like to adjust my config if there's need to. -- Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4 The easiest way to resolve [trivial code guidelines disputes] is to fire one or both of the people involved. (Damian Conway) ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Allow an alias for the target table in UPDATE
I am interested in a following item in TODO. o Allow an alias to be provided for the target table in UPDATE/DELETE This is not SQL-spec but many DBMSs allow it. I think that this functionality is useful for the migration from other RDBMS. However the SQL92 spec does not allow an alias for the target table in UPDATE. Is it still TODO? If it is TODO, I will try it. --- Atsushi Ogawa ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Using multi-row technique with COPY
Alvaro Herrera [EMAIL PROTECTED] writes: I don't see why couldn't have an additional index access method entry point to insert multiple rows on one call. I think Simon was mainly on about the idea of inserting multiple *heap* entries with one call, ie, only one cycle of locking a heap buffer. It's not apparent to me that there'd be any noticeable win from batching index insertions, because even if the heap entries are all on one page, it's unlikely their index entries would be. I'm a bit dubious about the whole thing from a concurrency standpoint, too: holding locks longer is not a great thing in general. The recent patch to avoid repeated locking cycles during a read seems OK, because (1) checking visibility of a tuple is ordinarily pretty cheap, and (2) it's only a shared lock so other people can be doing the same thing concurrently. However, heap insertion can make neither of those claims. You're talking about an exclusive lock, and you're talking about holding it while copying data around and then making multiple WAL entries. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Ipv6 and Postgresql 8.0.3
PLZ REPLY Hello there, I run Postgresql 8.0.3 on Tru64 Unix m/c. I have included the ipv6 auth. line in my pg_hba.conf file(::1/128) I keep getting error msgs from postmaster everytime I try to connect. Going by previous posts on the topic am unable to conclude. Does this mean pg 8.0.3 doesn't support ipv6 client auth. ?? Or is there a patch somewhere ?? Plz reply.Thanks in advance. -- Rajesh R SORRY FOR THE CROSS POST last time. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Ipv6 and Postgresql 8.0.3
First, do not name the mailing list in both the To: and Cc: lines - that's just redundant. Second, this is the wrong list to ask this question, as you were told last time. Ask on the pgsql-general list. cheers andrew R, Rajesh (STSD) wrote: PLZ REPLY Hello there, I run Postgresql 8.0.3 on Tru64 Unix m/c. I have included the ipv6 auth. line in my pg_hba.conf file(::1/128) I keep getting error msgs from postmaster everytime I try to connect. Going by previous posts on the topic am unable to conclude. Does this mean pg 8.0.3 doesn't support ipv6 client auth. ?? Or is there a patch somewhere ?? Plz reply.Thanks in advance. -- Rajesh R SORRY FOR THE CROSS POST last time. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Ipv6 and Postgresql 8.0.3
Andrew Dunstan [EMAIL PROTECTED] writes: First, do not name the mailing list in both the To: and Cc: lines - that's just redundant. Second, this is the wrong list to ask this question, as you were told last time. Ask on the pgsql-general list. And third, show us the darn error messages ... the reason you weren't answered any of the previous times is that you still haven't offered enough information to let anyone help you. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Getting different number of results when using hashjoin on/off
Mario Weilguni [EMAIL PROTECTED] writes: The failing case is: ... SubPlan - Hash Join (cost=8.47..19.46 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=21619) Hash Cond: (outer.id = inner.str_id) - Bitmap Heap Scan on structure str (cost=2.02..12.92 rows=6 width=4) (actual time=0.100..30.095 rows=1 loops=1) Recheck Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) - Bitmap Index Scan on str_uk4 (cost=0.00..2.02 rows=6 width=0) (actual time=0.090..0.090 rows=1 loops=1) Index Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) - Hash (cost=6.43..6.43 rows=5 width=4) (actual time=0.032..0.032 rows=0 loops=1) - Bitmap Heap Scan on foo2 bz (cost=2.02..6.43 rows=5 width=4) (actual time=0.025..0.025 rows=0 loops=1) Recheck Cond: (bid = $0) - Bitmap Index Scan on foo2_bid_key1 (cost=0.00..2.02 rows=5 width=0) (actual time=0.021..0.021 rows=0 loops=1) Index Cond: (bid = $0) Hmm, I wonder why the hash join's input nodes are showing loops=1 ... the hash depends on the subplan parameter $0 so it needs to be re-evaluated each time through. It looks like that's not happening. Do you have the corresponding results from 8.0 --- if so, what do the loop counts look like? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
G'day folks. We have a production database running 8.0.3 which gets fully pg_dump'd and vacuum analyze'd hourly by cron. Something strange happened to us on the 5AM Friday Nov. 25'th cron run -- the: /usr/local/pgsql/bin/vacuumdb -U postgres --all --analyze --verbose $DATE/vacuum.log step in our cron procedure never completed. Strange, since no known event of note happened on Friday since we were all out of the office past Wed. for the american Thanksgiving holiday. Anyway, running the vacuum line by hand shows it getting stuck -- processes the majority of our tables, then just stops, and the backend postmaster just stops accumulating CPU time. Comparing the logs further with when it did complete, it seems that one table in particular (at least) seems afflicted: social=# vacuum verbose analyze agency.swlog_client; hangs up forever -- have to control-c the client. Likewise for w/o 'analyze'. pg-dump'ing the entire database works (phew!) and upon restoring on a backup box, said table can be vacuum'd: social=# vacuum verbose analyze agency.swlog_client; INFO: vacuuming agency.swlog_client INFO: index swlog_client_pkey now contains 0 row versions in 1 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: swlog_client: found 0 removable, 0 nonremovable row versions in 0 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: analyzing agency.swlog_client INFO: swlog_client: scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows VACUUM That's right -- completely empty table -- which is what we actually expect. How should we proceed such that we can learn from this as well as we can proceed and get our entire database vacuuming again successfully? Running on Linux 2.6.8-24.18-smp (SuSE 9.2). No juicy filesystem- related messages in dmesg nor /var/log/messages. 11% disk used on the postgres-related partition. The table in question is defined as: social=# \d agency.swlog_client; Table agency.swlog_client Column | Type | Modifiers ++--- swlog | bigint | not null client | bigint | not null Indexes: swlog_client_pkey PRIMARY KEY, btree (swlog, client) Foreign-key constraints: $2 FOREIGN KEY (client) REFERENCES agency.client(id) $1 FOREIGN KEY (swlog) REFERENCES agency.swlog(id) And the two fk'd tables: social=# select count(*) from agency.client; count --- 0 (1 row) social=# select count(*) from agency.swlog; count --- 69 We doubt that there could be any strange oddball extremely longrunning transaction in any of those related tables gumming up this table. Finally, the only possibly potentially interesting event database- wise happened on Wed. Nov. 23'rd -- we SIGHUP'd the postmaster to have it learn a higher value for work_mem (10240, up from default of 1024). But the hourly crons went great for the subsequent two days. maintenance_work_mem is still at the default of 16384. Many thanks in advance! James James Robinson Socialserve.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
James Robinson [EMAIL PROTECTED] writes: Comparing the logs further with when it did complete, it seems that one table in particular (at least) seems afflicted: social=# vacuum verbose analyze agency.swlog_client; hangs up forever -- have to control-c the client. Likewise for w/o 'analyze'. Given that it's not eating CPU time, one would guess that it's blocked waiting for a lock. Can you find any relevant locks in pg_locks? regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Anonymous CVS working?
Any problems with CVS or anonymous CVS since the work last evening? Anonymous CVS hasn't given me the following commit yet; it's been almost twelve hours since it was made: http://archives.postgresql.org/pgsql-committers/2005-11/msg00553.php -- Michael Fuhr ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Getting different number of results when using hashjoin on/off
Mario Weilguni [EMAIL PROTECTED] writes: Yes. This is from a 8.0.3 (with slightly older and different data, resulting in only 9 rows, but the rest is the same): Yeah, that looks more reasonable. I tried to reproduce this, without any luck: regression=# explain analyze select count(*) from tenk1 a where exists (select 1 from tenk1 b, tenk1 c where b.unique1=c.unique2 and b.hundred in (4,5) and c.hundred=a.hundred); QUERY PLAN Aggregate (cost=3879742.37..3879742.38 rows=1 width=0) (actual time=46579.077..46579.082 rows=1 loops=1) - Seq Scan on tenk1 a (cost=0.00..3879729.87 rows=5000 width=0) (actual time=5.129..46528.208 rows=8500 loops=1) Filter: (subplan) SubPlan - Hash Join (cost=229.20..546.66 rows=2 width=0) (actual time=4.569..4.569 rows=1 loops=1) Hash Cond: (outer.unique1 = inner.unique2) - Bitmap Heap Scan on tenk1 b (cost=4.69..321.15 rows=196 width=4) (actual time=0.947..1.698 rows=90 loops=1) Recheck Cond: ((hundred = 4) OR (hundred = 5)) - BitmapOr (cost=4.69..4.69 rows=197 width=0) (actual time=0.544..0.544 rows=0 loops=1) - Bitmap Index Scan on tenk1_hundred (cost=0.00..2.34 rows=98 width=0) (actual time=0.271..0.271 rows=100 loops=1) Index Cond: (hundred = 4) - Bitmap Index Scan on tenk1_hundred (cost=0.00..2.34 rows=98 width=0) (actual time=0.262..0.262 rows=100 loops=1) Index Cond: (hundred = 5) - Hash (cost=224.26..224.26 rows=100 width=4) (actual time=2.370..2.370 rows=100 loops=1) - Bitmap Heap Scan on tenk1 c (cost=2.35..224.26 rows=100 width=4) (actual time=0.492..1.616 rows=100 loops=1) Recheck Cond: (hundred = $0) - Bitmap Index Scan on tenk1_hundred (cost=0.00..2.35 rows=100 width=0) (actual time=0.278..0.278 rows=100 loops=1) Index Cond: (hundred = $0) Total runtime: 46584.654 ms (19 rows) (I'm not bothering with setting up an ltree index, since the question of what index is being used shouldn't affect hashjoin's decision to rescan or not.) That's using 8.1 branch CVS tip, but there aren't any related bug fixes since 8.1 release. We did have several bug fixes in the hash join code during the 8.1 beta cycle though ... is it possible you are really running an 8.1 beta and not 8.1.0? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
As fate would have it, the vacuumdb frontend and backend which were initially afflicted are still in existence: sscadmin 19236 19235 0 Nov25 ?00:00:00 /usr/local/pgsql/bin/ vacuumdb -U postgres --all --analyze --verbose postgres 19244 3596 0 Nov25 ?00:00:02 postgres: postgres social [local] VACUUM pid 19244. And here's pg_locks: social=# select * from pg_locks; relation | database | transaction | pid | mode | granted --+--+-+---+-- +- | |38790657 | 19244 | ExclusiveLock| t 6586066 | 6585892 | | 28406 | ShareUpdateExclusiveLock | f | |39097312 | 28861 | ExclusiveLock| t | |39089744 | 28756 | ExclusiveLock| t 6586066 | 6585892 | | 28756 | ShareUpdateExclusiveLock | f 6586066 | 6585892 | | 19244 | ShareUpdateExclusiveLock | t 6586066 | 6585892 | | 19244 | ShareUpdateExclusiveLock | t 8417138 | 6585892 | | 19244 | ShareUpdateExclusiveLock | t 16839 | 6585892 | | 28861 | AccessShareLock | t | |39063661 | 28560 | ExclusiveLock| t | |39056736 | 28406 | ExclusiveLock| t 6586066 | 6585892 | | 28560 | ShareUpdateExclusiveLock | f (12 rows) pid --- 19244 28406 28560 28756 28861 (5 rows) Of those 5 pids: 19244 -- vaccuum backend initally afflicted -- status in argv: 'postgres: postgres social [local] VACUUM' 28406 -- a 10AM today vacuum started up by cron this morning after I got things half-way working again early in the diagnosis of this situation. args: 'postgres: postgres social [local] VACUUM waiting' 28560 -- a 10:16 today by-hand vacuum session futher in diagnosis land. args: 'postgres: postgres social [local] VACUUM waiting' 28756 -- 11AM cron'd process. Yes, I see a quickly mounting issue here. args: 'postgres: postgres social [local] VACUUM waiting' 28861 -- production servicing backend, now back in idle state. [ not in tx idle by regular idle ]. On Nov 28, 2005, at 11:09 AM, Tom Lane wrote: James Robinson [EMAIL PROTECTED] writes: Comparing the logs further with when it did complete, it seems that one table in particular (at least) seems afflicted: social=# vacuum verbose analyze agency.swlog_client; hangs up forever -- have to control-c the client. Likewise for w/o 'analyze'. Given that it's not eating CPU time, one would guess that it's blocked waiting for a lock. Can you find any relevant locks in pg_locks? regards, tom lane James Robinson Socialserve.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
James Robinson [EMAIL PROTECTED] writes: As fate would have it, the vacuumdb frontend and backend which were initially afflicted are still in existence: OK, so pid 19244 isn't blocked on any lmgr lock; else we'd see an entry with granted = f for it in pg_locks. It could be blocked on a lower level lock though. Can you attach to that PID with gdb and get a stack trace? Something like (as postgres user) gdb /path/to/postgres-executable 19244 gdb bt gdb quit ok to detach? yes regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Getting different number of results when using hashjoin on/off
Thanks Tom for you quick answer! No, I'm using 8.1.0, and tried it on different machines, always the same results. SELECT version(); PostgreSQL 8.1.0 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.4 20040623 (Gentoo Hardened Linux 3.3.4-r1, ssp-3.3.2-2, pie-8.7.6) Best regards, Mario Weilguni icomedias - Digitale Kommunikation Mario Weilguni, Forschung und Entwicklung [EMAIL PROTECTED], http://www.icomedias.com/ icomedias Österreich Systemhaus GmbH: 8020 Graz, Entenplatz 1 Tel: +43 (316) 721.671-272, Fax: -103 icomedias Deutschland Systemhaus GmbH: 10969 Berlin, Alexandrinenstraße 2-3 Tel: +49 (30) 695.399-272, Fax: -103 -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Monday, November 28, 2005 5:20 PM To: Mario Weilguni Cc: Mario Weilguni; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Getting different number of results when using hashjoin on/off Mario Weilguni [EMAIL PROTECTED] writes: Yes. This is from a 8.0.3 (with slightly older and different data, resulting in only 9 rows, but the rest is the same): Yeah, that looks more reasonable. I tried to reproduce this, without any luck: regression=# explain analyze select count(*) from tenk1 a where exists (select 1 from tenk1 b, tenk1 c where b.unique1=c.unique2 and b.hundred in (4,5) and c.hundred=a.hundred); QUERY PLAN Aggregate (cost=3879742.37..3879742.38 rows=1 width=0) (actual time=46579.077..46579.082 rows=1 loops=1) - Seq Scan on tenk1 a (cost=0.00..3879729.87 rows=5000 width=0) (actual time=5.129..46528.208 rows=8500 loops=1) Filter: (subplan) SubPlan - Hash Join (cost=229.20..546.66 rows=2 width=0) (actual time=4.569..4.569 rows=1 loops=1) Hash Cond: (outer.unique1 = inner.unique2) - Bitmap Heap Scan on tenk1 b (cost=4.69..321.15 rows=196 width=4) (actual time=0.947..1.698 rows=90 loops=1) Recheck Cond: ((hundred = 4) OR (hundred = 5)) - BitmapOr (cost=4.69..4.69 rows=197 width=0) (actual time=0.544..0.544 rows=0 loops=1) - Bitmap Index Scan on tenk1_hundred (cost=0.00..2.34 rows=98 width=0) (actual time=0.271..0.271 rows=100 loops=1) Index Cond: (hundred = 4) - Bitmap Index Scan on tenk1_hundred (cost=0.00..2.34 rows=98 width=0) (actual time=0.262..0.262 rows=100 loops=1) Index Cond: (hundred = 5) - Hash (cost=224.26..224.26 rows=100 width=4) (actual time=2.370..2.370 rows=100 loops=1) - Bitmap Heap Scan on tenk1 c (cost=2.35..224.26 rows=100 width=4) (actual time=0.492..1.616 rows=100 loops=1) Recheck Cond: (hundred = $0) - Bitmap Index Scan on tenk1_hundred (cost=0.00..2.35 rows=100 width=0) (actual time=0.278..0.278 rows=100 loops=1) Index Cond: (hundred = $0) Total runtime: 46584.654 ms (19 rows) (I'm not bothering with setting up an ltree index, since the question of what index is being used shouldn't affect hashjoin's decision to rescan or not.) That's using 8.1 branch CVS tip, but there aren't any related bug fixes since 8.1 release. We did have several bug fixes in the hash join code during the 8.1 beta cycle though ... is it possible you are really running an 8.1 beta and not 8.1.0? regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
Here ya go -- BTW -- your guys support is the _best_. But you know that already: [EMAIL PROTECTED]:/home/sscadmin gdb /usr/local/pgsql/bin/postgres 19244 GNU gdb 6.2.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as i586-suse-linux...Using host libthread_db library /lib/tls/libthread_db.so.1. Attaching to program: /usr/local/pgsql/bin/postgres, process 19244 Reading symbols from /lib/libz.so.1...done. Loaded symbols for /lib/libz.so.1 Reading symbols from /lib/libreadline.so.5...done. Loaded symbols for /lib/libreadline.so.5 Reading symbols from /lib/libcrypt.so.1...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/tls/libm.so.6...done. Loaded symbols for /lib/tls/libm.so.6 Reading symbols from /lib/tls/libc.so.6...done. Loaded symbols for /lib/tls/libc.so.6 Reading symbols from /lib/libncurses.so.5...done. Loaded symbols for /lib/libncurses.so.5 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_compat.so.2...done. Loaded symbols for /lib/libnss_compat.so.2 Reading symbols from /lib/libnss_nis.so.2...done. Loaded symbols for /lib/libnss_nis.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 0xe410 in ?? () (gdb) bt #0 0xe410 in ?? () #1 0xbfffd508 in ?? () #2 0x082aef97 in PqSendBuffer () #3 0xbfffd4f0 in ?? () #4 0xb7ec03e1 in send () from /lib/tls/libc.so.6 #5 0x08137d27 in secure_write () #6 0x0813c2a7 in internal_flush () #7 0x0813c4ff in pq_flush () #8 0x0820bfec in EmitErrorReport () #9 0x0820b5ac in errfinish () #10 0x0811d0a8 in lazy_vacuum_rel () #11 0x0811ac5a in vacuum_rel () #12 0x0811bb93 in vacuum () #13 0x0819c84d in PortalRunUtility () #14 0x0819d9b8 in PortalRun () #15 0x0819b221 in PostgresMain () #16 0x0816ffa9 in ServerLoop () #17 0x08170de9 in PostmasterMain () #18 0x0813e5e5 in main () (gdb) quit James Robinson Socialserve.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
James Robinson [EMAIL PROTECTED] writes: (gdb) bt #0 0xe410 in ?? () #1 0xbfffd508 in ?? () #2 0x082aef97 in PqSendBuffer () #3 0xbfffd4f0 in ?? () #4 0xb7ec03e1 in send () from /lib/tls/libc.so.6 #5 0x08137d27 in secure_write () #6 0x0813c2a7 in internal_flush () #7 0x0813c4ff in pq_flush () #8 0x0820bfec in EmitErrorReport () #9 0x0820b5ac in errfinish () #10 0x0811d0a8 in lazy_vacuum_rel () #11 0x0811ac5a in vacuum_rel () #12 0x0811bb93 in vacuum () #13 0x0819c84d in PortalRunUtility () #14 0x0819d9b8 in PortalRun () #15 0x0819b221 in PostgresMain () #16 0x0816ffa9 in ServerLoop () #17 0x08170de9 in PostmasterMain () #18 0x0813e5e5 in main () (gdb) quit Hmm, what this says is that the backend is blocked trying to send an error or notice message to the client. We can't tell anything about what the message was, but that's not so relevant ... the focus now has to shift to the network or client side, ie, why in the world is the backend waiting on the client to accept a message? Can you get a similar backtrace from the vacuumdb process? (Obviously, give gdb the vacuumdb executable not the postgres one.) regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
On Nov 28, 2005, at 11:38 AM, Tom Lane wrote: Can you get a similar backtrace from the vacuumdb process? (Obviously, give gdb the vacuumdb executable not the postgres one.) OK: (gdb) bt #0 0xe410 in ?? () #1 0xbfffe4f8 in ?? () #2 0x0030 in ?? () #3 0x08057b68 in ?? () #4 0xb7e98533 in __write_nocancel () from /lib/tls/libc.so.6 #5 0xb7e4aae6 in _IO_new_file_write () from /lib/tls/libc.so.6 #6 0xb7e4a7e5 in new_do_write () from /lib/tls/libc.so.6 #7 0xb7e4aa63 in _IO_new_file_xsputn () from /lib/tls/libc.so.6 #8 0xb7e413a2 in fputs () from /lib/tls/libc.so.6 #9 0xb7fd8f99 in defaultNoticeProcessor () from /usr/local/pgsql/lib/ libpq.so.4 #10 0xb7fd8fe5 in defaultNoticeReceiver () from /usr/local/pgsql/lib/ libpq.so.4 #11 0xb7fe2d34 in pqGetErrorNotice3 () from /usr/local/pgsql/lib/ libpq.so.4 #12 0xb7fe3921 in pqParseInput3 () from /usr/local/pgsql/lib/libpq.so.4 #13 0xb7fdb174 in parseInput () from /usr/local/pgsql/lib/libpq.so.4 #14 0xb7fdca99 in PQgetResult () from /usr/local/pgsql/lib/libpq.so.4 #15 0xb7fdcc4b in PQexecFinish () from /usr/local/pgsql/lib/libpq.so.4 #16 0x0804942c in vacuum_one_database () #17 0x080497a1 in main () Things to know which could possibly be of use. This cron is kicked off on the backup database box, and the vacuumdb is run via ssh to the primary box. The primary box is running the vacuumdb operation with --analyze --verbose, with the output being streamed to a logfile on the backup box. Lemme guess __write_nocancel calls syscall write, and 0x0030 might could well be the syscall entry point? Something gumming up the networking or sshd itself could have stopped up the ouput queues, and the backups populated all the way down to this level? If so, only dummies backup / vacuum direct to remote? James Robinson Socialserve.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
James Robinson [EMAIL PROTECTED] writes: On Nov 28, 2005, at 11:38 AM, Tom Lane wrote: Can you get a similar backtrace from the vacuumdb process? OK: (gdb) bt #0 0xe410 in ?? () #1 0xbfffe4f8 in ?? () #2 0x0030 in ?? () #3 0x08057b68 in ?? () #4 0xb7e98533 in __write_nocancel () from /lib/tls/libc.so.6 #5 0xb7e4aae6 in _IO_new_file_write () from /lib/tls/libc.so.6 #6 0xb7e4a7e5 in new_do_write () from /lib/tls/libc.so.6 #7 0xb7e4aa63 in _IO_new_file_xsputn () from /lib/tls/libc.so.6 #8 0xb7e413a2 in fputs () from /lib/tls/libc.so.6 #9 0xb7fd8f99 in defaultNoticeProcessor () from /usr/local/pgsql/lib/ libpq.so.4 #10 0xb7fd8fe5 in defaultNoticeReceiver () from /usr/local/pgsql/lib/ libpq.so.4 #11 0xb7fe2d34 in pqGetErrorNotice3 () from /usr/local/pgsql/lib/ libpq.so.4 #12 0xb7fe3921 in pqParseInput3 () from /usr/local/pgsql/lib/libpq.so.4 #13 0xb7fdb174 in parseInput () from /usr/local/pgsql/lib/libpq.so.4 #14 0xb7fdca99 in PQgetResult () from /usr/local/pgsql/lib/libpq.so.4 #15 0xb7fdcc4b in PQexecFinish () from /usr/local/pgsql/lib/libpq.so.4 #16 0x0804942c in vacuum_one_database () #17 0x080497a1 in main () OK, so evidently the backend is sending NOTICE messages, and the vacuumdb is blocked trying to copy those messages to stderr. Things to know which could possibly be of use. This cron is kicked off on the backup database box, and the vacuumdb is run via ssh to the primary box. The primary box is running the vacuumdb operation with --analyze --verbose, with the output being streamed to a logfile on the backup box. Lemme guess __write_nocancel calls syscall write, and 0x0030 might could well be the syscall entry point? Something gumming up the networking or sshd itself could have stopped up the ouput queues, and the backups populated all the way down to this level? That's what it looks like: the output queue from the vacuumdb has stopped up somehow. Your next move is to look at the state of sshd and whatever is running at the client end of the ssh tunnel. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Getting different number of results when using hashjoin on/off
Mario Weilguni [EMAIL PROTECTED] writes: No, I'm using 8.1.0, and tried it on different machines, always the same results. I see it, I think: the recent changes to avoid work when one or the other side of the hash join is empty would exit the hash join leaving a state that confused ExecReScanHashJoin() into thinking it didn't have to do anything. Try the attached patch. regards, tom lane Index: src/backend/executor/nodeHashjoin.c === RCS file: /cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v retrieving revision 1.75.2.1 diff -c -r1.75.2.1 nodeHashjoin.c *** src/backend/executor/nodeHashjoin.c 22 Nov 2005 18:23:09 - 1.75.2.1 --- src/backend/executor/nodeHashjoin.c 28 Nov 2005 17:04:43 - *** *** 152,163 * outer join, we can quit without scanning the outer relation. */ if (hashtable-totalTuples == 0 node-js.jointype != JOIN_LEFT) - { - ExecHashTableDestroy(hashtable); - node-hj_HashTable = NULL; - node-hj_FirstOuterTupleSlot = NULL; return NULL; - } /* * need to remember whether nbatch has increased since we began --- 152,158 *** *** 487,493 { ExecHashTableDestroy(node-hj_HashTable); node-hj_HashTable = NULL; - node-hj_FirstOuterTupleSlot = NULL; } /* --- 482,487 *** *** 805,841 ExecReScanHashJoin(HashJoinState *node, ExprContext *exprCtxt) { /* -* If we haven't yet built the hash table then we can just return; nothing -* done yet, so nothing to undo. -*/ - if (node-hj_HashTable == NULL) - return; - - /* * In a multi-batch join, we currently have to do rescans the hard way, * primarily because batch temp files may have already been released. But * if it's a single-batch join, and there is no parameter change for the * inner subnode, then we can just re-use the existing hash table without * rebuilding it. */ ! if (node-hj_HashTable-nbatch == 1 ! ((PlanState *) node)-righttree-chgParam == NULL) ! { ! /* okay to reuse the hash table; needn't rescan inner, either */ ! } ! else { ! /* must destroy and rebuild hash table */ ! ExecHashTableDestroy(node-hj_HashTable); ! node-hj_HashTable = NULL; ! node-hj_FirstOuterTupleSlot = NULL; ! /* !* if chgParam of subnode is not null then plan will be re-scanned by !* first ExecProcNode. !*/ ! if (((PlanState *) node)-righttree-chgParam == NULL) ! ExecReScan(((PlanState *) node)-righttree, exprCtxt); } /* Always reset intra-tuple state */ --- 799,830 ExecReScanHashJoin(HashJoinState *node, ExprContext *exprCtxt) { /* * In a multi-batch join, we currently have to do rescans the hard way, * primarily because batch temp files may have already been released. But * if it's a single-batch join, and there is no parameter change for the * inner subnode, then we can just re-use the existing hash table without * rebuilding it. */ ! if (node-hj_HashTable != NULL) { ! if (node-hj_HashTable-nbatch == 1 ! ((PlanState *) node)-righttree-chgParam == NULL) ! { ! /* okay to reuse the hash table; needn't rescan inner, either */ ! } ! else ! { ! /* must destroy and rebuild hash table */ ! ExecHashTableDestroy(node-hj_HashTable); ! node-hj_HashTable = NULL; ! /* !* if chgParam of subnode is not null then plan will be re-scanned !* by first ExecProcNode. !*/ ! if (((PlanState *) node)-righttree-chgParam == NULL) ! ExecReScan(((PlanState *) node)-righttree, exprCtxt); ! } } /* Always reset intra-tuple state */ *** *** 847,852 --- 836,842 node-js.ps.ps_TupFromTlist = false; node-hj_NeedNewOuter = true; node-hj_MatchedOuter = false; + node-hj_FirstOuterTupleSlot = NULL; /* * if chgParam of subnode is not null then plan will be re-scanned by ---(end of broadcast)--- TIP 4: Have you searched our list archives?
Re: [HACKERS] Anonymous CVS working?
Michael Fuhr [EMAIL PROTECTED] writes: Any problems with CVS or anonymous CVS since the work last evening? The master CVS is fine, but I agree that the anonymous mirror doesn't seem to be tracking it ... regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Anonymous CVS working?
Tom Lane wrote: Michael Fuhr [EMAIL PROTECTED] writes: Any problems with CVS or anonymous CVS since the work last evening? The master CVS is fine, but I agree that the anonymous mirror doesn't seem to be tracking it ... Apparently the CVSup server is down too. I wonder if the anon CVS copy is taken from CVSup. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Getting different number of results when using hashjoin on/off
Hello Tom, Thanks for the quick response, I've tried the patch, but it did not work as expected. When I set enable_hashjoin to off, everything works as expected, but with hashjoin on I do not even get results anymore, CPU is going up to 100% and after 3 minutes I cancelled the query (it normale would take ~100-500 milliseconds). I will check the patch on a different machine again and inform you of the results. Best regards, Mario Weilguni -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Monday, November 28, 2005 6:09 PM To: Mario Weilguni Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Getting different number of results when using hashjoin on/off Mario Weilguni [EMAIL PROTECTED] writes: No, I'm using 8.1.0, and tried it on different machines, always the same results. I see it, I think: the recent changes to avoid work when one or the other side of the hash join is empty would exit the hash join leaving a state that confused ExecReScanHashJoin() into thinking it didn't have to do anything. Try the attached patch. regards, tom lane Index: src/backend/executor/nodeHashjoin.c === RCS file: /cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v retrieving revision 1.75.2.1 diff -c -r1.75.2.1 nodeHashjoin.c *** src/backend/executor/nodeHashjoin.c 22 Nov 2005 18:23:09 - 1.75.2.1 --- src/backend/executor/nodeHashjoin.c 28 Nov 2005 17:04:43 - *** *** 152,163 * outer join, we can quit without scanning the outer relation. */ if (hashtable-totalTuples == 0 node-js.jointype != JOIN_LEFT) - { - ExecHashTableDestroy(hashtable); - node-hj_HashTable = NULL; - node-hj_FirstOuterTupleSlot = NULL; return NULL; - } /* * need to remember whether nbatch has increased since we began --- 152,158 *** *** 487,493 { ExecHashTableDestroy(node-hj_HashTable); node-hj_HashTable = NULL; - node-hj_FirstOuterTupleSlot = NULL; } /* --- 482,487 *** *** 805,841 ExecReScanHashJoin(HashJoinState *node, ExprContext *exprCtxt) { /* -* If we haven't yet built the hash table then we can just return; nothing -* done yet, so nothing to undo. -*/ - if (node-hj_HashTable == NULL) - return; - - /* * In a multi-batch join, we currently have to do rescans the hard way, * primarily because batch temp files may have already been released. But * if it's a single-batch join, and there is no parameter change for the * inner subnode, then we can just re-use the existing hash table without * rebuilding it. */ ! if (node-hj_HashTable-nbatch == 1 ! ((PlanState *) node)-righttree-chgParam == NULL) ! { ! /* okay to reuse the hash table; needn't rescan inner, either */ ! } ! else { ! /* must destroy and rebuild hash table */ ! ExecHashTableDestroy(node-hj_HashTable); ! node-hj_HashTable = NULL; ! node-hj_FirstOuterTupleSlot = NULL; ! /* !* if chgParam of subnode is not null then plan will be re-scanned by !* first ExecProcNode. !*/ ! if (((PlanState *) node)-righttree-chgParam == NULL) ! ExecReScan(((PlanState *) node)-righttree, exprCtxt); } /* Always reset intra-tuple state */ --- 799,830 ExecReScanHashJoin(HashJoinState *node, ExprContext *exprCtxt) { /* * In a multi-batch join, we currently have to do rescans the hard way, * primarily because batch temp files may have already been released. But * if it's a single-batch join, and there is no parameter change for the * inner subnode, then we can just re-use the existing hash table without * rebuilding it. */ ! if (node-hj_HashTable != NULL) { ! if (node-hj_HashTable-nbatch == 1 ! ((PlanState *) node)-righttree-chgParam == NULL) ! { ! /* okay to reuse the hash table; needn't rescan inner, either */ ! } ! else ! { ! /* must destroy and rebuild hash table */ ! ExecHashTableDestroy(node-hj_HashTable); ! node-hj_HashTable = NULL; ! /* !* if chgParam of subnode is not null then plan will be re-scanned !* by first ExecProcNode. !*/ !
Re: [HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
On Nov 28, 2005, at 12:00 PM, Tom Lane wrote: Your next move is to look at the state of sshd and whatever is running at the client end of the ssh tunnel. backtrace of the sshd doesn't look good: (gdb) bt #0 0xe410 in ?? () #1 0xbfffdb48 in ?? () #2 0x080a1e28 in ?? () #3 0x080a1e78 in ?? () #4 0xb7d379fd in ___newselect_nocancel () from /lib/tls/libc.so.6 #5 0x08054d64 in ?? () #6 0x000a in ?? () #7 0x080a1e78 in ?? () #8 0x080a1e28 in ?? () #9 0x in ?? () #10 0xbfffdb30 in ?? () #11 0x in ?? () #12 0xbfffdb48 in ?? () #13 0x0806c796 in ?? () #14 0x080a9d3c in ?? () #15 0x0001 in ?? () #16 0xbfffdb64 in ?? () #17 0x08054c3d in ?? () #18 0x0019 in ?? () #19 0x000acda0 in ?? () #20 0x080a9d3c in ?? () #21 0x in ?? () #22 0xbfffdb6c in ?? () #23 0x in ?? () #24 0xbfffdb78 in ?? () ---Type return to continue, or q return to quit--- #25 0x08055632 in ?? () #26 0xbfffdb6c in ?? () #27 0x in ?? () #28 0x080a1e78 in ?? () #29 0x08098ee8 in ?? () #30 0x080a1e78 in ?? () #31 0x080a1e28 in ?? () #32 0x0009 in ?? () #33 0x0004 in ?? () #34 0x0001 in ?? () #35 0x0001 in ?? () #36 0xbfffdbb8 in ?? () #37 0x0805b816 in ?? () #38 0x08098ee8 in ?? () #39 0x080a2e10 in ?? () #40 0x0007 in ?? () #41 0x08098ee8 in ?? () #42 0x08080fd2 in _IO_stdin_used () #43 0x08098ee8 in ?? () #44 0xbfffdbb8 in ?? () #45 0x080574a3 in ?? () #46 0x in ?? () #47 0x08098ee8 in ?? () #48 0x08098ee8 in ?? () #49 0x08098f30 in ?? () ---Type return to continue, or q return to quit--- #50 0x08080fd2 in _IO_stdin_used () #51 0x08098ee8 in ?? () #52 0xbfffeb98 in ?? () #53 0x0804fc90 in ?? () #54 0x08098ee8 in ?? () #55 0x08098f74 in ?? () #56 0x08098f30 in ?? () #57 0xbfffe110 in ?? () #58 0xbfffe110 in ?? () #59 0x0808014a in _IO_stdin_used () #60 0xb7ffad95 in malloc () from /lib/ld-linux.so.2 Previous frame inner to this frame (corrupt stack?) The client-side ssh is worse -- 507 frames before it reports '(corrupt stack?)'. At this moment in time, should we kill off the offending processes from Nov 25 -- starting from client-most side all the way to the vacuumdb process on the production server. The other vacuums would probably then complete happily, and we'd be cool again, eh? I suppose we're darn lucky the process got ultimately gummed up on a table that sees no traffic at all to it, eh? The lock that vacuum has taken out on it would prevent at least some things happening to the table in question -- possibly even new inserts or updates? Could this potentially be alleviated in the future by a little code reordering in vacuumdb or postmaster by completing working on the current table completely before emitting output, either postmaster - vacuumdb client, or possibly the vacuumdb client - whatever stdout is directed to so as to get gummed up in a state when no locks are being held? Or would that uglify the code too much and/or people would find that additional buffering a damnable offense? James Robinson Socialserve.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Getting different number of results when using hashjoin on/off
Yes. This is from a 8.0.3 (with slightly older and different data, resulting in only 9 rows, but the rest is the same): Index Scan using ben_uk3 on foo1 ben (cost=0.00..73867.23 rows=863 width=27) (actual time=38.591..501.839 rows=9 loops=1) Filter: (subplan) SubPlan - Hash Join (cost=14.25..42.53 rows=1 width=0) (actual time=0.284..0.284 rows=0 loops=1725) Hash Cond: (outer.id = inner.str_id) - Index Scan using str_uk4 on structure str (cost=0.00..27.91 rows=13 width=4) (actual time=0.765..4.043 rows=1 loops=112) Index Cond: (path ~ '*.2330676.*'::lquery) - Hash (cost=14.23..14.23 rows=10 width=4) (actual time=0.012..0.012 rows=0 loops=1725) - Index Scan using foo2_ben_id_key1 on foo2 bz (cost=0.00..14.23 rows=10 width=4) (actual time=0.008..0.009 rows=1 loops=1725) Index Cond: (ben_id = $0) Total runtime: 501.980 ms Best regards P.s. sorry for the stupid quoting, I've to use Outlook Mario Weilguni [EMAIL PROTECTED] writes: The failing case is: ... SubPlan - Hash Join (cost=8.47..19.46 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=21619) Hash Cond: (outer.id = inner.str_id) - Bitmap Heap Scan on structure str (cost=2.02..12.92 rows=6 width=4) (actual time=0.100..30.095 rows=1 loops=1) Recheck Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) - Bitmap Index Scan on str_uk4 (cost=0.00..2.02 rows=6 width=0) (actual time=0.090..0.090 rows=1 loops=1) Index Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) - Hash (cost=6.43..6.43 rows=5 width=4) (actual time=0.032..0.032 rows=0 loops=1) - Bitmap Heap Scan on foo2 bz (cost=2.02..6.43 rows=5 width=4) (actual time=0.025..0.025 rows=0 loops=1) Recheck Cond: (bid = $0) - Bitmap Index Scan on foo2_bid_key1 (cost=0.00..2.02 rows=5 width=0) (actual time=0.021..0.021 rows=0 loops=1) Index Cond: (bid = $0) Hmm, I wonder why the hash join's input nodes are showing loops=1 ... the hash depends on the subplan parameter $0 so it needs to be re-evaluated each time through. It looks like that's not happening. Do you have the corresponding results from 8.0 --- if so, what do the loop counts look like? ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
James Robinson [EMAIL PROTECTED] writes: backtrace of the sshd doesn't look good: Stripped executable :-( ... you won't get much info there. What of the client at the far end of the ssh connection? You should probably assume that the blockage is there, rather than in a commonly used bit of software like ssh. At this moment in time, should we kill off the offending processes from Nov 25 -- starting from client-most side all the way to the vacuumdb process on the production server. The other vacuums would probably then complete happily, and we'd be cool again, eh? If you just want to get out of it, killing the vacuumdb should be the least dangerous way to get out of the problem. I'd suggest taking a little more time to try to find out what's stuck though. I suppose we're darn lucky the process got ultimately gummed up on a table that sees no traffic at all to it, eh? The lock that vacuum has taken out on it would prevent at least some things happening to the table in question -- possibly even new inserts or updates? No, since it's a lazy vacuum it's not going to block either reads or writes. Just other vacuums and DDL changes on the table. Could this potentially be alleviated in the future by a little code reordering in vacuumdb or postmaster by completing working on the current table completely before emitting output, Wouldn't help. I would imagine that by the time we got to this state, the backend side is quite a few tables past the point where the end client stopped accepting output. You'd normally expect there to be several Kb worth of buffering in the network channel, and the backend isn't going to be blocked till that's *all* used up. BTW, I suppose this was a VACUUM VERBOSE? Without the verbose output, you'd likely not have seen any problem ... regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
James Robinson [EMAIL PROTECTED] writes: Given the other culprits in play are bash running a straightforward shellscript line with redirected output to a simple file on a non- full filesystem, I'm leaning more towards the odds that something related to the sshd + tcp/ip + ssh client portion of things went crazy. Yeah, could be. Anyway it doesn't seem like we can learn much more today. You might as well just zing the vacuumdb process and let things get back to normal. If it happens again, we'd have reason to dig deeper. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Anonymous CVS working?
cvsup back up and running ... On Mon, 28 Nov 2005, Alvaro Herrera wrote: Tom Lane wrote: Michael Fuhr [EMAIL PROTECTED] writes: Any problems with CVS or anonymous CVS since the work last evening? The master CVS is fine, but I agree that the anonymous mirror doesn't seem to be tracking it ... Apparently the CVSup server is down too. I wonder if the anon CVS copy is taken from CVSup. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
On Nov 28, 2005, at 1:46 PM, Tom Lane wrote: James Robinson [EMAIL PROTECTED] writes: backtrace of the sshd doesn't look good: Stripped executable :-( ... you won't get much info there. What of the client at the far end of the ssh connection? You should probably assume that the blockage is there, rather than in a commonly used bit of software like ssh. Ok: cron fired off a bash running our script which performs the backup + vacuuming on the backup box side, and that script was at the point of driving vacuumdb sscadmin 20612 20610 0 Nov25 ?00:00:00 bash /usr/local/bin/ db_backup.sh sscadmin 20622 20612 0 Nov25 ?00:00:00 ssh -c blowfish [ ssh identity file + host edited out ] /usr/local/pgsql/bin/vacuumdb -U postgres --all --analyze --verbose [ yes, verbose vacuum. Who knew that'd be the camel-breaking straw ??! ] The lines in the script invoking the ssh'd vacuumdb is: -- # Vacuum all databases, storing log results. $SSHCMD $DBHOST /usr/local/pgsql/bin/vacuumdb -U postgres --all -- analyze --verb ose $DATE/vacuum.log -- Unfortunately the dir holding that date + hour's vacuum.log was swept away by the next day's activities. The stuck bash is backtraceable: (gdb) bt #0 0xe410 in ?? () #1 0xb928 in ?? () #2 0x in ?? () #3 0xb918 in ?? () #4 0xb7ed1513 in __waitpid_nocancel () from /lib/tls/libc.so.6 #5 0x080935bf in default_tty_job_signals () #6 0x080949ca in wait_for () #7 0x0808acd7 in execute_command_internal () #8 0x0808a4f0 in execute_command () #9 0x0808241d in reader_loop () #10 0x08081364 in main () (gdb) Nothing unexpected there. Funny though, file reports /bin/bash as being stripped ( as it does for /usr/bin/ssh and /usr/sbin/sshd ), but I could get far better debugging support from it. Could the debugging issue be ssh / sshd's apparent multithreadedness: gdb /usr/bin/ssh GNU gdb 6.2.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as i586-suse-linux...(no debugging symbols found)...Using host libthread_db library /lib/tls/libthread_db.so.1. (gdb) run localhost Starting program: /usr/bin/ssh localhost (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...[Thread debugging using libthread_db enabled] [New Thread 1078408704 (LWP 29932)] (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...The authenticity of host 'localhost (127.0.0.1)' can't be established. RSA key fingerprint is f4:cd:bc:37:d7:08:bc:4f:04:91:45:9b:44:cf:d5:b9. Are you sure you want to continue connecting (yes/no)? Program received signal SIGINT, Interrupt. [Switching to Thread 1078408704 (LWP 29932)] 0xe410 in ?? () (gdb) bt #0 0xe410 in ?? () #1 0xbfffb9e8 in ?? () #2 0x0001 in ?? () #3 0xbfffb3af in ?? () #4 0x402f94b3 in __read_nocancel () from /lib/tls/libc.so.6 #5 0x080742e1 in mkstemp64 () #6 0x080684c8 in error () #7 0x08054e91 in ?? () #8 0xbfffcac0 in ?? () #9 0x0001 in ?? () #10 0x08096230 in ?? () #11 0xbfffcac0 in ?? () I know you have far better things to do than teach someone how to use gdb on multithreaded programs, but could a proper backtrace be salvageable on the ssh client? If you really care, that is, otherwise I'm off to kill that vacuumdb client. At this moment in time, should we kill off the offending processes from Nov 25 -- starting from client-most side all the way to the vacuumdb process on the production server. The other vacuums would probably then complete happily, and we'd be cool again, eh? If you just want to get out of it, killing the vacuumdb should be the least dangerous way to get out of the problem. I'd suggest taking a little more time to try to find out what's stuck though. Given the other culprits in play are bash running a straightforward shellscript line with redirected output to a simple file on a non- full filesystem, I'm leaning more towards the odds that something related to the sshd + tcp/ip + ssh client portion of things went crazy. Just seems that's where more complexity is -- bash is
Hashjoin startup strategy (was Re: [HACKERS] Getting different number of results when using hashjoin on/off)
Mario Weilguni [EMAIL PROTECTED] writes: Thanks for the quick response, I've tried the patch, but it did not work as expected. When I set enable_hashjoin to off, everything works as expected, but with hashjoin on I do not even get results anymore, CPU is going up to 100% and after 3 minutes I cancelled the query (it normale would take ~100-500 milliseconds). Try letting it run longer. I think your expectation is tuned for the broken implementation (which runs the subqueries only once instead of 26k times...) The test case I developed for this failure in the regression database is select count(*) from tenk1 a where exists (select 1 from tenk1 b, tenk1 c where b.unique1=c.unique2 and b.hundred in (4,5) and c.hundred=a.hundred+99); 8.0 prefers a nestloop for the subquery, and that plan runs in about 600 ms on my machine. If forced to a hash join, it takes about 2450 ms. 8.1 prefers the hash join to start with, but takes 11300 ms to run it :-( (after the patch that is). The reason for the differential is that 8.1 guesses wrong about which subplan to cycle first: most of the time, the inner plan is empty and so there's no need to pull any rows from the outer plan, but 8.1 pulls the first row from the outer plan anyway, and doing that 1 times is what's eating the extra runtime. It looks from your previous message that similar things are happening with your data distribution, allowing 8.0 to run faster for you than 8.1 does. Not sure if there's much we can do about this. The presence of the upper-query parameter in the subplan makes it difficult to derive any stats at all, let alone guess how often the subplan will be completely empty, so I'm not sure the planner can help. For a query like this, where the hash join is being done repeatedly, it might be useful for the executor itself to track how often each subplan has been seen to be empty. In particular, the executor knows that the outer subplan is parameterless and therefore should deliver the same results each time (modulo volatile functions of course), so after the first cycle it could know that there's no point in trying the early fetch on that side. Dunno if this will be of wide enough use to be worth implementing though --- in simple cases the join won't be rescanned and so the executor can't help. Anyone have any other ideas? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Anonymous CVS working?
On Mon, 28 Nov 2005, Alvaro Herrera wrote: Tom Lane wrote: Michael Fuhr [EMAIL PROTECTED] writes: Any problems with CVS or anonymous CVS since the work last evening? The master CVS is fine, but I agree that the anonymous mirror doesn't seem to be tracking it ... Apparently the CVSup server is down too. I wonder if the anon CVS copy is taken from CVSup. Give me a couple of minutes ... Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: Hashjoin startup strategy (was Re: [HACKERS] Getting different number of results when using hashjoin on/off)
Title: AW: Hashjoin startup strategy (was Re: [HACKERS] Getting different number of results when using hashjoin on/off) If the query runs slow it will be not such a problem, but I was very concerned about other queries having this problem too - without knowing it. I've already rewritten the query to use IN instead of exists. I'll compile again and try it again. Thanks alot! Best regards, Mario Weilguni -Ursprüngliche Nachricht- Von: Tom Lane [mailto:[EMAIL PROTECTED]] Gesendet: Mo 28.11.2005 19:39 An: Mario Weilguni Cc: pgsql-hackers@postgresql.org Betreff: Hashjoin startup strategy (was Re: [HACKERS] Getting different number of results when using hashjoin on/off) Mario Weilguni [EMAIL PROTECTED] writes: Thanks for the quick response, I've tried the patch, but it did not work as expected. When I set enable_hashjoin to off, everything works as expected, but with hashjoin on I do not even get results anymore, CPU is going up to 100% and after 3 minutes I cancelled the query (it normale would take ~100-500 milliseconds). Try letting it run longer. I think your expectation is tuned for the broken implementation (which runs the subqueries only once instead of 26k times...) The test case I developed for this failure in the regression database is select count(*) from tenk1 a where exists (select 1 from tenk1 b, tenk1 c where b.unique1=c.unique2 and b.hundred in (4,5) and c.hundred=a.hundred+99); 8.0 prefers a nestloop for the subquery, and that plan runs in about 600 ms on my machine. If forced to a hash join, it takes about 2450 ms. 8.1 prefers the hash join to start with, but takes 11300 ms to run it :-( (after the patch that is). The reason for the differential is that 8.1 guesses wrong about which subplan to cycle first: most of the time, the inner plan is empty and so there's no need to pull any rows from the outer plan, but 8.1 pulls the first row from the outer plan anyway, and doing that 1 times is what's eating the extra runtime. It looks from your previous message that similar things are happening with your data distribution, allowing 8.0 to run faster for you than 8.1 does. Not sure if there's much we can do about this. The presence of the upper-query parameter in the subplan makes it difficult to derive any stats at all, let alone guess how often the subplan will be completely empty, so I'm not sure the planner can help. For a query like this, where the hash join is being done repeatedly, it might be useful for the executor itself to track how often each subplan has been seen to be empty. In particular, the executor knows that the outer subplan is parameterless and therefore should deliver the same results each time (modulo volatile functions of course), so after the first cycle it could know that there's no point in trying the early fetch on that side. Dunno if this will be of wide enough use to be worth implementing though --- in simple cases the join won't be rescanned and so the executor can't help. Anyone have any other ideas? regards, tom lane
Re: [HACKERS] Help: 8.0.3 Vacuum of an empty table never completes ...
On Nov 28, 2005, at 4:13 PM, Tom Lane wrote: Yeah, could be. Anyway it doesn't seem like we can learn much more today. You might as well just zing the vacuumdb process and let things get back to normal. If it happens again, we'd have reason to dig deeper. Final report [ and apologies to hackers list in general -- sorry for the noise today ]. Killed the vacuumdb frontend. Then went off killing processes spawned by cron on Nov25th related to the cronjob. All of the related backends exited peacefully, and all is well. Manual vacuum verbose analyze completes successfully. One possibly curious thing -- one final process remains on the backup box dated Nov25: root 19912 3 0 Nov25 ?00:00:12 [pdflush] Coincidence? This is some sort of kernel thread, right? Flushes dirty pages to disk? There are two on this machine: root 9211 3 0 Nov22 ?00:02:56 [pdflush] root 19912 3 0 Nov25 ?00:00:12 [pdflush] The Nov25'ths pdflush's pid is suspiciously close to the pids which would be in use around the beginning of the cron'd process. [ checks / var/log/messages ... ] -- yep -- real close -- last known cross- referencable pid is: Nov 25 04:59:01 db02 /usr/sbin/cron[20590]: (root) CMD ( rm -f /var/ spool/cron/lastrun/cron.hourly) and the vacuumdb sshd connection on the production db box is logged at 05:02:22 AM, so that pdflush would have been started real close to the time which the remote backup + vacuum script would have been running. Any Linux 2.6 gurus lurking? Under what circumstances do pdflush'es get spawned? The filesystem upon which the outputs were going is a software raid partition (raid-0? raid-1? Always confuse the two) -- the interleaved one anyway, not mirrored -- formatted reiser3. Neither pdflush instance on this machine was started anywhere near the boot time of the machine -- both much later. Whereas on the production box the two pdflush instances are both dated from machine boot time. Does this perchance indicate unhappiness afoot perhaps hardware-wise? James Robinson Socialserve.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] comment doesn't accept expressions returning text
Added to TODO: * Allow COMMENT ON to accept an expression rather than just a string --- Michael Glaesemann wrote: I've been trying to be better at documentation in general and have been trying to take advantage of PostgreSQL's COMMENT ON functionality to provide a little more information in the DDL itself. I usually write my DDL in a text file and load it into the database using psql. To make it (a little) easier to write comments, I'd like to write the comment text as it's own paragraph so I don't have to worry about accidently deleting the opening and closing quotes. For example: create table foo (foo_id integer primary key); comment on table foo is $$ This is a comment for table foo. $$; Of course, this comment includes a new line at the beginning and end of the comment. test=# select relname, description test-# from pg_description test-# join pg_class on (pg_class.oid = pg_description.objoid) test-# where relname = 'foo'; relname | description -+- foo | This is a comment for table foo. (1 row) It would be nice to be able to strip those out using TRIM (or some other function). However, this doesn't appear to work, as COMMENT ON throws a syntax error as soon as it runs into anything that isn't a pure text string. Examples below. Would there be any objection to allowing any text-returning expression in this case? If not, what would be involved in allowing this? I'm interested in contributing the change if it's something that's considered worthwhile. Michael Glaesemann grzm myrealbox com test=# select version(); version -- PostgreSQL 8.1.0 on powerpc-apple-darwin8.3.0, compiled by GCC powerpc-apple-darwin8-gcc-4.0.0 (GCC) 4.0.0 (Apple Computer, Inc. build 5026) (1 row) test=# create table foo (foo_id integer primary key); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index foo_pkey for table foo CREATE TABLE test=# comment on table foo is trim('*', '***This is just an example***'); ERROR: syntax error at or near trim at character 25 LINE 1: comment on table foo is trim('*', '***This is just an exampl... ^ test=# comment on table foo is (trim('*', '***This is just an example***')); ERROR: syntax error at or near ( at character 25 LINE 1: comment on table foo is (trim('*', '***This is just an examp... ^ test=# comment on table foo is 'This is just' || ' an example'; ERROR: syntax error at or near || at character 40 LINE 1: comment on table foo is 'This is just' || ' an example'; ^ test=# comment on table foo is ('This is just' || ' an example'); ERROR: syntax error at or near ( at character 25 LINE 1: comment on table foo is ('This is just' || ' an example'); ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] POWER vs. POW ???
Christopher Kings-Lynne wrote: Also, POW() is not documented here: I think POW is just there for backward compatibility and people should use POWER(). --- http://www.postgresql.org/docs/8.1/interactive/functions-math.html Chris Christopher Kings-Lynne wrote: How come these give slightly different results? test=# SELECT POW(2,-2); pow -- 0.25 (1 row) test=# SELECT POWER(2,-2); power --- 0.25 (1 row) (Note width of result field.) Chris ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] TRUNC vs. TRUNCATE
Christopher Kings-Lynne wrote: Hi, I notice we added CEILING() as an alias to CEIL() for compatibility. We also have POWER() for POW(). I notice that MySQL uses TRUNCATE() and we only have TRUNC(). Is TRUNCATE actually spec compliant? Should we add TRUNCATE anyway for I don't see either in the SQL99 standard. consistency and compatibility? No, if we did that we would be carrying around a ton of stuff. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] NVL vs COALESCE
Michael Glaesemann wrote: On Nov 24, 2005, at 21:21 , Marcus Engene wrote: When we're having an alias discussion, I'd really like to see NVL in postgres. Not because of porting from oracle as much as just spelling that without the reference manual is completely impossible. NVL: what a very unfortunate spelling. (NULL VaLue? NULL Valued Logic? Named Very Loosely? Someone help me here :) ) AFAICT, COALESCE is SQL standard, while NVL isn't. I think an index entry might be a good idea. Agreed, documentation patch applied to HEAD and 8.1.X. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 Index: doc/src/sgml/func.sgml === RCS file: /cvsroot/pgsql/doc/src/sgml/func.sgml,v retrieving revision 1.295 diff -c -c -r1.295 func.sgml *** doc/src/sgml/func.sgml 19 Nov 2005 19:44:54 - 1.295 --- doc/src/sgml/func.sgml 28 Nov 2005 23:17:42 - *** *** 7227,7232 --- 7227,7240 primaryCOALESCE/primary /indexterm + indexterm +primaryNVL/primary + /indexterm + + indexterm +primaryIFNULL/primary + /indexterm + synopsis functionCOALESCE/function(replaceablevalue/replaceable optional, .../optional) /synopsis *** *** 7234,7242 para The functionCOALESCE/function function returns the first of its arguments that is not null. Null is returned only if all arguments !are null. This is often useful to substitute a !default value for null values when data is retrieved for display, !for example: programlisting SELECT COALESCE(description, short_description, '(none)') ... /programlisting --- 7242,7249 para The functionCOALESCE/function function returns the first of its arguments that is not null. Null is returned only if all arguments !are null. It is often used to substitute a default value for !null values when data is retrieved for display, for example: programlisting SELECT COALESCE(description, short_description, '(none)') ... /programlisting *** *** 7246,7252 Like a tokenCASE/token expression, functionCOALESCE/function will not evaluate arguments that are not needed to determine the result; that is, arguments to the right of the first non-null argument are ! not evaluated. /para /sect2 --- 7253,7261 Like a tokenCASE/token expression, functionCOALESCE/function will not evaluate arguments that are not needed to determine the result; that is, arguments to the right of the first non-null argument are ! not evaluated. This SQL-standard function provides capabilities similar ! to functionNVL/ and functionIFNULL/, which are used in some other ! database systems. /para /sect2 *** *** 7262,7277 /synopsis para !The functionNULLIF/function function returns a null value if and only !if replaceablevalue1/replaceable and !replaceablevalue2/replaceable are equal. Otherwise it returns !replaceablevalue1/replaceable. This can be used to perform the !inverse operation of the functionCOALESCE/function example !given above: programlisting SELECT NULLIF(value, '(none)') ... /programlisting /para /sect2 --- 7271,7289 /synopsis para !The functionNULLIF/function function returns a null value if !replaceablevalue1/replaceable and replaceablevalue2/replaceable !are equal; otherwise it returns replaceablevalue1/replaceable. !This can be used to perform the inverse operation of the !functionCOALESCE/function example given above: programlisting SELECT NULLIF(value, '(none)') ... /programlisting /para + para +If replaceablevalue1/replaceable is literal(none)/, return a null, +otherwise return replaceablevalue1/replaceable. + /para /sect2 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: Hashjoin startup strategy (was Re: [HACKERS] Getting different number of results when using hashjoin on/off)
I wrote: For a query like this, where the hash join is being done repeatedly, it might be useful for the executor itself to track how often each subplan has been seen to be empty. I implemented a simple form of this, and it made 8.1 faster than 8.0 on the test case I was using. Give it a try ... regards, tom lane Index: src/backend/executor/nodeHashjoin.c === RCS file: /cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v retrieving revision 1.75.2.2 diff -c -r1.75.2.2 nodeHashjoin.c *** src/backend/executor/nodeHashjoin.c 28 Nov 2005 17:14:47 - 1.75.2.2 --- src/backend/executor/nodeHashjoin.c 28 Nov 2005 23:41:28 - *** *** 120,135 * since we aren't going to be able to skip the join on the strength * of an empty inner relation anyway.) * * The only way to make the check is to try to fetch a tuple from the * outer plan node. If we succeed, we have to stash it away for later * consumption by ExecHashJoinOuterGetTuple. */ ! if (outerNode-plan-startup_cost hashNode-ps.plan-total_cost || ! node-js.jointype == JOIN_LEFT) { node-hj_FirstOuterTupleSlot = ExecProcNode(outerNode); if (TupIsNull(node-hj_FirstOuterTupleSlot)) return NULL; } else node-hj_FirstOuterTupleSlot = NULL; --- 120,147 * since we aren't going to be able to skip the join on the strength * of an empty inner relation anyway.) * +* If we are rescanning the join, we make use of information gained +* on the previous scan: don't bother to try the prefetch if the +* previous scan found the outer relation nonempty. This is not +* 100% reliable since with new parameters the outer relation might +* yield different results, but it's a good heuristic. +* * The only way to make the check is to try to fetch a tuple from the * outer plan node. If we succeed, we have to stash it away for later * consumption by ExecHashJoinOuterGetTuple. */ ! if (node-js.jointype == JOIN_LEFT || ! (outerNode-plan-startup_cost hashNode-ps.plan-total_cost !!node-hj_OuterNotEmpty)) { node-hj_FirstOuterTupleSlot = ExecProcNode(outerNode); if (TupIsNull(node-hj_FirstOuterTupleSlot)) + { + node-hj_OuterNotEmpty = false; return NULL; + } + else + node-hj_OuterNotEmpty = true; } else node-hj_FirstOuterTupleSlot = NULL; *** *** 159,164 --- 171,183 * scanning the outer relation */ hashtable-nbatch_outstart = hashtable-nbatch; + + /* +* Reset OuterNotEmpty for scan. (It's OK if we fetched a tuple +* above, because ExecHashJoinOuterGetTuple will immediately +* set it again.) +*/ + node-hj_OuterNotEmpty = false; } /* *** *** 454,459 --- 473,479 hjstate-js.ps.ps_TupFromTlist = false; hjstate-hj_NeedNewOuter = true; hjstate-hj_MatchedOuter = false; + hjstate-hj_OuterNotEmpty = false; return hjstate; } *** *** 546,551 --- 566,574 *hashvalue = ExecHashGetHashValue(hashtable, econtext, hjstate-hj_OuterHashKeys); + /* remember outer relation is not empty for possible rescan */ + hjstate-hj_OuterNotEmpty = true; + return slot; } *** *** 810,816 if (node-hj_HashTable-nbatch == 1 ((PlanState *) node)-righttree-chgParam == NULL) { ! /* okay to reuse the hash table; needn't rescan inner, either */ } else { --- 833,851 if (node-hj_HashTable-nbatch == 1 ((PlanState *) node)-righttree-chgParam == NULL) { ! /* !* okay to reuse the hash table; needn't rescan inner, either. !* !
Re: Hashjoin startup strategy (was Re: [HACKERS] Getting different number of results when using hashjoin on/off)
Greg Stark [EMAIL PROTECTED] writes: I suspect this is obvious but since you asked, there isn't any way to keep around the hash table and just reuse it repeatedly instead of having to rescan the data over and over is there? We already do that when possible --- which it's not in the particular case at hand, because there's an outer-query parameter used in the hashed subplan. It occurs to me that the planner ought to favor putting parameterized subplans on the outside of a hash join instead of the inside, so as to make reuse more likely. Not sure how to factor that into the cost model though. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] gprof SELECT COUNT(*) results
On Fri, Nov 25, 2005 at 10:20:11AM -0500, Tom Lane wrote: Qingqing Zhou [EMAIL PROTECTED] writes: I can see your computer is really slow, so my theory is that since it is easy to hold a running-slowly horse than a fast one, so my spinlock on a 2.4G modern machine should takes relatively longer time to get effective. Just kidding. Is that modern machine a Xeon by any chance? We know that Xeons have fairly awful concurrent performance, and the long latency for bus lock instructions may well be the reason why. FWIW, the numbers I showed last night were for an HPPA machine, which I used just because I chanced to have CVS tip already built for profiling on it. I've since reproduced the test on a spiffy new dual Xeon that Red Hat just bought me :-) ... and I get similar numbers to yours. It'd be interesting to see the results from an SMP Opteron, if anyone's got one handy. Is there still interest in this? I've got a dual Opteron running FBSD. (What would be the profiler to use on FBSD?) -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] gprof SELECT COUNT(*) results
Jim C. Nasby [EMAIL PROTECTED] writes: (What would be the profiler to use on FBSD?) gprof should work fine. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Using multi-row technique with COPY
On Sun, Nov 27, 2005 at 07:44:55PM +, Simon Riggs wrote: not have any unique indexes or row triggers. It should be possible to take advantage of this automatically when those requirements are met, without any new options. Just as it was with Seq Scans, this is worth about 10% reduction in CPU for a COPY FROM. snip FSM access would need to change slightly to allow for whole-block-only requests to be made for heaps, without damaging the average row length calculation. It might be simpler to ignore FSM entirely? Does that mean that this fast copy would end up not re-using space on pages that have space available? ISTM that's something users would want to be able to over-ride. In fact, it seems like it shouldn't be a default behavior... -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: Hashjoin startup strategy (was Re: [HACKERS] Getting different number of results when using hashjoin on/off)
Tom Lane [EMAIL PROTECTED] writes: In particular, the executor knows that the outer subplan is parameterless and therefore should deliver the same results each time (modulo volatile functions of course), so after the first cycle it could know that there's no point in trying the early fetch on that side. Anyone have any other ideas? I suspect this is obvious but since you asked, there isn't any way to keep around the hash table and just reuse it repeatedly instead of having to rescan the data over and over is there? -- greg ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Allow an alias for the target table in UPDATE
Atsushi Ogawa wrote: I am interested in a following item in TODO. o Allow an alias to be provided for the target table in UPDATE/DELETE This is not SQL-spec but many DBMSs allow it. I think that this functionality is useful for the migration from other RDBMS. However the SQL92 spec does not allow an alias for the target table in UPDATE. Is it still TODO? If it is TODO, I will try it. Yes, this is still a valid TODO item. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Checking a heap page
With the applied patch that checks an entire heap page with one lock, is there any advantage of considering this for index pages? -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Checking a heap page
Bruce Momjian pgman@candle.pha.pa.us writes: With the applied patch that checks an entire heap page with one lock, is there any advantage of considering this for index pages? Done already (see bitmap index scans). regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] ice-broker scan thread
I am considering add an ice-broker scan thread to accelerate PostgreSQL sequential scan IO speed. The basic idea of this thread is just like the read-ahead method, but the difference is this one does not read the data into shared buffer pool directly, instead, it reads the data into file system cache, which makes the integration easy and this is unique to PostgreSQL. What happens to the original sequential scan: for (;;) { /* * a physical read may happen, due to current content of * file system cache and if the kernel is smart enough to * understand you want to do sequential scan */ physical or logical read a page; process the page; } What happens to the sequential scan with ice-broker: for (;;) { /* since the ice-broker has read the page in already */ logical read a page with big chance; process the page; } I wrote a program to simulate the sequential scan in PostgreSQL with/without ice-broker. The results indicate this technique has the following characters: (1) The important factor of speedup is the how much CPU time PostgreSQL used on each data page. If PG is fast enough, then no speedup occurs; else a 10% to 20% speedup is expected due to my test. (2) It uses more CPU - this is easy to understand, since it does more work; (3) The benefits also depends on other factors, like how smart your file system ... Here is a test results on my machine: --- $#uname -a Linux josh.db 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown $#cat /proc/meminfo | grep MemTotal MemTotal: 1030988 kB $#cat /proc/cpuinfo | grep CPU model name : Intel(R) Pentium(R) 4 CPU 2.40GHz $#./seqscan 10 $HOME/pginstall/bin/data/base/10794/18986 50 PostgreSQL sequential scan simulator configuration: Memory size: 943718400 CPU cost per page: 50 Scan thread read unit size: 4 With scan threads off - duration: 56862.738 ms With scan threads on - duration: 40611.101 ms With scan threads off - duration: 46859.207 ms With scan threads on - duration: 38598.234 ms With scan threads off - duration: 56919.572 ms With scan threads on - duration: 47023.606 ms With scan threads off - duration: 52976.825 ms With scan threads on - duration: 43056.506 ms With scan threads off - duration: 54292.979 ms With scan threads on - duration: 42946.526 ms With scan threads off - duration: 51893.590 ms With scan threads on - duration: 42137.684 ms With scan threads off - duration: 46552.571 ms With scan threads on - duration: 41892.628 ms With scan threads off - duration: 45107.800 ms With scan threads on - duration: 38329.785 ms With scan threads off - duration: 47527.787 ms With scan threads on - duration: 38293.581 ms With scan threads off - duration: 48810.656 ms With scan threads on - duration: 39018.500 ms --- Notice in above the cpu_cost=50 might looks too big (if you look into the code) - but in concurrent situation, it is not that huge. Also, on my windows box(PIII, 800), a cpu_cost=5 can is enough to prove the benefits of 10%. So in general, it does help in some situations, but not a rocket science since we can't predicate the performance of the file system. It fairly easy to be integrated, and we should add a GUC parameter to control it. We need more tests, any comments and tests are welcome, Regards, Qingqing --- /* * seqscan.c * PostgreSQL sequential scan simulator with helper scan thread * * Note * I wrote this simulator to see if there is any benefits for sequential scan to * do read-ahead by another thread. The only thing you may want to change in the * source file is MEMSZ, make it big enough to thrash your file system cache. * * Use the following command to compile: * $gcc -O2 -Wall -pthread -lm seqscan.c -o seqscan * To use it: * $./seqscan rounds datafile cpu_cost * In which rounds is how many times you want to run the test (notice each round include * two disk-burn test), datafile is the path to any file (suggest size 100M), and cpu_cost * is the cost that processing each page of the file. Try different cpu_cost. */ #include stdio.h #include stdlib.h #include fcntl.h #include memory.h #include errno.h #include math.h #ifdef WIN32 #include io.h #include windows.h #define PG_BINARY O_BINARY #else #include unistd.h #include pthread.h #include sys/stat.h #include sys/time.h #include sys/file.h #define PG_BINARY 0 #endif typedef char bool; #define true((bool) 1) #define false ((bool) 0) #define BLCKSZ 8192 #define UNITSZ 4 #define MEMSZ (950*1024*1024) char*data_file; int cpu_cost; volatile bool stop_scan; charthread_buffer[BLCKSZ*UNITSZ]; static void cleanup_cache(void) { char*p; if (NULL == (p = (char *)malloc(MEMSZ))) { fprintf(stderr,
Re: [HACKERS] ice-broker scan thread
Qingqing Zhou wrote: I am considering add an ice-broker scan thread to accelerate PostgreSQL sequential scan IO speed. The basic idea of this thread is just like the read-ahead method, but the difference is this one does not read the data into shared buffer pool directly, instead, it reads the data into file system cache, which makes the integration easy and this is unique to PostgreSQL. Interesting, and I wondered about this too. But for my taste the demonstrated benefit really isn't large enough to make it worthwhile. BTW, I heard a long time ago that NTFS has quite fancy read-ahead, where it attempts to detect the application's access pattern including if it is reading sequentially and even if there is a 'stride' to the accesses when they're not contiguous. I would imagine that other filesystems attempt similar tricks. So one might expect a simple linear prefectch to not help much in the presence of such a filesystem. Were you worried about the icebreaker thread getting too far ahead of the scan ? If it did it might page out the data you're about to read, I think. Of course this could be fixed by having the read ahead thread perodically check the current location being read by the query thread and pausing if it's got too far ahead. Anyway, the recent performance thread has been intersting to me because in all my career I've never seen a database that scanned scads of data from disk to process a query. Typically the problems I work on arrange to read the entire database into memory. I think I need to get out more... ;) ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] ice-broker scan thread
On Mon, 28 Nov 2005, Qingqing Zhou wrote: I am considering add an ice-broker scan thread to accelerate PostgreSQL sequential scan IO speed. The basic idea of this thread is just like the read-ahead method, but the difference is this one does not read the data into shared buffer pool directly, instead, it reads the data into file system cache, which makes the integration easy and this is unique to PostgreSQL. MySQL, Oracle and others implement read-ahead threads to simulate async IO 'pre-fetching'. I've been experimenting with two ideas. The first is to increase the readahead when we're doing sequential scans (see prototype patch using posix fadvise attached). I've not got any hardware at the moment which I can test this patch on but I am waiting on some dbt-3 results which should indicate whether fadvise is a good idea or a bad one. The second idea is using posix async IO at key points within the system to better parallelise CPU and IO work. There areas I think we could use async IO are: during sequential scans, use async IO to do pre-fetching of blocks; inside WAL, begin flushing WAL buffers to disk before we commit; and, inside the background writer/check point process, asynchronously write out pages and, potentially, asynchronously build new checkpoint segments. The motivation for using async IO is two fold: first, the results of this paper[1] are compelling; second, modern OSs support async IO. I know that Linux[2], Solaris[3], AIX and Windows all have async IO and I presume that all their rivals have it as well. The fundamental premise of the paper mentioned above is that if the database is busy, IO should be busy. With our current block-at-a-time processing, this isn't always the case. This is why Qingqing's read-ahead thread makes sense. My reason for mailing is, however, that the async IO results are more compelling than the read ahead thread. I haven't had time to prototype whether we can easily implement async IO but I am planning to work on it in December. The two main goals will be to a) integrate and utilise async IO, at least within the executor context, and b) build a primitive kind of scheduler so that we stop prefetching when we know that there are a certain number of outstanding IOs for a given device. Thanks, Gavin [1] http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf [2] http://lse.sourceforge.net/io/aionotes.txt [3] http://developers.sun.com/solaris/articles/event_completion.html - I'm fairly sure they have a posix AIO wrapper around these routines, but I cannot see it documented anywhere :-(Index: src/backend/access/heap/heapam.c === RCS file: /usr/local/cvsroot/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.200 diff -c -p -r1.200 heapam.c *** src/backend/access/heap/heapam.c15 Oct 2005 02:49:08 - 1.200 --- src/backend/access/heap/heapam.c18 Nov 2005 04:10:21 - *** *** 36,41 --- 36,44 * *- */ + + #include fcntl.h + #include postgres.h #include access/heapam.h *** *** 49,54 --- 52,58 #include miscadmin.h #include pgstat.h #include storage/procarray.h + #include storage/smgr.h #include utils/inval.h #include utils/relcache.h *** heap_beginscan(Relation relation, Snapsh *** 659,665 pgstat_initstats(scan-rs_pgstat_info, relation); initscan(scan, key); ! return scan; } --- 663,673 pgstat_initstats(scan-rs_pgstat_info, relation); initscan(scan, key); ! if(!IsBootstrapProcessingMode()) ! { ! RelationOpenSmgr(relation); ! RelationSetSmgrAdvice(relation, POSIX_FADV_SEQUENTIAL); ! } return scan; } *** heap_rescan(HeapScanDesc scan, *** 693,698 --- 701,710 void heap_endscan(HeapScanDesc scan) { + /*if(!IsBootstrapProcessingMode()) + smgradvise(scan-rs_rd-rd_smgr, POSIX_FADV_NORMAL); + */ + /* Note: no locking manipulations needed */ /* Index: src/backend/access/index/indexam.c === RCS file: /usr/local/cvsroot/pgsql/src/backend/access/index/indexam.c,v retrieving revision 1.86 diff -c -p -r1.86 indexam.c *** src/backend/access/index/indexam.c 15 Oct 2005 02:49:09 - 1.86 --- src/backend/access/index/indexam.c 18 Nov 2005 03:13:01 - *** *** 61,73 *- */ #include postgres.h #include access/genam.h #include access/heapam.h ! #include pgstat.h #include utils/relcache.h /* *macros used in index_ routines --- 61,77
Re: [HACKERS] ice-broker scan thread
Qingqing, I am considering add an ice-broker scan thread to accelerate PostgreSQL sequential scan IO speed. The basic idea of this thread is just like the read-ahead method, but the difference is this one does not read the data into shared buffer pool directly, instead, it reads the data into file system cache, which makes the integration easy and this is unique to PostgreSQL. You probably mean ice-breaker by the way :) Chris ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] ice-broker scan thread
Gavin Sherry wrote: MySQL, Oracle and others implement read-ahead threads to simulate async IO I always believed that Oracle used async file I/O. Not that I've seen their code, but I'm fairly sure they funded the addition of kernel aio to Linux a few years back. ButOracle comes from a time long ago when threads and decent filesystems didn't exist, so some of the things they do may not be appropriate to add to a product that doesn't have them today. Now...network async I/O...that'd be really useful in my world... ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] ice-broker scan thread
On Mon, 28 Nov 2005, David Boreham wrote: Gavin Sherry wrote: MySQL, Oracle and others implement read-ahead threads to simulate async IO I always believed that Oracle used async file I/O. Not that I've seen their code, but I'm fairly sure they funded the addition of kernel aio to Linux a few years back. That's right. ButOracle comes from a time long ago when threads and decent filesystems didn't exist, so some of the things they do may not be appropriate to add to a product that doesn't have them today. The paper I linked to seemed to suggest that they weren't using async IO in 9.2 -- which is fairly old. I'm not sure why the authors didn't test 10g. Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] ice-broker scan thread
Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: I haven't had time to prototype whether we can easily implement async IO Just as with any suggestion to depend on threads, you are going to have to show results that border on astounding to have any chance of getting this in. Otherwise the portability issues are just going to make it not worth the trouble. Do these ideas require threads in principle? ISTM that there could be (additional) process(es) waiting to perform pre-fetching or async io, and we could use the usual IPC machinary to talk between them... cheers Mark ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] ice-broker scan thread
The paper I linked to seemed to suggest that they weren't using async IO in 9.2 -- which is fairly old. I'm not sure why the authors didn't test 10g. ...reads paper... ok, interesting. Did they say that Oracle isn't using aio ? I can't see that. They that Oracle has no more than one outstanding I/O operation in flight per concurrent query, and they appear to think that's a bad thing. I'm not seeing that myself. Perhaps once I sleep on it, it'll become clear what they're getting at. One theory for lack of aio in Oracle as tested in that paper would be that they were testing on Linux. Since aio is relatively new in Linux I wouldn't be surprised if Oracle didn't actually use it until it's known to be widely deployed in the field and to have proven reliability. Perhaps we've reached that state around now, and so Oracle may not yet have released an aio-capable Linux version of their RDBMS. Just a theory...someone from those tubular towers lurking here could tell us for sure I guess...
Re: [HACKERS] ice-broker scan thread
On Mon, 28 Nov 2005, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: I haven't had time to prototype whether we can easily implement async IO Just as with any suggestion to depend on threads, you are going to have to show results that border on astounding to have any chance of getting this in. Otherwise the portability issues are just going to make it not worth the trouble. The architecture I am looking at would not rely on threads. I didn't want to jump on list and waive my hands until I had something to show, but since Qingqing is looking at the issue I thought I better raise it. Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] ice-broker scan thread
Gavin Sherry wrote: The paper I linked to seemed to suggest that they weren't using async IO in 9.2 -- which is fairly old. I'm not sure why the authors didn't test 10g. There have been async io type parameters in Oracle's init.ora files from (at least) 8i (disk_async_io=true IIRC) - on Solaris anyway. Whether this enabled real or simulated async io is probably a good question - I recall during testing turning it off and seeing kio()? or similar type calls become write()/read() in truss oupout. regards Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] ice-broker scan thread
On Mon, 28 Nov 2005, Mark Kirkwood wrote: Do these ideas require threads in principle? ISTM that there could be (additional) process(es) waiting to perform pre-fetching or async io, and we could use the usual IPC machinary to talk between them... Right. I use threads because it is easy to write simulation program :-) Regards, Qingqing ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] ice-broker scan thread
FYI, I've personally used Oracle 9.2.0.4's async IO on Linux and have seen several installations which make use of it also. On 11/28/05, Gavin Sherry [EMAIL PROTECTED] wrote: On Mon, 28 Nov 2005, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: I haven't had time to prototype whether we can easily implement async IO Just as with any suggestion to depend on threads, you are going to have to show results that border on astounding to have any chance of getting this in.Otherwise the portability issues are just going to make it not worth the trouble.The architecture I am looking at would not rely on threads.I didn't want to jump on list and waive my hands until I had something to show, but since Qingqing is looking at the issue I thought I better raiseit.Gavin---(end of broadcast)---TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] ice-broker scan thread
On Mon, 28 Nov 2005, Gavin Sherry wrote: MySQL, Oracle and others implement read-ahead threads to simulate async IO 'pre-fetching'. Due to my tests on Windows (using the attached program and change enable_aio=true), seems aio doesn't help as a separate thread - but maybe because my usage is wrong ... Regards, Qingqing ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] ice-broker scan thread
On Mon, 28 Nov 2005, Gavin Sherry wrote: I didn't want to jump on list and waive my hands until I had something to show, but since Qingqing is looking at the issue I thought I better raise it. Don't worry :-) I separate the logic into a standalone program in order to let more people can help on this issue. Regards, Qingqing ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] ice-broker scan thread
David Boreham [EMAIL PROTECTED] wrote BTW, I heard a long time ago that NTFS has quite fancy read-ahead, where it attempts to detect the application's access pattern including if it is reading sequentially and even if there is a 'stride' to the accesses when they're not contiguous. I would imagine that other filesystems attempt similar tricks. So one might expect a simple linear prefectch to not help much in the presence of such a filesystem. So we need more tests. I understand how smart current file systems are, and seems that depends on the interval that you send next file block read request (decided by cpu_cost parameter in my program). I imagine on a multi-way machine with strong IO device, the ice-breaker could do much better ... Were you worried about the icebreaker thread getting too far ahead of the scan ? If it did it might page out the data you're about to read, I think. Of course this could be fixed by having the read ahead thread perodically check the current location being read by the query thread and pausing if it's got too far ahead. Right. Regards, Qingqing ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] ice-broker scan thread
Qingqing Zhou wrote: On Mon, 28 Nov 2005, Gavin Sherry wrote: MySQL, Oracle and others implement read-ahead threads to simulate async IO 'pre-fetching'. Due to my tests on Windows (using the attached program and change enable_aio=true), seems aio doesn't help as a separate thread - but maybe because my usage is wrong ... I don't think your NT overlapped I/O code is quite right. At least I think it will issue reads at a high rate without waiting for any of them to complete. Beyond some point that has to give the kernel gut-rot. But anyway, I wouldn't expect the use of aio to make any significant difference in an already threaded test program. The point of aio is to allow I/O concurrency _without_ the use of threads or multiple processes. You could re-write your program to have a single thread but use aio. In that case it should show the same read ahead benefit that you see with the thread.
Re: [HACKERS] ice-broker scan thread
On Mon, 28 Nov 2005, Qingqing Zhou wrote: On Mon, 28 Nov 2005, Gavin Sherry wrote: MySQL, Oracle and others implement read-ahead threads to simulate async IO 'pre-fetching'. Due to my tests on Windows (using the attached program and change enable_aio=true), seems aio doesn't help as a separate thread - but maybe because my usage is wrong ... Right, I would imagine that it's very close. I intend to use kernel based async IO so that we can have the prefetch effect of your sample program without the need for threads. Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] ice-broker scan thread
David Boreham [EMAIL PROTECTED] wrote I don't think your NT overlapped I/O code is quite right. At least I think it will issue reads at a high rate without waiting for any of them to complete. Beyond some point that has to give the kernel gut-rot. [also with reply to Gavin] look up dictionary for gut-rot, got it ... Uh, this behavior is intended - I try to push enough requests shortly to kernel so that it understands that I am doing sequential scan, so it would pull the data from disk to file system cache more efficiently. Some file systems may have free-behind mechanism, but our main thread (who really process the query) should be fast enough before the data vanished. You could re-write your program to have a single thread but use aio. In that case it should show the same read ahead benefit that you see with the thread. I guess this is also Gavin's point - I understand that will be two different methodologies to handle read-ahead. If no other thread/process involved, then the main thread will be responsible to grab a free buffer page from bufferpool and ask the kernel to put the data there by sync IO (current PostgreSQL does) or async IOs. And that's what I want to avoid. I'd like to use a dedicated thread/process to break the ice only, i.e., pull data from disk to file system cache, so that the main thread will only issue *logical* read. Regards, Qingqing ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq