[HACKERS] Project proposal/comments please - query optimization
I have noticed a deficiency in the current query optimizer related to "full outer joins". Tom Lane has confirmed to me that it will not be 8.1 material. I am not able to wait for 8.2 I am in the lucky situation that my project has money to hire consultants, so I would be very interested in hearing about any who feels able to work on this, with estimates to costs. The sw developed shall be freely available and will be given back into PostgreSQL, if the project wants it. I actually think it should be a requirement that the sw is accepted into PostgreSQL, but I do not know how to phrase it so that it is acceptable to all parties. The specific problem can be illustrated with two example queries. Query1: SELECT x, y, av, bv FROM at a FULL OUTER JOIN bt b USING (x, y) WHERE x = 52981 AND y = '2004-1-1 0:0:0'; Query2: SELECT x, y, av, bv FROM (SELECT x, y, av FROM at WHERE x = 52981 AND y = '2004-1-1 0:0:0') a FULL OUTER JOIN (SELECT x, y, bv FROM bt WHERE x = 52981 AND y = '2004-1-1 0:0:0') b USING (x, y); Both queries select the same set of data (one record), but query2 is able to use the indexes in doing so. By looking at the "explain analyze" output it is clear that this is because the current PostgreSQL query optimizer is not able to push the conditions (x = 52981 AND y = '2004-1-1 0:0:0') down into the sub-queries, thus forcing the fetching of all data from the tables, and then lastly filtering out the few records (zero to one row from each table). The reason why I say it is related to "full outer joins" it that if I take Query1 and substitute "full" with "left", the optimizer is capable of pushing the conditions out in the sub-selects, and is thus able to use indices. Looking forward for any comments. I am aware that there are workarounds (like query2, union of two left-joins, hand coding the join from a series of simple selects, ...) but I do not feel they are practical for my use. Regards, -- Kim Bisgaard Computer Department Phone: +45 3915 7562 (direct) Danish Meteorological Institute Fax: +45 3915 7460 (division) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Use of inv_getsize in functions
Hi, Red ears, looking down on my shoes! Well I code one getsize my self using inv_seek. Had disabled a lot of compiler warnings because I am reusing a lot of spaghetti code for testing etc. This was proberly a hint for starting to clean up the code. Regards, Søren > On Tue, Aug 09, 2005 at 10:54:49PM +0200, Soeren Laursen wrote: > > I have used other function calls like > > inv_open with no problem, but when I load this modules I get: > > > > undefined symbol: inv_getsize > > Notice the word "static" in the definition of inv_getsize() in > src/backend/storage/large_object/inv_api.c: > > static uint32 > inv_getsize(LargeObjectDesc *obj_desc) > { > ... > } > > I don't know if there's a good reason for inv_getsize() being static. > Maybe your code could use inv_seek() instead. > > > No errors when compiling. > > If you compile with warning flags, then you should at least have > gotten a warning like "implicit declaration of function `inv_getsize'". > That's a hint that something's wrong. > > -- > Michael Fuhr > ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Planner doesn't look at LIMIT?
I have a case that I though was an example of this issue, and that this patch would correct. I applied this patch to an 8.0.3 source distribution, but it didn't seem to solve my problem. In a nutshell, I have a LIMIT query where the planner seems to favor a merge join over a nested loop. I've simplified the query as much as possible: itvtrackdata3=> \d tableA Table "public.tableA" Column | Type | Modifiers +--+--- foo| bigint | not null bar| smallint | not null bap| bigint | not null bip| bigint | not null bom| bigint | not null Indexes: "idx_tableA_bip" btree (bip) WHERE (bip = 900::bigint) "idx_tableA_foo" btree (foo) itvtrackdata3=> \d tableB Table "tableB" Column | Type | Modifiers -+--+--- bim | bigint | not null bif | smallint | not null baf | smallint | not null bof | smallint | not null buf | smallint | not null foo | bigint | not null Indexes: "idx_tableB_bim" btree ("bim", foo) itvtrackdata3=> set default_statistics_target to 1000; SET Time: 0.448 ms itvtrackdata3=> analyze tableA; ANALYZE Time: 4237.151 ms itvtrackdata3=> analyze tableB; ANALYZE Time: 46672.939 ms itvtrackdata3=> explain analyze SELECT * FROM tableB NATURAL JOIN tableA WHERE bim>=72555896091359 AND bim<72555935412959 AND bim=bap ORDER BY bim ASC LIMIT 1; QUERY PLAN -- Limit (cost=149626.57..252987.71 rows=1 width=50) (actual time=5684.013..5684.013 rows=1 loops=1) -> Merge Join (cost=149626.57..252987.71 rows=1 width=50) (actual time=5684.012..5684.012 rows=1 loops=1) Merge Cond: (("outer"."bim" = "inner"."bap") AND ("outer".foo = "inner".foo)) -> Index Scan using idx_tableB_bim on tableB (cost=0.00..97391.22 rows=55672 width=24) (actual time=0.017..0.059 rows=29 loops=1) Index Cond: (("bim" >= 72555896091359::bigint) AND ("bim" < 72555935412959::bigint)) -> Sort (cost=149626.57..151523.94 rows=758948 width=34) (actual time=5099.300..5442.825 rows=560856 loops=1) Sort Key: tableA."bap", tableA.foo -> Seq Scan on tableA (cost=0.00..47351.48 rows=758948 width=34) (actual time=0.021..1645.204 rows=758948 loops=1) Total runtime: 5706.655 ms (9 rows) Time: 5729.984 ms itvtrackdata3=> set enable_mergejoin to false; SET Time: 0.373 ms itvtrackdata3=> explain analyze SELECT * FROM tableB NATURAL JOIN tableA WHERE bim>=72555896091359 AND bim<72555935412959 AND bim=bap ORDER BY bim ASC LIMIT 1; QUERY PLAN -- Limit (cost=0.00..432619.68 rows=1 width=50) (actual time=11.149..11.150 rows=1 loops=1) -> Nested Loop (cost=0.00..432619.68 rows=1 width=50) (actual time=11.148..11.148 rows=1 loops=1) Join Filter: ("outer"."bim" = "inner"."bap") -> Index Scan using idx_tableB_bim on tableB (cost=0.00..97391.22 rows=55672 width=24) (actual time=0.017..0.062 rows=29 loops=1) Index Cond: (("bim" >= 72555896091359::bigint) AND ("bim" < 72555935412959::bigint)) -> Index Scan using idx_tableA_foo on tableA (cost=0.00..6.01 rows=1 width=34) (actual time=0.007..0.379 rows=1 loops=29) Index Cond: ("outer".foo = tableA.foo) Total runtime: 11.215 ms (8 rows) Time: 32.007 ms Have I just flubbed the patch, or is there something else going on here? Thanks, --Ian On Fri, 2005-07-22 at 12:20, Tom Lane wrote: > I wrote: > > Dawid Kuroczko <[EMAIL PROTECTED]> writes: > >> qnex=# EXPLAIN SELECT * FROM log NATURAL JOIN useragents LIMIT 1; > > >> Limit (cost=15912.20..15912.31 rows=1 width=272) > >> -> Hash Join (cost=15912.20..5328368.96 rows=47044336 width=272) > > >> If I set enable_hashjoin=false: > > >> qnex=# EXPLAIN ANALYZE SELECT * FROM log NATURAL LEFT JOIN useragents > >> LIMIT 1; > > >> Limit (cost=0.00..3.07 rows=1 width=272) (actual time=74.214..74.216 > >> rows=1 loops=1) > >> -> Nested Loop Left Join (cost=0.00..144295895.01 rows=47044336 > >> width=272) (actual time=74.204..74.204 rows=1 loops=1) > > > This is quite strange. The nestloop plan definitely should be preferred > > in the context of the LIMIT, considering that it has far lower estimated > > cost. And it is preferred in simple tests for me. > > After a sui
Re: [HACKERS] [GENERAL] Testing of MVCC
On Wed, 2005-08-10 at 16:41 -0400, Tom Lane wrote: > Matt Miller <[EMAIL PROTECTED]> writes: > > It seems to me that contrib/dblink could greatly simplify the design and > > coding of multi-user regression tests. > > I doubt it would be very useful, since > a script based on that still doesn't let you issue concurrent queries. I think it would be useful to allow a test script to first create a set of committed and uncommitted transactions, and to then issue some queries on another connection to confirm that the other connection has a proper view of the database at that point. This type of test is serialized, but I think it would be a useful multi-user test. Also, the output from such a test is probably pretty easy to fit into the diff-based validation of "make check." I realize that we also need to have tests that spawn several connections and run scripts concurrently across those connections. I agree that this type of test would probably not benefit fundamentally from contrib/dblink. However, I was grasping a bit to see how the output from such a concurrent test would be diff'ed with an expected output in a meaningful way. So, to continue to progress on this problem, I figured that a contrib/dblink dependency would at least allow me to start coding something... > > Is there objection to a portion > > of src/test/regress depending on contrib/dblink? > > Yes. Understood. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Project proposal/comments please - query optimization
Kim Bisgaard <[EMAIL PROTECTED]> writes: > I have noticed a deficiency in the current query optimizer related to > "full outer joins". Tom Lane has confirmed to me that it will not be 8.1 > material. The particular case you are complaining of is fixed in CVS tip. There are related issues involving N-way joins that we're still not very good at. regression=# create table at (x int, y timestamp, av text); CREATE TABLE regression=# create table bt (x int, y timestamp, bv text); CREATE TABLE regression=# create index ati on at(x,y); CREATE INDEX regression=# create index bti on bt(x,y); CREATE INDEX regression=# explain SELECT x, y, av, bv FROM at a FULL OUTER JOIN bt b USING (x, y) WHERE x = 52981 AND y = '2004-1-1 0:0:0'; QUERY PLAN Merge Full Join (cost=0.00..9.66 rows=1 width=88) -> Index Scan using ati on "at" a (cost=0.00..4.83 rows=1 width=44) Index Cond: ((x = 52981) AND (y = '2004-01-01 00:00:00'::timestamp without time zone)) -> Index Scan using bti on bt b (cost=0.00..4.83 rows=1 width=44) Index Cond: ((x = 52981) AND (y = '2004-01-01 00:00:00'::timestamp without time zone)) (5 rows) regression=# This only works for WHERE clauses that equate join alias variables to pseudoconstants. I have this in my notes: Consider this version of Kim Bisgaard's example: SELECT FROM a join (b full join c using (id)) using (id) If A is small and B,C have indexes on ID then it is interesting to consider a plan like Nest Loop Scan A Merge Full Join Indexscan B using id = outer.id Indexscan C using id = outer.id We are fairly far from being able to do this. generate_outer_join_implications could easily be modified to generate derived equalities (I think it works to allow a deduction against any clause not overlapping the outerjoin itself) but the planner would want to evaluate them at the wrong level, and the executor doesn't have support for passing the outer variable down more than one level of join. This is why the existing hack works only for equalities to pseudoconstants. We could maybe mark join RestrictInfos as "valid only below xxx" and ignore them when processing a join that includes all of the indicated rels? Still not clear how you get the planner to recognize the above as an inner indexscan situation though. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] Determining return type of polymorphic function
[Please CC any replies so I don't have to follow them via the archives] Hi, I'm trying to create a set of types that are going to share the INPUT and OUTPUT functions (written in C). For output you can determine the type from the arguments, but for INPUT you can't. The prototype is restricted (by CREATE TYPE) and you can't specify "anyelement" as the return type because none of the arguments use it. My current way around that is to create an alias to the function with different names for each type, but get_fn_expr_rettype() doesn't appear to be filled in anyway (fcinfo->flinfo->fn_expr == NULL). What I'm trying to do now is use fcinfo->flinfo->fn_oid to lookup pg_proc and get the return type from there, but something tells me there must be an easier way. Or to put it another way, if I define a function like: CREATE FUNCTION myfunction(cstring, oid, integer) RETURNS mytype AS 'mylib.so' LANGUAGE 'C'; How can I determine I'm supposed to return a "mytype"? I'm running 7.4 if it matters... Thanks in advance, -- Martijn van Oosterhout http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. pgpw0xkpyZUhh.pgp Description: PGP signature
[HACKERS] SELECT for UPDATE and outer join?
Folks, -- SQL statement with input values :java.lang.Integer:30239. Please examine the SQLException for more information. NestedException: java.sql.SQLException: ERROR: SELECT FOR UPDATE cannot be applied to the nullable side of an outer join - So, my question is: why can't we apply FOR UPDATE to an outer join? Is this defined in the SQL Spec? -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Determining return type of polymorphic function
Martijn van Oosterhout writes: > What I'm trying to do now is use fcinfo->flinfo->fn_oid to lookup > pg_proc and get the return type from there, but something tells me > there must be an easier way. No, I think you're stuck. The internal calls for type I/O routines don't set up fn_expr (since there is no expression tree). One possibility, depending on your time horizon for this, is to change the getTypeIOParam rules so that ordinary types get their own OID as second argument. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Determining return type of polymorphic function
Martijn van Oosterhout writes: > I was thinking of actually also storing the oid in the typelem field > but the docs imply this does something fancy with subscripting. Yeah, like enable it ;-). You can't do that unless you are some kind of array type. typelem pointing at yourself would be particularly bad news --- probably lead to infinite loops ... > Would a patch to change the rules be accepted, or would it be > considered a unnecessary backward incompatable change? I wouldn't back-patch it, but it seems like something we could still put in for 8.1. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Determining return type of polymorphic function
[Please CC replies, thanks] On Thu, Aug 11, 2005 at 02:17:30PM -0400, Tom Lane wrote: > Martijn van Oosterhout writes: > > What I'm trying to do now is use fcinfo->flinfo->fn_oid to lookup > > pg_proc and get the return type from there, but something tells me > > there must be an easier way. > > No, I think you're stuck. The internal calls for type I/O routines > don't set up fn_expr (since there is no expression tree). > > One possibility, depending on your time horizon for this, is to change > the getTypeIOParam rules so that ordinary types get their own OID as > second argument. Hmm, I was thinking about that. While reading the documentation I was thinking "surely they'd pass their own oid, giving zero would be silly" so I was kind of surprised when I did get zero. I was thinking of actually also storing the oid in the typelem field but the docs imply this does something fancy with subscripting. I havn't traced the code paths for that yet. At the very least I think it would confuse anything looking for arrays. I also thought about typmod (the third argument) but that seems to almost always be -1. Would a patch to change the rules be accepted, or would it be considered a unnecessary backward incompatable change? Thanks in advance, -- Martijn van Oosterhout http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. pgpi9FwgBzPbd.pgp Description: PGP signature
[HACKERS] Logging explain-analyze output in pg log?
I'm stuck with a web app that periodically has truly awful query response times. The cause may be an interaction between system load, vacuum-analyze scheduling, the occasional wild variation in join selectivity, and stats collection. Logging the queries and running them later doesn't create an obvious pattern, because it's hard to catch the exact circumstances where the query takes forever, or returns zero rows. It would be difficult to change the app to run an EXPLAIN ANALYZE and record the results. I do this when I can, because it's the fastest way to figure out what went wrong in a 10-way join. I would like to add a guc variant on debug_print_plan that logs the output of ExplainOneQuery(), rather than dumping the plan node-traversal output --- perhaps it's just me, but I have trouble relating that output to the tables and columns named neatly in EXPLAIN ANALYZE output. This is different from just running EXPLAIN ANALYZE: this is asking the backend to log the EXPLAIN ANALYZE output, but ALSO return the normal result set. Some points I ponder: (1) Could not find this on the TODO list, nor mentioned in the mail archive. Has this truly never come up? Logging detail query execution to to the profiler is something one gets accustomed to, with MSSQL. (2) src/backend/commands/explain.c:ExplainQuery gives a cautionary comment on scribbling (recording stats) in the execution tree, particularly when that tree will be executed repeatedly from a PLPGSQL proc (not a problem in my case, but ...) (3) any suggestions on what to avoid ... ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Race condition in backend process exit
On Sun, Aug 07, 2005 at 03:45:10PM -0400, Tom Lane wrote: > > I'm a bit hesitant to back-patch such a nontrivial and hard-to-test > change, but it sure looks badly broken to me. Any thoughts about the > risks involved? If there were some way to test it reliably, it'd make me feel a lot better. I guess I'd as soon hear about the risks involved in not back-patching, because this seems like a pretty dangerous place to be back-patching (and race condition fixes, in my experience, are often well-loaded foot-guns, even if I do trust your coding abilities more than approximately anyone else's). A -- Andrew Sullivan | [EMAIL PROTECTED] In the future this spectacle of the middle classes shocking the avant- garde will probably become the textbook definition of Postmodernism. --Brad Holland ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Simplifying wal_sync_method
On Wed, Aug 10, 2005 at 02:11:48AM -0500, Thomas F. O'Connell wrote: > I was recently witness to a benchmark of 7.4.5 on Solaris 9 wherein > it was apparently demonstrated that fsync was the fastest option > among the 7.4.x wal_sync_method options. > > If there's a way to make this information more useful by providing > more data, please let me know, and I'll see what I can do. What would be really interesting to me to know is what Sun did between 8 and 9 to make that so. We don't use Solaris for databases any more, but fsync was a lot slower than whatever we ended up using on 8. I wouldn't be surprised if they'd wired fsync directly to something else; but I can hardly believe it'd be faster than any other option. (Mind, we were using Veritas filesyste with this, as well, which was at least half the headache.) A -- Andrew Sullivan | [EMAIL PROTECTED] The fact that technology doesn't work is no bar to success in the marketplace. --Philip Greenspun ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] data on devel code perf dip
I have an example of runs that illustrate a performance problem that occurred between installing the 7/18 and 8/1 development release codes. I'm running on a PPC64 8-way system, with 16GB of memory (on a 15GB virtual machine), with a DBT2 workload configured as a 16 warehouse DBT2 with 16 db connections, no think time. Using Linux 2.6.11/gentoo. Overview: Before picture: run 42 code updated from cvs July 18 12:18pm pacific metric: 8374.71 NOTPM After picture (same config params, db size, etc): run 59 code downloaded from cvs (all of it) Aug1 9:58AM pacific metric: 4787.61 NOTPM Oprofile data in 59 (top part copied at the end of this message) but not in 42 (although the binary is still around and I could in theory rerun and get it). I'd like to first get back to the throughput of run 42, and then remove what's keeping me from getting higher than 45% user with the original July 18 drop. Here are more details: (RUN 42) the previous performance 8374.71 new-order transactions per minute (NOTPM) and %user around 45% on 8 processors. Even this level of user utilization is not good, as there is still some bandwidth left in the log device (not much, but some). http://developer.osdl.org/maryedie/DBT2_PGSQL/42/ This development code we got on July 18 12:18pm pacific. Tuned it and got to run 42 on July 26. (RUN 59) This is an example of a run with code downloaded (whole) on Aug1 9:58am Pacific. 4787.61 new-order transactions per minute (NOTPM) and ~22% user over 8 cpus. This has the same config parameters as 42, with autovacuum off. However, I turned on oprofile for this run. (runs without oprofile are equivalent). http://developer.osdl.org/maryedie/DBT2_PGSQL/59/ Top things for postgreSQL in the oprofile are copied below. CPU: ppc64 POWER5, speed 1656.38 MHz (estimated) Counted CYCLES events (Processor cycles) with a unit mask of 0x00 (No unit mask) count 10 samples %app name symbol name 164623113 70.5372 kernel-2.6.11.3 .shared_idle 6641569 2.8458 libc-2.3.4.so(no symbols) 4011874 1.7190 postgres .AllocSetAlloc 2798355 1.1990 postgres .CreateLWLocks 1731687 0.7420 postgres .SearchCatCache 1701936 0.7292 postgres .MemoryContextAllocZeroAligned 1509903 0.6470 postgres .FunctionCall2 1320467 0.5658 postgres .hash_search 1275278 0.5464 oprofiled(no symbols) 1271477 0.5448 postgres .yyparse 1099396 0.4711 postgres .OpernameGetCandidates 7948020.3406 postgres .expression_tree_walker 7319690.3136 postgres .MemoryContextAlloc 7302660.3129 postgres ._bt_compare 7097930.3041 postgres .DirectFunctionCall1 6894050.2954 postgres .base_yylex 6837760.2930 postgres .hash_any 6356800.2724 postgres .DLMoveToFront 6320980.2708 kernel-2.6.11.3 .___dst_free 5621110.2409 kernel-2.6.11.3 .sys_setrlimit 5086080.2179 postgres .AtEOXact_CatCache 4993180.2139 postgres .heap_release_fetch 4981360.2134 kernel-2.6.11.3 .tty_write 4940200.2117 postgres .XLogInsert -- Mary Edie Meredith Initiative Manager Open Source Development Labs [EMAIL PROTECTED] 503-906-1942 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] data on devel code perf dip
Mary Edie Meredith <[EMAIL PROTECTED]> writes: > I have an example of runs that illustrate a performance > problem that occurred between installing the 7/18 and 8/1 > development release codes. I dug through the CVS logs to see what had changed, and I'm afraid there is just one plausible-looking candidate: 2005-07-28 23:22 momjian * src/backend/access/transam/xlog.c: Use O_DIRECT if available when using O_SYNC for wal_sync_method. Also, write multiple WAL buffers out in one write() operation. ITAGAKI Takahiro Most of the CVS activity in that time period had to with stuff like roles and the interval datatype. It's conceivable that these things had some marginal performance cost, but if so I'd have expected it to show up as extra CPU effort (more time checking permissions, say). This figure: > samples %app name symbol name > 164623113 70.5372 kernel-2.6.11.3 .shared_idle says pretty clearly that your problem is all I/O wait, and there are no other commits that might have increased our tendency to wait for I/O. I am sure I will get some pushback if I propose reverting the O_DIRECT patch, so could you try to get some more-specific evidence? Like pull the CVS tree from just before and just after this patch and compare performance? BTW I did check that both runs are using wal_sync_method = fdatasync and wal_buffers = 1000, so it's not a problem of those parameters having been changed by the patch. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] data on devel code perf dip
On Thu, Aug 11, 2005 at 09:02:03PM -0400, Tom Lane wrote: > I am sure I will get some pushback if I propose reverting the O_DIRECT > patch, so could you try to get some more-specific evidence? Like pull > the CVS tree from just before and just after this patch and compare > performance? Quoth the open(2) manpage: O_DIRECT Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is synchronous, i.e., at the comple- tion of the read(2) or write(2) system call, data is guaran- teed to have been transferred. In light of this, may I ask whether it makes sense to compare the performance of two runs with similar shared_buffer settings? With O_DIRECT, I understand from this manpage that the OS is going to do little or no page caching, so shared_buffers should be increased to account for this fact. Am I missing something? -- Alvaro Herrera () "Hay que recordar que la existencia en el cosmos, y particularmente la elaboración de civilizaciones dentre de él no son, por desgracia, nada idílicas" (Ijon Tichy) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] data on devel code perf dip
Alvaro Herrera <[EMAIL PROTECTED]> writes: > In light of this, may I ask whether it makes sense to compare the > performance of two runs with similar shared_buffer settings? With > O_DIRECT, I understand from this manpage that the OS is going to do > little or no page caching, so shared_buffers should be increased to > account for this fact. > Am I missing something? O_DIRECT is only being used for WAL page writes (or I sure hope so anyway), so shared_buffers should be irrelevant. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] data on devel code perf dip
Tom Lane wrote: > Most of the CVS activity in that time period had to with stuff like > roles and the interval datatype. It's conceivable that these things > had some marginal performance cost, but if so I'd have expected it to > show up as extra CPU effort (more time checking permissions, say). > This figure: > > > samples %app name symbol name > > 164623113 70.5372 kernel-2.6.11.3 .shared_idle > > says pretty clearly that your problem is all I/O wait, and there are > no other commits that might have increased our tendency to wait for I/O. > > I am sure I will get some pushback if I propose reverting the O_DIRECT > patch, so could you try to get some more-specific evidence? Like pull > the CVS tree from just before and just after this patch and compare > performance? > > BTW I did check that both runs are using wal_sync_method = fdatasync > and wal_buffers = 1000, so it's not a problem of those parameters having > been changed by the patch. We can supply a patch with just the O_DIRECT for you to test. The O_DIRECT patch also had grouped WAL writes, so that might be an issue too. Also, O_DIRECT is only used for open_* wal sync methods, so I don't see how it would affect this, but the grouped WAL writes might. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] data on devel code perf dip
Tom Lane wrote: > Alvaro Herrera <[EMAIL PROTECTED]> writes: > > In light of this, may I ask whether it makes sense to compare the > > performance of two runs with similar shared_buffer settings? With > > O_DIRECT, I understand from this manpage that the OS is going to do > > little or no page caching, so shared_buffers should be increased to > > account for this fact. > > > Am I missing something? > > O_DIRECT is only being used for WAL page writes (or I sure hope so > anyway), so shared_buffers should be irrelevant. Uh, O_DIRECT really just enables when open_sync is used, and I assume that is not used for writing dirty buffers during a checkpoint. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] data on devel code perf dip
On 2005-08-12, Alvaro Herrera <[EMAIL PROTECTED]> wrote: > In light of this, may I ask whether it makes sense to compare the > performance of two runs with similar shared_buffer settings? With > O_DIRECT, I understand from this manpage that the OS is going to do > little or no page caching, so shared_buffers should be increased to > account for this fact. > > Am I missing something? O_DIRECT should only be being used for the WAL, not for buffer I/O. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Why do index access methods use LP_DELETE?
Hi Hackers, I found index access methods use LP_DELETE for invisible tuples, but it seems to be ok to remove LP_USED instead of adding LP_DELETE. I tried the following simple replacements: - '|= LP_DELETE' => '&= ~LP_USED' - 'ItemIdDeleted' => '!ItemIdIsUsed' and then, it passed all regression tests. Why do index access methods use LP_DELETE? Does this change make troubles? (However, I guess there is no advantage in the change, because unused items are not recycled until next vacuum.) --- ITAGAKI Takahiro NTT Cyber Space Laboratories ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] data on devel code perf dip
Andrew - Supernews wrote: > On 2005-08-12, Alvaro Herrera <[EMAIL PROTECTED]> wrote: > > In light of this, may I ask whether it makes sense to compare the > > performance of two runs with similar shared_buffer settings? With > > O_DIRECT, I understand from this manpage that the OS is going to do > > little or no page caching, so shared_buffers should be increased to > > account for this fact. > > > > Am I missing something? > > O_DIRECT should only be being used for the WAL, not for buffer I/O. And only when open_sync or open_datasync are used. But the group write patch is used in all cases, but again only WAL. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] data on devel code perf dip
Bruce Momjian writes: >> O_DIRECT is only being used for WAL page writes (or I sure hope so >> anyway), so shared_buffers should be irrelevant. > Uh, O_DIRECT really just enables when open_sync is used, and I assume > that is not used for writing dirty buffers during a checkpoint. I double-checked that O_DIRECT is really just used for WAL, and only when the sync mode is open_sync or open_datasync. So it seems impossible that it affected a run with mode fdatasync. What seems the best theory at the moment is that the grouped-WAL-write part of the patch doesn't work so well as we thought. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] data on devel code perf dip
Tom Lane wrote: > Bruce Momjian writes: > >> O_DIRECT is only being used for WAL page writes (or I sure hope so > >> anyway), so shared_buffers should be irrelevant. > > > Uh, O_DIRECT really just enables when open_sync is used, and I assume > > that is not used for writing dirty buffers during a checkpoint. > > I double-checked that O_DIRECT is really just used for WAL, and only > when the sync mode is open_sync or open_datasync. So it seems > impossible that it affected a run with mode fdatasync. What seems the > best theory at the moment is that the grouped-WAL-write part of the > patch doesn't work so well as we thought. Yes, that's my only guess. Let us know if you want the patch to test, rather than pulling CVS before and after the patch was applied. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Why do index access methods use LP_DELETE?
ITAGAKI Takahiro <[EMAIL PROTECTED]> writes: > Why do index access methods use LP_DELETE? My recollection is that I deliberately used LP_DELETE for the known-dead-tuple marker so that there couldn't be any confusion with the use of LP_USED. AFAIR, LP_USED isn't actually used in indexes, so we could do it differently if there were another possible use for the flag bit ... have you got one in mind? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] ereport(ERROR) and files
Do we automatically close files opened with AllocateFile() on ereport(ERROR)? -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] ereport(ERROR) and files
Bruce Momjian wrote: > Do we automatically close files opened with AllocateFile() on > ereport(ERROR)? I found they are closed, fd.c comment says: * fd.c will automatically close all files opened with AllocateFile at * transaction commit or abort; this prevents FD leakage if a routine * that calls AllocateFile is terminated prematurely by ereport(ERROR). -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] obtaining row locking information
> Hi, > > With a help from Bruce, I wrote a small function which returns row > locking information(see attached file if you are interested). Here is > a sample result: > > test=# select * from pgrowlocks('t1'); > locked_row | lock_type | locker | multi > +---++--- > (0,1) | Shared| 1 | t > (0,3) | Exclusive |575 | f > (2 rows) > > I think it will be more usefull if actual xids are shown in the case > "locker" is a multixid. It seems GetMultiXactIdMembers() does the > job. Unfortunately that is a static funtcion, however. Is there any > chance GetMultiXactIdMembers() becomes public funtion? I enhanced pgrowlocks() to use GetMultiXactIdMembers() so that it displays each xid belonging to particular multi xid (see attached source code). test=# select * from pgrowlocks('t1'); locked_row | lock_type | locker | multi | xids +---++---+--- (0,1) | Shared| 3 | t | {646,647} (1 row) However even one of transactions, for example 647 commits, still it shows as if 647 is a member of muitixid 3. test=# select * from pgrowlocks('t1'); locked_row | lock_type | locker | multi | xids +---++---+--- (0,1) | Shared| 3 | t | {646,647} (1 row) Am I missing something? -- Tatsuo Ishii /* * $PostgreSQL$ * * Copyright (c) 2005 Tatsuo Ishii * * Permission to use, copy, modify, and distribute this software and * its documentation for any purpose, without fee, and without a * written agreement is hereby granted, provided that the above * copyright notice and this paragraph and the following two * paragraphs appear in all copies. * * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT, * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED * OF THE POSSIBILITY OF SUCH DAMAGE. * * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR * A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. */ #include "postgres.h" #include "funcapi.h" #include "access/heapam.h" #include "access/multixact.h" #include "access/transam.h" #include "catalog/namespace.h" #include "catalog/pg_type.h" #include "utils/builtins.h" PG_FUNCTION_INFO_V1(pgrowlocks); extern Datum pgrowlocks(PG_FUNCTION_ARGS); /* -- * pgrowlocks: * returns tids of rows being locked * * C FUNCTION definition * pgrowlocks(text) returns set of pgrowlocks_type * see pgrowlocks.sql for pgrowlocks_type * -- */ #define DUMMY_TUPLE "public.pgrowlocks_type" #define NCHARS 32 /* * define this if makeRangeVarFromNameList() has two arguments. As far * as I know, this only happens in 8.0.x. */ #undef MAKERANGEVARFROMNAMELIST_HAS_TWO_ARGS typedef struct { HeapScanDesc scan; int ncolumns; } MyData; Datum pgrowlocks(PG_FUNCTION_ARGS) { FuncCallContext *funcctx; HeapScanDesc scan; HeapTuple tuple; TupleDesc tupdesc; AttInMetadata *attinmeta; Datum result; MyData *mydata; Relationrel; if (SRF_IS_FIRSTCALL()) { text *relname; RangeVar *relrv; MemoryContext oldcontext; funcctx = SRF_FIRSTCALL_INIT(); oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); tupdesc = RelationNameGetTupleDesc(DUMMY_TUPLE); attinmeta = TupleDescGetAttInMetadata(tupdesc); funcctx->attinmeta = attinmeta; relname = PG_GETARG_TEXT_P(0); #ifdef MAKERANGEVARFROMNAMELIST_HAS_TWO_ARGS relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname, "pgrowlocks")); #else relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname)); #endif rel = heap_openrv(relrv, AccessShareLock); scan = heap_beginscan(rel, SnapshotNow, 0, NULL); mydata = palloc(sizeof(*mydata)); mydata->scan = scan; mydata->ncolumns = tupdesc->natts; funcctx->user_fctx = mydata; MemoryContextSwitchTo(oldcontext); } funcctx = SRF_PERCALL_SETUP(); attinmeta = funcctx->attinmeta; mydata = (MyData *)funcctx->user_fctx; scan = mydata->scan; /* scan the relation */ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) {
Re: [HACKERS] ereport(ERROR) and files
Bruce Momjian writes: > Do we automatically close files opened with AllocateFile() on > ereport(ERROR)? Yes ... that's more or less the point of having it ... regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] obtaining row locking information
On Fri, Aug 12, 2005 at 12:27:25PM +0900, Tatsuo Ishii wrote: > However even one of transactions, for example 647 commits, still it > shows as if 647 is a member of muitixid 3. > > test=# select * from pgrowlocks('t1'); > locked_row | lock_type | locker | multi | xids > +---++---+--- > (0,1) | Shared| 3 | t | {646,647} > (1 row) > > Am I missing something? By design, a MultiXactId does not change its membership, that is, no members are added nor deleted. When this has to happen (for example a row is locked by another backend), a new MultiXactId is generated. The caller is expected to check whether the member transactions are still running. -- Alvaro Herrera () "El que vive para el futuro es un iluso, y el que vive para el pasado, un imbécil" (Luis Adler, "Los tripulantes de la noche") ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] obtaining row locking information
> On Fri, Aug 12, 2005 at 12:27:25PM +0900, Tatsuo Ishii wrote: > > > However even one of transactions, for example 647 commits, still it > > shows as if 647 is a member of muitixid 3. > > > > test=# select * from pgrowlocks('t1'); > > locked_row | lock_type | locker | multi | xids > > +---++---+--- > > (0,1) | Shared| 3 | t | {646,647} > > (1 row) > > > > Am I missing something? > > By design, a MultiXactId does not change its membership, that is, no > members are added nor deleted. When this has to happen (for example a > row is locked by another backend), a new MultiXactId is generated. The > caller is expected to check whether the member transactions are still > running. But it seems when members are deleted, new multixid is not generated. i.e. I see "locker" column does not change. Is this an expected behavior? -- Tatsuo Ishii ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Why do index access methods use LP_DELETE?
Tom Lane <[EMAIL PROTECTED]> wrote: > > Why do index access methods use LP_DELETE? > > My recollection is that I deliberately used LP_DELETE for the > known-dead-tuple marker so that there couldn't be any confusion with > the use of LP_USED. AFAIR, LP_USED isn't actually used in indexes, > so we could do it differently if there were another possible use for > the flag bit ... have you got one in mind? Thanks. I understood there is little difference between non-LP_USED and LP_DELETE for indexes. I'm thinking to use LP_DELETE for relation tuples for incremental vacuum, which is discussed in http://archives.postgresql.org/pgsql-hackers/2005-03/msg00518.php I'll try to mark tuples with LP_DELETE on visibility checking and recycle the pages by bgwriter. ...However it is still a stage of an idea. --- ITAGAKI Takahiro NTT Cyber Space Laboratories ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org