[HACKERS] Project proposal/comments please - query optimization

2005-08-11 Thread Kim Bisgaard
I have noticed a deficiency in the current query optimizer related to 
"full outer joins". Tom Lane has confirmed to me that it will not be 8.1 
material. I am not able to wait for 8.2


I am in the lucky situation that my project has money to hire 
consultants, so I would be very interested in hearing about any who 
feels able to work on this, with estimates to costs. The sw developed 
shall be freely available and will be given back into PostgreSQL, if the 
project wants it. I actually think it should be a requirement that the 
sw is accepted into PostgreSQL, but I do not know how to phrase it so 
that it is acceptable to all parties.


The specific problem can be illustrated with two example queries.
Query1:

SELECT x, y, av, bv
FROM at a
FULL OUTER JOIN bt b
USING (x, y)
WHERE x = 52981
 AND y = '2004-1-1 0:0:0';

Query2:

SELECT x, y, av, bv
FROM
 (SELECT x, y, av
   FROM at
   WHERE x = 52981 AND y = '2004-1-1 0:0:0') a
 FULL OUTER JOIN
 (SELECT x, y, bv
   FROM bt
   WHERE x = 52981 AND y = '2004-1-1 0:0:0') b
 USING (x, y);

Both queries select the same set of data (one record), but query2 is 
able to use the indexes in doing so. By looking at the "explain analyze" 
output it is clear that this is because the current PostgreSQL query 
optimizer is not able to push the conditions (x = 52981 AND y = 
'2004-1-1 0:0:0') down into the sub-queries, thus forcing the fetching 
of all data from the tables, and then lastly filtering out the few 
records (zero to one row from each table).


The reason why I say it is related to "full outer joins" it that if I 
take Query1 and substitute "full" with "left", the optimizer is capable 
of pushing the conditions out in the sub-selects, and is thus able to 
use indices.


Looking forward for any comments. I am aware that there are workarounds 
(like query2, union of two left-joins, hand coding the join from a 
series of simple selects, ...) but I do not feel they are practical for 
my use.


Regards,

--
Kim Bisgaard

Computer Department  Phone: +45 3915 7562 (direct)
Danish Meteorological Institute  Fax: +45 3915 7460 (division)


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Use of inv_getsize in functions

2005-08-11 Thread Soeren Laursen
Hi,

Red ears, looking down on my shoes!

Well I code one getsize my self using inv_seek.

Had disabled a lot of compiler warnings because I am reusing a lot of
spaghetti code for testing etc. This was proberly a hint for starting
to clean up the code.

Regards,

Søren

> On Tue, Aug 09, 2005 at 10:54:49PM +0200, Soeren Laursen wrote:
> > I have used other function calls like
> > inv_open with no problem, but when I load this modules I get:
> >
> > undefined symbol: inv_getsize
>
> Notice the word "static" in the definition of inv_getsize() in
> src/backend/storage/large_object/inv_api.c:
>
> static uint32
> inv_getsize(LargeObjectDesc *obj_desc)
> {
> ...
> }
>
> I don't know if there's a good reason for inv_getsize() being static.
> Maybe your code could use inv_seek() instead.
>
> > No errors when compiling.
>
> If you compile with warning flags, then you should at least have
> gotten a warning like "implicit declaration of function `inv_getsize'".
> That's a hint that something's wrong.
>
> --
> Michael Fuhr
>



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] Planner doesn't look at LIMIT?

2005-08-11 Thread Ian Westmacott
I have a case that I though was an example of this issue,
and that this patch would correct.  I applied this patch
to an 8.0.3 source distribution, but it didn't seem to
solve my problem.

In a nutshell, I have a LIMIT query where the planner
seems to favor a merge join over a nested loop.  I've
simplified the query as much as possible:


itvtrackdata3=> \d tableA
Table "public.tableA"
 Column |   Type   | Modifiers 
+--+---
 foo| bigint   | not null
 bar| smallint | not null
 bap| bigint   | not null
 bip| bigint   | not null
 bom| bigint   | not null
Indexes:
"idx_tableA_bip" btree (bip) WHERE (bip =
900::bigint)
"idx_tableA_foo" btree (foo)

itvtrackdata3=> \d tableB
Table "tableB"
 Column  |   Type   | Modifiers 
-+--+---
 bim | bigint   | not null
 bif | smallint | not null
 baf | smallint | not null
 bof | smallint | not null
 buf | smallint | not null
 foo | bigint   | not null
Indexes:
"idx_tableB_bim" btree ("bim", foo)

itvtrackdata3=> set default_statistics_target to 1000;
SET
Time: 0.448 ms
itvtrackdata3=> analyze tableA;
ANALYZE
Time: 4237.151 ms
itvtrackdata3=> analyze tableB;
ANALYZE
Time: 46672.939 ms
itvtrackdata3=> explain analyze SELECT * FROM tableB NATURAL JOIN tableA
WHERE bim>=72555896091359 AND bim<72555935412959 AND bim=bap ORDER BY
bim ASC LIMIT 1;

  QUERY PLAN
  
--
 Limit  (cost=149626.57..252987.71 rows=1 width=50) (actual
time=5684.013..5684.013 rows=1 loops=1)
   ->  Merge Join  (cost=149626.57..252987.71 rows=1 width=50) (actual
time=5684.012..5684.012 rows=1 loops=1)
 Merge Cond: (("outer"."bim" = "inner"."bap") AND ("outer".foo =
"inner".foo))
 ->  Index Scan using idx_tableB_bim on tableB 
(cost=0.00..97391.22 rows=55672 width=24) (actual time=0.017..0.059
rows=29 loops=1)
   Index Cond: (("bim" >= 72555896091359::bigint) AND ("bim"
< 72555935412959::bigint))
 ->  Sort  (cost=149626.57..151523.94 rows=758948 width=34)
(actual time=5099.300..5442.825 rows=560856 loops=1)
   Sort Key: tableA."bap", tableA.foo
   ->  Seq Scan on tableA  (cost=0.00..47351.48 rows=758948
width=34) (actual time=0.021..1645.204 rows=758948 loops=1)
 Total runtime: 5706.655 ms
(9 rows)

Time: 5729.984 ms
itvtrackdata3=> set enable_mergejoin to false;
SET
Time: 0.373 ms
itvtrackdata3=> explain analyze SELECT * FROM tableB NATURAL JOIN tableA
WHERE bim>=72555896091359 AND bim<72555935412959 AND bim=bap ORDER BY
bim ASC LIMIT 1;

  QUERY PLAN
  
--
 Limit  (cost=0.00..432619.68 rows=1 width=50) (actual
time=11.149..11.150 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..432619.68 rows=1 width=50) (actual
time=11.148..11.148 rows=1 loops=1)
 Join Filter: ("outer"."bim" = "inner"."bap")
 ->  Index Scan using idx_tableB_bim on tableB 
(cost=0.00..97391.22 rows=55672 width=24) (actual time=0.017..0.062
rows=29 loops=1)
   Index Cond: (("bim" >= 72555896091359::bigint) AND ("bim"
< 72555935412959::bigint))
 ->  Index Scan using idx_tableA_foo on tableA  (cost=0.00..6.01
rows=1 width=34) (actual time=0.007..0.379 rows=1 loops=29)
   Index Cond: ("outer".foo = tableA.foo)
 Total runtime: 11.215 ms
(8 rows)

Time: 32.007 ms


Have I just flubbed the patch, or is there something else
going on here?

Thanks,

--Ian


On Fri, 2005-07-22 at 12:20, Tom Lane wrote:
> I wrote:
> > Dawid Kuroczko <[EMAIL PROTECTED]> writes:
> >> qnex=# EXPLAIN SELECT * FROM log NATURAL JOIN useragents LIMIT 1;
> 
> >> Limit  (cost=15912.20..15912.31 rows=1 width=272)
> >> ->  Hash Join  (cost=15912.20..5328368.96 rows=47044336 width=272)
> 
> >> If I set enable_hashjoin=false:
> 
> >> qnex=# EXPLAIN ANALYZE SELECT * FROM log NATURAL LEFT JOIN useragents 
> >> LIMIT 1;
> 
> >> Limit  (cost=0.00..3.07 rows=1 width=272) (actual time=74.214..74.216
> >> rows=1 loops=1)
> >> ->  Nested Loop Left Join  (cost=0.00..144295895.01 rows=47044336
> >> width=272) (actual time=74.204..74.204 rows=1 loops=1)
> 
> > This is quite strange.  The nestloop plan definitely should be preferred
> > in the context of the LIMIT, considering that it has far lower estimated
> > cost.  And it is preferred in simple tests for me.
> 
> After a sui

Re: [HACKERS] [GENERAL] Testing of MVCC

2005-08-11 Thread Matt Miller
On Wed, 2005-08-10 at 16:41 -0400, Tom Lane wrote:
> Matt Miller <[EMAIL PROTECTED]> writes:
> > It seems to me that contrib/dblink could greatly simplify the design and
> > coding of multi-user regression tests.
> 
> I doubt it would be very useful, since
> a script based on that still doesn't let you issue concurrent queries.

I think it would be useful to allow a test script to first create a set
of committed and uncommitted transactions, and to then issue some
queries on another connection to confirm that the other connection has a
proper view of the database at that point.  This type of test is
serialized, but I think it would be a useful multi-user test.  Also, the
output from such a test is probably pretty easy to fit into the
diff-based validation of "make check."

I realize that we also need to have tests that spawn several connections
and run scripts concurrently across those connections.  I agree that
this type of test would probably not benefit fundamentally from
contrib/dblink.  However, I was grasping a bit to see how the output
from such a concurrent test would be diff'ed with an expected output in
a meaningful way.  So, to continue to progress on this problem, I
figured that a contrib/dblink dependency would at least allow me to
start coding something...

> > Is there objection to a portion
> > of src/test/regress depending on contrib/dblink?
>
> Yes.

Understood.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Project proposal/comments please - query optimization

2005-08-11 Thread Tom Lane
Kim Bisgaard <[EMAIL PROTECTED]> writes:
> I have noticed a deficiency in the current query optimizer related to 
> "full outer joins". Tom Lane has confirmed to me that it will not be 8.1 
> material.

The particular case you are complaining of is fixed in CVS tip.  There
are related issues involving N-way joins that we're still not very
good at.

regression=# create table at (x int, y timestamp, av text);
CREATE TABLE
regression=# create table bt (x int, y timestamp, bv text);
CREATE TABLE
regression=# create index ati on at(x,y);
CREATE INDEX
regression=# create index bti on bt(x,y);
CREATE INDEX
regression=# explain SELECT x, y, av, bv FROM at a FULL OUTER JOIN bt b USING 
(x, y) WHERE x = 52981 AND y = '2004-1-1 0:0:0';
   QUERY PLAN

 Merge Full Join  (cost=0.00..9.66 rows=1 width=88)
   ->  Index Scan using ati on "at" a  (cost=0.00..4.83 rows=1 width=44)
 Index Cond: ((x = 52981) AND (y = '2004-01-01 00:00:00'::timestamp 
without time zone))
   ->  Index Scan using bti on bt b  (cost=0.00..4.83 rows=1 width=44)
 Index Cond: ((x = 52981) AND (y = '2004-01-01 00:00:00'::timestamp 
without time zone))
(5 rows)

regression=#

This only works for WHERE clauses that equate join alias variables to
pseudoconstants.  I have this in my notes:

Consider this version of Kim Bisgaard's example:
SELECT FROM a join (b full join c using (id)) using (id)
If A is small and B,C have indexes on ID then it is interesting to consider
a plan like
Nest Loop
Scan A
Merge Full Join
Indexscan B using id = outer.id
Indexscan C using id = outer.id
We are fairly far from being able to do this. generate_outer_join_implications
could easily be modified to generate derived equalities (I think it works to
allow a deduction against any clause not overlapping the outerjoin itself)
but the planner would want to evaluate them at the wrong level, and the
executor doesn't have support for passing the outer variable down more than
one level of join.  This is why the existing hack works only for equalities
to pseudoconstants.  We could maybe mark join RestrictInfos as "valid only
below xxx" and ignore them when processing a join that includes all of the
indicated rels?  Still not clear how you get the planner to recognize the
above as an inner indexscan situation though.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] Determining return type of polymorphic function

2005-08-11 Thread Martijn van Oosterhout
[Please CC any replies so I don't have to follow them via the archives]

Hi,

I'm trying to create a set of types that are going to share the INPUT
and OUTPUT functions (written in C). For output you can determine the
type from the arguments, but for INPUT you can't. The prototype is
restricted (by CREATE TYPE) and you can't specify "anyelement" as the
return type because none of the arguments use it.

My current way around that is to create an alias to the function with
different names for each type, but get_fn_expr_rettype() doesn't appear
to be filled in anyway (fcinfo->flinfo->fn_expr == NULL).

What I'm trying to do now is use fcinfo->flinfo->fn_oid to lookup
pg_proc and get the return type from there, but something tells me
there must be an easier way.

Or to put it another way, if I define a function like:

CREATE FUNCTION myfunction(cstring, oid, integer) 
RETURNS mytype AS 'mylib.so' LANGUAGE 'C';

How can I determine I'm supposed to return a "mytype"? I'm running 7.4
if it matters...

Thanks in advance,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


pgpw0xkpyZUhh.pgp
Description: PGP signature


[HACKERS] SELECT for UPDATE and outer join?

2005-08-11 Thread Josh Berkus
Folks,

--
SQL statement with 
input values
:java.lang.Integer:30239.
Please examine the SQLException for more information.
NestedException: java.sql.SQLException: ERROR: SELECT FOR UPDATE cannot 
be applied to the nullable side of an outer join
-

So, my question is: why can't we apply FOR UPDATE to an outer join?   Is this 
defined in the SQL Spec?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Determining return type of polymorphic function

2005-08-11 Thread Tom Lane
Martijn van Oosterhout  writes:
> What I'm trying to do now is use fcinfo->flinfo->fn_oid to lookup
> pg_proc and get the return type from there, but something tells me
> there must be an easier way.

No, I think you're stuck.  The internal calls for type I/O routines
don't set up fn_expr (since there is no expression tree).

One possibility, depending on your time horizon for this, is to change
the getTypeIOParam rules so that ordinary types get their own OID as
second argument.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Determining return type of polymorphic function

2005-08-11 Thread Tom Lane
Martijn van Oosterhout  writes:
> I was thinking of actually also storing the oid in the typelem field
> but the docs imply this does something fancy with subscripting.

Yeah, like enable it ;-).  You can't do that unless you are some kind
of array type.  typelem pointing at yourself would be particularly
bad news --- probably lead to infinite loops ...

> Would a patch to change the rules be accepted, or would it be
> considered a unnecessary backward incompatable change?

I wouldn't back-patch it, but it seems like something we could still put
in for 8.1.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Determining return type of polymorphic function

2005-08-11 Thread Martijn van Oosterhout
[Please CC replies, thanks]

On Thu, Aug 11, 2005 at 02:17:30PM -0400, Tom Lane wrote:
> Martijn van Oosterhout  writes:
> > What I'm trying to do now is use fcinfo->flinfo->fn_oid to lookup
> > pg_proc and get the return type from there, but something tells me
> > there must be an easier way.
> 
> No, I think you're stuck.  The internal calls for type I/O routines
> don't set up fn_expr (since there is no expression tree).
> 
> One possibility, depending on your time horizon for this, is to change
> the getTypeIOParam rules so that ordinary types get their own OID as
> second argument.

Hmm, I was thinking about that. While reading the documentation I was
thinking "surely they'd pass their own oid, giving zero would be silly"
so I was kind of surprised when I did get zero.

I was thinking of actually also storing the oid in the typelem field
but the docs imply this does something fancy with subscripting. I
havn't traced the code paths for that yet. At the very least I think it
would confuse anything looking for arrays. I also thought about typmod
(the third argument) but that seems to almost always be -1.

Would a patch to change the rules be accepted, or would it be
considered a unnecessary backward incompatable change?

Thanks in advance,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


pgpi9FwgBzPbd.pgp
Description: PGP signature


[HACKERS] Logging explain-analyze output in pg log?

2005-08-11 Thread Mischa Sandberg
I'm stuck with a web app that periodically has truly awful query
response times. The cause may be an interaction between system load,
vacuum-analyze scheduling, the occasional wild variation in join
selectivity, and stats collection. Logging the queries and running them
later doesn't create an obvious pattern, because it's hard to catch the
exact circumstances where the query takes forever, or returns zero rows.

It would be difficult to change the app to run an EXPLAIN ANALYZE and
record the results. I do this when I can, because it's the fastest way
to figure out what went wrong in a 10-way join. 

I would like to add a guc variant on debug_print_plan that logs the
output of ExplainOneQuery(), rather than dumping the plan node-traversal
output --- perhaps it's just me, but I have trouble relating that output
to the tables and columns named neatly in EXPLAIN ANALYZE output.

This is different from just running EXPLAIN ANALYZE: this is asking the
backend to log the EXPLAIN ANALYZE output, but ALSO return the normal
result set. 

Some points I ponder:

(1) Could not find this on the TODO list, nor mentioned in the mail
archive. Has this truly never come up? Logging detail query execution to
to the profiler is something one gets accustomed to, with MSSQL.

(2) src/backend/commands/explain.c:ExplainQuery gives a cautionary
comment on scribbling (recording stats) in the execution tree,
particularly when that tree will be executed repeatedly from a PLPGSQL
proc (not a problem in my case, but ...)

(3) any suggestions on what to avoid ...



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Race condition in backend process exit

2005-08-11 Thread Andrew Sullivan
On Sun, Aug 07, 2005 at 03:45:10PM -0400, Tom Lane wrote:
> 
> I'm a bit hesitant to back-patch such a nontrivial and hard-to-test
> change, but it sure looks badly broken to me.  Any thoughts about the
> risks involved?

If there were some way to test it reliably, it'd make me feel a lot
better.  I guess I'd as soon hear about the risks involved in not
back-patching, because this seems like a pretty dangerous place to be
back-patching (and race condition fixes, in my experience, are often
well-loaded foot-guns, even if I do trust your coding abilities more
than approximately anyone else's).

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
In the future this spectacle of the middle classes shocking the avant-
garde will probably become the textbook definition of Postmodernism. 
--Brad Holland

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Simplifying wal_sync_method

2005-08-11 Thread Andrew Sullivan
On Wed, Aug 10, 2005 at 02:11:48AM -0500, Thomas F. O'Connell wrote:
> I was recently witness to a benchmark of 7.4.5 on Solaris 9 wherein  
> it was apparently demonstrated that fsync was the fastest option  
> among the 7.4.x wal_sync_method options.
> 
> If there's a way to make this information more useful by providing  
> more data, please let me know, and I'll see what I can do.

What would be really interesting to me to know is what Sun did
between 8 and 9 to make that so.  We don't use Solaris for databases
any more, but fsync was a lot slower than whatever we ended up using
on 8.  I wouldn't be surprised if they'd wired fsync directly to
something else; but I can hardly believe it'd be faster than any
other option.  (Mind, we were using Veritas filesyste with this, as
well, which was at least half the headache.)

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
The fact that technology doesn't work is no bar to success in the marketplace.
--Philip Greenspun

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] data on devel code perf dip

2005-08-11 Thread Mary Edie Meredith

I have an example of runs that illustrate a performance 
problem that occurred between installing the 7/18 and 8/1 
development release codes.  I'm running on a PPC64 8-way system, 
with 16GB of memory (on a 15GB virtual machine), with a DBT2 
workload configured as a 16 warehouse DBT2 with 16 db connections, 
no think time.  Using Linux 2.6.11/gentoo.


Overview:

Before picture: 
run 42 code updated from cvs July 18 12:18pm pacific
metric: 8374.71 NOTPM

After picture (same config params, db size, etc):
run 59 code downloaded from cvs (all of it) Aug1 9:58AM pacific
metric: 4787.61 NOTPM

Oprofile data in 59 (top part copied at the end of this message) but
not in 42 (although the binary is still around and I could in theory
rerun and get it).  

I'd like to first get back to the throughput of run 42, and then
remove what's keeping me from getting higher than 45% user with 
the original July 18 drop.


Here are more details:

(RUN 42) the previous performance 8374.71 new-order transactions per
minute (NOTPM) and %user around 45% on 8 processors.  Even this level of
user utilization is not good, as there is still some bandwidth left in
the log device (not much, but some).

http://developer.osdl.org/maryedie/DBT2_PGSQL/42/


This development code we got on July 18 12:18pm pacific. Tuned it and got
to run 42 on July 26. 

(RUN 59)  This is an example of a run with code downloaded (whole) on
Aug1 9:58am Pacific.  4787.61 new-order transactions per minute (NOTPM)
and ~22% user over 8 cpus.  This has the same config parameters as 42,
with autovacuum off.  However, I turned on oprofile for this run. (runs
without oprofile are equivalent).

http://developer.osdl.org/maryedie/DBT2_PGSQL/59/


Top things for postgreSQL in the oprofile are copied below.

CPU: ppc64 POWER5, speed 1656.38 MHz (estimated)
Counted CYCLES events (Processor cycles) with a unit mask of 0x00 (No unit 
mask) count 10
samples  %app name symbol name
164623113 70.5372  kernel-2.6.11.3  .shared_idle
6641569   2.8458  libc-2.3.4.so(no symbols)
4011874   1.7190  postgres .AllocSetAlloc
2798355   1.1990  postgres .CreateLWLocks
1731687   0.7420  postgres .SearchCatCache
1701936   0.7292  postgres .MemoryContextAllocZeroAligned
1509903   0.6470  postgres .FunctionCall2
1320467   0.5658  postgres .hash_search
1275278   0.5464  oprofiled(no symbols)
1271477   0.5448  postgres .yyparse
1099396   0.4711  postgres .OpernameGetCandidates
7948020.3406  postgres .expression_tree_walker
7319690.3136  postgres .MemoryContextAlloc
7302660.3129  postgres ._bt_compare
7097930.3041  postgres .DirectFunctionCall1
6894050.2954  postgres .base_yylex
6837760.2930  postgres .hash_any
6356800.2724  postgres .DLMoveToFront
6320980.2708  kernel-2.6.11.3  .___dst_free
5621110.2409  kernel-2.6.11.3  .sys_setrlimit
5086080.2179  postgres .AtEOXact_CatCache
4993180.2139  postgres .heap_release_fetch
4981360.2134  kernel-2.6.11.3  .tty_write
4940200.2117  postgres .XLogInsert


-- 
Mary Edie Meredith
Initiative Manager
Open Source Development Labs
[EMAIL PROTECTED]
503-906-1942


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] data on devel code perf dip

2005-08-11 Thread Tom Lane
Mary Edie Meredith <[EMAIL PROTECTED]> writes:
> I have an example of runs that illustrate a performance 
> problem that occurred between installing the 7/18 and 8/1 
> development release codes.

I dug through the CVS logs to see what had changed, and I'm afraid there
is just one plausible-looking candidate:

2005-07-28 23:22  momjian

* src/backend/access/transam/xlog.c: 
Use O_DIRECT if available when using O_SYNC for wal_sync_method.

Also, write multiple WAL buffers out in one write() operation.

ITAGAKI Takahiro

Most of the CVS activity in that time period had to with stuff like
roles and the interval datatype.  It's conceivable that these things
had some marginal performance cost, but if so I'd have expected it to
show up as extra CPU effort (more time checking permissions, say).
This figure:

> samples  %app name symbol name
> 164623113 70.5372  kernel-2.6.11.3  .shared_idle

says pretty clearly that your problem is all I/O wait, and there are
no other commits that might have increased our tendency to wait for I/O.

I am sure I will get some pushback if I propose reverting the O_DIRECT
patch, so could you try to get some more-specific evidence?  Like pull
the CVS tree from just before and just after this patch and compare
performance?

BTW I did check that both runs are using wal_sync_method = fdatasync
and wal_buffers = 1000, so it's not a problem of those parameters having
been changed by the patch.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] data on devel code perf dip

2005-08-11 Thread Alvaro Herrera
On Thu, Aug 11, 2005 at 09:02:03PM -0400, Tom Lane wrote:

> I am sure I will get some pushback if I propose reverting the O_DIRECT
> patch, so could you try to get some more-specific evidence?  Like pull
> the CVS tree from just before and just after this patch and compare
> performance?

Quoth the open(2) manpage:

   O_DIRECT
  Try to minimize cache effects of the I/O to  and  from  this
  file.   In  general this will degrade performance, but it is
  useful in special situations, such as when  applications  do
  their  own  caching.  File I/O is done directly to/from user
  space buffers.  The I/O is synchronous, i.e., at the comple-
  tion of the read(2) or write(2) system call, data is guaran-
  teed to have been transferred.

In light of this, may I ask whether it makes sense to compare the
performance of two runs with similar shared_buffer settings?  With
O_DIRECT, I understand from this manpage that the OS is going to do
little or no page caching, so shared_buffers should be increased to
account for this fact.

Am I missing something?

-- 
Alvaro Herrera ()
"Hay que recordar que la existencia en el cosmos, y particularmente la
elaboración de civilizaciones dentre de él no son, por desgracia,
nada idílicas" (Ijon Tichy)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] data on devel code perf dip

2005-08-11 Thread Tom Lane
Alvaro Herrera <[EMAIL PROTECTED]> writes:
> In light of this, may I ask whether it makes sense to compare the
> performance of two runs with similar shared_buffer settings?  With
> O_DIRECT, I understand from this manpage that the OS is going to do
> little or no page caching, so shared_buffers should be increased to
> account for this fact.

> Am I missing something?

O_DIRECT is only being used for WAL page writes (or I sure hope so
anyway), so shared_buffers should be irrelevant.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] data on devel code perf dip

2005-08-11 Thread Bruce Momjian
Tom Lane wrote:
> Most of the CVS activity in that time period had to with stuff like
> roles and the interval datatype.  It's conceivable that these things
> had some marginal performance cost, but if so I'd have expected it to
> show up as extra CPU effort (more time checking permissions, say).
> This figure:
> 
> > samples  %app name symbol name
> > 164623113 70.5372  kernel-2.6.11.3  .shared_idle
> 
> says pretty clearly that your problem is all I/O wait, and there are
> no other commits that might have increased our tendency to wait for I/O.
> 
> I am sure I will get some pushback if I propose reverting the O_DIRECT
> patch, so could you try to get some more-specific evidence?  Like pull
> the CVS tree from just before and just after this patch and compare
> performance?
> 
> BTW I did check that both runs are using wal_sync_method = fdatasync
> and wal_buffers = 1000, so it's not a problem of those parameters having
> been changed by the patch.

We can supply a patch with just the O_DIRECT for you to test.  The
O_DIRECT patch also had grouped WAL writes, so that might be an issue
too.  Also, O_DIRECT is only used for open_* wal sync methods, so I
don't see how it would affect this, but the grouped WAL writes might.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] data on devel code perf dip

2005-08-11 Thread Bruce Momjian
Tom Lane wrote:
> Alvaro Herrera <[EMAIL PROTECTED]> writes:
> > In light of this, may I ask whether it makes sense to compare the
> > performance of two runs with similar shared_buffer settings?  With
> > O_DIRECT, I understand from this manpage that the OS is going to do
> > little or no page caching, so shared_buffers should be increased to
> > account for this fact.
> 
> > Am I missing something?
> 
> O_DIRECT is only being used for WAL page writes (or I sure hope so
> anyway), so shared_buffers should be irrelevant.

Uh, O_DIRECT really just enables when open_sync is used, and I assume
that is not used for writing dirty buffers during a checkpoint.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] data on devel code perf dip

2005-08-11 Thread Andrew - Supernews
On 2005-08-12, Alvaro Herrera <[EMAIL PROTECTED]> wrote:
> In light of this, may I ask whether it makes sense to compare the
> performance of two runs with similar shared_buffer settings?  With
> O_DIRECT, I understand from this manpage that the OS is going to do
> little or no page caching, so shared_buffers should be increased to
> account for this fact.
>
> Am I missing something?

O_DIRECT should only be being used for the WAL, not for buffer I/O.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Why do index access methods use LP_DELETE?

2005-08-11 Thread ITAGAKI Takahiro
Hi Hackers,

I found index access methods use LP_DELETE for invisible tuples,
but it seems to be ok to remove LP_USED instead of adding LP_DELETE.

I tried the following simple replacements:
  - '|= LP_DELETE' => '&= ~LP_USED'
  - 'ItemIdDeleted' => '!ItemIdIsUsed'
and then, it passed all regression tests.

Why do index access methods use LP_DELETE?
Does this change make troubles?


(However, I guess there is no advantage in the change,
 because unused items are not recycled until next vacuum.)

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] data on devel code perf dip

2005-08-11 Thread Bruce Momjian
Andrew - Supernews wrote:
> On 2005-08-12, Alvaro Herrera <[EMAIL PROTECTED]> wrote:
> > In light of this, may I ask whether it makes sense to compare the
> > performance of two runs with similar shared_buffer settings?  With
> > O_DIRECT, I understand from this manpage that the OS is going to do
> > little or no page caching, so shared_buffers should be increased to
> > account for this fact.
> >
> > Am I missing something?
> 
> O_DIRECT should only be being used for the WAL, not for buffer I/O.

And only when open_sync or open_datasync are used.  But the group write
patch is used in all cases, but again only WAL.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] data on devel code perf dip

2005-08-11 Thread Tom Lane
Bruce Momjian  writes:
>> O_DIRECT is only being used for WAL page writes (or I sure hope so
>> anyway), so shared_buffers should be irrelevant.

> Uh, O_DIRECT really just enables when open_sync is used, and I assume
> that is not used for writing dirty buffers during a checkpoint.

I double-checked that O_DIRECT is really just used for WAL, and only
when the sync mode is open_sync or open_datasync.  So it seems
impossible that it affected a run with mode fdatasync.  What seems the
best theory at the moment is that the grouped-WAL-write part of the
patch doesn't work so well as we thought.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] data on devel code perf dip

2005-08-11 Thread Bruce Momjian
Tom Lane wrote:
> Bruce Momjian  writes:
> >> O_DIRECT is only being used for WAL page writes (or I sure hope so
> >> anyway), so shared_buffers should be irrelevant.
> 
> > Uh, O_DIRECT really just enables when open_sync is used, and I assume
> > that is not used for writing dirty buffers during a checkpoint.
> 
> I double-checked that O_DIRECT is really just used for WAL, and only
> when the sync mode is open_sync or open_datasync.  So it seems
> impossible that it affected a run with mode fdatasync.  What seems the
> best theory at the moment is that the grouped-WAL-write part of the
> patch doesn't work so well as we thought.

Yes, that's my only guess.  Let us know if you want the patch to test,
rather than pulling CVS before and after the patch was applied.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Why do index access methods use LP_DELETE?

2005-08-11 Thread Tom Lane
ITAGAKI Takahiro <[EMAIL PROTECTED]> writes:
> Why do index access methods use LP_DELETE?

My recollection is that I deliberately used LP_DELETE for the
known-dead-tuple marker so that there couldn't be any confusion with
the use of LP_USED.  AFAIR, LP_USED isn't actually used in indexes,
so we could do it differently if there were another possible use for
the flag bit ... have you got one in mind?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] ereport(ERROR) and files

2005-08-11 Thread Bruce Momjian
Do we automatically close files opened with AllocateFile() on
ereport(ERROR)?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] ereport(ERROR) and files

2005-08-11 Thread Bruce Momjian
Bruce Momjian wrote:
> Do we automatically close files opened with AllocateFile() on
> ereport(ERROR)?

I found they are closed, fd.c comment says:

 * fd.c will automatically close all files opened with AllocateFile at
 * transaction commit or abort; this prevents FD leakage if a routine
 * that calls AllocateFile is terminated prematurely by ereport(ERROR).

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] obtaining row locking information

2005-08-11 Thread Tatsuo Ishii
> Hi,
> 
> With a help from Bruce, I wrote a small function which returns row
> locking information(see attached file if you are interested). Here is
> a sample result:
> 
> test=# select * from pgrowlocks('t1');
>  locked_row | lock_type | locker | multi 
> +---++---
>   (0,1) | Shared|  1 | t
>   (0,3) | Exclusive |575 | f
> (2 rows)
> 
> I think it will be more usefull if actual xids are shown in the case
> "locker" is a multixid. It seems GetMultiXactIdMembers() does the
> job. Unfortunately that is a static funtcion, however. Is there any
> chance GetMultiXactIdMembers() becomes public funtion?

I enhanced pgrowlocks() to use GetMultiXactIdMembers() so that it
displays each xid belonging to particular multi xid (see attached
source code).

test=# select * from pgrowlocks('t1');
 locked_row | lock_type | locker | multi |   xids
+---++---+---
  (0,1) | Shared|  3 | t | {646,647}
(1 row)

However even one of transactions, for example 647 commits, still it
shows as if 647 is a member of muitixid 3.

test=# select * from pgrowlocks('t1');
 locked_row | lock_type | locker | multi |   xids
+---++---+---
  (0,1) | Shared|  3 | t | {646,647}
(1 row)

Am I missing something?
--
Tatsuo Ishii
/*
 * $PostgreSQL$
 *
 * Copyright (c) 2005   Tatsuo Ishii
 *
 * Permission to use, copy, modify, and distribute this software and
 * its documentation for any purpose, without fee, and without a
 * written agreement is hereby granted, provided that the above
 * copyright notice and this paragraph and the following two
 * paragraphs appear in all copies.
 *
 * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
 * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
 * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
 * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
 * OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS
 * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
 * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
 */

#include "postgres.h"

#include "funcapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/transam.h"
#include "catalog/namespace.h"
#include "catalog/pg_type.h"
#include "utils/builtins.h"


PG_FUNCTION_INFO_V1(pgrowlocks);

extern Datum pgrowlocks(PG_FUNCTION_ARGS);

/* --
 * pgrowlocks:
 * returns tids of rows being locked
 *
 * C FUNCTION definition
 * pgrowlocks(text) returns set of pgrowlocks_type
 * see pgrowlocks.sql for pgrowlocks_type
 * --
 */

#define DUMMY_TUPLE "public.pgrowlocks_type"
#define NCHARS 32

/*
 * define this if makeRangeVarFromNameList() has two arguments. As far
 * as I know, this only happens in 8.0.x.
 */
#undef MAKERANGEVARFROMNAMELIST_HAS_TWO_ARGS

typedef struct {
HeapScanDesc scan;
int ncolumns;
} MyData;

Datum
pgrowlocks(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
HeapScanDesc scan;
HeapTuple   tuple;
TupleDesc   tupdesc;
AttInMetadata *attinmeta;
Datum   result;
MyData *mydata;
Relationrel;

if (SRF_IS_FIRSTCALL())
{
text   *relname;
RangeVar   *relrv;
MemoryContext oldcontext;

funcctx = SRF_FIRSTCALL_INIT();
oldcontext = 
MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);

tupdesc = RelationNameGetTupleDesc(DUMMY_TUPLE);
attinmeta = TupleDescGetAttInMetadata(tupdesc);
funcctx->attinmeta = attinmeta;

relname = PG_GETARG_TEXT_P(0);
#ifdef MAKERANGEVARFROMNAMELIST_HAS_TWO_ARGS
relrv = 
makeRangeVarFromNameList(textToQualifiedNameList(relname,   
 
"pgrowlocks"));

#else
relrv = 
makeRangeVarFromNameList(textToQualifiedNameList(relname));
#endif
rel = heap_openrv(relrv, AccessShareLock);
scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
mydata = palloc(sizeof(*mydata));
mydata->scan = scan;
mydata->ncolumns = tupdesc->natts;
funcctx->user_fctx = mydata;

MemoryContextSwitchTo(oldcontext);
}

funcctx = SRF_PERCALL_SETUP();
attinmeta = funcctx->attinmeta;
mydata = (MyData *)funcctx->user_fctx;
scan = mydata->scan;

/* scan the relation */
while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
{
 

Re: [HACKERS] ereport(ERROR) and files

2005-08-11 Thread Tom Lane
Bruce Momjian  writes:
> Do we automatically close files opened with AllocateFile() on
> ereport(ERROR)?

Yes ... that's more or less the point of having it ...

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] obtaining row locking information

2005-08-11 Thread Alvaro Herrera
On Fri, Aug 12, 2005 at 12:27:25PM +0900, Tatsuo Ishii wrote:

> However even one of transactions, for example 647 commits, still it
> shows as if 647 is a member of muitixid 3.
> 
> test=# select * from pgrowlocks('t1');
>  locked_row | lock_type | locker | multi |   xids
> +---++---+---
>   (0,1) | Shared|  3 | t | {646,647}
> (1 row)
> 
> Am I missing something?

By design, a MultiXactId does not change its membership, that is, no
members are added nor deleted.  When this has to happen (for example a
row is locked by another backend), a new MultiXactId is generated.  The
caller is expected to check whether the member transactions are still
running.

-- 
Alvaro Herrera ()
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] obtaining row locking information

2005-08-11 Thread Tatsuo Ishii
> On Fri, Aug 12, 2005 at 12:27:25PM +0900, Tatsuo Ishii wrote:
> 
> > However even one of transactions, for example 647 commits, still it
> > shows as if 647 is a member of muitixid 3.
> > 
> > test=# select * from pgrowlocks('t1');
> >  locked_row | lock_type | locker | multi |   xids
> > +---++---+---
> >   (0,1) | Shared|  3 | t | {646,647}
> > (1 row)
> > 
> > Am I missing something?
> 
> By design, a MultiXactId does not change its membership, that is, no
> members are added nor deleted.  When this has to happen (for example a
> row is locked by another backend), a new MultiXactId is generated.  The
> caller is expected to check whether the member transactions are still
> running.

But it seems when members are deleted, new multixid is not
generated. i.e. I see "locker" column does not change. Is this an
expected behavior?
--
Tatsuo Ishii

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Why do index access methods use LP_DELETE?

2005-08-11 Thread ITAGAKI Takahiro
Tom Lane <[EMAIL PROTECTED]> wrote:

> > Why do index access methods use LP_DELETE?
> 
> My recollection is that I deliberately used LP_DELETE for the
> known-dead-tuple marker so that there couldn't be any confusion with
> the use of LP_USED.  AFAIR, LP_USED isn't actually used in indexes,
> so we could do it differently if there were another possible use for
> the flag bit ... have you got one in mind?

Thanks. I understood there is little difference
between non-LP_USED and LP_DELETE for indexes.

I'm thinking to use LP_DELETE for relation tuples for incremental vacuum,
which is discussed in
http://archives.postgresql.org/pgsql-hackers/2005-03/msg00518.php
I'll try to mark tuples with LP_DELETE on visibility checking and 
recycle the pages by bgwriter.
...However it is still a stage of an idea.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org