Re: [PERFORM] Intel SRCS16 SATA raid?

2005-05-11 Thread Bruce Momjian
[EMAIL PROTECTED] wrote:
> Greg,
> 
> I posted this link under a different thread (the $7k server thread).  It is
> a very good read on why SCSI is better for servers than ATA.  I didn't note
> bias, though it is from a drive manufacturer.  YMMV.  There is an
> interesting, though dated appendix on different manufacturers' drive
> characteristics.
> 
> http://www.seagate.com/content/docs/pdf/whitepaper/D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf

I have read this and it is an _excellent_ read about disk drives.  The
bottom line is that the SCSI/IDE distinctions is more of an indicator of
the drive, rather than the main feature of the drive.  The main feature
is that certain drives are Enterprise Storage and are designed for high
reliability and speed, while Personal Server drives are designed for low
cost.  The IDE/SCSI issue is only an indicator of this.

There are a lot more variabilities between these two types of drives
than I knew.  I recommend it for anyone who is choosing drives for a
system.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] BLOB's bypassing the OS Filesystem for better Image loading speed?

2005-05-11 Thread Enrico Weigelt
* [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

Hi,

> My next queststion is dedicated to blobs in my  Webapplication (using 
> Tomcat 5 and JDBC
> integrated a the J2EE Appserver JBoss).
> 
> Filesystems with many Filesystem Objects can slow down the Performance 
> at opening and reading Data.

As others already pointed out, you probably meant: overcrowded
directories can make some filesystems slow. For ext2 this is the case.
Instead reiserfs is designed to handle very large directories
(in fact by using similar indices like an database does).

If your application is an typical web app your will probably have
the situation:

+ images get read quite often, while they get updated quite seldom. 
+ you dont want to use image content in quries (ie. match against it)
+ the images will be transfered directly, without further processing
+ you can give the upload and the download-server access to a shared
  filesystem or synchronize their filesystems (ie rsync)

Under this assumptions, I'd suggest directly using the filesystem.
This should save some load, ie. 

+ no transfer from postgres -> webserver and further processing 
  (server side application) necessary, the webserver can directly 
  fetch files from filesystem
+ no further processing (server side application) necessary
+ backup and synchronization is quite trivial (good old fs tools)
+ clustering (using many image webservers) is quite trivial

Already mentioned that you've got to choose the right filesystem or 
at least the right fs organization (ie. working with a n-level hierachy
to keep directory sizes small and lookups fast).

An RDBMS can do this for you and so will save some implementation work, 
but I don't think it will be noticably faster than an good fs-side
implementation.


Of course there may be a lot of good reasons to put images into the
database, ie. if some clients directly work on db connections and 
all work (including image upload) should be done over the db link.


cu
-- 
-
 Enrico Weigelt==   metux IT service
  phone: +49 36207 519931 www:   http://www.metux.de/
  fax:   +49 36207 519932 email: [EMAIL PROTECTED]
-
  Realtime Forex/Stock Exchange trading powered by postgresSQL :))
http://www.fxignal.net/
-

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Optimizer wrongly picks Nested Loop Left Join

2005-05-11 Thread Edin Kadribasic
From: "Tom Lane" <[EMAIL PROTECTED]>
> "Edin Kadribasic" <[EMAIL PROTECTED]> writes:
> > I have a query that is giving the optimizer (and me) great headache.
>
> The main problem seems to be that the rowcount estimates for
> axp_temp_order_match and axp_dayschedule are way off:
>
> >  ->  Index Scan using axp_temp_order_match_idx1 on
> > axp_temp_order_match a  (cost=0.00..209.65 rows=426 width=4) (actual
> > time=0.277..0.512 rows=6 loops=1)
> >Index Cond: (sid = 16072)
>
> >->  Index Scan using axp_dayschedule_day_idx
on
> > axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
> > time=0.036..3.973 rows=610 loops=1)
> >  Index Cond: (("day" >=
'2005-05-12'::date)
> > AND ("day" <= '2005-05-12'::date))
> >  Filter: (used = B'1'::"bit")
>
> >  ->  Index Scan using axp_temp_order_match_idx1 on
> > axp_temp_order_match a  (cost=0.00..2.45 rows=1 width=4) (actual
> > time=0.027..2.980 rows=471 loops=1)
> >Index Cond: (sid = 16092)
>
> >  ->  Index Scan using axp_dayschedule_day_idx on
> > axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
> > time=0.015..3.557 rows=606 loops=471)
> >Index Cond: (("day" >= '2005-05-13'::date)
AND
> > ("day" <= '2005-05-13'::date))
> >Filter: (used = B'1'::"bit")
>
> Do you ANALYZE these tables on a regular basis?  If so, it may be
> necessary to increase the statistics target to the point where you
> get better estimates.

Increasing statistics didn't seem to help, but both of you gave me an idea
of what might be wrong. axp_temp_order match contains temporary matches for
a search. Just before execution of that query the new matches are inserted
into the table under a new search id (sid column). Since the ANALYZE was
that before it it grossly underestimates the number of matches for that sid.
As this table is relatively small inserting ANALYZE axp_temp_order_match
just before running the query does not introduce a great perforance penalty
(50ms) and it reduces the query execution time from up to 50s down to ~20ms.

> > Please note that sometimes when I get "bad plan" in the logfile, I just
> > re-run the query and the optimizer chooses the more efficient one.
>
> That's fairly hard to believe, unless you've got autovacuum running
> in the background.

The application had ANALYZE axp_temp_order_match placed in the "slightly"
wrong location, before the large insert was done (1000 rows with  a new
sid). So when the app run the next search, previous search got correctly
analyzed and the query execution time dropped dramatically as I was trying
to EXPLAIN ANALYZE query recorded in the log file.

Thanks for your help,

Edin


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread PFC

However, memcached (and for us, pg_memcached) is an excellent way to  
improve
horizontal scalability by taking disposable data (like session  
information)
out of the database and putting it in protected RAM.
	So, what is the advantage of such a system versus, say, a "sticky  
sessions" system where each session is assigned to ONE application server  
(not PHP then) which keeps it in RAM as native objects instead of  
serializing and deserializing it on each request ?
	I'd say the sticky sessions should perform a lot better, and if one  
machine dies, only the sessions on this one are lost.
	But of course you can't do it with PHP as you need an app server which  
can manage sessions. Potentially the savings are huge, though.

	On Google, their distributed system spans a huge number of PCs and it has  
redundancy, ie. individual PC failure is a normal thing and is a part of  
the system, it is handled gracefully. I read a paper on this matter, it's  
pretty impressive. The google filesystem has nothing to do with databases  
though, it's more a massive data store / streaming storage.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread Jim C. Nasby
On Wed, May 11, 2005 at 08:57:57AM +0100, David Roussel wrote:
> For an interesting look at scalability, clustering, caching, etc for a
> large site have a look at how livejournal did it.
> http://www.danga.com/words/2004_lisa/lisa04.pdf
> 
> They have 2.6 Million active users, posting 200 new blog entries per
> minute, plus many comments and countless page views.

Neither of which is that horribly impressive. 200 TPM is less than 4TPS.
While I haven't run high transaction rate databases under PostgreSQL, I
suspect others who have will say that 4TPS isn't that big of a deal.

> Although this system is of a different sort to the type I work on it's
> interesting to see how they've made it scale.
> 
> They use mysql on dell hardware! And found single master replication did
> not scale.  There's a section on multimaster replication, not sure if
Probably didn't scale because they used to use MyISAM.

> they use it.  The main approach they use is to parition users into
> spefic database clusters.  Caching is done using memcached at the
Which means they've got a huge amount of additional code complexity, not
to mention how many times you can't post something because 'that cluster
is down for maintenance'.

> application level to avoid hitting the db for rendered pageviews.
Memcached is about the only good thing I've seen come out of
livejournal.

> It's interesting that the solution livejournal have arrived at is quite
> similar in ways to the way google is set up.

Except that unlike LJ, google stays up and it's fast. Though granted, LJ
is quite a bit faster than it was 6 months ago.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Bad plan after vacuum analyze

2005-05-11 Thread Mischa Sandberg
Quoting Guillaume Smet <[EMAIL PROTECTED]>:

> Hi,
> 
> We have some performances problem on a particular query.
...

I have to say it, this was the best laid-out set of details behind a
problem I've ever seen on this list; I'm going to try live up to it, the
next time I have a problem of my own.



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Federated PG servers -- Was: Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-11 Thread Mischa Sandberg
Was curious why you pointed out SQL-MED as a SQL-standard approach to
federated servers. Always thought of it as covering access to non-SQL
data, the way the lo_* interface works; as opposed to meshing compatible
(to say nothing of identical) SQL servers. Just checked Jim Melton's
last word on that, to make sure, too. Is there something beyond that,
that I'm missing?

The approach that made first best sense to me (perhaps from having gone
there before) is to leave the SQL syntactically unchanged, and to manage
federated relations via pg_ tables and probably procedures. MSSQL and
Sybase went that route. It won't preclude moving to a system embedded in
the SQL language. 

The hurdles for federated SQL service are:
- basic syntax (how to refer to a remote object)
- connection management and delegated security
- timeouts and temporary connection failures
- efficient distributed queries with >1 remote table
- distributed transactions
- interserver integrity constraints

Sometimes the lines get weird because of opportunistic implementations.
For example, for the longest time, MSSQL supported server.db.user.object
references WITHIN STORED PROCEDURES, since the proc engine could hide
some primitive connection management. 

PG struck me as such a natural for cross-server queries, because
it keeps everything out in the open, including statistics.
PG is also well set-up to handle heterogeneous table types,
and has functions that return rowsets. Nothing needs to be bent out of
shape syntactically, or in the cross-server interface, to get over the
hurdles above.

The fact that queries hence transactions can't span multiple databases
tells me, PG has a way to go before it can handle dependency on a
distributed transaction monitor. 

My 2c.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Sort and index

2005-05-11 Thread Jim C. Nasby
First, I've got some updated numbers up at
http://stats.distributed.net/~decibel/

timing2.log shows that the planner actually under-estimates an index
scan by several orders of magnitude. Granted, random_page_cost is set to
an unrealistic 1.1 (otherwise I can't force the index scan), but that
alone isn't enough to explain the difference.

On Wed, May 11, 2005 at 05:59:10PM +0200, Manfred Koizar wrote:
> On Sun, 24 Apr 2005 17:01:46 -0500, "Jim C. Nasby" <[EMAIL PROTECTED]>
> wrote:
> >> >> Feel free to propose better cost equations.
> 
> I did.  More than once.
> 
> >estimated index scan cost for (project_id, id, date) is
> >0.00..100117429.34 while the estimate for work_units is
> >0.00..103168408.62; almost no difference,
> 
> ~3%
> 
> > even though project_id correlation is .657
> 
> This is divided by the number of index columns, so the index correlation
> is estimated to be 0.219.

That seems like a pretty bad assumption to make.

Is there any eta on having statistics for multi-column indexes?

> >you'll see that the cost of the index scan is way overestimated. Looking
> >at the code, the runcost is calculated as
> >
> >run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);
> >
> >where csquared is indexCorrelation^2. Why is indexCorrelation squared?
> >The comments say a linear interpolation between min_IO and max_IO is
> >used, but ISTM that if it was linear then instead of csquared,
> >indexCorrelation would just be used.
> 
> In my tests I got much more plausible results with
> 
>   1 - (1 - abs(correlation))^2

What's the theory behind that?

And I'd still like to know why correlation squared is used.

> Jim, are you willing to experiment with one or two small patches of
> mine?  What version of Postgres are you running?

It depends on the patches, since this is a production machine. Currently
it's running 7.4.*mumble*, though I need to upgrade to 8, which I was
intending to do via slony. Perhaps the best thing would be for me to get
that setup and we can experiment against version 8.0.3.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Bad plan after vacuum analyze

2005-05-11 Thread Guillaume Smet
Josh, Tom,
Thanks for your explanations.
In the meantime it seems like the quickest answer for Guillaume might
be to try to avoid keeping any NULLs in parent_application_id.
I can't do that as the majority of the applications don't have any 
parent one. Moreover, we use a third party application and we cannot 
modify all its internals.

Anyway, I tried to work on the statistics as you told me and here are 
the results:
ccm_perf=# ALTER TABLE acs_objects ALTER COLUMN object_id SET STATISTICS 30;
ALTER TABLE
ccm_perf=# ANALYZE acs_objects;
ANALYZE

ccm_perf=# \i query_section.sql
... correct plan ...
 Total runtime: 0.555 ms
So I think I will use this solution for the moment.
Thanks a lot for your help.
Regards
--
Guillaume
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Bad plan after vacuum analyze

2005-05-11 Thread Tom Lane
Ah-ha, I can replicate the problem.  This example uses tenk1 from the
regression database, which has a column unique2 containing just the
integers 0...

regression=# create table t1(f1 int);
CREATE TABLE
regression=# insert into t1 values(5);
INSERT 154632 1
regression=# insert into t1 values(7);
INSERT 154633 1
regression=# analyze t1;
ANALYZE
regression=# explain analyze select * from tenk1 right join t1 on (unique2=f1);
QUERY PLAN
---
 Merge Right Join  (cost=1.03..1.37 rows=2 width=248) (actual time=0.507..0.617 
rows=2 loops=1)
   Merge Cond: ("outer".unique2 = "inner".f1)
   ->  Index Scan using tenk1_unique2 on tenk1  (cost=0.00..498.24 rows=10024 
width=244) (actual time=0.126..0.242 rows=9 loops=1)
   ->  Sort  (cost=1.03..1.03 rows=2 width=4) (actual time=0.145..0.153 rows=2 
loops=1)
 Sort Key: t1.f1
 ->  Seq Scan on t1  (cost=0.00..1.02 rows=2 width=4) (actual 
time=0.029..0.049 rows=2 loops=1)
 Total runtime: 1.497 ms
(7 rows)

The planner correctly perceives that only a small part of the unique2
index will need to be scanned, and hence thinks the merge is cheap ---
much cheaper than if the whole index had to be scanned.  And it is.
Notice that only 9 rows were actually pulled from the index.  Once
we got to unique2 = 8, nodeMergejoin.c could see that no more matches
to f1 were possible.

But watch this:

regression=# insert into t1 values(null);
INSERT 154634 1
regression=# explain analyze select * from tenk1 right join t1 on (unique2=f1);
   QUERY PLAN
-
 Merge Right Join  (cost=1.03..1.37 rows=2 width=248) (actual 
time=0.560..290.874 rows=3 loops=1)
   Merge Cond: ("outer".unique2 = "inner".f1)
   ->  Index Scan using tenk1_unique2 on tenk1  (cost=0.00..498.24 rows=10024 
width=244) (actual time=0.139..106.982 rows=1 loops=1)
   ->  Sort  (cost=1.03..1.03 rows=2 width=4) (actual time=0.181..0.194 rows=3 
loops=1)
 Sort Key: t1.f1
 ->  Seq Scan on t1  (cost=0.00..1.02 rows=2 width=4) (actual 
time=0.032..0.067 rows=3 loops=1)
 Total runtime: 291.670 ms
(7 rows)

See what happened to the actual costs of the indexscan?  All of a sudden
we had to scan the whole index because there was a null in the other
input, and nulls sort high.

I wonder if it is worth fixing nodeMergejoin.c to not even try to match
nulls to the other input.  We'd have to add a check to see if the join
operator is strict or not, but it nearly always will be.

The alternative would be to make the planner only believe in the
short-circuit path occuring if it thinks that the other input is
entirely non-null ... but this seems pretty fragile, since it only
takes one null to mess things up, and ANALYZE can hardly be counted
on to detect one null in a table.

In the meantime it seems like the quickest answer for Guillaume might
be to try to avoid keeping any NULLs in parent_application_id.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread Simon Riggs
On Wed, 2005-05-11 at 17:13 +0800, Christopher Kings-Lynne wrote:
> > Alex Stapleton wrote
> > Be more helpful, and less arrogant please. 
> 
> Simon told you all the reasons clearly and politely.

Thanks Chris for your comments.

PostgreSQL can always do with one more developer and my sole intent was
to encourage Alex and other readers to act themselves. If my words seem
arrogant, then I apologise to any and all that think so.

Best Regards, Simon Riggs 



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread Greg Stark
Alex Stapleton <[EMAIL PROTECTED]> writes:

> Acceptable Answers to 'So, when/is PG meant to be getting a decent
> partitioning system?':
...
>  3. Your welcome to take a stab at it, I expect the community  would
> support your efforts as well.

As long as we're being curt all around, this one's not acceptable on the basis
that it's not grammatical.

-- 
greg


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Bad plan after vacuum analyze

2005-05-11 Thread Guillaume Smet
> Well, those stats certainly appear to justify the planner's belief that
> the indexscan needn't run very far: the one value of
> parent_application_id is 1031 and this is below the smallest value of
> object_id seen by analyze.
Yes, it seems rather logical but why does it cost so much if it should 
be an effective way to find the row?

> You might have better luck if you increase
> the statistics target for acs_objects.object_id.
What do you mean exactly?
> (It'd be interesting
> to know what fraction of acs_objects actually does have object_id < 
1032.)

ccm_perf=# SELECT COUNT(*) FROM acs_objects WHERE object_id<1032;
 count
---
15
ccm_perf=# SELECT COUNT(*) FROM acs_objects;
 count
---
 33510
--
Guillaume
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Bad plan after vacuum analyze

2005-05-11 Thread Tom Lane
Guillaume Smet <[EMAIL PROTECTED]> writes:
>> If so, can we see the pg_stats rows for the object_id and
>> parent_application_id columns?

> See attached file.

Well, those stats certainly appear to justify the planner's belief that
the indexscan needn't run very far: the one value of
parent_application_id is 1031 and this is below the smallest value of
object_id seen by analyze.  You might have better luck if you increase
the statistics target for acs_objects.object_id.  (It'd be interesting
to know what fraction of acs_objects actually does have object_id < 1032.)

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Bad plan after vacuum analyze

2005-05-11 Thread Guillaume Smet
Tom,
So, the usual questions: have these two tables been ANALYZEd lately?
Yes, of course.
As I wrote in my previous mail, here is how I reproduce the problem:
- we load the dump in a new database (to be sure, there is no problem on 
an index or something like that)
- query: it's fast (< 1ms)
- *VACUUM FULL ANALYZE;*
- query: it's really slow (130ms) and it's another plan
- set enable_seqscan=off;
- query: it's fast (< 1ms) : it uses the best plan

I reproduced it on two different servers exactly like that (7.4.5 and 
7.4.7).

I first met the problem on a production database with a VACUUM ANALYZE 
run every night (and we don't have too many inserts a day on this database).

If so, can we see the pg_stats rows for the object_id and
parent_application_id columns?
See attached file.
If you're interested in a dump of these tables, just tell me. There 
aren't any confidential information in them.

Regards
--
Guillaume
 schemaname |  tablename  |  attname  | null_frac | avg_width | n_distinct | 
most_common_vals | most_common_freqs |  
histogram_bounds   | correlation 
+-+---+---+---++--+---+-+-
 public | acs_objects | object_id | 0 | 4 | -1 |
  |   | 
{1032,34143,112295,120811,285004,420038,449980,453451,457684,609292,710005} |   
 0.488069
(1 ligne)

 schemaname |  tablename   |attname| null_frac | avg_width | 
n_distinct | most_common_vals | most_common_freqs | histogram_bounds | 
correlation 
+--+---+---+---++--+---+--+-
 public | applications | parent_application_id |  0.928571 | 4 |
  1 | {1031}   | {0.0714286}   |  |   1
(1 ligne)


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Prefetch

2005-05-11 Thread Mischa Sandberg
Quoting Christopher Kings-Lynne <[EMAIL PROTECTED]>:

> > Another trick you can use with large data sets like this when you
> want 
> > results
> > back in seconds is to have regularly updated tables that aggregate
> the data
> > along each column normally aggregated against the main data set.
> 
> > Maybe some bright person will prove me wrong by posting some
> working
> > information about how to get these apparently absent features
> working.
> 
> Most people just use simple triggers to maintain aggregate summary
> tables...

Don't know if this is more appropriate to bizgres, but:
What the first poster is talking about is what OLAP cubes do.

For big aggregating systems (OLAP), triggers perform poorly, 
compared to messy hand-rolled code. You may have dozens
of aggregates at various levels. Consider the effect of having 
each detail row cascade into twenty updates. 

It's particularly silly-looking when data is coming in as 
batches of thousands of rows in a single insert, e.g.

   COPY temp_table FROM STDIN;
   UPDATE fact_table ... FROM ... temp_table
   INSERT INTO fact_table ...FROM...temp_table

   (the above pair of operations is so common, 
Oracle added its "MERGE" operator for it).

Hence my recent post (request) for using RULES to aggregate 
--- given no luck with triggers "FOR EACH STATEMENT".


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Bad plan after vacuum analyze

2005-05-11 Thread Tom Lane
Josh Berkus  writes:
>  ->  Merge Right Join  (cost=8.92..9.26 rows=1 width=529) (actual 
> time=129.100..129.103 rows=1 loops=1)
>Merge Cond: ("outer".object_id = "inner".parent_application_id)
>->  Index Scan using acs_objects_object_id_p_hhkb1 on 
> acs_objects t98  (cost=0.00..2554.07 rows=33510 width=81) (actual 
> time=0.043..56.392 rows=33510 loops=1)
>->  Sort  (cost=8.92..8.93 rows=1 width=452) (actual 
> time=0.309..0.310 rows=1 loops=1)
>  Sort Key: t22.parent_application_id

> Here the planner chooses a merge right join.  This decision seems to have 
> been 
> made entirely on the basis of the cost of the join itself (total of 17) 
> without taking the cost of the sort and index access (total of 2600+) into 
> account.

> Tom, is this a possible error in planner logic?

No, it certainly hasn't forgotten to add in the costs of the inputs.
There might be a bug here, but if so it's much more subtle than that.

It looks to me like the planner believes that the one value of
t22.parent_application_id joins to something very early in the
acs_objects_object_id_p_hhkb1 sort order, and that it will therefore not
be necessary to run the indexscan to completion (or indeed very far at
all, considering that it's including such a small fraction of the total
indexscan cost).

[EMAIL PROTECTED] pointed out recently that this effect doesn't apply to
the outer side of an outer join; releases before 7.4.8 mistakenly think
it does.  But unless my wires are totally crossed today, acs_objects is
the nullable side here and so that error isn't applicable anyway.

So, the usual questions: have these two tables been ANALYZEd lately?
If so, can we see the pg_stats rows for the object_id and
parent_application_id columns?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Bad plan after vacuum analyze

2005-05-11 Thread Josh Berkus
Guillaume,

> We reproduced the problem on a 7.4.5 and on a 7.4.7 server.
> * we load the dump in a new database
> * query: it's fast (< 1ms)
> * VACUUM FULL ANALYZE;
> * query: it's really slow (130ms) and it's another plan
> * set enable_seqscan=off;
> * query: it's fast (< 1ms) : it uses the best plan

Looking at this, the planner seems convinced that the merge join is the 
easiest way to do the OUTER JOINS, but it appears to be wrong; a nested loop 
is faster.

This isn't the only place I've encountered our optimizer doing this -- 
underestimating the cost of a merge join.  This seems to be becuase the 
merge_join vs. nested_loop decision seems to be being made in the planner 
without taking the double-sort and index access costs into account.   This 
query is an excellent example:

"good" plan:
 Nested Loop Left Join  (cost=2.44..17.36 rows=1 width=5532) (actual 
time=0.441..0.466 rows=1 loops=1)
   Join Filter: ("outer".parent_application_id = "inner".application_id)
   ->  Nested Loop Left Join  (cost=2.44..15.73 rows=1 width=5214) (actual 
time=0.378..0.402 rows=1 loops=1)

See, here the planner thinks that the 2 nested loops will cost "35".  

"bad" plan:
 Merge Right Join  (cost=9.27..9.48 rows=1 width=545) (actual 
time=129.364..129.365 rows=1 loops=1)
   Merge Cond: ("outer".application_id = "inner".parent_application_id)
   ->  Index Scan using applicati_applicati_id_p_ogstm on applications t116  
(cost=0.00..5.51 rows=28 width=20) (actual time=0.030..0.073 rows=28 loops=1)
   ->  Sort  (cost=9.27..9.27 rows=1 width=529) (actual time=129.202..129.203 
rows=1 loops=1)
 Sort Key: t22.parent_application_id
 ->  Merge Right Join  (cost=8.92..9.26 rows=1 width=529) (actual 
time=129.100..129.103 rows=1 loops=1)
   Merge Cond: ("outer".object_id = "inner".parent_application_id)
   ->  Index Scan using acs_objects_object_id_p_hhkb1 on 
acs_objects t98  (cost=0.00..2554.07 rows=33510 width=81) (actual 
time=0.043..56.392 rows=33510 loops=1)
   ->  Sort  (cost=8.92..8.93 rows=1 width=452) (actual 
time=0.309..0.310 rows=1 loops=1)
 Sort Key: t22.parent_application_id

Here the planner chooses a merge right join.  This decision seems to have been 
made entirely on the basis of the cost of the join itself (total of 17) 
without taking the cost of the sort and index access (total of 2600+) into 
account.

Tom, is this a possible error in planner logic?


-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] Bad plan after vacuum analyze

2005-05-11 Thread Guillaume Smet
Hi,
We have some performances problem on a particular query.
We reproduced the problem on a 7.4.5 and on a 7.4.7 server.
* we load the dump in a new database
* query: it's fast (< 1ms)
* VACUUM FULL ANALYZE;
* query: it's really slow (130ms) and it's another plan
* set enable_seqscan=off;
* query: it's fast (< 1ms) : it uses the best plan
I attached the EXPLAIN ANALYZE outputs, the query and the tables
description. I really can't understand why the planner chooses this plan
and especially the line :
->  Index Scan using acs_objects_object_id_p_hhkb1 on acs_objects t98
(cost=0.00..2554.07 rows=33510 width=81) (actual time=0.043..56.392
rows=33510 loops=1).
I never saw an index scan on such a number of lines. For your
information, there are 33510 lines in this table so it scans the whole
table.
The problem seems to be the left join on the acs_objects t98 table for
the parent_application_id as if I remove it or if I change it to a
subquery, it's ok. The query is automatically generated by a persistence 
layer so I can't really rewrite it.

Thanks for any help
Regards
--
Guillaume

  QUERY PLAN
  
--
 Merge Right Join  (cost=9.27..9.48 rows=1 width=545) (actual 
time=129.364..129.365 rows=1 loops=1)
   Merge Cond: ("outer".application_id = "inner".parent_application_id)
   ->  Index Scan using applicati_applicati_id_p_ogstm on applications t116  
(cost=0.00..5.51 rows=28 width=20) (actual time=0.030..0.073 rows=28 loops=1)
   ->  Sort  (cost=9.27..9.27 rows=1 width=529) (actual time=129.202..129.203 
rows=1 loops=1)
 Sort Key: t22.parent_application_id
 ->  Merge Right Join  (cost=8.92..9.26 rows=1 width=529) (actual 
time=129.100..129.103 rows=1 loops=1)
   Merge Cond: ("outer".object_id = "inner".parent_application_id)
   ->  Index Scan using acs_objects_object_id_p_hhkb1 on 
acs_objects t98  (cost=0.00..2554.07 rows=33510 width=81) (actual 
time=0.043..56.392 rows=33510 loops=1)
   ->  Sort  (cost=8.92..8.93 rows=1 width=452) (actual 
time=0.309..0.310 rows=1 loops=1)
 Sort Key: t22.parent_application_id
 ->  Nested Loop  (cost=2.44..8.91 rows=1 width=452) 
(actual time=0.259..0.287 rows=1 loops=1)
   ->  Hash Join  (cost=2.44..3.68 rows=1 width=339) 
(actual time=0.227..0.251 rows=1 loops=1)
 Hash Cond: ("outer".application_type_id = 
"inner".application_type_id)
 ->  Seq Scan on application_types t47  
(cost=0.00..1.15 rows=15 width=28) (actual time=0.009..0.025 rows=15 loops=1)
 ->  Hash  (cost=2.44..2.44 rows=1 width=315) 
(actual time=0.121..0.121 rows=0 loops=1)
   ->  Hash Join  (cost=1.01..2.44 rows=1 
width=315) (actual time=0.073..0.112 rows=1 loops=1)
 Hash Cond: ("outer".application_id 
= "inner".section_id)
 ->  Seq Scan on applications t22  
(cost=0.00..1.28 rows=28 width=70) (actual time=0.002..0.027 rows=28 loops=1)
 ->  Hash  (cost=1.01..1.01 rows=1 
width=245) (actual time=0.026..0.026 rows=0 loops=1)
   ->  Seq Scan on 
content_sections t0  (cost=0.00..1.01 rows=1 width=245) (actual 
time=0.020..0.022 rows=1 loops=1)
   ->  Index Scan using acs_objects_object_id_p_hhkb1 
on acs_objects t7  (cost=0.00..5.22 rows=1 width=121) (actual time=0.016..0.017 
rows=1 loops=1)
 Index Cond: ("outer".application_id = 
t7.object_id)
 Total runtime: 129.754 ms
(23 lignes)




SET

 QUERY PLAN 
 
-
 Merge Left Join  (cost=17.48..18.03 rows=1 width=545) (actual 
time=0.653..0.656 rows=1 loops=1)
   Merge Cond: ("outer".parent_application_id = "inner".application_id)
   ->  Merge Left Join  (cost=17.48..17.81 rows=1 width=529) (actual 
time=0.616..0.618 rows=1 loops=1)
 Merge Cond: ("outer".parent_application_id = "inner".object_id)
 ->  Sort  (cost=17.48..17.49 rows=1 width=452) (actual 
time=0.587..0.588 rows=1 loops=1)
   Sort Key: t22.parent_application_id
   ->  Nested Loop  (cost=4.68..17.47 rows=1 width=452) (actual 
t

Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread Josh Berkus
David,

> It's interesting that the solution livejournal have arrived at is quite
> similar in ways to the way google is set up.

Yes, although again, they're using memcached as pseudo-clustering software, 
and as a result are limited to what fits in RAM (RAM on 27 machines, but it's 
still RAM).  And due to limitations on memcached, the whole thing blows 
whenever a server goes out (the memcached project is working on this).  But 
any LJ user could tell you that it's a low-availability system.

However, memcached (and for us, pg_memcached) is an excellent way to improve 
horizontal scalability by taking disposable data (like session information) 
out of the database and putting it in protected RAM.  On some websites, 
adding memcached can result is as much as a 60% decrease in database traffic.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Sort and index

2005-05-11 Thread Manfred Koizar
On Sun, 24 Apr 2005 17:01:46 -0500, "Jim C. Nasby" <[EMAIL PROTECTED]>
wrote:
>> >> Feel free to propose better cost equations.

I did.  More than once.

>estimated index scan cost for (project_id, id, date) is
>0.00..100117429.34 while the estimate for work_units is
>0.00..103168408.62; almost no difference,

~3%

> even though project_id correlation is .657

This is divided by the number of index columns, so the index correlation
is estimated to be 0.219.

> while work_units correlation is .116.

So csquared is 0.048 and 0.013, respectively, and you get a result not
far away from the upper bound in both cases.  The cost estimations
differ by only 3.5% of (max_IO_cost - min_IO_cost).

>you'll see that the cost of the index scan is way overestimated. Looking
>at the code, the runcost is calculated as
>
>run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);
>
>where csquared is indexCorrelation^2. Why is indexCorrelation squared?
>The comments say a linear interpolation between min_IO and max_IO is
>used, but ISTM that if it was linear then instead of csquared,
>indexCorrelation would just be used.

In my tests I got much more plausible results with

1 - (1 - abs(correlation))^2

Jim, are you willing to experiment with one or two small patches of
mine?  What version of Postgres are you running?

Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Optimizer wrongly picks Nested Loop Left Join

2005-05-11 Thread Tom Lane
John A Meinel <[EMAIL PROTECTED]> writes:
> Unfortunately, because Hash Join doesn't report the number of rows
> (rows=0 always), it's hard to tell how good the estimator is.

This is only a cosmetic problem because you can just look at the number
of rows actually emitted by the Hash node's child; that's always exactly
the number loaded into the hashtable.

(But having said that, it is fixed in CVS tip.)

You may be confused though --- the Hash node is not the Hash Join node.
A zero report from Hash Join does actually mean that it emitted zero
rows.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Prefetch

2005-05-11 Thread Bricklen Anderson
Christopher Kings-Lynne wrote:
Another trick you can use with large data sets like this when you want 
results
back in seconds is to have regularly updated tables that aggregate the 
data
along each column normally aggregated against the main data set.

Maybe some bright person will prove me wrong by posting some working
information about how to get these apparently absent features working.

Most people just use simple triggers to maintain aggregate summary 
tables...

Chris
However, if (insert) triggers prove to be too much of a performance hit, try 
cron'd functions that perform the aggregation for you. This system works well 
for us, using the pk's (sequence) for start and stop points.

--
___
This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.
___
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Optimizer wrongly picks Nested Loop Left Join

2005-05-11 Thread Tom Lane
"Edin Kadribasic" <[EMAIL PROTECTED]> writes:
> I have a query that is giving the optimizer (and me) great headache.

The main problem seems to be that the rowcount estimates for
axp_temp_order_match and axp_dayschedule are way off:

>  ->  Index Scan using axp_temp_order_match_idx1 on
> axp_temp_order_match a  (cost=0.00..209.65 rows=426 width=4) (actual
> time=0.277..0.512 rows=6 loops=1)
>Index Cond: (sid = 16072)

>->  Index Scan using axp_dayschedule_day_idx on
> axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
> time=0.036..3.973 rows=610 loops=1)
>  Index Cond: (("day" >= '2005-05-12'::date)
> AND ("day" <= '2005-05-12'::date))
>  Filter: (used = B'1'::"bit")

>  ->  Index Scan using axp_temp_order_match_idx1 on
> axp_temp_order_match a  (cost=0.00..2.45 rows=1 width=4) (actual
> time=0.027..2.980 rows=471 loops=1)
>Index Cond: (sid = 16092)

>  ->  Index Scan using axp_dayschedule_day_idx on
> axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
> time=0.015..3.557 rows=606 loops=471)
>Index Cond: (("day" >= '2005-05-13'::date) AND
> ("day" <= '2005-05-13'::date))
>Filter: (used = B'1'::"bit")

Do you ANALYZE these tables on a regular basis?  If so, it may be
necessary to increase the statistics target to the point where you
get better estimates.

> Please note that sometimes when I get "bad plan" in the logfile, I just
> re-run the query and the optimizer chooses the more efficient one.

That's fairly hard to believe, unless you've got autovacuum running
in the background.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-11 Thread Neil Conway
Tom Lane wrote:
Performance?
I'll run some benchmarks tomorrow, as it's rather late in my time zone. 
If anyone wants to post some benchmark results, they are welcome to.

I disagree completely with the idea of forcing this behavior for all
datatypes.  It could only be sensible for fairly wide values; you don't
save enough to justify the lossiness otherwise.
I think it would be premature to decide about this before we see some 
performance numbers. I'm not fundamentally opposed, though.

[ BTW, posting patches to pgsql-general seems pretty off-topic. ]
Not any more than discussing implementation details is :) But your point 
is well taken, I'll send future patches to -patches.

-Neil
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Optimizer wrongly picks Nested Loop Left Join

2005-05-11 Thread John A Meinel
Edin Kadribasic wrote:
Hi,
I have a query that is giving the optimizer (and me) great headache. When
its in the good mood the optimizer chooses Hash Left Join and the query
executes in 13ms or so, but sometimes (more and more often) it chooses
Nested Loop Left Join and the execution time goes up to 2-30sec.
The query:
SELECT COUNT(DISTINCT a.tid) FROM axp_temp_order_match a LEFT OUTER JOIN (
SELECT ol.tid, ds.orid FROM axp_dayschedule ds JOIN axp_order_line ol ON
ol.olid = ds.olid JOIN axp_order o ON ds.orid = o.orid WHERE o.status >= 100
AND ds.day between '2005-05-12' and '2005-05-12' AND ds.used = '1' ) b ON
(a.tid = b.tid) WHERE b.tid IS NULL AND a.sid = 16072;
Unfortunately, because Hash Join doesn't report the number of rows
(rows=0 always), it's hard to tell how good the estimator is. But I
*can* say that the NestLoop estimation is way off.
Good plan:
=
Aggregate  (cost=221.93..221.93 rows=1 width=4) (actual time=34.262..34.266
rows=1 loops=1)
   ->  Hash Left Join  (cost=9.07..220.86 rows=426 width=4) (actual
time=34.237..34.237 rows=0 loops=1)
 Hash Cond: ("outer".tid = "inner".tid)
 Filter: ("inner".tid IS NULL)
 ->  Index Scan using axp_temp_order_match_idx1 on
axp_temp_order_match a  (cost=0.00..209.65 rows=426 width=4) (actual
time=0.277..0.512 rows=6 loops=1)
   Index Cond: (sid = 16072)
 ->  Hash  (cost=9.07..9.07 rows=1 width=4) (actual
time=32.777..32.777 rows=0 loops=1)
   ->  Nested Loop  (cost=0.00..9.07 rows=1 width=4) (actual
time=0.208..31.563 rows=284 loops=1)
 ->  Nested Loop  (cost=0.00..6.05 rows=1 width=4)
(actual time=0.178..20.684 rows=552 loops=1)
   ->  Index Scan using axp_dayschedule_day_idx on
axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
time=0.036..3.973 rows=610 loops=1)
 Index Cond: (("day" >= '2005-05-12'::date)
AND ("day" <= '2005-05-12'::date))
 Filter: (used = B'1'::"bit")
   ->  Index Scan using axp_order_orid_key on
axp_order o  (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013
rows=1 loops=610)
 Index Cond: ("outer".orid = o.orid)
 Filter: (status >= 100)
 ->  Index Scan using axp_order_line_pk on
axp_order_line ol  (cost=0.00..3.01 rows=1 width=8) (actual
time=0.006..0.008 rows=1 loops=552)
   Index Cond: (ol.olid = "outer".olid)
 Total runtime: 34.581 ms
Bad plan (same query different values):
===
 Aggregate  (cost=11.54..11.54 rows=1 width=4) (actual
time=11969.281..11969.285 rows=1 loops=1)
   ->  Nested Loop Left Join  (cost=0.00..11.53 rows=1 width=4) (actual
time=25.730..11967.180 rows=338 loops=1)
See here, it thinks it will only have to do 1 nestloop, which would be
quite fast, but it hast to do 338.
 Join Filter: ("outer".tid = "inner".tid)
 Filter: ("inner".tid IS NULL)
 ->  Index Scan using axp_temp_order_match_idx1 on
axp_temp_order_match a  (cost=0.00..2.45 rows=1 width=4) (actual
time=0.027..2.980 rows=471 loops=1)
   Index Cond: (sid = 16092)
 ->  Nested Loop  (cost=0.00..9.07 rows=1 width=4) (actual
time=0.088..24.350 rows=285 loops=471)
Same thing here.
   ->  Nested Loop  (cost=0.00..6.04 rows=1 width=8) (actual
time=0.067..15.649 rows=317 loops=471)
And here.
 ->  Index Scan using axp_dayschedule_day_idx on
axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
time=0.015..3.557 rows=606 loops=471)
This estimate is way off too, but it is off in both plans.
   Index Cond: (("day" >= '2005-05-13'::date) AND
("day" <= '2005-05-13'::date))
   Filter: (used = B'1'::"bit")
 ->  Index Scan using axp_order_line_pk on
axp_order_line ol  (cost=0.00..3.01 rows=1 width=8) (actual
time=0.006..0.008 rows=1 loops=285426)
This is probably what is killing you. It is doing a single lookup 285k
times. The above plan only does it 552 times.
   Index Cond: (ol.olid = "outer".olid)
   ->  Index Scan using axp_order_orid_key on axp_order o
(cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013 rows=1
loops=149307)
 Index Cond: ("outer".orid = o.orid)
 Filter: (status >= 100)
 Total runtime: 11969.443 ms
Please note that sometimes when I get "bad plan" in the logfile, I just
re-run the query and the optimizer chooses the more efficient one. Sometime
it does not.
You work_mem is quite high relative to your total Ram, hopefully you
don't have many allowed concurrent connections. But that is a side point.
I assume the tables are freshly VACUUM ANALYZEd. Have you tried altering
the statistics for the columns, one of them to look at is
axp_dayschedule(day). That one seems to be consisten

Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-11 Thread Greg Stark
Neil Conway <[EMAIL PROTECTED]> writes:

> I'm posting mainly because I wasn't sure what to do to avoid false positives 
> in
> the case of hash collisions. In the hash AM code it is somewhat awkward to
> fetch the pointed-to heap tuple and recheck the scankey.[1] I just did the
> first thing that came to mind -- I marked all the hash AM opclasses as 
> "lossy",
> so the index qual is rechecked. This works, but suggestions for a better way 
> to
> do things would be welcome.

I would have thought that would be the only way worth considering.

Consider for example a query involving two or more hash indexes and the new
bitmap indexscan plan. You don't want to fetch the tuples if you can eliminate
them using one of the other indexes. 

-- 
greg


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL

2005-05-11 Thread Tom Lane
Neil Conway <[EMAIL PROTECTED]> writes:
> Greg Stark wrote:
>> What if the hash index stored *only* the hash code?

> Attached is a WIP patch that implements this.

Performance?

> I'm posting mainly because I wasn't sure what to do to avoid false 
> positives in the case of hash collisions. In the hash AM code it is 
> somewhat awkward to fetch the pointed-to heap tuple and recheck the 
> scankey.[1] I just did the first thing that came to mind -- I marked all 
> the hash AM opclasses as "lossy", so the index qual is rechecked. This 
> works, but suggestions for a better way to do things would be welcome.

AFAICS that's the *only* way to do it.

I disagree completely with the idea of forcing this behavior for all
datatypes.  It could only be sensible for fairly wide values; you don't
save enough to justify the lossiness otherwise.

It would be interesting to look into whether it could be driven on a
per-opclass basis.  Then you could have, eg, "text_lossy_hash_ops"
as a non-default opclass the DBA could select if he wanted this
behavior.  (The code could perhaps use the amopreqcheck flag to tell
it which way to behave.)  If that seems unworkable, I'd prefer to see us
set this up as a new index AM type, which would share a lot of code with
the old.

[ BTW, posting patches to pgsql-general seems pretty off-topic. ]

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Prefetch

2005-05-11 Thread Rod Taylor
On Wed, 2005-05-11 at 12:53 +0800, Christopher Kings-Lynne wrote:
> > Another trick you can use with large data sets like this when you want 
> > results
> > back in seconds is to have regularly updated tables that aggregate the data
> > along each column normally aggregated against the main data set.
> 
> > Maybe some bright person will prove me wrong by posting some working
> > information about how to get these apparently absent features working.
> 
> Most people just use simple triggers to maintain aggregate summary tables...

Agreed. I've also got a view which calls a function that will 1) use the
summary table where data exists, or 2) calculate the summary
information, load it into summary table, and send a copy to the client
(partial query results cache).

It's not all nicely abstracted behind user friendly syntax, but most of
those features can be cobbled together (with effort) in PostgreSQL.
-- 


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread Tom Lane
Mischa Sandberg <[EMAIL PROTECTED]> writes:
> So, simplicity dictates something like:

> table pg_remote(schemaname text, connectby text, remoteschema text)

Previous discussion of this sort of thing concluded that we wanted to
follow the SQL-MED standard.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Query tuning help

2005-05-11 Thread Ulrich Wisser
Hi Dan,
I tried to understand your query, but I couldn't get my understanding of 
the query and your description in sync.

Why do you use sub selects? Wouldn't a simple "recordtext like '%RED%'" 
do the trick too?

You combine all your where conditions with and. To me this looks like 
you get only rows with RED and CORVETTE.

From your description I would rewrite the query as
explain analyze
select distinct
em.incidentid,
ea.recordtext as retdata,
eg.long,
eg.lat
from
ea join em using(incidentid) join eg using(incidentid)
where
em.entrydate >= '2005-1-1 00:00'::date
and em.entrydate <= '2005-5-9 00:00'::date
and (  recordtext like '%RED%' or recordtext like '%CORVETTE%'  )
order by em.entrydate
That should give you all rows containing one of the words.
Does it work?
Is is faster? Is it fast enough?
Ulrich
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread Christopher Kings-Lynne
Acceptable Answers to 'So, when/is PG meant to be getting a decent  
partitioning system?':

1. Person X is working on it I believe.
2. It's on the list, but nobody has done anything about it yet
3. Your welcome to take a stab at it, I expect the community  would 
support your efforts as well.
4. If you have a huge pile of money you could probably buy the  
Moon. Thinking along those lines, you can probably pay someone to  write 
it for you.
5. It's a stupid idea, and it's never going to work, and heres  
why..

Unacceptable Answers to the same question:
1. Yours.
Be more helpful, and less arrogant please. Everyone else who has  
contributed to this thread has been very helpful in clarifying the  
state of affairs and pointing out what work is and isn't being done,  
and alternatives to just waiting for PG do it for you.
Please YOU be more helpful and less arrogant.  I thought your inital 
email was arrogant, demanding and insulting.  Your followup email has 
done nothing to dispel my impression.  Simon (one of PostgreSQL's major 
contributors AND one of the very few people working on partitioning in 
PostgreSQL, as you requested) told you all the reasons clearly and politely.

Chris
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread Alex Stapleton
On 11 May 2005, at 09:50, Alex Stapleton wrote:
On 11 May 2005, at 08:57, David Roussel wrote:

For an interesting look at scalability, clustering, caching, etc  
for a
large site have a look at how livejournal did it.
http://www.danga.com/words/2004_lisa/lisa04.pdf

I have implemented similar systems in the past, it's a pretty good  
technique, unfortunately it's not very "Plug-and-Play" as you have  
to base most of your API on memcached (I imagine MySQLs NDB tables  
might work as well actually) for it to work well.


They have 2.6 Million active users, posting 200 new blog entries per
minute, plus many comments and countless page views.
Although this system is of a different sort to the type I work on  
it's
interesting to see how they've made it scale.

They use mysql on dell hardware! And found single master  
replication did
not scale.  There's a section on multimaster replication, not sure if
they use it.  The main approach they use is to parition users into
spefic database clusters.  Caching is done using memcached at the
application level to avoid hitting the db for rendered pageviews

I don't think they are storing pre-rendered pages (or bits of) in  
memcached, but are principally storing the data for the pages in  
it. Gluing pages together is not a hugely intensive process usually :)
The only problem with memcached is that the clients clustering/ 
partitioning system will probably break if a node dies, and  
probably get confused if you add new nodes onto it as well. Easily  
extensible clustering (no complete redistribution of data required  
when you add/remove nodes) with the data distributed across nodes  
seems to be nothing but a pipe dream right now.


It's interesting that the solution livejournal have arrived at is  
quite
similar in ways to the way google is set up.

Don't Google use indexing servers which keep track of where data  
is? So that you only need to update them when you add or move data,  
deletes don't even have to be propagated among indexes immediately  
really because you'll find out if data isn't there when you visit  
where it should be. Or am I talking crap?
That will teach me to RTFA first ;) Ok so LJ maintain an index of  
which cluster each user is on, kinda of like google do :)


David
---(end of  
broadcast)---
TIP 8: explain analyze is your friend



---(end of  
broadcast)---
TIP 4: Don't 'kill -9' the postmaster



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread Alex Stapleton
On 11 May 2005, at 08:57, David Roussel wrote:
For an interesting look at scalability, clustering, caching, etc for a
large site have a look at how livejournal did it.
http://www.danga.com/words/2004_lisa/lisa04.pdf
I have implemented similar systems in the past, it's a pretty good  
technique, unfortunately it's not very "Plug-and-Play" as you have to  
base most of your API on memcached (I imagine MySQLs NDB tables might  
work as well actually) for it to work well.

They have 2.6 Million active users, posting 200 new blog entries per
minute, plus many comments and countless page views.
Although this system is of a different sort to the type I work on it's
interesting to see how they've made it scale.
They use mysql on dell hardware! And found single master  
replication did
not scale.  There's a section on multimaster replication, not sure if
they use it.  The main approach they use is to parition users into
spefic database clusters.  Caching is done using memcached at the
application level to avoid hitting the db for rendered pageviews
I don't think they are storing pre-rendered pages (or bits of) in  
memcached, but are principally storing the data for the pages in it.  
Gluing pages together is not a hugely intensive process usually :)
The only problem with memcached is that the clients clustering/ 
partitioning system will probably break if a node dies, and probably  
get confused if you add new nodes onto it as well. Easily extensible  
clustering (no complete redistribution of data required when you add/ 
remove nodes) with the data distributed across nodes seems to be  
nothing but a pipe dream right now.

It's interesting that the solution livejournal have arrived at is  
quite
similar in ways to the way google is set up.
Don't Google use indexing servers which keep track of where data is?  
So that you only need to update them when you add or move data,  
deletes don't even have to be propagated among indexes immediately  
really because you'll find out if data isn't there when you visit  
where it should be. Or am I talking crap?

David
---(end of  
broadcast)---
TIP 8: explain analyze is your friend



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread Alex Stapleton
On 11 May 2005, at 08:16, Simon Riggs wrote:
On Tue, 2005-05-10 at 11:03 +0100, Alex Stapleton wrote:
So, when/is PG meant to be getting a decent partitioning system?
ISTM that your question seems to confuse where code comes from.  
Without
meaning to pick on you, or reply rudely, I'd like to explore that
question. Perhaps it should be a FAQ entry.

All code is written by someone, and those people need to eat. Some
people are fully or partly funded to perform their tasks on this  
project
(coding, patching, etc). Others contribute their time for a variety of
reasons where involvement has a positive benefit.

You should ask these questions:
- Is anyone currently working on (Feature X)?
- If not, Can I do it myself?
- If not, and I still want it, can I fund someone else to build it for
me?
Asking "when is Feature X going to happen" is almost certainly  
going to
get the answer "never" otherwise, if the initial development is large
and complex. There are many TODO items that have lain untouched for
years, even though adding the feature has been discussed and agreed.

Best Regards, Simon Riggs
Acceptable Answers to 'So, when/is PG meant to be getting a decent  
partitioning system?':

1. Person X is working on it I believe.
2. It's on the list, but nobody has done anything about it yet
3. Your welcome to take a stab at it, I expect the community  
would support your efforts as well.
4. If you have a huge pile of money you could probably buy the  
Moon. Thinking along those lines, you can probably pay someone to  
write it for you.
5. It's a stupid idea, and it's never going to work, and heres  
why..

Unacceptable Answers to the same question:
1. Yours.
Be more helpful, and less arrogant please. Everyone else who has  
contributed to this thread has been very helpful in clarifying the  
state of affairs and pointing out what work is and isn't being done,  
and alternatives to just waiting for PG do it for you.

---(end of  
broadcast)---
TIP 8: explain analyze is your friend



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] Optimizer wrongly picks Nested Loop Left Join

2005-05-11 Thread Edin Kadribasic
Hi,


I have a query that is giving the optimizer (and me) great headache. When
its in the good mood the optimizer chooses Hash Left Join and the query
executes in 13ms or so, but sometimes (more and more often) it chooses
Nested Loop Left Join and the execution time goes up to 2-30sec.

The query:
SELECT COUNT(DISTINCT a.tid) FROM axp_temp_order_match a LEFT OUTER JOIN (
SELECT ol.tid, ds.orid FROM axp_dayschedule ds JOIN axp_order_line ol ON
ol.olid = ds.olid JOIN axp_order o ON ds.orid = o.orid WHERE o.status >= 100
AND ds.day between '2005-05-12' and '2005-05-12' AND ds.used = '1' ) b ON
(a.tid = b.tid) WHERE b.tid IS NULL AND a.sid = 16072;

Good plan:
=
Aggregate  (cost=221.93..221.93 rows=1 width=4) (actual time=34.262..34.266
rows=1 loops=1)
   ->  Hash Left Join  (cost=9.07..220.86 rows=426 width=4) (actual
time=34.237..34.237 rows=0 loops=1)
 Hash Cond: ("outer".tid = "inner".tid)
 Filter: ("inner".tid IS NULL)
 ->  Index Scan using axp_temp_order_match_idx1 on
axp_temp_order_match a  (cost=0.00..209.65 rows=426 width=4) (actual
time=0.277..0.512 rows=6 loops=1)
   Index Cond: (sid = 16072)
 ->  Hash  (cost=9.07..9.07 rows=1 width=4) (actual
time=32.777..32.777 rows=0 loops=1)
   ->  Nested Loop  (cost=0.00..9.07 rows=1 width=4) (actual
time=0.208..31.563 rows=284 loops=1)
 ->  Nested Loop  (cost=0.00..6.05 rows=1 width=4)
(actual time=0.178..20.684 rows=552 loops=1)
   ->  Index Scan using axp_dayschedule_day_idx on
axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
time=0.036..3.973 rows=610 loops=1)
 Index Cond: (("day" >= '2005-05-12'::date)
AND ("day" <= '2005-05-12'::date))
 Filter: (used = B'1'::"bit")
   ->  Index Scan using axp_order_orid_key on
axp_order o  (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013
rows=1 loops=610)
 Index Cond: ("outer".orid = o.orid)
 Filter: (status >= 100)
 ->  Index Scan using axp_order_line_pk on
axp_order_line ol  (cost=0.00..3.01 rows=1 width=8) (actual
time=0.006..0.008 rows=1 loops=552)
   Index Cond: (ol.olid = "outer".olid)
 Total runtime: 34.581 ms

Bad plan (same query different values):
===
 Aggregate  (cost=11.54..11.54 rows=1 width=4) (actual
time=11969.281..11969.285 rows=1 loops=1)
   ->  Nested Loop Left Join  (cost=0.00..11.53 rows=1 width=4) (actual
time=25.730..11967.180 rows=338 loops=1)
 Join Filter: ("outer".tid = "inner".tid)
 Filter: ("inner".tid IS NULL)
 ->  Index Scan using axp_temp_order_match_idx1 on
axp_temp_order_match a  (cost=0.00..2.45 rows=1 width=4) (actual
time=0.027..2.980 rows=471 loops=1)
   Index Cond: (sid = 16092)
 ->  Nested Loop  (cost=0.00..9.07 rows=1 width=4) (actual
time=0.088..24.350 rows=285 loops=471)
   ->  Nested Loop  (cost=0.00..6.04 rows=1 width=8) (actual
time=0.067..15.649 rows=317 loops=471)
 ->  Index Scan using axp_dayschedule_day_idx on
axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
time=0.015..3.557 rows=606 loops=471)
   Index Cond: (("day" >= '2005-05-13'::date) AND
("day" <= '2005-05-13'::date))
   Filter: (used = B'1'::"bit")
 ->  Index Scan using axp_order_line_pk on
axp_order_line ol  (cost=0.00..3.01 rows=1 width=8) (actual
time=0.006..0.008 rows=1 loops=285426)
   Index Cond: (ol.olid = "outer".olid)
   ->  Index Scan using axp_order_orid_key on axp_order o
(cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013 rows=1
loops=149307)
 Index Cond: ("outer".orid = o.orid)
 Filter: (status >= 100)
 Total runtime: 11969.443 ms

Please note that sometimes when I get "bad plan" in the logfile, I just
re-run the query and the optimizer chooses the more efficient one. Sometime
it does not.

Any ideas?

postgresql-8.0.2 on 2x3.2 GHz Xeon with 2GB ram Linux 2.6
shared_buffers = 15000
work_mem = 128000
effective_cache_size = 20
random_page_cost = (tried 1.0 - 4, seemingly without effect on this
particular issue).

Edin




---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread David Roussel
For an interesting look at scalability, clustering, caching, etc for a
large site have a look at how livejournal did it.
http://www.danga.com/words/2004_lisa/lisa04.pdf

They have 2.6 Million active users, posting 200 new blog entries per
minute, plus many comments and countless page views.

Although this system is of a different sort to the type I work on it's
interesting to see how they've made it scale.

They use mysql on dell hardware! And found single master replication did
not scale.  There's a section on multimaster replication, not sure if
they use it.  The main approach they use is to parition users into
spefic database clusters.  Caching is done using memcached at the
application level to avoid hitting the db for rendered pageviews.

It's interesting that the solution livejournal have arrived at is quite
similar in ways to the way google is set up.

David

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread Neil Conway
Josh Berkus wrote:
The other problem, as I was told it at OSCON, was that these were not 
high-availability clusters; it's impossible to add a server to an existing 
cluster
Yeah, that's a pretty significant problem.
a server going down is liable to take the whole cluster down.
That's news to me. Do you have more information on this?
-Neil
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


Re: [PERFORM] Partitioning / Clustering

2005-05-11 Thread Simon Riggs
On Tue, 2005-05-10 at 11:03 +0100, Alex Stapleton wrote:
> So, when/is PG meant to be getting a decent partitioning system?  

ISTM that your question seems to confuse where code comes from. Without
meaning to pick on you, or reply rudely, I'd like to explore that
question. Perhaps it should be a FAQ entry.

All code is written by someone, and those people need to eat. Some
people are fully or partly funded to perform their tasks on this project
(coding, patching, etc). Others contribute their time for a variety of
reasons where involvement has a positive benefit.

You should ask these questions:
- Is anyone currently working on (Feature X)?
- If not, Can I do it myself?
- If not, and I still want it, can I fund someone else to build it for
me?

Asking "when is Feature X going to happen" is almost certainly going to
get the answer "never" otherwise, if the initial development is large
and complex. There are many TODO items that have lain untouched for
years, even though adding the feature has been discussed and agreed.

Best Regards, Simon Riggs



---(end of broadcast)---
TIP 8: explain analyze is your friend