Re: [PERFORM] Intel SRCS16 SATA raid?
[EMAIL PROTECTED] wrote: > Greg, > > I posted this link under a different thread (the $7k server thread). It is > a very good read on why SCSI is better for servers than ATA. I didn't note > bias, though it is from a drive manufacturer. YMMV. There is an > interesting, though dated appendix on different manufacturers' drive > characteristics. > > http://www.seagate.com/content/docs/pdf/whitepaper/D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf I have read this and it is an _excellent_ read about disk drives. The bottom line is that the SCSI/IDE distinctions is more of an indicator of the drive, rather than the main feature of the drive. The main feature is that certain drives are Enterprise Storage and are designed for high reliability and speed, while Personal Server drives are designed for low cost. The IDE/SCSI issue is only an indicator of this. There are a lot more variabilities between these two types of drives than I knew. I recommend it for anyone who is choosing drives for a system. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] BLOB's bypassing the OS Filesystem for better Image loading speed?
* [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: Hi, > My next queststion is dedicated to blobs in my Webapplication (using > Tomcat 5 and JDBC > integrated a the J2EE Appserver JBoss). > > Filesystems with many Filesystem Objects can slow down the Performance > at opening and reading Data. As others already pointed out, you probably meant: overcrowded directories can make some filesystems slow. For ext2 this is the case. Instead reiserfs is designed to handle very large directories (in fact by using similar indices like an database does). If your application is an typical web app your will probably have the situation: + images get read quite often, while they get updated quite seldom. + you dont want to use image content in quries (ie. match against it) + the images will be transfered directly, without further processing + you can give the upload and the download-server access to a shared filesystem or synchronize their filesystems (ie rsync) Under this assumptions, I'd suggest directly using the filesystem. This should save some load, ie. + no transfer from postgres -> webserver and further processing (server side application) necessary, the webserver can directly fetch files from filesystem + no further processing (server side application) necessary + backup and synchronization is quite trivial (good old fs tools) + clustering (using many image webservers) is quite trivial Already mentioned that you've got to choose the right filesystem or at least the right fs organization (ie. working with a n-level hierachy to keep directory sizes small and lookups fast). An RDBMS can do this for you and so will save some implementation work, but I don't think it will be noticably faster than an good fs-side implementation. Of course there may be a lot of good reasons to put images into the database, ie. if some clients directly work on db connections and all work (including image upload) should be done over the db link. cu -- - Enrico Weigelt== metux IT service phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: [EMAIL PROTECTED] - Realtime Forex/Stock Exchange trading powered by postgresSQL :)) http://www.fxignal.net/ - ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Optimizer wrongly picks Nested Loop Left Join
From: "Tom Lane" <[EMAIL PROTECTED]> > "Edin Kadribasic" <[EMAIL PROTECTED]> writes: > > I have a query that is giving the optimizer (and me) great headache. > > The main problem seems to be that the rowcount estimates for > axp_temp_order_match and axp_dayschedule are way off: > > > -> Index Scan using axp_temp_order_match_idx1 on > > axp_temp_order_match a (cost=0.00..209.65 rows=426 width=4) (actual > > time=0.277..0.512 rows=6 loops=1) > >Index Cond: (sid = 16072) > > >-> Index Scan using axp_dayschedule_day_idx on > > axp_dayschedule ds (cost=0.00..3.02 rows=1 width=8) (actual > > time=0.036..3.973 rows=610 loops=1) > > Index Cond: (("day" >= '2005-05-12'::date) > > AND ("day" <= '2005-05-12'::date)) > > Filter: (used = B'1'::"bit") > > > -> Index Scan using axp_temp_order_match_idx1 on > > axp_temp_order_match a (cost=0.00..2.45 rows=1 width=4) (actual > > time=0.027..2.980 rows=471 loops=1) > >Index Cond: (sid = 16092) > > > -> Index Scan using axp_dayschedule_day_idx on > > axp_dayschedule ds (cost=0.00..3.02 rows=1 width=8) (actual > > time=0.015..3.557 rows=606 loops=471) > >Index Cond: (("day" >= '2005-05-13'::date) AND > > ("day" <= '2005-05-13'::date)) > >Filter: (used = B'1'::"bit") > > Do you ANALYZE these tables on a regular basis? If so, it may be > necessary to increase the statistics target to the point where you > get better estimates. Increasing statistics didn't seem to help, but both of you gave me an idea of what might be wrong. axp_temp_order match contains temporary matches for a search. Just before execution of that query the new matches are inserted into the table under a new search id (sid column). Since the ANALYZE was that before it it grossly underestimates the number of matches for that sid. As this table is relatively small inserting ANALYZE axp_temp_order_match just before running the query does not introduce a great perforance penalty (50ms) and it reduces the query execution time from up to 50s down to ~20ms. > > Please note that sometimes when I get "bad plan" in the logfile, I just > > re-run the query and the optimizer chooses the more efficient one. > > That's fairly hard to believe, unless you've got autovacuum running > in the background. The application had ANALYZE axp_temp_order_match placed in the "slightly" wrong location, before the large insert was done (1000 rows with a new sid). So when the app run the next search, previous search got correctly analyzed and the query execution time dropped dramatically as I was trying to EXPLAIN ANALYZE query recorded in the log file. Thanks for your help, Edin ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Partitioning / Clustering
However, memcached (and for us, pg_memcached) is an excellent way to improve horizontal scalability by taking disposable data (like session information) out of the database and putting it in protected RAM. So, what is the advantage of such a system versus, say, a "sticky sessions" system where each session is assigned to ONE application server (not PHP then) which keeps it in RAM as native objects instead of serializing and deserializing it on each request ? I'd say the sticky sessions should perform a lot better, and if one machine dies, only the sessions on this one are lost. But of course you can't do it with PHP as you need an app server which can manage sessions. Potentially the savings are huge, though. On Google, their distributed system spans a huge number of PCs and it has redundancy, ie. individual PC failure is a normal thing and is a part of the system, it is handled gracefully. I read a paper on this matter, it's pretty impressive. The google filesystem has nothing to do with databases though, it's more a massive data store / streaming storage. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Partitioning / Clustering
On Wed, May 11, 2005 at 08:57:57AM +0100, David Roussel wrote: > For an interesting look at scalability, clustering, caching, etc for a > large site have a look at how livejournal did it. > http://www.danga.com/words/2004_lisa/lisa04.pdf > > They have 2.6 Million active users, posting 200 new blog entries per > minute, plus many comments and countless page views. Neither of which is that horribly impressive. 200 TPM is less than 4TPS. While I haven't run high transaction rate databases under PostgreSQL, I suspect others who have will say that 4TPS isn't that big of a deal. > Although this system is of a different sort to the type I work on it's > interesting to see how they've made it scale. > > They use mysql on dell hardware! And found single master replication did > not scale. There's a section on multimaster replication, not sure if Probably didn't scale because they used to use MyISAM. > they use it. The main approach they use is to parition users into > spefic database clusters. Caching is done using memcached at the Which means they've got a huge amount of additional code complexity, not to mention how many times you can't post something because 'that cluster is down for maintenance'. > application level to avoid hitting the db for rendered pageviews. Memcached is about the only good thing I've seen come out of livejournal. > It's interesting that the solution livejournal have arrived at is quite > similar in ways to the way google is set up. Except that unlike LJ, google stays up and it's fast. Though granted, LJ is quite a bit faster than it was 6 months ago. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?" ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Bad plan after vacuum analyze
Quoting Guillaume Smet <[EMAIL PROTECTED]>: > Hi, > > We have some performances problem on a particular query. ... I have to say it, this was the best laid-out set of details behind a problem I've ever seen on this list; I'm going to try live up to it, the next time I have a problem of my own. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Federated PG servers -- Was: Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL
Was curious why you pointed out SQL-MED as a SQL-standard approach to federated servers. Always thought of it as covering access to non-SQL data, the way the lo_* interface works; as opposed to meshing compatible (to say nothing of identical) SQL servers. Just checked Jim Melton's last word on that, to make sure, too. Is there something beyond that, that I'm missing? The approach that made first best sense to me (perhaps from having gone there before) is to leave the SQL syntactically unchanged, and to manage federated relations via pg_ tables and probably procedures. MSSQL and Sybase went that route. It won't preclude moving to a system embedded in the SQL language. The hurdles for federated SQL service are: - basic syntax (how to refer to a remote object) - connection management and delegated security - timeouts and temporary connection failures - efficient distributed queries with >1 remote table - distributed transactions - interserver integrity constraints Sometimes the lines get weird because of opportunistic implementations. For example, for the longest time, MSSQL supported server.db.user.object references WITHIN STORED PROCEDURES, since the proc engine could hide some primitive connection management. PG struck me as such a natural for cross-server queries, because it keeps everything out in the open, including statistics. PG is also well set-up to handle heterogeneous table types, and has functions that return rowsets. Nothing needs to be bent out of shape syntactically, or in the cross-server interface, to get over the hurdles above. The fact that queries hence transactions can't span multiple databases tells me, PG has a way to go before it can handle dependency on a distributed transaction monitor. My 2c. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Sort and index
First, I've got some updated numbers up at http://stats.distributed.net/~decibel/ timing2.log shows that the planner actually under-estimates an index scan by several orders of magnitude. Granted, random_page_cost is set to an unrealistic 1.1 (otherwise I can't force the index scan), but that alone isn't enough to explain the difference. On Wed, May 11, 2005 at 05:59:10PM +0200, Manfred Koizar wrote: > On Sun, 24 Apr 2005 17:01:46 -0500, "Jim C. Nasby" <[EMAIL PROTECTED]> > wrote: > >> >> Feel free to propose better cost equations. > > I did. More than once. > > >estimated index scan cost for (project_id, id, date) is > >0.00..100117429.34 while the estimate for work_units is > >0.00..103168408.62; almost no difference, > > ~3% > > > even though project_id correlation is .657 > > This is divided by the number of index columns, so the index correlation > is estimated to be 0.219. That seems like a pretty bad assumption to make. Is there any eta on having statistics for multi-column indexes? > >you'll see that the cost of the index scan is way overestimated. Looking > >at the code, the runcost is calculated as > > > >run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost); > > > >where csquared is indexCorrelation^2. Why is indexCorrelation squared? > >The comments say a linear interpolation between min_IO and max_IO is > >used, but ISTM that if it was linear then instead of csquared, > >indexCorrelation would just be used. > > In my tests I got much more plausible results with > > 1 - (1 - abs(correlation))^2 What's the theory behind that? And I'd still like to know why correlation squared is used. > Jim, are you willing to experiment with one or two small patches of > mine? What version of Postgres are you running? It depends on the patches, since this is a production machine. Currently it's running 7.4.*mumble*, though I need to upgrade to 8, which I was intending to do via slony. Perhaps the best thing would be for me to get that setup and we can experiment against version 8.0.3. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?" ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Bad plan after vacuum analyze
Josh, Tom, Thanks for your explanations. In the meantime it seems like the quickest answer for Guillaume might be to try to avoid keeping any NULLs in parent_application_id. I can't do that as the majority of the applications don't have any parent one. Moreover, we use a third party application and we cannot modify all its internals. Anyway, I tried to work on the statistics as you told me and here are the results: ccm_perf=# ALTER TABLE acs_objects ALTER COLUMN object_id SET STATISTICS 30; ALTER TABLE ccm_perf=# ANALYZE acs_objects; ANALYZE ccm_perf=# \i query_section.sql ... correct plan ... Total runtime: 0.555 ms So I think I will use this solution for the moment. Thanks a lot for your help. Regards -- Guillaume ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Bad plan after vacuum analyze
Ah-ha, I can replicate the problem. This example uses tenk1 from the regression database, which has a column unique2 containing just the integers 0... regression=# create table t1(f1 int); CREATE TABLE regression=# insert into t1 values(5); INSERT 154632 1 regression=# insert into t1 values(7); INSERT 154633 1 regression=# analyze t1; ANALYZE regression=# explain analyze select * from tenk1 right join t1 on (unique2=f1); QUERY PLAN --- Merge Right Join (cost=1.03..1.37 rows=2 width=248) (actual time=0.507..0.617 rows=2 loops=1) Merge Cond: ("outer".unique2 = "inner".f1) -> Index Scan using tenk1_unique2 on tenk1 (cost=0.00..498.24 rows=10024 width=244) (actual time=0.126..0.242 rows=9 loops=1) -> Sort (cost=1.03..1.03 rows=2 width=4) (actual time=0.145..0.153 rows=2 loops=1) Sort Key: t1.f1 -> Seq Scan on t1 (cost=0.00..1.02 rows=2 width=4) (actual time=0.029..0.049 rows=2 loops=1) Total runtime: 1.497 ms (7 rows) The planner correctly perceives that only a small part of the unique2 index will need to be scanned, and hence thinks the merge is cheap --- much cheaper than if the whole index had to be scanned. And it is. Notice that only 9 rows were actually pulled from the index. Once we got to unique2 = 8, nodeMergejoin.c could see that no more matches to f1 were possible. But watch this: regression=# insert into t1 values(null); INSERT 154634 1 regression=# explain analyze select * from tenk1 right join t1 on (unique2=f1); QUERY PLAN - Merge Right Join (cost=1.03..1.37 rows=2 width=248) (actual time=0.560..290.874 rows=3 loops=1) Merge Cond: ("outer".unique2 = "inner".f1) -> Index Scan using tenk1_unique2 on tenk1 (cost=0.00..498.24 rows=10024 width=244) (actual time=0.139..106.982 rows=1 loops=1) -> Sort (cost=1.03..1.03 rows=2 width=4) (actual time=0.181..0.194 rows=3 loops=1) Sort Key: t1.f1 -> Seq Scan on t1 (cost=0.00..1.02 rows=2 width=4) (actual time=0.032..0.067 rows=3 loops=1) Total runtime: 291.670 ms (7 rows) See what happened to the actual costs of the indexscan? All of a sudden we had to scan the whole index because there was a null in the other input, and nulls sort high. I wonder if it is worth fixing nodeMergejoin.c to not even try to match nulls to the other input. We'd have to add a check to see if the join operator is strict or not, but it nearly always will be. The alternative would be to make the planner only believe in the short-circuit path occuring if it thinks that the other input is entirely non-null ... but this seems pretty fragile, since it only takes one null to mess things up, and ANALYZE can hardly be counted on to detect one null in a table. In the meantime it seems like the quickest answer for Guillaume might be to try to avoid keeping any NULLs in parent_application_id. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Partitioning / Clustering
On Wed, 2005-05-11 at 17:13 +0800, Christopher Kings-Lynne wrote: > > Alex Stapleton wrote > > Be more helpful, and less arrogant please. > > Simon told you all the reasons clearly and politely. Thanks Chris for your comments. PostgreSQL can always do with one more developer and my sole intent was to encourage Alex and other readers to act themselves. If my words seem arrogant, then I apologise to any and all that think so. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Partitioning / Clustering
Alex Stapleton <[EMAIL PROTECTED]> writes: > Acceptable Answers to 'So, when/is PG meant to be getting a decent > partitioning system?': ... > 3. Your welcome to take a stab at it, I expect the community would > support your efforts as well. As long as we're being curt all around, this one's not acceptable on the basis that it's not grammatical. -- greg ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Bad plan after vacuum analyze
> Well, those stats certainly appear to justify the planner's belief that > the indexscan needn't run very far: the one value of > parent_application_id is 1031 and this is below the smallest value of > object_id seen by analyze. Yes, it seems rather logical but why does it cost so much if it should be an effective way to find the row? > You might have better luck if you increase > the statistics target for acs_objects.object_id. What do you mean exactly? > (It'd be interesting > to know what fraction of acs_objects actually does have object_id < 1032.) ccm_perf=# SELECT COUNT(*) FROM acs_objects WHERE object_id<1032; count --- 15 ccm_perf=# SELECT COUNT(*) FROM acs_objects; count --- 33510 -- Guillaume ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Bad plan after vacuum analyze
Guillaume Smet <[EMAIL PROTECTED]> writes: >> If so, can we see the pg_stats rows for the object_id and >> parent_application_id columns? > See attached file. Well, those stats certainly appear to justify the planner's belief that the indexscan needn't run very far: the one value of parent_application_id is 1031 and this is below the smallest value of object_id seen by analyze. You might have better luck if you increase the statistics target for acs_objects.object_id. (It'd be interesting to know what fraction of acs_objects actually does have object_id < 1032.) regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Bad plan after vacuum analyze
Tom, So, the usual questions: have these two tables been ANALYZEd lately? Yes, of course. As I wrote in my previous mail, here is how I reproduce the problem: - we load the dump in a new database (to be sure, there is no problem on an index or something like that) - query: it's fast (< 1ms) - *VACUUM FULL ANALYZE;* - query: it's really slow (130ms) and it's another plan - set enable_seqscan=off; - query: it's fast (< 1ms) : it uses the best plan I reproduced it on two different servers exactly like that (7.4.5 and 7.4.7). I first met the problem on a production database with a VACUUM ANALYZE run every night (and we don't have too many inserts a day on this database). If so, can we see the pg_stats rows for the object_id and parent_application_id columns? See attached file. If you're interested in a dump of these tables, just tell me. There aren't any confidential information in them. Regards -- Guillaume schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation +-+---+---+---++--+---+-+- public | acs_objects | object_id | 0 | 4 | -1 | | | {1032,34143,112295,120811,285004,420038,449980,453451,457684,609292,710005} | 0.488069 (1 ligne) schemaname | tablename |attname| null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation +--+---+---+---++--+---+--+- public | applications | parent_application_id | 0.928571 | 4 | 1 | {1031} | {0.0714286} | | 1 (1 ligne) ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Prefetch
Quoting Christopher Kings-Lynne <[EMAIL PROTECTED]>: > > Another trick you can use with large data sets like this when you > want > > results > > back in seconds is to have regularly updated tables that aggregate > the data > > along each column normally aggregated against the main data set. > > > Maybe some bright person will prove me wrong by posting some > working > > information about how to get these apparently absent features > working. > > Most people just use simple triggers to maintain aggregate summary > tables... Don't know if this is more appropriate to bizgres, but: What the first poster is talking about is what OLAP cubes do. For big aggregating systems (OLAP), triggers perform poorly, compared to messy hand-rolled code. You may have dozens of aggregates at various levels. Consider the effect of having each detail row cascade into twenty updates. It's particularly silly-looking when data is coming in as batches of thousands of rows in a single insert, e.g. COPY temp_table FROM STDIN; UPDATE fact_table ... FROM ... temp_table INSERT INTO fact_table ...FROM...temp_table (the above pair of operations is so common, Oracle added its "MERGE" operator for it). Hence my recent post (request) for using RULES to aggregate --- given no luck with triggers "FOR EACH STATEMENT". ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Bad plan after vacuum analyze
Josh Berkus writes: > -> Merge Right Join (cost=8.92..9.26 rows=1 width=529) (actual > time=129.100..129.103 rows=1 loops=1) >Merge Cond: ("outer".object_id = "inner".parent_application_id) >-> Index Scan using acs_objects_object_id_p_hhkb1 on > acs_objects t98 (cost=0.00..2554.07 rows=33510 width=81) (actual > time=0.043..56.392 rows=33510 loops=1) >-> Sort (cost=8.92..8.93 rows=1 width=452) (actual > time=0.309..0.310 rows=1 loops=1) > Sort Key: t22.parent_application_id > Here the planner chooses a merge right join. This decision seems to have > been > made entirely on the basis of the cost of the join itself (total of 17) > without taking the cost of the sort and index access (total of 2600+) into > account. > Tom, is this a possible error in planner logic? No, it certainly hasn't forgotten to add in the costs of the inputs. There might be a bug here, but if so it's much more subtle than that. It looks to me like the planner believes that the one value of t22.parent_application_id joins to something very early in the acs_objects_object_id_p_hhkb1 sort order, and that it will therefore not be necessary to run the indexscan to completion (or indeed very far at all, considering that it's including such a small fraction of the total indexscan cost). [EMAIL PROTECTED] pointed out recently that this effect doesn't apply to the outer side of an outer join; releases before 7.4.8 mistakenly think it does. But unless my wires are totally crossed today, acs_objects is the nullable side here and so that error isn't applicable anyway. So, the usual questions: have these two tables been ANALYZEd lately? If so, can we see the pg_stats rows for the object_id and parent_application_id columns? regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Bad plan after vacuum analyze
Guillaume, > We reproduced the problem on a 7.4.5 and on a 7.4.7 server. > * we load the dump in a new database > * query: it's fast (< 1ms) > * VACUUM FULL ANALYZE; > * query: it's really slow (130ms) and it's another plan > * set enable_seqscan=off; > * query: it's fast (< 1ms) : it uses the best plan Looking at this, the planner seems convinced that the merge join is the easiest way to do the OUTER JOINS, but it appears to be wrong; a nested loop is faster. This isn't the only place I've encountered our optimizer doing this -- underestimating the cost of a merge join. This seems to be becuase the merge_join vs. nested_loop decision seems to be being made in the planner without taking the double-sort and index access costs into account. This query is an excellent example: "good" plan: Nested Loop Left Join (cost=2.44..17.36 rows=1 width=5532) (actual time=0.441..0.466 rows=1 loops=1) Join Filter: ("outer".parent_application_id = "inner".application_id) -> Nested Loop Left Join (cost=2.44..15.73 rows=1 width=5214) (actual time=0.378..0.402 rows=1 loops=1) See, here the planner thinks that the 2 nested loops will cost "35". "bad" plan: Merge Right Join (cost=9.27..9.48 rows=1 width=545) (actual time=129.364..129.365 rows=1 loops=1) Merge Cond: ("outer".application_id = "inner".parent_application_id) -> Index Scan using applicati_applicati_id_p_ogstm on applications t116 (cost=0.00..5.51 rows=28 width=20) (actual time=0.030..0.073 rows=28 loops=1) -> Sort (cost=9.27..9.27 rows=1 width=529) (actual time=129.202..129.203 rows=1 loops=1) Sort Key: t22.parent_application_id -> Merge Right Join (cost=8.92..9.26 rows=1 width=529) (actual time=129.100..129.103 rows=1 loops=1) Merge Cond: ("outer".object_id = "inner".parent_application_id) -> Index Scan using acs_objects_object_id_p_hhkb1 on acs_objects t98 (cost=0.00..2554.07 rows=33510 width=81) (actual time=0.043..56.392 rows=33510 loops=1) -> Sort (cost=8.92..8.93 rows=1 width=452) (actual time=0.309..0.310 rows=1 loops=1) Sort Key: t22.parent_application_id Here the planner chooses a merge right join. This decision seems to have been made entirely on the basis of the cost of the join itself (total of 17) without taking the cost of the sort and index access (total of 2600+) into account. Tom, is this a possible error in planner logic? -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Bad plan after vacuum analyze
Hi, We have some performances problem on a particular query. We reproduced the problem on a 7.4.5 and on a 7.4.7 server. * we load the dump in a new database * query: it's fast (< 1ms) * VACUUM FULL ANALYZE; * query: it's really slow (130ms) and it's another plan * set enable_seqscan=off; * query: it's fast (< 1ms) : it uses the best plan I attached the EXPLAIN ANALYZE outputs, the query and the tables description. I really can't understand why the planner chooses this plan and especially the line : -> Index Scan using acs_objects_object_id_p_hhkb1 on acs_objects t98 (cost=0.00..2554.07 rows=33510 width=81) (actual time=0.043..56.392 rows=33510 loops=1). I never saw an index scan on such a number of lines. For your information, there are 33510 lines in this table so it scans the whole table. The problem seems to be the left join on the acs_objects t98 table for the parent_application_id as if I remove it or if I change it to a subquery, it's ok. The query is automatically generated by a persistence layer so I can't really rewrite it. Thanks for any help Regards -- Guillaume QUERY PLAN -- Merge Right Join (cost=9.27..9.48 rows=1 width=545) (actual time=129.364..129.365 rows=1 loops=1) Merge Cond: ("outer".application_id = "inner".parent_application_id) -> Index Scan using applicati_applicati_id_p_ogstm on applications t116 (cost=0.00..5.51 rows=28 width=20) (actual time=0.030..0.073 rows=28 loops=1) -> Sort (cost=9.27..9.27 rows=1 width=529) (actual time=129.202..129.203 rows=1 loops=1) Sort Key: t22.parent_application_id -> Merge Right Join (cost=8.92..9.26 rows=1 width=529) (actual time=129.100..129.103 rows=1 loops=1) Merge Cond: ("outer".object_id = "inner".parent_application_id) -> Index Scan using acs_objects_object_id_p_hhkb1 on acs_objects t98 (cost=0.00..2554.07 rows=33510 width=81) (actual time=0.043..56.392 rows=33510 loops=1) -> Sort (cost=8.92..8.93 rows=1 width=452) (actual time=0.309..0.310 rows=1 loops=1) Sort Key: t22.parent_application_id -> Nested Loop (cost=2.44..8.91 rows=1 width=452) (actual time=0.259..0.287 rows=1 loops=1) -> Hash Join (cost=2.44..3.68 rows=1 width=339) (actual time=0.227..0.251 rows=1 loops=1) Hash Cond: ("outer".application_type_id = "inner".application_type_id) -> Seq Scan on application_types t47 (cost=0.00..1.15 rows=15 width=28) (actual time=0.009..0.025 rows=15 loops=1) -> Hash (cost=2.44..2.44 rows=1 width=315) (actual time=0.121..0.121 rows=0 loops=1) -> Hash Join (cost=1.01..2.44 rows=1 width=315) (actual time=0.073..0.112 rows=1 loops=1) Hash Cond: ("outer".application_id = "inner".section_id) -> Seq Scan on applications t22 (cost=0.00..1.28 rows=28 width=70) (actual time=0.002..0.027 rows=28 loops=1) -> Hash (cost=1.01..1.01 rows=1 width=245) (actual time=0.026..0.026 rows=0 loops=1) -> Seq Scan on content_sections t0 (cost=0.00..1.01 rows=1 width=245) (actual time=0.020..0.022 rows=1 loops=1) -> Index Scan using acs_objects_object_id_p_hhkb1 on acs_objects t7 (cost=0.00..5.22 rows=1 width=121) (actual time=0.016..0.017 rows=1 loops=1) Index Cond: ("outer".application_id = t7.object_id) Total runtime: 129.754 ms (23 lignes) SET QUERY PLAN - Merge Left Join (cost=17.48..18.03 rows=1 width=545) (actual time=0.653..0.656 rows=1 loops=1) Merge Cond: ("outer".parent_application_id = "inner".application_id) -> Merge Left Join (cost=17.48..17.81 rows=1 width=529) (actual time=0.616..0.618 rows=1 loops=1) Merge Cond: ("outer".parent_application_id = "inner".object_id) -> Sort (cost=17.48..17.49 rows=1 width=452) (actual time=0.587..0.588 rows=1 loops=1) Sort Key: t22.parent_application_id -> Nested Loop (cost=4.68..17.47 rows=1 width=452) (actual t
Re: [PERFORM] Partitioning / Clustering
David, > It's interesting that the solution livejournal have arrived at is quite > similar in ways to the way google is set up. Yes, although again, they're using memcached as pseudo-clustering software, and as a result are limited to what fits in RAM (RAM on 27 machines, but it's still RAM). And due to limitations on memcached, the whole thing blows whenever a server goes out (the memcached project is working on this). But any LJ user could tell you that it's a low-availability system. However, memcached (and for us, pg_memcached) is an excellent way to improve horizontal scalability by taking disposable data (like session information) out of the database and putting it in protected RAM. On some websites, adding memcached can result is as much as a 60% decrease in database traffic. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Sort and index
On Sun, 24 Apr 2005 17:01:46 -0500, "Jim C. Nasby" <[EMAIL PROTECTED]> wrote: >> >> Feel free to propose better cost equations. I did. More than once. >estimated index scan cost for (project_id, id, date) is >0.00..100117429.34 while the estimate for work_units is >0.00..103168408.62; almost no difference, ~3% > even though project_id correlation is .657 This is divided by the number of index columns, so the index correlation is estimated to be 0.219. > while work_units correlation is .116. So csquared is 0.048 and 0.013, respectively, and you get a result not far away from the upper bound in both cases. The cost estimations differ by only 3.5% of (max_IO_cost - min_IO_cost). >you'll see that the cost of the index scan is way overestimated. Looking >at the code, the runcost is calculated as > >run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost); > >where csquared is indexCorrelation^2. Why is indexCorrelation squared? >The comments say a linear interpolation between min_IO and max_IO is >used, but ISTM that if it was linear then instead of csquared, >indexCorrelation would just be used. In my tests I got much more plausible results with 1 - (1 - abs(correlation))^2 Jim, are you willing to experiment with one or two small patches of mine? What version of Postgres are you running? Servus Manfred ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Optimizer wrongly picks Nested Loop Left Join
John A Meinel <[EMAIL PROTECTED]> writes: > Unfortunately, because Hash Join doesn't report the number of rows > (rows=0 always), it's hard to tell how good the estimator is. This is only a cosmetic problem because you can just look at the number of rows actually emitted by the Hash node's child; that's always exactly the number loaded into the hashtable. (But having said that, it is fixed in CVS tip.) You may be confused though --- the Hash node is not the Hash Join node. A zero report from Hash Join does actually mean that it emitted zero rows. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Prefetch
Christopher Kings-Lynne wrote: Another trick you can use with large data sets like this when you want results back in seconds is to have regularly updated tables that aggregate the data along each column normally aggregated against the main data set. Maybe some bright person will prove me wrong by posting some working information about how to get these apparently absent features working. Most people just use simple triggers to maintain aggregate summary tables... Chris However, if (insert) triggers prove to be too much of a performance hit, try cron'd functions that perform the aggregation for you. This system works well for us, using the pk's (sequence) for start and stop points. -- ___ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. ___ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Optimizer wrongly picks Nested Loop Left Join
"Edin Kadribasic" <[EMAIL PROTECTED]> writes: > I have a query that is giving the optimizer (and me) great headache. The main problem seems to be that the rowcount estimates for axp_temp_order_match and axp_dayschedule are way off: > -> Index Scan using axp_temp_order_match_idx1 on > axp_temp_order_match a (cost=0.00..209.65 rows=426 width=4) (actual > time=0.277..0.512 rows=6 loops=1) >Index Cond: (sid = 16072) >-> Index Scan using axp_dayschedule_day_idx on > axp_dayschedule ds (cost=0.00..3.02 rows=1 width=8) (actual > time=0.036..3.973 rows=610 loops=1) > Index Cond: (("day" >= '2005-05-12'::date) > AND ("day" <= '2005-05-12'::date)) > Filter: (used = B'1'::"bit") > -> Index Scan using axp_temp_order_match_idx1 on > axp_temp_order_match a (cost=0.00..2.45 rows=1 width=4) (actual > time=0.027..2.980 rows=471 loops=1) >Index Cond: (sid = 16092) > -> Index Scan using axp_dayschedule_day_idx on > axp_dayschedule ds (cost=0.00..3.02 rows=1 width=8) (actual > time=0.015..3.557 rows=606 loops=471) >Index Cond: (("day" >= '2005-05-13'::date) AND > ("day" <= '2005-05-13'::date)) >Filter: (used = B'1'::"bit") Do you ANALYZE these tables on a regular basis? If so, it may be necessary to increase the statistics target to the point where you get better estimates. > Please note that sometimes when I get "bad plan" in the logfile, I just > re-run the query and the optimizer chooses the more efficient one. That's fairly hard to believe, unless you've got autovacuum running in the background. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL
Tom Lane wrote: Performance? I'll run some benchmarks tomorrow, as it's rather late in my time zone. If anyone wants to post some benchmark results, they are welcome to. I disagree completely with the idea of forcing this behavior for all datatypes. It could only be sensible for fairly wide values; you don't save enough to justify the lossiness otherwise. I think it would be premature to decide about this before we see some performance numbers. I'm not fundamentally opposed, though. [ BTW, posting patches to pgsql-general seems pretty off-topic. ] Not any more than discussing implementation details is :) But your point is well taken, I'll send future patches to -patches. -Neil ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Optimizer wrongly picks Nested Loop Left Join
Edin Kadribasic wrote: Hi, I have a query that is giving the optimizer (and me) great headache. When its in the good mood the optimizer chooses Hash Left Join and the query executes in 13ms or so, but sometimes (more and more often) it chooses Nested Loop Left Join and the execution time goes up to 2-30sec. The query: SELECT COUNT(DISTINCT a.tid) FROM axp_temp_order_match a LEFT OUTER JOIN ( SELECT ol.tid, ds.orid FROM axp_dayschedule ds JOIN axp_order_line ol ON ol.olid = ds.olid JOIN axp_order o ON ds.orid = o.orid WHERE o.status >= 100 AND ds.day between '2005-05-12' and '2005-05-12' AND ds.used = '1' ) b ON (a.tid = b.tid) WHERE b.tid IS NULL AND a.sid = 16072; Unfortunately, because Hash Join doesn't report the number of rows (rows=0 always), it's hard to tell how good the estimator is. But I *can* say that the NestLoop estimation is way off. Good plan: = Aggregate (cost=221.93..221.93 rows=1 width=4) (actual time=34.262..34.266 rows=1 loops=1) -> Hash Left Join (cost=9.07..220.86 rows=426 width=4) (actual time=34.237..34.237 rows=0 loops=1) Hash Cond: ("outer".tid = "inner".tid) Filter: ("inner".tid IS NULL) -> Index Scan using axp_temp_order_match_idx1 on axp_temp_order_match a (cost=0.00..209.65 rows=426 width=4) (actual time=0.277..0.512 rows=6 loops=1) Index Cond: (sid = 16072) -> Hash (cost=9.07..9.07 rows=1 width=4) (actual time=32.777..32.777 rows=0 loops=1) -> Nested Loop (cost=0.00..9.07 rows=1 width=4) (actual time=0.208..31.563 rows=284 loops=1) -> Nested Loop (cost=0.00..6.05 rows=1 width=4) (actual time=0.178..20.684 rows=552 loops=1) -> Index Scan using axp_dayschedule_day_idx on axp_dayschedule ds (cost=0.00..3.02 rows=1 width=8) (actual time=0.036..3.973 rows=610 loops=1) Index Cond: (("day" >= '2005-05-12'::date) AND ("day" <= '2005-05-12'::date)) Filter: (used = B'1'::"bit") -> Index Scan using axp_order_orid_key on axp_order o (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013 rows=1 loops=610) Index Cond: ("outer".orid = o.orid) Filter: (status >= 100) -> Index Scan using axp_order_line_pk on axp_order_line ol (cost=0.00..3.01 rows=1 width=8) (actual time=0.006..0.008 rows=1 loops=552) Index Cond: (ol.olid = "outer".olid) Total runtime: 34.581 ms Bad plan (same query different values): === Aggregate (cost=11.54..11.54 rows=1 width=4) (actual time=11969.281..11969.285 rows=1 loops=1) -> Nested Loop Left Join (cost=0.00..11.53 rows=1 width=4) (actual time=25.730..11967.180 rows=338 loops=1) See here, it thinks it will only have to do 1 nestloop, which would be quite fast, but it hast to do 338. Join Filter: ("outer".tid = "inner".tid) Filter: ("inner".tid IS NULL) -> Index Scan using axp_temp_order_match_idx1 on axp_temp_order_match a (cost=0.00..2.45 rows=1 width=4) (actual time=0.027..2.980 rows=471 loops=1) Index Cond: (sid = 16092) -> Nested Loop (cost=0.00..9.07 rows=1 width=4) (actual time=0.088..24.350 rows=285 loops=471) Same thing here. -> Nested Loop (cost=0.00..6.04 rows=1 width=8) (actual time=0.067..15.649 rows=317 loops=471) And here. -> Index Scan using axp_dayschedule_day_idx on axp_dayschedule ds (cost=0.00..3.02 rows=1 width=8) (actual time=0.015..3.557 rows=606 loops=471) This estimate is way off too, but it is off in both plans. Index Cond: (("day" >= '2005-05-13'::date) AND ("day" <= '2005-05-13'::date)) Filter: (used = B'1'::"bit") -> Index Scan using axp_order_line_pk on axp_order_line ol (cost=0.00..3.01 rows=1 width=8) (actual time=0.006..0.008 rows=1 loops=285426) This is probably what is killing you. It is doing a single lookup 285k times. The above plan only does it 552 times. Index Cond: (ol.olid = "outer".olid) -> Index Scan using axp_order_orid_key on axp_order o (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013 rows=1 loops=149307) Index Cond: ("outer".orid = o.orid) Filter: (status >= 100) Total runtime: 11969.443 ms Please note that sometimes when I get "bad plan" in the logfile, I just re-run the query and the optimizer chooses the more efficient one. Sometime it does not. You work_mem is quite high relative to your total Ram, hopefully you don't have many allowed concurrent connections. But that is a side point. I assume the tables are freshly VACUUM ANALYZEd. Have you tried altering the statistics for the columns, one of them to look at is axp_dayschedule(day). That one seems to be consisten
Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL
Neil Conway <[EMAIL PROTECTED]> writes: > I'm posting mainly because I wasn't sure what to do to avoid false positives > in > the case of hash collisions. In the hash AM code it is somewhat awkward to > fetch the pointed-to heap tuple and recheck the scankey.[1] I just did the > first thing that came to mind -- I marked all the hash AM opclasses as > "lossy", > so the index qual is rechecked. This works, but suggestions for a better way > to > do things would be welcome. I would have thought that would be the only way worth considering. Consider for example a query involving two or more hash indexes and the new bitmap indexscan plan. You don't want to fetch the tuples if you can eliminate them using one of the other indexes. -- greg ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL
Neil Conway <[EMAIL PROTECTED]> writes: > Greg Stark wrote: >> What if the hash index stored *only* the hash code? > Attached is a WIP patch that implements this. Performance? > I'm posting mainly because I wasn't sure what to do to avoid false > positives in the case of hash collisions. In the hash AM code it is > somewhat awkward to fetch the pointed-to heap tuple and recheck the > scankey.[1] I just did the first thing that came to mind -- I marked all > the hash AM opclasses as "lossy", so the index qual is rechecked. This > works, but suggestions for a better way to do things would be welcome. AFAICS that's the *only* way to do it. I disagree completely with the idea of forcing this behavior for all datatypes. It could only be sensible for fairly wide values; you don't save enough to justify the lossiness otherwise. It would be interesting to look into whether it could be driven on a per-opclass basis. Then you could have, eg, "text_lossy_hash_ops" as a non-default opclass the DBA could select if he wanted this behavior. (The code could perhaps use the amopreqcheck flag to tell it which way to behave.) If that seems unworkable, I'd prefer to see us set this up as a new index AM type, which would share a lot of code with the old. [ BTW, posting patches to pgsql-general seems pretty off-topic. ] regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Prefetch
On Wed, 2005-05-11 at 12:53 +0800, Christopher Kings-Lynne wrote: > > Another trick you can use with large data sets like this when you want > > results > > back in seconds is to have regularly updated tables that aggregate the data > > along each column normally aggregated against the main data set. > > > Maybe some bright person will prove me wrong by posting some working > > information about how to get these apparently absent features working. > > Most people just use simple triggers to maintain aggregate summary tables... Agreed. I've also got a view which calls a function that will 1) use the summary table where data exists, or 2) calculate the summary information, load it into summary table, and send a copy to the client (partial query results cache). It's not all nicely abstracted behind user friendly syntax, but most of those features can be cobbled together (with effort) in PostgreSQL. -- ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Partitioning / Clustering
Mischa Sandberg <[EMAIL PROTECTED]> writes: > So, simplicity dictates something like: > table pg_remote(schemaname text, connectby text, remoteschema text) Previous discussion of this sort of thing concluded that we wanted to follow the SQL-MED standard. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Query tuning help
Hi Dan, I tried to understand your query, but I couldn't get my understanding of the query and your description in sync. Why do you use sub selects? Wouldn't a simple "recordtext like '%RED%'" do the trick too? You combine all your where conditions with and. To me this looks like you get only rows with RED and CORVETTE. From your description I would rewrite the query as explain analyze select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat from ea join em using(incidentid) join eg using(incidentid) where em.entrydate >= '2005-1-1 00:00'::date and em.entrydate <= '2005-5-9 00:00'::date and ( recordtext like '%RED%' or recordtext like '%CORVETTE%' ) order by em.entrydate That should give you all rows containing one of the words. Does it work? Is is faster? Is it fast enough? Ulrich ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Partitioning / Clustering
Acceptable Answers to 'So, when/is PG meant to be getting a decent partitioning system?': 1. Person X is working on it I believe. 2. It's on the list, but nobody has done anything about it yet 3. Your welcome to take a stab at it, I expect the community would support your efforts as well. 4. If you have a huge pile of money you could probably buy the Moon. Thinking along those lines, you can probably pay someone to write it for you. 5. It's a stupid idea, and it's never going to work, and heres why.. Unacceptable Answers to the same question: 1. Yours. Be more helpful, and less arrogant please. Everyone else who has contributed to this thread has been very helpful in clarifying the state of affairs and pointing out what work is and isn't being done, and alternatives to just waiting for PG do it for you. Please YOU be more helpful and less arrogant. I thought your inital email was arrogant, demanding and insulting. Your followup email has done nothing to dispel my impression. Simon (one of PostgreSQL's major contributors AND one of the very few people working on partitioning in PostgreSQL, as you requested) told you all the reasons clearly and politely. Chris ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Partitioning / Clustering
On 11 May 2005, at 09:50, Alex Stapleton wrote: On 11 May 2005, at 08:57, David Roussel wrote: For an interesting look at scalability, clustering, caching, etc for a large site have a look at how livejournal did it. http://www.danga.com/words/2004_lisa/lisa04.pdf I have implemented similar systems in the past, it's a pretty good technique, unfortunately it's not very "Plug-and-Play" as you have to base most of your API on memcached (I imagine MySQLs NDB tables might work as well actually) for it to work well. They have 2.6 Million active users, posting 200 new blog entries per minute, plus many comments and countless page views. Although this system is of a different sort to the type I work on it's interesting to see how they've made it scale. They use mysql on dell hardware! And found single master replication did not scale. There's a section on multimaster replication, not sure if they use it. The main approach they use is to parition users into spefic database clusters. Caching is done using memcached at the application level to avoid hitting the db for rendered pageviews I don't think they are storing pre-rendered pages (or bits of) in memcached, but are principally storing the data for the pages in it. Gluing pages together is not a hugely intensive process usually :) The only problem with memcached is that the clients clustering/ partitioning system will probably break if a node dies, and probably get confused if you add new nodes onto it as well. Easily extensible clustering (no complete redistribution of data required when you add/remove nodes) with the data distributed across nodes seems to be nothing but a pipe dream right now. It's interesting that the solution livejournal have arrived at is quite similar in ways to the way google is set up. Don't Google use indexing servers which keep track of where data is? So that you only need to update them when you add or move data, deletes don't even have to be propagated among indexes immediately really because you'll find out if data isn't there when you visit where it should be. Or am I talking crap? That will teach me to RTFA first ;) Ok so LJ maintain an index of which cluster each user is on, kinda of like google do :) David ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Partitioning / Clustering
On 11 May 2005, at 08:57, David Roussel wrote: For an interesting look at scalability, clustering, caching, etc for a large site have a look at how livejournal did it. http://www.danga.com/words/2004_lisa/lisa04.pdf I have implemented similar systems in the past, it's a pretty good technique, unfortunately it's not very "Plug-and-Play" as you have to base most of your API on memcached (I imagine MySQLs NDB tables might work as well actually) for it to work well. They have 2.6 Million active users, posting 200 new blog entries per minute, plus many comments and countless page views. Although this system is of a different sort to the type I work on it's interesting to see how they've made it scale. They use mysql on dell hardware! And found single master replication did not scale. There's a section on multimaster replication, not sure if they use it. The main approach they use is to parition users into spefic database clusters. Caching is done using memcached at the application level to avoid hitting the db for rendered pageviews I don't think they are storing pre-rendered pages (or bits of) in memcached, but are principally storing the data for the pages in it. Gluing pages together is not a hugely intensive process usually :) The only problem with memcached is that the clients clustering/ partitioning system will probably break if a node dies, and probably get confused if you add new nodes onto it as well. Easily extensible clustering (no complete redistribution of data required when you add/ remove nodes) with the data distributed across nodes seems to be nothing but a pipe dream right now. It's interesting that the solution livejournal have arrived at is quite similar in ways to the way google is set up. Don't Google use indexing servers which keep track of where data is? So that you only need to update them when you add or move data, deletes don't even have to be propagated among indexes immediately really because you'll find out if data isn't there when you visit where it should be. Or am I talking crap? David ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Partitioning / Clustering
On 11 May 2005, at 08:16, Simon Riggs wrote: On Tue, 2005-05-10 at 11:03 +0100, Alex Stapleton wrote: So, when/is PG meant to be getting a decent partitioning system? ISTM that your question seems to confuse where code comes from. Without meaning to pick on you, or reply rudely, I'd like to explore that question. Perhaps it should be a FAQ entry. All code is written by someone, and those people need to eat. Some people are fully or partly funded to perform their tasks on this project (coding, patching, etc). Others contribute their time for a variety of reasons where involvement has a positive benefit. You should ask these questions: - Is anyone currently working on (Feature X)? - If not, Can I do it myself? - If not, and I still want it, can I fund someone else to build it for me? Asking "when is Feature X going to happen" is almost certainly going to get the answer "never" otherwise, if the initial development is large and complex. There are many TODO items that have lain untouched for years, even though adding the feature has been discussed and agreed. Best Regards, Simon Riggs Acceptable Answers to 'So, when/is PG meant to be getting a decent partitioning system?': 1. Person X is working on it I believe. 2. It's on the list, but nobody has done anything about it yet 3. Your welcome to take a stab at it, I expect the community would support your efforts as well. 4. If you have a huge pile of money you could probably buy the Moon. Thinking along those lines, you can probably pay someone to write it for you. 5. It's a stupid idea, and it's never going to work, and heres why.. Unacceptable Answers to the same question: 1. Yours. Be more helpful, and less arrogant please. Everyone else who has contributed to this thread has been very helpful in clarifying the state of affairs and pointing out what work is and isn't being done, and alternatives to just waiting for PG do it for you. ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] Optimizer wrongly picks Nested Loop Left Join
Hi, I have a query that is giving the optimizer (and me) great headache. When its in the good mood the optimizer chooses Hash Left Join and the query executes in 13ms or so, but sometimes (more and more often) it chooses Nested Loop Left Join and the execution time goes up to 2-30sec. The query: SELECT COUNT(DISTINCT a.tid) FROM axp_temp_order_match a LEFT OUTER JOIN ( SELECT ol.tid, ds.orid FROM axp_dayschedule ds JOIN axp_order_line ol ON ol.olid = ds.olid JOIN axp_order o ON ds.orid = o.orid WHERE o.status >= 100 AND ds.day between '2005-05-12' and '2005-05-12' AND ds.used = '1' ) b ON (a.tid = b.tid) WHERE b.tid IS NULL AND a.sid = 16072; Good plan: = Aggregate (cost=221.93..221.93 rows=1 width=4) (actual time=34.262..34.266 rows=1 loops=1) -> Hash Left Join (cost=9.07..220.86 rows=426 width=4) (actual time=34.237..34.237 rows=0 loops=1) Hash Cond: ("outer".tid = "inner".tid) Filter: ("inner".tid IS NULL) -> Index Scan using axp_temp_order_match_idx1 on axp_temp_order_match a (cost=0.00..209.65 rows=426 width=4) (actual time=0.277..0.512 rows=6 loops=1) Index Cond: (sid = 16072) -> Hash (cost=9.07..9.07 rows=1 width=4) (actual time=32.777..32.777 rows=0 loops=1) -> Nested Loop (cost=0.00..9.07 rows=1 width=4) (actual time=0.208..31.563 rows=284 loops=1) -> Nested Loop (cost=0.00..6.05 rows=1 width=4) (actual time=0.178..20.684 rows=552 loops=1) -> Index Scan using axp_dayschedule_day_idx on axp_dayschedule ds (cost=0.00..3.02 rows=1 width=8) (actual time=0.036..3.973 rows=610 loops=1) Index Cond: (("day" >= '2005-05-12'::date) AND ("day" <= '2005-05-12'::date)) Filter: (used = B'1'::"bit") -> Index Scan using axp_order_orid_key on axp_order o (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013 rows=1 loops=610) Index Cond: ("outer".orid = o.orid) Filter: (status >= 100) -> Index Scan using axp_order_line_pk on axp_order_line ol (cost=0.00..3.01 rows=1 width=8) (actual time=0.006..0.008 rows=1 loops=552) Index Cond: (ol.olid = "outer".olid) Total runtime: 34.581 ms Bad plan (same query different values): === Aggregate (cost=11.54..11.54 rows=1 width=4) (actual time=11969.281..11969.285 rows=1 loops=1) -> Nested Loop Left Join (cost=0.00..11.53 rows=1 width=4) (actual time=25.730..11967.180 rows=338 loops=1) Join Filter: ("outer".tid = "inner".tid) Filter: ("inner".tid IS NULL) -> Index Scan using axp_temp_order_match_idx1 on axp_temp_order_match a (cost=0.00..2.45 rows=1 width=4) (actual time=0.027..2.980 rows=471 loops=1) Index Cond: (sid = 16092) -> Nested Loop (cost=0.00..9.07 rows=1 width=4) (actual time=0.088..24.350 rows=285 loops=471) -> Nested Loop (cost=0.00..6.04 rows=1 width=8) (actual time=0.067..15.649 rows=317 loops=471) -> Index Scan using axp_dayschedule_day_idx on axp_dayschedule ds (cost=0.00..3.02 rows=1 width=8) (actual time=0.015..3.557 rows=606 loops=471) Index Cond: (("day" >= '2005-05-13'::date) AND ("day" <= '2005-05-13'::date)) Filter: (used = B'1'::"bit") -> Index Scan using axp_order_line_pk on axp_order_line ol (cost=0.00..3.01 rows=1 width=8) (actual time=0.006..0.008 rows=1 loops=285426) Index Cond: (ol.olid = "outer".olid) -> Index Scan using axp_order_orid_key on axp_order o (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013 rows=1 loops=149307) Index Cond: ("outer".orid = o.orid) Filter: (status >= 100) Total runtime: 11969.443 ms Please note that sometimes when I get "bad plan" in the logfile, I just re-run the query and the optimizer chooses the more efficient one. Sometime it does not. Any ideas? postgresql-8.0.2 on 2x3.2 GHz Xeon with 2GB ram Linux 2.6 shared_buffers = 15000 work_mem = 128000 effective_cache_size = 20 random_page_cost = (tried 1.0 - 4, seemingly without effect on this particular issue). Edin ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Partitioning / Clustering
For an interesting look at scalability, clustering, caching, etc for a large site have a look at how livejournal did it. http://www.danga.com/words/2004_lisa/lisa04.pdf They have 2.6 Million active users, posting 200 new blog entries per minute, plus many comments and countless page views. Although this system is of a different sort to the type I work on it's interesting to see how they've made it scale. They use mysql on dell hardware! And found single master replication did not scale. There's a section on multimaster replication, not sure if they use it. The main approach they use is to parition users into spefic database clusters. Caching is done using memcached at the application level to avoid hitting the db for rendered pageviews. It's interesting that the solution livejournal have arrived at is quite similar in ways to the way google is set up. David ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Partitioning / Clustering
Josh Berkus wrote: The other problem, as I was told it at OSCON, was that these were not high-availability clusters; it's impossible to add a server to an existing cluster Yeah, that's a pretty significant problem. a server going down is liable to take the whole cluster down. That's news to me. Do you have more information on this? -Neil ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Partitioning / Clustering
On Tue, 2005-05-10 at 11:03 +0100, Alex Stapleton wrote: > So, when/is PG meant to be getting a decent partitioning system? ISTM that your question seems to confuse where code comes from. Without meaning to pick on you, or reply rudely, I'd like to explore that question. Perhaps it should be a FAQ entry. All code is written by someone, and those people need to eat. Some people are fully or partly funded to perform their tasks on this project (coding, patching, etc). Others contribute their time for a variety of reasons where involvement has a positive benefit. You should ask these questions: - Is anyone currently working on (Feature X)? - If not, Can I do it myself? - If not, and I still want it, can I fund someone else to build it for me? Asking "when is Feature X going to happen" is almost certainly going to get the answer "never" otherwise, if the initial development is large and complex. There are many TODO items that have lain untouched for years, even though adding the feature has been discussed and agreed. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 8: explain analyze is your friend