Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Quoting Andrew Dunstan <[EMAIL PROTECTED]>: > After some more experimentation, I'm wondering about some sort of > adaptive algorithm, a bit along the lines suggested by Marko Ristola, but limited to 2 rounds. > > The idea would be that we take a sample (either of fixed size, or > some small proportion of the table) , see how well it fits a larger sample > > (say a few times the size of the first sample), and then adjust the > formula accordingly to project from the larger sample the estimate for the full population. Math not worked out yet - I think we want to ensure that the result remains bounded by [d,N]. Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result than just taking the larger sample on its own; GIGO. Whether they are consistent with one another has no relationship to whether the larger sample correlates with the whole population. You can think of the tiny sample like "anecdotal" evidence for wonderdrugs. -- "Dreams come true, not free." -- S.Sondheim, ITW ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Joel's Performance Issues WAS : Opteron vs Xeon
Josh Berkus wrote: > Jim, Kevin, > > > > Hrm... I was about to suggest that for timing just the query (and not > > > output/data transfer time) using explain analyze, but then I remembered > > > that explain analyze can incur some non-trivial overhead with the timing > > > calls. Is there a way to run the query but have psql ignore the output? > > > If so, you could use \timing. > > > > Would timing "SELECT COUNT(*) FROM (query)" work? > > Just \timing would work fine; PostgreSQL doesn't return anything until it has > the whole result set. Hmm...does \timing show the amount of elapsed time between query start and the first results handed to it by the database (even if the database itself has prepared the entire result set for transmission by that time), or between query start and the last result handed to it by the database? Because if it's the latter, then things like server<->client network bandwidth are going to affect the results that \timing shows, and it won't necessarily give you a good indicator of how well the database backend is performing. I would expect that timing SELECT COUNT(*) FROM (query) would give you an idea of how the backend is performing, because the amount of result set data that has to go over the wire is trivial. Each is, of course, useful in its own right, and you want to be able to measure both (so, for instance, you can get an idea of just how much your network affects the overall performance of your queries). > That's why MSSQL vs. PostgreSQL timing comparisons are > deceptive unless you're careful: MSSQL returns the results on block at a > time, and reports execution time as the time required to return the *first* > block, as opposed to Postgres which reports the time required to return the > whole dataset. Interesting. I had no idea MSSQL did that, but I can't exactly say I'm surprised. :-) -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Table Partitioning: Will it be supported in Future?
On 4/26/05, Mohan, Ross <[EMAIL PROTECTED]> wrote: > Maybe he needs to spend $7K on performance improvements? > > ;-) > AAARRRGGG! I will forever hate the number 7,000 from this day forth! Seriously, though, I've never seen a thread on any list wander on so aimlessly for so long. Please, mommy, make it stop! -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
> -Original Message- > From: Gurmeet Manku [mailto:[EMAIL PROTECTED] > Sent: Tuesday, April 26, 2005 5:01 PM > To: Simon Riggs > Cc: Tom Lane; josh@agliodbs.com; Greg Stark; Marko Ristola; > pgsql-perform; pgsql-hackers@postgresql.org; Utkarsh Srivastava; > [EMAIL PROTECTED] > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks > suggested? > > [...] > 2. In a single scan, it is possible to estimate n_distinct by using > a very simple algorithm: > > "Distinct sampling for highly-accurate answers to distinct value > queries and event reports" by Gibbons, VLDB 2001. > > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf > > [...] This paper looks the most promising, and isn't too different from what I suggested about collecting stats over the whole table continuously. What Gibbons does is give a hard upper bound on the sample size by using a logarithmic technique for storing sample information. His technique appears to offer very good error bounds and confidence intervals as shown by tests on synthetic and real data. I think it deserves a hard look from people hacking the estimator. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Simon Riggs wrote: The comment * Every value in the sample appeared more than once. Assume * the column has just these values. doesn't seem to apply when using larger samples, as Josh is using. Looking at Josh's application it does seem likely that when taking a sample, all site visitors clicked more than once during their session, especially if they include home page, adverts, images etc for each page. Could it be that we have overlooked this simple explanation and that the Haas and Stokes equation is actually quite good, but just not being applied? No, it is being aplied. If every value in the sample appears more than once, then f1 in the formula is 0, and the result is then just d, the number of distinct values in the sample. cheers andrew ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Tom Lane wrote: Josh Berkus writes: Overall, our formula is inherently conservative of n_distinct. That is, I believe that it is actually computing the *smallest* number of distinct values which would reasonably produce the given sample, rather than the *median* one. This is contrary to the notes in analyze.c, which seem to think that we're *overestimating* n_distinct. Well, the notes are there because the early tests I ran on that formula did show it overestimating n_distinct more often than not. Greg is correct that this is inherently a hard problem :-( I have nothing against adopting a different formula, if you can find something with a comparable amount of math behind it ... but I fear it'd only shift the failure cases around. The math in the paper does not seem to look at very low levels of q (= sample to pop ratio). The formula has a range of [d,N]. It appears intuitively (i.e. I have not done any analysis) that at very low levels of q, as f1 moves down from n, the formula moves down from N towards d very rapidly. I did a test based on the l_comments field in a TPC lineitems table. The test set has N = 6001215, D = 2921877. In my random sample of 1000 I got d = 976 and f1 = 961, for a DUJ1 figure of 24923, which is too low by 2 orders of magnitude. I wonder if this paper has anything that might help: http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I were more of a statistician I might be able to answer :-) cheers andrew ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] What needs to be done for real Partitioning?
On March 21, 2005 8:07 AM, Hannu Krosing wrote: > On L, 2005-03-19 at 23:47 -0500, Tom Lane wrote: > > Well, partitioning on the primary key would be Good Enough for 95% or > > 99% of the real problems out there. I'm not excited about adding a > > large chunk of complexity to cover another few percent. > > Are you sure that partitioning on anything else than PK would be > significantly harder ? > > I have a case where I do manual partitioning over start_time > (timestamp), but the PK is an id from a sequence. They are almost, but > not exactly in the same order. And I don't think that moving the PK to > be (start_time, id) just because of "partitioning on PK only" would be a > good design in any way. > > So please don't design the system to partition on PK only. I agree. I have used table partitioning to implement pseudo-partitioning, and I am very pleased with the results so far. Real partitioning would be even better, but I am partitioning by timestamp, and this is not the PK, and I don't wish to make it one. -Roger ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Josh Berkus wrote: Simon, Tom: While it's not possible to get accurate estimates from a fixed size sample, I think it would be possible from a small but scalable sample: say, 0.1% of all data pages on large tables, up to the limit of maintenance_work_mem. Setting up these samples as a % of data pages, rather than a pure random sort, makes this more feasable; for example, a 70GB table would only need to sample about 9000 data pages (or 70MB). Of course, larger samples would lead to better accuracy, and this could be set through a revised GUC (i.e., maximum_sample_size, minimum_sample_size). I just need a little help doing the math ... please? After some more experimentation, I'm wondering about some sort of adaptive algorithm, a bit along the lines suggested by Marko Ristola, but limited to 2 rounds. The idea would be that we take a sample (either of fixed size, or some small proportion of the table) , see how well it fits a larger sample (say a few times the size of the first sample), and then adjust the formula accordingly to project from the larger sample the estimate for the full population. Math not worked out yet - I think we want to ensure that the result remains bounded by [d,N]. cheers andrew ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] speed up query with max() and odd estimates
On 4/26/05, Steinar H. Gunderson <[EMAIL PROTECTED]> wrote: > On Tue, Apr 26, 2005 at 03:16:57PM -0500, Matthew Nuzum wrote: > > Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851 > > width=116) (actual time=481796.22..481839.43 rows=3343 loops=1) > > That's a gross misestimation -- four orders of magnitude off! > > Have you considering doing this in two steps, first getting out whatever > comes from the subquery and then doing the query? Well, I don't know if the estimates are correct now or not, but I found that your suggestion of doing it in two steps helped a lot. For the archives, here's what made a drastic improvement: This batch program had an overhead of 25 min to build hash tables using the sql queries. It is now down to about 47 seconds. The biggest improvements (bringing it down to 9 min) were to get rid of all instances of `select max(field) from ...` and replacing them with `select field from ... order by field desc limit 1` Then, to get it down to the final 47 seconds I changed this query: SELECT client,max(atime) as atime from usage_access where atime >= (select atime - '1 hour'::interval from usage_access order by atime desc limit 1) group by client; To these three queries: SELECT atime - '1 hour'::interval from usage_access order by atime desc limit 1; SELECT client, atime into temporary table recent_sessions from usage_access where atime >= '%s'; SELECT client, max(atime) as atime from recent_sessions group by client; Thanks for the help. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] speed up query with max() and odd estimates
Matthew Nuzum wrote: I have this query that takes a little over 8 min to run: select client,max(atime) as atime from usage_access where atime >= (select atime - '1 hour'::interval from usage_access order by atime desc limit 1) group by client; I think it can go a lot faster. Any suggestions on improving this? DB is 7.3.4 I think. (There is no index on client because it is very big and this data is used infrequently.) Switch to Postgres 8.0.2 :) Actually, I think one problem that you are running into is that postgres (at least used to) has problems with selectivity of date fields when using a non-constant parameter. So it isn't switching over to using an index, even though you are restricting the access time. I would guess that creating a multi-column index on (client, atime) *might* get you the best performance. Try adding the index, and then doing this query: select atime from usage_access where client = order by atime desc limit 1; If you can get that query to use an index, then you can put it in a loop. Something like: CREATE FUNCTION last_client_access() RETURNS SETOF time AS ' DECLARE client_id INT; client_time TIME; BEGIN FOR client_id IN SELECT id FROM LOOP SELECT INTO client_time atime FROM usage_access WHERE client = client_id ORDER BY atime DESC LIMIT 1; RETURN NEXT client_time; END LOOP; END; ' LANGUAGE plpgsql; If you really need high speed, you could create a partial index for each client id, something like: CREATE INDEX usage_access_atime_client1_idx ON usage_access(atime) WHERE client = client1; But that is a lot of indexes to maintain. I'm hoping that the multi-column index would be enough. You might also try something like: SELECT client, max(atime) FROM usage_access WHERE atime > now - '1 hour'::interval GROUP BY client; now is more of a constant, so postgres might have a better time figuring out the selectivity. I don't know your table, but I assume you are constantly inserting new rows, and the largest atime value will be close to now(). Remember, in this query (and in your original query) clients with their last access time > then 1 hour since the max time (of all clients) will not be shown. (Example, client 1 accessed yesterday, client 2 accessed right now your original last atime would be today, which would hide client 1). Also, if it is simply a problem of the planner mis-estimating the selectivity of the row, you can alter the statistics for atime. ALTER TABLE usage_access ALTER COLUMN atime SET STATISTICS 1000; I'm not really sure what else to try, but you might start there. Also, I still recommend upgrading to postgres 8, as I think it handles a lot of these things better. (7.3 is pretty old). John =:-> explain ANALYZE select client,max(atime) as atime from usage_access where atime >= (select atime - '1 hour'::interval from usage_access order by atime desc limit 1) group by client; QUERY PLAN Aggregate (cost=3525096.28..3620450.16 rows=1271385 width=20) (actual time=482676.95..482693.69 rows=126 loops=1) InitPlan -> Limit (cost=0.00..0.59 rows=1 width=8) (actual time=0.40..0.41 rows=1 loops=1) -> Index Scan Backward using usage_access_atime on usage_access (cost=0.00..22657796.18 rows=38141552 width=8) (actual time=0.39..0.40 rows=2 loops=1) -> Group (cost=3525096.28..3588665.53 rows=12713851 width=20) (actual time=482676.81..482689.29 rows=3343 loops=1) -> Sort (cost=3525096.28..3556880.90 rows=12713851 width=20) (actual time=482676.79..482679.16 rows=3343 loops=1) Sort Key: client -> Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851 width=20) (actual time=482641.57..482659.18 rows=3343 loops=1) Filter: (atime >= $0) Total runtime: 482694.65 msec I'm starting to understand this, which is quite frightening to me. I thought that maybe if I shrink the number of rows down I could improve things a bit, but my first attempt didn't work. I thought I'd replace the "from usage_access" with this query instead: select * from usage_access where atime >= (select atime - '1 hour'::interval from usage_access order by atime desc limit 1); QUERY PLAN Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851 width=116) (actual time=481796.22..481839.43 rows=3343 loops=1) Filter: (atime >= $0) InitPlan -> Limit (cost=0.00..0.59 rows=1 width=8) (actual time=0.41..0.42 rows=1 loops=1) -> Index Scan Backward using usage_access_atime on usage_access (cost=0.00..22657796.18 rows=38141552 width=8) (act
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Hi everybody! Perhaps the following papers are relevant to the discussion here (their contact authors have been cc'd): 1. The following proposes effective algorithms for using block-level sampling for n_distinct estimation: "Effective use of block-level sampling in statistics estimation" by Chaudhuri, Das and Srivastava, SIGMOD 2004. http://www-db.stanford.edu/~usriv/papers/block-sampling.pdf 2. In a single scan, it is possible to estimate n_distinct by using a very simple algorithm: "Distinct sampling for highly-accurate answers to distinct value queries and event reports" by Gibbons, VLDB 2001. http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf 3. In fact, Gibbon's basic idea has been extended to "sliding windows" (this extension is useful in streaming systems like Aurora / Stream): "Distributed streams algorithms for sliding windows" by Gibbons and Tirthapura, SPAA 2002. http://home.eng.iastate.edu/~snt/research/tocs.pdf Thanks, Gurmeet Gurmeet Singh Manku Google Inc. http://www.cs.stanford.edu/~manku(650) 967 1890 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] speed up query with max() and odd estimates
On 4/26/05, Steinar H. Gunderson <[EMAIL PROTECTED]> wrote: > On Tue, Apr 26, 2005 at 03:16:57PM -0500, Matthew Nuzum wrote: > > Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851 > > width=116) (actual time=481796.22..481839.43 rows=3343 loops=1) > > That's a gross misestimation -- four orders of magnitude off! > > Have you considering doing this in two steps, first getting out whatever > comes from the subquery and then doing the query? Have you ANALYZEd recently? > Do you have an index on atime? > Yes, there is an index on atime. I'll re-analyze but I'm pretty certain that runs nightly. Regarding two steps, are you suggesting: begin; select * into temp_table...; select * from temp_table...; drop temp_table; rollback; I have not tried that but will. BTW, I created an index on clients just for the heck of it and there was no improvement. (actually, a slight degradation) -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
On Mon, 2005-04-25 at 17:10 -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: > >> It's not just the scan --- you also have to sort, or something like > >> that, if you want to count distinct values. I doubt anyone is really > >> going to consider this a feasible answer for large tables. > > > Assuming you don't use the HashAgg plan, which seems very appropriate > > for the task? (...but I understand the plan otherwise). > > The context here is a case with a very large number of distinct > values... Yes, but is there another way of doing this other than sampling a larger proportion of the table? I don't like that answer either, for the reasons you give. The manual doesn't actually say this, but you can already alter the sample size by setting one of the statistics targets higher, but all of those samples are fixed sample sizes, not a proportion of the table itself. It seems reasonable to allow an option to scan a higher proportion of the table. (It would be even better if you could say "keep going until you run out of memory, then stop", to avoid needing to have an external sort mode added to ANALYZE). Oracle and DB2 allow a proportion of the table to be specified as a sample size during statistics collection. IBM seem to be ignoring their own research note on estimating ndistinct... > keep in mind also that we have to do this for *all* the > columns of the table. You can collect stats for individual columns. You need only use an option to increase sample size when required. Also, if you have a large table and the performance of ANALYZE worries you, set some fields to 0. Perhaps that should be the default setting for very long text columns, since analyzing those doesn't help much (usually) and takes ages. (I'm aware we already don't analyze var length column values > 1024 bytes). > A full-table scan for each column seems > right out to me. Some systems analyze multiple columns simultaneously. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] speed up query with max() and odd estimates
On Tue, Apr 26, 2005 at 03:16:57PM -0500, Matthew Nuzum wrote: > Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851 > width=116) (actual time=481796.22..481839.43 rows=3343 loops=1) That's a gross misestimation -- four orders of magnitude off! Have you considering doing this in two steps, first getting out whatever comes from the subquery and then doing the query? Have you ANALYZEd recently? Do you have an index on atime? /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Simon, > Could it be that we have overlooked this simple explanation and that the > Haas and Stokes equation is actually quite good, but just not being > applied? That's probably part of it, but I've tried Haas and Stokes on a pure random sample and it's still bad, or more specifically overly conservative. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
On Sun, 2005-04-24 at 00:48 -0400, Tom Lane wrote: > Josh Berkus writes: > > Overall, our formula is inherently conservative of n_distinct. That is, I > > believe that it is actually computing the *smallest* number of distinct > > values which would reasonably produce the given sample, rather than the > > *median* one. This is contrary to the notes in analyze.c, which seem to > > think that we're *overestimating* n_distinct. > > Well, the notes are there because the early tests I ran on that formula > did show it overestimating n_distinct more often than not. Greg is > correct that this is inherently a hard problem :-( > > I have nothing against adopting a different formula, if you can find > something with a comparable amount of math behind it ... but I fear > it'd only shift the failure cases around. > Perhaps the formula is not actually being applied? The code looks like this... if (nmultiple == 0) { /* If we found no repeated values, assume it's a unique column */ stats->stadistinct = -1.0; } else if (toowide_cnt == 0 && nmultiple == ndistinct) { /* * Every value in the sample appeared more than once. Assume * the column has just these values. */ stats->stadistinct = ndistinct; } else { /*-- * Estimate the number of distinct values using the estimator * proposed by Haas and Stokes in IBM Research Report RJ 10025: The middle chunk of code looks to me like if we find a distribution where values all occur at least twice, then we won't bother to apply the Haas and Stokes equation. That type of frequency distribution would be very common in a set of values with very high ndistinct, especially when sampled. The comment * Every value in the sample appeared more than once. Assume * the column has just these values. doesn't seem to apply when using larger samples, as Josh is using. Looking at Josh's application it does seem likely that when taking a sample, all site visitors clicked more than once during their session, especially if they include home page, adverts, images etc for each page. Could it be that we have overlooked this simple explanation and that the Haas and Stokes equation is actually quite good, but just not being applied? Best Regards, Simon Riggs ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] speed up query with max() and odd estimates
I have this query that takes a little over 8 min to run: select client,max(atime) as atime from usage_access where atime >= (select atime - '1 hour'::interval from usage_access order by atime desc limit 1) group by client; I think it can go a lot faster. Any suggestions on improving this? DB is 7.3.4 I think. (There is no index on client because it is very big and this data is used infrequently.) explain ANALYZE select client,max(atime) as atime from usage_access where atime >= (select atime - '1 hour'::interval from usage_access order by atime desc limit 1) group by client; QUERY PLAN Aggregate (cost=3525096.28..3620450.16 rows=1271385 width=20) (actual time=482676.95..482693.69 rows=126 loops=1) InitPlan -> Limit (cost=0.00..0.59 rows=1 width=8) (actual time=0.40..0.41 rows=1 loops=1) -> Index Scan Backward using usage_access_atime on usage_access (cost=0.00..22657796.18 rows=38141552 width=8) (actual time=0.39..0.40 rows=2 loops=1) -> Group (cost=3525096.28..3588665.53 rows=12713851 width=20) (actual time=482676.81..482689.29 rows=3343 loops=1) -> Sort (cost=3525096.28..3556880.90 rows=12713851 width=20) (actual time=482676.79..482679.16 rows=3343 loops=1) Sort Key: client -> Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851 width=20) (actual time=482641.57..482659.18 rows=3343 loops=1) Filter: (atime >= $0) Total runtime: 482694.65 msec I'm starting to understand this, which is quite frightening to me. I thought that maybe if I shrink the number of rows down I could improve things a bit, but my first attempt didn't work. I thought I'd replace the "from usage_access" with this query instead: select * from usage_access where atime >= (select atime - '1 hour'::interval from usage_access order by atime desc limit 1); QUERY PLAN Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851 width=116) (actual time=481796.22..481839.43 rows=3343 loops=1) Filter: (atime >= $0) InitPlan -> Limit (cost=0.00..0.59 rows=1 width=8) (actual time=0.41..0.42 rows=1 loops=1) -> Index Scan Backward using usage_access_atime on usage_access (cost=0.00..22657796.18 rows=38141552 width=8) (actual time=0.40..0.41 rows=2 loops=1) Total runtime: 481842.47 msec It doesn't look like this will help at all. This table is primarily append, however I just recently deleted a few million rows from the table, if that helps anyone. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] pgbench Comparison of 7.4.7 to 8.0.2
Tom, Honestly, you've got me. It was either comment from Tom Lane or Josh that the os is caching the results (I may not be using the right terms here), so I thought it the database is dropped and recreated, I would see less of a skew (or variation) in the results. Someone which to comment? Steve Poe Thomas F.O'Connell wrote: Considering the default vacuuming behavior, why would this be? -tfo -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC Strategic Open Source: Open Your iâ„¢ http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-260-0005 On Apr 25, 2005, at 12:18 PM, Steve Poe wrote: Tom, Just a quick thought: after each run/sample of pgbench, I drop the database and recreate it. When I don't my results become more skewed. Steve Poe ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Table Partitioning: Will it be supported in Future?
Maybe he needs to spend $7K on performance improvements? ;-) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Josh Berkus Sent: Tuesday, April 26, 2005 8:00 PM To: Richard Huxton Cc: [EMAIL PROTECTED]; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Table Partitioning: Will it be supported in Future? Richard, > I believe these are being worked on at the moment. You might want to > search the archives of the hackers mailing list to see if the plans > will suit your needs. Actually, this is being discussed through the Bizgres project: www.bizgres.org. However, I agree that a 1GB table is not in need of partitioning. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Table Partitioning: Will it be supported in Future?
Richard, > I believe these are being worked on at the moment. You might want to > search the archives of the hackers mailing list to see if the plans will > suit your needs. Actually, this is being discussed through the Bizgres project: www.bizgres.org. However, I agree that a 1GB table is not in need of partitioning. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Table Partitioning: Will it be supported in Future?
[EMAIL PROTECTED] wrote: Hmm, I have asked some Peoples on the List an some one has posted this links http://archives.postgresql.org/pgsql-performance/2004-12/msg00101.php It is quite usefull to read but iam not sure thadt theese Trick is verry helpfull. I want to splitt my 1GByte Table into some little Partitions but how should i do thadt? With the ORACLE Partitioning Option, i can Configurering my Table withe Enterprise Manager or SQL Plus but in this case it looks like Trap. Should i really decrease my Tabledata size and spread them to other Tables with the same Structure by limiting Records??? The next Problem i see, how should i do a Insert/Update/Delete on 4 Tables of the same Structure at one Query??? No missunderstanding. We talking not about normalization or restructuring the Colums of a table. We talking about Partitioning and in this case at Postgres (emultation of Partitioning wir UNIONS for Performance tuning).. From your description I don't see evidence that you should need to partition your table at all. A 1GB table is very common for pgsql. Spend some hard disks on your storage subsystem and you'll gain the performance you want, without trouble on the SQL side. For specific requirements, you might see improvements from partial indexes. Regards, Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] two queries and dual cpu (perplexed)
Shoaib Burq (VPAC) wrote: OK ... so just to clearify... (and pardon my ignorance): I need to increase the value of 'default_statistics_target' variable and then run VACUUM ANALYZE, right? If so what should I choose for the 'default_statistics_target'? BTW I only don't do any sub-selection on the View. I have attached the view in question and the output of: SELECT oid , relname, relpages, reltuples FROM pg_class ORDER BY relpages DESC; reg shoaib Actually, you only need to alter the statistics for that particular column, not for all columns in the db. What you want to do is: ALTER TABLE "ClimateChangeModel40" ALTER COLUMN SET STATISTICS 100; VACUUM ANALYZE "ClimateChangeModel40"; The column is just the column that you have the "IX_ClimateId" index on, I don't know which one that is. The statistics value ranges from 1 - 1000, the default being 10, and for indexed columns you are likely to want somewhere between 100-200. If you set it to 100 and the planner is still mis-estimating the number of rows, try 200, etc. The reason to keep the number low is because with a high number the planner has to spend more time planning. But especially for queries like this one, you'd rather the query planner spent a little bit more time planning, and got the right plan. John =:-> signature.asc Description: OpenPGP digital signature
Re: [PERFORM] two queries and dual cpu (perplexed)
> -Original Message- > From: Shoaib Burq (VPAC) [mailto:[EMAIL PROTECTED] > Sent: Tuesday, April 26, 2005 9:31 AM > To: Tom Lane > Cc: John A Meinel; Russell Smith; Jeff; > pgsql-performance@postgresql.org > Subject: Re: [PERFORM] two queries and dual cpu (perplexed) > > > OK ... so just to clearify... (and pardon my ignorance): > > I need to increase the value of 'default_statistics_target' > variable and then run VACUUM ANALYZE, right? Not necessarily. You can set the statistics for a single column with ALTER TABLE. > If so what should I choose for the 'default_statistics_target'? > [...] Since you have a decently large table, go for the max setting which is 1000. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] two queries and dual cpu (perplexed)
OK ... so just to clearify... (and pardon my ignorance): I need to increase the value of 'default_statistics_target' variable and then run VACUUM ANALYZE, right? If so what should I choose for the 'default_statistics_target'? BTW I only don't do any sub-selection on the View. I have attached the view in question and the output of: SELECT oid , relname, relpages, reltuples FROM pg_class ORDER BY relpages DESC; reg shoaib On Sat, 23 Apr 2005, Tom Lane wrote: > John A Meinel <[EMAIL PROTECTED]> writes: > > Actually, you probably don't want enable_seqscan=off, you should try: > > SET enable_nestloop TO off. > > The problem is that it is estimating there will only be 44 rows, but in > > reality there are 13M rows. It almost definitely should be doing a > > seqscan with a sort and merge join. > > Not nestloops anyway. > > > I don't understand how postgres could get the number of rows that wrong. > > No stats, or out-of-date stats is the most likely bet. > > > I can't figure out exactly what is where from the formatting, but the query > > that seems misestimated is: > > -> Index Scan using "IX_ClimateId" on "ClimateChangeModel40" > > (cost=0.00..1063711.75 rows=265528 width=20) (actual time=28.311..17212.703 > > rows=13276368 loops=1) > > Index Cond: ("outer"."ClimateId" = "ClimateChangeModel40"."ClimateId") > > Yeah, that's what jumped out at me too. It's not the full explanation > for the join number being so far off, but this one at least you have a > chance to fix by updating the stats on ClimateChangeModel40. > > regards, tom lane > > ---(end of broadcast)--- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED]) > -- Shoaib Burq -- VPAC - Geospatial Applications Developer Building 91, 110 Victoria Street, Carlton South, Vic 3053, Australia ___ w: www.vpac.org | e: sab_AT_vpac_DOT_org | mob: +61.431-850039 oid| relname | relpages | reltuples ---+-+--+- 16996 | CurrentAusClimate | 474551 | 8.06736e+07 16983 | ClimateChangeModel40| 338252 | 5.31055e+07 157821816 | PK_CurrentAusClimate| 265628 | 8.06736e+07 157835995 | idx_climateid | 176645 | 8.06736e+07 157835996 | idx_ausposnum | 176645 | 8.06736e+07 157835997 | idx_climatevalue| 176645 | 8.06736e+07 157821808 | PK_ClimateModelChange_40| 174858 | 5.31055e+07 157821788 | IX_iMonth001| 116280 | 5.31055e+07 157821787 | IX_ClimateId| 116280 | 5.31055e+07 157821786 | IX_AusPosNumber | 116280 | 5.31055e+07 17034 | NeighbourhoodTable |54312 | 1.00476e+07 157821854 | PK_NeighbourhoodTable |27552 | 1.00476e+07 157821801 | IX_NeighbourhoodId |22002 | 1.00476e+07 157821800 | IX_NAusPosNumber|22002 | 1.00476e+07 157821799 | IX_AusPosNumber006 |22002 | 1.00476e+07 17012 | FutureEvapMonth |12026 | 1.10636e+06 17014 | FutureMaxTMonth |12026 | 1.10636e+06 17016 | FutureMinTMonth |12026 | 1.10636e+06 17018 | FutureRainMonth |12026 | 1.10636e+06 17000 | CurrentEvapMonth| 8239 | 1.12047e+06 17002 | CurrentMaxTMonth| 8239 | 1.12047e+06 17004 | CurrentMinTMonth| 8239 | 1.12047e+06 17006 | CurrentR_RMonth | 8239 | 1.12047e+06 17008 | CurrentRadMonth | 8239 | 1.12047e+06 17010 | CurrentRainMonth| 8239 | 1.12047e+06 16977 | Aus40_DEM | 6591 | 1.12047e+06 16979 | Aus40DemRandom | 6057 | 1.12047e+06 16981 | ClimateChange | 3752 | 543984 157821780 | IX_Random | 3075 | 1.12047e+06 157821832 | PK_FutureEvapMonth | 3036 | 1.10636e+06 157821834 | PK_FutureMaxTMonth | 3036 | 1.10636e+06 157821836 | PK_FutureMinTMonth | 3036 | 1.10636e+06 157821838 | PK_FutureRainMonth | 3036 | 1.10636e+06 157821804 | PK_Aus40DemRandom | 2456 | 1.12047e+06 157821802 | PK_Aus40_DEM| 2456 | 1.12047e+06 157821820 | PK_CurrentEvapMonth | 2456 | 1.12047e+06 157821822 | PK_CurrentMaxTMonth | 2456 | 1.12047e+06 157821824 | PK_CurrentMinTMonth | 2456 | 1.12047e+06 157821826 | PK_CurrentR_RMonth | 2456 | 1.12047e+06 157821828 | PK_CurrentRadMonth
Re: [PERFORM] Table Partitioning: Will it be supported in Future?
Hmm, I have asked some Peoples on the List an some one has posted this links http://archives.postgresql.org/pgsql-performance/2004-12/msg00101.php It is quite usefull to read but iam not sure thadt theese Trick is verry helpfull. I want to splitt my 1GByte Table into some little Partitions but how should i do thadt? With the ORACLE Partitioning Option, i can Configurering my Table withe Enterprise Manager or SQL Plus but in this case it looks like Trap. Should i really decrease my Tabledata size and spread them to other Tables with the same Structure by limiting Records??? The next Problem i see, how should i do a Insert/Update/Delete on 4 Tables of the same Structure at one Query??? No missunderstanding. We talking not about normalization or restructuring the Colums of a table. We talking about Partitioning and in this case at Postgres (emultation of Partitioning wir UNIONS for Performance tuning).. Josh ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] BLOB's bypassing the OS Filesystem for better Image
Which filesystems? I know ext2 used to have issues with many-thousands of files in one directory, but that was a directory scanning issue rather than file reading. From my Point of view i think it is better to let one Process do the operation to an Postgres Cluster Filestructure as if i bypass it with a second process. For example: A User loads up some JPEG Images over HTTP. a) (Filesystem) On Filesystem it would be written in a File with a random generated Filename (timestamp or what ever) (the Directory Expands and over a Million Fileobjects with will be archived, written, replaced, e.t.c) b) (Database) The JPEG Image Information will be stored into a BLOB as Part of a special Table, where is linked wit the custid of the primary Usertable. From my Point of view is any outside Process (must be created, forked, Memory allocated, e.t.c) a bad choice. I think it is generall better to Support the Postmaster in all Ways and do some Hardware RAID Configurations. My Question: Can i speedup my Webapplication if i store my JPEG Images with small sizes inside my PostgreSQL Database (on verry large Databasis over 1 GByte and above without Images at this time!) No. Otherwise the filesystem people would build their filesystems on top of PostgreSQL not the other way around. Of course, if you want image updates to be part of a database transaction, then it might be worth storing them in the database. Hmm, ORACLE is going the other Way. All File Objects can be stored into the Database if the DB has the IFS Option (Database Filesystem and Fileserver insinde the Database). I hope some Peoples can give me a Tip or Hint where in can some usefull Information about it! Look into having a separate server (process or actual hardware) to handle requests for static text and images. Keep the Java server for actually processing Thanks ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] BLOB's bypassing the OS Filesystem for better Image
[EMAIL PROTECTED] wrote: Hi all again, My next queststion is dedicated to blobs in my Webapplication (using Tomcat 5 and JDBC integrated a the J2EE Appserver JBoss). Filesystems with many Filesystem Objects can slow down the Performance at opening and reading Data. Which filesystems? I know ext2 used to have issues with many-thousands of files in one directory, but that was a directory scanning issue rather than file reading. My Question: Can i speedup my Webapplication if i store my JPEG Images with small sizes inside my PostgreSQL Database (on verry large Databasis over 1 GByte and above without Images at this time!) No. Otherwise the filesystem people would build their filesystems on top of PostgreSQL not the other way around. Of course, if you want image updates to be part of a database transaction, then it might be worth storing them in the database. I hope some Peoples can give me a Tip or Hint where in can some usefull Information about it! Look into having a separate server (process or actual hardware) to handle requests for static text and images. Keep the Java server for actually processing data. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] BLOB's bypassing the OS Filesystem for better Image loading speed?
Hi all again, My next queststion is dedicated to blobs in my Webapplication (using Tomcat 5 and JDBC integrated a the J2EE Appserver JBoss). Filesystems with many Filesystem Objects can slow down the Performance at opening and reading Data. My Question: Can i speedup my Webapplication if i store my JPEG Images with small sizes inside my PostgreSQL Database (on verry large Databasis over 1 GByte and above without Images at this time!) I hope some Peoples can give me a Tip or Hint where in can some usefull Information about it! Thanks Josh ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Table Partitioning: Will it be supported in Future?
[EMAIL PROTECTED] wrote: Hi all, Ia a Guy from Germany an a strong Postgres believer! It is the best OpenSource Database i have ever have bee tasted and i try to using it in any Database Environments. It is exiting to see thadt Verison 8.0 has Tablespaces like ORACLE and DB/2, but i need Partitioning on a few very large Tables. I believe these are being worked on at the moment. You might want to search the archives of the hackers mailing list to see if the plans will suit your needs. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] Table Partitioning: Will it be supported in Future? (splitting large Tables)
Hi all, Ia a Guy from Germany an a strong Postgres believer! It is the best OpenSource Database i have ever have bee tasted and i try to using it in any Database Environments. It is exiting to see thadt Verison 8.0 has Tablespaces like ORACLE and DB/2, but i need Partitioning on a few very large Tables. The Tabeles are not verry complex, but it is extremely Large (1 GByte and above) and i think Table Partitioning is the right Way to spiltt them off on some physical Harddrives. Iam not sure thadt a common Harddrive RAID or SAN Storage System will do it for me. The ORACLE Table Partitioning Features are verry usefull but my favorite Datebase is PSQL. Is there any Plans thadt Postgres will support Partitioning in the near Future? Thanks ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings