Re: [PERFORM] Index only scan sometimes switches to sequential scan for small amount of rows
Sorry, didn't respond to all your questions: What version of PostgreSQL are running? 'select version();' PostgreSQL 9.3.4 on x86_64-pc-linux-gnu, compiled by gcc-4.6.real (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit What do you get when to do analyze verbose print_list? # analyze verbose print_list ; INFO: analyzing print_list INFO: print_list: scanned 53712 of 53712 pages, containing 3626950 live rows and 170090 dead rows; 30 rows in sample, 3626950 estimated total rows ANALYZE Time: 6656.037 ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Index only scan sometimes switches to sequential scan for small amount of rows
On 25 March 2015 at 22:45, Jeff Janes jeff.ja...@gmail.com wrote: How can the avg_width be 4 when the vast majority of entries are 7 characters long? The datatype is an enum, as I understand it, an enum type always occupies 4 bytes -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] query faster with a foreign table
Hello, I have an executing time problem for a query : this time is very different as I used a local table or a foreign table : 20 times faster for the foreign table On a server 9.4.1, I have 2 spatial bases b1 (size 5.4 Go) et b2 (size 19Mo) and in the base b1 the table tmp_obs_coordgps (61 Mo, 502982 lignes). When I use a JOIN construct beetwen this table tmp_obs_coordgps and a foreign table fao_areas (table in b2), the performance are best than with a local table fao_aires_local (in b1). These 2 tables fao_areas and fao_aires_local are identical (build with select * from or with pg_dump : the results are the same) The links to explain analyze are * foreign table fao_areas : http://explain.depesz.com/s/4hO select count(*) from tmp_obs_coordgps o, fao_areas f where o.code_fao=f.f_code and st_contains(f.the_geom, o.geom); * local table fao_aires_local : http://explain.depesz.com/s/BvDb select count(*) from tmp_obs_coordgps o, fao_aires_local f where o.code_fao=f.f_code and st_contains(f.the_geom, o.geom); Thanks by advance -- Dominique Vallée UMS 3468 Bases de données sur la Biodiversité, Ecologie, Environnement et Sociétés (BBEES) MNHN - Muséum national d'Histoire naturelle 01 40 79 53 70 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Index only scan sometimes switches to sequential scan for small amount of rows
On 26.3.2015 08:48, Jeff Janes wrote: OK, this is starting to look like a long-standing bug to me. If it only sees 3 distinct values, and all three are present at least twice, it throws all of them into the MCV list. But if one of those 3 were present just once, then it tests them to see if they qualify. The test for inclusion is that it has to be present more than once, and that it must be over-represented by 25%. Lets say it sampled 3 rows and found 29,900 of one value, 99 of another, and 1 of a third. But that turns into the second one needing to be present 12,500 times. The average value is present 10,000 times (30,000 samples with 3 distinct values) and 25 more than that is 12,500. So it excluded. It seems to me that a more reasonable criteria is that it must be over-represented 25% compared to the average of all the remaining values not yet accepted into the MCV list. I.e. all the greater ones should be subtracted out before computing the over-representation threshold. That might work IMO, but maybe we should increase the coefficient a bit (say, from 1.25 to 2), not to produce needlessly long MCV lists. It is also grossly inconsistent with the other behavior. If they are 29900; 98; 2 then all three go into the MCV. Isn't the mincount still 12500? How could all three get into the MCV? -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Index only scan sometimes switches to sequential scan for small amount of rows
On Thu, Mar 26, 2015 at 5:44 AM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: On 26.3.2015 08:48, Jeff Janes wrote: OK, this is starting to look like a long-standing bug to me. If it only sees 3 distinct values, and all three are present at least twice, it throws all of them into the MCV list. But if one of those 3 were present just once, then it tests them to see if they qualify. The test for inclusion is that it has to be present more than once, and that it must be over-represented by 25%. Lets say it sampled 3 rows and found 29,900 of one value, 99 of another, and 1 of a third. But that turns into the second one needing to be present 12,500 times. The average value is present 10,000 times (30,000 samples with 3 distinct values) and 25 more than that is 12,500. So it excluded. It seems to me that a more reasonable criteria is that it must be over-represented 25% compared to the average of all the remaining values not yet accepted into the MCV list. I.e. all the greater ones should be subtracted out before computing the over-representation threshold. That might work IMO, but maybe we should increase the coefficient a bit (say, from 1.25 to 2), not to produce needlessly long MCV lists. That wouldn't work here, because at the point of decision the value present 99 times contributes half the average, so the average is 50, and of course it can't possibly be twice of that. I have a patch, but is there a way to determine how it affects a wide variety of situations? I guess run `make installcheck`, then analyze, then dump pg_stats, with the patch and without the patch, and then compare the dumpsj? It is also grossly inconsistent with the other behavior. If they are 29900; 98; 2 then all three go into the MCV. Isn't the mincount still 12500? How could all three get into the MCV? If all observed values are observed at least twice, it takes a different path through the code. It just keeps them all in the MCV list. That is what is causing the instability for the OP. If the 3rd most common is seen twice, then all three are kept. If it is seen once, then only the most common is kept. See if statements at 2494 and 2585 else if (toowide_cnt == 0 nmultiple == ndistinct) if (track_cnt == ndistinct Cheers, Jeff analyze_highly_skewed.patch Description: Binary data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Index only scan sometimes switches to sequential scan for small amount of rows
On 26.3.2015 17:35, Jeff Janes wrote: On Thu, Mar 26, 2015 at 5:44 AM, Tomas Vondra tomas.von...@2ndquadrant.com mailto:tomas.von...@2ndquadrant.com wrote: That might work IMO, but maybe we should increase the coefficient a bit (say, from 1.25 to 2), not to produce needlessly long MCV lists. That wouldn't work here, because at the point of decision the value present 99 times contributes half the average, so the average is 50, and of course it can't possibly be twice of that. Oh, right. How could I miss that? ;-) I have a patch, but is there a way to determine how it affects a wide variety of situations? I guess run `make installcheck`, then analyze, then dump pg_stats, with the patch and without the patch, and then compare the dumpsj? I doubt there's such way. I'd argue that if you can show this always generates longer MCV lists, we can assume the stats are probably more accurate, and thus the plans should be better. Of course, there's always the possibility that the plan was good by luck, and improving the estimates will result in a worse plan. But I don't think we can really fix that :-( It is also grossly inconsistent with the other behavior. If they are 29900; 98; 2 then all three go into the MCV. Isn't the mincount still 12500? How could all three get into the MCV? If all observed values are observed at least twice, it takes a different path through the code. It just keeps them all in the MCV list. That is what is causing the instability for the OP. If the 3rd most common is seen twice, then all three are kept. If it is seen once, then only the most common is kept. See if statements at 2494 and 2585 else if (toowide_cnt == 0 nmultiple == ndistinct) if (track_cnt == ndistinct Aha, I think I see it now. I've been concentrating on this code: avgcount = (double) samplerows / ndistinct; /* set minimum threshold count to store a value */ mincount = avgcount * 1.25; if (mincount 2) mincount = 2; but this is actually too late, because first we do this: else if (toowide_cnt == 0 nmultiple == ndistinct) { stats-stadistinct = ndistinct; } and that only happens if each item is observed at least 2x in the sample (and the actual Haas and Stokes estimator it not used). And then we do this: if (track_cnt == ndistinct toowide_cnt == 0 stats-stadistinct 0 track_cnt = num_mcv) { num_mcv = track_cnt; } so that we track everything. If at least one value is seen only 1x, it works differently, and we use the code with (1.25*avgcount) threshold. I wonder where the 1.25x threshold comes from - whether it's something we came up with, or if it comes from some paper. I guess the former. -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Index only scan sometimes switches to sequential scan for small amount of rows
On Wed, Mar 25, 2015 at 1:00 PM, Feike Steenbergen feikesteenber...@gmail.com wrote: On 25 March 2015 at 19:07, Jeff Janes jeff.ja...@gmail.com wrote: Also, I doubt that that is the problem in the first place. If you collect a sample of 30,000 (which the default target size of 100 does), and the frequency of the second most common is really 0.00307333 at the time you sampled it, you would expect to find it 92 times in the sample. The chances against actually finding 1 instead of around 92 due to sampling error are astronomical. It can be that the distribution of values is very volatile; we hope the increased stats target (from the default=100 to 1000 for this column) and frequent autovacuum and autoanalyze helps in keeping the estimates correct. It seems that it did find some other records ( 'PRINTED), as is demonstrated in the stats where there was only one value in the MCV list: the frequency was 0.996567 and the fraction of nulls was 0, therefore leaving 0.03+ for other values. But because none of them were in the MCV and MCF list, they were all treated as equals. They are certainly not equal. I not know why some values were found (they are mentioned in the histogram_bounds), but are not part of the MCV list, as you say, the likeliness of only 1 item being found is very small. Does anyone know the criteria for a value to be included in the MCV list? OK, this is starting to look like a long-standing bug to me. If it only sees 3 distinct values, and all three are present at least twice, it throws all of them into the MCV list. But if one of those 3 were present just once, then it tests them to see if they qualify. The test for inclusion is that it has to be present more than once, and that it must be over-represented by 25%. Lets say it sampled 3 rows and found 29,900 of one value, 99 of another, and 1 of a third. But that turns into the second one needing to be present 12,500 times. The average value is present 10,000 times (30,000 samples with 3 distinct values) and 25 more than that is 12,500. So it excluded. It seems to me that a more reasonable criteria is that it must be over-represented 25% compared to the average of all the remaining values not yet accepted into the MCV list. I.e. all the greater ones should be subtracted out before computing the over-representation threshold. It is also grossly inconsistent with the other behavior. If they are 29900; 98; 2 then all three go into the MCV. If they are 29900; 99; 1 then only the highest one goes in. The second one gets evicted for being slightly *more* popular. This is around line 2605 of src/backend/commands/analyze.c in head. Cheers, Jeff