Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2010-01-05 Thread Robert Haas
On Tue, Jan 5, 2010 at 7:49 AM, Chetan Suttraway wrote: > if we can treat this case as similar to that of merging of histograms in > case of joins involving 2 tables and generating the histograms for the > cartesian (result) node, ...which you can't, because it's totally different, so I think the

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2010-01-05 Thread Chetan Suttraway
Hi, My suggestion is to keep two sets of histograms. One which is generated by running ANALYZE and the other which is dynamically generated histograms using the entries from logging (that is done in insert/update/delete operations). I am not sure how difficult is it to read such record details fr

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2010-01-04 Thread Tom Lane
Josh Berkus writes: >> Great. When you have it set up, you might want to play with enabling >> the mergejoinscansel change as well, and see if that is a net plus or >> minus for you. > How would I *disable* it? It's #ifdef NOT_USED in CVS at the moment. regards, tom lan

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2010-01-04 Thread Josh Berkus
> Great. When you have it set up, you might want to play with enabling > the mergejoinscansel change as well, and see if that is a net plus or > minus for you. How would I *disable* it? --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscr

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2010-01-04 Thread Josh Berkus
> I've applied a patch to HEAD that does the above. Can you test it to > see how well it fixes your problem? Sure. It'll take us a while to set up a test environment; the database is 1TB in size so converting it to 8.5 isn't quick. Will report back. --Josh -- Sent via pgsql-hackers mailing

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2010-01-04 Thread Tom Lane
Josh Berkus writes: >> I've applied a patch to HEAD that does the above. Can you test it to >> see how well it fixes your problem? > Sure. It'll take us a while to set up a test environment; the database > is 1TB in size so converting it to 8.5 isn't quick. Great. When you have it set up, you

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2010-01-04 Thread Csaba Nagy
On Wed, 2009-12-30 at 17:16 +0100, Tom Lane wrote: > I think the cleanest solution to this would be to make ANALYZE > cheaper, perhaps by finding some way for it to work incrementally. What if when inserting/deleting a tuple, some random sample of them would be passed into an auto-analyze buffer ?

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2010-01-03 Thread Tom Lane
I wrote: > Actually, in the problematic cases, it's interesting to consider the > following strategy: when scalarineqsel notices that it's being asked for > a range estimate that's outside the current histogram bounds, first try > to obtain the actual current max() or min() of the column value ---

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2010-01-01 Thread Simon Riggs
On Thu, 2009-12-31 at 21:29 +, Simon Riggs wrote: > On Thu, 2009-12-31 at 15:18 -0500, Tom Lane wrote: > > Simon Riggs writes: > > > Why not get both max() and min(), then rebase the histogram according to > > > those values. That way the histogram can still move significantly and > > > the te

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-31 Thread Simon Riggs
On Thu, 2009-12-31 at 15:18 -0500, Tom Lane wrote: > Simon Riggs writes: > > Why not get both max() and min(), then rebase the histogram according to > > those values. That way the histogram can still move significantly and > > the technique will still work. > > Define "rebase", keeping in mind t

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-31 Thread Tom Lane
Simon Riggs writes: > Why not get both max() and min(), then rebase the histogram according to > those values. That way the histogram can still move significantly and > the technique will still work. Define "rebase", keeping in mind that this has to work on datatypes that we don't have a distance

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-31 Thread Simon Riggs
On Wed, 2009-12-30 at 14:55 -0500, Tom Lane wrote: > Actually, in the problematic cases, it's interesting to consider the > following strategy: when scalarineqsel notices that it's being asked for > a range estimate that's outside the current histogram bounds, first try > to obtain the actual curr

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-31 Thread Dimitri Fontaine
Tom Lane writes: > Actually, in the problematic cases, it's interesting to consider the > following strategy: when scalarineqsel notices that it's being asked for > a range estimate that's outside the current histogram bounds, first try > to obtain the actual current max() or min() of the column v

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Craig Ringer
While regular ANALYZE seems to be pretty good ... is it insane to suggest determining the min/max bounds of problem columns by looking at a btree index on the column in ANALYZE, instead of relying on random data sampling? An ANALYZE that didn't even have to scan the indexes but just look at the

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Craig Ringer
On 31/12/2009 12:33 AM, Kevin Grittner wrote: Tom Lane wrote: Well, the problem Josh has got is exactly that a constant high bound doesn't work. I thought the problem was that the high bound in the statistics fell too far below the actual high end in the data. This tends (in my experience)

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Greg Smith
Joshua D. Drake wrote: postgres=# analyze verbose test_ten_million; INFO: analyzing "public.test_ten_million" INFO: "test_ten_million": scanned 3000 of 44248 pages, containing 678000 live rows and 0 dead rows; 3000 rows in sample, 1048 estimated total rows ANALYZE Time: 20145.148 ms At

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Tom Lane
Chris Browne writes: > I find it curious that ANALYZE *would* take a long time to run. > After all, its sampling strategy means that, barring having SET > STATISTICS to some ghastly high number, it shouldn't need to do > materially more work to analyze a 1TB table than is required to analyze > a

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Greg Stark
well that's interesting because they claim to be doing exactly the same amount of I/O in terms of pages. In the first case it's reading 3/4 of the table so it's effectively doing a sequential scan. In the second case it's only scanning 7.5% so you would expect it to be slower but not that much

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Chris Browne
j...@commandprompt.com ("Joshua D. Drake") writes: > On the other hand ANALYZE also: > > 1. Uses lots of memory > 2. Lots of processor > 3. Can take a long time > > We normally don't notice because most sets won't incur a penalty. We got a > customer who > has a single table that is over 1TB in siz

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Tom Lane
Peter Eisentraut writes: > On tis, 2009-12-29 at 22:08 -0500, Tom Lane wrote: >> This seems like a fundamentally broken approach, first because "time >> between analyzes" is not even approximately a constant, and second >> because it assumes that we have a distance metric for all datatypes. > May

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Peter Eisentraut
On tis, 2009-12-29 at 22:08 -0500, Tom Lane wrote: > This seems like a fundamentally broken approach, first because "time > between analyzes" is not even approximately a constant, and second > because it assumes that we have a distance metric for all datatypes. Maybe you could compute a correlatio

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Joshua D. Drake
On Wed, 30 Dec 2009 18:42:38 +, Greg Stark wrote: > I'm a bit puzzled by people's repeated suggestion here that large > tables take a long time to analyze. The sample analyze takes to > generate statistics is not heavily influenced by the size of the > table. Your 1TB table should take basica

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Greg Stark
On Wed, Dec 30, 2009 at 4:31 PM, Joshua D. Drake wrote: > On the other hand ANALYZE also: > > 1. Uses lots of memory > 2. Lots of processor > 3. Can take a long time > > We normally don't notice because most sets won't incur a penalty. We got a > customer who > has a single table that is over 1TB

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Tom Lane
Alvaro Herrera writes: > Tom Lane escribió: >> Normally, yeah. I think Josh's problem is that he's got >> performance-critical queries that are touching the "moving edge" of the >> data set, and so the part of the stats that are relevant to them is >> changing fast, even though in an overall sens

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Alvaro Herrera
Tom Lane escribió: > Greg Smith writes: > > Right, and the only thing that makes this case less painful is that you > > don't really need the stats to be updated quite as often in situations > > with that much data. If, say, your stats say there's 2B rows in the > > table but there's actually

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Kevin Grittner
Greg Smith wrote: > If, say, your stats say there's 2B rows in the table but there's > actually 2.5B, that's a big error, but unlikely to change the > types of plans you get. Once there's millions of distinct values > it's takes a big change for plans to shift, etc. Well, the exception to tha

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Tom Lane
Greg Smith writes: > Right, and the only thing that makes this case less painful is that you > don't really need the stats to be updated quite as often in situations > with that much data. If, say, your stats say there's 2B rows in the > table but there's actually 2.5B, that's a big error, but

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Greg Smith
Joshua D. Drake wrote: We normally don't notice because most sets won't incur a penalty. We got a customer who has a single table that is over 1TB in size... We notice. Granted that is the extreme but it would only take a quarter of that size (which is common) to start seeing issues. Righ

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Kevin Grittner
Tom Lane wrote: > Well, the problem Josh has got is exactly that a constant high > bound doesn't work. I thought the problem was that the high bound in the statistics fell too far below the actual high end in the data. This tends (in my experience) to be much more painful than an artificially

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Joshua D. Drake
On Wed, 30 Dec 2009 11:16:45 -0500, Tom Lane wrote: > "Kevin Grittner" writes: >> Tom Lane wrote: >>> I don't have a better idea at the moment :-( > >> It's been a while since I've been bitten by this issue -- the last >> time was under Sybase. The Sybase suggestion was to either add >> "dumm

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Tom Lane
"Kevin Grittner" writes: > Tom Lane wrote: >> I don't have a better idea at the moment :-( > It's been a while since I've been bitten by this issue -- the last > time was under Sybase. The Sybase suggestion was to either add > "dummy rows" [YUCK!] to set the extreme bounds or to "lie to the >

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-30 Thread Kevin Grittner
Tom Lane wrote: > Josh Berkus writes: >> My thoughts on dealing with this intelligently without a major >> change to statstics gathering went along these lines: > >> 1. add columns to pg_statistic to hold estimates of upper and >> lower bounds growth between analyzes. > > This seems like a fund

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-29 Thread Tom Lane
Josh Berkus writes: > My thoughts on dealing with this intelligently without a major change to > statstics gathering went along these lines: > 1. add columns to pg_statistic to hold estimates of upper and lower > bounds growth between analyzes. This seems like a fundamentally broken approach, fi

[HACKERS] Thoughts on statistics for continuously advancing columns

2009-12-29 Thread Josh Berkus
All, One of our clients is having query plan issues with a table with a continuously advancing timestamp column (i.e. one with default now()). The newest rows, which are the most in demand, are always estimated to be fewer than they are or even non-existant. As a result, the user has to analyze t