Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-04 Thread John A Meinel
Josh Berkus wrote:
Mischa,

Okay, although given the track record of page-based sampling for
n-distinct, it's a bit like looking for your keys under the streetlight,
rather than in the alley where you dropped them :-)

Bad analogy, but funny.
The issue with page-based vs. pure random sampling is that to do, for example,
10% of rows purely randomly would actually mean loading 50% of pages.  With
20% of rows, you might as well scan the whole table.
Unless, of course, we use indexes for sampling, which seems like a *really
good* idea to me 
But doesn't an index only sample one column at a time, whereas with
page-based sampling, you can sample all of the columns at once. And not
all columns would have indexes, though it could be assumed that if a
column doesn't have an index, then it doesn't matter as much for
calculations such as n_distinct.
But if you had 5 indexed rows in your table, then doing it index wise
means you would have to make 5 passes instead of just one.
Though I agree that page-based sampling is important for performance
reasons.
John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Recognizing range constraints (was Re: [PERFORM] Plan

2005-04-10 Thread John A Meinel
Tom Lane wrote:
Jim C. Nasby [EMAIL PROTECTED] writes:
On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote:
Can anyone suggest a more general rule?  Do we need for example to
consider whether the relation membership is the same in two clauses
that might be opposite sides of a range restriction?  It seems like
a.x  b.y AND a.x  b.z

In a case like this, you could actually look at the  data in b and see
what the average range size is.

Not with the current statistics --- you'd need some kind of cross-column
statistics involving both y and z.  (That is, I doubt it would be
helpful to estimate the average range width by taking the difference of
independently-calculated mean values of y and z ...)  But yeah, in
principle it would be possible to make a non-default estimate.
			regards, tom lane
Actually, I think he was saying do a nested loop, and for each item in
the nested loop, re-evaluate if an index or a sequential scan is more
efficient.
I don't think postgres re-plans once it has started, though you could
test this in a plpgsql function.
John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [PERFORM] How to read query plan

2005-03-20 Thread John Arbash Meinel
Miroslav ulc wrote:
Tom Lane wrote:
...
I think the reason this is popping to the top of the runtime is that the
joins are so wide (an average of ~85 columns in a join tuple according
to the numbers above).  Because there are lots of variable-width columns
involved, most of the time the fast path for field access doesn't apply
and we end up going to nocachegetattr --- which itself is going to be
slow because it has to scan over so many columns.  So the cost is
roughly O(N^2) in the number of columns.

As there are a lot of varchar(1) in the AdDevicesSites table, wouldn't
be helpful to change them to char(1)? Would it solve the
variable-width problem at least for some fields and speed the query up?
I'm guessing there really wouldn't be a difference. I think varchar()
and char() are stored the same way, just one always has space padding. I
believe they are both varlena types, so they are still variable length.
As a short-term hack, you might be able to improve matters if you can
reorder your LEFT JOINs to have the minimum number of columns
propagating up from the earlier join steps.  In other words make the
later joins add more columns than the earlier, as much as you can.

That will be hard as the main table which contains most of the fields
is LEFT JOINed with the others. I'll look at it if I find some way to
improve it.
One thing that you could try, is to select just the primary keys from
the main table, and then later on, join back to that table to get the
rest of the columns. It is a little bit hackish, but if it makes your
query faster, you might want to try it.
I'm not sure whether I understand the process of performing the plan
but I imagine that the data from AdDevicesSites are retrieved only
once when they are loaded and maybe stored in memory. Are the columns
stored in the order they are in the SQL command? If so, wouldn't it be
useful to move all varchar fields at the end of the SELECT query? I'm
just guessing because I don't know at all how a database server is
implemented and what it really does.
I don't think they are stored in the order of the SELECT  portion. I'm
guessing they are loaded and saved as you go. But that the order of the
LEFT JOIN at the end is probably important.
..
regards, tom lane

Miroslav

John
=:-


signature.asc
Description: OpenPGP digital signature