Re: [PERFORM] How to unique-ify HUGE table?

2008-12-24 Thread Kynn Jones
Thank you all for the very helpful advice.  Upping work_mem made it possible
for me to generate the table within this century without bringing the server
to a near standstill.  I have not yet experimented with GROUP BY, but I'll
do this next.

Cheers,

Kynn


[PERFORM] How to unique-ify HUGE table?

2008-12-23 Thread Kynn Jones
Hi everyone!
I have a very large 2-column table (about 500M records) from which I want to
remove duplicate records.

I have tried many approaches, but they all take forever.

The table's definition consists of two short TEXT columns.  It is a
temporary table generated from a query:

CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ;

Initially I tried

CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ;

but after waiting for nearly an hour I aborted the query, and repeated it
after getting rid of the DISTINCT clause.

Everything takes forever with this monster!  It's uncanny.  Even printing it
out to a file takes forever, let alone creating an index for it.

Any words of wisdom on how to speed this up would be appreciated.

TIA!

Kynn


[PERFORM] How to profile an SQL script?

2008-12-03 Thread Kynn Jones
Hi.  I have a longish collection of SQL statements stored in a file that I
run periodically via cron.  Running this script takes a bit too long, even
for a cron job, and I would like to streamline it.
Is there a way to tell Postgres to print out, after each SQL statement is
executed, how long it took to execute?

Thanks!

Kynn


Re: [PERFORM] How to profile an SQL script?

2008-12-03 Thread Kynn Jones
Andreas, Heikki:

Thanks!
Kynn


[PERFORM] Performance and IN clauses

2008-11-18 Thread Kynn Jones
Hi.  I have a Perl script whose main loop generates thousands of SQL updates
of the form
UPDATE edge SET keep = true WHERE node1 IN ( $node_list ) AND node2 =
$node_id;

...where here $node_list stands for a comma-separated list of integers, and
$node_id stands for some integer.

The list represented by $node_list can be fairly long (on average it has
around 900 entries, and can be as long as 30K entries), and I'm concerned
about the performance cost of testing for inclusion in such a long list.  Is
this done by a sequential search?  If so, is there a better way to write
this query?  (FWIW, I have two indexes on the edge table using btree( node1
) and btree( node2 ), respectively.)

Also, assuming that the optimal way to write the query depends on the length
of $node_list, how can I estimate the critical length at which I should
switch from one form of the query to the other?

TIA!

Kynn


[PERFORM] The many nulls problem

2008-03-14 Thread Kynn Jones
It often happens that a particular pieces of information is non-null for a
small minority of cases.  A superficially different manifestation of this is
when two pieces of information are identical in all but a small minority of
cases.  This can be easily mapped to the previous description by defining a
null in one column to mean that its contents should be obtained from those
of another column.  A further variant of this is when one piece of
information is a simple function of another one in all but a small minority
of cases.

(BTW, I vaguely recall that RDb theorists have a technical term for this
particular design issue, but I don't remember it.)

In all these cases, the design choice, at least according to RDb's 101, is
between including a column in the table that will be NULL most of the time,
or defining a second auxiliary column that references the first one and
holds the non-redundant information for the minority of cases for which this
is necessary (and maybe define a VIEW that includes all the columns).

But for me it is a frequent occurrence that my quaint and simple RDb's 101
reasoning doesn't really apply for PostgreSQL.  Basically, Pg is too smart
for it!  For example, does a large proportion of NULLs really imply a lot of
wasted space?  Maybe this is true for fixed-length data types, but what
about for type TEXT or VARCHAR?

Just to be concrete, consider the case of a customers database for some home
shopping website.  Suppose that, as it happens, for the majority of this
site's customers, the shipping and billing addresses are identical.  Or
consider the scenario of a company in which, for most employees, the email
address can be readily computed from the first and last name using the rule
First M. Last = [EMAIL PROTECTED], but the company allows some
flexibility for special cases (e.g. for people like Yasuhiro Tanaka who's
known to everyone by his nickname, Yaz, the email is
[EMAIL PROTECTED] hardly anyone remembers or even knows his
full name.)

What's your schema design approach for such situations?  How would you go
about deciding whether the number of exceptional cases is small enough to
warrant a second table?  Of course, one could do a systematic profiling of
various possible scenarios, but as a first approximation what's your
rule-of-thumb?

TIA!

Kynn


Re: [PERFORM] The many nulls problem

2008-03-14 Thread Kynn Jones
On Fri, Mar 14, 2008 at 3:46 PM, Heikki Linnakangas [EMAIL PROTECTED]
wrote:

 tons of useful info snipped

 From performance point of view, I would go with a single table with
 NULL fields on PostgreSQL.


Wow.  I'm so glad I asked!  Thank you very much!

Kynn


Re: [PERFORM] The many nulls problem

2008-03-14 Thread Kynn Jones
On Fri, Mar 14, 2008 at 2:59 PM, Oleg Bartunov [EMAIL PROTECTED] wrote:

 have you seen contrib/hstore ? You can have one table with common
 attributes
 and hide others in hstore


That's interesting.  I'll check it out.  Thanks!

Kynn


Re: [PERFORM] Joins and DELETE FROM

2008-03-11 Thread Kynn Jones
Thank you for your post.  I finally spent some quality time with the query
planner section in the docs' server config chapter.  Very instructive, even
considering that most of it went over my head!

On Sat, Mar 8, 2008 at 4:08 PM, Tom Lane [EMAIL PROTECTED] wrote:

...have you got effective_cache_size set to something that's realistic for
 your machine?


I guess not.  It was the default value (128MB) on a machine with 4GB of RAM.
 It's not a dedicated server, though, so I'll set it to 1G for now.

But before doing so I need a clarification.  The docs state that this
parameter is used only for cost estimation, and has no effect on actual
memory allocations.  I imagine that if other memory-related settings are not
somehow in line with it, it could lead to estimates that are out of touch
with reality.  If this is correct what other memory-related parameters do I
need to adjust to ensure that both the planner's estimates and the actual
execution agree and fit well with the available memory?

One problem with this test is that your smaller tables probably fit in
 memory whereas the big ones may not, so it's not a given that your test
 accurately reflects how the real query will go down.


That's a very helpful reminder.  Thanks.

Kynn


Re: [PERFORM] Q on views and performance

2008-02-26 Thread Kynn Jones
On Mon, Feb 25, 2008 at 11:56 AM, Matthew [EMAIL PROTECTED] wrote:

 On Mon, 25 Feb 2008, Kynn Jones wrote:
  This is just GREAT!!!  It fits the problem to a tee.

 It makes the queries quick then?


It is good that you ask.  Clearly you know the story: a brilliant-sounding
optimization that in practice has only a small effect at best...

I'm totally puzzled.  It makes absolutely no sense to me...

For my analysis, in addition to creating the index on (type, zipk) that you
suggested, I also added an extra column to T containing a random integer in
the range 0..99, and created an index on this, so that I could produce a
totally shuffled clustering.  I compared the performance in going from a
randomly-clustered table to a (type, zipk)-clustered table, and the output
of EXPLAIN was encouraging, but when I ran the actual queries under EXPLAIN
ANALYZE the difference in execution time was negligible.

Live and learn!

Actually, what's driving me absolutely insane is the documentation for
EXPLAIN and for Pg's query planning in general.  I've read the docs (in
particular the chapter on performance), but I still can't make any sense of
EXPLAINs results, so I can't begin to understand why optimizations like the
one you suggested turned out to be ineffective.  For example, the first
lines of two recent EXPLAIN ANALYZE outputs are

Nested Loop Left Join  (cost=58.00..1154.22 rows=626 width=26) (actual time=
1.462..26.494 rows=2240 loops=1)
Merge Left Join  (cost=33970.96..34887.69 rows=58739 width=26) (actual time=
106.961..126.589 rows=7042 loops=1)

Actual runtimes are 27ms and 128ms.  The ratio 128/27 is much smaller than
one would expect from the relative costs of the two queries.  It looks like
there is no proportionality at all between the estimated costs and actual
running time...  (BTW, all these runs of EXPLAIN were done after calls to
VACUUM ANALYZE.)  This is one of the many things I don't understand about
this case...

What I would like to be able to do is to at least make enough sense of query
plans to determine whether they are reasonable or not.  This requires
knowing the algorithms behind each type of query tree node, but I have not
found this info...

On the positive side, in the course of all this analysis I must have done
*something* to improve the performance, because now even the unoptimized
queries are running pretty fast (e.g. queries that used to take about
1.5seconds are now taking 130ms).  But unfortunately I don't know what
was it
that I did to bring this speed-up about!

Anyway, be that as it may, thank you very much for your suggestion.

Kynn


Re: [PERFORM] Q on views and performance

2008-02-25 Thread Kynn Jones
On Mon, Feb 25, 2008 at 8:45 AM, Matthew [EMAIL PROTECTED] wrote:

 On Fri, 22 Feb 2008, Kynn Jones wrote:
  Hi.  I'm trying to optimize...
 
  (Q1)   SELECT a1.word, a2.word
  FROM T a1 JOIN T a2 USING ( zipk )
 WHERE a1.type = int1
   AND a2.type = int2;

 Okay, try this:

 Create an index on T(type, zipk), and then CLUSTER on that index...


This is just GREAT!!!  It fits the problem to a tee.

Many, many thanks!

Also, including zipk in the index is a really nice extra boost.  (If you
hadn't mentioned it I would have just settled for clustering only on
type...)

Thanks for that also!

Kynn


Re: [PERFORM] Q on views and performance

2008-02-23 Thread Kynn Jones
On Fri, Feb 22, 2008 at 8:48 PM, Dean Gibson (DB Administrator) 
[EMAIL PROTECTED] wrote:

 On 2008-02-22 12:49, Kynn Jones wrote:
  Of course, I expect that using views Vint1 and Vint2... would
  result in a loss in performance relative to a version that used bona
  fide tables Tint1 and Tint2.  My question is, how can I minimize
  this performance loss?

 That used to be my thoughts too, but I have found over the years that
 the PostgreSQL execution planner is able to flatten SELECTs using
 VIEWs, ALMOST ALWAYS in a way that does not adversely affect
 performance, and often gives an IMPROVEMENT in performance, probably
 because by using VIEWs I am stating the query problem in a better way
 than if I try to guess the best way to optimize a SELECT.


Well, the last consideration you mention there does not apply to the two
alternatives I'm comparing because they differ only in that one uses views
V1, V2, V3, ... , V100 where the other one uses the corresponding tables T1,
T2, T3, ... , T100, so the query statements would be identical in both
cases.


 I have at least a 10:1 ratio of VIEWs to TABLEs.  Occasionally, with
 some query that is slow, I will try to rewrite it without VIEWs.  This
 ALMOST NEVER results in an improvement in performance...


That's truly amazing!  Just to make sure I get you right, you're saying that
when you replace a view by its equivalent table you see no performance gain?
 How could it be?  With views every query entails the additional work of
searching the underlying tables for the records that make up the views...

OK, if I think a bit more about it I suppose that a view could be
implemented for performance as a special sort of table consisting of a
single column of pointers to the true records, in which case using views
would entail only the cost of this indirection, and not the cost of a
search...  (And also the cost of maintaining this pointer table, if the
underlying tables are mutable.)  So I guess views could be implemented in
such a way that the difference in SELECT performance relative to replacing
them with tables would be negligible...

Anyway, your post once again reminded me of awesomeness of PostgreSQL.
 Props to the developers!

kynn


Re: [PERFORM] Q on views and performance

2008-02-23 Thread Kynn Jones
On Fri, Feb 22, 2008 at 8:48 PM, Dean Gibson (DB Administrator) 
[EMAIL PROTECTED] wrote:

 On 2008-02-22 12:49, Kynn Jones wrote:
  Of course, I expect that using views Vint1 and Vint2... would
  result in a loss in performance relative to a version that used bona
  fide tables Tint1 and Tint2.  My question is, how can I minimize
  this performance loss?

 That used to be my thoughts too, but I have found over the years that
 the PostgreSQL execution planner is able to flatten SELECTs using
 VIEWs, ALMOST ALWAYS in a way that does not adversely affect
 performance, and often gives an IMPROVEMENT in performance, probably
 because by using VIEWs I am stating the query problem in a better way
 than if I try to guess the best way to optimize a SELECT.

 I have at least a 10:1 ratio of VIEWs to TABLEs.  Occasionally, with
 some query that is slow, I will try to rewrite it without VIEWs.  This
 ALMOST NEVER results in an improvement in performance, and when it does,
 I am able to find another way to write the VIEW and SELECT to recapture
 the gain.


Since you have experience working with views, let me ask you this.  The
converse strategy to the one I described originally would be to create the
individual tables T1, T2, T3, ..., T100, but instead of keeping around the
original (and now redundant) table T, replace it with a view V made up of
the union of T1, T2, T3, ..., T100.  The problem with this alternative is
that one cannot index V, or define a primary key constraint for it, because
it's a view.  This means that a search in V, even for a primary key value,
would be *have to be* very inefficient (i.e. I don't see how even the very
clever PostgreSQL implementers could get around this one!), because the
engine would have to search *all* the underlying tables, T1 through T100,
even if it found the desired record in T1, since it has no way of knowing
that the value is unique all across V.

Is there a way around this?

kynn


Re: [PERFORM] Q on views and performance

2008-02-23 Thread Kynn Jones
Hi, Dean.  The system I'm working with is very similar in spirit to a
large multilingual dictionary covering 100 languages.  Using this analogy,
the type column would correspond to the language, and the zipk column
would correspond to some language-independent key associated with a concept
(concept key for short).  So, if it were indeed a multilingual dictionary,
records in T would look like
  word   | zipk | language
-+--+---
 house   | 1234 | english
 casa| 1234 | spanish
 haus| 1234 | german
 piano   | 2345 | english
 piano   | 2345 | spanish
 cat | 3456 | english
 chat| 3456 | french
 chat| 4567 | english
 plausch | 4567 | german

...where I used the notation lang to denote the integer id assigned to
language lang.  Therefore typically there are about 100 records in T for
any given zipk, one for each language.  But the correspondence is not
perfect, since, for example, some languages have, proverbially, more than
one word for snow, and some (maybe from some tropical island in the South
Pacific) have none.  (This last case, BTW, is what accounts for the use of
left joins, as will become clear in a minute.)

The table S can be thought of a table consisting of a collection of words to
be translated to some target language.  In the first type of query (Q1), all
the words in S are effectively declared to belong to the same source
language, whereas in the second type of query (Q2) the source language for
the words in S is left unspecified (in this case S may contain words from
various languages, or words--like piano or chat in the example
above--that belong simultaneously to different languages, and which may (e.g.
piano) or may not (e.g. chat) have the same zipk [concept key] for each of
these languages).

So, regarding your question about (Q1**) and (Q2**):

(Q1**)  SELECT a1.word, sq.word FROM
   S  JOIN T a1 USING ( word )
 LEFT JOIN ( SELECT * FROM T a2
 WHERE a2.type = int2 ) sq USING ( zipk )
 WHERE a1.type = int1;

(Q2**)  SELECT a1.word, sq.word FROM
   S  JOIN T a1 USING ( word )
 LEFT JOIN ( SELECT * FROM T a2
 WHERE a2.type = int2 ) sq USING ( zipk )

...the inner join with S is intended to pick out all the records in the
source table (either Tint1 in Q1** or T in Q2**) corresponding to words in
S, while the second (left) join, is there to find all the translations in
the target language.  I use a left join so that even those words in S for
which no translations exist will show up in the query results.

3. Why not write:

 CREATE VIEW txt AS
   SELECT a1.word AS word1, a1.type AS type1, a2.word AS word2, a2.type AS
 type2
 FROM T a1 [LEFT] JOIN T a2 USING( zipk );  -- Use LEFT if
 appropriate
 SELECT word1, word1
   FROM S JOIN txt ON word = word1
   WHERE type1 = int1 AND type2 = int2;


This is would indeed produce the same results as Q1, but this approach would
require defining about 10,000 views, one for each possible pair of int1 and
int2 (or pair of languages, to continue the multilingual dictionary
analogy), which freaks me out for some reason.  (Actually, the number of
such views would be many more than that, because in the actual application
there is not just one T but several dozen, similar to what would happen to
the schema in the multilingual dictionary analogy if we wanted to
pre-segregate the words according to some categories, say a T for animals, a
T for fruits, a T for verbs, a T for professions, etc.)

(I need to do a bit more work before I can post the EXPLAIN results.)

kynn


[PERFORM] Q on views and performance

2008-02-22 Thread Kynn Jones
Hi.  I'm trying to optimize the performance of a database whose main purpose
is to support two (rather similar) kinds of queries.  The first kind, which
is expected to be the most common (I estimate it will account for about 90%
of all the queries performed on this DB), has the following general
structure:

(Q1)   SELECT a1.word, a2.word
 FROM T a1 JOIN T a2 USING ( zipk )
WHERE a1.type = int1
  AND a2.type = int2;

...where int1 and int2 stand for some two integers.  In English, this
query essentially executes an inner join between two virtual subtables of
table T, which are defined by the value of the type column.  For brevity, I
will refer to these (virtual) subtables as Tint1 and Tint2.  (I should
point out that T holds about 2 million records, spread roughly evenly over
about 100 values of the type column.  So each of these virtual subtables has
about 20K records.  Also, for practical purposes T may be regarded as an
immutable, read-only table, since it gets re-built from scratch about once a
month.  And, FWIW, all the columns mentioned in this post have a NOT
NULLconstraint.)

The second form is similar to the first, except that now the join is taken
between T and Tint2:

(Q2)   SELECT a1.word, a2.word
 FROM T a1 JOIN T a2 USING ( zipk )
WHERE a2.type = int2;

(Both the forms above are somewhat oversimplified relative to the actual
situation; in our actual application, the joins are actually left outer
ones, and each query also involves an additional inner join with another
table, S.  For the sake of completeness, I give the real-world versions of
these queries at the end of this post, but I think that for the purpose of
my question, the additional complications they entail can be neglected.)

One way to speed (Q1) would be to break T into its subtables, i.e. to create
T1, T2, T3, ... , T100 as bona fide tables.  Then the query would become a
simple join without the two condition of the original's WHERE clause, which
I figure should make it noticeably faster.

But since the second kind of query (Q2) requires T, we can't get rid of this
table, so all the data would need to be stored twice, once in T and once in
some Tint*.

In trying to come up with a way around this duplication, it occurred to me
that instead of creating tables T1, T2, etc., I could create the analogous
views V1, V2, etc.  (e.g. CREATE VIEW V1 AS SELECT * FROM T WHERE type = 1).
 With this design, the two queries above would become

(Q1*)  SELECT Vint1.word, Vint2.word
 FROM Vint1 JOIN Vint2 USING ( zipk );

(Q2*)  SELECT T.word, Vint2.word
 FROM T JOIN Vint2 USING ( zipk );

Of course, I expect that using views Vint1 and Vint2... would result in
a loss in performance relative to a version that used bona fide tables
Tint1 and Tint2.  My question is, how can I minimize this performance
loss?

More specifically, how can I go about building table T and the views
Vint?'s to maximize the performance of (Q1)?  For example, I'm thinking
that if T had an additional id column and were built in such a way that all
the records belonging to each Vint? were physically contiguous, and (say)
had contiguous values in the id column, then I could define each view like
this

  CREATE VIEW Vint1 AS SELECT * FROM T
   WHERE start_int1 = id AND id  start_int1+1;

So my question is, what design would make querying V1, V2, V3 ... as fast as
possible?  Is it possible to approach the performance of the design that
uses bona fide tables T1, T2, T3, ... instead of views V1, V2, V3 ...?

Thank you very much for reading this long post, and many thanks in advance
for your comments!

Kynn


P.S.  Here are the actual form of the queries.  They now include an initial
join with table S, and the join with Tint2 (or Vint2) is a left outer
join.  Interestingly, even though the queries below that use views (i.e.
Q1*** and Q2***) are not much more complex-looking than before, the other
two (Q1** and Q2**) are.  I don't know if this is because my ineptitude with
SQL, but I am not able to render (Q1**) and (Q2**) without resorting to the
subquery sq.

(Q1**)  SELECT a1.word, sq.word FROM
   S  JOIN T a1 USING ( word )
 LEFT JOIN ( SELECT * FROM T a2
 WHERE a2.type = int2 ) sq USING ( zipk )
 WHERE a1.type = int1;

(Q2**)  SELECT a1.word, sq.word FROM
   S  JOIN T a1 USING ( word )
 LEFT JOIN ( SELECT * FROM T a2
 WHERE a2.type = int2 ) sq USING ( zipk )

   -

(Q1***) SELECT Vint1.word, Vint2.word FROM
   S  JOIN Vint1 USING ( word )
 LEFT JOIN Vint2 USING ( zipk );

(Q2***) SELECT T.word, Vint2.word
  FROM S  JOIN T   USING ( word )
 LEFT JOIN Vint2 USING ( zipk );


[PERFORM] Basic Q on superfluous primary keys

2007-04-14 Thread Kynn Jones

Consider these two very similar schemas:

Schema 1:


CREATE TABLE foo (
 id serial PRIMARY KEY,
 frobnitz character(varying 100) NOT NULL UNIQUE
);


CREATE TABLE bar (
 id serial PRIMARY KEY,
 foo_id int REFERENCES foo(id)
)


Schema 2:


CREATE TABLE foo (
 frobnitz character(varying 100) PRIMARY KEY
);


CREATE TABLE bar (
 id serial PRIMARY KEY,
 frobnitz character(varying 100) REFERENCES foo(frobnitz)
)




The two situations are semantically identical: each record in table bar
refers to a record in table foo.  The difference is that in the first
schema, this referencing is done through an artificial serial-integer
primary key, while in the second schema this reference is done through a
data field that happens to be unique and not null, so it can serve as
primary key.


I find Schema 1 awkward and unnatural; more specifically, foo.id seems
unnecessary in light of the non-null uniqueness of foo.frobnitz.  But I
remember once reading that long fields like foo.frobnitz did not make good
primary keys.


Is the field foo.id in Schema 1 superfluous?  For example, wouldn't the
referencing from bar to foo really be done behind the scenes through some
hidden field (oid?) instead of through the frobnitz text field?  Which of
the two schemas would give better perfornance?


Thanks!


kj


[PERFORM] Optimizing a huge_table/tiny_table join

2006-05-30 Thread kynn





I want to optimize this simple join:

SELECT * FROM huge_table h, tiny_table t WHERE UPPER( h.id ) = UPPER( t.id )

huge_table has about 2.5 million records, can be assumed as fixed, and
has the following index:

CREATE INDEX huge_table_index ON huge_table( UPPER( id ) );

...while tiny_table changes with each user request, and typically will
contain on the order of 100-1000 records.  For this analysis, I put
300 records in tiny_table, resulting in 505 records in the join.

I tried several approaches.  In order of increasing speed of
execution:

1. executed as shown above, with enable_seqscan on: about 100 s.

2. executed as shown above, with enable_seqscan off: about 10 s.

3. executed with a LIMIT 6000 clause added to the SELECT statement, and
   enable_seqscan on: about 5 s.

4. executed with a LIMIT 600 clause added to the SELECT statement, and
   enable_seqscan on: less than 1 s.



Clearly, using LIMIT is the way to go.  Unfortunately I *do* want all
the records that would have been produced without the LIMIT clause,
and I don't have a formula for the limit that will guarantee this.  I
could use a very large value (e.g. 20x the size of tiny_table, as in
approach 3 above) which would make the probability of hitting the
limit very small, but unfortunately, the query plan in this case is
different from the query plan when the limit is just above the
expected number of results (approach 4 above).

The query plan for the fastest approach is this:

   QUERY PLAN
-
 Limit  (cost=0.01..2338.75 rows=600 width=84)
   -  Nested Loop  (cost=0.01..14766453.89 rows=3788315 width=84)
 -  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 width=38)
 -  Index Scan using huge_table_index on huge_table h  
(cost=0.01..48871.80 rows=12628 width=46)
   Index Cond: (upper((outer.id)::text) = upper((h.id)::text))



How can I *force* this query plan even with a higher limit value?

I found, by dumb trial and error, that in this case the switch happens
at LIMIT 5432, which, FWIW, is about 0.2% of the size of huge_table.
Is there a simpler way to determine this limit (hopefully
programmatically)?


Alternatively, I could compute the value for LIMIT as 2x the number of
records in tiny_table, and if the number of records found is *exactly*
this number, I would know that (most likely) some records were left
out.  In this case, I could use the fact that, according to the query
plan above, the scan of tiny_table is sequential to infer which
records in tiny_table were disregarded when the limit was reached, and
then repeat the query with only these left over records in tiny_table.

What's your opinion of this strategy?  Is there a good way to improve
it?

Many thanks in advance!

kj

PS:  FWIW, the query plan for the query with LIMIT 6000 is this:

 QUERY PLAN
-
 Limit  (cost=19676.75..21327.99 rows=6000 width=84)
   -  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
 Hash Cond: (upper((outer.id)::text) = upper((inner.id)::text))
 -  Seq Scan on huge_table h  (cost=0.00..51292.43 rows=2525543 
width=46)
 -  Hash  (cost=19676.00..19676.00 rows=300 width=38)
   -  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
width=38)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[PERFORM] Optimizing a huge_table/tiny_table join

2006-05-30 Thread kynn




[ I had a problem with my mailer when I first sent this.  My apologies
  for any repeats. ]


I want to optimize this simple join:

SELECT * FROM huge_table h, tiny_table t WHERE UPPER( h.id ) = UPPER( t.id )

huge_table has about 2.5 million records, can be assumed as fixed, and
has the following index:

CREATE INDEX huge_table_index ON huge_table( UPPER( id ) );

...while tiny_table changes with each user request, and typically will
contain on the order of 100-1000 records.  For this analysis, I put
300 records in tiny_table, resulting in 505 records in the join.

I tried several approaches.  In order of increasing speed of
execution:

1. executed as shown above, with enable_seqscan on: about 100 s.

2. executed as shown above, with enable_seqscan off: about 10 s.

3. executed with a LIMIT 6000 clause added to the SELECT statement, and
   enable_seqscan on: about 5 s.

4. executed with a LIMIT 600 clause added to the SELECT statement, and
   enable_seqscan on: less than 1 s.



Clearly, using LIMIT is the way to go.  Unfortunately I *do* want all
the records that would have been produced without the LIMIT clause,
and I don't have a formula for the limit that will guarantee this.  I
could use a very large value (e.g. 20x the size of tiny_table, as in
approach 3 above) which would make the probability of hitting the
limit very small, but unfortunately, the query plan in this case is
different from the query plan when the limit is just above the
expected number of results (approach 4 above).

The query plan for the fastest approach is this:

   QUERY PLAN
-
 Limit  (cost=0.01..2338.75 rows=600 width=84)
   -  Nested Loop  (cost=0.01..14766453.89 rows=3788315 width=84)
 -  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 width=38)
 -  Index Scan using huge_table_index on huge_table h  
(cost=0.01..48871.80 rows=12628 width=46)
   Index Cond: (upper((outer.id)::text) = upper((h.id)::text))



How can I *force* this query plan even with a higher limit value?

I found, by dumb trial and error, that in this case the switch happens
at LIMIT 5432, which, FWIW, is about 0.2% of the size of huge_table.
Is there a simpler way to determine this limit (hopefully
programmatically)?


Alternatively, I could compute the value for LIMIT as 2x the number of
records in tiny_table, and if the number of records found is *exactly*
this number, I would know that (most likely) some records were left
out.  In this case, I could use the fact that, according to the query
plan above, the scan of tiny_table is sequential to infer which
records in tiny_table were disregarded when the limit was reached, and
then repeat the query with only these left over records in tiny_table.

What's your opinion of this strategy?  Is there a good way to improve
it?

Many thanks in advance!

kj

PS:  FWIW, the query plan for the query with LIMIT 6000 is this:

 QUERY PLAN
-
 Limit  (cost=19676.75..21327.99 rows=6000 width=84)
   -  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
 Hash Cond: (upper((outer.id)::text) = upper((inner.id)::text))
 -  Seq Scan on huge_table h  (cost=0.00..51292.43 rows=2525543 
width=46)
 -  Hash  (cost=19676.00..19676.00 rows=300 width=38)
   -  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
width=38)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-30 Thread Kynn Jones

On 5/24/06, Tom Lane [EMAIL PROTECTED] wrote: 
[EMAIL PROTECTED] writes:Limit(cost=19676.75..21327.99
 rows=6000 width=84)-Hash Join(cost=19676.75..1062244.81 rows=3788315 width=84)Hash Cond: (upper((outer.id)::text) = upper((inner.id)::text))-Seq Scan on huge_table h(cost=
0.00..51292.43 rows=2525543 width=46)-Hash(cost=19676.00..19676.00 rows=300 width=38)-Seq Scan on tiny_table t(cost=0.00..19676.00 rows=300 width=38)Um, if huge_table is so much bigger than tiny_table, why are the cost
estimates for seqscanning them only about 2.5x different?There'ssomething wacko about your statistics, methinks.


You mean there's a bug in explain? I agree that it makes no sense that the costs don't differ as much as one would expect, but you can see right there the numbers of rows for the two tables, which are exactly as I described. At any rate, how would onego about finding an explanation for these strange stats?


More bewildering still (and infuriating as hell--because it means thatall of my work for yesterdayhas been wasted) is that I can no longer reproduce the bestquery plan, even though the tables have not changed at all. (Hence I can't post the explain analyze for thebest query plan.) No matter what value I use for LIMIT, the query planner nowinsists on sequentially scanning huge_table and ignoring the available index. (If I turn off enable_seqscan, I get the second worst query plan I posted yesterday.)


Anyway, I take it that there is no way to bypass the optimizer and instruct PostgreSQL exactly how one wants the search performed?

Thanks!

kj



Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-25 Thread kynn

On 5/24/06, Tom Lane [EMAIL PROTECTED] wrote:

 [EMAIL PROTECTED] writes:
   Limit  (cost=19676.75..21327.99 rows=6000 width=84)
 -  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
   Hash Cond: (upper((outer.id)::text) upper((inner.id)::text))
   -  Seq Scan on huge_table h  (cost= 0.00..51292.43 rows=2525543 
  width=46)
   -  Hash  (cost=19676.00..19676.00 rows=300 width=38)
 -  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
  width=38)

 Um, if huge_table is so much bigger than tiny_table, why are the cost
 estimates for seqscanning them only about 2.5x different?  There's
 something wacko about your statistics, methinks.



Well, they're not my statistics; they're explain's.  You mean there's
a bug in explain?  I agree that it makes no sense that the costs don't
differ as much as one would expect, but you can see right there the
numbers of rows for the two tables.  At any rate, how would one go
about finding an explanation for these strange stats?

More bewildering still (and infuriating as hell--because it means that
all of my work for yesterday has been wasted) is that I can no longer
reproduce the best query plan I posted earlier, even though the tables
have not changed at all.  (Hence I can't post the explain analyze for
the best query plan, which Josh Drake asked for.)  No matter what
value I use for LIMIT, the query planner now insists on sequentially
scanning huge_table and ignoring the available index.  (If I turn off
enable_seqscan, I get the second worst query plan I posted yesterday.)

Anyway, I take it that there is no way to bypass the optimizer and
instruct PostgreSQL exactly how one wants the search performed?

Thanks!

kj

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] Optimizing a huge_table/tiny_table join

2006-05-24 Thread kynn




I want to optimize this simple join:

SELECT * FROM huge_table h, tiny_table t WHERE UPPER( h.id ) = UPPER( t.id )

huge_table has about 2.5 million records, can be assumed as fixed, and
has the following index:

CREATE INDEX huge_table_index ON huge_table( UPPER( id ) );

...while tiny_table changes with each user request, and typically will
contain on the order of 100-1000 records.  For this analysis, I put
300 records in tiny_table, resulting in 505 records in the join.

I tried several approaches.  In order of increasing speed of
execution:

1. executed as shown above, with enable_seqscan on: about 100 s.

2. executed as shown above, with enable_seqscan off: about 10 s.

3. executed with a LIMIT 6000 clause added to the SELECT statement, and
   enable_seqscan on: about 5 s.

4. executed with a LIMIT 600 clause added to the SELECT statement, and
   enable_seqscan on: less than 1 s.



Clearly, using LIMIT is the way to go.  Unfortunately I *do* want all
the records that would have been produced without the LIMIT clause,
and I don't have a formula for the limit that will guarantee this.  I
could use a very large value (e.g. 20x the size of tiny_table, as in
approach 3 above) which would make the probability of hitting the
limit very small, but unfortunately, the query plan in this case is
different from the query plan when the limit is just above the
expected number of results (approach 4 above).

The query plan for the fastest approach is this:

   QUERY PLAN
-
 Limit  (cost=0.01..2338.75 rows=600 width=84)
   -  Nested Loop  (cost=0.01..14766453.89 rows=3788315 width=84)
 -  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 width=38)
 -  Index Scan using huge_table_index on huge_table h  
(cost=0.01..48871.80 rows=12628 width=46)
   Index Cond: (upper((outer.id)::text) = upper((h.id)::text))



How can I *force* this query plan even with a higher limit value?

I found, by dumb trial and error, that in this case the switch happens
at LIMIT 5432, which, FWIW, is about 0.2% of the size of huge_table.
Is there a simpler way to determine this limit (hopefully
programmatically)?


Alternatively, I could compute the value for LIMIT as 2x the number of
records in tiny_table, and if the number of records found is *exactly*
this number, I would know that (most likely) some records were left
out.  In this case, I could use the fact that, according to the query
plan above, the scan of tiny_table is sequential to infer which
records in tiny_table were disregarded when the limit was reached, and
then repeat the query with only these left over records in tiny_table.

What's your opinion of this strategy?  Is there a good way to improve
it?

Many thanks in advance!

kj

PS:  FWIW, the query plan for the query with LIMIT 6000 is this:

 QUERY PLAN
-
 Limit  (cost=19676.75..21327.99 rows=6000 width=84)
   -  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
 Hash Cond: (upper((outer.id)::text) = upper((inner.id)::text))
 -  Seq Scan on huge_table h  (cost=0.00..51292.43 rows=2525543 
width=46)
 -  Hash  (cost=19676.00..19676.00 rows=300 width=38)
   -  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
width=38)

=_1148485808-20617-3--


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq