Re: [PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance

2009-04-06 Thread Mario Splivalo

Scott Marlowe wrote:


It's not really solved, it's just a happy coincidence that the current
plan runs well.  In order to keep the query planner making good
choices you need to increase stats target for the field in the index
above.  The easiest way to do so is to do this:

alter database mydb set default_statistics_target=100;

and run analyze again:

analyze;


So, i removed the index on field_name, set 
default_default_statistics_target to 100, analyzed, and the results are 
the same:


 QUERY PLAN 



 Hash Join  (cost=1.04..58676.31 rows=218048 width=68) (actual 
time=0.067..12268.394 rows=6 loops=1)

   Hash Cond: ((u.field_name)::text = (t.key)::text)
   -  Seq Scan on photo_info_data u  (cost=0.00..47500.30 rows=2398530 
width=50) (actual time=0.013..6426.611 rows=2398446 loops=1)
   -  Hash  (cost=1.02..1.02 rows=2 width=18) (actual 
time=0.015..0.015 rows=2 loops=1)
 -  Seq Scan on t_query_data t  (cost=0.00..1.02 rows=2 
width=18) (actual time=0.002..0.006 rows=2 loops=1)

 Total runtime: 12268.459 ms
(6 rows)

I even changed default_statistics_target to 1000:


 Hash Join  (cost=1.04..58580.29 rows=208561 width=67) (actual 
time=0.054..12434.283 rows=6 loops=1)

   Hash Cond: ((u.field_name)::text = (t.key)::text)
   -  Seq Scan on photo_info_data u  (cost=0.00..47499.46 rows=2398446 
width=49) (actual time=0.012..6129.923 rows=2398446 loops=1)
   -  Hash  (cost=1.02..1.02 rows=2 width=18) (actual 
time=0.015..0.015 rows=2 loops=1)
 -  Seq Scan on t_query_data t  (cost=0.00..1.02 rows=2 
width=18) (actual time=0.002..0.004 rows=2 loops=1)

 Total runtime: 12434.338 ms
(6 rows)


Even when I run this query, I get sequential scan:

explain analyze select * from photo_info_data where field_name = 
'f-spot' or field_name = 'shutter';


QUERY PLAN 


---
 Seq Scan on photo_info_data  (cost=0.00..59491.69 rows=1705 width=49) 
(actual time=0.018..1535.963 rows=6 loops=1)
   Filter: (((field_name)::text = 'f-spot'::text) OR 
((field_name)::text = 'shutter'::text))

 Total runtime: 1536.010 ms
(3 rows)

These are the representations of te values 'f-spot' and 'shutter' for 
the field field_name in photo_info_data table:


xmltest=# select field_name, count(*) from user_info_data where 
field_name in ('visina', 'spol') group by field_name;

 field_name | count
+---
 'f-spot'   | 3
 'shutter'  | 3
(2 rows)


Maybe my test-data is poor? As I've mentioned, photo_info_data has 
little over 230 rows. And this is complete 'distribution' of the data:


xmltest=# select field_name, count(*) from user_info_data group by 
field_name order by count(*) desc;

   field_name   | count
+
 field_Xx1  | 35
 field_Xx2  | 332447
 field_Xx3  | 297414
 field_Xx4  | 262394
 field_Xx5  | 227396
 field_Xx6  | 192547
 field_Xx7  | 157612
 field_Xx8  | 122543
 field_Xx9  |  87442
 field_Xx10 |  52296
 field_1|  5
 field_2|  47389
 field_3|  42412
 field_4|  37390
 field_5|  32366
 field_6|  27238
 field_7|  22360
 field_Xx11 |  17589
 field_8|  17412
 field_9|  12383
 field_10   |   7386
 field_11   |   2410
 f-spot |  3
 shutter|  3
 focal  |  3
 flash  |  3
 m_city |  3
 person |  3
 iso|  2
(29 rows)

No matter what field_name value I enter in WHERE condition, planner 
chooses sequential scan. Only when I add seperate index on field_name, 
planner chooes index scan or bitmap index scan.


Mike

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance

2009-04-06 Thread Scott Marlowe
On Mon, Apr 6, 2009 at 6:20 AM, Mario Splivalo
mario.spliv...@megafon.hr wrote:
 Scott Marlowe wrote:

 It's not really solved, it's just a happy coincidence that the current
 plan runs well.  In order to keep the query planner making good
 choices you need to increase stats target for the field in the index
 above.  The easiest way to do so is to do this:

 alter database mydb set default_statistics_target=100;

 and run analyze again:

 analyze;

 So, i removed the index on field_name, set default_default_statistics_target
 to 100, analyzed, and the results are the same:

Why did you remove the index?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance

2009-04-06 Thread Mario Splivalo

Scott Marlowe wrote:

On Mon, Apr 6, 2009 at 6:20 AM, Mario Splivalo
mario.spliv...@megafon.hr wrote:

Scott Marlowe wrote:

It's not really solved, it's just a happy coincidence that the current
plan runs well.  In order to keep the query planner making good
choices you need to increase stats target for the field in the index
above.  The easiest way to do so is to do this:

alter database mydb set default_statistics_target=100;

and run analyze again:

analyze;

So, i removed the index on field_name, set default_default_statistics_target
to 100, analyzed, and the results are the same:


Why did you remove the index?



Because I already have index on that column, index needed to enforce PK 
constraint. Here is the original DDL for the table:


CREATE TABLE photo_info_data
(
  photo_id integer NOT NULL,
  field_name character varying NOT NULL,
  field_value character varying,
  CONSTRAINT photo_info_data_pk PRIMARY KEY (user_id, field_name)
)

CREATE INDEX photo_info_data_ix_field_value
  ON user_info_data USING btree (field_value);

So, there is index on (user_id, field_name). Postgres is using index for 
user_id (...WHERE user_id = 12345) but not on field-name (...WHERE 
field_name = 'f-spot'). When I add extra index on field name:


CREATE INDEX photo_info_data_ix__field_name
  ON user_info_data USING btree (field_name);

Then that index is used.

Mike

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance

2009-04-06 Thread Scott Marlowe
On Mon, Apr 6, 2009 at 8:37 AM, Mario Splivalo
mario.spliv...@megafon.hr wrote:
 Scott Marlowe wrote:

 On Mon, Apr 6, 2009 at 6:20 AM, Mario Splivalo
 mario.spliv...@megafon.hr wrote:

 Scott Marlowe wrote:

 It's not really solved, it's just a happy coincidence that the current
 plan runs well.  In order to keep the query planner making good
 choices you need to increase stats target for the field in the index
 above.  The easiest way to do so is to do this:

 alter database mydb set default_statistics_target=100;

 and run analyze again:

 analyze;

 So, i removed the index on field_name, set
 default_default_statistics_target
 to 100, analyzed, and the results are the same:

 Why did you remove the index?


 Because I already have index on that column, index needed to enforce PK
 constraint. Here is the original DDL for the table:

 CREATE TABLE photo_info_data
 (
  photo_id integer NOT NULL,
  field_name character varying NOT NULL,
  field_value character varying,
  CONSTRAINT photo_info_data_pk PRIMARY KEY (user_id, field_name)
 )

 CREATE INDEX photo_info_data_ix_field_value
  ON user_info_data USING btree (field_value);

 So, there is index on (user_id, field_name). Postgres is using index for
 user_id (...WHERE user_id = 12345) but not on field-name (...WHERE
 field_name = 'f-spot'). When I add extra index on field name:

 CREATE INDEX photo_info_data_ix__field_name
  ON user_info_data USING btree (field_name);

 Then that index is used.

On older versions of pgsql, the second of two terms in a multicolumn
index can't be used alone.  On newer versions it can, but it is much
less efficient than if it's a single column index or if the term is
the first one not the second.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance

2009-04-06 Thread Mario Splivalo

Scott Marlowe wrote:

CREATE INDEX photo_info_data_ix_field_value
 ON user_info_data USING btree (field_value);

So, there is index on (user_id, field_name). Postgres is using index for
user_id (...WHERE user_id = 12345) but not on field-name (...WHERE
field_name = 'f-spot'). When I add extra index on field name:

CREATE INDEX photo_info_data_ix__field_name
 ON user_info_data USING btree (field_name);

Then that index is used.


On older versions of pgsql, the second of two terms in a multicolumn
index can't be used alone.  On newer versions it can, but it is much
less efficient than if it's a single column index or if the term is
the first one not the second.


I'm using 8.3.7. So, you'd also suggest to keep that extra (in a way 
redundant) index on field_name, since I need PK on (photo_id, field_name) ?


Mike

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance

2009-04-06 Thread Scott Marlowe
On Mon, Apr 6, 2009 at 8:50 AM, Mario Splivalo
mario.spliv...@megafon.hr wrote:
 Scott Marlowe wrote:

 CREATE INDEX photo_info_data_ix_field_value
  ON user_info_data USING btree (field_value);

 So, there is index on (user_id, field_name). Postgres is using index for
 user_id (...WHERE user_id = 12345) but not on field-name (...WHERE
 field_name = 'f-spot'). When I add extra index on field name:

 CREATE INDEX photo_info_data_ix__field_name
  ON user_info_data USING btree (field_name);

 Then that index is used.

 On older versions of pgsql, the second of two terms in a multicolumn
 index can't be used alone.  On newer versions it can, but it is much
 less efficient than if it's a single column index or if the term is
 the first one not the second.

 I'm using 8.3.7. So, you'd also suggest to keep that extra (in a way
 redundant) index on field_name, since I need PK on (photo_id, field_name) ?

Either that or reverse the terms in the pk.

Also, you might want to look at adjusting random_page_access to
something around 1.5 to 2.0.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance

2009-03-30 Thread Mario Splivalo

I have two tables, like this:

Big table:

CREATE TABLE photo_info_data
(
  photo_id integer NOT NULL,
  field_name character varying NOT NULL,
  field_value character varying,
  CONSTRAINT photo_info_data_pk PRIMARY KEY (photo_id, field_name)
)
WITH (OIDS=FALSE);

CREATE INDEX user_info_data_ix_field_value
  ON user_info_data
  USING btree
  (field_value);


Small table:

CREATE TABLE t_query_data
(
  i integer,
  key character varying,
  op character varying,
  value character varying
)
WITH (OIDS=FALSE);

I have around 240 rows in photo_info_data, and just two rows in 
t_query_data:

 i |  key | op | value
---+--++
 1 | f-stop   | eq | 2.6
 2 | shutter  | gt | 1/100


This is the query I'm executing:

SELECT
*
FROM
photo_info_data u
JOIN t_query_data t on u.field_name = key

This query takes around 900ms to execute. It returns 6 rows.

When I do 'explain analyze' for some reason it takes around 7 seconds, 
and this is what I get:


phototest=# explain analyze select * from photo_info_data u join 
t_query_data t on u.field_name = key;
 QUERY PLAN 



 Hash Join  (cost=1.04..58676.31 rows=218048 width=68) (actual 
time=2381.895..7087.225 rows=6 loops=1)

   Hash Cond: ((u.field_name)::text = (t.key)::text)
   -  Seq Scan on photo_info_data u  (cost=0.00..47500.30 rows=2398530 
width=50) (actual time=0.042..3454.112 rows=2398446 loops=1)
   -  Hash  (cost=1.02..1.02 rows=2 width=18) (actual 
time=0.016..0.016 rows=2 loops=1)
 -  Seq Scan on t_query_data t  (cost=0.00..1.02 rows=2 
width=18) (actual time=0.003..0.007 rows=2 loops=1)

 Total runtime: 7087.291 ms
(6 rows)

Time: 7088.663 ms


I can rerun this query many times, it's always around 7 seconds. I/O 
wait during the query is nonexistant, it just takes 100% of CPU time (i 
have a DualCore Opteron server).


If I force the planner not to use sequential_scan, here is what I get:

phototest=# explain analyze select * from photo_info_data u join 
t_query_data t on u.field_name = key;
 QUERY 
PLAN

-
 Nested Loop  (cost=100039134.84..100130206.79 rows=218048 width=68) 
(actual time=271.138..540.998 rows=6 loops=1)
   -  Seq Scan on t_query_data t  (cost=1.00..10001.02 
rows=2 width=18) (actual time=0.008..0.015 rows=2 loops=1)
   -  Bitmap Heap Scan on photo_info_data u  (cost=39134.84..63740.08 
rows=109024 width=50) (actual time=270.464..270.469 rows=3 loops=2)

 Recheck Cond: ((u.field_name)::text = (t.key)::text)
 -  Bitmap Index Scan on photo_info_data_pk 
(cost=0.00..39107.59 rows=109024 width=0) (actual time=270.435..270.435 
rows=3 loops=2)

   Index Cond: ((u.field_name)::text = (t.key)::text)
 Total runtime: 541.065 ms
(7 rows)

Time: 542.147 ms


The database currently has only those two tables. I have vacuumed them 
prior running above queries.


I tought this information also might be important:
phototest=# select key, count(*) from photo_info_data u join 
t_query_data t on u.field_name = key group by key;

  key | count
--+---
 f-stop   | 3
 shutter  | 3
(2 rows)


Am I doing something wrong here? The photo_info_data would hold around 
10.000.000 records, should I be doing 'set seq_scan to false' each time 
I will want to run this query? (Since I'm accessing postgres trough JDBC 
I'll have same situation I had weeks ago, I described it here also).


Mike

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance

2009-03-30 Thread Tom Lane
Mario Splivalo mario.spliv...@megafon.hr writes:
 -  Bitmap Heap Scan on photo_info_data u  (cost=39134.84..63740.08 
 rows=109024 width=50) (actual time=270.464..270.469 rows=3 loops=2)
   Recheck Cond: ((u.field_name)::text = (t.key)::text)
   -  Bitmap Index Scan on photo_info_data_pk 
 (cost=0.00..39107.59 rows=109024 width=0) (actual time=270.435..270.435 
 rows=3 loops=2)
 Index Cond: ((u.field_name)::text = (t.key)::text)

You need to figure out why that rowcount estimate is off by more than
four orders of magnitude :-(

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance

2009-03-30 Thread Mario Splivalo

Tom Lane wrote:

Mario Splivalo mario.spliv...@megafon.hr writes:
-  Bitmap Heap Scan on photo_info_data u  (cost=39134.84..63740.08 
rows=109024 width=50) (actual time=270.464..270.469 rows=3 loops=2)

  Recheck Cond: ((u.field_name)::text = (t.key)::text)
  -  Bitmap Index Scan on photo_info_data_pk 
(cost=0.00..39107.59 rows=109024 width=0) (actual time=270.435..270.435 
rows=3 loops=2)

Index Cond: ((u.field_name)::text = (t.key)::text)


You need to figure out why that rowcount estimate is off by more than
four orders of magnitude :-(


Huh, thnx! :) Could you give me some starting points, what do I do?

Could it be because table is quite large, and there are only 3 columns 
that match join condition?


Now, after I finished writing above lines, index creation on 
photo_info_data(field_name) was done. When I rerun above query, here is 
what I get:


phototest=# explain analyze select field_name, count(*) from 
t_query_data t join photo_info_data u on t.key = u.field_name group by 
field_name;


 QUERY PLAN 


--
 HashAggregate  (cost=57414.33..57414.61 rows=22 width=9) (actual 
time=0.135..0.139 rows=2 loops=1)
   -  Nested Loop  (cost=2193.50..56324.09 rows=218048 width=9) 
(actual time=0.063..0.114 rows=6 loops=1)
 -  Seq Scan on t_query_data t  (cost=0.00..1.02 rows=2 
width=6) (actual time=0.019..0.022 rows=2 loops=1)
 -  Bitmap Heap Scan on photo_info_data u 
(cost=2193.50..26798.74 rows=109024 width=9) (actual time=0.025..0.030 
rows=3 loops=2)

   Recheck Cond: ((u.field_name)::text = (t.key)::text)
   -  Bitmap Index Scan on photo_info_data_ix__field_name 
 (cost=0.00..2166.24 rows=109024 width=0) (actual time=0.019..0.019 
rows=3 loops=2)

 Index Cond: ((u.field_name)::text = (t.key)::text)
 Total runtime: 0.200 ms
(8 rows)


So, I guess I solved my problem! :) The explain analyze still shows that 
row estimate is 'quite off' (109024 estimated vs only 3 actuall), but 
the query is light-speeded :)


I tought that having primary key (and auto-index because of primary key) 
on (photo_id, field_name) should be enough. Now I have two indexes on 
field_name, but that seems to do good.


Mike

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance

2009-03-30 Thread Scott Marlowe
On Mon, Mar 30, 2009 at 9:34 AM, Mario Splivalo
mario.spliv...@megafon.hr wrote:

         -  Bitmap Heap Scan on photo_info_data u (cost=2193.50..26798.74
 rows=109024 width=9) (actual time=0.025..0.030 rows=3 loops=2)
               Recheck Cond: ((u.field_name)::text = (t.key)::text)
               -  Bitmap Index Scan on photo_info_data_ix__field_name
  (cost=0.00..2166.24 rows=109024 width=0) (actual time=0.019..0.019 rows=3
 loops=2)

 So, I guess I solved my problem! :) The explain analyze still shows that row
 estimate is 'quite off' (109024 estimated vs only 3 actuall), but the query
 is light-speeded :)

It's not really solved, it's just a happy coincidence that the current
plan runs well.  In order to keep the query planner making good
choices you need to increase stats target for the field in the index
above.  The easiest way to do so is to do this:

alter database mydb set default_statistics_target=100;

and run analyze again:

analyze;

 I tought that having primary key (and auto-index because of primary key) on
 (photo_id, field_name) should be enough. Now I have two indexes on
 field_name, but that seems to do good.

Nope, it's about the stats collected that let the planner make the right choice.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance