Re: [PERFORM] mis-estimation on data-warehouse aggregate creation

2004-11-17 Thread Kris Jurka


On Tue, 16 Nov 2004, Simon Riggs wrote:

 The join condition has so many ANDed predicates that we assume that this
 will reduce the selectivity considerably. It does not, and so you pay
 the cost dearly later on.
 

Yes, that makes a lot of sense.  Without some incredibly good cross-column
statistics there is no way it could expect all of the rows to match.  
Thanks for the analysis.

Kris Jurka

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] query plan question

2004-11-17 Thread David Parker
Oh, I didn't realize that analyze gave that much more info. I've got a
lot to learn about this tuning stuff ;-) 

I've attached the output. I see from the new output where the slow query
is taking its time (the nested loop at line 10), but I still have no
idea why this plan is getting chosen

Thanks!

- DAP

-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of 
Russell Smith
Sent: Tuesday, November 16, 2004 11:36 PM
To: [EMAIL PROTECTED]
Subject: Re: [PERFORM] query plan question

On Wed, 17 Nov 2004 02:54 pm, you wrote:
 I have a query for which postgres is generating a different 
plan on different machines. The database schema is the same, 
the dataset is the same, the configuration is the same (e.g., 
pg_autovacuum running in both cases), both systems are Solaris 
9. The main difference in the two systems is that one is sparc 
and the other is intel.
 
 The query runs in about 40 ms on the intel box, but takes 
about 18 seconds on the sparc box. Now, the intel boxes we 
have are certainly faster, but I'm curious why the query plan 
might be different.
 
 For the intel:
 
 QUERY PLAN
 Unique  (cost=11.50..11.52 rows=2 width=131)
   -  Sort  (cost=11.50..11.50 rows=2 width=131)
 Sort Key: up.prefix, s.name, s.tuid, s.foundryversion
 -  Hash Join  (cost=10.42..11.49 rows=2 width=131)
   Hash Cond: (outer.dbid = inner.schema)
   -  Seq Scan on schema s  (cost=0.00..1.02 
rows=2 width=128)
   -  Hash  (cost=10.41..10.41 rows=4 width=11)
 -  Nested Loop  (cost=0.00..10.41 
rows=4 width=11)
   -  Nested Loop  (cost=0.00..2.14 
rows=4 width=4)
 -  Seq Scan on flow fl  
(cost=0.00..0.00 rows=1 width=4)
   Filter: (servicetype = 646)
 -  Index Scan using 
usage_flow_i on usage u  (cost=0.00..2.06 rows=6 width=8)
   Index Cond: (u.flow = 
outer.dbid)
   -  Index Scan using 
usageparameter_usage_i on usageparameter up  (cost=0.00..2.06 
rows=1 width=15)
 Index Cond: (up.usage = 
outer.dbid)
 Filter: ((prefix)::text  
 'xsd'::text)
 
 For the sparc:
 
 QUERY PLAN
 Unique  (cost=10.81..10.83 rows=1 width=167)
   -  Sort  (cost=10.81..10.82 rows=1 width=167)
 Sort Key: up.prefix, s.name, s.tuid, s.foundryversion
 -  Nested Loop  (cost=9.75..10.80 rows=1 width=167)
   Join Filter: (outer.flow = inner.dbid)
   -  Hash Join  (cost=9.75..10.79 rows=1 width=171)
 Hash Cond: (outer.dbid = inner.schema)
 -  Seq Scan on schema s  
(cost=0.00..1.02 rows=2 width=128)
 -  Hash  (cost=9.75..9.75 rows=1 width=51)
   -  Nested Loop  (cost=0.00..9.75 
rows=1 width=51)
 Join Filter: 
(inner.usage = outer.dbid)
 -  Index Scan using 
usage_flow_i on usage u  (cost=0.00..4.78 rows=1 width=8)
 -  Index Scan using 
usageparameter_schema_i on usageparameter up  (cost=0.00..4.96 
rows=1 width=51)
   Filter: 
((prefix)::text  'xsd'::text)
   -  Seq Scan on flow fl  (cost=0.00..0.00 
rows=1 width=4)
 Filter: (servicetype = 646)
 
Unique  (cost=11.50..11.52 rows=2 width=131) Unique  
(cost=10.81..10.83 rows=1 width=167)

The estimations for the cost is basically the same, 10ms for 
the first row.  Can you supply Explain analyze to see what 
it's actually doing?

Russell Smith

---(end of 
broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

QUERY PLAN
Unique  (cost=10.81..10.83 rows=1 width=167) (actual time=19390.684..19390.687 
rows=1 loops=1)
  -  Sort  (cost=10.81..10.82 rows=1 width=167) (actual 
time=19390.678..19390.679 rows=1 loops=1)
Sort Key: up.prefix, s.name, s.tuid, s.foundryversion
-  Nested Loop  (cost=9.75..10.80 rows=1 width=167) (actual 
time=19377.051..19390.391 rows=1 loops=1)
  Join Filter: (outer.flow = inner.dbid)
  -  Hash Join  (cost=9.75..10.79 rows=1 width=171) (actual 
time=19173.684..19181.827 rows=770 loops=1)
Hash Cond: (outer.dbid = inner.schema)
-  Seq Scan on schema s  (cost=0.00..1.02 rows=2 
width=128) (actual time=0.240..0.693 rows=20 loops=1)
-  Hash  (cost=9.75..9.75 rows=1 width=51) (actual 
time=19173.354..19173.354 rows=0 loops=1)
  -  Nested Loop  (cost=0.00..9.75 rows=1 width=51) 
(actual time=30.456..19166.759 rows=770 loops=1)
Join Filter: (inner.usage = outer.dbid)
-  Index 

Re: [PERFORM] query plan question

2004-11-17 Thread David Parker
If they are the same and PostgreSQL are the same, are the 
intel machines Xeons?

Yup, dual 3.06-GHz Intel Xeon Processors.

I'm not sure off the top of my head what the sparcs are exactly. We're
in the process of moving completely to intel, but we still have to
support our app on sparc, and we are seeing these weird differences...

- DAP

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] query plan question

2004-11-17 Thread Jeff
On Nov 17, 2004, at 7:32 AM, David Parker wrote:
Oh, I didn't realize that analyze gave that much more info. I've got a
lot to learn about this tuning stuff ;-)
I've attached the output. I see from the new output where the slow 
query
is taking its time (the nested loop at line 10), but I still have no
idea why this plan is getting chosen

looks like your stats are incorrect on the sparc.
Did you forget to run vacuum analyze on it?
also, do both db's have the same data loaded?
there are some very different numbers in terms of actual rows floating 
around there...

--
Jeff Trout [EMAIL PROTECTED]
http://www.jefftrout.com/
http://www.stuarthamm.net/
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] query plan question

2004-11-17 Thread David Parker
I've got pg_autovacuum running on both platforms. I've verified that the
tables involved in the query have the same number of rows on both
databases.

I'm not sure where to look to see how the stats might be different. The
good database's pg_statistic table has 24 more rows than that in the
bad database, so there's definitely a difference. The good database's
pg_statistic has rows for 2 extra tables, but they are not tables
involved in the query in question...

So something must be up with stats, but can you tell me what the most
signicant columns in the pg_statistic table are for the planner making
its decision? I'm sure this has been discussed before, so if there's a
thread you can point me to, that would be great - I realize it's a big
general question.

Thanks for your time.

- DAP

-Original Message-
From: Jeff [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 17, 2004 9:01 AM
To: David Parker
Cc: Russell Smith; [EMAIL PROTECTED]
Subject: Re: [PERFORM] query plan question


On Nov 17, 2004, at 7:32 AM, David Parker wrote:

 Oh, I didn't realize that analyze gave that much more info. 
I've got a 
 lot to learn about this tuning stuff ;-)

 I've attached the output. I see from the new output where the slow 
 query is taking its time (the nested loop at line 10), but I still 
 have no idea why this plan is getting chosen


looks like your stats are incorrect on the sparc.
Did you forget to run vacuum analyze on it?

also, do both db's have the same data loaded?
there are some very different numbers in terms of actual rows floating 
around there...

--
Jeff Trout [EMAIL PROTECTED]
http://www.jefftrout.com/
http://www.stuarthamm.net/



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] query plan question

2004-11-17 Thread David Parker
Hmm, I'm really a beginner at this...

It turns out that the pg_statistic table in my good database has records
in it for the tables in the query, while the pg_statistic table in my
bad database has no records for those tables at all!

So I guess I need to figure out why pg_autovacuum isn't analyzing those
tables.

- DAP 

-Original Message-
From: David Parker 
Sent: Wednesday, November 17, 2004 9:44 AM
To: 'Jeff'
Cc: Russell Smith; [EMAIL PROTECTED]
Subject: RE: [PERFORM] query plan question

I've got pg_autovacuum running on both platforms. I've 
verified that the tables involved in the query have the same 
number of rows on both databases.

I'm not sure where to look to see how the stats might be 
different. The good database's pg_statistic table has 24 
more rows than that in the bad database, so there's 
definitely a difference. The good database's pg_statistic has 
rows for 2 extra tables, but they are not tables involved in 
the query in question...

So something must be up with stats, but can you tell me what 
the most signicant columns in the pg_statistic table are for 
the planner making its decision? I'm sure this has been 
discussed before, so if there's a thread you can point me to, 
that would be great - I realize it's a big general question.

Thanks for your time.

- DAP

-Original Message-
From: Jeff [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 17, 2004 9:01 AM
To: David Parker
Cc: Russell Smith; [EMAIL PROTECTED]
Subject: Re: [PERFORM] query plan question


On Nov 17, 2004, at 7:32 AM, David Parker wrote:

 Oh, I didn't realize that analyze gave that much more info. 
I've got a
 lot to learn about this tuning stuff ;-)

 I've attached the output. I see from the new output where the slow 
 query is taking its time (the nested loop at line 10), but I still 
 have no idea why this plan is getting chosen


looks like your stats are incorrect on the sparc.
Did you forget to run vacuum analyze on it?

also, do both db's have the same data loaded?
there are some very different numbers in terms of actual rows 
floating 
around there...

--
Jeff Trout [EMAIL PROTECTED]
http://www.jefftrout.com/
http://www.stuarthamm.net/



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] query plan question

2004-11-17 Thread Tom Lane
David Parker [EMAIL PROTECTED] writes:
 So I guess I need to figure out why pg_autovacuum isn't analyzing those
 tables.

Which autovacuum version are you using?  The early releases had some
nasty bugs that would allow it to skip tables sometimes.  I think all
the known problems are fixed as of recent 7.4.x updates.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] query plan question

2004-11-17 Thread David Parker
We're using postgresql 7.4.5. I've only recently put pg_autovacuum in
place as part of our installation, and I'm basically taking the
defaults. I doubt it's a problem with autovacuum itself, but rather with
my configuration of it. I have some reading to do, so any pointers to
existing autovacuum threads would be greatly appreciated!

Thanks.

- DAP

-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 17, 2004 10:46 AM
To: David Parker
Cc: Jeff; Russell Smith; [EMAIL PROTECTED]
Subject: Re: [PERFORM] query plan question 

David Parker [EMAIL PROTECTED] writes:
 So I guess I need to figure out why pg_autovacuum isn't analyzing 
 those tables.

Which autovacuum version are you using?  The early releases 
had some nasty bugs that would allow it to skip tables 
sometimes.  I think all the known problems are fixed as of 
recent 7.4.x updates.

   regards, tom lane


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] Analyzer is clueless

2004-11-17 Thread David Brown
I'm doing some performance profiling with a simple two-table query:

SELECT L.ProductID, sum(L.Amount)
FROM drinv H
JOIN drinvln L ON L.OrderNo = H.OrderNo
WHERE
(OrderDate between '2003-01-01' AND '2003-04-30')
GROUP BY L.ProductID

drinv and drinvln have about 100,000 and 3,500,000 rows respectively. Actual 
data size in the large table is 500-600MB. OrderNo is indexed in both tables, 
as is OrderDate.

The environment is PGSQL 8 on Win2k with 512MB RAM (results are similar to 7.3 
from Mammoth). I've tried tweaking various conf parameters, but apart from 
using up memory, nothing seems to have had a tangible effect - the Analyzer 
doesn't seem to take resources into account like some of the doco suggests.

The date selection represents about 5% of the range. Here's the plan summaries:

Three months (2003-01-01 to 2003-03-30) = 1 second

HashAggregate  (cost=119365.53..119368.74 rows=642 width=26)
  -  Nested Loop  (cost=0.00..118791.66 rows=114774 width=26)
-  Index Scan using drinv_OrderDate on drinv h  (cost=0.00..200.27 
rows=3142 width=8)
  Index Cond: ((OrderDate = '2003-01-01'::date) AND (OrderDate 
= '2003-03-30'::date))
-  Index Scan using drinvln_OrderNo on drinvln l  (cost=0.00..28.73 
rows=721 width=34)
  Index Cond: (l.OrderNo = outer.OrderNo)


Four months (2003-01-01 to 2003-04-30) = 60 seconds

HashAggregate  (cost=126110.53..126113.74 rows=642 width=26)
  -  Hash Join  (cost=277.55..125344.88 rows=153130 width=26)
Hash Cond: (outer.OrderNo = inner.OrderNo)
-  Seq Scan on drinvln l  (cost=0.00..106671.35 rows=3372935 width=34)
-  Hash  (cost=267.07..267.07 rows=4192 width=8)
  -  Index Scan using drinv_OrderDate on drinv h  
(cost=0.00..267.07 rows=4192 width=8)
Index Cond: ((OrderDate = '2003-01-01'::date) AND 
(OrderDate = '2003-04-30'::date))


Four months (2003-01-01 to 2003-04-30) with Seq_scan disabled = 75 seconds


HashAggregate  (cost=130565.83..130569.04 rows=642 width=26)
  -  Merge Join  (cost=519.29..129800.18 rows=153130 width=26)
Merge Cond: (outer.OrderNo = inner.OrderNo)
-  Sort  (cost=519.29..529.77 rows=4192 width=8)
  Sort Key: h.OrderNo
  -  Index Scan using drinv_OrderDate on drinv h  
(cost=0.00..267.07 rows=4192 width=8)
Index Cond: ((OrderDate = '2003-01-01'::date) AND 
(OrderDate = '2003-04-30'::date))
-  Index Scan using drinvln_OrderNo on drinvln l  
(cost=0.00..119296.29 rows=3372935 width=34)

Statistics were run on each table before query execution. The random page cost 
was lowered to 2, but as you can see, the estimated costs are wild anyway.

As a comparison, MS SQL Server took less than 15 seconds, or 4 times faster.

MySQL (InnoDB) took 2 seconds, which is 30 times faster.

The query looks straightforward to me (it might be clearer with a subselect), 
so what on earth is wrong?

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] memcached and PostgreSQL

2004-11-17 Thread Mike Rylander
On 17 Nov 2004 03:08:20 -0500, Greg Stark [EMAIL PROTECTED] wrote:
 Josh Berkus [EMAIL PROTECTED] writes:
 
  So memcached becomes a very good place to stick data that's read often but 
  not
  updated often, or alternately data that changes often but is disposable.   
  An
  example of the former is a user+ACL list; and example of the latter is web
  session information ... or simple materialized views.
 
 I would like very much to use something like memcached for a materialized view
 I have. The problem is that I have to join it against other tables.
 
 I've thought about providing a SRF in postgres to read records out of
 memcached but I'm unclear it would it really help at all.
 
 Has anyone tried anything like this?

I haven't tried it yet, but I plan too.  An intersting case might be
to use plperlu to interface with memcached and store hashes in the
cache via some external process, like a CGI script.  Then just define
a TYPE for the perl SRF to return, and store the data as an array of
hashes with keys matching the TYPE.

A (perhaps useless) example could then be something like:

CREATE TYPE user_info AS ( sessionid TEXT,  userid INT, lastaccess
TIMESTAMP, lastrequest TEXT);

CREATE FUNCTION get_user_info_by_session ( TEXT) RETURNS SETOF user_info AS $$
  use Cache::Memcached;

  my $session = shift;

  my $c = $_SHARED{memcached} || Cache::Memcached-new( {servers =
'127.0.0.1:'} );

  my $user_info = $m-get('web_access_list');

  # $user_info looks like
  # [ {userid = 5, lastrequest = 'http://...', lastaccess = localtime(),
  #sessionid = '123456789'}, { ...} ]
  # and is stored by a CGI.

  @info = grep {$$_{sessionid} eq $session} @$user_info;

  return [EMAIL PROTECTED];
$$ LANGUAGE 'plperlu';

SELECT u.username, f.lastrequest FROM users u,
get_user_info_by_session('123456789') WHERE f.userid = u.userid;


Any thoughts? 

 
 --
 greg
 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faqs/FAQ.html
 


-- 
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] query plan question

2004-11-17 Thread Matthew T. O'Connor
David Parker wrote:
We're using postgresql 7.4.5. I've only recently put pg_autovacuum in
place as part of our installation, and I'm basically taking the
defaults. I doubt it's a problem with autovacuum itself, but rather with
my configuration of it. I have some reading to do, so any pointers to
existing autovacuum threads would be greatly appreciated!
Well the first thing to do is increase the verbosity of the 
pg_autovacuum logging output.  If you use -d2 or higher, pg_autovacuum 
will print out a lot of detail on what it thinks the thresholds are and 
why it is or isn't performing vacuums and analyzes.   Attach some of the 
log and I'll take a look at it.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] memcached and PostgreSQL

2004-11-17 Thread Darcy Buskermolen
On November 16, 2004 08:00 pm, Michael Adler wrote:
 http://pugs.postgresql.org/sfpug/archives/21.html

 I noticed that some of you left coasters were talking about memcached
 and pgsql. I'm curious to know what was discussed.

 In reading about memcached, it seems that many people are using it to
 circumvent the scalability problems of MySQL (lack of MVCC).

 from their site:

 snip
 Shouldn't the database do this?

 Regardless of what database you use (MS-SQL, Oracle, Postgres,
 MysQL-InnoDB, etc..), there's a lot of overhead in implementing ACID
 properties in a RDBMS, especially when disks are involved, which means
 queries are going to block. For databases that aren't ACID-compliant
 (like MySQL-MyISAM), that overhead doesn't exist, but reading threads
 block on the writing threads. memcached never blocks.
 /snip

 So What does memcached offer pgsql users? It would still seem to offer
 the benefit of a multi-machined cache.

Have a look at the pdf presentation found on the following site:

http://people.freebsd.org/~seanc/pgmemcache/



 -Mike

 ---(end of broadcast)---
 TIP 7: don't forget to increase your free space map settings

-- 
Darcy Buskermolen
Wavefire Technologies Corp.
ph: 250.717.0200
fx:  250.763.1759
http://www.wavefire.com

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Tsearch2 really slower than ilike ?

2004-11-17 Thread Herv Piedvache
Oleg,

Sorry but when I do your request I get :
# select id_site from site where idx_site_name @@  'livejourn';
ERROR:  type   does not exist

What is this ?

(private: I don't know what happend with my mail, but I do nothing special to 
disturb the contains when I'm writting to you ...)

Le Mardi 16 Novembre 2004 22:13, Oleg Bartunov a écrit :
 ok, I downloaded dump of table and here is what I found:

 zz=# select count(*) from tt;
   count
 
   183956
 (1 row)

 zz=# select * from stat('select tt from tt') order by ndoc desc, nentry
 desc,wo
 rd limit 10;
   word | ndoc  | nentry
 --+---+
   blog | 12710 |  12835
   weblog   |  4857 |   4859
   news |  4402 |   4594
   life |  4136 |   4160
   world|  1980 |   1986
   journal  |  1882 |   1883
   livejourn|  1737 |   1737
   thought  |  1669 |   1677
   web  |  1154 |   1161
   scotsman.com |  1138 |   1138
 (10 rows)

 zz=# explain analyze select tt from tt where tt @@  'blog';
QUERY PLAN
 ---
--- Index Scan using tt_idx on tt 
 (cost=0.00..728.83 rows=184 width=32) (actual time=0.047..141.110
 rows=12710 loops=1) Index Cond: (tt @@ '\'blog\''::tsquery)
 Filter: (tt @@ '\'blog\''::tsquery)
   Total runtime: 154.105 ms
 (4 rows)

 It's really fast ! So, I don't understand your problem.
 I run query on my desktop machine, nothing special.


   Oleg

 On Tue, 16 Nov 2004, [iso-8859-15] Herv? Piedvache wrote:
  Hi,
 
  I'm completly dispointed with Tsearch2 ...
 
  I have a table like this :
   Table public.site
 Column |Type |
  Modifiers
  ---+-+---
  id_site   | integer  
| not null default
  nextval('public.site_id_site_seq'::text)
  site_name | text|
  site_url  | text|
  url| text|
  language  | text|
  datecrea  | date| default now()
  id_category   | integer |
  time_refresh  | integer |
  active| integer |
  error | integer |
  description   | text|
  version   | text|
  idx_site_name | tsvector|
  lastcheck | date|
  lastupdate| timestamp without time zone |
  Indexes:
 site_id_site_key unique, btree (id_site)
 ix_idx_site_name gist (idx_site_name)
  Triggers:
 tsvectorupdate_site_name BEFORE INSERT OR UPDATE ON site FOR EACH ROW
  EXECUTE PROCEDURE tsearch2('idx_site_name', 'site_name')
 
  I have 183 956 records in the database ...
 
  SELECT s.site_name, s.id_site, s.description, s.site_url,
 case when exists (select id_user
 from user_choice u
  where u.id_site=s.id_site
 and u.id_user = 1)
  then 1 else 0 end as bookmarked
FROM site s
  WHERE s.idx_site_name @@ to_tsquery('atari');
 
  Explain Analyze :
QUERY PLAN
  -
 - Index
  Scan using ix_idx_site_name on site s  (cost=0.00..1202.12 rows=184
  width=158) (actual time=4687.674..4698.422 rows=1 loops=1)
Index Cond: (idx_site_name @@ '\'atari\''::tsquery)
Filter: (idx_site_name @@ '\'atari\''::tsquery)
SubPlan
  -  Seq Scan on user_choice u  (cost=0.00..3.46 rows=1 width=4)
  (actual time=0.232..0.232 rows=0 loops=1)
Filter: ((id_site = $0) AND (id_user = 1))
  Total runtime: 4698.608 ms
 
  First time I run the request I have a result in about 28 seconds.
 
  SELECT s.site_name, s.id_site, s.description, s.site_url,
 case when exists (select id_user
 from user_choice u
  where u.id_site=s.id_site
 and u.id_user = 1)
  then 1 else 0 end as bookmarked
FROM site_rss s
  WHERE s.site_name ilike '%atari%'
 
QUERY PLAN
  -
 --- Seq Scan on site_rss s 
  (cost=0.00..11863.16 rows=295 width=158) (actual time=17.414..791.937
  rows=12 loops=1)
Filter: (site_name 

Re: [PERFORM] Tsearch2 really slower than ilike ?

2004-11-17 Thread Oleg Bartunov
1;2c1;2c1;2cOn Wed, 17 Nov 2004, [iso-8859-15] Herv? Piedvache wrote:
Oleg,
Sorry but when I do your request I get :
# select id_site from site where idx_site_name @@ 'livejourn';
ERROR: type  does not exist
no idea :) btw, what version of postgresql and OS you're running.
Could you try minimal test - check sql commands from tsearch2 sources,
some basic queries from tsearch2 documentation, tutorials.
btw, your query should looks like
select id_site from site_rss where idx_site_name @@ 'livejourn';

How did you run your queries at all ? I mean your first message about 
poor tsearch2 performance.

1;2c1;2c1;2c
What is this ?
(private: I don't know what happend with my mail, but I do nothing special to
disturb the contains when I'm writting to you ...)
Le Mardi 16 Novembre 2004 22:13, Oleg Bartunov a ?crit :
ok, I downloaded dump of table and here is what I found:
zz=# select count(*) from tt;
  count

  183956
(1 row)
zz=# select * from stat('select tt from tt') order by ndoc desc, nentry
desc,wo
rd limit 10;
  word | ndoc  | nentry
--+---+
  blog | 12710 |  12835
  weblog   |  4857 |   4859
  news |  4402 |   4594
  life |  4136 |   4160
  world|  1980 |   1986
  journal  |  1882 |   1883
  livejourn|  1737 |   1737
  thought  |  1669 |   1677
  web  |  1154 |   1161
  scotsman.com |  1138 |   1138
(10 rows)
zz=# explain analyze select tt from tt where tt @@  'blog';
   QUERY PLAN
---
--- Index Scan using tt_idx on tt
(cost=0.00..728.83 rows=184 width=32) (actual time=0.047..141.110
rows=12710 loops=1) Index Cond: (tt @@ '\'blog\''::tsquery)
Filter: (tt @@ '\'blog\''::tsquery)
  Total runtime: 154.105 ms
(4 rows)
It's really fast ! So, I don't understand your problem.
I run query on my desktop machine, nothing special.
Oleg
On Tue, 16 Nov 2004, [iso-8859-15] Herv? Piedvache wrote:
Hi,
I'm completly dispointed with Tsearch2 ...
I have a table like this :
 Table public.site
   Column |Type |
Modifiers
---+-+---
 id_site   | integer
  | not null default
nextval('public.site_id_site_seq'::text)
site_name | text|
site_url  | text|
url| text|
language  | text|
datecrea  | date| default now()
id_category   | integer |
time_refresh  | integer |
active| integer |
error | integer |
description   | text|
version   | text|
idx_site_name | tsvector|
lastcheck | date|
lastupdate| timestamp without time zone |
Indexes:
   site_id_site_key unique, btree (id_site)
   ix_idx_site_name gist (idx_site_name)
Triggers:
   tsvectorupdate_site_name BEFORE INSERT OR UPDATE ON site FOR EACH ROW
EXECUTE PROCEDURE tsearch2('idx_site_name', 'site_name')
I have 183 956 records in the database ...
SELECT s.site_name, s.id_site, s.description, s.site_url,
   case when exists (select id_user
   from user_choice u
where u.id_site=s.id_site
   and u.id_user = 1)
then 1 else 0 end as bookmarked
  FROM site s
WHERE s.idx_site_name @@ to_tsquery('atari');
Explain Analyze :
  QUERY PLAN
-
- Index
Scan using ix_idx_site_name on site s  (cost=0.00..1202.12 rows=184
width=158) (actual time=4687.674..4698.422 rows=1 loops=1)
  Index Cond: (idx_site_name @@ '\'atari\''::tsquery)
  Filter: (idx_site_name @@ '\'atari\''::tsquery)
  SubPlan
-  Seq Scan on user_choice u  (cost=0.00..3.46 rows=1 width=4)
(actual time=0.232..0.232 rows=0 loops=1)
  Filter: ((id_site = $0) AND (id_user = 1))
Total runtime: 4698.608 ms
First time I run the request I have a result in about 28 seconds.
SELECT s.site_name, s.id_site, s.description, s.site_url,
   case when exists (select id_user
   from user_choice u
where u.id_site=s.id_site
   and u.id_user = 1)
then 1 else 0 end as bookmarked
  FROM site_rss s
WHERE s.site_name ilike '%atari%'
 

Re: [PERFORM] query plan question

2004-11-17 Thread Matthew T. O'Connor
Well based on the autovacuum log that you attached, all of those tables 
are insert only (at least during the time period included in the log.   
Is that correct?  If so, autovacuum will never do a vacuum (unless 
required by xid wraparound issues) on those tables.  So this doesn't 
appear to be an autovacuum problem.  I'm not sure about the missing 
pg_statistic entries anyone else care to field that one?

Matthew
David Parker wrote:
Thanks. The tables I'm concerned with are named: 'schema', 'usage',
'usageparameter', and 'flow'. It looks like autovacuum is performing
analyzes:
% grep Performing:  logs/.db.tazz.vacuum.log
[2004-11-17 12:05:58 PM] Performing: ANALYZE
public.scriptlibrary_library
[2004-11-17 12:15:59 PM] Performing: ANALYZE
public.scriptlibraryparm
[2004-11-17 12:15:59 PM] Performing: ANALYZE public.usageparameter
[2004-11-17 12:21:00 PM] Performing: ANALYZE public.usageproperty
[2004-11-17 12:21:00 PM] Performing: ANALYZE public.route
[2004-11-17 12:21:00 PM] Performing: ANALYZE public.usageparameter
[2004-11-17 12:21:00 PM] Performing: ANALYZE
public.scriptlibrary_library
[2004-11-17 12:26:01 PM] Performing: ANALYZE public.usage
[2004-11-17 12:26:01 PM] Performing: ANALYZE public.usageparameter
[2004-11-17 12:31:04 PM] Performing: ANALYZE public.usageproperty
[2004-11-17 12:36:04 PM] Performing: ANALYZE public.route
[2004-11-17 12:36:04 PM] Performing: ANALYZE public.service_usage
[2004-11-17 12:36:04 PM] Performing: ANALYZE public.usageparameter
But when I run the following:
select * from pg_statistic where starelid in 
(select oid from pg_class where relname in
('schema','usageparameter','flow','usage'))

it returns no records. Shouldn't it? It doesn't appear to be doing a
vacuum anywhere, which makes sense because none of these tables have
over the default threshold of 1000. Are there statistics which only get
generated by vacuum?
I've attached a gzip of the pg_autovacuum log file, with -d 3.
Thanks again.
- DAP
 

-Original Message-
From: Matthew T. O'Connor [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 17, 2004 11:41 AM
To: David Parker
Cc: Tom Lane; Jeff; Russell Smith; [EMAIL PROTECTED]
Subject: Re: [PERFORM] query plan question

David Parker wrote:
   

We're using postgresql 7.4.5. I've only recently put pg_autovacuum in 
place as part of our installation, and I'm basically taking the 
defaults. I doubt it's a problem with autovacuum itself, but rather 
with my configuration of it. I have some reading to do, so 
 

any pointers 
   

to existing autovacuum threads would be greatly appreciated!
 

Well the first thing to do is increase the verbosity of the 
pg_autovacuum logging output.  If you use -d2 or higher, 
pg_autovacuum will print out a lot of detail on what it thinks 
the thresholds are and 
why it is or isn't performing vacuums and analyzes.   Attach 
some of the 
log and I'll take a look at it.

   


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Analyzer is clueless

2004-11-17 Thread Josh Berkus
Dan,

 I'm doing some performance profiling with a simple two-table query:

Please send EXPLAIN ANALYZE for each query, and not just EXPLAIN.   Thanks!

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] memcached and PostgreSQL

2004-11-17 Thread Michael Adler
On Wed, Nov 17, 2004 at 09:13:09AM -0800, Darcy Buskermolen wrote:
 On November 16, 2004 08:00 pm, Michael Adler wrote:
  http://pugs.postgresql.org/sfpug/archives/21.html
 
  I noticed that some of you left coasters were talking about memcached
  and pgsql. I'm curious to know what was discussed.
 
 Have a look at the pdf presentation found on the following site:
 
 http://people.freebsd.org/~seanc/pgmemcache/

Thanks for that.

That presentation was rather broad and the API seems rather general
purpose, but I wonder why you would really want access the cache by
way of the DB? If one major point of memcache is to allocate RAM to a
low-overhead server instead of to the RDBMS's disk cache, why would
you add the overhead of the RDBMS to the process?  (this is a bit of
straw man, but just trying to flesh-out the pros and cons)

Still, it seems like a convenient way to maintain cache coherency,
assuming that your application doesn't already have a clean way to do
that.

(just my uninformed opinion, though...)

-Mike

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Analyzer is clueless

2004-11-17 Thread Joshua D. Drake
Hello,
Have you tried increasing the statistics target for orderdate and 
rerunning analyze?

Sincerely,
Joshua D. Drake
David Brown wrote:
I'm doing some performance profiling with a simple two-table query:
SELECT L.ProductID, sum(L.Amount)
FROM drinv H
JOIN drinvln L ON L.OrderNo = H.OrderNo
WHERE
(OrderDate between '2003-01-01' AND '2003-04-30')
GROUP BY L.ProductID
drinv and drinvln have about 100,000 and 3,500,000 rows respectively. Actual 
data size in the large table is 500-600MB. OrderNo is indexed in both tables, 
as is OrderDate.
The environment is PGSQL 8 on Win2k with 512MB RAM (results are similar to 7.3 
from Mammoth). I've tried tweaking various conf parameters, but apart from 
using up memory, nothing seems to have had a tangible effect - the Analyzer 
doesn't seem to take resources into account like some of the doco suggests.
The date selection represents about 5% of the range. Here's the plan summaries:
Three months (2003-01-01 to 2003-03-30) = 1 second
HashAggregate  (cost=119365.53..119368.74 rows=642 width=26)
  -  Nested Loop  (cost=0.00..118791.66 rows=114774 width=26)
-  Index Scan using drinv_OrderDate on drinv h  (cost=0.00..200.27 
rows=3142 width=8)
  Index Cond: ((OrderDate = '2003-01-01'::date) AND (OrderDate 
= '2003-03-30'::date))
-  Index Scan using drinvln_OrderNo on drinvln l  (cost=0.00..28.73 
rows=721 width=34)
  Index Cond: (l.OrderNo = outer.OrderNo)
Four months (2003-01-01 to 2003-04-30) = 60 seconds
HashAggregate  (cost=126110.53..126113.74 rows=642 width=26)
  -  Hash Join  (cost=277.55..125344.88 rows=153130 width=26)
Hash Cond: (outer.OrderNo = inner.OrderNo)
-  Seq Scan on drinvln l  (cost=0.00..106671.35 rows=3372935 width=34)
-  Hash  (cost=267.07..267.07 rows=4192 width=8)
  -  Index Scan using drinv_OrderDate on drinv h  
(cost=0.00..267.07 rows=4192 width=8)
Index Cond: ((OrderDate = '2003-01-01'::date) AND 
(OrderDate = '2003-04-30'::date))
Four months (2003-01-01 to 2003-04-30) with Seq_scan disabled = 75 seconds
HashAggregate  (cost=130565.83..130569.04 rows=642 width=26)
  -  Merge Join  (cost=519.29..129800.18 rows=153130 width=26)
Merge Cond: (outer.OrderNo = inner.OrderNo)
-  Sort  (cost=519.29..529.77 rows=4192 width=8)
  Sort Key: h.OrderNo
  -  Index Scan using drinv_OrderDate on drinv h  
(cost=0.00..267.07 rows=4192 width=8)
Index Cond: ((OrderDate = '2003-01-01'::date) AND 
(OrderDate = '2003-04-30'::date))
-  Index Scan using drinvln_OrderNo on drinvln l  
(cost=0.00..119296.29 rows=3372935 width=34)
Statistics were run on each table before query execution. The random page cost 
was lowered to 2, but as you can see, the estimated costs are wild anyway.
As a comparison, MS SQL Server took less than 15 seconds, or 4 times faster.
MySQL (InnoDB) took 2 seconds, which is 30 times faster.
The query looks straightforward to me (it might be clearer with a subselect), 
so what on earth is wrong?
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

--
Command Prompt, Inc., home of PostgreSQL Replication, and plPHP.
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL
begin:vcard
fn:Joshua D. Drake
n:Drake;Joshua D.
org:Command Prompt, Inc.
adr:;;PO Box 215;Cascade Locks;Oregon;97014;USA
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
note:Command Prompt, Inc. is the largest and oldest US based commercial PostgreSQL support provider. We  provide the only commercially viable integrated PostgreSQL replication solution, but also custom programming, and support. We authored  the book Practical PostgreSQL, the procedural language plPHP, and adding trigger capability to plPerl.
x-mozilla-html:FALSE
url:http://www.commandprompt.com/
version:2.1
end:vcard


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] memcached and PostgreSQL

2004-11-17 Thread Josh Berkus
Michael,

 Still, it seems like a convenient way to maintain cache coherency,
 assuming that your application doesn't already have a clean way to do
 that.

Precisely.The big problem with memory caching is the cache getting out of 
sync with the database.   Updating the cache through database triggers helps 
ameliorate that.

However, our inability to pass messages with NOTIFY somewhat limits the the 
utility of this solution   Sean wants on commit triggers, but there's some 
major issues to work out with that.   Passing messages with NOTIFY would be 
easier and almost as good.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[PERFORM] sort_mem affect on inserts?

2004-11-17 Thread David Parker
I understand that the sort_mem conf setting affects queries with order by, 
etc., and the doc mentions that it is used in create index. Does sort_mem 
affect the updating of indexes, i.e., can the sort_mem setting affect the 
performance of inserts?

- DAP
--
David ParkerTazz Networks(401) 709-5130
 

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] sort_mem affect on inserts?

2004-11-17 Thread Josh Berkus
David,

 I understand that the sort_mem conf setting affects queries with order by,
 etc., and the doc mentions that it is used in create index. Does sort_mem
 affect the updating of indexes, i.e., can the sort_mem setting affect the
 performance of inserts?

Only if the table has Foriegn Keys whose lookup might require a large sort.   
Otherwise, no.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[PERFORM] postgres eating CPU

2004-11-17 Thread [EMAIL PROTECTED]
Hi,
in March there was an interesting discussion on the list with the 
subject postgres eating CPU on HP9000.

Now I'm the same problem on a Dell dual processor machine.
Anybody know if there was a solution?
Thanks
Piergiorgio
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] postgres eating CPU

2004-11-17 Thread Josh Berkus

 in March there was an interesting discussion on the list with the
 subject postgres eating CPU on HP9000.

Link, please?

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Analyzer is clueless

2004-11-17 Thread Simon Riggs
On Thu, 2004-11-18 at 02:08, David Brown wrote:
 Statistics were run on each table before query execution. The random page 
 cost was lowered to 2, but as you can see, the estimated costs are wild 
 anyway.
 
 As a comparison, MS SQL Server took less than 15 seconds, or 4 times faster.
 
 MySQL (InnoDB) took 2 seconds, which is 30 times faster.
 
 The query looks straightforward to me (it might be clearer with a subselect), 
 so what on earth is wrong?

The query is, as you say, straightforward.

You are clearly working with a query that is on the very edge of the
decision between using an index or not. 

The main issue is that PostgreSQL's default histogram statistics setting
is lower than other RDBMS. This means that it is less able to
discriminate between cases such as yours that are close to the edge.
This is a trade-off between run-time of the ANALYZE command and the
benefit it produces. As Joshua suggests, increasing the statistics
target for this table will likely allow the optimizer to correctly
determine the selectivity of the index and take the right path.

If this is a general RDBMS comparison, you may wish to extend the
system's default_statistics_target = 80 or at least  10.

To improve this query, you may wish to extend the table's statistics
target using:

ALTER TABLE drinv
ALTER COLUMN OrderDate SET STATISTICS 100;

which should allow the planner to more accurately estimate statistics
and thereby select an index, if appropriate.

The doco has recently been changed with regard to effective_cache_size;
you don't mention what beta release level you're using. That is the only
planner parameter that takes cache size into account, so any other
changes would certainly have zero effect on this *plan* though might
still benefit execution time.

Please post EXPLAIN ANALYZE output for any further questions.

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] postgres eating CPU

2004-11-17 Thread [EMAIL PROTECTED]
Josh Berkus wrote:
in March there was an interesting discussion on the list with the
subject postgres eating CPU on HP9000.
   

Link, please?
 

  http://archives.postgresql.org/pgsql-performance/2004-03/msg00380.php
 

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] postgres eating CPU

2004-11-17 Thread Josh Berkus

 in March there was an interesting discussion on the list with the
 subject postgres eating CPU on HP9000.

Aha, this one.   Yeah, I believe that they upgraded to 7.4 inorder to deal 
with REINDEX issues.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] postgres eating CPU

2004-11-17 Thread Tom Lane
[EMAIL PROTECTED] [EMAIL PROTECTED] writes:
 in March there was an interesting discussion on the list with the
 subject postgres eating CPU on HP9000.
http://archives.postgresql.org/pgsql-performance/2004-03/msg00380.php

Reviewing that, the problem is most likely that (a) they didn't have
max_fsm_pages set high enough to cover the database, and (b) they were
running 7.3.* which is prone to index bloat.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Analyzer is clueless

2004-11-17 Thread Jim C. Nasby
On Wed, Nov 17, 2004 at 10:32:48PM +, Simon Riggs wrote:
 The main issue is that PostgreSQL's default histogram statistics setting
 is lower than other RDBMS. This means that it is less able to
 discriminate between cases such as yours that are close to the edge.
 This is a trade-off between run-time of the ANALYZE command and the
 benefit it produces. As Joshua suggests, increasing the statistics
 target for this table will likely allow the optimizer to correctly
 determine the selectivity of the index and take the right path.

Is there still a good reason to have the histogram stats so low? Should
the default be changed to more like 100 at this point?

Also, how extensively does the planner use n_distinct, null_frac,
reltuples and the histogram to see what the odds are of finding a unique
value or a low number of values? I've seen cases where it seems the
planer doesn't think it'll be getting a unique value or a small set of
values even though stats indicates that it should be.

One final question... would there be interest in a process that would
dynamically update the histogram settings for tables based on how
distinct/unique each field was?
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Analyzer is clueless

2004-11-17 Thread Josh Berkus
Jim,

 Is there still a good reason to have the histogram stats so low? Should
 the default be changed to more like 100 at this point?

Low overhead.  This is actually a TODO for me for 8.1.   I need to find some 
test cases to set a differential level of histogram access for indexed 
fields, so like 10 for most fields but 100/150/200 for indexed fields.

However, I got stalled on finding test cases and then ran out of time.

 Also, how extensively does the planner use n_distinct, null_frac,
 reltuples and the histogram to see what the odds are of finding a unique
 value or a low number of values? I've seen cases where it seems the
 planer doesn't think it'll be getting a unique value or a small set of
 values even though stats indicates that it should be.

 One final question... would there be interest in a process that would
 dynamically update the histogram settings for tables based on how
 distinct/unique each field was?

Well, the process by which the analyzer decides that a field is unique could 
probably use some troubleshooting.   And we always, always could use 
suggestions/tests/help with the query planner.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Analyzer is clueless

2004-11-17 Thread Tom Lane
 I've seen cases where it seems the
 planer doesn't think it'll be getting a unique value or a small set of
 values even though stats indicates that it should be.

A test case exhibiting the problem would be helpful.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] postgres eating CPU

2004-11-17 Thread Joshua D. Drake
Josh Berkus wrote:
in March there was an interesting discussion on the list with the
subject postgres eating CPU on HP9000.
http://archives.postgresql.org/pgsql-performance/2004-03/msg00380.php
--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
begin:vcard
fn:Joshua Drake
n:Drake;Joshua
org:Command Prompt, Inc.
adr:;;PO Box 215 ;Cascade Locks;OR;97014;US
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
x-mozilla-html:FALSE
url:http://www.commandprompt.com
version:2.1
end:vcard


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match