Re: [PERFORM] Speeding up this function

2004-10-19 Thread Jeremy Dunn
 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Matt Nuzum
 Sent: Tuesday, October 19, 2004 3:35 PM
 To: pgsql-performance
 Subject: [PERFORM] Speeding up this function
 
snip
 
 All it does is try to link pageviews together into a session. 
 here's the function:
  create or replace function usage_normalize_session 
 (varchar(12), inet, timestamptz) returns integer as '  DECLARE
  -- $1 = Account ID, $2 = IP Address, $3 = Time
  RecordSet record;
  BEGIN
  SELECT INTO RecordSet DISTINCT sessionid FROM usage_access ua
  WHERE ua.accountid = $1
  AND ua.client = $2
  AND ua.atime = ($3 - ''20 
 min''::interval)::timestamptz;
 
  if found
  then return RecordSet.sessionid;
  end if;
 
  return nextval(''usage_session_ids'');
  END;'
  language plpgsql;
 

This is probably a stupid question, but why are you trying to create
sessions after the fact?  Since it appears that users of your site must
login, why not just assign a sessionID to them at login time, and keep
it in the URL for the duration of the session?  Then it would be easy to
track where they've been.

- Jeremy


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?

2004-08-27 Thread Jeremy Dunn


 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Bruce Momjian
 Sent: Friday, August 27, 2004 1:27 PM
 To: Adi Alurkar
 Cc: [EMAIL PROTECTED]
 Subject: Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?
 
 
 
 But what is the advantage of non-full pages in Oracle?
 

One advantage has to do with updates of variable-length columns, e.g.
varchars.

If the block is fully packed with data, an update to a varchar column
that makes the column wider, causes row-chaining.  This means that a
portion of the row is stored in a different data block, which may be
somewhere completely different in the storage array.  Retrieving that
row (or even just that column from that row) as a unit may now require
additional disk seek(s).

Leaving some space for updates in each data block doesn't prevent this
problem completely, but mitigates it to a certain extent.  If for
instance a row is typically inserted with a null value for a varchar
column, but the application developer knows it will almost always get
updated with some value later on, then leaving a certain percentage of
empty space in each block allocated to that table makes sense.

Conversely, if you know that your data is never going to get updated
(e.g. a data warehousing application), you might specify to pack the
blocks as full as possible.  This makes for the most efficient data
retrieval performance.

- Jeremy


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] 50 000 000 Table entries and I have to do a keyword search HELP NEEDED

2004-06-15 Thread Jeremy Dunn
Title: Message



One 
option that does not take advantage of any fancy indexing methods is to create a 
trigger on the table, on insert/update/delete, which extracts each individual 
word from the field you care about, and creates an entry in another 'keyword' 
table, id = 'word', value = pk of your original table. then index the 
keyword table on the 'keyword' field, and do your searches from there. 
this should improve performance substantially, even on very large return sets, 
because the keyword table rows are very small and thus a lot of them fit in a 
disk block.

- 
Jeremy

  
  -Original Message-From: 
  [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED] On Behalf Of 
  borajettaSent: Monday, June 07, 2004 5:47 PMTo: 
  [EMAIL PROTECTED]Subject: [PERFORM] 50 000 000 Table 
  entries and I have to do a keyword search HELP NEEDED
  
  So I have a table with about 50 million entries in it, I have to do a 
  keyword search.
  
  The keyword search is done on the title of the entry. For example a entry could be "This is 
  a title string which could be searched"
  
  I have tried a few ways to search but I get horrible search times. Some keywords will come up with 
  matches as big as .25 million but most are around 1000-5000.
  
  I use an index which narrows the table down to about 1.5-2million 
  entries.
  
  I used 2 tables which had a 1:1 correspondence.
  One held a gist index which was on a int field which searched the for 
  the keyword. Then I would join 
  the table to another to retrieve the rest of the information about the items 
  it matched.
  
  This was slow even for returning 100 entries. About 10 seconds, sometimes 5. But when I start getting 1xxx entries 
  its about 30-50 seconds. The rest 
  is just horrible.
  
  How should I set up my indexes and or tables.
  We were thinking of putting the index inside one table then the join 
  would not have to be done but this still returns rather slow 
  results.
  
  I have not fully tested this method but it looks like when run for just 
  the keyword search on the title and no joining it can return in about 10 
  seconds or less.
  This is a great improvement but I am currently going to 
  make the table all in one and see how long it will take. I believe it will not be much more as 
  there will be no join needed only the returning of some attribute fields. 
  
  This is still not the kind of time I would like to see, I wanted 
  something around 2 seconds or less. 
  I know there is a lot of information especially if .25 million rows are 
  to be returned but if there is only 1xxx-9xxx rows to be returned I believe 2 
  seconds seems about right.
  
  How do search engines do it?
  Any suggestions are welcome,
  
  Thanks


Re: [PERFORM] index v. seqscan for certain values

2004-04-13 Thread Jeremy Dunn
  When I just tried it again with a value of 300, analyze, 
 then run the query, I get a *worse* result for an estimate.  I don't
understand 
  this.
 
 That's annoying.  How repeatable are these results --- if you 
 do ANALYZE over again several times, how much does the row 
 count estimate change each time?  (It should change somewhat, 
 since ANALYZE is taking a random sample, but one would like 
 to think not a whole lot.)  Is the variance more or less at 
 the higher stats target?  Take a look at a few different CID 
 values to get a sense of the accuracy, don't look at just one ...

Yes, it's repeatable.  I tried a bunch of times, and there are only
small variations in the stats for the higher stat targets.

 (Actually, you might find it more profitable to look at the 
 pg_stats entry for the CID column rather than 
 reverse-engineering the stats via ANALYZE.  Look at how well 
 the most-common-values list and associated frequency numbers 
 track reality.)

I checked the accuracy of the stats for various values, and there is a
wide variation.  I see some values where the estimate is 1.75x the
actual; and others where the estimate is .44x the actual.

 Also, can you think of any reason for the distribution of CID 
 values to be nonuniform within the table?  For instance, do 
 rows get inserted in order of increasing CID, or is there any 
 clustering of rows with the same CID?

This is almost certainly the answer.  The data is initially inserted in
chunks for each CID, and later on there is a more normal distribution of
insert/update/deletes across all CIDs; and then again a new CID will
come with a large chunk of rows, etc.

Interestingly, I tried increasing the stat size for the CID column to
2000, analyzing, and checking the accuracy of the stats again.  Even
with this relatively high value, the accuracy of the stats is not that
close.   The value giving .44x previously nows gives an estimate .77x of
actual.  Another value which was at 1.38x of actual is now at .71x of
actual!  

Then just for kicks I set the statistics size to 100,000 (!), analyzed,
and ran the query again.  For the same CID I still got an estimated row
count that is .71x the actual rows returned.  Why is this not better?  I
wonder how high I'd have to set the statistics collector to get really
good data, given the uneven data distribution of this table.  Is there
any other technique that works better to get good estimates, given
uneven distribution of values?

So I think this explains the inaccurate stats; and the solution as far
as I'm concerned is to increase the two params mentioned yesterday
(effective_cache_size  random_page_cost).

Thanks again for the help!
- Jeremy


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] index v. seqscan for certain values

2004-04-13 Thread Jeremy Dunn

 There's a hard limit of 1000, I believe.  Didn't it give you
 a warning saying so?

No warning at 2000, and no warning at 100,000 either!

Remember we are still on 7.2.x.  The docs here
http://www.postgresql.org/docs/7.2/static/sql-altertable.html don't say
anything about a limit.  

This is good to know, if it's true.  Can anyone confirm?

- Jeremy


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] index v. seqscan for certain values

2004-04-12 Thread Jeremy Dunn
Title: Message



I've searched the 
archives and can't find an answer to this seemingly simple question. 
Apologies if it's too common.

The table in 
question has ~1.3M rows. It has 85 columns, 5 of which have single-column 
indexes.

The column in 
question (CID) has 183 distinct values. For these values, the largest has 
~38,000 rows, and the smallest has 1 row. About 30 values have  100 
rows, and about 10 values have  20,000 rows.

The database is 
7.2.3 running on RedHat 7.1. (weare in process of upgrading to 
PG7.4.2) All of the query plan options are enabled, and 
the cpu costs are set to the default values. (cpu_tuple_cost is 0.01, 
cpu_index_tuple_cost is 0.001). The database is VACUUM'd every 
night.

The 
problem:
A simply 
query:
 select count(*) from xxx where CID=smalval
where 
smalval is a CID value which has relatively few rows, returns a plan 
using the index on that column.

 explain 
analyze select count(*) from xxx where cid=869366; Aggregate 
(cost=19136.33..19136.33 rows=1 width=0) (actual time=78.49..78.49 rows=1 
loops=1) - Index Scan using xxx_cid on 
emailrcpts (cost=0.00..19122.21 rows=5648 width=0) (actual 
time=63.40..78.46 rows=1 loops=1) Total runtime: 78.69 
msec

The same plan is 
true for values which have up to about 20,000 rows:

 explain 
analyze select count(*) from xxx where cid=6223341; Aggregate 
(cost=74384.19..74384.19 rows=1 width=0) (actual time=11614.89..11614.89 rows=1 
loops=1) - Index Scan using xxx_cid on 
emailrcpts (cost=0.00..74329.26 rows=21974 width=0) (actual 
time=35.75..11582.10 rows=20114 loops=1) Total runtime: 11615.05 
msec
However for the 
values that have  20,000 rows, the plan changes to a sequential scan, which 
is proportionately much slower.

 
explain analyze select count(*) fromxxx where cid=7191032; 
Aggregate (cost=97357.61..97357.61 rows=1 width=0) (actual 
time=46427.81..46427.82 rows=1 loops=1) - 
 Seq Scan onxxx 
(cost=0.00..97230.62 rows=50792 width=0) (actual time=9104.45..46370.27 
rows=37765 loops=1) 
Total runtime: 46428.00 msec


The question: why 
does the planner consider a sequential scan to be better for these top 10 
values? In terms of elapsed time it is more than twice as slow, 
proportionate to an index scan for the same number of rows.

What I 
tried:

A)alter table xxx alter column cid set statistics 500; 

 analyze xxx;
This does not affect 
the results.

B) 
dropped/rebuilt the index, with no improvement.

C) decreasing 
cpu_index_tuple_cost by a factor of up to 1000, with no 
success

D) force an index 
scan for the larger values by using a very high value for cpu_tuple_cost (e.g. 
.5) but this doesn't seem like a wise thing to do.

Your thoughts 
appreciated in advance!

- 
Jeremy

7+ years 
experience in Oracle performance-tuning
relatively new to postgresql




Re: [PERFORM] index v. seqscan for certain values

2004-04-12 Thread Jeremy Dunn
Sorry I should have written that we do VACUUM VERBOSE ANALYZE every
night.

- Jeremy

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Bill Moran
Sent: Monday, April 12, 2004 12:09 PM
To: [EMAIL PROTECTED]
Cc: Postgresql Performance
Subject: Re: [PERFORM] index v. seqscan for certain values


Quick bit of input, since you didn't mention it.

How often do you run ANALYZE?  I found it interesting that a database I
was doing tests on sped up by a factor of 20 after ANALYZE.  If your
data changes a lot, you should probably schedule ANALYZE to run with
VACUUM.

Jeremy Dunn wrote:
 I've searched the archives and can't find an answer to this seemingly
 simple question.  Apologies if it's too common.
  
 The table in question has ~1.3M rows.  It has 85 columns, 5 of which
 have single-column indexes.
  
 The column in question (CID) has 183 distinct values.  For these 
 values,
 the largest has ~38,000 rows, and the smallest has 1 row.  About 30 
 values have  100 rows, and about 10 values have  20,000 rows.
  
 The database is 7.2.3 running on RedHat 7.1. (we are in process of 
 upgrading to PG 7.4.2)All of the query plan options are enabled,
and 
 the cpu costs are set to the default values. ( cpu_tuple_cost is 0.01,
 cpu_index_tuple_cost is 0.001).  The database is VACUUM'd every night.
  
 The problem:
 A simply query:
 select count(*) from xxx where CID=smalval
 where smalval is a CID value which has relatively few rows, returns 
 a
 plan using the index on that column.
  
explain analyze select count(*) from xxx where cid=869366;
Aggregate  (cost=19136.33..19136.33 rows=1 width=0) (actual
 time=78.49..78.49 rows=1 loops=1)
  -  Index Scan using xxx_cid on emailrcpts  (cost=0.00..19122.21 
 rows=5648 width=0) (actual time=63.40..78.46 rows=1 loops=1)
Total runtime: 78.69 msec
  
 The same plan is true for values which have up to about 20,000 rows:
  
explain analyze select count(*) from xxx where cid=6223341;
Aggregate  (cost=74384.19..74384.19 rows=1 width=0) (actual
 time=11614.89..11614.89 rows=1 loops=1)
  -  Index Scan using xxx_cid on emailrcpts  (cost=0.00..74329.26 
 rows=21974 width=0) (actual time=35.75..11582.10 rows=20114 loops=1)
Total runtime: 11615.05 msec
 However for the values that have  20,000 rows, the plan changes to a 
 sequential scan, which is proportionately much slower.
  
explain analyze select count(*) from xxx where cid=7191032;
Aggregate  (cost=97357.61..97357.61 rows=1 width=0) (actual
 time=46427.81..46427.82 rows=1 loops=1)
 -   Seq Scan on xxx (cost=0.00..97230.62 rows=50792 width=0) 
 (actual time=9104.45..46370.27 rows=37765 loops=1)
 Total runtime: 46428.00 msec
  
  
 The question: why does the planner consider a sequential scan to be
 better for these top 10 values?  In terms of elapsed time it is more 
 than twice as slow, proportionate to an index scan for the same number

 of rows.
  
 What I tried:
  
 A) alter table xxx alter column cid set statistics 500;   
 analyze xxx;
 This does not affect the results.
  
 B)  dropped/rebuilt the index, with no improvement.
  
 C) decreasing cpu_index_tuple_cost by a factor of up to 1000, with no
 success
  
 D) force an index scan for the larger values by using a very high 
 value
 for cpu_tuple_cost (e.g. .5) but this doesn't seem like a wise thing
to do.
  
 Your thoughts appreciated in advance!
  
 - Jeremy
  
 7+ years experience in Oracle performance-tuning
 relatively new to postgresql


-- 
Bill Moran
Potential Technologies
http://www.potentialtech.com


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] index v. seqscan for certain values

2004-04-12 Thread Jeremy Dunn
 Jeremy Dunn [EMAIL PROTECTED] writes:
  The question: why does the planner consider a sequential scan to be 
  better for these top 10 values?
 
 At some point a seqscan *will* be better.  In the limit, if 
 the key being sought is common enough to occur on every page 
 of the table, it's certain that a seqscan will require less 
 I/O than an indexscan (because reading the index isn't 
 actually saving you any heap fetches). In practice the 
 breakeven point is less than that because Unix kernels are 
 better at handling sequential than random access.
 
 Your gripe appears to be basically that the planner's idea of 
 the breakeven point is off a bit.  It looks to me like it's 
 within about a factor of 2 of being right, though, which is 
 not all that bad when it's using generic cost parameters.

Agreed.  However, given that count(*) is a question that can be answered
_solely_ using the index (without reference to the actual data blocks),
I'd expect that the break-even point would be considerably higher than
the  3% (~38,000 / ~1.3M) I'm currently getting.  Does PG not use
solely the index in this situation??

  A) alter table xxx alter column cid set statistics 500;
  analyze xxx;
  This does not affect the results.
 
 It probably improved the accuracy of the row count estimates, 
 no? The estimate you show for cid=7191032 is off by more than 
 25% (37765 vs 50792), which seems like a lot of error for one 
 of the most common values in the table.  (I hope that was 
 with default stats target and not 500.)  That leads directly 
 to a 25% overestimate of the cost of an indexscan, while 
 having IIRC no impact on the cost of a seqscan. Since the 
 cost ratio was more than 25%, this didn't change the selected 
 plan, but you want to fix that error as best you can before 
 you move on to tweaking cost parameters.

Actually it made them worse!  Yes, this was the default statistics (10).
When I just tried it again with a value of 300, analyze, then run the
query, I get a *worse* result for an estimate.  I don't understand this.


   alter table xxx alter column cid set statistics 300;
   analyze emailrcpts;
   set random_page_cost to 2;
   explain analyze select count(*) from xxx where cid=7191032;

   Aggregate  (cost=20563.28..20563.28 rows=1 width=0) (actual
time=7653.90..7653.90 rows=1 loops=1)
  -  Index Scan using xxx_cid on xxx  (cost=0.00..20535.82 rows=10983
width=0) (actual time=72.24..7602.38 rows=37765 loops=1)
   Total runtime: 7654.14 msec

Now it estimates I have only 10,983 rows (~3x too low) instead of the
old estimate 50,792 (1.3x too high).  Why is that ??

Anyway, a workable solution seems to be using a lower value for
Random_Page_Cost.  Thanks to everyone who replied with this answer. 

 Also it is likely appropriate to increase 
 effective_cache_size, which is awfully small in the default 
 configuration.  I'd set that to something related to your 
 available RAM before trying to home in on a suitable random_page_cost.

We have ours set to the default value of 1000, which does seem low for a
system with 1GB of RAM.  We'll up this once we figure out what's
available.  Then tweak the Random_Page_Cost appropriately at that point.

I'd still like to understand the strangeness above, if anyone can shed
light.

- Jeremy


---(end of broadcast)---
TIP 8: explain analyze is your friend