Re: [PERFORM] Index Backward Scan fast / Index Scan slow !

2004-04-13 Thread Pailloncy Jean-Gérard
Hm, this is odd.  That says you've got 349519 live index entries in 
only
463 actively-used index pages, or an average of 754 per page, which
AFAICS could not fit in an 8K page.  Are you using a nondefault value 
of
BLCKSZ?  If so what?
Sorry, I forgot to specify I use BLCKSZ of 32768, the same blokck's 
size for newfs, the same for RAID slice's size.
I test the drive sometimes ago, and found a speed win if the slice size 
the disk block size and the read block size was the same.

I do not think that a different BLCKSZ should exhibit a slowdown as the 
one I found.

If you *are* using default BLCKSZ then this index must be corrupt, and
what you probably need to do is REINDEX it.  But before you do that,
could you send me a copy of the index file?
Do you want the index file now, or may I try something before?

Cordialement,
Jean-Gérard Pailloncy
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Index Backward Scan fast / Index Scan slow !

2004-04-13 Thread Tom Lane
=?ISO-8859-1?Q?Pailloncy_Jean-G=E9rard?= [EMAIL PROTECTED] writes:
 Are you using a nondefault value of
 BLCKSZ?  If so what?

 Sorry, I forgot to specify I use BLCKSZ of 32768,

Okay, the numbers are sensible then.  The index density seems a bit low
(754 entries/page where the theoretical ideal would be about 1365) but
not really out-of-line.

 could you send me a copy of the index file?

 Do you want the index file now, or may I try something before?

If you're going to reindex, please do send me a copy of the file first.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Deleting certain duplicates

2004-04-13 Thread Rajesh Kumar Mallah
Shea,Dan [CIS] wrote:

The index is
Indexes:
   forecastelement_rwv_idx btree (region_id, wx_element, valid_time)
-Original Message-
From: Shea,Dan [CIS] [mailto:[EMAIL PROTECTED]
Sent: Monday, April 12, 2004 10:39 AM
To: Postgres Performance
Subject: [PERFORM] Deleting certain duplicates
We have a large database which recently increased dramatically due to a
change in our insert program allowing all entries.
PWFPM_DEV=# select relname,relfilenode,reltuples from pg_class where relname
= 'forecastelement';
relname | relfilenode |  reltuples
-+-+-
forecastelement |   361747866 | 4.70567e+08
Column |Type | Modifiers
+-+---
version| character varying(99)   |
origin | character varying(10)   |
timezone   | character varying(99)   |
region_id  | character varying(20)   |
wx_element | character varying(99)   |
value  | character varying(99)   |
flag   | character(3)|
units  | character varying(99)   |
valid_time | timestamp without time zone |
issue_time | timestamp without time zone |
next_forecast  | timestamp without time zone |
reception_time | timestamp without time zone |
The program is supposed to check to ensure that all fields but the
reception_time are unique using a select statement, and if so, insert it.
Due an error in a change, reception time was included in the select to check
for duplicates.  The reception_time is created by a program creating the dat
file to insert. 
Essentially letting all duplicate files to be inserted.

I tried the delete query below.
PWFPM_DEV=# delete from forecastelement where oid not in (select min(oid)
from forecastelement group by
version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,iss
ue_time,next_forecast);
It ran for 3 days creating what I assume is an index in pgsql_tmp of the
group by statement. 
The query ended up failing with dateERROR:write failed.
Well the long weekend is over and we do not have the luxury of trying this
again. 
So I was thinking maybe of doing the deletion in chunks, perhaps based on
reception time.
 

its more of an sql question though.

to deduplicate on basis of

version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,
issue_time,next_forecast
You could do this.

begin work;
create temp_table as select distinct on 
(version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,
issue_time,next_forecast) * from forecastelement ;
truncate table forecastelement ;
drop index index on forecastelement   ;
insert into forecastelement  select * from temp_table ;
commit;
create indexes
Analyze forecastelement ;

note that distinct on will keep only one row out of all rows having  
distinct values
of the specified columns. kindly go thru the distinct on manual before 
trying
the queries.

regds
mallah.
Are there any suggestions for a better way to do this, or using multiple
queries to delete selectively a week at a time based on the reception_time.
I would say there are a lot of duplicate entries between mid march to the
first week of April.


---(end of broadcast)---
TIP 8: explain analyze is your friend
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly
 



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] index v. seqscan for certain values

2004-04-13 Thread Jeremy Dunn
  When I just tried it again with a value of 300, analyze, 
 then run the query, I get a *worse* result for an estimate.  I don't
understand 
  this.
 
 That's annoying.  How repeatable are these results --- if you 
 do ANALYZE over again several times, how much does the row 
 count estimate change each time?  (It should change somewhat, 
 since ANALYZE is taking a random sample, but one would like 
 to think not a whole lot.)  Is the variance more or less at 
 the higher stats target?  Take a look at a few different CID 
 values to get a sense of the accuracy, don't look at just one ...

Yes, it's repeatable.  I tried a bunch of times, and there are only
small variations in the stats for the higher stat targets.

 (Actually, you might find it more profitable to look at the 
 pg_stats entry for the CID column rather than 
 reverse-engineering the stats via ANALYZE.  Look at how well 
 the most-common-values list and associated frequency numbers 
 track reality.)

I checked the accuracy of the stats for various values, and there is a
wide variation.  I see some values where the estimate is 1.75x the
actual; and others where the estimate is .44x the actual.

 Also, can you think of any reason for the distribution of CID 
 values to be nonuniform within the table?  For instance, do 
 rows get inserted in order of increasing CID, or is there any 
 clustering of rows with the same CID?

This is almost certainly the answer.  The data is initially inserted in
chunks for each CID, and later on there is a more normal distribution of
insert/update/deletes across all CIDs; and then again a new CID will
come with a large chunk of rows, etc.

Interestingly, I tried increasing the stat size for the CID column to
2000, analyzing, and checking the accuracy of the stats again.  Even
with this relatively high value, the accuracy of the stats is not that
close.   The value giving .44x previously nows gives an estimate .77x of
actual.  Another value which was at 1.38x of actual is now at .71x of
actual!  

Then just for kicks I set the statistics size to 100,000 (!), analyzed,
and ran the query again.  For the same CID I still got an estimated row
count that is .71x the actual rows returned.  Why is this not better?  I
wonder how high I'd have to set the statistics collector to get really
good data, given the uneven data distribution of this table.  Is there
any other technique that works better to get good estimates, given
uneven distribution of values?

So I think this explains the inaccurate stats; and the solution as far
as I'm concerned is to increase the two params mentioned yesterday
(effective_cache_size  random_page_cost).

Thanks again for the help!
- Jeremy


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] index v. seqscan for certain values

2004-04-13 Thread Tom Lane
Jeremy Dunn [EMAIL PROTECTED] writes:
 Interestingly, I tried increasing the stat size for the CID column to
 2000, analyzing, and checking the accuracy of the stats again.

There's a hard limit of 1000, I believe.  Didn't it give you a warning
saying so?

At 1000 the ANALYZE sample size would be 30 rows, or about a quarter
of your table.  I would have thought this would give frequency estimates
with much better precision than you seem to be seeing --- but my
statistics are rusty enough that I'm not sure about it.  Possibly the
nonuniform clumping of CID has something to do with the poor results.

Any stats majors on the list?

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] query slows down with more accurate stats

2004-04-13 Thread Robert Treat
In the process of optimizing some queries, I have found the following
query seems to degrade in performance the more accurate I make the
statistics on the table... whether by using increased alter table ...
set statistics or by using vacuum..

SELECT 
count( cl.caller_id ), 
npanxx.city, 
npanxx.state 
FROM 
cl 
LEFT OUTER JOIN npanxx 
  on substr( cl.caller_id, 1, 3 ) = npanxx.npa 
  and substr( cl.caller_id, 4, 3 ) = npanxx.nxx 
LEFT OUTER JOIN cp 
  ON cl.caller_id = cp.caller_id 
WHERE 
cl.ivr_system_id = 130 
AND 
cl.call_time  '2004-03-01 06:00:00.0 CST' 
AND 
cl.call_time  '2004-04-01 06:00:00.0 CST' 
AND 
cp.age = 18 
AND 
cp.age = 24 
AND 
cp.gender = 'm' 
GROUP BY 
npanxx.city, 
npanxx.state


live=# analyze cl;
ANALYZE
live=# select reltuples from pg_class where relname = 'cl';
 reltuples 
---
 53580
(1 row)

live=# select count(*) from cl;
  count  
-
 1140166
(1 row)

The plan i get under these conditions is actually pretty good...

 HashAggregate  (cost=28367.22..28367.66 rows=174 width=32) (actual 
time=1722.060..1722.176 rows=29 loops=1)
   -  Nested Loop  (cost=0.00..28365.92 rows=174 width=32) (actual 
time=518.592..1716.254 rows=558 loops=1)
 -  Nested Loop Left Join  (cost=0.00..20837.33 rows=1248 width=32) (actual 
time=509.991..1286.755 rows=13739 loops=1)
   -  Index Scan using cl_ivr_system_id on cl  (cost=0.00..13301.15 
rows=1248 width=14) (actual time=509.644..767.421 rows=13739 loops=1)
 Index Cond: (ivr_system_id = 130)
 Filter: ((call_time  '2004-03-01 07:00:00-05'::timestamp with 
time zone) AND (call_time  '2004-04-01 07:00:00-05'::timestamp with time zone))
   -  Index Scan using npanxx_pkey on npanxx  (cost=0.00..6.02 rows=1 
width=32) (actual time=0.025..0.027 rows=1 loops=13739)
 Index Cond: ((substr((outer.caller_id)::text, 1, 3) = 
(npanxx.npa)::text) AND (substr((outer.caller_id)::text, 4, 3) = (npanxx.nxx)::text))
 -  Index Scan using cp_pkey on cp  (cost=0.00..6.02 rows=1 width=14) (actual 
time=0.027..0.027 rows=0 loops=13739)
   Index Cond: ((outer.caller_id)::text = (cp.caller_id)::text)
   Filter: ((age = 18) AND (age = 24) AND (gender = 'm'::bpchar))
 Total runtime: 1722.489 ms
(12 rows)


but when i do 

live=# vacuum cl;
VACUUM
live=# select reltuples from pg_class where relname = 'cl';
  reltuples  
-
 1.14017e+06
(1 row)

(or alternatively increase the stats target on the table)

I now get the following plan:

 HashAggregate  (cost=80478.74..80482.41 rows=1471 width=32) (actual 
time=8132.261..8132.422 rows=29 loops=1)
   -  Merge Join  (cost=79951.95..80467.70 rows=1471 width=32) (actual 
time=7794.338..8130.041 rows=558 loops=1)
 Merge Cond: (outer.?column4? = inner.?column2?)
 -  Sort  (cost=55719.06..55745.42 rows=10546 width=32) (actual 
time=4031.827..4052.526 rows=13739 loops=1)
   Sort Key: (cl.caller_id)::text
   -  Merge Right Join  (cost=45458.30..55014.35 rows=10546 width=32) 
(actual time=2944.441..3796.787 rows=13739 loops=1)
 Merge Cond: (((outer.npa)::text = inner.?column2?) AND 
((outer.nxx)::text = inner.?column3?))
 -  Index Scan using npanxx_pkey on npanxx  (cost=0.00..8032.99 
rows=132866 width=32) (actual time=0.200..461.767 rows=130262 loops=1)
 -  Sort  (cost=45458.30..45484.67 rows=10546 width=14) (actual 
time=2942.994..2967.935 rows=13739 loops=1)
   Sort Key: substr((cl.caller_id)::text, 1, 3), 
substr((cl.caller_id)::text, 4, 3)
   -  Seq Scan on cl  (cost=0.00..44753.60 rows=10546 
width=14) (actual time=1162.423..2619.662 rows=13739 loops=1)
 Filter: ((ivr_system_id = 130) AND (call_time  
'2004-03-01 07:00:00-05'::timestamp with time zone) AND (call_time  '2004-04-01 
07:00:00-05'::timestamp with time zone))
 -  Sort  (cost=24232.89..24457.06 rows=89667 width=14) (actual 
time=3761.703..3900.340 rows=98010 loops=1)
   Sort Key: (cp.caller_id)::text
   -  Seq Scan on cp  (cost=0.00..15979.91 rows=89667 width=14) (actual 
time=0.128..1772.215 rows=100302 loops=1)
 Filter: ((age = 18) AND (age = 24) AND (gender = 'm'::bpchar))
 Total runtime: 8138.607 ms
(17 rows)


so i guess i am wondering if there is something I should be doing to
help get the better plan at the more accurate stats levels and/or why it
doesn't stick with the original plan (I noticed disabling merge joins
does seem to push it back to the original plan). 

alternatively if anyone has any general suggestions on speeding up the
query I'd be open to that too :-) 


Robert Treat
-- 
Build A Brighter Lamp :: 

Re: [PERFORM] index v. seqscan for certain values

2004-04-13 Thread Jeremy Dunn

 There's a hard limit of 1000, I believe.  Didn't it give you
 a warning saying so?

No warning at 2000, and no warning at 100,000 either!

Remember we are still on 7.2.x.  The docs here
http://www.postgresql.org/docs/7.2/static/sql-altertable.html don't say
anything about a limit.  

This is good to know, if it's true.  Can anyone confirm?

- Jeremy


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] configure shmmax on MAC OS X

2004-04-13 Thread Jeff Bohmer

On OS X, I've always made these changes in:

/System/Library/StartupItems/SystemTuning/SystemTuning

and manually checked it with sysctl after reboot.  Works for me.

100k buffers is probably overkill.  There can be a performance penalty with too many 
buffers.  See this lists' archives for more.  10k would probably be a better start.

- Jeff


Hi, all,

I have got a new MaC OS G5 with 8GB RAM. So i tried to increase
the shmmax in Kernel so that I can take advantage of the RAM.

I searched the web and read the manual for PG7.4 chapter 16.5.1.
After that, I edited /etc/rc file:

sysctl -w kern.sysv.shmmax=4294967296 // byte
sysctl -w kern.sysv.shmmin=1
sysctl -w kern.sysv.shmmni=32
sysctl -w kern.sysv.shmseg=8
sysctl -w kern.sysv.shmall=1048576 //4kpage

for 4G shared RAM.

Then I changed postgresql.conf:
shared_buffer=10 //could be bigger?

and restart the machine and postgres server. To my surprise, postgres server wouldn't
start, saying that the requested shared memory exceeds kernel's shmmax.

My suspision is that the change i made in /etc/rc does not take effect.Is there a way
to check it?  Is there an
up limit for how much RAM can be allocated for shared buffer in MAC OS X? Or
is there something wrong with my calculation in numbers?

Thanks a lot!

Qing


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


-- 

Jeff Bohmer
VisionLink, Inc.
_
303.402.0170
www.visionlink.org
_
People. Tools. Change. Community.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] configure shmmax on MAC OS X

2004-04-13 Thread Qing Zhao
Tom:

I used sysctl -A to see the kernel state, I got:
kern.sysv.shmmax: -1
It looks the value is too big!

Thanks!

Qing
On Apr 13, 2004, at 12:55 PM, Tom Lane wrote:
Qing Zhao [EMAIL PROTECTED] writes:
My suspision is that the change i made in /etc/rc does not take
effect.Is there a way to check it?
sysctl has an option to show the values currently in effect.

I believe that /etc/rc is the correct place to set shmmax on OSX 10.3 
or
later ... but we have seen prior reports of people having trouble
getting the setting to take.  There may be some other constraint
involved.

sysctl -w kern.sysv.shmmax=4294967296 // byte
Hmm, does sysctl work for values that exceed the range of int?

There's no particularly good reason to try to set shmmax as high as you
are trying anyhow; you really don't need more than a couple hundred meg
in Postgres shared memory.  It's better to leave the kernel to manage
the bulk of your RAM.
			regards, tom lane



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] configure shmmax on MAC OS X

2004-04-13 Thread Joshua D. Drake
Hello,

I found that if you SHMALL value was less than your SHMMAX value,
the value wouldn't take.
J

Tom Lane wrote:

Qing Zhao [EMAIL PROTECTED] writes:

My suspision is that the change i made in /etc/rc does not take 
effect.Is there a way to check it?


sysctl has an option to show the values currently in effect.

I believe that /etc/rc is the correct place to set shmmax on OSX 10.3 or
later ... but we have seen prior reports of people having trouble
getting the setting to take.  There may be some other constraint
involved.

sysctl -w kern.sysv.shmmax=4294967296 // byte


Hmm, does sysctl work for values that exceed the range of int?

There's no particularly good reason to try to set shmmax as high as you
are trying anyhow; you really don't need more than a couple hundred meg
in Postgres shared memory.  It's better to leave the kernel to manage
the bulk of your RAM.
			regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] View columns calculated

2004-04-13 Thread Tom Lane
Peter Darley [EMAIL PROTECTED] writes:
   I have a question about views:  I want to have a fairly wide view (lots of
 columns) where most of the columns have some heavyish calculations in them,
 but I'm concerned that it will have to calculate every column even when I'm
 not selecting them.  So, the question is, if I have 5 columns in a view but
 only select 1 column, is the system smart enough to not calculate the unused
 columns,

It depends on what the rest of your view looks like.  If the view is
simple enough to be flattened into the parent query then the unused
columns will disappear into the ether.  If it's not flattenable then
they will get evaluated.  You can check by seeing whether an EXPLAIN
shows a separate subquery scan node corresponding to the view.
(Without bothering to look at the code, an unflattenable view is one
that uses GROUP BY, DISTINCT, aggregates, ORDER BY, LIMIT, UNION,
INTERSECT, EXCEPT, probably a couple other things.)

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] View columns calculated

2004-04-13 Thread Peter Darley
Folks,
I have a question about views:  I want to have a fairly wide view (lots of
columns) where most of the columns have some heavyish calculations in them,
but I'm concerned that it will have to calculate every column even when I'm
not selecting them.  So, the question is, if I have 5 columns in a view but
only select 1 column, is the system smart enough to not calculate the unused
columns, or am I taking a performance hit over a smaller view that doesn't
have the extra 4 columns?
Thanks,
Peter Darley


---(end of broadcast)---
TIP 8: explain analyze is your friend