[PERFORM] Adding RAM: seeking advice & warnings of hidden "gotchas"

2003-12-17 Thread Nick Fankhauser
Hi-

After having done my best to squeeze better performance out of our
application by tuning within our existing resources, I'm falling back on
adding memory as a short-term solution while we get creative for a long-term
fix. I'm curious about what experiences others have had with the process of
adding big chunks of RAM. In particular, if I'm trying to encourage the OS
to cache more of my index information in RAM, what sort of configuration
should I do at both the PostgreSQL and OS level?

In a slightly off-topic vein, I'd also like to hear about it if anyone knows
about any gotchas at the OS level that might become a problem.

The server is a dual processor Athlon 1.2GHz box with hardware SCSI RAID. It
currently has 1 GB RAM, and we're planning to add one GB more for a total of
2GB. The OS is Debian Linux Kernel 2.4.x, and we're on PostgreSQL v7.3.2

My current memory related settings are:

SHMMAX and SHMALL set to 128MB (OS setting)
shared buffers 8192 (64MB)
sort_mem 16384 (16MB)
effective_cache_size  65536 (512MB)


We support up to 70 active users, sharing a connection pool of 16
connections. Most of the queries center around 3 tables that are about 1.5
GB each.


Thanks.
   -Nick

-----
Nick Fankhauser

[EMAIL PROTECTED]  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Nested loop performance

2003-12-17 Thread Nick Fankhauser

> As a question, what does explain analyze give you if you
> set enable_nestloop=false; before trying the query?

Here are the results- It looks quite a bit more painful than the other plan,
although the wall time is in the same ballpark.

alpha=# explain analyze
alpha-#   select
alpha-# min(actor.actor_id) as actor_id,
alpha-# min(actor.actor_entity_type) as actor_entity_type,
alpha-# min(actor.role_class_code) as role_class_code,
alpha-# min(actor.actor_full_name) as actor_full_name,
alpha-# min(actor.actor_person_date_of_birth) as
actor_person_date_of_birth,
alpha-# min(actor.actor_entity_acronym) as actor_entity_acronym,
alpha-# min(actor.actor_person_last_name) as actor_person_last_name,
alpha-# min(actor.actor_person_first_name) as actor_person_first_name,
alpha-# min(actor.actor_person_middle_name) as actor_person_middle_name,
alpha-# min(actor.actor_person_name_suffix) as actor_person_name_suffix,
alpha-# min(actor.actor_person_place_of_birth) as
actor_person_place_of_birth,
alpha-# min(actor.actor_person_height) as actor_person_height,
alpha-# min(actor.actor_person_height_unit) as actor_person_height_unit,
alpha-# min(actor.actor_person_weight) as actor_person_weight,
alpha-# min(actor.actor_person_weight_unit) as actor_person_weight_unit,
alpha-# min(actor.actor_person_ethnicity) as actor_person_ethnicity,
alpha-# min(actor.actor_person_citizenship_count) as
actor_person_citizenship_count,
alpha-# min(actor.actor_person_hair_color) as actor_person_hair_color,
alpha-# min(actor.actor_person_scars_marks_tatto) as
actor_person_scars_marks_tatto,
alpha-# min(actor.actor_person_marital_status) as
actor_person_marital_status,
alpha-# min(actor.actor_alias_for_actor_id) as actor_alias_for_actor_id,
alpha-# min(to_char(data_source.source_last_update, 'MM/DD/ HH12:MI
AM TZ')) as last_update,
alpha-# min(actor_summary.single_case_public_id) as case_public_id,
alpha-# min(actor_summary.single_case_id) as case_id,
alpha-# sum(actor_summary.case_count)as case_count
alpha-#   from
alpha-# actor,
alpha-# actor_summary,
alpha-# data_source
alpha-#   where
alpha-# actor.actor_id = actor_summary.actor_id
alpha-# and data_source.source_id = actor.source_id
alpha-# and actor.actor_full_name_uppercase like upper('sanders%')
alpha-#   group by
alpha-# actor.actor_id
alpha-#   order by
alpha-# min(actor.actor_full_name_uppercase),
alpha-# case_count desc,
alpha-# min(actor_summary.case_disp_global_code)
alpha-#   limit
alpha-# 1000;

QUERY PLAN




--
 Limit  (cost=168919.98..168920.03 rows=20 width=548) (actual
time=91247.95..91249.05 rows=1000 loops=1)
   ->  Sort  (cost=168919.98..168920.03 rows=20 width=548) (actual
time=91247.95..91248.35 rows=1001 loops=1)
 Sort Key: min((actor.actor_full_name_uppercase)::text),
sum(actor_summary.case_count),
min((actor_summary.case_disp_global_code)::text)
 ->  Aggregate  (cost=168904.95..168919.54 rows=20 width=548)
(actual time=91015.00..91164.68 rows=3590 loops=1)
   ->  Group  (cost=168904.95..168905.95 rows=201 width=548)
(actual time=90999.87..91043.25 rows=3594 loops=1)
 ->  Sort  (cost=168904.95..168905.45 rows=201
width=548) (actual time=90999.83..91001.57 rows=3594 loops=1)
   Sort Key: actor.actor_id
   ->  Hash Join  (cost=903.08..168897.24 rows=201
width=548) (actual time=25470.63..90983.45 rows=3594 loops=1)
 Hash Cond: ("outer".actor_id =
"inner".actor_id)
 ->  Seq Scan on actor_summary
(cost=0.00..150715.43 rows=3455243 width=73) (actual time=8.03..52902.24
rows=3455243 loops=1)
 ->  Hash  (cost=902.57..902.57 rows=204
width=475) (actual time=25459.92..25459.92 rows=0 loops=1)
   ->  Hash Join  (cost=1.14..902.57
rows=204 width=475) (actual time=155.92..25451.25 rows=3639 loops=1)
 Hash Cond: ("outer".source_id =
"inner".source_id)
 ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..897.20 rows=223 width=463)
(actual time=144.93..25404.
10 rows=3639 loops=1)
   Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::
character varying))
   Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
 ->  Hash  (cost=1.11..1.11
rows=11 width=12) (actual time=10.66..10.66 rows=0 loops=1)
  

Re: [PERFORM] Nested loop performance

2003-12-17 Thread Nick Fankhauser
> It seems that your basic problem is that you're fetching lots of rows
> from two big ol' tables.

> It doesn't seem to me that there would be a substantially better plan
> for this query with your tables as they stand.

That's more or less the conclusion I had come to. I was just hoping someone
else could point out an approach I've been missing. (sigh!)



> If your data were more
> normalised, then your big scans might be quicker (because their rows
> would be smaller so they would hit fewer disk pages),

This started off as a 5-table join on well-normalized data. Unfortunately,
the actor table doesn't get any smaller, and the work involved in
calculating the "case_count" information on the fly was clearly becoming a
problem- particularly with actors that had a heavy caseload. (Busy attorneys
and judges.) The actor_summary approach makes these previous problem cases
go away, but the payback is that (as you correctly pointed out) queries on
average citizens who only have one case suffer from the de-normalized
approach.

We're currently considering the approach of just returning all of the rows
to our application, and doing the aggregation and limit work in the app. The
inconsistency of the data makes it very tough for the query planner to come
up with an strategy that is always a winner.

Thanks for your thoughts!

-Nick



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Nested loop question

2003-12-17 Thread Nick Fankhauser

> The fact that it's taking you 9ms to do each index lookup
> suggests to me that
> it's going to disk each time. Does that sound plausible, or do
> you think you
> have enough RAM to cache your large indexes?

I'm sure we don't have enough RAM to cache all of our large indexes, so your
supposition makes sense. We have 1GB on this machine. In responding to the
performance problems we're having, one of the questions has been adding
memory vs crafting "helper" tables to speed things up. The issue is that
this database needs to be able to scale easily to about 10 times the size,
so although we could easily triple the memory at reasonable expense, we'd
eventually hit a wall.

Is there any solid method to insure that a particular index always resides
in memory? A hybrid approach that might scale reliably would be to bump up
our memory and then make sure key indexes are cached. however, I'm concerned
that if we didn't have a way to ensure that the indexes that we choose
remain cached, we would have very inconsistent responses.

Thanks for your ideas!

-Nick




---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


[PERFORM] Nested loop question

2003-12-16 Thread Nick Fankhauser - Doxpop
7) (actual
time=48604.04..48647.91 rows=3594 loops=1)
 ->  Sort  (cost=2555.50..2555.50 rows=1 width=547)
(actual time=48604.01..48605.70 rows=3594 loops=1)
   Sort Key: actor.actor_id
   ->  Nested Loop  (cost=1.14..2555.49 rows=1
width=547) (actual time=69.09..48585.83 rows=3594 loops=1)
 ->  Hash Join  (cost=1.14..900.39 rows=204
width=475) (actual time=46.92..15259.02 rows=3639 loops=1)
   Hash Cond: ("outer".source_id =
"inner".source_id)
   ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..895.04 rows=222 width=463)
(actual time=46.54..15220.77 rows=3639 loops=1)
 Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
 Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
   ->  Hash  (cost=1.11..1.11 rows=11
width=12) (actual time=0.05..0.05 rows=0 loops=1)
 ->  Seq Scan on data_source
(cost=0.00..1.11 rows=11 width=12) (actual time=0.02..0.04 rows=11 loops=1)
 ->  Index Scan using actor_summary_pk on
actor_summary  (cost=0.00..8.11 rows=1 width=72) (actual time=9.14..9.15
rows=1 loops=3639)
   Index Cond: ("outer".actor_id =
actor_summary.actor_id)
 Total runtime: 48851.85 msec
(18 rows)


-
Nick Fankhauser

[EMAIL PROTECTED]  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Nested loop performance

2003-12-16 Thread Nick Fankhauser
7) (actual
time=48604.04..48647.91 rows=3594 loops=1)
 ->  Sort  (cost=2555.50..2555.50 rows=1 width=547)
(actual time=48604.01..48605.70 rows=3594 loops=1)
   Sort Key: actor.actor_id
   ->  Nested Loop  (cost=1.14..2555.49 rows=1
width=547) (actual time=69.09..48585.83 rows=3594 loops=1)
 ->  Hash Join  (cost=1.14..900.39 rows=204
width=475) (actual time=46.92..15259.02 rows=3639 loops=1)
   Hash Cond: ("outer".source_id =
"inner".source_id)
   ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..895.04 rows=222 width=463)
(actual time=46.54..15220.77 rows=3639 loops=1)
 Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
 Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
   ->  Hash  (cost=1.11..1.11 rows=11
width=12) (actual time=0.05..0.05 rows=0 loops=1)
 ->  Seq Scan on data_source
(cost=0.00..1.11 rows=11 width=12) (actual time=0.02..0.04 rows=11 loops=1)
 ->  Index Scan using actor_summary_pk on
actor_summary  (cost=0.00..8.11 rows=1 width=72) (actual time=9.14..9.15
rows=1 loops=3639)
   Index Cond: ("outer".actor_id =
actor_summary.actor_id)
 Total runtime: 48851.85 msec
(18 rows)


-
Nick Fankhauser

[EMAIL PROTECTED]  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] n_distinct way off, but following a pattern.

2003-11-16 Thread Nick Fankhauser


> It would be interesting to see exactly what inputs are going into this
> equation.  Do you feel like adding some debug printouts into this code?
> Or just looking at the variables with a debugger?  In 7.3 it's about
> line 1060 in src/backend/commands/analyze.c.

Tom-

I don't really have time to follow up at this moment, but I think this would
be interesting to look into, so I'll plan to dig into it over the
Thanksgiving Holiday when I'll have a little time free to follow up on some
fun projects. Your pointers should let me get into it pretty quickly.

In the meantime, I'll just set up a cron job that runs behind my nightly
analyze to put the correct numbers into pg_statistic on the tables that this
affects.

Thanks-
-Nick




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] n_distinct way off, but following a pattern.

2003-11-14 Thread Nick Fankhauser
Hi-

I'm seeing estimates for n_distinct that are way off for a large table
(8,700,000 rows). They get better by setting the stats target higher, but
are still off by a factor of 10 with the stats set to 1000. I've noticed and
reported a similar pattern before on another table. Because this follows the
same very consistent pattern, I figured it was worth reporting again. This
looks more like a bug than randomness. If the poor result was simply due to
having a small sample to work from, the estimates should be all over the
map, but these are consistently low, and vary in almost exact inverse
proportion to the stats target:

run 1: run2: run3:
n_distinct estimate, statistics = 10:   3168   3187  3212
n_distinct estimate, statistics = 100:  23828  24059 23615
n_distinct estimate, statistics = 1000: 194690 194516194081
Actual distinct values: 3340724

Or to put it another way, if you were to take the estimate from analyze,
divide by the stats target and multiply by 1, the result would be pretty
close to exact. (Within a factor of 2, which ought to be plenty close for
planning purposes.)

I'm running version 7.3.2

Any thoughts from folks familiar with this part of the source code?

Regards,
 -Nick

PS-
Here's a log of the session that I got this from.

alpha=# select count(distinct actor_id) from actor_case_assignment;
-[ RECORD 1 ]--
count | 3340724
alpha=# analyze;
ANALYZE
alpha=# SELECT * FROM pg_stats
alpha-#  WHERE tablename='actor_case_assignment' AND attname='actor_id';
-[ RECORD
1 ]-+---


schemaname| public
tablename | actor_case_assignment
attname   | actor_id
null_frac | 0
avg_width | 16
n_distinct| 3168
most_common_vals  |
{18105XS,18115XS,18106XS,18113JD02,18115JD02,18106J27,18113XS,18113A10656,18
115LST,18108XS}
most_common_freqs |
{0.0206667,0.0206667,0.0196667,0.019,0.0176667,0.017,0.016,0.015,0.0
14,0.0136667}
histogram_bounds  |
{18067A000-07P,18067PD397SC1574-1,18105LBPD,18106A2119-49,18106PD399IF845-1,
18108A03068-20,18108LECS207,18108PTW03737278-2,18111A19788-77,18115A50420,18
115XC}
correlation   | 0.876795
alpha=#
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 100;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# SELECT * FROM pg_stats
alpha-#  WHERE tablename='actor_case_assignment' AND attname='actor_id';
-[ RECORD 1 ]

schemaname| public
tablename | actor_case_assignment
attname   | actor_id
null_frac | 0
avg_width | 17
n_distinct| 23828
most_common_vals  | {18115XS,18113JD02,18106XS,1

alpha=# alter table actor_case_assignment alter column actor_id set
statistics 1000;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# SELECT * FROM pg_stats
alpha-#  WHERE tablename='actor_case_assignment' AND attname='actor_id';
-[ RECORD 1 ]-

schemaname| public
tablename | actor_case_assignment
attname   | actor_id
null_frac | 0
avg_width | 16
n_distinct| 194690
most_common_vals  | {18106XS,18115XS,18115...


alpha=# \x
Expanded display is off.
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 10;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment and attname='actor_id';
alpha'# ';
ERROR:  parser: parse error at or near "actor_id" at character 85
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
 n_distinct

   3187
(1 row)

alpha=# alter table actor_case_assignment alter column actor_id set
statistics 10;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
 n_distinct

   3212
(1 row)

alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 100;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
 n_distinct

  24059
(1 row)

alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 100;
ALTER TABLE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
 n_distinct

  23615
(1 row)

alpha=# alter table actor_case_assignment alter column actor_id set
statistics 1000;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_cas

Re: [PERFORM] Seeking help with a query that takes too long

2003-11-14 Thread Nick Fankhauser

> Does actor_case_assignment contain more columns than just the two ids?
> If yes, do these additional fields account for ca. 70 bytes per tuple?
> If not, try
>   VACUUM FULL ANALYSE actor_case_assignment;

actor_case_assignment has its own primary key and a "role" field in addition
to the ids you've seen, so 70 bytes sounds reasonable. (The PK is to allow a
remote mirroring application to update these records- otherwise it would be
unnecessary.)



> 7ms per
> tuple returned looks like a lot of disk seeks are involved.  Is
> clustering actor on actor_full_name_uppercase an option or would this
> slow down other queries?

Good question... I've never used clustering in PostgreSQL before, so I'm
unsure. I presume this is like clustering in Oracle where the table is
ordered to match the index? If so, I think you may be onto something because
the only other field We regularly query on is the actor_id. Actor_id has a
unique index with no clustering currently, so I don't think I'd lose a thing
by clustering on actor_full_name_uppercase.

I'll give this a try & let you know how it changes.

BTW, you are correct that caching has a big affect on the actual time
figures in this case- I'm working on my development DB, so cahced info
doesn't get trampled as quickly by other users. Is there a way to flush out
the cache in a testing situation like this in order to start from a
consistent base?


Thanks!
-Nick



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Seeking help with a query that take too long

2003-11-12 Thread Nick Fankhauser - Doxpop
Hi-

I have a query that takes too long. I haven't been able to come up with any
ideas for speeding it up, so I'm seeking some input from the list.

I'm using version 7.3.2

I have three tables:

case_data (1,947,386 rows)
actor (3,385,669 rows)
actor_case_assignment (8,668,650 rows)

As the names imply, actor_case_assignment contains records that assign an
actor to a case. Actors such as attorneys or judges may have many cases,
while the average actor (we hope) only has one.

What I'm trying to do is link these tables to get back a single row per
actor that shows the actor's name, the number of cases that actor is
assigned to, and if they only have one case, I want the public_id for that
case. This means I have to do a group by to get the case count, but I'm then
forced to use an aggregate function like max on the other fields.

All of the fields ending in "_id" have unique indexes, and
actor_full_name_uppercase is indexed.

Here's the select:

  select
actor.actor_id,
max(actor.actor_full_name),
max(case_data.case_public_id),
max(case_data.case_id),
count(case_data.case_id) as case_count
  from
actor,
actor_case_assignment,
case_data
  where
actor.actor_full_name_uppercase like upper('sanders%')
and actor.actor_id = actor_case_assignment.actor_id
and case_data.case_id = actor_case_assignment.case_id
  group by
actor.actor_id
  order by
max(actor.actor_full_name),
case_count desc
  limit
1000;


Here's the explain analyze:


QUERY PLAN


--
 Limit  (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.61..120035.67 rows=1000 loops=1)
   ->  Sort  (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.60..120034.98 rows=1001 loops=1)
 Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
 ->  Aggregate  (cost=2214.67..2214.70 rows=1 width=115) (actual
time=119962.80..120011.49 rows=3456 loops=1)
   ->  Group  (cost=2214.67..2214.68 rows=2 width=115) (actual
time=119962.76..119987.04 rows=5879 loops=1)
 ->  Sort  (cost=2214.67..2214.68 rows=2 width=115)
(actual time=119962.74..119965.09 rows=5879 loops=1)
   Sort Key: actor.actor_id
   ->  Nested Loop  (cost=0.00..2214.66 rows=2
width=115) (actual time=59.05..119929.71 rows=5879 loops=1)
 ->  Nested Loop  (cost=0.00..2205.26 rows=3
width=76) (actual time=51.46..66089.04 rows=5882 loops=1)
   ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
(actual time=37.62..677.44 rows=3501 loops=1)
 Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
 Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
   ->  Index Scan using
actor_case_assignment_actor_id on actor_case_assignment  (cost=0.00..2165.93
rows=2666 width=34) (actual time=16.37..18.67 rows=2 loops=3501)
 Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
 ->  Index Scan using case_data_case_id on
case_data  (cost=0.00..3.66 rows=1 width=39) (actual time=9.14..9.15 rows=1
loops=5882)
   Index Cond: (case_data.case_id =
"outer".case_id)
 Total runtime: 120038.60 msec
(17 rows)


Any ideas?

Thanks!
 -Nick


-
Nick Fankhauser

[EMAIL PROTECTED]  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Nick Fankhauser

> You might have to resort to brute force, like "set enable_nestloop=false".
> Just out of curiosity, what do you get if you do that?

I get a different plan, but similar execution time:


 Limit  (cost=323437.13..323437.13 rows=1 width=115) (actual
time=170921.89..170922.95 rows=1000 loops=1)
   ->  Sort  (cost=323437.13..323437.13 rows=1 width=115) (actual
time=170921.89..170922.26 rows=1001 loops=1)
 Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
 ->  Aggregate  (cost=323437.08..323437.12 rows=1 width=115) (actual
time=170849.94..170898.06 rows=3457 loops=1)
   ->  Group  (cost=323437.08..323437.09 rows=3 width=115)
(actual time=170849.90..170873.60 rows=5880 loops=1)
 ->  Sort  (cost=323437.08..323437.08 rows=3 width=115)
(actual time=170847.97..170850.21 rows=5880 loops=1)
   Sort Key: actor.actor_id
   ->  Hash Join  (cost=25.29..323437.06 rows=3
width=115) (actual time=122873.80..170814.27 rows=5880 loops=1)
 Hash Cond: ("outer".case_id =
"inner".case_id)
 ->  Seq Scan on case_data
(cost=0.00..60368.16 rows=1947116 width=39) (actual time=12.95..43542.25
rows=1947377 loops=1)
 ->  Hash  (cost=25.28..25.28 rows=3
width=76) (actual time=122844.40..122844.40 rows=0 loops=1)
   ->  Hash Join  (cost=6.02..25.28
rows=3 width=76) (actual time=24992.70..122810.32 rows=5883 loops=1)
 Hash Cond: ("outer".actor_id =
"inner".actor_id)
 ->  Seq Scan on
actor_case_assignment  (cost=0.00..209980.49 rows=8669349 width=34) (actual
time=9.13..85504.05 rows=8670467 loops=1)
 ->  Hash  (cost=6.01..6.01
rows=1 width=42) (actual time=24926.56..24926.56 rows=0 loops=1)
   ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
(actual time=51.67..24900.53 rows=3502 loops=1)
 Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
 Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
 Total runtime: 170925.93 msec
(19 rows)


-Nick



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Nick Fankhauser

> It looks like you are running with the default statistics target (10).
> Try boosting it to 100 or even more for this column (see ALTER TABLE
> SET STATISTICS, then re-ANALYZE) and see if the estimate gets better.


Here are the results & a few more clues:

prod1=# alter table actor alter column actor_full_name_uppercase set
statistics 1000;
ALTER TABLE
prod1=# analyze actor;
ANALYZE
prod1=# select count(distinct actor_full_name_uppercase) from actor;
  count
-
 1453371
(1 row)

prod1=# select count(actor_id) from actor;
  count
-
 3386359
(1 row)

This indicates to me that 1 isn't too shabby as an estimate if the whole
name is specified, but I'm not sure how this gets altered in the case of a
"LIKE"


prod1=# \x
Expanded display is on.
prod1=# SELECT * FROM pg_stats
prod1-# WHERE tablename='actor' AND attname='actor_full_name_uppercase';



schemaname| public
tablename | actor
attname   | actor_full_name_uppercase
null_frac | 0.000586667
avg_width | 21
n_distinct| -0.14701



correlation   | -0.00211291


Question: What does it mean when n_distinct is negative?

New results of explain analyze:



QUERY PLAN


--
 Limit  (cost=252683.61..252683.68 rows=28 width=116) (actual
time=169377.32..169378.39 rows=1000 loops=1)
   ->  Sort  (cost=252683.61..252683.68 rows=29 width=116) (actual
time=169377.31..169377.69 rows=1001 loops=1)
 Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
 ->  Aggregate  (cost=252678.57..252682.91 rows=29 width=116)
(actual time=169305.79..169354.50 rows=3456 loops=1)
   ->  Group  (cost=252678.57..252680.01 rows=289 width=116)
(actual time=169305.76..169330.00 rows=5879 loops=1)
 ->  Sort  (cost=252678.57..252679.29 rows=289
width=116) (actual time=169305.75..169308.15 rows=5879 loops=1)
   Sort Key: actor.actor_id
   ->  Nested Loop  (cost=0.00..252666.74 rows=289
width=116) (actual time=89.27..169273.51 rows=5879 loops=1)
 ->  Nested Loop  (cost=0.00..251608.11
rows=289 width=77) (actual time=57.73..92753.49 rows=5882 loops=1)
   ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..456.88 rows=113 width=42)
(actual time=32.80..3197.28 rows=3501 loops=1)
 Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
 Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
   ->  Index Scan using
actor_case_assignment_actor_id on actor_case_assignment  (cost=0.00..2181.29
rows=2616 width=35) (actual time=22.26..25.57 rows=2 loops=3501)
 Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
 ->  Index Scan using case_data_case_id on
case_data  (cost=0.00..3.65 rows=1 width=39) (actual time=13.00..13.00
rows=1 loops=5882)
   Index Cond: (case_data.case_id =
"outer".case_id)
 Total runtime: 169381.38 msec
(17 rows)



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Nick Fankhauser

> >actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
>   ^^
> >(actual time=37.62..677.44 rows=3501 loops=1)
> ^
> Nick, can you find out why this row count estimation is so far off?
^

Oops- I read this backward- I see what you mean now. That's a good question.
I'm not sure what part of the stats this estimate might be pulled from. The
average is 357, but the most common frequency may be around 1.

-Nick



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Nick Fankhauser


>(actual time=37.62..677.44 rows=3501 loops=1)
^

> Nick, can you find out why this row count estimation is so far off?

It's actually correct:

prod1=# select count(actor_id) from actor where actor_full_name_uppercase
like 'SANDERS%';
 count
---
  3501
(1 row)

Of course, I merely chose "SANDERS" arbitrarily as a name that falls
somewhere near the middle of the frequency range for names. SMITH or JONES
would represent a worst-case, and something like KOIZAR would probably be
unique.


Here are the stats:

prod1=# SELECT * FROM pg_stats
prod1-# WHERE tablename='actor' AND attname='actor_full_name_uppercase';
-[ RECORD
1 ]-+---


-
schemaname| public
tablename | actor
attname   | actor_full_name_uppercase
null_frac | 0.00033
avg_width | 21
n_distinct| 24215
most_common_vals  | {"STATE OF INDIANA","INDIANA DEPARTMENT OF
REVENUE","BARTH CONS SCHOOL CORP","HOWARD COUNTY CLERK","ADVANCED RECOVERY
SERVICES","STATE OF INDIANA-DEPT OF REVENUE","ALLIED COLLECTION SERVICE
INC","CREDIT BUREAU OF LAPORTE","MIDWEST COLLECTION SVC INC","NCO FINANCIAL
SYSTEMS INC"}
most_common_freqs |
{0.015,0.014,0.0043,0.0043,0.004,0.0037,0.0033,0.003
3,0.0027,0.0027}
histogram_bounds  | {"(POE) ESTELLE, DENISE","BRIEN, LIISI","COTTRELL,
CAROL","FAMILY RENTALS","HAYNES, TAMIKA","KESSLER, VICTORIA","MEFFORD,
VERNON L","PHILLIPS, GERALD L","SHELTON, ANTOINETTE","TRICARICO, MELISSA
SUE","ZUEHLKE, THOMAS L"}
correlation   | -0.00147395


I think this means that the average is 357 per actor. As you can see, the
range of assignments varies from people with a single parking ticket to
"State of Indiana", which is party to many thousands of cases.


> BTW, there seem to be missing cases:
> >  ->  Nested Loop  (cost=0.00..2214.66 rows=2 width=115)
> >   (actual time=59.05..119929.71 rows=5879 loops=1)
>
> >->  Nested Loop  (cost=0.00..2205.26 rows=3 width=76)
> > (actual time=51.46..66089.04 rows=5882 loops=1)

This is expected- We actually aggregate data from many county court
databases, with varying levels of data "cleanliness".

Regards,
-Nick




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


[PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Nick Fankhauser
[I originally posted this using the wrong E-Mail account, so a double
posting may occur if the first message gets released by the moderator later-
sorry!]

Hi-

I have a query that I'm trying to speed up. I haven't been able to come up
with any workable ideas for speeding it up, so I'm seeking some input from
the list.

I'm using version 7.3.2

I have three tables:

case_data (1,947,386 rows)
actor (3,385,669 rows)
actor_case_assignment (8,668,650 rows)

As the names imply, actor_case_assignment contains records that assign an
actor to a case. Actors such as attorneys or judges may have many cases,
while the average actor (we hope) only has one.

What I'm trying to do is link these tables to get back a single row per
actor that shows the actor's name, the number of cases that actor is
assigned to, and if they only have one case, I want the public_id for that
case. This means I have to do a group by to get the case count, but I'm then
forced to use an aggregate function like max on the other fields.

All of the fields ending in "_id" have unique indexes, and
actor_full_name_uppercase is indexed. An analyze is done every night & the
database is fairly stable in it's composition.

Here's the select:

  select
actor.actor_id,
max(actor.actor_full_name),
max(case_data.case_public_id),
max(case_data.case_id),
count(case_data.case_id) as case_count
  from
actor,
actor_case_assignment,
case_data
  where
actor.actor_full_name_uppercase like upper('sanders%')
and actor.actor_id = actor_case_assignment.actor_id
and case_data.case_id = actor_case_assignment.case_id
  group by
actor.actor_id
  order by
max(actor.actor_full_name),
case_count desc
  limit
1000;


Here's the explain analyze:


QUERY PLAN


--
 Limit  (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.61..120035.67 rows=1000 loops=1)
   ->  Sort  (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.60..120034.98 rows=1001 loops=1)
 Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
 ->  Aggregate  (cost=2214.67..2214.70 rows=1 width=115) (actual
time=119962.80..120011.49 rows=3456 loops=1)
   ->  Group  (cost=2214.67..2214.68 rows=2 width=115) (actual
time=119962.76..119987.04 rows=5879 loops=1)
 ->  Sort  (cost=2214.67..2214.68 rows=2 width=115)
(actual time=119962.74..119965.09 rows=5879 loops=1)
   Sort Key: actor.actor_id
   ->  Nested Loop  (cost=0.00..2214.66 rows=2
width=115) (actual time=59.05..119929.71 rows=5879 loops=1)
 ->  Nested Loop  (cost=0.00..2205.26 rows=3
width=76) (actual time=51.46..66089.04 rows=5882 loops=1)
   ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
(actual time=37.62..677.44 rows=3501 loops=1)
 Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
 Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
   ->  Index Scan using
actor_case_assignment_actor_id on actor_case_assignment  (cost=0.00..2165.93
rows=2666 width=34) (actual time=16.37..18.67 rows=2 loops=3501)
 Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
 ->  Index Scan using case_data_case_id on
case_data  (cost=0.00..3.66 rows=1 width=39) (actual time=9.14..9.15 rows=1
loops=5882)
   Index Cond: (case_data.case_id =
"outer".case_id)
 Total runtime: 120038.60 msec
(17 rows)


Any ideas?

Thanks!
 -Nick


-
Nick Fankhauser

[EMAIL PROTECTED]  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [pgsql-advocacy] [PERFORM] OFFTOPIC: PostgreSQL vs MySQL

2003-10-09 Thread Nick Fankhauser
> Have you checked these pages?  They've been posted on this list numerous
> times:
> http://techdocs.postgresql.org
> http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html
> http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html
>

Josh- It would be great to have a link to those last two excellent resources
from the techdocs area- perhaps from the "optimizing" section in
http://techdocs.postgresql.org/oresources.php. Who should we suggest this
to? (I submitted these using the form in that area, but you may have better
connections.)

-Nick



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] How to make n_distinct more accurate.

2003-09-24 Thread Nick Fankhauser

The performance list seemed to be off-line for a while, so I posed the same
question on the admin list and Tom Lane has been helping in that forum.

-Nick

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Nick
> Fankhauser
> Sent: Monday, September 22, 2003 3:42 PM
> To: [EMAIL PROTECTED] Org
> Subject: [PERFORM] How to make n_distinct more accurate.
>
>
> Hi-
>
> I have a table- called "event" with a field event_date_time that
> is indexed.
> There are 1,700,000 rows in the table and 92,000 distinct values of
> event_date_time with anywhere from 1 to 400 rows sharing the same
> value. (I
> did a count grouped by event_date_time & scanned it to get this info.)
>
> When I look at the pg_stats on this table, I always see 15,000 or lower in
> the n_distinct column for event_date_time. (I re-ran analyze
> several times &
> then checked pg_stats to see if the numbers varied significantly.)
>
> Since this is off by about a factor of 6, I think the planner is
> missing the
> chance to use this table as the "driver" in a complex query plan that I'm
> trying to optimize.
>
> So the question is- how can I get a better estimate of n_distinct from
> analyze?
>
> If I alter the stats target as high as it will go, I get closer, but it
> still shows the index to be about 1/2 as selective as it actually is:
>
> alpha=# alter table event alter column event_date_time set
> statistics 1000;
> ALTER TABLE
> alpha=# analyze event;
> ANALYZE
> alpha=# select n_distinct from pg_stats where tablename='event' and
> attname='event_date_time';
>  n_distinct
> 
>   51741
> (1 row)
>
> This number seems to be consistently around 51,000 if I re-run
> analyze a few
> times.
>
> I guess my question is two-part:
>
> (1)Is there any tweak to make this estimate work better?
>
> (2)Since I'm getting numbers that are consistent but way off, is
> there a bug
> here?
>
> (2-1/2) Or alternately, am I totally missing what n-distinct is
> supposed to
> denote?
>
> Thanks!
>-Nick
>
> -
> Nick Fankhauser
>
> [EMAIL PROTECTED]  Phone 1.765.965.7363  Fax 1.765.962.9788
> doxpop - Court records at your fingertips - http://www.doxpop.com/
>
>
>
> ---(end of broadcast)---
> TIP 7: don't forget to increase your free space map settings
>


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] How to make n_distinct more accurate.

2003-09-23 Thread Nick Fankhauser
Hi-

I have a table- called "event" with a field event_date_time that is indexed.
There are 1,700,000 rows in the table and 92,000 distinct values of
event_date_time with anywhere from 1 to 400 rows sharing the same value. (I
did a count grouped by event_date_time & scanned it to get this info.)

When I look at the pg_stats on this table, I always see 15,000 or lower in
the n_distinct column for event_date_time. (I re-ran analyze several times &
then checked pg_stats to see if the numbers varied significantly.)

Since this is off by about a factor of 6, I think the planner is missing the
chance to use this table as the "driver" in a complex query plan that I'm
trying to optimize.

So the question is- how can I get a better estimate of n_distinct from
analyze?

If I alter the stats target as high as it will go, I get closer, but it
still shows the index to be about 1/2 as selective as it actually is:

alpha=# alter table event alter column event_date_time set statistics 1000;
ALTER TABLE
alpha=# analyze event;
ANALYZE
alpha=# select n_distinct from pg_stats where tablename='event' and
attname='event_date_time';
 n_distinct

  51741
(1 row)

This number seems to be consistently around 51,000 if I re-run analyze a few
times.

I guess my question is two-part:

(1)Is there any tweak to make this estimate work better?

(2)Since I'm getting numbers that are consistent but way off, is there a bug
here?

(2-1/2) Or alternately, am I totally missing what n-distinct is supposed to
denote?

Thanks!
   -Nick

-----
Nick Fankhauser

[EMAIL PROTECTED]  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] SELECT's take a long time compared to other DBMS

2003-09-03 Thread Nick Fankhauser

> Yes I Analyze also, but there was no need to because it was a fresh brand
> new database.

This apparently wasn't the source of problem since he did an analyze anyway,
but my impression was that a fresh brand new database is exactly the
situation where an analyze is needed- ie: a batch of data has just been
loaded and stats haven't been collected yet.

Am I mistaken?

-Nick



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] PostgreSQL is slow...HELP

2003-09-03 Thread Nick Fankhauser

> For small records
> it is not much problems.  But as the records grew (up to 12,000
> records) the
> difference is quite significant.

Although there are many tuning options, I'd suggest starting by making sure
you have an index (unique in cases where appropriate) on accposd.date
accposd.item, items.name, accposd.loose and items.name. Then do an
"analyze;" on the DB to make sure the database takes advantage of the
indexes where appropriate.

If this doesn't help, there are other options to pursue, but this is where I
would start.

-Nick



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] How to efficiently duplicate a whole schema?

2003-08-06 Thread Nick Fankhauser
Sebastien-

I have a similar nightly process to keep our development system synched with
production. I just do a complete pg_dump of production, do a dropdb &
createdb to empty the database for development, and then restore the whole
db from the pg_dump file. Our database is about 12 GB currently, and it
takes less than one hour to dump & restore back into dev if I go through a
file. I can go even faster by piping the data to eliminate one set of reads
& writes to the disk:

dropdb dev
createdb dev
pg_dump prod | psql dev

This of course only works if you haven't changed your data structure in the
development area, but it is very simple and reasonably quick.

in situations where the data structure has changed, I run a more complex
system that deletes data rather than drop the whole db, but I always drop
the indexes in development before restoring data and recreate them
afterwards.

-Nick


> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Sebastien
> Lemieux
> Sent: Wednesday, August 06, 2003 1:56 PM
> To: Postgresql-performance
> Subject: [PERFORM] How to efficiently duplicate a whole schema?
>
>
> Hi,
>
>   I'm running on Redhat 7.2 with postgresql 7.3.2 and I have two schema in
> the same database 'db' and 'db_dev'.  Both contain a set of >20 tables for
> a total of less than 50 Mb of data each (on the order of 50k rows in
> total).  Once in a while (often these days!), I need to synchronize the
> dev version from the production 'db'.  Currently, I do this by setting
> constraints to deferred, deleting everything in db_dev, then issue a serie
> of insert ... select ... to copy data from each table in db to the
> equivalent table in db_dev.
>
>   This approach used to run in less than 30 seconds in MySQL, but in
> PostgreSQL it currently takes around 30 minutes.  The postmaster process
> is running at 100% cpu all the time.  I enclosed all the delete statement
> in one transaction and all the insert statements in a second
> transaction.
> All the time is taken at the commit of both transaction.
>
>   Is there a more straightforward way to synchronize a development
> database to a production one?  Is there anyway to increase the
> performance
> of this delete/insert combination?  I've got indexes and constraints on
> most tables, could that be the problem?  At some point in the future, I
> will also need to make a copy of a whole schema ('db' into 'db_backup'),
> what would be an efficient way to do that?
>
>   These are the parameters I've adjusted in the postgresql.conf:
>
> max_connections = 16
> shared_buffers = 3000
> max_fsm_relations = 2000
> max_fsm_pages = 2
> sort_mem = 2
> vacuum_mem = 2
> effective_cache_size = 15000
>
>   And this is the memory state of the machine:
>
> [EMAIL PROTECTED]> free
>  total   used   free sharedbuffers cached
> Mem:   20594722042224  17248  24768 1157121286572
> -/+ buffers/cache: 6399401419532
> Swap:  2096440 4909681605472
>
> thanks,
>
> --
> Sebastien Lemieux
> Bioinformatics, post-doc
> Elitra-canada
>
>
> ---(end of broadcast)---
> TIP 6: Have you searched our list archives?
>
>http://archives.postgresql.org
>


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Optimization

2003-07-28 Thread Nick Fankhauser


>Can someone tell me what effective_cache_size should be set to?

You may be able to intuit this from my last post, but if I understand
correctly, what you should be doing is estimating how much memory is likely
to be "left over" for the OS to do disk caching with after all of the basic
needs of the OS, PostgreSQL & any other applications are taken care of. You
then tell postgresql what to expect in terms of caching resources by putting
this number into effective_cache_size, and this allows the query planner
come up with a strategy that is optimized for the expected cache size.

So the "formula" would be: Figure out how much memory is normally in use
allowing adequate margins, subtract this from your total RAM, and make the
remainder your effective_cache size.


-Nick


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Optimization

2003-07-28 Thread Nick Fankhauser - Doxpop
Justin-

It sounds like you're on a system similar to ours, so I'll pass along the
changes that I made, which seem to have increased performance, and most
importantly, haven't hurt anything. The main difference in our environment
is that we are less Update/Insert intensive than you are- in our
application, 90% of our information (court cases) is static (Closed) and 10%
are frequently being updated (Pending/Active). This means I only vacuum once
a week. I haven't had chance to churn out objective tests yet, but my
subjective judgment is that this set of params works well:

Set SHMMAX and SHMALL in the kernel to 134217728 (128MB)
Set shared_buffers to 8192 (64MB)
Set sort_mem to 16384 (16MB)
Set effective_cache_size to 65536 (1/2 GB)


The Hardware is a dual-processor Athlon 1.2 Ghz box with 1 GB of RAM and the
DB on SCSI RAID drives.

The database size is about 8GB, with the largest table 2.5 GB, and the two
most commonly queried tables at 1 GB each.

The OS is Debian Linux kernel 2.4.x (recompiled custom kernel for dual
processor support)
The PostgreSQL version is 7.3.2

My reasoning was to increase shared_buffers based on anecdotal
recommendations I've seen on this list to 64MB and boost the OS SHMMAX to
twice that value to allow adequate room for other shared memory needs, thus
reserving 128MB. Of the remaining memory, 256MB goes to 16MB sort space
times
a guesstimate of 16 simultaneous sorts at any given time. If I leave about
128 MB for headroom, then 1/2 GB should be left available for the effective
cache size.

I've never been tempted to turn fsync off. That seems like a risky move.

Regards,
 -Nick

-----
Nick Fankhauser

[EMAIL PROTECTED]  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Optimization

2003-07-28 Thread Nick Fankhauser
Justin-

It sounds like you're on a system similar to ours, so I'll pass along the
changes that I made, which seem to have increased performance, and most
importantly, haven't hurt anything. The main difference in our environment
is that we are less Update/Insert intensive than you are- in our
application, 90% of our information (court cases) is static (Closed) and 10%
are frequently being updated (Pending/Active). This means I only vacuum once
a week. I haven't had chance to churn out objective tests yet, but my
subjective judgment is that this set of params works well:

Set SHMMAX and SHMALL in the kernel to 134217728 (128MB)
Set shared_buffers to 8192 (64MB)
Set sort_mem to 16384 (16MB)
Set effective_cache_size to 65536 (1/2 GB)

The Hardware is a dual-processor Athlon 1.2 Ghz box with 1 GB of RAM and the
DB on SCSI RAID drives.

The database size is about 8GB, with the largest table 2.5 GB, and the two
most commonly queried tables at 1 GB each.

The OS is Debian Linux kernel 2.4.x (recompiled custom kernel for dual
processor support)

The PostgreSQL version is 7.3.2

My reasoning was to increase shared_buffers based on anecdotal
recommendations I've seen on this list to 64MB and boost the OS SHMMAX to
twice that value to allow adequate room for other shared memory needs, thus
reserving 128MB off the top. Of the remaining memory, 256MB goes to 16MB
sort space times a guesstimate of 16 simultaneous sorts at any given time.
If I leave about 128 MB for headroom, then 1/2 GB should be left available
for the effective cache size.

I've never been tempted to turn fsync off. That seems like a risky move.

Regards,
   -Nick
-----
Nick Fankhauser
[EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Sanity check requested

2003-07-18 Thread Nick Fankhauser
I'm confused:

Ang Chin Han wrote:

> We've been using ext3fs for our production systems. (Red Hat Advanced
> Server 2.1)

Vincent van Leeuwen wrote:

> I'd upgrade to a journaling filesystem as soon as possible for
> reliability.

...About one year ago I considered moving to a journaling file system, but
opted not to because it seems like that's what WAL does for us already. How
does putting a journaling file system under it add more reliability?

I also guessed that a journaling file system would add overhead because now
a write to the WAL file could itself be deferred and logged elsewhere.

...So now I'm really puzzled because folks are weighing in with solid
anecdotal evidence saying that I'll get both better reliability and
performance. Can someone explain what I'm missing about the concept?

-A puzzled Nick


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


File systems (RE: [PERFORM] Sanity check requested)

2003-07-18 Thread Nick Fankhauser

Thanks for the suggestions in the FS types- especially the Debian oriented
info. I'll start by playing with the memory allocation parameters that I
originally listed (seems like they should provide results in a way that is
unaffected by the disk IO). Then once I have them at optimal values, move on
to trying different file systems.

I assume that as I make changes that affect the disk IO performance, I'll
then need to do some testing to find new values for the IO cost for the
planner- Do you folks have some ballpark numbers to start with for this
based on your experience? I'm departing in three ways from the simple IDE
model that (I presume) the default random page cost of 4 is based on- The
disks are SCSI & RAID and the FS would be different.

At this point, I can't think of any better way to test this than simply
running my local test suite with various values and recording the wall-clock
results. Is there a different approach that might make more sense? (This
means that my results will be skewed to my environment, but I'll post them
anyway.)

I'll post results back to the list as I get to it- It might be a slow
process Since I spend about 18 hours of each day keeping the business
running, I'll have to cut back on sleep & do this in the other 10 hours. 

-NF


> Shridhar Daithankar wrote:
> I appreciate your approach but it almost proven that ext2 is
> not the best and fastest out there.
>
> Agreed.

> Ang Chin Han wrote:
> We've been using ext3fs for our production systems. (Red Hat Advanced
> Server 2.1)
>
> And since your (Nick) system is based on Debian, I have done some rough
> testing on Debian sarge (testing) (with custom 2.4.20) with ext3fs,
> reiserfs and jfs. Can't get XFS going easily on Debian, though.
>
> I used a single partition mkfs'd with ext3fs, reiserfs and jfs one after
> the other on an IDE disk. Ran pgbench and osdb-x0.15-0 on it.
>
> jfs's has been underperforming for me. Somehow the CPU usage is higher
> than the other two. As for ext3fs and reiserfs, I can't detect any
> significant difference. So if you're in a hurry, it'll be easier to
> convert your ext2 to ext3 (using tune2fs) and use that. Otherwise, it'd
> be nice if you could do your own testing, and post it to the list.
>
> --
> Linux homer 2.4.18-14 #1 Wed Sep 4 13:35:50 EDT 2002 i686 i686 i386
> GNU/Linux
>2:30pm  up 204 days,  5:35,  5 users,  load average: 5.50, 5.18, 5.13
>


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Sanity check requested

2003-07-17 Thread Nick Fankhauser

> Wrong, actually.   Sort memory is allocated *per sort*, not per
connnection or
> per query.   So a single complex query could easily use 4xsort_mem if it
has
> several merge joins ...

Thanks for the correction- it sounds like this is one where usage can't be
precisely controlled in a dynamic user environment & I just need to get a
feel for what works under a load that approximates my production system.

-Nick


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Sanity check requested

2003-07-17 Thread Nick Fankhauser
Shridhar-

I appreciate your thoughts- I'll be running some before & after tests on
this using one of our development/hot-swap boxes, so I'll report the results
back to the list.

A few more thoughts/questions:

> 1. 30 users does not seem to be much of a oevrhead. If possible
> try doing away with connection pooling.

The application needs to scale up gracefully. We actually have about 200
users that could decide to log on at the same time- 30 is just a typical
load. We'd also prefer to have 20,000 subscribers so we can start making a
living with this business .

> 2. While increasing sort memory, try 4/8/16 in that order. That
> way you will get a better picture of load behaviour. Though whatever you
put appears
> reasonable, having more data always help.

I'll try that approach while testing. Is it the case that the sort memory is
allocated for each connection and becomes unavailable to other processes
while the connection exists? If so, since I'm using a connection pool, I
should be able to control total usage precisely. Without a connection pool,
I could start starving the rest of the system for resources if the number of
users spiked unexpectedly. Correct?



> 3. I don't know how this affects on SCSI drives, but what file
> system you are using? Can you try diferent ones?

> 4. OK, this is too much but linux kernel 2.6 is in test and has
> vastly improved IO...

I'm using ext2. For now, I'll leave this and the OS version alone. If I
change too many variables, I won't be able to discern which one is causing a
change. Although I understand that there's an element of art to tuning, I'm
enough of a neophyte that I don't have a "feeling" for the tuning parameters
yet and hence I have to take a scientific approach of just tweaking a few
variables in an otherwise controlled and unchanged environment. If I can't
reach my goals with the simple approach, I'll consider some of the more
radical ideas.

Again, thanks for the ideas- I'll feed the results back after I've done some
tests

-Nick






---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Tunning FreeeBSD and PostgreSQL

2003-07-14 Thread Nick Fankhauser

> I still would like some guidance on tunning FreeBSD (shmmax and
> shmmaxpgs).
> Do I need to even touch these settings?

Stephen- I have no idea what these are set to by default in FreeBSD, but
here's the page that covers changing it in the postgresql docs:

http://www.postgresql.org/docs/7.3/static/kernel-resources.html

-Nick


---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] Sanity check requested

2003-07-14 Thread Nick Fankhauser
Hi folks-

For some time, we've been running Postgres with the default configuration &
getting adequate performance, but the time has come to tune a bit, so I've
been lurking on this list & gathering notes. Now I'm about ready to make a
change & would appreciate it if a few more experienced folks could comment
on whether I appear to be heading in the right direction-

Here's what I'm planning:

Increase SHMMAX and SHMALL in my kernel to 134217728 (128MB)

Increase shared_buffers to 8192 (64MB)

Increase sort_mem to 16384 (16MB)

Increase effective_cache_size to 65536 (1/2 GB)


Here's the environment:

The Hardware is a dual-processor Athlon 1.2 Ghz box with 1 GB of RAM and the
DB on SCSI RAID drives.

The server runs only PostgreSQL

The database size is about 8GB, with the largest table 2.5 GB, and the two
most commonly queried tables at 1 GB each.

The two most commonly queried tables are usually queried based on a
non-unique indexed varchar field typically 20 chars long. The query is a
"like" on people's names with trailing %, so this often gets pushed to seq
scan or returns several thousand records. (As when someone searches on
'Jones%'.

Records from the largest table are always accessed via unique index in
groups of 20 or less.

The OS is Debian Linux kernel 2.4.x (recompiled custom kernel for dual
processor support)
The PostgreSQL version is 7.3.2

We typically have about 30 interactive users on the DB, but they're using a
shared connection pool of 16. Our main problem appears to be when one of the
users fires up a large query and creates a log-jam with resources.


My reasoning is that I'll increase shared_buffers based on anecdotal
recommendations I've seen on this list to 64MB. I'll boost the OS SHMMAX to
twice that value to allow adequate room for other shared memory needs, thus
reserving 128MB. Of the remaining memory, 256MB goes to 16 connections *
16MB sort space, if I leave about 128 MB for headroom, then 1/2 GB should be
left available for the effective cache size.

Any thoughts? Is this a sane plan? Are there other parameters I should
consider changing first?


Thanks!
 -Nick

-
Nick Fankhauser

[EMAIL PROTECTED]  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/


-
Nick Fankhauser

[EMAIL PROTECTED]  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match