[PERFORM] Adding RAM: seeking advice & warnings of hidden "gotchas"
Hi- After having done my best to squeeze better performance out of our application by tuning within our existing resources, I'm falling back on adding memory as a short-term solution while we get creative for a long-term fix. I'm curious about what experiences others have had with the process of adding big chunks of RAM. In particular, if I'm trying to encourage the OS to cache more of my index information in RAM, what sort of configuration should I do at both the PostgreSQL and OS level? In a slightly off-topic vein, I'd also like to hear about it if anyone knows about any gotchas at the OS level that might become a problem. The server is a dual processor Athlon 1.2GHz box with hardware SCSI RAID. It currently has 1 GB RAM, and we're planning to add one GB more for a total of 2GB. The OS is Debian Linux Kernel 2.4.x, and we're on PostgreSQL v7.3.2 My current memory related settings are: SHMMAX and SHMALL set to 128MB (OS setting) shared buffers 8192 (64MB) sort_mem 16384 (16MB) effective_cache_size 65536 (512MB) We support up to 70 active users, sharing a connection pool of 16 connections. Most of the queries center around 3 tables that are about 1.5 GB each. Thanks. -Nick ----- Nick Fankhauser [EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Nested loop performance
> As a question, what does explain analyze give you if you > set enable_nestloop=false; before trying the query? Here are the results- It looks quite a bit more painful than the other plan, although the wall time is in the same ballpark. alpha=# explain analyze alpha-# select alpha-# min(actor.actor_id) as actor_id, alpha-# min(actor.actor_entity_type) as actor_entity_type, alpha-# min(actor.role_class_code) as role_class_code, alpha-# min(actor.actor_full_name) as actor_full_name, alpha-# min(actor.actor_person_date_of_birth) as actor_person_date_of_birth, alpha-# min(actor.actor_entity_acronym) as actor_entity_acronym, alpha-# min(actor.actor_person_last_name) as actor_person_last_name, alpha-# min(actor.actor_person_first_name) as actor_person_first_name, alpha-# min(actor.actor_person_middle_name) as actor_person_middle_name, alpha-# min(actor.actor_person_name_suffix) as actor_person_name_suffix, alpha-# min(actor.actor_person_place_of_birth) as actor_person_place_of_birth, alpha-# min(actor.actor_person_height) as actor_person_height, alpha-# min(actor.actor_person_height_unit) as actor_person_height_unit, alpha-# min(actor.actor_person_weight) as actor_person_weight, alpha-# min(actor.actor_person_weight_unit) as actor_person_weight_unit, alpha-# min(actor.actor_person_ethnicity) as actor_person_ethnicity, alpha-# min(actor.actor_person_citizenship_count) as actor_person_citizenship_count, alpha-# min(actor.actor_person_hair_color) as actor_person_hair_color, alpha-# min(actor.actor_person_scars_marks_tatto) as actor_person_scars_marks_tatto, alpha-# min(actor.actor_person_marital_status) as actor_person_marital_status, alpha-# min(actor.actor_alias_for_actor_id) as actor_alias_for_actor_id, alpha-# min(to_char(data_source.source_last_update, 'MM/DD/ HH12:MI AM TZ')) as last_update, alpha-# min(actor_summary.single_case_public_id) as case_public_id, alpha-# min(actor_summary.single_case_id) as case_id, alpha-# sum(actor_summary.case_count)as case_count alpha-# from alpha-# actor, alpha-# actor_summary, alpha-# data_source alpha-# where alpha-# actor.actor_id = actor_summary.actor_id alpha-# and data_source.source_id = actor.source_id alpha-# and actor.actor_full_name_uppercase like upper('sanders%') alpha-# group by alpha-# actor.actor_id alpha-# order by alpha-# min(actor.actor_full_name_uppercase), alpha-# case_count desc, alpha-# min(actor_summary.case_disp_global_code) alpha-# limit alpha-# 1000; QUERY PLAN -- Limit (cost=168919.98..168920.03 rows=20 width=548) (actual time=91247.95..91249.05 rows=1000 loops=1) -> Sort (cost=168919.98..168920.03 rows=20 width=548) (actual time=91247.95..91248.35 rows=1001 loops=1) Sort Key: min((actor.actor_full_name_uppercase)::text), sum(actor_summary.case_count), min((actor_summary.case_disp_global_code)::text) -> Aggregate (cost=168904.95..168919.54 rows=20 width=548) (actual time=91015.00..91164.68 rows=3590 loops=1) -> Group (cost=168904.95..168905.95 rows=201 width=548) (actual time=90999.87..91043.25 rows=3594 loops=1) -> Sort (cost=168904.95..168905.45 rows=201 width=548) (actual time=90999.83..91001.57 rows=3594 loops=1) Sort Key: actor.actor_id -> Hash Join (cost=903.08..168897.24 rows=201 width=548) (actual time=25470.63..90983.45 rows=3594 loops=1) Hash Cond: ("outer".actor_id = "inner".actor_id) -> Seq Scan on actor_summary (cost=0.00..150715.43 rows=3455243 width=73) (actual time=8.03..52902.24 rows=3455243 loops=1) -> Hash (cost=902.57..902.57 rows=204 width=475) (actual time=25459.92..25459.92 rows=0 loops=1) -> Hash Join (cost=1.14..902.57 rows=204 width=475) (actual time=155.92..25451.25 rows=3639 loops=1) Hash Cond: ("outer".source_id = "inner".source_id) -> Index Scan using actor_full_name_uppercase on actor (cost=0.00..897.20 rows=223 width=463) (actual time=144.93..25404. 10 rows=3639 loops=1) Index Cond: ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND (actor_full_name_uppercase < 'SANDERT':: character varying)) Filter: (actor_full_name_uppercase ~~ 'SANDERS%'::text) -> Hash (cost=1.11..1.11 rows=11 width=12) (actual time=10.66..10.66 rows=0 loops=1)
Re: [PERFORM] Nested loop performance
> It seems that your basic problem is that you're fetching lots of rows > from two big ol' tables. > It doesn't seem to me that there would be a substantially better plan > for this query with your tables as they stand. That's more or less the conclusion I had come to. I was just hoping someone else could point out an approach I've been missing. (sigh!) > If your data were more > normalised, then your big scans might be quicker (because their rows > would be smaller so they would hit fewer disk pages), This started off as a 5-table join on well-normalized data. Unfortunately, the actor table doesn't get any smaller, and the work involved in calculating the "case_count" information on the fly was clearly becoming a problem- particularly with actors that had a heavy caseload. (Busy attorneys and judges.) The actor_summary approach makes these previous problem cases go away, but the payback is that (as you correctly pointed out) queries on average citizens who only have one case suffer from the de-normalized approach. We're currently considering the approach of just returning all of the rows to our application, and doing the aggregation and limit work in the app. The inconsistency of the data makes it very tough for the query planner to come up with an strategy that is always a winner. Thanks for your thoughts! -Nick ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Nested loop question
> The fact that it's taking you 9ms to do each index lookup > suggests to me that > it's going to disk each time. Does that sound plausible, or do > you think you > have enough RAM to cache your large indexes? I'm sure we don't have enough RAM to cache all of our large indexes, so your supposition makes sense. We have 1GB on this machine. In responding to the performance problems we're having, one of the questions has been adding memory vs crafting "helper" tables to speed things up. The issue is that this database needs to be able to scale easily to about 10 times the size, so although we could easily triple the memory at reasonable expense, we'd eventually hit a wall. Is there any solid method to insure that a particular index always resides in memory? A hybrid approach that might scale reliably would be to bump up our memory and then make sure key indexes are cached. however, I'm concerned that if we didn't have a way to ensure that the indexes that we choose remain cached, we would have very inconsistent responses. Thanks for your ideas! -Nick ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PERFORM] Nested loop question
7) (actual time=48604.04..48647.91 rows=3594 loops=1) -> Sort (cost=2555.50..2555.50 rows=1 width=547) (actual time=48604.01..48605.70 rows=3594 loops=1) Sort Key: actor.actor_id -> Nested Loop (cost=1.14..2555.49 rows=1 width=547) (actual time=69.09..48585.83 rows=3594 loops=1) -> Hash Join (cost=1.14..900.39 rows=204 width=475) (actual time=46.92..15259.02 rows=3639 loops=1) Hash Cond: ("outer".source_id = "inner".source_id) -> Index Scan using actor_full_name_uppercase on actor (cost=0.00..895.04 rows=222 width=463) (actual time=46.54..15220.77 rows=3639 loops=1) Index Cond: ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND (actor_full_name_uppercase < 'SANDERT'::character varying)) Filter: (actor_full_name_uppercase ~~ 'SANDERS%'::text) -> Hash (cost=1.11..1.11 rows=11 width=12) (actual time=0.05..0.05 rows=0 loops=1) -> Seq Scan on data_source (cost=0.00..1.11 rows=11 width=12) (actual time=0.02..0.04 rows=11 loops=1) -> Index Scan using actor_summary_pk on actor_summary (cost=0.00..8.11 rows=1 width=72) (actual time=9.14..9.15 rows=1 loops=3639) Index Cond: ("outer".actor_id = actor_summary.actor_id) Total runtime: 48851.85 msec (18 rows) - Nick Fankhauser [EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/ ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] Nested loop performance
7) (actual time=48604.04..48647.91 rows=3594 loops=1) -> Sort (cost=2555.50..2555.50 rows=1 width=547) (actual time=48604.01..48605.70 rows=3594 loops=1) Sort Key: actor.actor_id -> Nested Loop (cost=1.14..2555.49 rows=1 width=547) (actual time=69.09..48585.83 rows=3594 loops=1) -> Hash Join (cost=1.14..900.39 rows=204 width=475) (actual time=46.92..15259.02 rows=3639 loops=1) Hash Cond: ("outer".source_id = "inner".source_id) -> Index Scan using actor_full_name_uppercase on actor (cost=0.00..895.04 rows=222 width=463) (actual time=46.54..15220.77 rows=3639 loops=1) Index Cond: ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND (actor_full_name_uppercase < 'SANDERT'::character varying)) Filter: (actor_full_name_uppercase ~~ 'SANDERS%'::text) -> Hash (cost=1.11..1.11 rows=11 width=12) (actual time=0.05..0.05 rows=0 loops=1) -> Seq Scan on data_source (cost=0.00..1.11 rows=11 width=12) (actual time=0.02..0.04 rows=11 loops=1) -> Index Scan using actor_summary_pk on actor_summary (cost=0.00..8.11 rows=1 width=72) (actual time=9.14..9.15 rows=1 loops=3639) Index Cond: ("outer".actor_id = actor_summary.actor_id) Total runtime: 48851.85 msec (18 rows) - Nick Fankhauser [EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/ ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] n_distinct way off, but following a pattern.
> It would be interesting to see exactly what inputs are going into this > equation. Do you feel like adding some debug printouts into this code? > Or just looking at the variables with a debugger? In 7.3 it's about > line 1060 in src/backend/commands/analyze.c. Tom- I don't really have time to follow up at this moment, but I think this would be interesting to look into, so I'll plan to dig into it over the Thanksgiving Holiday when I'll have a little time free to follow up on some fun projects. Your pointers should let me get into it pretty quickly. In the meantime, I'll just set up a cron job that runs behind my nightly analyze to put the correct numbers into pg_statistic on the tables that this affects. Thanks- -Nick ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] n_distinct way off, but following a pattern.
Hi- I'm seeing estimates for n_distinct that are way off for a large table (8,700,000 rows). They get better by setting the stats target higher, but are still off by a factor of 10 with the stats set to 1000. I've noticed and reported a similar pattern before on another table. Because this follows the same very consistent pattern, I figured it was worth reporting again. This looks more like a bug than randomness. If the poor result was simply due to having a small sample to work from, the estimates should be all over the map, but these are consistently low, and vary in almost exact inverse proportion to the stats target: run 1: run2: run3: n_distinct estimate, statistics = 10: 3168 3187 3212 n_distinct estimate, statistics = 100: 23828 24059 23615 n_distinct estimate, statistics = 1000: 194690 194516194081 Actual distinct values: 3340724 Or to put it another way, if you were to take the estimate from analyze, divide by the stats target and multiply by 1, the result would be pretty close to exact. (Within a factor of 2, which ought to be plenty close for planning purposes.) I'm running version 7.3.2 Any thoughts from folks familiar with this part of the source code? Regards, -Nick PS- Here's a log of the session that I got this from. alpha=# select count(distinct actor_id) from actor_case_assignment; -[ RECORD 1 ]-- count | 3340724 alpha=# analyze; ANALYZE alpha=# SELECT * FROM pg_stats alpha-# WHERE tablename='actor_case_assignment' AND attname='actor_id'; -[ RECORD 1 ]-+--- schemaname| public tablename | actor_case_assignment attname | actor_id null_frac | 0 avg_width | 16 n_distinct| 3168 most_common_vals | {18105XS,18115XS,18106XS,18113JD02,18115JD02,18106J27,18113XS,18113A10656,18 115LST,18108XS} most_common_freqs | {0.0206667,0.0206667,0.0196667,0.019,0.0176667,0.017,0.016,0.015,0.0 14,0.0136667} histogram_bounds | {18067A000-07P,18067PD397SC1574-1,18105LBPD,18106A2119-49,18106PD399IF845-1, 18108A03068-20,18108LECS207,18108PTW03737278-2,18111A19788-77,18115A50420,18 115XC} correlation | 0.876795 alpha=# alpha=# alter table actor_case_assignment alter column actor_id set statistics 100; ALTER TABLE alpha=# analyze actor_case_assignment; ANALYZE alpha=# SELECT * FROM pg_stats alpha-# WHERE tablename='actor_case_assignment' AND attname='actor_id'; -[ RECORD 1 ] schemaname| public tablename | actor_case_assignment attname | actor_id null_frac | 0 avg_width | 17 n_distinct| 23828 most_common_vals | {18115XS,18113JD02,18106XS,1 alpha=# alter table actor_case_assignment alter column actor_id set statistics 1000; ALTER TABLE alpha=# analyze actor_case_assignment; ANALYZE alpha=# SELECT * FROM pg_stats alpha-# WHERE tablename='actor_case_assignment' AND attname='actor_id'; -[ RECORD 1 ]- schemaname| public tablename | actor_case_assignment attname | actor_id null_frac | 0 avg_width | 16 n_distinct| 194690 most_common_vals | {18106XS,18115XS,18115... alpha=# \x Expanded display is off. alpha=# alter table actor_case_assignment alter column actor_id set statistics 10; ALTER TABLE alpha=# analyze actor_case_assignment; ANALYZE alpha=# select n_distinct from pg_stats where tablename='actor_case_assignment and attname='actor_id'; alpha'# '; ERROR: parser: parse error at or near "actor_id" at character 85 alpha=# select n_distinct from pg_stats where tablename='actor_case_assignment' and attname='actor_id'; n_distinct 3187 (1 row) alpha=# alter table actor_case_assignment alter column actor_id set statistics 10; ALTER TABLE alpha=# analyze actor_case_assignment; ANALYZE alpha=# select n_distinct from pg_stats where tablename='actor_case_assignment' and attname='actor_id'; n_distinct 3212 (1 row) alpha=# analyze actor_case_assignment; ANALYZE alpha=# alter table actor_case_assignment alter column actor_id set statistics 100; ALTER TABLE alpha=# analyze actor_case_assignment; ANALYZE alpha=# select n_distinct from pg_stats where tablename='actor_case_assignment' and attname='actor_id'; n_distinct 24059 (1 row) alpha=# analyze actor_case_assignment; ANALYZE alpha=# alter table actor_case_assignment alter column actor_id set statistics 100; ALTER TABLE alpha=# select n_distinct from pg_stats where tablename='actor_case_assignment' and attname='actor_id'; n_distinct 23615 (1 row) alpha=# alter table actor_case_assignment alter column actor_id set statistics 1000; ALTER TABLE alpha=# analyze actor_case_assignment; ANALYZE alpha=# select n_distinct from pg_stats where tablename='actor_cas
Re: [PERFORM] Seeking help with a query that takes too long
> Does actor_case_assignment contain more columns than just the two ids? > If yes, do these additional fields account for ca. 70 bytes per tuple? > If not, try > VACUUM FULL ANALYSE actor_case_assignment; actor_case_assignment has its own primary key and a "role" field in addition to the ids you've seen, so 70 bytes sounds reasonable. (The PK is to allow a remote mirroring application to update these records- otherwise it would be unnecessary.) > 7ms per > tuple returned looks like a lot of disk seeks are involved. Is > clustering actor on actor_full_name_uppercase an option or would this > slow down other queries? Good question... I've never used clustering in PostgreSQL before, so I'm unsure. I presume this is like clustering in Oracle where the table is ordered to match the index? If so, I think you may be onto something because the only other field We regularly query on is the actor_id. Actor_id has a unique index with no clustering currently, so I don't think I'd lose a thing by clustering on actor_full_name_uppercase. I'll give this a try & let you know how it changes. BTW, you are correct that caching has a big affect on the actual time figures in this case- I'm working on my development DB, so cahced info doesn't get trampled as quickly by other users. Is there a way to flush out the cache in a testing situation like this in order to start from a consistent base? Thanks! -Nick ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] Seeking help with a query that take too long
Hi- I have a query that takes too long. I haven't been able to come up with any ideas for speeding it up, so I'm seeking some input from the list. I'm using version 7.3.2 I have three tables: case_data (1,947,386 rows) actor (3,385,669 rows) actor_case_assignment (8,668,650 rows) As the names imply, actor_case_assignment contains records that assign an actor to a case. Actors such as attorneys or judges may have many cases, while the average actor (we hope) only has one. What I'm trying to do is link these tables to get back a single row per actor that shows the actor's name, the number of cases that actor is assigned to, and if they only have one case, I want the public_id for that case. This means I have to do a group by to get the case count, but I'm then forced to use an aggregate function like max on the other fields. All of the fields ending in "_id" have unique indexes, and actor_full_name_uppercase is indexed. Here's the select: select actor.actor_id, max(actor.actor_full_name), max(case_data.case_public_id), max(case_data.case_id), count(case_data.case_id) as case_count from actor, actor_case_assignment, case_data where actor.actor_full_name_uppercase like upper('sanders%') and actor.actor_id = actor_case_assignment.actor_id and case_data.case_id = actor_case_assignment.case_id group by actor.actor_id order by max(actor.actor_full_name), case_count desc limit 1000; Here's the explain analyze: QUERY PLAN -- Limit (cost=2214.71..2214.72 rows=1 width=115) (actual time=120034.61..120035.67 rows=1000 loops=1) -> Sort (cost=2214.71..2214.72 rows=1 width=115) (actual time=120034.60..120034.98 rows=1001 loops=1) Sort Key: max((actor.actor_full_name)::text), count(case_data.case_id) -> Aggregate (cost=2214.67..2214.70 rows=1 width=115) (actual time=119962.80..120011.49 rows=3456 loops=1) -> Group (cost=2214.67..2214.68 rows=2 width=115) (actual time=119962.76..119987.04 rows=5879 loops=1) -> Sort (cost=2214.67..2214.68 rows=2 width=115) (actual time=119962.74..119965.09 rows=5879 loops=1) Sort Key: actor.actor_id -> Nested Loop (cost=0.00..2214.66 rows=2 width=115) (actual time=59.05..119929.71 rows=5879 loops=1) -> Nested Loop (cost=0.00..2205.26 rows=3 width=76) (actual time=51.46..66089.04 rows=5882 loops=1) -> Index Scan using actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42) (actual time=37.62..677.44 rows=3501 loops=1) Index Cond: ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND (actor_full_name_uppercase < 'SANDERT'::character varying)) Filter: (actor_full_name_uppercase ~~ 'SANDERS%'::text) -> Index Scan using actor_case_assignment_actor_id on actor_case_assignment (cost=0.00..2165.93 rows=2666 width=34) (actual time=16.37..18.67 rows=2 loops=3501) Index Cond: ("outer".actor_id = actor_case_assignment.actor_id) -> Index Scan using case_data_case_id on case_data (cost=0.00..3.66 rows=1 width=39) (actual time=9.14..9.15 rows=1 loops=5882) Index Cond: (case_data.case_id = "outer".case_id) Total runtime: 120038.60 msec (17 rows) Any ideas? Thanks! -Nick - Nick Fankhauser [EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Seeking help with a query that takes too long
> You might have to resort to brute force, like "set enable_nestloop=false". > Just out of curiosity, what do you get if you do that? I get a different plan, but similar execution time: Limit (cost=323437.13..323437.13 rows=1 width=115) (actual time=170921.89..170922.95 rows=1000 loops=1) -> Sort (cost=323437.13..323437.13 rows=1 width=115) (actual time=170921.89..170922.26 rows=1001 loops=1) Sort Key: max((actor.actor_full_name)::text), count(case_data.case_id) -> Aggregate (cost=323437.08..323437.12 rows=1 width=115) (actual time=170849.94..170898.06 rows=3457 loops=1) -> Group (cost=323437.08..323437.09 rows=3 width=115) (actual time=170849.90..170873.60 rows=5880 loops=1) -> Sort (cost=323437.08..323437.08 rows=3 width=115) (actual time=170847.97..170850.21 rows=5880 loops=1) Sort Key: actor.actor_id -> Hash Join (cost=25.29..323437.06 rows=3 width=115) (actual time=122873.80..170814.27 rows=5880 loops=1) Hash Cond: ("outer".case_id = "inner".case_id) -> Seq Scan on case_data (cost=0.00..60368.16 rows=1947116 width=39) (actual time=12.95..43542.25 rows=1947377 loops=1) -> Hash (cost=25.28..25.28 rows=3 width=76) (actual time=122844.40..122844.40 rows=0 loops=1) -> Hash Join (cost=6.02..25.28 rows=3 width=76) (actual time=24992.70..122810.32 rows=5883 loops=1) Hash Cond: ("outer".actor_id = "inner".actor_id) -> Seq Scan on actor_case_assignment (cost=0.00..209980.49 rows=8669349 width=34) (actual time=9.13..85504.05 rows=8670467 loops=1) -> Hash (cost=6.01..6.01 rows=1 width=42) (actual time=24926.56..24926.56 rows=0 loops=1) -> Index Scan using actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42) (actual time=51.67..24900.53 rows=3502 loops=1) Index Cond: ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND (actor_full_name_uppercase < 'SANDERT'::character varying)) Filter: (actor_full_name_uppercase ~~ 'SANDERS%'::text) Total runtime: 170925.93 msec (19 rows) -Nick ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Seeking help with a query that takes too long
> It looks like you are running with the default statistics target (10). > Try boosting it to 100 or even more for this column (see ALTER TABLE > SET STATISTICS, then re-ANALYZE) and see if the estimate gets better. Here are the results & a few more clues: prod1=# alter table actor alter column actor_full_name_uppercase set statistics 1000; ALTER TABLE prod1=# analyze actor; ANALYZE prod1=# select count(distinct actor_full_name_uppercase) from actor; count - 1453371 (1 row) prod1=# select count(actor_id) from actor; count - 3386359 (1 row) This indicates to me that 1 isn't too shabby as an estimate if the whole name is specified, but I'm not sure how this gets altered in the case of a "LIKE" prod1=# \x Expanded display is on. prod1=# SELECT * FROM pg_stats prod1-# WHERE tablename='actor' AND attname='actor_full_name_uppercase'; schemaname| public tablename | actor attname | actor_full_name_uppercase null_frac | 0.000586667 avg_width | 21 n_distinct| -0.14701 correlation | -0.00211291 Question: What does it mean when n_distinct is negative? New results of explain analyze: QUERY PLAN -- Limit (cost=252683.61..252683.68 rows=28 width=116) (actual time=169377.32..169378.39 rows=1000 loops=1) -> Sort (cost=252683.61..252683.68 rows=29 width=116) (actual time=169377.31..169377.69 rows=1001 loops=1) Sort Key: max((actor.actor_full_name)::text), count(case_data.case_id) -> Aggregate (cost=252678.57..252682.91 rows=29 width=116) (actual time=169305.79..169354.50 rows=3456 loops=1) -> Group (cost=252678.57..252680.01 rows=289 width=116) (actual time=169305.76..169330.00 rows=5879 loops=1) -> Sort (cost=252678.57..252679.29 rows=289 width=116) (actual time=169305.75..169308.15 rows=5879 loops=1) Sort Key: actor.actor_id -> Nested Loop (cost=0.00..252666.74 rows=289 width=116) (actual time=89.27..169273.51 rows=5879 loops=1) -> Nested Loop (cost=0.00..251608.11 rows=289 width=77) (actual time=57.73..92753.49 rows=5882 loops=1) -> Index Scan using actor_full_name_uppercase on actor (cost=0.00..456.88 rows=113 width=42) (actual time=32.80..3197.28 rows=3501 loops=1) Index Cond: ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND (actor_full_name_uppercase < 'SANDERT'::character varying)) Filter: (actor_full_name_uppercase ~~ 'SANDERS%'::text) -> Index Scan using actor_case_assignment_actor_id on actor_case_assignment (cost=0.00..2181.29 rows=2616 width=35) (actual time=22.26..25.57 rows=2 loops=3501) Index Cond: ("outer".actor_id = actor_case_assignment.actor_id) -> Index Scan using case_data_case_id on case_data (cost=0.00..3.65 rows=1 width=39) (actual time=13.00..13.00 rows=1 loops=5882) Index Cond: (case_data.case_id = "outer".case_id) Total runtime: 169381.38 msec (17 rows) ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Seeking help with a query that takes too long
> >actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42) > ^^ > >(actual time=37.62..677.44 rows=3501 loops=1) > ^ > Nick, can you find out why this row count estimation is so far off? ^ Oops- I read this backward- I see what you mean now. That's a good question. I'm not sure what part of the stats this estimate might be pulled from. The average is 357, but the most common frequency may be around 1. -Nick ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Seeking help with a query that takes too long
>(actual time=37.62..677.44 rows=3501 loops=1) ^ > Nick, can you find out why this row count estimation is so far off? It's actually correct: prod1=# select count(actor_id) from actor where actor_full_name_uppercase like 'SANDERS%'; count --- 3501 (1 row) Of course, I merely chose "SANDERS" arbitrarily as a name that falls somewhere near the middle of the frequency range for names. SMITH or JONES would represent a worst-case, and something like KOIZAR would probably be unique. Here are the stats: prod1=# SELECT * FROM pg_stats prod1-# WHERE tablename='actor' AND attname='actor_full_name_uppercase'; -[ RECORD 1 ]-+--- - schemaname| public tablename | actor attname | actor_full_name_uppercase null_frac | 0.00033 avg_width | 21 n_distinct| 24215 most_common_vals | {"STATE OF INDIANA","INDIANA DEPARTMENT OF REVENUE","BARTH CONS SCHOOL CORP","HOWARD COUNTY CLERK","ADVANCED RECOVERY SERVICES","STATE OF INDIANA-DEPT OF REVENUE","ALLIED COLLECTION SERVICE INC","CREDIT BUREAU OF LAPORTE","MIDWEST COLLECTION SVC INC","NCO FINANCIAL SYSTEMS INC"} most_common_freqs | {0.015,0.014,0.0043,0.0043,0.004,0.0037,0.0033,0.003 3,0.0027,0.0027} histogram_bounds | {"(POE) ESTELLE, DENISE","BRIEN, LIISI","COTTRELL, CAROL","FAMILY RENTALS","HAYNES, TAMIKA","KESSLER, VICTORIA","MEFFORD, VERNON L","PHILLIPS, GERALD L","SHELTON, ANTOINETTE","TRICARICO, MELISSA SUE","ZUEHLKE, THOMAS L"} correlation | -0.00147395 I think this means that the average is 357 per actor. As you can see, the range of assignments varies from people with a single parking ticket to "State of Indiana", which is party to many thousands of cases. > BTW, there seem to be missing cases: > > -> Nested Loop (cost=0.00..2214.66 rows=2 width=115) > > (actual time=59.05..119929.71 rows=5879 loops=1) > > >-> Nested Loop (cost=0.00..2205.26 rows=3 width=76) > > (actual time=51.46..66089.04 rows=5882 loops=1) This is expected- We actually aggregate data from many county court databases, with varying levels of data "cleanliness". Regards, -Nick ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
[PERFORM] Seeking help with a query that takes too long
[I originally posted this using the wrong E-Mail account, so a double posting may occur if the first message gets released by the moderator later- sorry!] Hi- I have a query that I'm trying to speed up. I haven't been able to come up with any workable ideas for speeding it up, so I'm seeking some input from the list. I'm using version 7.3.2 I have three tables: case_data (1,947,386 rows) actor (3,385,669 rows) actor_case_assignment (8,668,650 rows) As the names imply, actor_case_assignment contains records that assign an actor to a case. Actors such as attorneys or judges may have many cases, while the average actor (we hope) only has one. What I'm trying to do is link these tables to get back a single row per actor that shows the actor's name, the number of cases that actor is assigned to, and if they only have one case, I want the public_id for that case. This means I have to do a group by to get the case count, but I'm then forced to use an aggregate function like max on the other fields. All of the fields ending in "_id" have unique indexes, and actor_full_name_uppercase is indexed. An analyze is done every night & the database is fairly stable in it's composition. Here's the select: select actor.actor_id, max(actor.actor_full_name), max(case_data.case_public_id), max(case_data.case_id), count(case_data.case_id) as case_count from actor, actor_case_assignment, case_data where actor.actor_full_name_uppercase like upper('sanders%') and actor.actor_id = actor_case_assignment.actor_id and case_data.case_id = actor_case_assignment.case_id group by actor.actor_id order by max(actor.actor_full_name), case_count desc limit 1000; Here's the explain analyze: QUERY PLAN -- Limit (cost=2214.71..2214.72 rows=1 width=115) (actual time=120034.61..120035.67 rows=1000 loops=1) -> Sort (cost=2214.71..2214.72 rows=1 width=115) (actual time=120034.60..120034.98 rows=1001 loops=1) Sort Key: max((actor.actor_full_name)::text), count(case_data.case_id) -> Aggregate (cost=2214.67..2214.70 rows=1 width=115) (actual time=119962.80..120011.49 rows=3456 loops=1) -> Group (cost=2214.67..2214.68 rows=2 width=115) (actual time=119962.76..119987.04 rows=5879 loops=1) -> Sort (cost=2214.67..2214.68 rows=2 width=115) (actual time=119962.74..119965.09 rows=5879 loops=1) Sort Key: actor.actor_id -> Nested Loop (cost=0.00..2214.66 rows=2 width=115) (actual time=59.05..119929.71 rows=5879 loops=1) -> Nested Loop (cost=0.00..2205.26 rows=3 width=76) (actual time=51.46..66089.04 rows=5882 loops=1) -> Index Scan using actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42) (actual time=37.62..677.44 rows=3501 loops=1) Index Cond: ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND (actor_full_name_uppercase < 'SANDERT'::character varying)) Filter: (actor_full_name_uppercase ~~ 'SANDERS%'::text) -> Index Scan using actor_case_assignment_actor_id on actor_case_assignment (cost=0.00..2165.93 rows=2666 width=34) (actual time=16.37..18.67 rows=2 loops=3501) Index Cond: ("outer".actor_id = actor_case_assignment.actor_id) -> Index Scan using case_data_case_id on case_data (cost=0.00..3.66 rows=1 width=39) (actual time=9.14..9.15 rows=1 loops=5882) Index Cond: (case_data.case_id = "outer".case_id) Total runtime: 120038.60 msec (17 rows) Any ideas? Thanks! -Nick - Nick Fankhauser [EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/ ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [pgsql-advocacy] [PERFORM] OFFTOPIC: PostgreSQL vs MySQL
> Have you checked these pages? They've been posted on this list numerous > times: > http://techdocs.postgresql.org > http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html > http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html > Josh- It would be great to have a link to those last two excellent resources from the techdocs area- perhaps from the "optimizing" section in http://techdocs.postgresql.org/oresources.php. Who should we suggest this to? (I submitted these using the form in that area, but you may have better connections.) -Nick ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] How to make n_distinct more accurate.
The performance list seemed to be off-line for a while, so I posed the same question on the admin list and Tom Lane has been helping in that forum. -Nick > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Nick > Fankhauser > Sent: Monday, September 22, 2003 3:42 PM > To: [EMAIL PROTECTED] Org > Subject: [PERFORM] How to make n_distinct more accurate. > > > Hi- > > I have a table- called "event" with a field event_date_time that > is indexed. > There are 1,700,000 rows in the table and 92,000 distinct values of > event_date_time with anywhere from 1 to 400 rows sharing the same > value. (I > did a count grouped by event_date_time & scanned it to get this info.) > > When I look at the pg_stats on this table, I always see 15,000 or lower in > the n_distinct column for event_date_time. (I re-ran analyze > several times & > then checked pg_stats to see if the numbers varied significantly.) > > Since this is off by about a factor of 6, I think the planner is > missing the > chance to use this table as the "driver" in a complex query plan that I'm > trying to optimize. > > So the question is- how can I get a better estimate of n_distinct from > analyze? > > If I alter the stats target as high as it will go, I get closer, but it > still shows the index to be about 1/2 as selective as it actually is: > > alpha=# alter table event alter column event_date_time set > statistics 1000; > ALTER TABLE > alpha=# analyze event; > ANALYZE > alpha=# select n_distinct from pg_stats where tablename='event' and > attname='event_date_time'; > n_distinct > > 51741 > (1 row) > > This number seems to be consistently around 51,000 if I re-run > analyze a few > times. > > I guess my question is two-part: > > (1)Is there any tweak to make this estimate work better? > > (2)Since I'm getting numbers that are consistent but way off, is > there a bug > here? > > (2-1/2) Or alternately, am I totally missing what n-distinct is > supposed to > denote? > > Thanks! >-Nick > > - > Nick Fankhauser > > [EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788 > doxpop - Court records at your fingertips - http://www.doxpop.com/ > > > > ---(end of broadcast)--- > TIP 7: don't forget to increase your free space map settings > ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] How to make n_distinct more accurate.
Hi- I have a table- called "event" with a field event_date_time that is indexed. There are 1,700,000 rows in the table and 92,000 distinct values of event_date_time with anywhere from 1 to 400 rows sharing the same value. (I did a count grouped by event_date_time & scanned it to get this info.) When I look at the pg_stats on this table, I always see 15,000 or lower in the n_distinct column for event_date_time. (I re-ran analyze several times & then checked pg_stats to see if the numbers varied significantly.) Since this is off by about a factor of 6, I think the planner is missing the chance to use this table as the "driver" in a complex query plan that I'm trying to optimize. So the question is- how can I get a better estimate of n_distinct from analyze? If I alter the stats target as high as it will go, I get closer, but it still shows the index to be about 1/2 as selective as it actually is: alpha=# alter table event alter column event_date_time set statistics 1000; ALTER TABLE alpha=# analyze event; ANALYZE alpha=# select n_distinct from pg_stats where tablename='event' and attname='event_date_time'; n_distinct 51741 (1 row) This number seems to be consistently around 51,000 if I re-run analyze a few times. I guess my question is two-part: (1)Is there any tweak to make this estimate work better? (2)Since I'm getting numbers that are consistent but way off, is there a bug here? (2-1/2) Or alternately, am I totally missing what n-distinct is supposed to denote? Thanks! -Nick ----- Nick Fankhauser [EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] SELECT's take a long time compared to other DBMS
> Yes I Analyze also, but there was no need to because it was a fresh brand > new database. This apparently wasn't the source of problem since he did an analyze anyway, but my impression was that a fresh brand new database is exactly the situation where an analyze is needed- ie: a batch of data has just been loaded and stats haven't been collected yet. Am I mistaken? -Nick ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] PostgreSQL is slow...HELP
> For small records > it is not much problems. But as the records grew (up to 12,000 > records) the > difference is quite significant. Although there are many tuning options, I'd suggest starting by making sure you have an index (unique in cases where appropriate) on accposd.date accposd.item, items.name, accposd.loose and items.name. Then do an "analyze;" on the DB to make sure the database takes advantage of the indexes where appropriate. If this doesn't help, there are other options to pursue, but this is where I would start. -Nick ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] How to efficiently duplicate a whole schema?
Sebastien- I have a similar nightly process to keep our development system synched with production. I just do a complete pg_dump of production, do a dropdb & createdb to empty the database for development, and then restore the whole db from the pg_dump file. Our database is about 12 GB currently, and it takes less than one hour to dump & restore back into dev if I go through a file. I can go even faster by piping the data to eliminate one set of reads & writes to the disk: dropdb dev createdb dev pg_dump prod | psql dev This of course only works if you haven't changed your data structure in the development area, but it is very simple and reasonably quick. in situations where the data structure has changed, I run a more complex system that deletes data rather than drop the whole db, but I always drop the indexes in development before restoring data and recreate them afterwards. -Nick > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Sebastien > Lemieux > Sent: Wednesday, August 06, 2003 1:56 PM > To: Postgresql-performance > Subject: [PERFORM] How to efficiently duplicate a whole schema? > > > Hi, > > I'm running on Redhat 7.2 with postgresql 7.3.2 and I have two schema in > the same database 'db' and 'db_dev'. Both contain a set of >20 tables for > a total of less than 50 Mb of data each (on the order of 50k rows in > total). Once in a while (often these days!), I need to synchronize the > dev version from the production 'db'. Currently, I do this by setting > constraints to deferred, deleting everything in db_dev, then issue a serie > of insert ... select ... to copy data from each table in db to the > equivalent table in db_dev. > > This approach used to run in less than 30 seconds in MySQL, but in > PostgreSQL it currently takes around 30 minutes. The postmaster process > is running at 100% cpu all the time. I enclosed all the delete statement > in one transaction and all the insert statements in a second > transaction. > All the time is taken at the commit of both transaction. > > Is there a more straightforward way to synchronize a development > database to a production one? Is there anyway to increase the > performance > of this delete/insert combination? I've got indexes and constraints on > most tables, could that be the problem? At some point in the future, I > will also need to make a copy of a whole schema ('db' into 'db_backup'), > what would be an efficient way to do that? > > These are the parameters I've adjusted in the postgresql.conf: > > max_connections = 16 > shared_buffers = 3000 > max_fsm_relations = 2000 > max_fsm_pages = 2 > sort_mem = 2 > vacuum_mem = 2 > effective_cache_size = 15000 > > And this is the memory state of the machine: > > [EMAIL PROTECTED]> free > total used free sharedbuffers cached > Mem: 20594722042224 17248 24768 1157121286572 > -/+ buffers/cache: 6399401419532 > Swap: 2096440 4909681605472 > > thanks, > > -- > Sebastien Lemieux > Bioinformatics, post-doc > Elitra-canada > > > ---(end of broadcast)--- > TIP 6: Have you searched our list archives? > >http://archives.postgresql.org > ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Optimization
>Can someone tell me what effective_cache_size should be set to? You may be able to intuit this from my last post, but if I understand correctly, what you should be doing is estimating how much memory is likely to be "left over" for the OS to do disk caching with after all of the basic needs of the OS, PostgreSQL & any other applications are taken care of. You then tell postgresql what to expect in terms of caching resources by putting this number into effective_cache_size, and this allows the query planner come up with a strategy that is optimized for the expected cache size. So the "formula" would be: Figure out how much memory is normally in use allowing adequate margins, subtract this from your total RAM, and make the remainder your effective_cache size. -Nick ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Optimization
Justin- It sounds like you're on a system similar to ours, so I'll pass along the changes that I made, which seem to have increased performance, and most importantly, haven't hurt anything. The main difference in our environment is that we are less Update/Insert intensive than you are- in our application, 90% of our information (court cases) is static (Closed) and 10% are frequently being updated (Pending/Active). This means I only vacuum once a week. I haven't had chance to churn out objective tests yet, but my subjective judgment is that this set of params works well: Set SHMMAX and SHMALL in the kernel to 134217728 (128MB) Set shared_buffers to 8192 (64MB) Set sort_mem to 16384 (16MB) Set effective_cache_size to 65536 (1/2 GB) The Hardware is a dual-processor Athlon 1.2 Ghz box with 1 GB of RAM and the DB on SCSI RAID drives. The database size is about 8GB, with the largest table 2.5 GB, and the two most commonly queried tables at 1 GB each. The OS is Debian Linux kernel 2.4.x (recompiled custom kernel for dual processor support) The PostgreSQL version is 7.3.2 My reasoning was to increase shared_buffers based on anecdotal recommendations I've seen on this list to 64MB and boost the OS SHMMAX to twice that value to allow adequate room for other shared memory needs, thus reserving 128MB. Of the remaining memory, 256MB goes to 16MB sort space times a guesstimate of 16 simultaneous sorts at any given time. If I leave about 128 MB for headroom, then 1/2 GB should be left available for the effective cache size. I've never been tempted to turn fsync off. That seems like a risky move. Regards, -Nick ----- Nick Fankhauser [EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/ ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Optimization
Justin- It sounds like you're on a system similar to ours, so I'll pass along the changes that I made, which seem to have increased performance, and most importantly, haven't hurt anything. The main difference in our environment is that we are less Update/Insert intensive than you are- in our application, 90% of our information (court cases) is static (Closed) and 10% are frequently being updated (Pending/Active). This means I only vacuum once a week. I haven't had chance to churn out objective tests yet, but my subjective judgment is that this set of params works well: Set SHMMAX and SHMALL in the kernel to 134217728 (128MB) Set shared_buffers to 8192 (64MB) Set sort_mem to 16384 (16MB) Set effective_cache_size to 65536 (1/2 GB) The Hardware is a dual-processor Athlon 1.2 Ghz box with 1 GB of RAM and the DB on SCSI RAID drives. The database size is about 8GB, with the largest table 2.5 GB, and the two most commonly queried tables at 1 GB each. The OS is Debian Linux kernel 2.4.x (recompiled custom kernel for dual processor support) The PostgreSQL version is 7.3.2 My reasoning was to increase shared_buffers based on anecdotal recommendations I've seen on this list to 64MB and boost the OS SHMMAX to twice that value to allow adequate room for other shared memory needs, thus reserving 128MB off the top. Of the remaining memory, 256MB goes to 16MB sort space times a guesstimate of 16 simultaneous sorts at any given time. If I leave about 128 MB for headroom, then 1/2 GB should be left available for the effective cache size. I've never been tempted to turn fsync off. That seems like a risky move. Regards, -Nick ----- Nick Fankhauser [EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/ ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Sanity check requested
I'm confused: Ang Chin Han wrote: > We've been using ext3fs for our production systems. (Red Hat Advanced > Server 2.1) Vincent van Leeuwen wrote: > I'd upgrade to a journaling filesystem as soon as possible for > reliability. ...About one year ago I considered moving to a journaling file system, but opted not to because it seems like that's what WAL does for us already. How does putting a journaling file system under it add more reliability? I also guessed that a journaling file system would add overhead because now a write to the WAL file could itself be deferred and logged elsewhere. ...So now I'm really puzzled because folks are weighing in with solid anecdotal evidence saying that I'll get both better reliability and performance. Can someone explain what I'm missing about the concept? -A puzzled Nick ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
File systems (RE: [PERFORM] Sanity check requested)
Thanks for the suggestions in the FS types- especially the Debian oriented info. I'll start by playing with the memory allocation parameters that I originally listed (seems like they should provide results in a way that is unaffected by the disk IO). Then once I have them at optimal values, move on to trying different file systems. I assume that as I make changes that affect the disk IO performance, I'll then need to do some testing to find new values for the IO cost for the planner- Do you folks have some ballpark numbers to start with for this based on your experience? I'm departing in three ways from the simple IDE model that (I presume) the default random page cost of 4 is based on- The disks are SCSI & RAID and the FS would be different. At this point, I can't think of any better way to test this than simply running my local test suite with various values and recording the wall-clock results. Is there a different approach that might make more sense? (This means that my results will be skewed to my environment, but I'll post them anyway.) I'll post results back to the list as I get to it- It might be a slow process Since I spend about 18 hours of each day keeping the business running, I'll have to cut back on sleep & do this in the other 10 hours. -NF > Shridhar Daithankar wrote: > I appreciate your approach but it almost proven that ext2 is > not the best and fastest out there. > > Agreed. > Ang Chin Han wrote: > We've been using ext3fs for our production systems. (Red Hat Advanced > Server 2.1) > > And since your (Nick) system is based on Debian, I have done some rough > testing on Debian sarge (testing) (with custom 2.4.20) with ext3fs, > reiserfs and jfs. Can't get XFS going easily on Debian, though. > > I used a single partition mkfs'd with ext3fs, reiserfs and jfs one after > the other on an IDE disk. Ran pgbench and osdb-x0.15-0 on it. > > jfs's has been underperforming for me. Somehow the CPU usage is higher > than the other two. As for ext3fs and reiserfs, I can't detect any > significant difference. So if you're in a hurry, it'll be easier to > convert your ext2 to ext3 (using tune2fs) and use that. Otherwise, it'd > be nice if you could do your own testing, and post it to the list. > > -- > Linux homer 2.4.18-14 #1 Wed Sep 4 13:35:50 EDT 2002 i686 i686 i386 > GNU/Linux >2:30pm up 204 days, 5:35, 5 users, load average: 5.50, 5.18, 5.13 > ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Sanity check requested
> Wrong, actually. Sort memory is allocated *per sort*, not per connnection or > per query. So a single complex query could easily use 4xsort_mem if it has > several merge joins ... Thanks for the correction- it sounds like this is one where usage can't be precisely controlled in a dynamic user environment & I just need to get a feel for what works under a load that approximates my production system. -Nick ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Sanity check requested
Shridhar- I appreciate your thoughts- I'll be running some before & after tests on this using one of our development/hot-swap boxes, so I'll report the results back to the list. A few more thoughts/questions: > 1. 30 users does not seem to be much of a oevrhead. If possible > try doing away with connection pooling. The application needs to scale up gracefully. We actually have about 200 users that could decide to log on at the same time- 30 is just a typical load. We'd also prefer to have 20,000 subscribers so we can start making a living with this business . > 2. While increasing sort memory, try 4/8/16 in that order. That > way you will get a better picture of load behaviour. Though whatever you put appears > reasonable, having more data always help. I'll try that approach while testing. Is it the case that the sort memory is allocated for each connection and becomes unavailable to other processes while the connection exists? If so, since I'm using a connection pool, I should be able to control total usage precisely. Without a connection pool, I could start starving the rest of the system for resources if the number of users spiked unexpectedly. Correct? > 3. I don't know how this affects on SCSI drives, but what file > system you are using? Can you try diferent ones? > 4. OK, this is too much but linux kernel 2.6 is in test and has > vastly improved IO... I'm using ext2. For now, I'll leave this and the OS version alone. If I change too many variables, I won't be able to discern which one is causing a change. Although I understand that there's an element of art to tuning, I'm enough of a neophyte that I don't have a "feeling" for the tuning parameters yet and hence I have to take a scientific approach of just tweaking a few variables in an otherwise controlled and unchanged environment. If I can't reach my goals with the simple approach, I'll consider some of the more radical ideas. Again, thanks for the ideas- I'll feed the results back after I've done some tests -Nick ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Tunning FreeeBSD and PostgreSQL
> I still would like some guidance on tunning FreeBSD (shmmax and > shmmaxpgs). > Do I need to even touch these settings? Stephen- I have no idea what these are set to by default in FreeBSD, but here's the page that covers changing it in the postgresql docs: http://www.postgresql.org/docs/7.3/static/kernel-resources.html -Nick ---(end of broadcast)--- TIP 8: explain analyze is your friend
[PERFORM] Sanity check requested
Hi folks- For some time, we've been running Postgres with the default configuration & getting adequate performance, but the time has come to tune a bit, so I've been lurking on this list & gathering notes. Now I'm about ready to make a change & would appreciate it if a few more experienced folks could comment on whether I appear to be heading in the right direction- Here's what I'm planning: Increase SHMMAX and SHMALL in my kernel to 134217728 (128MB) Increase shared_buffers to 8192 (64MB) Increase sort_mem to 16384 (16MB) Increase effective_cache_size to 65536 (1/2 GB) Here's the environment: The Hardware is a dual-processor Athlon 1.2 Ghz box with 1 GB of RAM and the DB on SCSI RAID drives. The server runs only PostgreSQL The database size is about 8GB, with the largest table 2.5 GB, and the two most commonly queried tables at 1 GB each. The two most commonly queried tables are usually queried based on a non-unique indexed varchar field typically 20 chars long. The query is a "like" on people's names with trailing %, so this often gets pushed to seq scan or returns several thousand records. (As when someone searches on 'Jones%'. Records from the largest table are always accessed via unique index in groups of 20 or less. The OS is Debian Linux kernel 2.4.x (recompiled custom kernel for dual processor support) The PostgreSQL version is 7.3.2 We typically have about 30 interactive users on the DB, but they're using a shared connection pool of 16. Our main problem appears to be when one of the users fires up a large query and creates a log-jam with resources. My reasoning is that I'll increase shared_buffers based on anecdotal recommendations I've seen on this list to 64MB. I'll boost the OS SHMMAX to twice that value to allow adequate room for other shared memory needs, thus reserving 128MB. Of the remaining memory, 256MB goes to 16 connections * 16MB sort space, if I leave about 128 MB for headroom, then 1/2 GB should be left available for the effective cache size. Any thoughts? Is this a sane plan? Are there other parameters I should consider changing first? Thanks! -Nick - Nick Fankhauser [EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/ - Nick Fankhauser [EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/ ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match