Re: [PERFORM] sum of all values

2005-01-14 Thread Richard Huxton
Madison Kelly wrote:
Richard Huxton wrote:
Madison Kelly wrote:
Hi all,
  Is there a fast(er) way to get the sum of all integer values for a 
certain condition over many thousands of rows? What I am currently 
doing is this (which takes ~5-10sec.):
OK, I'm assuming you've configured PG to your satisfaction and this is 
the only query giving you problems.
  This is a program for general consumption (hopefully... eventually...) 
so I want to leave the psql config alone. Once I am happier with the 
program I will try different tuning options and write a faq though I 
expect 9 out of 10 users won't read it.
PostgreSQL is not FireFox, and you can't expect it to work efficiently 
without doing at least some configuration. The settings to support 100 
simultaneous connections on a dual-Opteron with 8GB RAM are not the same 
as on a single-user laptop.
Take half an hour to read through the performance-tuning guide here:
  http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html

SELECT SUM (a.file_size) FROM file_info_1 a, file_set_1 b WHERE 
a.file_name=b.fs_name AND a.file_parent_dir=b.fs_parent_dir AND 
a.file_type=b.fs_type AND b.fs_backup='t';
You'll want to run EXPLAIN ANALYSE SELECT SUM... and post the output 
of that, although the query looks straightforward enough.
tle-bu=> EXPLAIN ANALYZE SELECT SUM (a.file_size) FROM file_info_1 a, 
file_set_1 b WHERE a.file_name=b.fs_name AND 
a.file_parent_dir=b.fs_parent_dir AND a.file_type=b.fs_type AND 
b.fs_backup='t';

QUERY PLAN

 Aggregate  (cost=2202.54..2202.54 rows=1 width=8) (actual 
time=5078.744..5078.748 rows=1 loops=1)
   ->  Merge Join  (cost=724.94..2202.51 rows=11 width=8) (actual 
time=3281.677..4969.719 rows=12828 loops=1)
 Merge Cond: (("outer".file_parent_dir = "inner".fs_parent_dir) 
AND ("outer".file_name = "inner".fs_name) AND ("outer".file_type = 
"inner".fs_type))
 ->  Index Scan using file_info_1_search_idx on file_info_1 a 
(cost=0.00..1317.11 rows=12828 width=104) (actual time=0.042..116.825 
rows=12828 loops=1)
 ->  Sort  (cost=724.94..740.97 rows=6414 width=96) (actual 
time=3281.516..3350.640 rows=12828 loops=1)
   Sort Key: b.fs_parent_dir, b.fs_name, b.fs_type
   ->  Seq Scan on file_set_1 b  (cost=0.00..319.35 
rows=6414 width=96) (actual time=0.029..129.129 rows=12828 loops=1)
 Filter: (fs_backup = true)
 Total runtime: 5080.729 ms
Well, it's slow, but that's probably your settings. Run VACUUM ANALYSE 
on the tables though, it looks like you've got default statistics (It's 
expecting exactly 1/2 the fs_backup values to be true - 6414 out of 12828).

  Here are the schemas, in case they help:
tle-bu=> \d file_info_1Table "public.file_info_1"
 Column  |  Type   | Modifiers
-+-+
 file_acc_time   | bigint  | not null
 file_group_name | text| not null
 file_group_uid  | integer | not null
 file_mod_time   | bigint  | not null
 file_name   | text| not null
 file_parent_dir | text| not null
 file_perm   | text| not null
 file_size   | bigint  | not null
 file_type   | text| not null default 'f'::text
 file_user_name  | text| not null
 file_user_uid   | integer | not null
Indexes:
"file_info_1_display_idx" btree (file_parent_dir, file_name, 
file_type)
"file_info_1_search_idx" btree (file_parent_dir, file_name, 
file_type)

tle-bu=> \d file_set_1   Table "public.file_set_1"
Column |  Type   | Modifiers
---+-+
 fs_backup | boolean | not null default true
 fs_display| boolean | not null default false
 fs_name   | text| not null
 fs_parent_dir | text| not null
 fs_restore| boolean | not null default false
 fs_type   | text| not null default 'f'::text
Indexes:
"file_set_1_sync_idx" btree (fs_parent_dir, fs_name, fs_type)

1. WHERE ARE YOUR PRIMARY KEYS???
2. Why do you have two identical indexes on file_info_1
3. WHERE ARE YOUR PRIMARY KEYS???
4. Am I right in thinking that always, file_name==fs_name (i.e. they 
represent the same piece of information) and if so, why are you 
storing it twice? Same for _parent_dir too
5. file_type/fs_type are being held as unbounded text? Not an index 
into some lookup table or a varchar(N)?

Can you explain what you're trying to do here - it might be you want 
to alter your database design.
--
  Richard Huxton
  Archonet Ltd
  This is where I have to admit my novice level of knowledge. Until now 
I have been more concerned with "making it work". It is only now that I 
have finished (more or less) the program that I have started going back 
and trying to find ways to speed it up. I have not used postgres (or 
perl or anything) before this program. I hope my questions aren't too 
basic. ^.^;
There's a rule of thumb about throwing the 

Re: [PERFORM] sum of all values

2005-01-14 Thread Madison Kelly
Richard Huxton wrote:
Madison Kelly wrote:
Hi all,
  Is there a fast(er) way to get the sum of all integer values for a 
certain condition over many thousands of rows? What I am currently 
doing is this (which takes ~5-10sec.):

OK, I'm assuming you've configured PG to your satisfaction and this is 
the only query giving you problems.
  This is a program for general consumption (hopefully... 
eventually...) so I want to leave the psql config alone. Once I am 
happier with the program I will try different tuning options and write a 
faq though I expect 9 out of 10 users won't read it.

SELECT SUM (a.file_size) FROM file_info_1 a, file_set_1 b WHERE 
a.file_name=b.fs_name AND a.file_parent_dir=b.fs_parent_dir AND 
a.file_type=b.fs_type AND b.fs_backup='t';

You'll want to run EXPLAIN ANALYSE SELECT SUM... and post the output of 
that, although the query looks straightforward enough.
tle-bu=> EXPLAIN ANALYZE SELECT SUM (a.file_size) FROM file_info_1 a, 
file_set_1 b WHERE a.file_name=b.fs_name AND 
a.file_parent_dir=b.fs_parent_dir AND a.file_type=b.fs_type AND 
b.fs_backup='t';

QUERY PLAN

 Aggregate  (cost=2202.54..2202.54 rows=1 width=8) (actual 
time=5078.744..5078.748 rows=1 loops=1)
   ->  Merge Join  (cost=724.94..2202.51 rows=11 width=8) (actual 
time=3281.677..4969.719 rows=12828 loops=1)
 Merge Cond: (("outer".file_parent_dir = "inner".fs_parent_dir) 
AND ("outer".file_name = "inner".fs_name) AND ("outer".file_type = 
"inner".fs_type))
 ->  Index Scan using file_info_1_search_idx on file_info_1 a 
(cost=0.00..1317.11 rows=12828 width=104) (actual time=0.042..116.825 
rows=12828 loops=1)
 ->  Sort  (cost=724.94..740.97 rows=6414 width=96) (actual 
time=3281.516..3350.640 rows=12828 loops=1)
   Sort Key: b.fs_parent_dir, b.fs_name, b.fs_type
   ->  Seq Scan on file_set_1 b  (cost=0.00..319.35 
rows=6414 width=96) (actual time=0.029..129.129 rows=12828 loops=1)
 Filter: (fs_backup = true)
 Total runtime: 5080.729 ms
(9 rows)

  Here are the schemas, in case they help:
tle-bu=> \d file_info_1Table "public.file_info_1"
 Column  |  Type   | Modifiers
-+-+
 file_acc_time   | bigint  | not null
 file_group_name | text| not null
 file_group_uid  | integer | not null
 file_mod_time   | bigint  | not null
 file_name   | text| not null
 file_parent_dir | text| not null
 file_perm   | text| not null
 file_size   | bigint  | not null
 file_type   | text| not null default 'f'::text
 file_user_name  | text| not null
 file_user_uid   | integer | not null
Indexes:
"file_info_1_display_idx" btree (file_parent_dir, file_name, 
file_type)
"file_info_1_search_idx" btree (file_parent_dir, file_name, 
file_type)

tle-bu=> \d file_set_1   Table "public.file_set_1"
Column |  Type   | Modifiers
---+-+
 fs_backup | boolean | not null default true
 fs_display| boolean | not null default false
 fs_name   | text| not null
 fs_parent_dir | text| not null
 fs_restore| boolean | not null default false
 fs_type   | text| not null default 'f'::text
Indexes:
"file_set_1_sync_idx" btree (fs_parent_dir, fs_name, fs_type)

1. WHERE ARE YOUR PRIMARY KEYS???
2. Why do you have two identical indexes on file_info_1
3. WHERE ARE YOUR PRIMARY KEYS???
4. Am I right in thinking that always, file_name==fs_name (i.e. they 
represent the same piece of information) and if so, why are you storing 
it twice? Same for _parent_dir too
5. file_type/fs_type are being held as unbounded text? Not an index into 
some lookup table or a varchar(N)?

Can you explain what you're trying to do here - it might be you want to 
alter your database design.
--
  Richard Huxton
  Archonet Ltd
  This is where I have to admit my novice level of knowledge. Until now 
I have been more concerned with "making it work". It is only now that I 
have finished (more or less) the program that I have started going back 
and trying to find ways to speed it up. I have not used postgres (or 
perl or anything) before this program. I hope my questions aren't too 
basic. ^.^;

  I keep hearing about Primary Keys but I can't say that I know what 
they are or how they are used. If I do understand, it is a way to 
reference another table's entry (using a foreign key)? The two matching 
indexes is a typo in my program that I hadn't noticed, I'll fix that asap.

  Here is what the database is used for:
  This is a backup program and I use the DB to store extended 
information on all selected files and directories on a partition. Each 
partition has it's own 'file_info_#' and 'file_set_#' tables where '#' 
matches the ID stored for that partition in the DB in another table.

  The 'file_info_#' table stored the data that 

Re: [PERFORM] sum of all values

2005-01-14 Thread Richard Huxton
Madison Kelly wrote:
Hi all,
  Is there a fast(er) way to get the sum of all integer values for a 
certain condition over many thousands of rows? What I am currently doing 
is this (which takes ~5-10sec.):
OK, I'm assuming you've configured PG to your satisfaction and this is 
the only query giving you problems.

SELECT SUM (a.file_size) FROM file_info_1 a, file_set_1 b WHERE 
a.file_name=b.fs_name AND a.file_parent_dir=b.fs_parent_dir AND 
a.file_type=b.fs_type AND b.fs_backup='t';
You'll want to run EXPLAIN ANALYSE SELECT SUM... and post the output of 
that, although the query looks straightforward enough.

  Here are the schemas, in case they help:
tle-bu=> \d file_info_1Table "public.file_info_1"
 Column  |  Type   | Modifiers
-+-+
 file_acc_time   | bigint  | not null
 file_group_name | text| not null
 file_group_uid  | integer | not null
 file_mod_time   | bigint  | not null
 file_name   | text| not null
 file_parent_dir | text| not null
 file_perm   | text| not null
 file_size   | bigint  | not null
 file_type   | text| not null default 'f'::text
 file_user_name  | text| not null
 file_user_uid   | integer | not null
Indexes:
"file_info_1_display_idx" btree (file_parent_dir, file_name, file_type)
"file_info_1_search_idx" btree (file_parent_dir, file_name, file_type)
tle-bu=> \d file_set_1   Table "public.file_set_1"
Column |  Type   | Modifiers
---+-+
 fs_backup | boolean | not null default true
 fs_display| boolean | not null default false
 fs_name   | text| not null
 fs_parent_dir | text| not null
 fs_restore| boolean | not null default false
 fs_type   | text| not null default 'f'::text
Indexes:
"file_set_1_sync_idx" btree (fs_parent_dir, fs_name, fs_type)
1. WHERE ARE YOUR PRIMARY KEYS???
2. Why do you have two identical indexes on file_info_1
3. WHERE ARE YOUR PRIMARY KEYS???
4. Am I right in thinking that always, file_name==fs_name (i.e. they 
represent the same piece of information) and if so, why are you storing 
it twice? Same for _parent_dir too
5. file_type/fs_type are being held as unbounded text? Not an index into 
some lookup table or a varchar(N)?

Can you explain what you're trying to do here - it might be you want to 
alter your database design.
--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])