Re: [PERFORM] Hardware purchase question

2005-01-03 Thread Madison Kelly
Mitch Pirtle wrote:
On Mon, 13 Dec 2004 09:23:13 -0800, Joshua D. Drake
[EMAIL PROTECTED] wrote:
RAID 10 will typically always outperform RAID 5 with the same HD config.

Isn't RAID10 just RAID5 mirrored?  How does that speed up performance?
 Or am I missing something?
-- Mitch
Hi Mitch,
  Nope, Raid 10 (one zero) is a mirror is stripes, no parity. with r10 
you get the benefit of a full mirror which means your system does not 
need to calculate the XOR parity but you only get 50% disk usage. The 
mirror causes a slight write hit as the data needs to be split between 
two disk (or in this case, to striped pairs) but reads can be up to 
twice as fast (theoretically). By adding the stripe you negate the write 
hit and actually gain write performance because half the data goes to 
mirror A, half to mirror B (same with reads, roughly).

  Raid 10 is a popular choice for software raid because of the reduced 
overhead. Raid 5 on the otherhand does require that a parity bit is 
calculated for every N-1 disks. With r5 you get N-1 disk usage (you get 
the combined capacity of 3 disks in a 4 disk r5 array) and still get the 
benefit of striping across the disks so long as you have a dedicated 
raid asic that can do the XOR calculations. Without it, specially in a 
failure state, the performance can collapse as the CPU performs all that 
extra math.

hth
Madison
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Hardware purchase question

2005-01-03 Thread Madison Kelly
Madison Kelly wrote:
  Nope, Raid 10 (one zero) is a mirror is stripes, no parity. with r10 
Woops, that should be mirror of stripes.
By the way, what you are thinking of is possible, it would be 51 (five 
one; a raid 5 built on mirrors) or 15 (a mirror of raid 5 arrays). 
Always be careful, 10 and 01 are also not the same. You want to think 
carefully about what you want out of your array before building it.

Madison
---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] sum of all values

2005-01-13 Thread Madison Kelly
Hi all,
  Is there a fast(er) way to get the sum of all integer values for a 
certain condition over many thousands of rows? What I am currently doing 
is this (which takes ~5-10sec.):

SELECT SUM (a.file_size) FROM file_info_1 a, file_set_1 b WHERE 
a.file_name=b.fs_name AND a.file_parent_dir=b.fs_parent_dir AND 
a.file_type=b.fs_type AND b.fs_backup='t';

  I need to keep parts of the data in two tables. I currently use 
'file_name/fs_name', 'file_parent_dir/fs_parent_dir' and 
'file_type/fs_type' to match the entries in the two tables. The 
'file_info_#' table is frequently dropped and re-created so this was the 
only way I could think to match the data.

  I am hoping that maybe there is something I can do differently that 
will return this value a lot faster (ideally within a second). I know 
that this is heavily dependant on the system underneath but the program 
is designed for Joe/Jane User so I am trying to do what I can in the 
script and within my DB calls to make this as efficient as possible. I 
realise that my goal may not be viable.

  Here are the schemas, in case they help:
tle-bu= \d file_info_1Table public.file_info_1
 Column  |  Type   | Modifiers
-+-+
 file_acc_time   | bigint  | not null
 file_group_name | text| not null
 file_group_uid  | integer | not null
 file_mod_time   | bigint  | not null
 file_name   | text| not null
 file_parent_dir | text| not null
 file_perm   | text| not null
 file_size   | bigint  | not null
 file_type   | text| not null default 'f'::text
 file_user_name  | text| not null
 file_user_uid   | integer | not null
Indexes:
file_info_1_display_idx btree (file_parent_dir, file_name, file_type)
file_info_1_search_idx btree (file_parent_dir, file_name, file_type)
tle-bu= \d file_set_1   Table public.file_set_1
Column |  Type   | Modifiers
---+-+
 fs_backup | boolean | not null default true
 fs_display| boolean | not null default false
 fs_name   | text| not null
 fs_parent_dir | text| not null
 fs_restore| boolean | not null default false
 fs_type   | text| not null default 'f'::text
Indexes:
file_set_1_sync_idx btree (fs_parent_dir, fs_name, fs_type)
  Thanks all!
Madison
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] sum of all values

2005-01-14 Thread Madison Kelly
Richard Huxton wrote:
Madison Kelly wrote:
Hi all,
  Is there a fast(er) way to get the sum of all integer values for a 
certain condition over many thousands of rows? What I am currently 
doing is this (which takes ~5-10sec.):

OK, I'm assuming you've configured PG to your satisfaction and this is 
the only query giving you problems.
  This is a program for general consumption (hopefully... 
eventually...) so I want to leave the psql config alone. Once I am 
happier with the program I will try different tuning options and write a 
faq though I expect 9 out of 10 users won't read it.

SELECT SUM (a.file_size) FROM file_info_1 a, file_set_1 b WHERE 
a.file_name=b.fs_name AND a.file_parent_dir=b.fs_parent_dir AND 
a.file_type=b.fs_type AND b.fs_backup='t';

You'll want to run EXPLAIN ANALYSE SELECT SUM... and post the output of 
that, although the query looks straightforward enough.
tle-bu= EXPLAIN ANALYZE SELECT SUM (a.file_size) FROM file_info_1 a, 
file_set_1 b WHERE a.file_name=b.fs_name AND 
a.file_parent_dir=b.fs_parent_dir AND a.file_type=b.fs_type AND 
b.fs_backup='t';

QUERY PLAN

 Aggregate  (cost=2202.54..2202.54 rows=1 width=8) (actual 
time=5078.744..5078.748 rows=1 loops=1)
   -  Merge Join  (cost=724.94..2202.51 rows=11 width=8) (actual 
time=3281.677..4969.719 rows=12828 loops=1)
 Merge Cond: ((outer.file_parent_dir = inner.fs_parent_dir) 
AND (outer.file_name = inner.fs_name) AND (outer.file_type = 
inner.fs_type))
 -  Index Scan using file_info_1_search_idx on file_info_1 a 
(cost=0.00..1317.11 rows=12828 width=104) (actual time=0.042..116.825 
rows=12828 loops=1)
 -  Sort  (cost=724.94..740.97 rows=6414 width=96) (actual 
time=3281.516..3350.640 rows=12828 loops=1)
   Sort Key: b.fs_parent_dir, b.fs_name, b.fs_type
   -  Seq Scan on file_set_1 b  (cost=0.00..319.35 
rows=6414 width=96) (actual time=0.029..129.129 rows=12828 loops=1)
 Filter: (fs_backup = true)
 Total runtime: 5080.729 ms
(9 rows)

  Here are the schemas, in case they help:
tle-bu= \d file_info_1Table public.file_info_1
 Column  |  Type   | Modifiers
-+-+
 file_acc_time   | bigint  | not null
 file_group_name | text| not null
 file_group_uid  | integer | not null
 file_mod_time   | bigint  | not null
 file_name   | text| not null
 file_parent_dir | text| not null
 file_perm   | text| not null
 file_size   | bigint  | not null
 file_type   | text| not null default 'f'::text
 file_user_name  | text| not null
 file_user_uid   | integer | not null
Indexes:
file_info_1_display_idx btree (file_parent_dir, file_name, 
file_type)
file_info_1_search_idx btree (file_parent_dir, file_name, 
file_type)

tle-bu= \d file_set_1   Table public.file_set_1
Column |  Type   | Modifiers
---+-+
 fs_backup | boolean | not null default true
 fs_display| boolean | not null default false
 fs_name   | text| not null
 fs_parent_dir | text| not null
 fs_restore| boolean | not null default false
 fs_type   | text| not null default 'f'::text
Indexes:
file_set_1_sync_idx btree (fs_parent_dir, fs_name, fs_type)

1. WHERE ARE YOUR PRIMARY KEYS???
2. Why do you have two identical indexes on file_info_1
3. WHERE ARE YOUR PRIMARY KEYS???
4. Am I right in thinking that always, file_name==fs_name (i.e. they 
represent the same piece of information) and if so, why are you storing 
it twice? Same for _parent_dir too
5. file_type/fs_type are being held as unbounded text? Not an index into 
some lookup table or a varchar(N)?

Can you explain what you're trying to do here - it might be you want to 
alter your database design.
--
  Richard Huxton
  Archonet Ltd
  This is where I have to admit my novice level of knowledge. Until now 
I have been more concerned with making it work. It is only now that I 
have finished (more or less) the program that I have started going back 
and trying to find ways to speed it up. I have not used postgres (or 
perl or anything) before this program. I hope my questions aren't too 
basic. ^.^;

  I keep hearing about Primary Keys but I can't say that I know what 
they are or how they are used. If I do understand, it is a way to 
reference another table's entry (using a foreign key)? The two matching 
indexes is a typo in my program that I hadn't noticed, I'll fix that asap.

  Here is what the database is used for:
  This is a backup program and I use the DB to store extended 
information on all selected files and directories on a partition. Each 
partition has it's own 'file_info_#' and 'file_set_#' tables where '#' 
matches the ID stored for that partition in the DB in another table.

  The 'file_info_#' table stored the data that can change such as file 
size

[PERFORM] Very long time to execute and Update, suggestions?

2005-03-31 Thread Madison Kelly
Hi all,
  I have a table with a little over 200,000 columns in it that I need 
to update using a regular expression. I admit that though I am not a 
beginner and postgres, I am also far from an expert. :p

  I tried to create an Index that would optimize the UPDATE but I may 
have made an error in how I built it. Here is the table structure, the 
index I tried to create and an 'EXPLAIN ANALYZE' of the UPDATE (though I 
 am still just now learning how to use 'EXPLAIN').

tle-bu= \d file_info_3
Table public.file_info_3
 Column  | Type |Modifiers
-+--+-
 file_group_name | text | not null
 file_group_uid  | bigint   | not null
 file_mod_time   | bigint   | not null
 file_name   | text | not null
 file_parent_dir | text | not null
 file_perm   | text | not null
 file_size   | bigint   | not null
 file_type   | character varying(2) | not null default 
'f'::character varying
 file_user_name  | text | not null
 file_user_uid   | bigint   | not null
 file_backup | boolean  | not null default true
 file_display| boolean  | not null default false
 file_restore| boolean  | not null default false
Indexes:
file_info_3_display_idx btree (file_type, file_parent_dir, file_name)

  Here is the EXPLAIN:
tle-bu= EXPLAIN ANALYZE UPDATE file_info_3 SET file_backup='f' WHERE 
file_parent_dir~'^/home' OR (file_parent_dir='/' AND file_name='home');
  QUERY PLAN
---
 Seq Scan on file_info_3  (cost=0.00..7770.00 rows=1006 width=206) 
(actual time=1050.813..5648.462 rows=67835 loops=1)
   Filter: ((file_parent_dir ~ '^/home'::text) OR ((file_parent_dir = 
'/'::text) AND (file_name = 'home'::text)))
 Total runtime: 68498.898 ms
(3 rows)

  I thought that it would have used the index because 'file_parent_dir' 
and 'file_name' are in the index but is I am reading the EXPLAIN 
output right it isn't but is instead doing a sequencial scan. If that is 
the case, how would I best built the index? Should I have just used the 
'file_parent_dir' and 'file_name'?

  Thanks all!!
Madison
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Very long time to execute and Update, suggestions?

2005-03-31 Thread Madison Kelly
Philip Hallstrom wrote:
I'm not sure about this which is why I'm replying off list, but your 
index is on file_type, file_parent_dir, and file_name and you're query 
is on file_parent_dir and file_name.

I seem to remember reading that that the index will only get used if the 
columns in the where clause match up in order.

That is um... if you have an index on columns a and b and a where clause 
of b = 1 it woin't use the index since the index looks like

a, b
a, b
a, b
etc...
Does that make any sense?  Not sure if that's right or not, but easy 
enough to remove the file_type from your index and try it.

post back to the list if that's it.
-philip
Thanks for the reply!
  I have played around a little more and have created a few different 
test Indexes and it looks like it is the regex that is causing it to do 
the sequential scan. If I remove the regex and create a 
'file_parent_dir', 'file_name' index it will use it. If I create an 
Index just for 'file_parent_dir' and change my UPDATE to just look for 
the regex '... WHERE file_parent_dir~'^/dir'...' it will still do the 
sequential scan anyway.

  So I need to either find an Index that will work with regexes or 
re-write my code to update each subdirectory separately and use simpler 
UPDATE statement for each.

  Thanks again!
Madison
PS - I cc'ed the list to follow up on what I found out so far. (Hi list!)
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Index ot being used

2005-06-12 Thread Madison Kelly

Tobias Brox wrote:

[EMAIL PROTECTED] - Fri at 12:10:19PM -0400]


tle-bu= EXPLAIN ANALYZE SELECT file_type, file_parent_dir, file_name FROM
file_info_7;
   QUERY PLAN
--
Seq Scan on file_info_7  (cost=0.00..11028.35 rows=294035 width=118)
(actual time=0.122..2707.764 rows=294035 loops=1)
Total runtime: 3717.862 ms
(2 rows)




As far as I can see, you are selecting everything from the table without any
sort order.  The only rational thing to do then is a sequential scan, it's
no point in an index scan.



  Thanks for replying, Tobias and Jacques!

  Doh! This is a case of over simplification, I think. I was trying to 
simplify my query as much as I could and then work it out to the actual 
query I want. It would seem I don't understand how to use indexes quite 
right. Do you think you might be able to help me with a useful index?


  Here is the 'file_info_7' schema, my query and the 'explain analyze' 
results:


tle-bu= \d file_info_7
  Table public.file_info_7
Column| Type |Modifiers
--+--+-
 file_group_name  | text |
 file_group_uid   | bigint   | not null
 file_mod_time| bigint   | not null
 file_name| text | not null
 file_parent_dir  | text | not null
 file_perm| text | not null
 file_size| bigint   | not null
 file_type| character varying(2) | not null default 
'f'::character varying

 file_user_name   | text |
 file_user_uid| bigint   | not null
 file_backup  | boolean  | not null default true
 file_display | boolean  | not null default false
 file_restore_display | boolean  | not null default false
 file_restore | boolean  | not null default false
Indexes:
file_info_7_display_idx btree (file_type, file_parent_dir, file_name)

  Here is my full query:

tle-bu= EXPLAIN ANALYZE SELECT file_name, file_parent_dir, file_type 
FROM file_info_7 WHERE file_type='d' ORDER BY file_parent_dir ASC, 
file_name ASC;

 QUERY PLAN
-
 Sort  (cost=14541.24..14603.48 rows=24895 width=118) (actual 
time=15751.804..15967.591 rows=25795 loops=1)

   Sort Key: file_parent_dir, file_name
   -  Seq Scan on file_info_7  (cost=0.00..11763.44 rows=24895 
width=118) (actual time=19.289..3840.845 rows=25795 loops=1)

 Filter: ((file_type)::text = 'd'::text)
 Total runtime: 16043.075 ms
(5 rows)

  This is my index (which I guess is wrong):

tle-bu= \d file_info_7_display_idx
 Index public.file_info_7_display_idx
 Column  | Type
-+--
 file_type   | character varying(2)
 file_parent_dir | text
 file_name   | text
btree, for table public.file_info_7

  Those are the three columns I am using in my restrictions so I 
thought that would create an index this query would use. Do I need to do 
something different because of the 'ORDER BY...'?


  Thanks again for the replies!

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Index ot being used

2005-06-12 Thread Madison Kelly

Tom Lane wrote:

Madison Kelly [EMAIL PROTECTED] writes:


  Here is my full query:



tle-bu= EXPLAIN ANALYZE SELECT file_name, file_parent_dir, file_type 
FROM file_info_7 WHERE file_type='d' ORDER BY file_parent_dir ASC, 
file_name ASC;




  This is my index (which I guess is wrong):




tle-bu= \d file_info_7_display_idx
 Index public.file_info_7_display_idx
 Column  | Type
-+--
 file_type   | character varying(2)
 file_parent_dir | text
 file_name   | text
btree, for table public.file_info_7



The index is fine, but you need to phrase the query as

... ORDER BY file_type, file_parent_dir, file_name;

(Whether you use ASC or not doesn't matter.)  Otherwise the planner
won't make the connection to the sort ordering of the index.

regards, tom lane


Hi Tom and Bruno,

  After sending that email I kept plucking away and in the course of 
doing so decided that I didn't need to return the 'file_type' column. 
Other than that, it would see my query now matches what you two have 
recommended in the 'ORDER BY...' front but I still can't get an index 
search.


  Here is the latest query and the new index:

tle-bu= \d file_info_7_display_idx;
Index public.file_info_7_display_idx
 Column  | Type
-+--
 file_parent_dir | text
 file_name   | text
btree, for table public.file_info_7

tle-bu= EXPLAIN ANALYZE SELECT file_parent_dir, file_name, file_display 
FROM file_info_7 WHERE file_type='d' ORDER BY file_parent_dir ASC, 
file_name ASC;

 QUERY PLAN

 Sort  (cost=14509.53..14571.76 rows=24895 width=114) (actual 
time=19995.250..20123.874 rows=25795 loops=1)

   Sort Key: file_parent_dir, file_name
   -  Seq Scan on file_info_7  (cost=0.00..11762.44 rows=24895 
width=114) (actual time=0.123..3228.446 rows=25795 loops=1)

 Filter: ((file_type)::text = 'd'::text)
 Total runtime: 20213.443 ms

  The 'Sort' is taking 20 seconds on my pentium III 1GHz (not great, 
but...). If I follow you right, my index is 'file_parent_dir' first and 
'file_name' second (does order matter?). So I figured the query:


SELECT file_parent_dir, file_name, file_display
FROM file_info_7
WHERE file_type='d'
ORDER BY file_parent_dir ASC, file_name ASC;

  Would hit the index for the sort. Is there any other way other than 
'EXPLAIN ANALYZE...' to get a better understanding of what is happening 
in there? For what it's worth, there is a little under 300,000 entries 
in this table of which, as you can see above, 25,795 are being returned.


  Yet again, thank you both!! I'm off to keep trying to figure this out...

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Index ot being used

2005-06-12 Thread Madison Kelly

Bruno Wolff III wrote:

On Sun, Jun 12, 2005 at 18:52:05 -0400,
  Madison Kelly [EMAIL PROTECTED] wrote:

 After sending that email I kept plucking away and in the course of 
doing so decided that I didn't need to return the 'file_type' column. 
Other than that, it would see my query now matches what you two have 
recommended in the 'ORDER BY...' front but I still can't get an index 
search.



No it doesn't. Even if you don't return file_type you still need it
in the order by clause if you want postgres to consider using your
index.

Is there some reason you didn't actually try out our suggestion, but are
now asking for more advice?


No good excuse.

I'll recreate the index and test out your suggestion...

tle-bu= EXPLAIN ANALYZE SELECT file_name, file_parent_dir, file_type 
FROM file_info_7 WHERE file_type='d' ORDER BY file_type ASC, 
file_parent_dir ASC, file_name ASC;

 QUERY PLAN

 Sort  (cost=14789.92..14857.06 rows=26856 width=117) (actual 
time=16865.473..16989.104 rows=25795 loops=1)

   Sort Key: file_type, file_parent_dir, file_name
   -  Seq Scan on file_info_7  (cost=0.00..11762.44 rows=26856 
width=117) (actual time=0.178..1920.413 rows=25795 loops=1)

 Filter: ((file_type)::text = 'd'::text)
 Total runtime: 17102.925 ms
(5 rows)

tle-bu= \d file_info_7_display_idx  Index public.file_info_7_display_idx
 Column  | Type
-+--
 file_type   | character varying(2)
 file_parent_dir | text
 file_name   | text
btree, for table public.file_info_7

  I'm still getting the sequential scan.

Madison


--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Index ot being used

2005-06-12 Thread Madison Kelly

Bruno Wolff III wrote:

On Sun, Jun 12, 2005 at 23:42:05 -0400,
  Madison Kelly [EMAIL PROTECTED] wrote:

As you probably saw in my last reply, I went back to the old index and 
tried the query you and Tom Lane recommended. Should this not have 
caught the index?



Probably, but there might be some other reason the planner thought it
was better to not use it. Using indexes is not always faster.

It would help to see your latest definition of the table and indexes,
the exact query you used and explain analyze output.



Okay, here's what I have at the moment:

tle-bu= \d file_info_7   Table 
public.file_info_7

Column| Type |Modifiers
--+--+-
 file_group_name  | text |
 file_group_uid   | bigint   | not null
 file_mod_time| bigint   | not null
 file_name| text | not null
 file_parent_dir  | text | not null
 file_perm| text | not null
 file_size| bigint   | not null
 file_type| character varying(2) | not null default 
'f'::character varying

 file_user_name   | text |
 file_user_uid| bigint   | not null
 file_backup  | boolean  | not null default true
 file_display | boolean  | not null default false
 file_restore_display | boolean  | not null default false
 file_restore | boolean  | not null default false
Indexes:
file_info_7_display_idx btree (file_parent_dir, file_name)


tle-bu= \d file_info_7_display_idx
Index public.file_info_7_display_idx
 Column  | Type
-+--
 file_parent_dir | text
 file_name   | text
btree, for table public.file_info_7


tle-bu= EXPLAIN ANALYZE SELECT file_name, file_parent_dir, file_display 
FROM file_info_7 WHERE file_type='d' ORDER BY file_parent_dir ASC, 
file_name ASC;

 QUERY PLAN

 Sort  (cost=15091.53..15165.29 rows=29502 width=114) (actual 
time=12834.933..12955.136 rows=25795 loops=1)

   Sort Key: file_parent_dir, file_name
   -  Seq Scan on file_info_7  (cost=0.00..11762.44 rows=29502 
width=114) (actual time=0.244..2533.388 rows=25795 loops=1)

 Filter: ((file_type)::text = 'd'::text)
 Total runtime: 13042.421 ms
(5 rows)


  Since my last post I went back to a query closer to what I actually 
want. What is most important to me is that 'file_parent_dir, file_name, 
file_display' are returned and that the results are sorted by 
'file_parent_dir, file_name' and the results are restricted to where 
'file_info='d''.


  Basically what I am trying to do is display a directory tree in a 
file browser. I had this working before but it was far, far too slow 
once the number of directories to display got much higher than 1,000. 
That is what 'file_display' is, by the way.


  Again, thank you!

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Index ot being used

2005-06-13 Thread Madison Kelly

Bruno Wolff III wrote:

I am guessing you mean 'file_type' instead of 'file_info'.

To do this efficiently you want an index on (file_type, file_parent_dir,
file_name). Currently you only have an index on (file_parent_dir, file_name)
which won't help for this query. You also need to order by file_type
even though it will be constant for all of the returned rows in order
to help out the planner. This will allow an index scan over the desired
rows that returns them in the desired order.

Please actually try this before changing anything else.


  If I follow then I tried it but still got the sequential scan. Here's 
the index and query (copied from the 'psql' shell):



tle-bu= \d file_info_7_display_idx  Index public.file_info_7_display_idx
 Column  | Type
-+--
 file_type   | character varying(2)
 file_parent_dir | text
 file_name   | text
btree, for table public.file_info_7

tle-bu= EXPLAIN ANALYZE SELECT file_name, file_parent_dir, file_display 
FROM file_info_7 WHERE file_type='d' ORDER BY file_type ASC, 
file_parent_dir ASC, file_name ASC;

 QUERY PLAN

 Sort  (cost=14810.92..14874.65 rows=25490 width=119) (actual 
time=15523.767..15731.136 rows=25795 loops=1)

   Sort Key: file_type, file_parent_dir, file_name
   -  Seq Scan on file_info_7  (cost=0.00..11956.84 rows=25490 
width=119) (actual time=0.132..2164.757 rows=25795 loops=1)

 Filter: ((file_type)::text = 'd'::text)
 Total runtime: 15884.188 ms
(5 rows)


  If I follow all three 'ORDER BY...' items match the three columns in 
the index.


  Again, thanks!

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Index ot being used

2005-06-13 Thread Madison Kelly

Kevin Grittner wrote:

tle-bu= \d file_info_7_display_idx
 Index public.file_info_7_display_idx
 Column  | Type
-+--
 file_type   | character varying(2)
 file_parent_dir | text
 file_name   | text
btree, for table public.file_info_7



The index is fine, but you need to phrase the query as

... ORDER BY file_type, file_parent_dir, file_name;

(Whether you use ASC or not doesn't matter.)  Otherwise the planner
won't make the connection to the sort ordering of the index.

regards, tom lane


With Bruno's help I've gone back and tried just this with no luck. I've 
re-written the query to include all three items in the 'ORDER BY...' 
column in the same order but the sort still takes a long time and a 
sequential scan is being done instead of using the index.


For what it's worth, and being somewhat of a n00b, I agree with the idea 
of a smarter, more flexible planner. I guess the trade off is the added 
overhead neaded versus the size of the average query.


Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Index ot being used

2005-06-13 Thread Madison Kelly

Tom Lane wrote:

Madison Kelly [EMAIL PROTECTED] writes:


Bruno Wolff III wrote:


Please actually try this before changing anything else.




  If I follow then I tried it but still got the sequential scan.



Given the fairly large number of rows being selected, it seems likely
that the planner thinks this is faster than an indexscan.  It could
be right, too.  Have you tried set enable_seqscan = off to see if
the index is used then?  If so, is it faster or slower?  Comparing
EXPLAIN ANALYZE results with enable_seqscan on and off would be useful.


Wow!

With the sequence scan off my query took less than 2sec. When I turned 
it back on the time jumped back up to just under 14sec.



tle-bu= set enable_seqscan = off; SET
tle-bu= EXPLAIN ANALYZE SELECT file_name, file_parent_dir, file_display 
FROM file_info_7 WHERE file_type='d' ORDER BY file_type ASC, 
file_parent_dir ASC, file_name ASC;


QUERY PLAN

 Index Scan using file_info_7_display_idx on file_info_7 
(cost=0.00..83171.78 rows=25490 width=119) (actual 
time=141.405..1700.459 rows=25795 loops=1)

   Index Cond: ((file_type)::text = 'd'::text)
 Total runtime: 1851.366 ms
(3 rows)


tle-bu= set enable_seqscan = on; SET
tle-bu= EXPLAIN ANALYZE SELECT file_name, file_parent_dir, file_display 
FROM file_info_7 WHERE file_type='d' ORDER BY file_type ASC, 
file_parent_dir ASC, file_name ASC;

 QUERY PLAN

 Sort  (cost=14810.92..14874.65 rows=25490 width=119) (actual 
time=13605.185..13728.436 rows=25795 loops=1)

   Sort Key: file_type, file_parent_dir, file_name
   -  Seq Scan on file_info_7  (cost=0.00..11956.84 rows=25490 
width=119) (actual time=0.048..2018.996 rows=25795 loops=1)

 Filter: ((file_type)::text = 'd'::text)
 Total runtime: 13865.830 ms
(5 rows)

  So the index obiously provides a major performance boost! I just need 
to figure out how to tell the planner how to use it...


Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


---(end of broadcast)---
TIP 6: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] System Requirement

2005-06-13 Thread Madison Kelly

Saranya Sivakumar wrote:

Hi All,

We are looking to upgrade to 8.0 from 7.3.2 on production server. The 
current production system we are using is




2 x 2.4 Ghz Intel Xeon CPU with HT(4 virtual CPUs)

RAM - 1GB

HDD - 34GB SCSI

-

Production DB size: 10.89 GB

Number of tables: 253

We are planning to get a new server/system and upgrade to 8.0 on it. 
What is the recommended system requirement for Postgres 8.0?


Please give me your inputs on this.

Thanks

Saranya


Hi,

  Let me be the first to recommend RAM. From what little I know so far 
I think it is still important to know more about what your database 
looks like and how is it used/accessed. Can you post some more 
information on the details of your database? Is it a few users with 
large datasets (like a research project) or many users with small data 
sets (like a website)?


http://www.postgresql.org/docs/8.0/interactive/kernel-resources.html

See if that helps a bit. My first suggestion would be to simply increase 
your RAM to at least 2GB. Anything more would be beneficial up to the 
point of being able to load your entire DB into RAM (16GB RAM should 
allow for that plus other OS overhead).


Well, I'm relatively new so defer to others but this is my suggestion.

Best of luck!

Madison
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Pseudo-Solved was: (Re: [PERFORM] Index ot being used)

2005-06-13 Thread Madison Kelly

Bruno Wolff III wrote:

On Mon, Jun 13, 2005 at 15:05:00 -0400,
  Madison Kelly [EMAIL PROTECTED] wrote:


Wow!

With the sequence scan off my query took less than 2sec. When I turned 
it back on the time jumped back up to just under 14sec.



tle-bu= set enable_seqscan = off; SET
tle-bu= EXPLAIN ANALYZE SELECT file_name, file_parent_dir, file_display 
FROM file_info_7 WHERE file_type='d' ORDER BY file_type ASC, 
file_parent_dir ASC, file_name ASC;


QUERY PLAN

Index Scan using file_info_7_display_idx on file_info_7 
(cost=0.00..83171.78 rows=25490 width=119) (actual 
time=141.405..1700.459 rows=25795 loops=1)

  Index Cond: ((file_type)::text = 'd'::text)
Total runtime: 1851.366 ms
(3 rows)


tle-bu= set enable_seqscan = on; SET
tle-bu= EXPLAIN ANALYZE SELECT file_name, file_parent_dir, file_display 
FROM file_info_7 WHERE file_type='d' ORDER BY file_type ASC, 
file_parent_dir ASC, file_name ASC;

QUERY PLAN

Sort  (cost=14810.92..14874.65 rows=25490 width=119) (actual 
time=13605.185..13728.436 rows=25795 loops=1)

  Sort Key: file_type, file_parent_dir, file_name
  -  Seq Scan on file_info_7  (cost=0.00..11956.84 rows=25490 
width=119) (actual time=0.048..2018.996 rows=25795 loops=1)

Filter: ((file_type)::text = 'd'::text)
Total runtime: 13865.830 ms
(5 rows)

 So the index obiously provides a major performance boost! I just need 
to figure out how to tell the planner how to use it...



The two things you probably want to look at are (in postgresql.conf):
effective_cache_size = 1# typically 8KB each
random_page_cost = 2# units are one sequential page fetch cost

Increasing effective cache size and decreasing the penalty for random
disk fetches will favor using index scans. People have reported that
dropping random_page_cost from the default of 4 to 2 works well.
Effective cache size should be set to some reasonable estimate of
the memory available on your system to postgres, not counting that
set aside for shared buffers.

However, since the planner thought the index scan plan was going to be 6 times
slower than the sequential scan plan, I don't know if tweaking these values
enough to switch the plan choice won't cause problems for other queries.


Hmm,

  In this case I am trying to avoid modifying 'postgres.conf' and am 
trying to handle any performance tweaks within my program through SQL 
calls. This is because (I hope) my program will be installed by many 
users and I don't want to expect them to be able/comfortable playing 
with 'postgres.conf'. I do plan later though to create a section in the 
docs with extra tweaks for more advanced users and in that case I will 
come back to this and try/record just that.


  In the mean time Tom's recommendation works from perl by calling:

$DB-do(SET ENABLE_SEQSCAN TO OFF) || die...
query...
$DB-do(SET ENABLE_SEQSCAN TO ON) || die...

  Forces the index to be used. It isn't clean but it works for now and 
I don't need to do anything outside my program.


  Lacking any other ideas, thank you very, very much for sticking with 
this and helping me out!


Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Index ot being used

2005-06-13 Thread Madison Kelly

Tom Lane wrote:

Madison Kelly [EMAIL PROTECTED] writes:

  So the index obiously provides a major performance boost! I just need 
to figure out how to tell the planner how to use it...



Simple division shows that the planner's cost estimate ratio between the
seqscan and the indexscan (11956.84 vs 83171.78) is off by a factor of
more than 8 compared to reality (2018.996 vs 1700.459).  Also the cost of
the sort seems to be drastically underestimated.

I suspect this may be a combination of random_page_cost being too high
(since your test case, at least, is no doubt fully cached in RAM) and
cpu_operator_cost being too low.  I'm wondering if text comparisons
are really slow on your machine --- possibly due to strcoll being
inefficient in the locale you are using, which you didn't say.  That
would account for both the seqscan being slower than expected and the
sort taking a long time.

It'd be interesting to look at the actual runtimes of this seqscan vs
one that is doing a simple integer comparison over the same number of
rows (and, preferably, returning about the same number of rows as this).

regards, tom lane


  This is where I should mention that though 'n00b' might be a little 
harsh, I am still somewhat of a beginner (only been using postgres or 
programming at all for a little over a year).


  What is, and how do I check, 'strcoll'? Is there a way that I can 
clear the psql cache to make the tests more accurate to real-world 
situations? For what it's worth, the program is working (I am doing 
stress-testing and optimizing now) and the data in this table is actual 
data, not a construct.


  As I mentioned to Bruno in my reply to him, I am trying to keep as 
many tweaks as I can inside my program. The reason for this is that this 
is a backup program that I am trying to aim to more mainstream users or 
where a techy would set it up and then it would be used by mainstream 
users. At this point I want to avoid, as best I can, any changes from 
default to the 'postgres.conf' file or other external files. Later 
though, once I finish this testing phase, I plan to write a section of 
external tweaking where I will test these changes out and note my 
success for mre advanced users who feel more comfortable playing with 
postgres (and web server, rsync, etc) configs.


  If there is any way that I can make changes like this similar from 
inside my (perl) program I would prefer that. For example, I implemented 
the 'enable_seqscan' via:


$DB-do(SET ENABLE_SEQSCAN TO OFF) || die...
...
$DB-do(SET ENABLE_SEQSCAN TO ON) || die...

  Thank you very kindly! You and Bruno are wonderfully helpful! (as are 
the other's who have replied ^_^;)


Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Index ot being used

2005-06-15 Thread Madison Kelly

Karim Nassar wrote:

Your goal is admirable. However, many people tweak their postgresql.conf
files, and your program can't know whether or not this has happened. It
might be a good idea to have a var $do_db_optimization, which defaults
to on. Then, if your users have trouble or are advanced admins they can
turn it off. My personal opinion is that there are too many
architectures and configurations for you to accurately optimize inside
your program, and this gives you and your users an easy out.

if ($do_db_optimization == 1) {
  $DB-do(SET ENABLE_SEQSCAN TO OFF) || die...
} else {
  # do nothing -- postgresql will figure it out
}


That is a wonderful idea and I already have the foundation in place to 
easily implement this. Thanks!!


Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


[PERFORM] B-Tree index not being used

2005-07-02 Thread Madison Kelly

Hi all,

  I have gone back to my index problem from a while ago where I am 
trying to do an update with a regex on the WHERE column. If I specifiy a 
constant the index is used so that much I know is working.


  I've been reading the 7.4 docs and I saw that a B-Tree index *should* 
but used when the regex is anchored to the start. This is from 11.2 of 
the docs; It says The optimizer can also use a B-tree indexfor queries 
involving pattern matching operators LIKE, ILIKE, ~, and ~*, if, the 
pattern is anchored to the beginning of the string. In my case that is 
what I will always do.


  Specifically, this is a backup program I am using the DB for. The 
table I am working on stores all the file and directory information for 
a given partition. When the user toggles the checkbox for a given 
directory (to indicate that they do or do not what that directory backed 
up) I make a call to the DB telling it to change that column to given 
state.


  When the user toggle a directory I want to propgate that change to 
all sub directories and all files within those directories. The way I do 
this is:


UPDATE file_info_11 SET file_backup='t' WHERE file_parent_dir~'^/foo/bar';

  Which basically is just to say change every directory and file with 
this parent directory and all sub directories to the new backup state. 
From what I gather this query should have used the index. Here is what 
I am actually getting though:


tle-bu= EXPLAIN ANALYZE UPDATE file_info_11 SET file_backup='t' WHERE 
file_parent_dir~'^/';

  QUERY PLAN
--
 Seq Scan on file_info_11  (cost=0.00..13484.23 rows=1 width=183) 
(actual time=13.560..22040.603 rows=336039 loops=1)

   Filter: (file_parent_dir ~ '^/'::text)
 Total runtime: 514099.565 ms
(3 rows)

  Now if I define a static directory the index IS used:

tle-bu= EXPLAIN ANALYZE UPDATE file_info_11 SET file_backup='t' WHERE 
file_parent_dir='/';

 QUERY PLAN
-
 Index Scan using file_info_11_update_idx on file_info_11 
(cost=0.00..109.69 rows=66 width=183) (actual time=22.828..62.020 rows=3 
loops=1)

   Index Cond: (file_parent_dir = '/'::text)
 Total runtime: 88.334 ms
(3 rows)

  Here is the table and index schemas:

tle-bu= \d file_info_11; \d file_info_11_update_idx;
  Table public.file_info_11
Column| Type |Modifiers
--+--+-
 file_group_name  | text |
 file_group_uid   | bigint   | not null
 file_mod_time| bigint   | not null
 file_name| text | not null
 file_parent_dir  | text | not null
 file_perm| text | not null
 file_size| bigint   | not null
 file_type| character varying(2) | not null default 
'f'::character varying

 file_user_name   | text |
 file_user_uid| bigint   | not null
 file_backup  | boolean  | not null default true
 file_display | boolean  | not null default false
 file_restore_display | boolean  | not null default false
 file_restore | boolean  | not null default false
Indexes:
file_info_11_display_idx btree (file_type, file_parent_dir, 
file_name)

file_info_11_update_idx btree (file_parent_dir)

Index public.file_info_11_update_idx
 Column  | Type
-+--
 file_parent_dir | text
btree, for table public.file_info_11


  Can anyone see why the index might not be being used?

  I know that 'tsearch2' would probably work but it seems like way more 
than I need (because I will never be searching the middle of a string).


Thanks for any advice/help/pointers!

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] B-Tree index not being used

2005-07-02 Thread Madison Kelly

Tom Lane wrote:

Madison Kelly [EMAIL PROTECTED] writes:


  Can anyone see why the index might not be being used?



You didn't initdb in 'C' locale.  You can either re-initdb,
or create a specialized index with a non-default operator class
to support LIKE.  See the documentation.

regards, tom lane


I'll look into the non-default op class. I want to keep anything that 
tweaks the DB in my code so that a user doesn't need to modify anything 
on their system.


Thanks!

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] Need suggestion high-level suggestion on how to solve a performance problem

2005-07-07 Thread Madison Kelly

Hi all,

  I hope I am not asking too many questions. :)

  I have been trying to solve a performance problem in my program for a 
while now and, after getting an index to work which didn't speed things 
up enough, I am stumped. I am hoping someone here might have come across 
a similar issue and came up with a creative solution they wouldn't mind 
sharing.


  I am not looking for details, I expect to do my homework, I just need 
a pointer, suggestion or trick.


  The problem I have is that I am using pgSQL as a back end for my 
web-based *nix backup program. Part of the database stores info on every 
file and directory per partition. I use this information to build my 
directory tree. I have a test partition with ~325,000 files of which 
~30,000 are directories. I have been able to get the performance up to a 
reasonable level for displaying the directory tree including expanding 
and contracting branches (~3-5sec). I do this by loading all the 
directory info into an array and a hash once and using them as needed 
instead of hitting the DB.


  The problem comes when the user toggles a directory branch's backup 
flag (a simple check box beside the directory name). If it's a directory 
near the end of a branch it is fast enough. If they toggle a single file 
it is nearly instant. However if they toggle say the root directory, so 
every file and directory below it needs to be updated, it can take 
500-600sec to return. Obviously this is no good.


  What I need is a scheme for being able to say, essentially:

UPDATE file_info_1 SET file_backup='t' WHERE file_parent_dir~'^/';

  Faster. An index isn't enough because it needs to hit every entry anyway.

  I use perl to access the DB and generate the web pages. The file 
browser portion looks and acts like most file browsers (directory tree 
in the left frame with expanding and contracting directory branches and 
a list of files in a given directory on the right). It does not use any 
plug-ins like Java and that is important to me that it stays that way (I 
want it to be as simple as possible for the user to install).


  So far the only suggestion I've received is to keep a secondary 
'delta' table to store just the request. Then on load get the existing 
data then check it against the delta table before creating the page. The 
biggest draw back for me with this is that currently I don't need to 
provide an 'Apply' button because a simple javascript call passes the 
request onto the perl script immediately. I really like the Mac-esque 
approach to keeping the UI as simple and straight forward as possible. 
So, a suggestion that doesn't require something like an 'Apply' button 
would be much appreciated.


  Thanks for any suggestions in advance!

Madison

PS - For what it's worth, this is the last hurdle for me to overcome 
before I can finally release my program as 'beta' after over 15 months 
of work! :)


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Need suggestion high-level suggestion on how to solve

2005-07-07 Thread Madison Kelly

PFC wrote:


Hello,
I once upon a time worked in a company doing backup software and I  
remember these problems, we had exactly the same !


  Prety neat. :)

The file tree was all into memory and everytime the user clicked on  
something it haaad to update everything. Being C++ it was very fast, 
but  to backup a million files you needed a gig of RAM, which is... a 
problem  let's say, when you think my linux laptop has about 400k files 
on it.


  I want this to run on average systems (I'm developing it primarily 
on my modest P3 1GHz Thinkpad w/ 512MB RAM running Debian) so expecting 
that much free memory is not reasonable. As it is my test DB, with a 
realistic amount of data, is ~150MB.


So we rewrote the project entirely with the purpose of doing the 
million  files thingy with the clunky Pentium 90 with 64 megabytes of 
RAM, and it  worked.

What I did was this :
- use Berkeley DB

snip
- the price of the licence to be able to embed it in your product 
and  sell it is expensive, and if you want crash-proof, it's insanely 
expensive.


  This is the kicker right there; my program is released under the GPL 
so it's fee-free. I can't eat anything costly like that. As it is there 
is hundreds and hundreds of hours in this program that I am already 
hoping to recoup one day through support contracts. Adding commercial 
software I am afraid is not an option.


bonus  : if you check a directory as include and one of its 
subdirectory as  exclude, and the user adds files all over the place, 
the files added in  the included directory will be automatically 
backed up and the ones in  the 'ignored' directory will be automatically 
ignored, you have nothing to  change.

snip

IMHO it's the only solution.


  Now *this* is an idea worth looking into. How I will implement it 
with my system I don't know yet but it's a new line of thinking. Wonderful!


Now you'll ask me, but how do I calculate the total size of the 
backup  without looking at all the files ? when I click on a directory I 
don't  know what files are in it and which will inherit and which will not.


It's simple : you precompute it when you scan the disk for changed 
files.  This is the only time you should do a complete tree exploration.


  This is already what I do. When a user selects a partition they want 
to select files to backup or restore the partition is scanned. The scan 
looks at every file, directory and symlink and records it's size (on 
disk), it mtime, owner, group, etc. and records it to the database. I've 
got this scan/update running at ~1,500 files/second on my laptop. That 
was actually the first performance tuning I started with. :)


  With all the data in the DB the backup script can calculate rather 
intelligently where it wants to copy each directory to.


On each directory we put a matrix [M]x[N], M and N being one of the 
three  above state, containing the amount of stuff in the directory 
which would  be in state M if the directory was in state N. This is very 
easy to  compute when you scan for new files. Then when a directory 
changes state,  you have to sum a few cells of that matrix to know how 
much more that adds  to the backup. And you only look up 1 record.


  In my case what I do is calculate the size of all the files selected 
for backup in each directory, sort the directories from all sources by 
the total size of all their selected files and then start assigning the 
directories, largest to smallest to each of my available destination 
medias. If it runs out of destination space it backs up what it can and 
then waits a user-definable amount of time and then checks to see if any 
new destination media has been made available. If so it again tries to 
assign the files/directories that didn't fit. It will loop a 
user-definable number of times before giving up and warning the user 
that more destination space is needed for that backup job.



Is that helpful ?


  The three states (inhertied, backup, ignore) has definately caught my 
attention. Thank you very much for your idea and lengthy reply!


Madison

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[PERFORM] Another index question

2005-07-22 Thread Madison Kelly

Hi all,

  I am trying to do an update on a table but so far I can't seem to 
come up with a usable index. After my last question/thread the user 
'PFC' recommended I store whether a file was to be backed up as either 
't'(rue), 'f'(alse) or 'i'(nherit) to speed up changing files and sub 
directories under a given directory when it was toggled. I've more or 
less finished implementing this and it is certainly a LOT faster but I 
am hoping to make it just a little faster still with an Index.


  Tom Lane pointed out to me that I needed 'text_pattern_ops' on my 
'file_parent_dir' column in the index if I wanted to do pattern matching 
(the C locale wasn't set). Now I have added an additional condition and 
I think this might be my problem. Here is a sample query I am trying to 
create my index for:



UPDATE file_info_2 SET file_backup='i' WHERE file_backup!='i' AND 
file_parent_dir='/';


  This would be an example of someone changing the backup state of the 
root of a partition. It could also be:



UPDATE file_info_2 SET file_backup='i' WHERE file_backup!='i' AND 
file_parent_dir='/usr';


  If, for example, the user was toggling the backup state of the '/usr' 
directory.


  I suspected that because I was using file_backup!='i' that maybe I 
was running into the same problem as before so I tried creating the index:



tle-bu= CREATE INDEX file_info_2_mupdate_idx ON file_info_2 
(file_backup bpchar_pattern_ops, file_parent_dir text_pattern_ops);


tle-bu= EXPLAIN ANALYZE UPDATE file_info_2 SET file_backup='i' WHERE 
file_backup!='i' AND file_parent_dir~'^/'; 
  QUERY PLAN

-
 Seq Scan on file_info_2  (cost=0.00..13379.38 rows=1 width=134) 
(actual time=1623.819..1624.087 rows=4 loops=1)
   Filter: ((file_backup  'i'::bpchar) AND (file_parent_dir ~ 
'^/'::text))

 Total runtime: 1628.053 ms
(3 rows)


  This index wasn't used though, even when I set 'enable_seqscan' to 
'OFF'. The column 'file_backup' is 'char(1)' and the column 
'file_parent_dir' is 'text'.



tle-bu= \d file_info_2; \di file_info_2_mupdate_idx; 
Table public.file_info_2

 Column  | Type |  Modifiers
-+--+--
 file_group_name | text |
 file_group_uid  | integer  | not null
 file_mod_time   | bigint   | not null
 file_name   | text | not null
 file_parent_dir | text | not null
 file_perm   | integer  | not null
 file_size   | bigint   | not null
 file_type   | character(1) | not null
 file_user_name  | text |
 file_user_uid   | integer  | not null
 file_backup | character(1) | not null default 'i'::bpchar
 file_display| character(1) | not null default 'i'::bpchar
 file_restore| character(1) | not null default 'i'::bpchar
Indexes:
file_info_2_mupdate_idx btree (file_backup bpchar_pattern_ops, 
file_parent_dir text_pattern_ops)

file_info_2_supdate_idx btree (file_parent_dir, file_name, file_type)

List of relations
 Schema |  Name   | Type  |  Owner  |Table
+-+---+-+-
 public | file_info_2_mupdate_idx | index | madison | file_info_2
(1 row)

  Could it be that there needs to be a certain number of 
file_backup!='i' before the planner will use the index? I have also 
tried not defining an op_class on both tables (and one at a time) but I 
can't seem to figure this out.


  As always, thank you!

Madison

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Solved (was: Re: [PERFORM] Another index question)

2005-07-22 Thread Madison Kelly

Line noise, sorry...

  After posting I went back to reading the pgsql docs and saw the query:


SELECT am.amname AS index_method, opc.opcname AS opclass_name, 
opr.oprname AS opclass_operator FROM pg_am am, pg_opclass opc, pg_amop 
amop, pg_operator opr WHERE opc.opcamid = am.oid AND amop.amopclaid = 
opc.oid AND amop.amopopr = opr.oid ORDER BY index_method, opclass_name, 
opclass_operator;


  Which listed all the op_classes. I noticed none of the 
opclass_operators supported '!=' so I wondered if that was simply an 
unindexable (is that a word?) operator. So I tried creating the index:



tle-bu= CREATE INDEX file_info_2_mupdate_idx ON file_info_2 
(file_backup, file_parent_dir text_pattern_ops);


  And changing my query to:


tle-bu= EXPLAIN ANALYZE UPDATE file_info_2 SET file_backup='i' WHERE 
file_backup='t' OR file_backup='f' AND file_parent_dir~'^/';


QUERY PLAN
-
 Index Scan using file_info_2_mupdate_idx, file_info_2_mupdate_idx on 
file_info_2  (cost=0.00..10.04 rows=1 width=134) (actual 
time=0.112..0.718 rows=4 loops=1)
   Index Cond: ((file_backup = 't'::bpchar) OR ((file_backup = 
'f'::bpchar) AND (file_parent_dir ~=~ '/'::text) AND (file_parent_dir 
~~ '0'::text)))
   Filter: ((file_backup = 't'::bpchar) OR ((file_backup = 'f'::bpchar) 
AND (file_parent_dir ~ '^/'::text)))

 Total runtime: 60.359 ms
(4 rows)

  Bingo!

  Hopefully someone might find this useful in the archives. :p

Madison


Madison Kelly wrote:

Hi all,

  I am trying to do an update on a table but so far I can't seem to come 
up with a usable index. After my last question/thread the user 'PFC' 
recommended I store whether a file was to be backed up as either 
't'(rue), 'f'(alse) or 'i'(nherit) to speed up changing files and sub 
directories under a given directory when it was toggled. I've more or 
less finished implementing this and it is certainly a LOT faster but I 
am hoping to make it just a little faster still with an Index.


  Tom Lane pointed out to me that I needed 'text_pattern_ops' on my 
'file_parent_dir' column in the index if I wanted to do pattern matching 
(the C locale wasn't set). Now I have added an additional condition and 
I think this might be my problem. Here is a sample query I am trying to 
create my index for:



UPDATE file_info_2 SET file_backup='i' WHERE file_backup!='i' AND 
file_parent_dir='/';


  This would be an example of someone changing the backup state of the 
root of a partition. It could also be:



UPDATE file_info_2 SET file_backup='i' WHERE file_backup!='i' AND 
file_parent_dir='/usr';


  If, for example, the user was toggling the backup state of the '/usr' 
directory.


  I suspected that because I was using file_backup!='i' that maybe I 
was running into the same problem as before so I tried creating the index:



tle-bu= CREATE INDEX file_info_2_mupdate_idx ON file_info_2 
(file_backup bpchar_pattern_ops, file_parent_dir text_pattern_ops);


tle-bu= EXPLAIN ANALYZE UPDATE file_info_2 SET file_backup='i' WHERE 
file_backup!='i' AND file_parent_dir~'^/';   QUERY PLAN
- 

 Seq Scan on file_info_2  (cost=0.00..13379.38 rows=1 width=134) (actual 
time=1623.819..1624.087 rows=4 loops=1)
   Filter: ((file_backup  'i'::bpchar) AND (file_parent_dir ~ 
'^/'::text))

 Total runtime: 1628.053 ms
(3 rows)


  This index wasn't used though, even when I set 'enable_seqscan' to 
'OFF'. The column 'file_backup' is 'char(1)' and the column 
'file_parent_dir' is 'text'.



tle-bu= \d file_info_2; \di file_info_2_mupdate_idx; Table 
public.file_info_2

 Column  | Type |  Modifiers
-+--+--
 file_group_name | text |
 file_group_uid  | integer  | not null
 file_mod_time   | bigint   | not null
 file_name   | text | not null
 file_parent_dir | text | not null
 file_perm   | integer  | not null
 file_size   | bigint   | not null
 file_type   | character(1) | not null
 file_user_name  | text |
 file_user_uid   | integer  | not null
 file_backup | character(1) | not null default 'i'::bpchar
 file_display| character(1) | not null default 'i'::bpchar
 file_restore| character(1) | not null default 'i'::bpchar
Indexes:
file_info_2_mupdate_idx btree (file_backup bpchar_pattern_ops, 
file_parent_dir text_pattern_ops)

file_info_2_supdate_idx btree (file_parent_dir, file_name, file_type)

List of relations
 Schema |  Name   | Type  |  Owner  |Table
+-+---+-+-
 public | file_info_2_mupdate_idx | index | madison | file_info_2
(1 row)

  Could

[PERFORM] MySQL is faster than PgSQL but a large margin in my program... any ideas why?

2005-12-21 Thread Madison Kelly

Hi all,

  On a user's request, I recently added MySQL support to my backup 
program which had been written for PostgreSQL exclusively until now. 
What surprises me is that MySQL is about 20%(ish) faster than PostgreSQL.


  Now, I love PostgreSQL and I want to continue recommending it as the 
database engine of choice but it is hard to ignore a performance 
difference like that.


  My program is a perl backup app that scans the content of a given 
mounted partition, 'stat's each file and then stores that data in the 
database. To maintain certain data (the backup, restore and display 
values for each file) I first read in all the data from a given table 
(one table per partition) into a hash, drop and re-create the table, 
then start (in PostgreSQL) a bulk 'COPY..' call through the 'psql' shell 
app.


  In MySQL there is no 'COPY...' equivalent so instead I generate a 
large 'INSERT INTO file_info_X (col1, col2, ... coln) VALUES (...), 
(blah) ... (blah);'. This doesn't support automatic quoting, obviously, 
so I manually quote my values before adding the value to the INSERT 
statement. I suspect this might be part of the performance difference?


  I take the total time needed to update a partition (load old data 
into hash + scan all files and prepare COPY/INSERT + commit new data) 
and devide by the number of seconds needed to get a score I call a 
'U.Rate). On average on my Pentium3 1GHz laptop I get U.Rate of ~4/500. 
On MySQL though I usually get a U.Rate of ~7/800.


  If the performace difference comes from the 'COPY...' command being 
slower because of the automatic quoting can I somehow tell PostgreSQL 
that the data is pre-quoted? Could the performance difference be 
something else?


  If it would help I can provide code samples. I haven't done so yet 
because it's a little convoluded. ^_^;


  Thanks as always!

Madison


Where the big performance concern is when

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Madison Kelly (Digimer)
   TLE-BU; The Linux Experience, Back Up
Main Project Page:  http://tle-bu.org
Community Forum:http://forum.tle-bu.org
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] MySQL is faster than PgSQL but a large margin in my

2005-12-21 Thread Madison Kelly

Stephen Frost wrote:

* Madison Kelly ([EMAIL PROTECTED]) wrote:

 If the performace difference comes from the 'COPY...' command being 
slower because of the automatic quoting can I somehow tell PostgreSQL 
that the data is pre-quoted? Could the performance difference be 
something else?



I doubt the issue is with the COPY command being slower than INSERTs
(I'd expect the opposite generally, actually...).  What's the table type
of the MySQL tables?  Is it MyISAM or InnoDB (I think those are the main
alternatives)?  IIRC, MyISAM doesn't do ACID and isn't transaction safe,
and has problems with data reliability (aiui, equivilant to doing 'fsync
= false' for Postgres).  InnoDB, again iirc, is transaction safe and
whatnot, and more akin to the default PostgreSQL setup.

I expect some others will comment along these lines too, if my response
isn't entirely clear. :)

Stephen


Ah, that makes a lot of sense (I read about the 'fsync' issue before, 
now that you mention it). I am not too familiar with MySQL but IIRC 
MyISAM is their open-source DB and InnoDB is their commercial one, ne? 
If so, then I am running MyISAM.


  Here is the MySQL table. The main difference from the PostgreSQL 
table is that the 'varchar(255)' columns are 'text' columns in PostgreSQL.


mysql DESCRIBE file_info_1;
+-+--+--+-+-+---+
| Field   | Type | Null | Key | Default | Extra |
+-+--+--+-+-+---+
| file_group_name | varchar(255) | YES  | | NULL|   |
| file_group_uid  | int(11)  |  | | 0   |   |
| file_mod_time   | bigint(20)   |  | | 0   |   |
| file_name   | varchar(255) |  | | |   |
| file_parent_dir | varchar(255) |  | MUL | |   |
| file_perm   | int(11)  |  | | 0   |   |
| file_size   | bigint(20)   |  | | 0   |   |
| file_type   | char(1)  |  | | |   |
| file_user_name  | varchar(255) | YES  | | NULL|   |
| file_user_uid   | int(11)  |  | | 0   |   |
| file_backup | char(1)  |  | MUL | i   |   |
| file_display| char(1)  |  | | i   |   |
| file_restore| char(1)  |  | | i   |   |
+-+--+--+-+-+---+

  I will try turning off 'fsync' on my test box to see how much of a 
performance gain I get and to see if it is close to what I am getting 
out of MySQL. If that does turn out to be the case though I will be able 
to comfortably continue recommending PostgreSQL from a stability point 
of view.


Thanks!!

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Madison Kelly (Digimer)
   TLE-BU; The Linux Experience, Back Up
Main Project Page:  http://tle-bu.org
Community Forum:http://forum.tle-bu.org
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] MySQL is faster than PgSQL but a large margin in

2005-12-21 Thread Madison Kelly

Luke Lonergan wrote:

What version of postgres?

Copy has been substantially improved in bizgres and also in 8.1.
- Luke


Currently 7.4 (what comes with Debian Sarge). I have run my program on 
8.0 but not since I have added MySQL support. I should run the tests on 
the newer versions of both DBs (using v4.1 for MySQL which is also 
mature at this point).


As others mentioned though, so far the most likely explanation is the 
'fsync' being enabled on PostgreSQL.


Thanks for the reply!

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Madison Kelly (Digimer)
   TLE-BU; The Linux Experience, Back Up
Main Project Page:  http://tle-bu.org
Community Forum:http://forum.tle-bu.org
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[PERFORM] Setting nice values

2006-11-02 Thread Madison Kelly

Hi all,

  I've got a script (perl, in case it matters) that I need to run once
a month to prepare statements. This script queries and updates the
database a *lot*. I am not concerned with the performance of the SQL
calls so much as I am about the impact it has on the server's load.

  Is there a way to limit queries speed (ie: set a low 'nice' value on
a query)? This might be an odd question, or I could be asking the
question the wrong way, but hopefully you the idea. :)

Thanks!

Madi


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Setting nice values

2006-11-02 Thread Madison Kelly

Scott Marlowe wrote:

On Thu, 2006-11-02 at 09:14, Madison Kelly wrote:

Hi all,

   I've got a script (perl, in case it matters) that I need to run once
a month to prepare statements. This script queries and updates the
database a *lot*. I am not concerned with the performance of the SQL
calls so much as I am about the impact it has on the server's load.

   Is there a way to limit queries speed (ie: set a low 'nice' value on
a query)? This might be an odd question, or I could be asking the
question the wrong way, but hopefully you the idea. :)


While you can safely set the priority lower on the calling perl script,
setting db backend priorities lower can result in problems caused by
priority inversion  Look up that phrase on the pgsql admin, perform,
general, or hackers lists for an explanation, or go here:

http://en.wikipedia.org/wiki/Priority_inversion

I have a simple script that grabs raw data from an oracle db and shoves
it into a postgresql database for reporting purposes.  Every 100 rows I
put into postgresql, I usleep 10 or so and the load caused by that
script on both systems is minimal.  You might try something like that.


Will the priority of the script pass down to the pgsql queries it calls? 
I figured (likely incorrectly) that because the queries were executed by 
the psql server the queries ran with the server's priority. If this 
isn't the case, then perfect. :)


Thanks for the tip, too, it's something I will try.

Madi

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Setting nice values

2006-11-06 Thread Madison Kelly

Tobias Brox wrote:

[Madison Kelly - Thu at 10:25:07AM -0500]
Will the priority of the script pass down to the pgsql queries it calls? 
I figured (likely incorrectly) that because the queries were executed by 
the psql server the queries ran with the server's priority. 


I think you are right, and in any case, I don't think the niceness
value won't help much if the bottleneck is iowait.

In our application, I've made a special function for doing
low-priority transactions which I believe is quite smart - though maybe
not always.  Before introducing this logic, we observed we had a tipping
point, too many queries, and the database wouldn't swallow them fast
enough, and the database server just jammed up, trying to work at too
many queries at once, yielding the results far too slow.

In the config file, I now have those two flags set:

 stats_start_collector = on
 stats_command_string = on

This will unfortunately cause some CPU-load, but the benefit is great
- one can actually check what the server is working with at any time:

  select * from pg_stat_activity

with those, it is possible to check a special view pg_stat_activity -
it will contain all the queries the database is working on right now.
My idea is to peek into this table - if there is no active queries,
the database is idle, and it's safe to start our low-priority
transaction.  If this view is full of stuff, one should certainly not
run any low-priority transactions, rather sleep a bit and try again
later.

 select count(*) from pg_stat_activity where not current_query like
 'IDLE%' and query_start+?now()

The algorithm takes four parameters, the time value to put in above,
the maximum number of queries allowed to run, the sleep time between
each attempt, and the amount of attempts to try before giving up.


So here are the cons and drawbacks:

 con: Given small queries and small transactions, one can tune this in
  such a way that the low priority queries (almost) never causes
  significant delay for the higher priority queries.

 con: can be used to block users of an interactive query
  application to cause disturbances on the production database.

 con: can be used for pausing low-priority batch jobs to execute only
  when the server is idle.

 drawback: unsuitable for long-running queries and transactions 


 drawback: with fixed values in the parameters above, one risks that
   the queries never gets run if the server is sufficiently stressed.

 drawback: the stats collection requires some CPU

 drawback: the select * from pg_stats_activity query requires some CPU

 drawback: the pg_stats_activity-view is constant within the
   transaction, so one has to roll back if there is activity
   (this is however not a really bad thing, because one
   certainly shouldn't live an idle transaction around if the
   database is stressed).


I can see how this would be very useful (and may make use of it later!). 
For the current job at hand though, at full tilt it can take a few hours 
to run, which puts it into your drawback section. The server in 
question is also almost under load of some sort, too.


A great tip and one I am sure to make use of later, thanks!

Madi

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Setting nice values

2006-11-06 Thread Madison Kelly

Scott Marlowe wrote:

nope, the priorities don't pass down.  you connect via a client lib to
the server, which spawns a backend process that does the work for you. 
The backend process inherits its priority from the postmaster that

spawns it, and they all run at the same priority.


Shoot, but figured. :)


Thanks for the tip, too, it's something I will try.


Sometimes it's the simple solutions that work best.  :)  Welcome to the
world of pgsql, btw...


Heh, if only I was new to pgsql I wouldn't feel silly for asking so many 
questions :P. In the same right though, I enjoy PgSQL/Linux/FOSS in 
general *because* there seems to never be a shortage of things to learn.


Thanks!

Madi

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Setting nice values

2006-11-06 Thread Madison Kelly

Andreas Kostyrka wrote:

Am Donnerstag, den 02.11.2006, 09:41 -0600 schrieb Scott Marlowe:

Sometimes it's the simple solutions that work best.  :)  Welcome to the
world of pgsql, btw...


OTOH, there are also non-simple solutions to this, which might make
sense anyway: Install slony, and run your queries against a readonly
replica of your data.


Bingo! This seems like exactly what we can/should do, and it will likely 
help with other jobs we run, too.


I feel a little silly for not having thought of this myself... Guess I 
was too focused on niceness :). Thanks!


Madi

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Setting nice values

2006-11-06 Thread Madison Kelly

Tobias Brox wrote:

[Madison Kelly - Mon at 08:10:12AM -0500]
to run, which puts it into your drawback section. The server in 
question is also almost under load of some sort, too.


A great tip and one I am sure to make use of later, thanks!


I must have been sleepy, listing up cons vs drawbacks ;-)


:) I noticed but figured what you meant (I certainly do similar flubs!).


Anyway, the central question is not the size of the job, but the size of
the transactions within the job - if the job consists of many
transactions, my test can be run before every transaction.  Having
transactions lasting for hours is a very bad thing to do, anyway.


Ah, sorry, long single queries is what you meant. I have inherited this 
code so I am not sure how long a given query takes, though they do use a 
lot of joins and such, so I suspect it isn't quick; indexes aside. When 
I get some time (and get the backup server running) I plan to play with 
this. Currently the DB is on a production server so I am hesitant to 
poke around just now. Once I get the backup server though, I will play 
with your suggestions. I am quite curious to see how it will work out.


Thanks again!

Madi


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Setting nice values

2006-11-06 Thread Madison Kelly

Tobias Brox wrote:

[Madison Kelly - Mon at 08:48:19AM -0500]
Ah, sorry, long single queries is what you meant. 


No - long running single transactions :-)  If it's only read-only
queries, one will probably benefit by having one transaction for every
query.



In this case, what happens is one kinda ugly big transaction is read 
into a hash, and then looped through (usually ~10,000 rows). On each 
loop another, slightly less ugly query is performed based on the first 
query's values now in the hash (these queries being where throttling 
might help). Then after the second query is parsed a PDF file is created 
(also a big source of slowness). It isn't entirely read-only though 
because as the PDFs are created a flag is updated in the given record's 
row. So yeah, need to experiment some. :)


Madi

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[PERFORM] Writting a search engine for a pgsql DB

2007-02-26 Thread Madison Kelly

Hi all,

  I am asking in this list because, at the end of the day, this is a 
performance question.


  I am looking at writing a search engine of sorts for my database. I 
have only ever written very simple search engines before which amounted 
to not much more that the query string being used with ILIKE on a pile 
of columns. This was pretty rudimentary and didn't offer anything like 
relevance sorting and such (I'd sort by result name, age or whatnot).


  So I am hoping some of you guys and gals might be able to point me 
towards some resources or offer some tips or gotcha's before I get 
started on this. I'd really like to come up with a more intelligent 
search engine that doesn't take two minutes to return results. :) I 
know, in the end good indexes and underlying hardware will be important, 
but a sane as possible query structure helps to start with.


  Thanks all!!

Madison

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Writting a search engine for a pgsql DB

2007-02-26 Thread Madison Kelly

Mark Stosberg wrote:

Madison Kelly wrote:

  I think the more direct question I was trying to get at is How do you
build a 'relavence' search engine? One where results are returned/sorted
by relevance of some sort?. At this point, the best I can think of,
would be to perform multiple queries; first matching the whole search
term, then the search term starting a row, then ending a row, then
anywhere in a row and scoring the results based on which query they
came out on. This seems terribly cumbersome (and probably slow, indexes
be damned) though. I'm hoping there is a better way! :)


Madison,

I think your basic thinking is correct. However, the first select can
done offline -- sometime beforehand.

For example, you might create a table called keywords that includes
the list of words mined in the other tables, along with references to
where the words are found, and how many times they are mentioned.

Then, when someone actually searches, the search is primarily on the
keywords table, which is now way to sort by rank, since the table
contains how many times each keyword matches. The final result can be
constructed by using the details in the keywords table to pull up the
actual records needed.

My expectation however is that there are enough details in the system,
that I would first look at trying a package like tsearch2 to help solve
the problem, before trying to write another system like this from scratch.

  Mark


Now see, this is exactly the kind of sagely advice I was hoping for! :)

I'll look into tsearch2, and failing that for some reason, I love the 
keyword table idea.


Thanks kindly!!

Madi

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] Optimizing a VIEW

2008-08-15 Thread Madison Kelly

Hi all,

  I've got a simple table with a lot of data in it:

CREATE TABLE customer_data (
cd_id   int primary key 
default(nextval('cd_seq')),
cd_cust_id  int not null,
cd_variable textnot null,
cd_valuetext,
cd_tag  text,
added_user  int not null,
added_date  timestamp   not nulldefault now(),
modified_user   int not null,
modified_date   timestamp   not nulldefault now(),

FOREIGN KEY(cd_cust_id) REFERENCES customer(cust_id)
);

  The 'cust_id' references the customer that the given data belongs to. 
The reason for this data bucket (does this structure have a proper 
name?) is that the data I need to store on a give customer is quite 
variable and outside of my control. As it is, there is about 400 
different variable/value pairs I need to store per customer.


  This table has a copy in a second historical schema that matches this 
one in public but with an additional 'history_id' sequence. I use a 
simple function to copy an INSERT or UPDATE to any entry in the 
historical schema.


  Now I want to graph a certain subset of these variable/value pairs, 
so I created a simple (in concept) view to pull out the historical data 
set for a given customer. I do this by pulling up a set of records based 
on the name of the 'cd_variable' and 'cd_tag' and connect the records 
together using a matching timestamp.


  The problem is that this view has very quickly become terribly slow.

  I've got indexes on the 'cd_variable', 'cd_tag' and the parent 
'cust_id' columns, and the plan seems to show that the indexes are 
indeed being used, but the query against this view can take up to 10 
minutes to respond. I am hoping to avoid making a dedicated table as 
what I use to build this dataset may change over time.


  Below I will post the VIEW and a sample of the query's EXPLAIN 
ANALYZE. Thanks for any tips/help/clue-stick-beating you may be able to 
share!


Madi

-=] VIEW

CREATE VIEW view_sync_rate_history AS
SELECT
a.cust_id   ASvsrh_cust_id,
a.cust_name ASvsrh_cust_name,
a.cust_business ASvsrh_cust_business,
a.cust_nexxia_id||'-'||a.cust_nexxia_seqASvsrh_cust_nexxia,
a.cust_phoneASvsrh_cust_phone,
b.cd_value  ASvsrh_up_speed,
b.history_idASvsrh_up_speed_history_id,
c.cd_value  ASvsrh_up_rco,
c.history_idASvsrh_up_rco_history_id,
d.cd_value  ASvsrh_up_nm,
d.history_idASvsrh_up_nm_history_id,
e.cd_value  ASvsrh_up_sp,
e.history_idASvsrh_up_sp_history_id,
f.cd_value  ASvsrh_up_atten,
f.history_idASvsrh_up_atten_history_id,
g.cd_value  ASvsrh_down_speed,
g.history_idASvsrh_down_speed_history_id,
h.cd_value  ASvsrh_down_rco,
h.history_idASvsrh_down_rco_history_id,
i.cd_value  ASvsrh_down_nm,
i.history_idASvsrh_down_nm_history_id,
j.cd_value  ASvsrh_down_sp,
j.history_idASvsrh_down_sp_history_id,
k.cd_value  ASvsrh_down_atten,
k.history_idASvsrh_down_atten_history_id,
l.cd_value  ASvsrh_updated,
l.history_idASvsrh_updated_history_id
FROM
customer a,
history.customer_data b,
history.customer_data c,
history.customer_data d,
history.customer_data e,
history.customer_data f,
history.customer_data g,
history.customer_data h,
history.customer_data i,
history.customer_data j,
history.customer_data k,
history.customer_data l
WHERE
a.cust_id=b.cd_cust_id AND
a.cust_id=c.cd_cust_id AND
a.cust_id=d.cd_cust_id AND
a.cust_id=e.cd_cust_id AND
a.cust_id=f.cd_cust_id AND
a.cust_id=g.cd_cust_id AND
a.cust_id=h.cd_cust_id AND
a.cust_id=i.cd_cust_id AND
a.cust_id=j.cd_cust_id AND
a.cust_id=k.cd_cust_id AND
a.cust_id=l.cd_cust_id AND
b.cd_tag='sync_rate' AND
c.cd_tag='sync_rate' AND
d.cd_tag='sync_rate' AND
e.cd_tag='sync_rate' AND
f.cd_tag='sync_rate' AND
g.cd_tag='sync_rate' AND
h.cd_tag='sync_rate' AND
i.cd_tag='sync_rate' AND
j.cd_tag='sync_rate' AND
k.cd_tag='sync_rate' AND
l.cd_tag='sync_rate' AND
b.cd_variable='upstream_speed' AND
c.cd_variable='upstream_relative_capacity_occupation' AND
d.cd_variable='upstream_noise_margin' AND
e.cd_variable='upstream_signal_power' AND
f.cd_variable='upstream_attenuation' 

Re: [PERFORM] Optimizing a VIEW

2008-08-17 Thread Madison Kelly

Decibel! wrote:

On Aug 15, 2008, at 1:36 PM, Madison Kelly wrote:
The 'cust_id' references the customer that the given data belongs to. 
The reason for this data bucket (does this structure have a proper 
name?) is that the data I need to store on a give customer is quite 
variable and outside of my control. As it is, there is about 400 
different variable/value pairs I need to store per customer.



It's called Entity-Attribute-Value, and it's performance is pretty much 
guaranteed to suck for any kind of a large dataset. The problem is that 
you're storing a MASSIVE amount of extra information for every single 
value. Consider:


If each data point was just a field in a table, then even if we left 
cd_value as text, each data point would consume 4 bytes* + 1 byte per 
character (I'm assuming you don't need extra UTF8 chars or anything). Of 
course if you know you're only storing numbers or the like then you can 
make that even more efficient.


* In 8.3, the text field overhead could be as low as 1 byte if the field 
is small enough.


OTOH, your table is going to 32+24 bytes per row just for the per-row 
overhead, ints and timestamps. Each text field will have 1 or 4 bytes in 
overhead, then you have to store the actual data. Realistically, you're 
looking at 60+ bytes per data point, as opposed to maybe 15, or even 
down to 4 if you know you're storing an int.


Now figure out what that turns into if you have 100 data points per 
minute. It doesn't take very long until you have a huge pile of data 
you're trying to deal with. (As an aside, I once consulted with a 
company that wanted to do this... they wanted to store about 400 data 
points from about 1000 devices on a 5 minute interval. That worked out 
to something like 5GB per day, just for the EAV table. Just wasn't going 
to scale...)


So, back to your situation... there's several things you can do that 
will greatly improve things.


Identify data points that are very common and don't use EAV to store 
them. Instead, store them as regular fields in a table (and don't use 
text if at all possible).


You need to trim down your EAV table. Throw out the added/modified info; 
there's almost certainly no reason to store that *per data point*. Get 
rid of cd_id; there should be a natural PK you can use, and you 
certainly don't want anything else referring to this table (which is a 
big reason to use a surrogate key).


cd_variable and cd_tag need to be ints that point at other tables. For 
that matter, do you really need to tag each *data point*? Probably not...


Finally, if you have a defined set of points that you need to report on, 
create a materialized view that has that information.


BTW, it would probably be better to store data either in the main table, 
or the history table, but not both places.


This is a very long and thoughtful reply, thank you very kindly.

Truth be told, I sort of expected this would be what I had to do. I 
think I asked this more in hoping that there might be some magic I 
didn't know about, but I see now that's not the case. :)


As my data points grow to 500,000+, the time it took to return these 
results grew to well over 10 minutes on a decent server and the DB size 
was growing rapidly, as you spoke of.


So I did just as you suggested and took the variable names I knew about 
specifically and created a table for them. These are the ones that are 
being most often updated (hourly per customer) and made each column an 
'int' or 'real' where possible and ditched the tracking of the 
adding/modifying user and time stamp. I added those out of habit, more 
than anything. This data will always come from a system app though, so...


Given that my DB is in development and how very long and intensive it 
would have been to pull out the existing data, I have started over and 
am now gathering new data. In a week or so I should have the same amount 
of data as I had before and I will be able to do a closer comparison test.


However, I already suspect the growth of the database will be 
substantially slower and the queries will return substantially faster.


Thank you again!

Madi

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] DB is slow until DB is reloaded

2010-01-04 Thread Madison Kelly

Hi all,

  I've got a fairly small DB (~850MB when pg_dump'ed) running on PgSQL
v8.1.11 a CentOS 5.3 x86_64 Xen-based virtual machine. The filesystem is
ext3 on LVM with 32MB extents. It's about the only real resource-hungry
VM on the server.

  It slows down over time and I can't seem to find a way to get the
performance to return without doing a dump and reload of the database.
I've tried manually running 'VACUUM FULL' and restarting the postgresql
daemon without success.

For example, here is an actual query before the dump and again after the
dump (sorry for the large query):

-=] Before the dump/reload [=-
ser...@iwt= EXPLAIN ANALYZE SELECT lor_order_type, lor_order_date,
lor_order_time, lor_isp_agent_id, lor_last_modified_date,
lor_isp_order_number, lor_instr_for_bell_rep, lor_type_of_service,
lor_local_voice_provider, lor_dry_loop_instr, lor_termination_location,
lor_req_line_speed, lor_server_from, lor_rate_band,
lor_related_order_nums, lor_related_order_types, lor_activation_date,
lor_cust_first_name, lor_cust_last_name, lor_req_activation_date,
lor_street_number, lor_street_number_suffix, lor_street_name,
lor_street_type, lor_street_direction, lor_location_type_1,
lor_location_number_1, lor_location_type_2, lor_location_number_2,
lor_postal_code, lor_municipality, lor_province, lor_customer_group,
lor_daytime_phone, lor_daytime_phone_ext, lod_value AS circuit_number
FROM line_owner_report LEFT JOIN line_owner_data ON (lor_id=lod_lo_id
AND lod_variable='ISPCircuitNumber1') WHERE lor_lo_id=514;
QUERY
PLAN
--
 Hash Left Join  (cost=2115.43..112756.81 rows=8198 width=1152) (actual
time=1463.311..1463.380 rows=1 loops=1)
   Hash Cond: (outer.lor_id = inner.lod_lo_id)
   -  Seq Scan on line_owner_report  (cost=0.00..108509.85 rows=8198
width=1124) (actual time=1462.810..1462.872 rows=1 loops=1)
 Filter: (lor_lo_id = 514)
   -  Hash  (cost=2112.85..2112.85 rows=1033 width=36) (actual
time=0.421..0.421 rows=5 loops=1)
 -  Bitmap Heap Scan on line_owner_data  (cost=9.61..2112.85
rows=1033 width=36) (actual time=0.274..0.378 rows=5 loops=1)
   Recheck Cond: (lod_variable = 'ISPCircuitNumber1'::text)
   -  Bitmap Index Scan on lod_variable_index
(cost=0.00..9.61 rows=1033 width=0) (actual time=0.218..0.218 rows=5
loops=1)
 Index Cond: (lod_variable = 'ISPCircuitNumber1'::text)
 Total runtime: 1463.679 ms
(10 rows)

-=] After the dump/reload [=-
ser...@iwt= EXPLAIN ANALYZE SELECT lor_order_type, lor_order_date,
lor_order_time, lor_isp_agent_id, lor_last_modified_date,
lor_isp_order_number, lor_instr_for_bell_rep, lor_type_of_service,
lor_local_voice_provider, lor_dry_loop_instr, lor_termination_location,
lor_req_line_speed, lor_server_from, lor_rate_band,
lor_related_order_nums, lor_related_order_types, lor_activation_date,
lor_cust_first_name, lor_cust_last_name, lor_req_activation_date,
lor_street_number, lor_street_number_suffix, lor_street_name,
lor_street_type, lor_street_direction, lor_location_type_1,
lor_location_number_1, lor_location_type_2, lor_location_number_2,
lor_postal_code, lor_municipality, lor_province, lor_customer_group,
lor_daytime_phone, lor_daytime_phone_ext, lod_value AS circuit_number
FROM line_owner_report LEFT JOIN line_owner_data ON (lor_id=lod_lo_id
AND lod_variable='ISPCircuitNumber1') WHERE lor_lo_id=514;
   QUERY
PLAN
-
 Nested Loop Left Join  (cost=10.84..182.57 rows=5 width=1152) (actual
time=1.980..2.083 rows=1 loops=1)
   -  Seq Scan on line_owner_report  (cost=0.00..70.05 rows=5
width=1124) (actual time=1.388..1.485 rows=1 loops=1)
 Filter: (lor_lo_id = 514)
   -  Bitmap Heap Scan on line_owner_data  (cost=10.84..22.47 rows=3
width=36) (actual time=0.562..0.562 rows=0 loops=1)
 Recheck Cond: ((outer.lor_id = line_owner_data.lod_lo_id)
AND (line_owner_data.lod_variable = 'ISPCircuitNumber1'::text))
 -  BitmapAnd  (cost=10.84..10.84 rows=3 width=0) (actual
time=0.552..0.552 rows=0 loops=1)
   -  Bitmap Index Scan on lod_id_index  (cost=0.00..4.80
rows=514 width=0) (actual time=0.250..0.250 rows=126 loops=1)
 Index Cond: (outer.lor_id =
line_owner_data.lod_lo_id)
   -  Bitmap Index Scan on lod_variable_index
(cost=0.00..5.80 rows=514 width=0) (actual time=0.262..0.262 rows=5 loops=1)
 Index Cond: (lod_variable = 'ISPCircuitNumber1'::text)
 Total runtime: 2.576 ms
(11 rows)

  Any idea on what might be causing the slowdown? Is it likely
filesystem related or am I missing for maintenance step?

Thanks!

Madi


--
Sent via pgsql-performance 

Re: [PERFORM] DB is slow until DB is reloaded

2010-01-04 Thread Madison Kelly

Steve Crawford wrote:

Madison Kelly wrote:

Hi all,

  I've got a fairly small DB...

  It slows down over time and I can't seem to find a way to get the
performance to return without doing a dump and reload of the database...


Some questions:

Is autovacuum running? This is the most likely suspect. If not, things 
will bloat and you won't be getting appropriate analyze runs. Speaking 
of which, what happens if you just run analyze?


And as long as you are dumping and reloading anyway, how about version 
upgrading for bug reduction, performance improvement, and cool new 
features.


Cheers,
Steve



Yup, I even tried manually running 'VACUUM FULL' and it didn't help. As 
for upgrading;


a) I am trying to find a way around the dump/reload. I am doing it as a 
last resort only.

b) I want to keep the version in CentOS' repo.

I'd not tried simply updating the stats via ANALYZE... I'll keep an eye 
on performance and if it starts to slip again, I will run ANALYZE and 
see if that helps. If there is a way to run ANALYZE against a query that 
I am missing, please let me know.


Madi

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] DB is slow until DB is reloaded

2010-01-04 Thread Madison Kelly

Gary Doades wrote:
 From your queries it definitely looks like its your stats that are the 
problem. When the stats get well out of date the planner is choosing a 
hash join because it thinks thousands of rows are involved where as only 
a few are actually involved. Thats why, with better stats, the second 
query is using a loop join over very few rows and running much quicker.


Therefore it's ANALYZE you need to run as well as regular VACUUMing. 
There should be no need to VACUUM FULL at all as long as you VACUUM and 
ANALYZE regularly. Once a day may be enough, but you don't say how long 
it takes your database to become slow.


You can VACUUM either the whole database (often easiest) or individual 
tables if you know in more detail what the problem is and that only 
certain tables need it.


Setting up autovacuum may well be sufficient.

Cheers,
Gary.


That explains things, thank you!

For the record; It was taking a few months for the performance to become 
intolerable. I've added CLUSTER - ANALYZE - VACUUM to my nightly 
routine and dropped the VACUUM FULL call. I'll see how this works.


Cheers!

Madi

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] DB is slow until DB is reloaded

2010-01-04 Thread Madison Kelly

Kevin Grittner wrote:

Madison Kelly li...@alteeve.com wrote:
 
I've added CLUSTER - ANALYZE - VACUUM to my nightly 
routine and dropped the VACUUM FULL call.
 
The CLUSTER is probably not going to make much difference once

you've eliminated bloat, unless your queries do a lot of searches in
the sequence of the index used.  Be sure to run VACUUM ANALYZE as
one statement, not two separate steps.
 
-Kevin


Ah, noted and updated, thank you.

Madi

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] DB is slow until DB is reloaded

2010-01-04 Thread Madison Kelly

Brad Nicholson wrote:

I think you are going down the wrong route here - you should be looking
at preventative maintenance instead of fixing it after its broken.

Ensure that autovacuum is running for the database (assuming that you
are on a relatively modern version of PG), and possibly tune it to be
more aggressive (we can help).

This will ensure that the condition never comes up.

ps - if you do go with the route specify, no need to VACUUM after the
CLUSTER.  CLUSTER gets rid of the dead tuples - nothing for VACUUM to
do.



  I wanted to get ahead of the problem, hence my question here. :) I've 
set this to run at night ('iwt' being the DB in question):


su postgres -c psql iwt -c \VACUUM ANALYZE VERBOSE\

  I will keep an eye on the output for a little while (it appends to a 
log) and see what it says. Also, I read that CLUSTER can mess up back 
ups as it makes tables look empty while running. If the above doesn't 
seem to help, I will swap out the VACUUM and run a CLUSTER before the 
ANALYZE and see how that works.


Madi

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] DB is slow until DB is reloaded

2010-01-04 Thread Madison Kelly

Steve Crawford wrote:

Madison Kelly wrote:

Steve Crawford wrote:

Madison Kelly wrote:

Hi all,

  I've got a fairly small DB...

  It slows down over time and I can't seem to find a way to get the
performance to return without doing a dump and reload of the 
database...



Yup, I even tried manually running 'VACUUM FULL' and it didn't help.
That's because VACUUM reclaims space (er, actually marks space that is 
available for reuse) while ANALYZE refreshes the statistics that the 
planner uses.



As for upgrading;

a) I am trying to find a way around the dump/reload. I am doing it as 
a last resort only.
Agreed - it is the last resort. But since you were doing it I was just 
suggesting that you could combine with a upgrade and get more benefits.

b) I want to keep the version in CentOS' repo.
Depends on reasoning. If you absolutely require a fully vanilla 
particular version of CentOS for some reason then fine. But telling 
CentOS to use the PostgreSQL Development Group pre-built releases for 
CentOS is a very easy one-time process (it's what I do on my CentOS 
machines). From memory (but read to end for warnings):


Download the setup rpm:
wget http://yum.pgsqlrpms.org/reporpms/8.4/pgdg-centos-8.4-1.noarch.rpm

Install it:
rpm -i pgdg-centos-8.4-1.noarch.rpm

Note: This does not install PostgreSQL - it just updates your repository 
list to add the repository containing PostgreSQL binaries. Now make sure 
that you get your updates from PostgreSQL, not CentOS:


Edit /etc/yum.repos.d/CentOS-Base.repo and add exclude=postgresql* to 
the  [base] and [updates] sections.


Now you can use yum as normal and you will get PostgreSQL 8.4 and 
updates thereto rather than using 8.1.


BUT!! I have only done this on new installs. I have not tried it on an 
already running machine. As always, test first on a dev machine and do 
your pre-update dump using the new version of the pg_dump utilities, not 
the old ones.


Cheers,
Steve




I'd not tried simply updating the stats via ANALYZE... I'll keep an 
eye on performance and if it starts to slip again, I will run ANALYZE 
and see if that helps. If there is a way to run ANALYZE against a 
query that I am missing, please let me know.
If you stick with 8.1x, you may want to edit postgresql.conf and change 
default_statistics_target to 100 if it is still at the previous default 
of 10. 100 is the new default setting as testing indicates that it tends 
to yield better query plans with minimal additional overhead.


Cheers,
Steve


I think for now, I will stick with 8.1, but I will certainly try out 
your repo edit above on a test machine and see how that works out. I am 
always reticent to change something as fundamental as postgres without 
good reason. I guess I am a fan of if it ain't broke :)


As for the edit to postgresql.conf, I've made the change. Thanks for the 
detailed input on that.


Madi

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] DB is slow until DB is reloaded

2010-01-04 Thread Madison Kelly

Rosser Schwarz wrote:

On Mon, Jan 4, 2010 at 2:57 PM, Madison Kelly li...@alteeve.com wrote:

As for autovacuum, I assumed (yes, I know) that all v8.x releases enabled it
by default. How would I confirm that it's running or not?


I believe it's not enabled by default in 8.1-land, and is as of 8.2
and later.  Whether it's running or not, try SELECT * FROM
pg_autovacuum;.  If that returns the null set, it's not doing
anything, as it hasn't been told it has anything to do.

IME, however, if you really want to benefit from the autovacuum
daemon, you probably do want to be on something more recent than 8.1.
(And, yes, this is a bit of the pot calling the kettle black: I have a
mixed set of 8.1 and 8.3 hosts.  Autovacuum is only running on the
latter, while the former are queued for an upgrade.)

rls


You are right, autovacuum is not running after all. From your comment, I 
am wondering if you'd recommend I turn it on or not? If so, given that I 
doubt I will upgrade any time soon, how would I enable it? I suppose I 
could google that, but google rarely shares gotcha's. :)


Madi

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] DB is slow until DB is reloaded

2010-01-04 Thread Madison Kelly

Scott Marlowe wrote:

On Mon, Jan 4, 2010 at 3:13 PM, Greg Smith g...@2ndquadrant.com wrote:

Madison Kelly wrote:

I think for now, I will stick with 8.1, but I will certainly try out your
repo edit above on a test machine and see how that works out. I am always
reticent to change something as fundamental as postgres without good
reason. I guess I am a fan of if it ain't broke :)

PostgreSQL has many fundamental limitations that cannot be resolved no
matter what you do in 8.1 that are fixed in later versions.  The default
behavior for the problem you're having has been massively improved by
updates made in 8.2, 8.3, and 8.4.  8.1 can certainly be considered broken
in regards to its lack of good and automatic VACUUM and ANALYZE behavior,
and you're just seeing the first round of issues in that area.  Every minute
you spend applying temporary fixes to the fundamental issues is time you
could be better spending toward upgrading instead.


Also, the HOT updates in 8.3 made a compelling case for us to update,
and if the OP is suffering from table bloat, HOT might help a lot.



These are certainly compelling reasons for me to try upgrading... I will 
try a test upgrade on a devel server tomorrow using Steve's repo edits.


Madi

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance