Re: [PERFORM] Configuration Advice

2007-01-19 Thread Merlin Moncure

On 1/17/07, Steve [EMAIL PROTECTED] wrote:

Hey there;
I've been lurking on this list awhile, and I've been working with postgres
for a number of years so I'm not exactly new to this.  But I'm still
having trouble getting a good balance of settings and I'd like to see what
other people think.  We may also be willing to hire a contractor to help
tackle this problem if anyone is interested.


I happen to be something of a cobol-sql expert, if you are interested
in some advice you can contact me off-list.  I converted an enterprise
cobol (in acucobol) app to Postgresql by plugging pg into the cobol
system via custom c++ isam driver.


I've got an application here that runs large (in terms of length -- the
queries have a lot of conditions in them) queries that can potentially
return millions of rows but on average probably return tens of thousands
of rows.  It's read only for most of the day, and pretty much all the
queries except one are really fast.


If it's just one query I think I'd focus on optimizing that query, not
.conf settings.  In my opinion .conf tuning (a few gotchas aside)
doesn't really get you all that much.


However, each night we load data from a legacy cobol system into the SQL
system and then we summarize that data to make the reports faster.  This
load process is intensely insert/update driven but also has a hefty
amount of selects as well.  This load process is taking ever longer to
complete.


SO ... our goal here is to make this load process take less time.  It
seems the big part is building the big summary table; this big summary
table is currently 9 million rows big.  Every night, we drop the table,
re-create it, build the 9 million rows of data (we use COPY to put hte
data in when it's prepared, not INSERT), and then build the indexes on it
-- of which there are many.  Unfortunately this table gets queried
in a lot of different ways and needs these indexes; also unfortunately, we
have operator class indexes to support both ASC and DESC sorting on


I have some very specific advice here.  Check out row-wise comparison
feature introduced in 8.2.


columns so these are for all intents and purposes duplicate but required
under Postgres 8.1 (we've recently upgraded to Postgres 8.2, is this still
a requirement?)



Building these indexes takes forever!  It's a long grind through inserts
and then building the indexes takes a hefty amount of time too.  (about 9
hours).  Now, the application is likely part at fault, and we're working
to make it more efficient, but it has nothing to do with the index
building time.  I'm wondering what we can do to make this better if
anything; would it be better to leave the indexes on?  It doesn't seem to
be.  Would it be better to use INSERTs instead of copies?  Doesn't seem to


no.

probably any optimization strategies would focus on reducing the
amount of data you had to load.

merlin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Configuration Advice

2007-01-18 Thread Arjen van der Meijden

On 18-1-2007 0:37 Adam Rich wrote:

4) Complex queries that might take advantage of the MySQL Query Cache
since the base data never changes


Have you ever compared MySQL's performance with complex queries to 
PostgreSQL's? I once had a query which would operate on a recordlist and 
see whether there were any gaps larger than 1 between consecutive 
primary keys.


Normally that information isn't very usefull, but this time it was. 
Since the data was in MySQL I tried several variations of queries in 
MySQL... After ten minutes or so I gave up waiting, but left my last 
version running. In the mean time I dumped the data, reloaded the data 
in PostgreSQL and ran some testqueries there. I came up with a query 
that took only 0.5 second on Postgres pretty soon. The query on MySQL 
still wasn't finished...
In my experience it is (even with the 5.0 release) easier to get good 
performance from complex queries in postgresql. And postgresql gives you 
more usefull information on why a query takes a long time when using 
explain (analyze). There are some draw backs too of course, but while we 
in our company use mysql I switched to postgresql for some readonly 
complex query stuff just for its performance...


Besides that, mysql rewrites the entire table for most table-altering 
statements you do (including indexes). For small tables that's no issue, 
but if you somehow can't add all your indexes in a single statement to a 
table you'll be waiting a long time more for new indexes than with 
postgresql. And that situation isn't so unusual if you think of a query 
which needs an index that isn't there yet. Apart from the fact that it 
doesn't have functional indexes and such.


Long story short: MySQL still isn't the best performer when looking at 
the more complex queries. I've seen performance which made me assume it 
can't optimise sequential scans (when it is forced to loop using a seq 
scan it appears to do a new seq scan for each round in the loop...) and 
various other cases PostgreSQL can execute much more efficiently.


So unless you run the same queries a lot of times and know of a way to 
get it fast enough the initial time, the query cache is not much of a help.


Best regards,

Arjen

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Configuration Advice

2007-01-18 Thread Scott Marlowe
On Wed, 2007-01-17 at 18:27, Steve wrote:
  Generally speaking, once you've gotten to the point of swapping, even a
  little, you've gone too far.  A better approach is to pick some
  conservative number, like 10-25% of your ram for shared_buffers, and 1
  gig or so for maintenance work_mem, and then increase them while
  exercising the system, and measure the difference increasing them makes.
 
  If going from 1G shared buffers to 2G shared buffers gets you a 10%
  increase, then good.  If going from 2G to 4G gets you a 1.2% increase,
  it's questionable.  You should reach a point where throwing more
  shared_buffers stops helping before you start swapping.  But you might
  not.
 
  Same goes for maintenance work mem.  Incremental changes, accompanied by
  reproduceable benchmarks / behaviour measurements are the way to
  determine the settings.
 
  Note that you can also vary those during different times of the day.
  you can have maint_mem set to 1Gig during the day and crank it up to 8
  gig or something while loading data.  Shared_buffers can't be changed
  without restarting the db though.
 
 
 I'm currently benchmarking various configuration adjustments.  Problem is 
 these tests take a really long time because I have to run the load 
 process... which is like a 9 hour deal.  That's why I'm asking for advice 
 here, because there's a lot of variables here and it's really time costly 
 to test :)
 
 I'm still working on the benchmarkings and by Friday I should have some 
 interesting statistics to work with and maybe help figure out what's going 
 on.

You can probably take a portion of what you're loading and make a
benchmark of the load process that is repeatable (same data, size,
etc...) each time, but only takes 30 minutes to an hour to run each
time.  shortens your test iteration AND makes it reliably repeatable.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Configuration Advice

2007-01-18 Thread Scott Marlowe
On Thu, 2007-01-18 at 04:24, Arjen van der Meijden wrote:
 On 18-1-2007 0:37 Adam Rich wrote:
  4) Complex queries that might take advantage of the MySQL Query Cache
  since the base data never changes
 
 Have you ever compared MySQL's performance with complex queries to 
 PostgreSQL's? I once had a query which would operate on a recordlist and 
 see whether there were any gaps larger than 1 between consecutive 
 primary keys.
 
 Normally that information isn't very usefull, but this time it was. 
 Since the data was in MySQL I tried several variations of queries in 
 MySQL... After ten minutes or so I gave up waiting, but left my last 
 version running. In the mean time I dumped the data, reloaded the data 
 in PostgreSQL and ran some testqueries there. I came up with a query 
 that took only 0.5 second on Postgres pretty soon. The query on MySQL 
 still wasn't finished...

I have had similar experiences in the past.  Conversely, I've had
similar things happen the other way around.  The biggest difference?  If
I report something like that happening in postgresql, it's easier to get
a fix or workaround, and if it's a code bug, the fix is usually released
as a patch within a day or two.  With MySQL, if it's a common problem,
then I can find it on the internet with google, otherwise it might take
a while to get a good workaround / fix.  And if it's a bug, it might
take much longer to get a working patch.

 In my experience it is (even with the 5.0 release) easier to get good 
 performance from complex queries in postgresql.

Agreed.  For data warehousing / OLAP stuff, postgresql is generally
better than mysql.  

 Besides that, mysql rewrites the entire table for most table-altering 
 statements you do (including indexes). For small tables that's no issue, 
 but if you somehow can't add all your indexes in a single statement to a 
 table you'll be waiting a long time more for new indexes than with 
 postgresql. And that situation isn't so unusual if you think of a query 
 which needs an index that isn't there yet. Apart from the fact that it 
 doesn't have functional indexes and such.

Note that this applies to the myisam table type.  innodb works quite
differently.  It is more like pgsql in behaviour, and is an mvcc storage
engine.  Like all storage engine, it's a collection of compromises. 
Some areas it's better than pgsql, some areas worse.  Sadly, it lives
under the hood of a database that can do some pretty stupid things, like
ignore column level constraint definitions without telling you.

 Long story short: MySQL still isn't the best performer when looking at 
 the more complex queries. 

agreed.  And those are the queries that REALLY kick your ass.  Or your
server's ass, I guess.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Configuration Advice

2007-01-18 Thread Arjen van der Meijden

On 18-1-2007 17:20 Scott Marlowe wrote:
Besides that, mysql rewrites the entire table for most table-altering 
statements you do (including indexes). 


Note that this applies to the myisam table type.  innodb works quite
differently.  It is more like pgsql in behaviour, and is an mvcc storage


Afaik this is not engine specific and also applies to InnoDB. Here is 
what the MySQL-manual sais about it:
In most cases, ALTER TABLE works by making a temporary copy of the 
original table. The alteration is performed on the copy, and then the 
original table is deleted and the new one is renamed. While ALTER TABLE 
 is executing, the original table is readable by other clients. Updates 
and writes to the table are stalled until the new table is ready, and 
then are automatically redirected to the new table without any failed 
updates.


http://dev.mysql.com/doc/refman/5.0/en/alter-table.html

If it were myisam-only they sure would've mentioned that. Besides this 
is the behaviour we've seen on our site as well.


Since 'create index' is also an alter table statement for mysql, this 
also applies for adding indexes.


Best regards,

Arjen


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Configuration Advice

2007-01-18 Thread Jeremy Haile
 I once had a query which would operate on a recordlist and 
 see whether there were any gaps larger than 1 between consecutive 
 primary keys.

Would you mind sharing the query you described?  I am attempting to do
something similar now. 

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Configuration Advice

2007-01-18 Thread Arjen van der Meijden

On 18-1-2007 18:28 Jeremy Haile wrote:
I once had a query which would operate on a recordlist and 
see whether there were any gaps larger than 1 between consecutive 
primary keys.


Would you mind sharing the query you described?  I am attempting to do
something similar now. 



Well it was over a year ago, so I don't know what I did back then. But 
since it was a query adjusted from what I did in MySQL there where no 
subqueries involved, I think it was something like this:

select a.id, min(b.id)
 from
  members a
  join members b on a.id  b.id
  left join members c on a.id +1 = c.id
where c.id IS NULL
group by a.id;

Or rewriting it to this one halves the execution time though:

select a.id, min(b.id)
 from
  members a
  left join members c on a.id +1 = c.id
  join members b on a.id  b.id
where c.id IS NULL
group by a.id;

Although this query seems to be much faster with 150k records:

select aid, bid
from
(select a.id as aid, (select min(b.id) from members b where b.id  a.id) 
as bid

 from
  members a
group by a.id) as foo
where bid  aid+1;

The first one takes about 16 seconds on my system with PG 8.2, the 
second about 1.8 second. But back then the list was much shorter, so it 
can have been the first one or a variant on that. On MySQL the first 
takes much more than the 16 seconds PostgreSQL uses, and after editting 
this e-mail it still isn't finished... The second one made EXPLAIN hang 
in my 5.0.32-bk, so I didn't try that for real.


Best regards,

Arjen

PS, In case any of the planner-hackers are reading, here are the plans 
of the first two queries, just to see if something can be done to 
decrease the differences between them. The main differences seems to be 
that groupaggregate vs the hashaggregate?


 GroupAggregate  (cost=34144.16..35144.38 rows=50011 width=8) (actual 
time=17653.401..23881.320 rows=71 loops=1)
   -  Sort  (cost=34144.16..34269.19 rows=50011 width=8) (actual 
time=17519.274..21423.128 rows=7210521 loops=1)

 Sort Key: a.id
 -  Nested Loop  (cost=11011.41..30240.81 rows=50011 width=8) 
(actual time=184.412..10945.189 rows=7210521 loops=1)
   -  Hash Left Join  (cost=11011.41..28739.98 rows=1 
width=4) (actual time=184.384..1452.467 rows=72 loops=1)

 Hash Cond: ((a.id + 1) = c.id)
 Filter: (c.id IS NULL)
 -  Seq Scan on members a  (cost=0.00..9903.33 
rows=150033 width=4) (actual time=0.009..71.463 rows=150033 loops=1)
 -  Hash  (cost=9903.33..9903.33 rows=150033 
width=4) (actual time=146.040..146.040 rows=150033 loops=1)
   -  Seq Scan on members c 
(cost=0.00..9903.33 rows=150033 width=4) (actual time=0.002..77.066 
rows=150033 loops=1)
   -  Index Scan using members_pkey on members b 
(cost=0.00..875.69 rows=50011 width=4) (actual time=0.025..78.971 
rows=100146 loops=72)

 Index Cond: (a.id  b.id)
 Total runtime: 23882.511 ms
(13 rows)

 HashAggregate  (cost=30240.82..30240.83 rows=1 width=8) (actual 
time=12870.440..12870.504 rows=71 loops=1)
   -  Nested Loop  (cost=11011.41..30240.81 rows=1 width=8) (actual 
time=168.658..9466.644 rows=7210521 loops=1)
 -  Hash Left Join  (cost=11011.41..28739.98 rows=1 width=4) 
(actual time=168.630..865.690 rows=72 loops=1)

   Hash Cond: ((a.id + 1) = c.id)
   Filter: (c.id IS NULL)
   -  Seq Scan on members a  (cost=0.00..9903.33 
rows=150033 width=4) (actual time=0.012..70.612 rows=150033 loops=1)
   -  Hash  (cost=9903.33..9903.33 rows=150033 width=4) 
(actual time=140.432..140.432 rows=150033 loops=1)
 -  Seq Scan on members c  (cost=0.00..9903.33 
rows=150033 width=4) (actual time=0.003..76.709 rows=150033 loops=1)
 -  Index Scan using members_pkey on members b 
(cost=0.00..875.69 rows=50011 width=4) (actual time=0.023..73.317 
rows=100146 loops=72)

   Index Cond: (a.id  b.id)
 Total runtime: 12870.756 ms
(11 rows)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Configuration Advice

2007-01-18 Thread Tom Lane
Arjen van der Meijden [EMAIL PROTECTED] writes:
 PS, In case any of the planner-hackers are reading, here are the plans 
 of the first two queries, just to see if something can be done to 
 decrease the differences between them.

Increase work_mem?  It's not taking the hash because it thinks it won't
fit in memory ...

There is a bug here, I'd say: the rowcount estimate ought to be the same
either way.  Dunno why it's not, but will look --- I see the same
misbehavior with a toy table here.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Configuration Advice

2007-01-18 Thread Arjen van der Meijden

On 18-1-2007 23:11 Tom Lane wrote:

Increase work_mem?  It's not taking the hash because it thinks it won't
fit in memory ...


When I increase it to 128MB in the session (arbitrarily selected 
relatively large value) it indeed has the other plan.


Best regards,

Arjen

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Adam Rich

Doesn't sound like you want postgres at all Try mysql.



-Original Message-
From: Steve [EMAIL PROTECTED]
To: pgsql-performance@postgresql.org
Sent: 1/17/2007 2:41 PM
Subject: [PERFORM] Configuration Advice

Hey there;

I've been lurking on this list awhile, and I've been working with postgres 
for a number of years so I'm not exactly new to this.  But I'm still 
having trouble getting a good balance of settings and I'd like to see what 
other people think.  We may also be willing to hire a contractor to help 
tackle this problem if anyone is interested.

I've got an application here that runs large (in terms of length -- the 
queries have a lot of conditions in them) queries that can potentially 
return millions of rows but on average probably return tens of thousands 
of rows.  It's read only for most of the day, and pretty much all the 
queries except one are really fast.

However, each night we load data from a legacy cobol system into the SQL 
system and then we summarize that data to make the reports faster.  This 
load process is intensely insert/update driven but also has a hefty 
amount of selects as well.  This load process is taking ever longer to 
complete.


SO ... our goal here is to make this load process take less time.  It 
seems the big part is building the big summary table; this big summary 
table is currently 9 million rows big.  Every night, we drop the table, 
re-create it, build the 9 million rows of data (we use COPY to put hte 
data in when it's prepared, not INSERT), and then build the indexes on it 
-- of which there are many.  Unfortunately this table gets queried 
in a lot of different ways and needs these indexes; also unfortunately, we 
have operator class indexes to support both ASC and DESC sorting on 
columns so these are for all intents and purposes duplicate but required 
under Postgres 8.1 (we've recently upgraded to Postgres 8.2, is this still 
a requirement?)

Building these indexes takes forever!  It's a long grind through inserts 
and then building the indexes takes a hefty amount of time too.  (about 9 
hours).  Now, the application is likely part at fault, and we're working 
to make it more efficient, but it has nothing to do with the index 
building time.  I'm wondering what we can do to make this better if 
anything; would it be better to leave the indexes on?  It doesn't seem to 
be.  Would it be better to use INSERTs instead of copies?  Doesn't seem to 
be.


Anyway -- ANYTHING we can do to make this go faster is appreciated :) 
Here's some vital statistics:

- Machine is a 16 GB, 4 actual CPU dual-core opteron system using SCSI 
discs.  The disc configuration seems to be a good one, it's the best of 
all the ones we've tested so far.

- The load process itself takes about 6 gigs of memory, the rest is free 
for postgres because this is basically all the machine does.

- If this was your machine and situation, how would you lay out the emmory 
settings?  What would you set the FSM to?  Would you leave teh bgwriter on 
or off?  We've already got FSYNC off because data integrity doesn't 
matter -- this stuff is religeously backed up and we've got no problem 
reinstalling it.  Besides, in order for this machine to go down, data 
integrity of the DB is the least of the worries :)

Do wal_buffers/full_page_writes matter of FSYNC is off?  If so, what 
settings?  What about checkpoints?

Any finally, any ideas on planner constants?  Here's what I'm using:

seq_page_cost = 0.5 # measured on an arbitrary scale
random_page_cost = 1.0  # same scale as above
cpu_tuple_cost = 0.001  # same scale as above
cpu_index_tuple_cost = 0.0001   # same scale as above
cpu_operator_cost = 0.00025 # same scale as above
effective_cache_size = 679006

I really don't remember how I came up with that effective_cache_size 
number


Anyway... any advice would be appreciated :)


Steve

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Bricklen Anderson

Adam Rich wrote:

Doesn't sound like you want postgres at all Try mysql.


Could you explain your reason for suggesting mysql? I'm simply curious 
why you would offer that as a solution.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Tomas Vondra
 Any finally, any ideas on planner constants?  Here's what I'm using:
 
 seq_page_cost = 0.5 # measured on an arbitrary scale
 random_page_cost = 1.0  # same scale as above
 cpu_tuple_cost = 0.001  # same scale as above
 cpu_index_tuple_cost = 0.0001   # same scale as above
 cpu_operator_cost = 0.00025 # same scale as above
 effective_cache_size = 679006
 
 I really don't remember how I came up with that effective_cache_size
 number

I don't have much experience with the way your application works, but:

1) What is the size of the whole database? Does that fit in your memory?
   That's the first thing I'd like to know and I can't find it in your
   post.

   I'm missing several other important values too - namely

 shared_buffers
 max_fsm_pages
 work_mem
 maintenance_work_mem

   BTW, is the autovacuum daemon running? If yes, try to stop it during
   the import (and run ANALYZE after the import of data).

2) What is the size of a disc page? Without that we can only guess what
   doest the effective_cache_size number means - in the usual case it's
   8kB thus giving about 5.2 GiB of memory.

   As suggested in http://www.powerpostgresql.com/PerfList I'd increase
   that to about 1.400.000 which about 10.5 GiB (about 2/3 of RAM).

   Anyway - don't be afraid this breaks something. This is just an
   information for PostgreSQL how much memory the OS is probably using
   as a filesystem cache. PostgreSQL uses this to evaluate the
   probability that the page is in a cache.

3) What is the value of maintenance_work_mem? This is a very important
   value for CREATE INDEX (and some other). The lower this value is,
   the slower the CREATE INDEX is. So try to increase the value as much
   as you can - this could / should improve the import performance
   considerably.

   But be careful - this does influence the amount of memmory allocated
   by PostgreSQL. Being in your position I wouldn't do this in the
   postgresql.conf - I'd do that in the connection used by the import
   using SET command, ie. something like

   SET maintenance_work_mem = 524288;
   CREATE INDEX ...
   CREATE INDEX ...
   CREATE INDEX ...
   CREATE INDEX ...

   for a 512 MiB of maintenance_work_mem. Maybe even a higher value
   could be used (1 GiB?). Just try to fiddle with this a little.

4) Try to set up some performance monitoring - for example a 'dstat' is
   a nice way to do that. This way you can find yout where's the
   bottleneck (memory, I/O etc.)

That's basically all I can think of right now.

Tomas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Chad Wagner

On 1/17/07, Steve [EMAIL PROTECTED] wrote:


However, each night we load data from a legacy cobol system into the SQL
system and then we summarize that data to make the reports faster.  This
load process is intensely insert/update driven but also has a hefty
amount of selects as well.  This load process is taking ever longer to
complete.



How many rows do you typically load each night?  If it is say less than 10%
of the total rows, then perhaps the suggestion in the next paragraph is
reasonable.

SO ... our goal here is to make this load process take less time.  It

seems the big part is building the big summary table; this big summary
table is currently 9 million rows big.  Every night, we drop the table,
re-create it, build the 9 million rows of data (we use COPY to put hte
data in when it's prepared, not INSERT), and then build the indexes on it



Perhaps, placing a trigger on the source table and building a change log
would be useful.  For example, you could scan the change log (looking for
insert, update, and deletes) and integrate those changes into your summary
table.  Obviously if you are using complex aggregates it may not be possible
to adjust the summary table, but if you are performing simple SUM's,
COUNT's, etc.  then this is a workable solution.


--
Chad
http://www.postgresqlforums.com/


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Heikki Linnakangas

Steve wrote:
SO ... our goal here is to make this load process take less time.  It 
seems the big part is building the big summary table; this big summary 
table is currently 9 million rows big.  Every night, we drop the table, 
re-create it, build the 9 million rows of data (we use COPY to put hte 
data in when it's prepared, not INSERT), and then build the indexes on 
it -- of which there are many.  


Would it be possible to just update the summary table, instead of 
recreating it from scratch every night?


Unfortunately this table gets queried in 
a lot of different ways and needs these indexes; also unfortunately, we 
have operator class indexes to support both ASC and DESC sorting on 
columns so these are for all intents and purposes duplicate but required 
under Postgres 8.1 (we've recently upgraded to Postgres 8.2, is this 
still a requirement?)


I don't think this has changed in 8.2.

Building these indexes takes forever!  It's a long grind through inserts 
and then building the indexes takes a hefty amount of time too.  (about 
9 hours).  Now, the application is likely part at fault, and we're 
working to make it more efficient, but it has nothing to do with the 
index building time.  I'm wondering what we can do to make this better 
if anything; would it be better to leave the indexes on?  It doesn't 
seem to be.  Would it be better to use INSERTs instead of copies?  
Doesn't seem to be.


Would it help if you created multiple indexes simultaneously? You have 
enough CPU to do it. Is the index creation CPU or I/O bound? 9 million 
rows should fit in 16 GB of memory, right?


- The load process itself takes about 6 gigs of memory, the rest is free 
for postgres because this is basically all the machine does.


Can you describe the load process in more detail? What's it doing with 
the 6 gigs?


- If this was your machine and situation, how would you lay out the 
emmory settings?  What would you set the FSM to? 


FSM seems irrelevant here..

Do wal_buffers/full_page_writes matter of FSYNC is off?  


Better turn off full_page_writes, since you can kiss goodbye to data 
integrity anyway with fsync=off.



Anyway... any advice would be appreciated :)


What's your maintenance_work_mem setting? It can make a big difference 
in sorting the data for indexes.


If you could post the schema including the indexes, people might have 
more ideas...


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Steve


1) What is the size of the whole database? Does that fit in your memory?
  That's the first thing I'd like to know and I can't find it in your
  post.


Current on-disk size is about 51 gig.  I'm not sure if there's a different 
size I should be looking at instead, but that's what du tells me the 
directory for the database in the base directory is sized at.  So, no, 
it doesn't fit into memory all the way.



  I'm missing several other important values too - namely

shared_buffers
max_fsm_pages
work_mem
maintenance_work_mem



I didn't share these because they've been in flux :)  I've been 
experimenting with different values, but currently we're using:


8GB shared_buffers
10  max_fsm_pages
256MB   work_mem
6GB maintenance_work_mem


  BTW, is the autovacuum daemon running? If yes, try to stop it during
  the import (and run ANALYZE after the import of data).


No.  all vacuums are done explicitly since the database doesn't change 
during the day.  The 'order of operations' is:


- Load COBOL data into database (inserts/updates)
- VACUUM COBOL data
- Summarize COBOL data (inserts/updates with the big table using COPY)
- VACUUM summary tables

So everything gets vacuumed as soon as it's updated.


2) What is the size of a disc page? Without that we can only guess what
  doest the effective_cache_size number means - in the usual case it's
  8kB thus giving about 5.2 GiB of memory.



I believe it's 8kB.  I definitely haven't changed it :)


  As suggested in http://www.powerpostgresql.com/PerfList I'd increase
  that to about 1.400.000 which about 10.5 GiB (about 2/3 of RAM).

  Anyway - don't be afraid this breaks something. This is just an
  information for PostgreSQL how much memory the OS is probably using
  as a filesystem cache. PostgreSQL uses this to evaluate the
  probability that the page is in a cache.



Okay, I'll try the value you recommend. :)



3) What is the value of maintenance_work_mem? This is a very important
  value for CREATE INDEX (and some other). The lower this value is,
  the slower the CREATE INDEX is. So try to increase the value as much
  as you can - this could / should improve the import performance
  considerably.

  But be careful - this does influence the amount of memmory allocated
  by PostgreSQL. Being in your position I wouldn't do this in the
  postgresql.conf - I'd do that in the connection used by the import
  using SET command, ie. something like

  SET maintenance_work_mem = 524288;
  CREATE INDEX ...
  CREATE INDEX ...
  CREATE INDEX ...
  CREATE INDEX ...

  for a 512 MiB of maintenance_work_mem. Maybe even a higher value
  could be used (1 GiB?). Just try to fiddle with this a little.


	It's currently at 6GB in postgres.conf, though you have a good 
point in that maybe that should be before the indexes are made to save 
room.  Things are certainly kinda tight in the config as is.



4) Try to set up some performance monitoring - for example a 'dstat' is
  a nice way to do that. This way you can find yout where's the
  bottleneck (memory, I/O etc.)

That's basically all I can think of right now.



Thanks for the tips :)


Steve

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Steve



On Wed, 17 Jan 2007, Benjamin Minshall wrote:




Building these indexes takes forever!


Anyway -- ANYTHING we can do to make this go faster is appreciated :) 
Here's some vital statistics:


- Machine is a 16 GB, 4 actual CPU dual-core opteron system using SCSI 
discs.  The disc configuration seems to be a good one, it's the best of all 
the ones we've tested so far.


What are your shared_buffers, work_mem, and maintenance_work_mem settings?

maintenance_work_mem is used for CREATE INDEX, and with 16GB of memory in the 
machine, maintenance_work_mem should be set to at least 1GB in my opinion.




shared_buffers = 8GB
work_mem = 256MB
maintenance_work_mem = 6GB

So that should be covered, unless I'm using too much memory and swapping. 
It does look like it's swapping a little, but not too badly as far as I 
can tell.  I'm thinking of dialing back everything a bit, but I'm not 
really sure what the heck to do :)  It's all guessing for me right now.



Steve

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Benjamin Minshall



Building these indexes takes forever!


Anyway -- ANYTHING we can do to make this go faster is appreciated :) 
Here's some vital statistics:


- Machine is a 16 GB, 4 actual CPU dual-core opteron system using SCSI 
discs.  The disc configuration seems to be a good one, it's the best of 
all the ones we've tested so far.


What are your shared_buffers, work_mem, and maintenance_work_mem settings?

maintenance_work_mem is used for CREATE INDEX, and with 16GB of memory 
in the machine, maintenance_work_mem should be set to at least 1GB in my 
opinion.


--
Benjamin Minshall [EMAIL PROTECTED]
Senior Developer -- Intellicon, Inc.
http://www.intellicon.biz


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Joshua D. Drake
Bricklen Anderson wrote:
 Adam Rich wrote:
 Doesn't sound like you want postgres at all Try mysql.
 
 Could you explain your reason for suggesting mysql? I'm simply curious
 why you would offer that as a solution.

He sound a little trollish to me. I would refer to the other actually
helpful posts on the topic.

Sincerely,

Joshua D. Drake


 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings
 


-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Bruno Wolff III
 From: Steve [EMAIL PROTECTED]
 To: pgsql-performance@postgresql.org
 Sent: 1/17/2007 2:41 PM
 Subject: [PERFORM] Configuration Advice
 
 SO ... our goal here is to make this load process take less time.  It 
 seems the big part is building the big summary table; this big summary 
 table is currently 9 million rows big.  Every night, we drop the table, 
 re-create it, build the 9 million rows of data (we use COPY to put hte 
 data in when it's prepared, not INSERT), and then build the indexes on it 
 -- of which there are many.  Unfortunately this table gets queried 
 in a lot of different ways and needs these indexes; also unfortunately, we 
 have operator class indexes to support both ASC and DESC sorting on 
 columns so these are for all intents and purposes duplicate but required 
 under Postgres 8.1 (we've recently upgraded to Postgres 8.2, is this still 
 a requirement?)

Note that you only need to have the ASC and DESC versions of opclasses when
you are going to use multicolumn indexes with some columns in ASC order and
some in DESC order. For columns used by themselves in an index, you don't
need to do this, no matter which order you are sorting on.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Scott Marlowe
On Wed, 2007-01-17 at 15:58, Steve wrote:
 On Wed, 17 Jan 2007, Benjamin Minshall wrote:
 
 
  Building these indexes takes forever!
 
  Anyway -- ANYTHING we can do to make this go faster is appreciated :) 
  Here's some vital statistics:
 
  - Machine is a 16 GB, 4 actual CPU dual-core opteron system using SCSI 
  discs.  The disc configuration seems to be a good one, it's the best of 
  all 
  the ones we've tested so far.
 
  What are your shared_buffers, work_mem, and maintenance_work_mem settings?
 
  maintenance_work_mem is used for CREATE INDEX, and with 16GB of memory in 
  the 
  machine, maintenance_work_mem should be set to at least 1GB in my opinion.
 
 
 shared_buffers = 8GB
 work_mem = 256MB
 maintenance_work_mem = 6GB
 
 So that should be covered, unless I'm using too much memory and swapping. 
 It does look like it's swapping a little, but not too badly as far as I 
 can tell.  I'm thinking of dialing back everything a bit, but I'm not 
 really sure what the heck to do :)  It's all guessing for me right now.

Generally speaking, once you've gotten to the point of swapping, even a
little, you've gone too far.  A better approach is to pick some
conservative number, like 10-25% of your ram for shared_buffers, and 1
gig or so for maintenance work_mem, and then increase them while
exercising the system, and measure the difference increasing them makes.

If going from 1G shared buffers to 2G shared buffers gets you a 10%
increase, then good.  If going from 2G to 4G gets you a 1.2% increase,
it's questionable.  You should reach a point where throwing more
shared_buffers stops helping before you start swapping.  But you might
not.

Same goes for maintenance work mem.  Incremental changes, accompanied by
reproduceable benchmarks / behaviour measurements are the way to
determine the settings.

Note that you can also vary those during different times of the day. 
you can have maint_mem set to 1Gig during the day and crank it up to 8
gig or something while loading data.  Shared_buffers can't be changed
without restarting the db though.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Steve

How many rows do you typically load each night?  If it is say less than 10%
of the total rows, then perhaps the suggestion in the next paragraph is
reasonable.


	Hrm.  It's very, very variable.  I'd say it's more than 10% on 
average, and it can actually be pretty close to 50-100% on certain days. 
Our data is based upon customer submissions, and normally it's a daily 
basis kind of deal, but sometimes they'll resubmit their entire year on 
certain deadlines to make sure it's all in.  Now, we don't have to 
optimize for those deadlines, just the 'average daily load'.  It's okay if 
on those deadlines it takes forever, because that's understandable.


	However, I will look into this and see if I can figure out this 
average value.  This may be a valid idea, and I'll look some more at it.



Thanks!

Steve


SO ... our goal here is to make this load process take less time.  It

seems the big part is building the big summary table; this big summary
table is currently 9 million rows big.  Every night, we drop the table,
re-create it, build the 9 million rows of data (we use COPY to put hte
data in when it's prepared, not INSERT), and then build the indexes on it



Perhaps, placing a trigger on the source table and building a change log
would be useful.  For example, you could scan the change log (looking for
insert, update, and deletes) and integrate those changes into your summary
table.  Obviously if you are using complex aggregates it may not be possible
to adjust the summary table, but if you are performing simple SUM's,
COUNT's, etc.  then this is a workable solution.


--
Chad
http://www.postgresqlforums.com/



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Adam Rich

Sorry if this came off sounding trollish  All databases have their
strengths  weaknesses, and I feel the responsible thing to do is
exploit
those strengths where possible, rather than expend significant time and
effort coaxing one database to do something it wasn't designed to.
There's just no such thing as one size fits all.

I have professional experience with MS-SQL, Oracle, MySQL, and Postgres.
and the scenario described sounds more ideal for MySQL  MyISAM than 
anything else:

1) No concerns for data loss (turning fsync  full_page_writes off)
since the data can be reloaded

2) No need for MVCC or transactions, since the database is read-only

3) No worries about lock contention

4) Complex queries that might take advantage of the MySQL Query Cache
since the base data never changes

5) Queries that might obtain data directly from indexes without having
to touch tables (again, no need for MVCC)

If loading in the base data and creating the summary table is taking 
a lot of time, using MySQL with MyISAM tables (and binary logging
disabled) should provide significant time savings, and it doesn't 
sound like there's any concerns for the downsides.  

Yes, postgresql holds an edge over MySQL for heavy OLTP applications,
I use it for that and I love it.  But for the scenario the original 
poster is asking about, MySQL/MyISAM is ideal.  




-Original Message-
From: Bricklen Anderson [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 17, 2007 3:29 PM
To: Adam Rich
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Configuration Advice


Adam Rich wrote:
 Doesn't sound like you want postgres at all Try mysql.

Could you explain your reason for suggesting mysql? I'm simply curious 
why you would offer that as a solution.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Dave Cramer


On 17-Jan-07, at 3:41 PM, Steve wrote:


Hey there;

I've been lurking on this list awhile, and I've been working with  
postgres for a number of years so I'm not exactly new to this.  But  
I'm still having trouble getting a good balance of settings and I'd  
like to see what other people think.  We may also be willing to  
hire a contractor to help tackle this problem if anyone is interested.


I've got an application here that runs large (in terms of length --  
the queries have a lot of conditions in them) queries that can  
potentially return millions of rows but on average probably return  
tens of thousands of rows.  It's read only for most of the day, and  
pretty much all the queries except one are really fast.


However, each night we load data from a legacy cobol system into  
the SQL system and then we summarize that data to make the reports  
faster.  This load process is intensely insert/update driven but  
also has a hefty amount of selects as well.  This load process is  
taking ever longer to complete.



SO ... our goal here is to make this load process take less time.   
It seems the big part is building the big summary table; this big  
summary table is currently 9 million rows big.  Every night, we  
drop the table, re-create it, build the 9 million rows of data (we  
use COPY to put hte data in when it's prepared, not INSERT), and  
then build the indexes on it -- of which there are many.   
Unfortunately this table gets queried in a lot of different ways  
and needs these indexes; also unfortunately, we have operator class  
indexes to support both ASC and DESC sorting on columns so these  
are for all intents and purposes duplicate but required under  
Postgres 8.1 (we've recently upgraded to Postgres 8.2, is this  
still a requirement?)


Building these indexes takes forever!  It's a long grind through  
inserts and then building the indexes takes a hefty amount of time  
too.  (about 9 hours).  Now, the application is likely part at  
fault, and we're working to make it more efficient, but it has  
nothing to do with the index building time.  I'm wondering what we  
can do to make this better if anything; would it be better to leave  
the indexes on?  It doesn't seem to be.  Would it be better to use  
INSERTs instead of copies?  Doesn't seem to be.



Anyway -- ANYTHING we can do to make this go faster is  
appreciated :) Here's some vital statistics:


- Machine is a 16 GB, 4 actual CPU dual-core opteron system using  
SCSI discs.  The disc configuration seems to be a good one, it's  
the best of all the ones we've tested so far.


The basic problem here is simply writing all the data to disk. you  
are building 9M rows of data plus numerous index's. How much data are  
you actually writing to the disk. Try looking at iostat while this is  
going on.


My guess is you are maxing out the disk write speed.
- The load process itself takes about 6 gigs of memory, the rest is  
free for postgres because this is basically all the machine does.


- If this was your machine and situation, how would you lay out the  
emmory settings?  What would you set the FSM to?  Would you leave  
teh bgwriter on or off?  We've already got FSYNC off because data  
integrity doesn't matter -- this stuff is religeously backed up  
and we've got no problem reinstalling it.  Besides, in order for  
this machine to go down, data integrity of the DB is the least of  
the worries :)


Do wal_buffers/full_page_writes matter of FSYNC is off?  If so,  
what settings?  What about checkpoints?



Not reallly, I'd have WAL buffers write to a ram disk

Any finally, any ideas on planner constants?  Here's what I'm using:

seq_page_cost = 0.5 # measured on an arbitrary  
scale

random_page_cost = 1.0  # same scale as above
cpu_tuple_cost = 0.001  # same scale as above
cpu_index_tuple_cost = 0.0001   # same scale as above
cpu_operator_cost = 0.00025 # same scale as above
effective_cache_size = 679006



as a general rule make shared buffers about 25% of free mem,  
effective cache 75% but with a write intensive load like you have I  
think the first thing to look at is write speed.
I really don't remember how I came up with that  
effective_cache_size number



Anyway... any advice would be appreciated :)


Steve

---(end of  
broadcast)---

TIP 2: Don't 'kill -9' the postmaster




---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Steve
Would it be possible to just update the summary table, instead of recreating 
it from scratch every night?


	Hrm, I believe it's probably less work for the computer to do if 
it's rebuilt.  Any number of rows may be changed during an update, not 
including additions, so I'd have to pull out what's changed and sync it 
with what's in the summary table already.  It'll be a lot more selects and 
program-side computation to save the big copy; it might work out, but I'd 
say this would be my last ditch thing. :)


Building these indexes takes forever!  It's a long grind through inserts 
and then building the indexes takes a hefty amount of time too.  (about 9 
hours).  Now, the application is likely part at fault, and we're working to 
make it more efficient, but it has nothing to do with the index building 
time.  I'm wondering what we can do to make this better if anything; would 
it be better to leave the indexes on?  It doesn't seem to be.  Would it be 
better to use INSERTs instead of copies?  Doesn't seem to be.


Would it help if you created multiple indexes simultaneously? You have enough 
CPU to do it. Is the index creation CPU or I/O bound? 9 million rows should 
fit in 16 GB of memory, right?


	This is a very very interesting idea.  It looks like we're 
probably not fully utilizing the machine for the index build, and this 
could be the ticket for us.  I'm going to go ahead and set up a test for 
this and we'll see how it goes.


Can you describe the load process in more detail? What's it doing with the 6 
gigs?


	There's two halves to the load process; the loader and the 
summarizer.  The loader is the part that takes 6 gigs; the summarizer only 
takes a few hundred MEG.


	Basically we have these COBOL files that vary in size but 
are usually in the hundred's of MEG realm.  These files contain new data 
OR updates to existing data.  We load this data from the COBOL files in 
chunks, so that's not a place where we're burning a lot of memory.


	The first thing we do is cache the list of COBOL ID codes that are 
already in the DB; the COBOL ID codes are really long numeric strings, so 
we use a sequenced integer primary key.  The cache translates COBOL IDs to 
primary keys, and this takes most of our memory nowadays.  Our cache is 
fast, but it's kind of a memory hog.  We're working on trimming that down, 
but it's definitely faster than making a query for each COBOL ID.


	The load is relatively fast and is considered acceptable, and 
has been relatively constant in speed.  It's the summarizer that's brutal.


	The summarizer produces 3 main summary tables and a few 
secondaries that don't take much time to make.  Two of them are smallish 
and not that big a deal, and the last one is the biggie that's 9 mil rows 
and growing.  To produce the 9 mil row table, we query out the data in 
groups, do our processing, and save that data to a series of text files 
that are in blocks of 10,000 rows as I recall.  We then copy each file 
into the DB (there were some issues with copying in an entire 9 mil row 
file in the past, which is why we don't use just one file -- those issues 
have been fixed, but we didn't undo the change).


What's your maintenance_work_mem setting? It can make a big difference in 
sorting the data for indexes.


6 gigs currently. :)

If you could post the schema including the indexes, people might have more 
ideas...


I'll have to ask first, but I'll see if I can :)

Talk to you later, and thanks for the info!


Steve

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Steve

Note that you only need to have the ASC and DESC versions of opclasses when
you are going to use multicolumn indexes with some columns in ASC order and
some in DESC order. For columns used by themselves in an index, you don't
need to do this, no matter which order you are sorting on.



Yeah, I assumed the people 'in the know' on this kind of stuff would know 
the details of why I have to have those, and therefore I wouldn't have to 
go into detail as to why -- but you put your finger right on it. :) 
Unfortunately the customer this is for wants certain columns joined at the 
hip for querying and sorting, and this method was a performance godsend 
when we implemented it (with a C .so library, not using SQL in our 
opclasses or anything like that).



Steve

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Chad Wagner

On 1/17/07, Steve [EMAIL PROTECTED] wrote:


However, I will look into this and see if I can figure out this
average value.  This may be a valid idea, and I'll look some more at it.



It must be, Oracle sells it pretty heavily as a data warehousing feature
;).  Oracle calls it a materialized view, and the basic premise is you have
a change log (called a materialized log by Oracle) and you have a job that
runs through the change log and applies the changes to the materialized
view.

If you are using aggregates, be careful and make sure you use simple forms
of those aggregates.  For example, if you are using an average function
then you should have two columns sum and count instead.  Some aggregates are
too complex and cannot be represented by this solution and you will find
that you can't update the summary tables, so definitely try to stay away
from complex aggregates if you do not need them.

Here is a link to a PL/pgSQL effort that tries to simulate materialized
views:

http://jonathangardner.net/PostgreSQL/materialized_views/matviews.html

I don't know how complete it is, and it looks like there was a project
started but has been abandoned for the last 3 years.

--
Chad
http://www.postgresqlforums.com/


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Chris Browne
[EMAIL PROTECTED] (Steve) writes:
 I'm wondering what we can do to make
 this better if anything; would it be better to leave the indexes on?
 It doesn't seem to be.  

Definitely NOT.  Generating an index via a bulk sort is a LOT faster
than loading data into an index one tuple at a time.

We saw a BIG increase in performance of Slony-I when, in version
1.1.5, we added a modification that shuts off indexes during COPY and
then does a reindex.  Conceivably, you might look at how Slony-I does
that, and try doing the same thing; it might well be faster than doing
a bunch of reindexes serially.  (Or not...)

 Would it be better to use INSERTs instead of copies?  Doesn't seem
 to be.

I'd be mighty surprised.

 - The load process itself takes about 6 gigs of memory, the rest is
 free for postgres because this is basically all the machine does.

The settings you have do not seem conspicuously wrong in any way.

The one thought which comes to mind is that if you could turn this
into a *mostly* incremental change, that might help.

The thought:

 - Load the big chunk of data into a new table

 - Generate some minimal set of indices on the new table

 - Generate four queries that compare old to new:
  q1 - See which tuples are unchanged from yesterday to today
  q2  - See which tuples have been deleted from yesterday to today
  q3  - See which tuples have been added
  q4  - See which tuples have been modified

 If the unchanged set is extremely large, then you might see benefit
 to doing updates based on deleting the rows indicated by q2,
 inserting rows based on q3, and updating based on q4.  

In principle, computing and applying those 4 queries might be quicker
than rebuilding from scratch.

In principle, applying q2, then q4, then vacuuming, then q3, ought to
be optimal.
-- 
let name=cbbrowne and tld=cbbrowne.com in name ^ @ ^ tld;;
http://linuxdatabases.info/info/linux.html
A 'Cape Cod Salsa' just isn't right. -- Unknown

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Steve

Generally speaking, once you've gotten to the point of swapping, even a
little, you've gone too far.  A better approach is to pick some
conservative number, like 10-25% of your ram for shared_buffers, and 1
gig or so for maintenance work_mem, and then increase them while
exercising the system, and measure the difference increasing them makes.

If going from 1G shared buffers to 2G shared buffers gets you a 10%
increase, then good.  If going from 2G to 4G gets you a 1.2% increase,
it's questionable.  You should reach a point where throwing more
shared_buffers stops helping before you start swapping.  But you might
not.

Same goes for maintenance work mem.  Incremental changes, accompanied by
reproduceable benchmarks / behaviour measurements are the way to
determine the settings.

Note that you can also vary those during different times of the day.
you can have maint_mem set to 1Gig during the day and crank it up to 8
gig or something while loading data.  Shared_buffers can't be changed
without restarting the db though.



I'm currently benchmarking various configuration adjustments.  Problem is 
these tests take a really long time because I have to run the load 
process... which is like a 9 hour deal.  That's why I'm asking for advice 
here, because there's a lot of variables here and it's really time costly 
to test :)


I'm still working on the benchmarkings and by Friday I should have some 
interesting statistics to work with and maybe help figure out what's going 
on.



Thanks!

Steve

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Steve
This would probably also be another last ditch option. :)  Our stuff is 
designed to pretty much work on any DB but there are postgres specific 
things in there... not to mention ramp up time on MySQL.  I mean, I know 
MySQL from a user point of view and in a very limited way 
administratively, but I'd be back to square one on learning performance 
stuff :)


Anyway -- I'll listen to what people have to say, and keep this in mind. 
It would be an interesting test to take parts of the process and compare 
at least, if not converting the whole thing.


talk to you later,

Steve

On Wed, 17 Jan 2007, Adam Rich wrote:



Sorry if this came off sounding trollish  All databases have their
strengths  weaknesses, and I feel the responsible thing to do is
exploit
those strengths where possible, rather than expend significant time and
effort coaxing one database to do something it wasn't designed to.
There's just no such thing as one size fits all.

I have professional experience with MS-SQL, Oracle, MySQL, and Postgres.
and the scenario described sounds more ideal for MySQL  MyISAM than
anything else:

1) No concerns for data loss (turning fsync  full_page_writes off)
since the data can be reloaded

2) No need for MVCC or transactions, since the database is read-only

3) No worries about lock contention

4) Complex queries that might take advantage of the MySQL Query Cache
since the base data never changes

5) Queries that might obtain data directly from indexes without having
to touch tables (again, no need for MVCC)

If loading in the base data and creating the summary table is taking
a lot of time, using MySQL with MyISAM tables (and binary logging
disabled) should provide significant time savings, and it doesn't
sound like there's any concerns for the downsides.

Yes, postgresql holds an edge over MySQL for heavy OLTP applications,
I use it for that and I love it.  But for the scenario the original
poster is asking about, MySQL/MyISAM is ideal.




-Original Message-
From: Bricklen Anderson [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 17, 2007 3:29 PM
To: Adam Rich
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Configuration Advice


Adam Rich wrote:

Doesn't sound like you want postgres at all Try mysql.


Could you explain your reason for suggesting mysql? I'm simply curious
why you would offer that as a solution.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Configuration Advice

2007-01-17 Thread Steve

The thought:

- Load the big chunk of data into a new table

- Generate some minimal set of indices on the new table

- Generate four queries that compare old to new:
 q1 - See which tuples are unchanged from yesterday to today
 q2  - See which tuples have been deleted from yesterday to today
 q3  - See which tuples have been added
 q4  - See which tuples have been modified

If the unchanged set is extremely large, then you might see benefit
to doing updates based on deleting the rows indicated by q2,
inserting rows based on q3, and updating based on q4.

In principle, computing and applying those 4 queries might be quicker
than rebuilding from scratch.

In principle, applying q2, then q4, then vacuuming, then q3, ought to
be optimal.



	This looks like an interesting idea, and I'm going to take a look 
at how feasible it'll be to impletement.  I may be able to combine this 
with Mr. Wagner's idea to make a much more efficient system overall.  It's 
going to be a  pretty big programming task, but I've a feeling this 
summarizer thing may just need to be re-written with a smarter system 
like this to get something faster.



Thanks!

Steve

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings