Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-17 Thread Howard Cole

Alan Hodgson wrote:
It's because everything is cached, in particular the relevant rows from 
the email table (accessing which took 22 of the original 27 seconds).


The plan looks good for what it's doing.

I don't see that query getting much faster unless you could add a lot more 
cache RAM; 30K random IOs off disk is going to take a fair bit of time 
regardless of what you do. 

  


Thanks Alan, I guessed that the caching was the difference, but I do not 
understand why there is a heap scan on the email table? The query seems 
to use the email_fts_index correctly, which only takes 6 seconds, why 
does it then need to scan the email table?


Sorry If I sound a bit stupid - I am not very experienced with the 
analyse statement.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-17 Thread Howard Cole
I think I may have answered my own question partially, the problem may 
be how I structure the query.


I always structured my tsearch queries as follows following my initial 
read of the tsearch2 instructions...


select email_id from email, to_tsquery('default', 'howard') as q where 
q@@fts;


However if I construct them in the following way, as stipulated in the 
8.3 documentation


select email_id from email where fts@@to_tsquery('default','howard')

Then the results are better due to the fact that the email table is not 
necessarily scanned as can be seen from the two analyse statements:


Original statement:

Nested Loop  (cost=4.40..65.08 rows=16 width=8)
  -  Function Scan on q  (cost=0.00..0.01 rows=1 width=32)
  -  Bitmap Heap Scan on email  (cost=4.40..64.87 rows=16 width=489)
Filter: (email.fts @@ q.q)
-  Bitmap Index Scan on email_fts_index  (cost=0.00..4.40 
rows=16 width=0)

  Index Cond: (email.fts @@ q.q)

Second statement:

Bitmap Heap Scan on email  (cost=4.40..64.91 rows=16 width=8)
  Filter: (fts @@ '''howard'''::tsquery)
  -  Bitmap Index Scan on email_fts_index  (cost=0.00..4.40 rows=16 
width=0)

Index Cond: (fts @@ '''howard'''::tsquery)

This misses out the random access of the email table, turning my 27 
second query into 6 seconds.


I guess the construction of the first statement effectively stops the 
query optimisation from working.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-17 Thread Matthew Wakeling

On Tue, 17 Jun 2008, Howard Cole wrote:

Alan Hodgson wrote:
It's because everything is cached, in particular the relevant rows from the 
email table (accessing which took 22 of the original 27 seconds).


Thanks Alan, I guessed that the caching was the difference, but I do not 
understand why there is a heap scan on the email table? The query seems to 
use the email_fts_index correctly, which only takes 6 seconds, why does it 
then need to scan the email table?


It's not a sequential scan - that really would take a fair time. It's a 
bitmap heap scan - that is, it has built a bitmap of the rows needed by 
using the index, and now it needs to fetch all those rows from the email 
table. There's 14938 of them, and they're likely scattered all over the 
table, so you'll probably have to do 14938 seeks on the disc. At 5ms a 
pop, that would be 70 seconds, so count yourself lucky it only takes 22 
seconds instead!


If you aren't actually interested in having all 14938 rows splurged at 
you, try using the LIMIT keyword at the end of the query. That would make 
it run a bit faster, and would make sense if you only want to display the 
first twenty on a web page or something.


Matthew

--
For every complex problem, there is a solution that is simple, neat, and wrong.
 -- H. L. Mencken 


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-17 Thread Matthew Wakeling

On Tue, 17 Jun 2008, Howard Cole wrote:
I think I may have answered my own question partially, the problem may be how 
I structure the query.


Original statement:

Nested Loop  (cost=4.40..65.08 rows=16 width=8)
  -  Function Scan on q  (cost=0.00..0.01 rows=1 width=32)
  -  Bitmap Heap Scan on email  (cost=4.40..64.87 rows=16 width=489)
Filter: (email.fts @@ q.q)
-  Bitmap Index Scan on email_fts_index  (cost=0.00..4.40 rows=16 
width=0)
  Index Cond: (email.fts @@ q.q)

Second statement:

Bitmap Heap Scan on email  (cost=4.40..64.91 rows=16 width=8)
  Filter: (fts @@ '''howard'''::tsquery)
  -  Bitmap Index Scan on email_fts_index  (cost=0.00..4.40 rows=16 width=0)
Index Cond: (fts @@ '''howard'''::tsquery)


As far as I can see, that shouldn't make any difference. Both queries 
still do the bitmap heap scan, and have almost exactly the same cost.


Matthew

--
Lord grant me patience, and I want it NOW!

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-17 Thread Howard Cole


As far as I can see, that shouldn't make any difference. Both queries 
still do the bitmap heap scan, and have almost exactly the same cost.


Matthew

You may have a point there Matthew, they both appear to do a scan on the 
email table (Why?). But for whatever reason, I swear the second method 
is significantly faster! If I run the new style query first, then the 
original style (to_tsquery as q) then the original style still takes 
longer, even with the new style cached!


Incidentally, how can I clear the cache in between queries?

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-17 Thread Matthew Wakeling

On Tue, 17 Jun 2008, Howard Cole wrote:

They both appear to do a scan on the email table (Why?).


The indexes don't contain copies of the row data. They only contain 
pointers to the rows in the table. So once the index has been consulted, 
Postgres still needs to look at the table to fetch the actual rows. Of 
course, it only needs to bother looking where the index points, and that 
is the benefit of an index.


Matthew

--
I've run DOOM more in the last few days than I have the last few
months.  I just love debugging ;-)  -- Linus Torvalds

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-17 Thread Howard Cole

Matthew Wakeling wrote:

On Tue, 17 Jun 2008, Howard Cole wrote:

They both appear to do a scan on the email table (Why?).


The indexes don't contain copies of the row data. They only contain 
pointers to the rows in the table. So once the index has been 
consulted, Postgres still needs to look at the table to fetch the 
actual rows. Of course, it only needs to bother looking where the 
index points, and that is the benefit of an index.


Matthew

Thanks for your patience with me here Matthew, But what I don't 
understand is why it needs to do a scan on email. If I do a query that 
uses another index, then it uses the index only and does not scan the 
email table. The scan on the fts index takes 6 seconds, which presumably 
returns email_id's (the email_id being the primary key) - what does it 
then need from the email table that takes 22 seconds?


e.g.

triohq= explain select email_id from email where email_directory_id=1;
QUERY PLAN


-
Index Scan using email_email_directory_id_idx on email  
(cost=0.00..129.01 rows

=35 width=8)
  Index Cond: (email_directory_id = 1)
(2 rows)

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-17 Thread Matthew Wakeling

On Tue, 17 Jun 2008, Howard Cole wrote:
If I do a query that uses another index, then it uses the index only and 
does not scan the email table.


Not true. It only looks a little bit like that from the explain output. 
However, if you look closely:



Index Scan using email_email_directory_id_idx on email  (cost=0.00..129.01 
rows=35 width=8)
 Index Cond: (email_directory_id = 1)
(2 rows)


It's a scan *using* the index, but *on* the table email. This index scan 
is having to read the email table too.


The scan on the fts index takes 6 seconds, which presumably returns 
email_id's (the email_id being the primary key) - what does it then need 
from the email table that takes 22 seconds?


Actually, the index returns page numbers in the table on disc which may 
contain one or more rows that are relevant. Postgres has to fetch the 
whole row to find out the email_id and any other information, including 
whether the row is visible in your current transaction (concurrency 
control complicates it all). Just having a page number isn't much use to 
you!


Matthew

--
First law of computing:  Anything can go wro
sig: Segmentation fault.  core dumped.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Migration Articles.. ???

2008-06-17 Thread Matthew Wakeling

On Tue, 17 Jun 2008, sathiya psql wrote:

I have a database in postgres X.Y which has around 90 tables, and lot of
data in it.
In the next version of that product, i had some more tables, so how to
migrate that,. there may be 150 tables., in that 90 tables, 70 may be the
same, 20 got deleted, and 80 may be new., i want the 70 tables to have same
data as it is.,


Please do not cross-post. This question has nothing to do with 
performance. (Cross-posting answer so everyone else doesn't answer the 
same.)


You'll want to dump the source database selectively, and then reload the 
dump into a new database. RTFM on pg_dump, especially the -t and -T 
options.


Matthew

--
All of this sounds mildly turgid and messy and confusing... but what the
heck. That's what programming's all about, really
   -- Computer Science Lecturer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-17 Thread Howard Cole


Actually, the index returns page numbers in the table on disc which 
may contain one or more rows that are relevant. Postgres has to fetch 
the whole row to find out the email_id and any other information, 
including whether the row is visible in your current transaction 
(concurrency control complicates it all). Just having a page number 
isn't much use to you!


Matthew


I learn something new every day.

Thanks Matthew.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Which hardware ?

2008-06-17 Thread Lionel
Hi,

I need to install a 8.3 database and was wondering which hardware would be 
sufficient to have good performances (less than 30s for² slowest select).

Database size: 25 Go /year, 5 years of history
One main table containing 40 million lines per year.
Batch inserts of 10 lines. Very very few deletes, few updates.

30 other tables, 4 levels of hierarchy, containing from 10 lines up to 2 
lines.
5 of them have forein keys on the main table.

I will use table partitionning on the year column.

Statements will mainly do sums on the main table, grouped by whatever column 
of the database (3-5 joined tables, or join on join), with some criterions 
that may vary, lots of joined varchar in ('a','b',...,'z').
It's almost impossible to predict what users will do via the webapplication 
that queries this database: almost all select, join, group by, where... 
possibilities are available.

Up to 4 simultaneous users.

I'm planning to host it on a quad xeon 2.66Ghz with 8Go of DDR2, and a dual 
(RAID1) SATA2 750Go HD.
Perharps with another HD for indexes.

Do you think it will be enough ?
Is another RAID for better performances a minimum requirement ?
Will a secondary HD for indexes help ?

Which OS would you use ? (knowing that there will be a JDK 1.6 installed 
too)

With 5 millions of lines, the same application runs quite fast on windows 
2000 on a single P4 2.8 GHz (very few statements last more than 10s, mostly 
when concurrent statements are made). Each statement consumes 100% of the 
CPU.


thanks for advices.



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Scott Marlowe
On Tue, Jun 17, 2008 at 7:38 AM, Lionel [EMAIL PROTECTED] wrote:
 Hi,

 I need to install a 8.3 database and was wondering which hardware would be
 sufficient to have good performances (less than 30s for² slowest select).

 Database size: 25 Go /year, 5 years of history
 One main table containing 40 million lines per year.
 Batch inserts of 10 lines. Very very few deletes, few updates.

 30 other tables, 4 levels of hierarchy, containing from 10 lines up to 2
 lines.
 5 of them have forein keys on the main table.

 I will use table partitionning on the year column.

 Statements will mainly do sums on the main table, grouped by whatever column
 of the database (3-5 joined tables, or join on join), with some criterions
 that may vary, lots of joined varchar in ('a','b',...,'z').
 It's almost impossible to predict what users will do via the webapplication
 that queries this database: almost all select, join, group by, where...
 possibilities are available.

 Up to 4 simultaneous users.

 I'm planning to host it on a quad xeon 2.66Ghz with 8Go of DDR2, and a dual
 (RAID1) SATA2 750Go HD.
 Perharps with another HD for indexes.

 Do you think it will be enough ?
 Is another RAID for better performances a minimum requirement ?
 Will a secondary HD for indexes help ?

More drives, all in the same RAID-10 setup.  For reporting like this
writing speed often isn't that critical, so you are often better off
with software RAID-10 than using a mediocre hardware RAID controller
(most adapatecs, low end LSI, etc...)

You'd be surprised what going from a 2 disk RAID1 to a 4 disk RAID10
can do in these circumstances.  Going up to 6, 8, 10 or more disks
really makes a difference.

 Which OS would you use ? (knowing that there will be a JDK 1.6 installed
 too)

I'd use RHEL5 because it's what I'm familiar with.  Any stable flavor
of linux or FreeBSD7 are good performance choices if you know how to
drive them.

 With 5 millions of lines, the same application runs quite fast on windows
 2000 on a single P4 2.8 GHz (very few statements last more than 10s, mostly
 when concurrent statements are made). Each statement consumes 100% of the
 CPU.

That statement about concurrent statements REALLY sells me on the idea
of a many disk RAID10 here.  I'd take that over quad cores for what
you're doing any day.  Not that I'd turn down quad cores here either.
:)

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Andrew Sullivan
On Tue, Jun 17, 2008 at 03:38:59PM +0200, Lionel wrote:
 Hi,
 
 I need to install a 8.3 database and was wondering which hardware would be 
 sufficient to have good performances (less than 30s for� slowest select).

 Statements will mainly do sums on the main table, grouped by whatever column 
 of the database (3-5 joined tables, or join on join), with some criterions 
 that may vary, lots of joined varchar in ('a','b',...,'z').
 It's almost impossible to predict what users will do via the webapplication 
 that queries this database: almost all select, join, group by, where... 
 possibilities are available.

I'm not sure that I have any specific recommendation to make in the
face of such sweeping requirements.  But I'd say you need to make I/O
cheap, which means piles of memory and extremely fast disk
subsystems.

Also, there's another important question (which never gets asked in
these discussions), which is, How much is the performance worth to
you?  If the last 10% of users get something longer than 30s, but
less than 40s, and they will pay no more to get the extra 10s response
time, then it's worth nothing to you, and you shouldn't fix it.
 
 Up to 4 simultaneous users.

You won't need lots of processer, then.
 
 I'm planning to host it on a quad xeon 2.66Ghz with 8Go of DDR2, and a dual 
 (RAID1) SATA2 750Go HD.
 Perharps with another HD for indexes.

How big's the database?  If you can have enough memory to hold the
whole thing, including all indexes, in memory, that's what you want.
Apart from that, dual SATA2 is probably underpowered.  But. . .
 
 Which OS would you use ? (knowing that there will be a JDK 1.6 installed 
 too)

. . .I think this is the real mistake.  Get a separate database box.
It's approximately impossible to tune a box correctly for both your
application and your database, in my experience.

A

-- 
Andrew Sullivan
[EMAIL PROTECTED]
+1 503 667 4564 x104
http://www.commandprompt.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Matthew Wakeling

On Tue, 17 Jun 2008, Lionel wrote:

I need to install a 8.3 database and was wondering which hardware would be
sufficient to have good performances (less than 30s for² slowest select).



It's almost impossible to predict what users will do via the webapplication
that queries this database: almost all select, join, group by, where...
possibilities are available.


Well, Scott has given you some good pointers on how to make a fast system. 
However, your original question (is this fast enough) is impossible to 
answer, especially if the users are allowed to run absolutely anything 
they want. I bet you I could craft a query that takes more than 30 seconds 
regardless of how fast you make your system.


Having said that, I'll add the suggestion that you should put as much RAM 
in the box as possible. It can only help. As others have said, if you only 
have four users, then CPU power isn't going to be such a problem, and 
given that, I'd disagree with Andrew and say as long as you have plenty of 
RAM, Java can play well with a database on the same box. Depends what it 
is doing, of course.


Matthew

--
To be or not to be   -- Shakespeare
To do is to be   -- Nietzsche
To be is to do   -- Sartre
Do be do be do   -- Sinatra
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Lionel
Andrew Sullivan wrote:
 You won't need lots of processer, then.

can't find less than quad core for this price range...

 How big's the database?

with 20 millions of rows, the main table is 3.5 Go on win XP.
With 8 Go of indexes.

I estimate the whole database around 30 Go / year

  If you can have enough memory to hold the
 whole thing, including all indexes, in memory, that's what you want.
 Apart from that, dual SATA2 is probably underpowered.  But. . .

RAID is twice more expansive.
(600euros/month for a 5x750Go SATA2 with 12Gb of ram and unnecessary 2x quad 
core)

didn't find any RAID 10 not too expansive dedicated server.

If this setup is twice as fast, I can afford it. But if it a 30sec VS 
40sec...I'm not sure my customer will pay.

 Which OS would you use ? (knowing that there will be a JDK 1.6
 installed too)

 . . .I think this is the real mistake.  Get a separate database box.
 It's approximately impossible to tune a box correctly for both your
 application and your database, in my experience.

My tomcat webapp is well coded  and consumes nearly nothing.
On such powerful hardware, I prefer to run both on the same server.
I could eventually run it on a different server, much less powerfull, but 
it's not on the same network, I guess this would be an issue. 



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Greg Smith

On Tue, 17 Jun 2008, Andrew Sullivan wrote:


Which OS would you use ? (knowing that there will be a JDK 1.6 installed
too)


. . .I think this is the real mistake.  Get a separate database box.
It's approximately impossible to tune a box correctly for both your
application and your database, in my experience.


I can't remember the last time I deployed a PG box that didn't have a Java 
app or three on it, too.  You've got even odds that putting it a separate 
system will even be a improvement.  Yes, if the Java app is a pig and the 
machine doesn't have enough resources, separating it out to another system 
will help.  But there are plenty of these buggers that will end up so much 
slower from the additional network latency that it's a net loss (depends 
on how the app groups its requests for rows).


If you know enough about Java to watch things like how much memory the 
JVMs are taking up, I wouldn't hesitate to put them all on the same 
machine.  Considering that Lionel's system seems pretty overpowered for 
what he's doing--runs plenty fast on a much slower system, enough RAM to 
hold a large portion of the primary tables and database, all batch updates 
that don't really need a good RAID setup--I'd say looks good here and 
recommend he just follow the plan he outlined.  Just watch the system with 
top for a bit under load to make sure the Java processes are staying under 
control.


As for OS, a RHEL5 or clone like CentOS should work fine here, which is 
more appropriate depends on your support requirements.  I would recommend 
against using FreeBSD as it's not the friendliest Java platform, and the 
additional complexity of Solaris seems like overkill for your app. 
Basically, evem though it's available for more of them, I only consider 
deploying a Java app on one of the mainstream platforms listed at 
http://www.java.com/en/download/manual.jsp right now because those are the 
mature releases.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Andrew Sullivan
On Tue, Jun 17, 2008 at 04:49:17PM +0200, Lionel wrote:
 My tomcat webapp is well coded  and consumes nearly nothing.

If I were ever inclined to say, Nonsense, about code I've never
seen, this is probably the occasion on which I'd do it.  A running JVM
is necessarily going to use some memory, and that is memory use that
you won't be able to factor out properly when developing models of
your database system performance.

 I could eventually run it on a different server, much less powerfull, but 
 it's not on the same network, I guess this would be an issue. 

The power of the system is hard to know about in the context (with
only 8Go of memory, I don't consider this a powerful box at all,
note).  But why wouldn't it be on the same network?  You're using the
network stack anyway, note: JVMs can't go over domain sockets.

A

-- 
Andrew Sullivan
[EMAIL PROTECTED]
+1 503 667 4564 x104
http://www.commandprompt.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Scott Marlowe
On Tue, Jun 17, 2008 at 9:32 AM, Greg Smith [EMAIL PROTECTED] wrote:

  Considering that Lionel's system seems pretty overpowered for what he's
 doing--runs plenty fast on a much slower system, enough RAM to hold a large
 portion of the primary tables and database, all batch updates that don't
 really need a good RAID setup--I'd say looks good here and recommend he
 just follow the plan he outlined.  Just watch the system with top for a bit
 under load to make sure the Java processes are staying under control.

In the original post he mentioned that he had 5 years of data at about
25G / year.

With 125G of data, it's likely that if most queries are on recent data
it'll be in RAM, but anything that hits older data will NOT have that
luxury.  Which is why I recommended RAID-10.  It doesn't have to be on
a $1200 card with 44 disks or something, but even 4 disks in a sw
RAID-10 will be noticeably faster (about 2x) than a simple RAID-1 at
hitting that old data.

We had a reporting server with about 80G of data on a machine with 4G
ram last place I worked, and it could take it a few extra seconds to
hit the old data, but the SW RAID-10 on it made it much faster at
reporting than it would have been with a single disk.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-17 Thread Alan Hodgson
On Tuesday 17 June 2008, Howard Cole [EMAIL PROTECTED] wrote:
 This misses out the random access of the email table, turning my 27
 second query into 6 seconds.

It took less time because it retrieved a lot less data - it still has to 
look at the table.

-- 
Alan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-17 Thread Alan Hodgson
On Tuesday 17 June 2008, Howard Cole [EMAIL PROTECTED] wrote:
 Incidentally, how can I clear the cache in between queries?

Stop PostgreSQL, unmount the filesystem it's on, remount it, restart 
PostgreSQL. Works under Linux.

If it's on a filesystem you can't unmount hot, you'll need to reboot.

-- 
Alan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-17 Thread Matthew Wakeling

On Tue, 17 Jun 2008, Alan Hodgson wrote:

On Tuesday 17 June 2008, Howard Cole [EMAIL PROTECTED] wrote:

Incidentally, how can I clear the cache in between queries?


Stop PostgreSQL, unmount the filesystem it's on, remount it, restart
PostgreSQL. Works under Linux.

If it's on a filesystem you can't unmount hot, you'll need to reboot.


Not true - on recent Linux kernels, you can drop the OS cache by running

echo 1 /proc/sys/vm/drop_caches

as root. You'll still need to restart Postgres to drop its cache too.

Matthew

--
Richards' Laws of Data Security:
1. Don't buy a computer.
2. If you must buy a computer, don't turn it on.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Scott Marlowe
On Tue, Jun 17, 2008 at 9:42 AM, Andrew Sullivan [EMAIL PROTECTED] wrote:
 On Tue, Jun 17, 2008 at 04:49:17PM +0200, Lionel wrote:
 My tomcat webapp is well coded  and consumes nearly nothing.

 If I were ever inclined to say, Nonsense, about code I've never
 seen, this is probably the occasion on which I'd do it.  A running JVM
 is necessarily going to use some memory, and that is memory use that
 you won't be able to factor out properly when developing models of
 your database system performance.

But if that amount of memory is 256 Megs and it only ever acts as a
control panel or data access point, it's probably not a huge issue.
If it's 2 Gig it's another issue.  It's all about scale.  The real
performance hog for me on all in one boxes has been perl / fastcgi
setups.

 The power of the system is hard to know about in the context (with
 only 8Go of memory, I don't consider this a powerful box at all,
 note).

I always think of main memory in terms of how high a cache hit rate it
can get me.  If 8G gets you a 50% hit rate, and 16G gets you a 95% hit
rate, then 16G is the way to go.  But if 8G gets you to 75% and 32G
gets you to 79% because of your usage patterns (the world isn't always
bell curve shaped) then 8G is plenty and it's time to work on faster
disk subsystems if you need more performance.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Greg Smith

On Tue, 17 Jun 2008, Andrew Sullivan wrote:

A running JVM is necessarily going to use some memory, and that is 
memory use that you won't be able to factor out properly when developing 
models of your database system performance.


Now you've wandered into pure FUD.  Tuning maximum memory usage on a Java 
app so you can model it is straightforward (albeit a little confusing at 
first), and in most cases you can just sample it periodically to get a 
good enough estimate for database tuning purposes.  JVMs let you adjust 
maximum memory use with -Xmx , and if anything the bigger problem I run 
into is that using too much memory hits that limit and crashes Java long 
before it becomes a hazard to the database.


This is a system with 8GB of RAM here; having some Tomcat instances 
co-existing with the database when there's that much room to work is not 
that hard.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Greg Smith

On Tue, 17 Jun 2008, Scott Marlowe wrote:


We had a reporting server with about 80G of data on a machine with 4G
ram last place I worked, and it could take it a few extra seconds to
hit the old data, but the SW RAID-10 on it made it much faster at
reporting than it would have been with a single disk.


I agree with your statement above, that query time could likely be dropped 
a few seconds with a better disk setup.  I just question whether that's 
necessary given the performance target here.


Right now the app is running on an underpowered Windows box and is 
returning results in around 10s, on a sample data set that sounds like 1/8 
of a year worth of data (1/40 of the total).  It is seemingly CPU bound 
with not enough processor to handle concurrent queries being the source of 
the worst-case behavior.  The target is keeping that 30s on more powerful 
hardware, with at least 6X as much processor power and a more efficient 
OS, while using yearly partitions to keep the amount of data to juggle at 
once under control.  That seems reasonable to me, and while better disks 
would be nice I don't see any evidence they're really needed here.  This 
application sounds a batch processing/reporting one where plus or minus a 
few seconds doesn't have a lot of business value.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Scott Marlowe
On Tue, Jun 17, 2008 at 10:56 AM, Greg Smith [EMAIL PROTECTED] wrote:
 On Tue, 17 Jun 2008, Scott Marlowe wrote:

 We had a reporting server with about 80G of data on a machine with 4G
 ram last place I worked, and it could take it a few extra seconds to
 hit the old data, but the SW RAID-10 on it made it much faster at
 reporting than it would have been with a single disk.

 I agree with your statement above, that query time could likely be dropped a
 few seconds with a better disk setup.  I just question whether that's
 necessary given the performance target here.

 Right now the app is running on an underpowered Windows box and is returning
 results in around 10s, on a sample data set that sounds like 1/8 of a year
 worth of data (1/40 of the total).  It is seemingly CPU bound with not
 enough processor to handle concurrent queries being the source of the
 worst-case behavior.  The target is keeping that 30s on more powerful
 hardware, with at least 6X as much processor power and a more efficient OS,
 while using yearly partitions to keep the amount of data to juggle at once
 under control.  That seems reasonable to me, and while better disks would be
 nice I don't see any evidence they're really needed here.  This application
 sounds a batch processing/reporting one where plus or minus a few seconds
 doesn't have a lot of business value.

I think you're making a big assumption that this is CPU bound.  And it
may be that when all the queries are operating on current data that it
is.  But as soon as a few ugly queries fire that need to read tens of
gigs of data off the drives, then you'll start to switch to I/O bound
and the system will slow a lot.

We had a single drive box doing work on an 80G set that was just fine
with the most recent bits.  Until I ran a report that ran across the
last year instead of the last two days, and took 2 hours to run.

All the queries that had run really quickly on all the recent data
suddenly were going from 1 or 2 seconds to 2 or 3 minutes.  And I'd
have to kill my reporting query.

Moved it to the same exact hardware but with a 4 disc RAID-10 and the
little queries stayed 1-2 seconds while th reporting queries were cut
down by factors of about 4 to 10.  RAID-1 will be somewhere between
them I'd imagine.  RAID-10 has an amazing ability to handle parallel
accesses without falling over performance-wise.

You're absolutely right though, we really need to know the value of
fast performance here.

If you're monitoring industrial systems you need fast enough response
to spot problems before they escalate to disasters.

If you're running aggregations of numbers used for filling out
quarterly reports, not so much.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Scott Marlowe
On Tue, Jun 17, 2008 at 11:59 AM, Lionel [EMAIL PROTECTED] wrote:
 Scott Marlowe wrote:
 You're absolutely right though, we really need to know the value of
 fast performance here.

 the main problem is that my customers are used to have their reporting after
 few seconds.
 They want do have 10 times more data but still have the same speed, which
 is, I think, quite impossible.

 If you're running aggregations of numbers used for filling out
 quarterly reports, not so much.

 The application is used to analyse products sales behaviour, display charts,
 perform comparisons, study progression...
 10-40 seconds seems to be a quite good performance.
 More than 1 minute will be too slow (meaning they won't pay for that).

 I did some test with a 20 millions lines database on a single disk dual core
 2GB win XP system (default postgresql config), most of the time is spent in
 I/O: 50-100 secs for statements that scan 6 millions of lines, which will
 happen. Almost no CPU activity.

 So here is the next question: 4 disks RAID10 (did not find a french web host
 yet) or 5 disk RAID5 (found at 600euros/month) ?
 I don't want to have any RAID issue...
 I did not have any problem with my basic RAID1 since many years, and don't
 want that to change.

Do you have root access on your servers?  then just ask for 5 disks
with one holding the OS / Apps and you'll do the rest.  Software RAID
is probably a good fit for cheap right now.

If you can set it up yourself, you might be best off with 2 disk
RAID-1.  5 750G disks in a RAID-1 yields 750G of storage (duh) but
allows for five different readers to operate without the heads having
to seek.  large amounts of data can be read at a medium speed from a
RAID-1 like this.  But most RAID implementations don't aggregate
bandwidth for RAID-1.

They do for RAID-0.  So, having a huge RAID-0 zero array allows for
reading a large chunk of data really fast from all disks at once.

RAID1+0 gives you the ability to tune this in either direction.  But
the standard config of a 4 disk setup (striping two mirrors, each made
from two disks, is a good compromise to start with.  Average read
speed of array is doubled, and the ability to have two reads not
conflict helps too.

RAID5 is a comproise to provide the most storage while having mediocre
performance or, when degraded, horrifficaly poor performance.

Hard drives are cheap, hosting not as much.

Also, always look at optimizing their queries.  A lot of analysis is
done by brute force queries that rewritten intelligently suddenly run
in minutes not hours.  or seconds not minutes.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Greg Smith

On Tue, 17 Jun 2008, Lionel wrote:


I did some test with a 20 millions lines database on a single disk dual core
2GB win XP system (default postgresql config), most of the time is spent in
I/O: 50-100 secs for statements that scan 6 millions of lines, which will
happen. Almost no CPU activity.


I hope you're aware that the default config is awful, and there are all 
sorts of possible causes for heavy I/O churn that might improve if you 
setup the postgresql.conf file to use the server's resources more 
aggressively (the default is setup for machines with a very small amount 
of RAM).  There are lots of links to articles that cover the various areas 
you might improve at 
http://wiki.postgresql.org/wiki/Performance_Optimization


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which hardware ?

2008-06-17 Thread Jon D



 From: [EMAIL PROTECTED]
 Subject: [PERFORM] Which hardware ?
 Date: Tue, 17 Jun 2008 15:38:59 +0200
 To: pgsql-performance@postgresql.org
 
 Hi,
 
 I need to install a 8.3 database and was wondering which hardware would be 
 sufficient to have good performances (less than 30s for² slowest select).
 
 Database size: 25 Go /year, 5 years of history
 One main table containing 40 million lines per year.
 Batch inserts of 10 lines. Very very few deletes, few updates.
 
 30 other tables, 4 levels of hierarchy, containing from 10 lines up to 2 
 lines.
 5 of them have forein keys on the main table.
 
 I will use table partitionning on the year column.
 
 Statements will mainly do sums on the main table, grouped by whatever column 
 of the database (3-5 joined tables, or join on join), with some criterions 
 that may vary, lots of joined varchar in ('a','b',...,'z').
 It's almost impossible to predict what users will do via the webapplication 
 that queries this database: almost all select, join, group by, where... 
 possibilities are available.
 
 Up to 4 simultaneous users.
 
 I'm planning to host it on a quad xeon 2.66Ghz with 8Go of DDR2, and a dual 
 (RAID1) SATA2 750Go HD.
 Perharps with another HD for indexes.
 
 Do you think it will be enough ?
 Is another RAID for better performances a minimum requirement ?
 Will a secondary HD for indexes help ?
 
 Which OS would you use ? (knowing that there will be a JDK 1.6 installed 
 too)
 
 With 5 millions of lines, the same application runs quite fast on windows 
 2000 on a single P4 2.8 GHz (very few statements last more than 10s, mostly 
 when concurrent statements are made). Each statement consumes 100% of the 
 CPU.
 
 
 thanks for advices.
 
 
 
 -- 
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


I think hardware isnt going to solve your problem,  especially the cpu.  You 
only have four users..  and postgres can only use 1 core per query.   If you 
have sequential scans that span this table and say it has 60-80 million rows,  
It can could take longer then 30 seconds.  Even if you have alot of ram.  Just 
imagine what postgres is doing... if its target search is going to end in 
searching 40 million rows and it has to aggregate on two,  or three columns its 
going to be slow.  No amount of hardware is going to fix this.   Sure you can 
gain some speed by having entire tables in ram.   No magic bullet here.   Disk 
is definitely not a magic bullet.Even if you have a bunch of fast disks its 
still much slower then RAM in performing reads.So if you read heavy then 
adding more disk isnt going to just solve all your problems.   RAM is nice.   
The more pages you can keep in ram the less reading from the disk.  

Even with that all said and done...   aggregating lots of rows takes time.I 
suggest you come up with a system from preaggregating your data if possible.  
Identify all of your target dimensions.   If your lucky,  you only have a few 
key dimensions which can reduce size of table by lots and reduce queries to 1-2 
seconds.   There are a number of ways to tackle this,  but postgres is a nice 
db to do this with, since writers do not block readers.   

I think you should focus on getting this system to work well with minimal 
hardware first.  Then you can upgrade.   Over the next few years the db is only 
going to get larger.   You have 4 users now.. but who's to say what it will 
evolve into.
_
Earn cashback on your purchases with Live Search - the search that pays you 
back!
http://search.live.com/cashback/?pkw=form=MIJAAF/publ=HMTGL/crea=earncashback
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance