subject:"\[PERFORM\] PostgreSQL clustering VS MySQL clustering"

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-02-06 Thread Jan Wieck

On 1/28/2005 2:49 PM, Christopher Browne wrote:
But there's nothing wrong with the idea of using pg_dump --data-only
against a subscriber node to get you the data without putting a load
on the origin.  And then pulling the schema from the origin, which
oughtn't be terribly expensive there.
And there is a script in the current CVS head that extracts the schema 
from the origin in a clean, slony-traces-removed state.

Jan
--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-31 Thread Alex Turner

fsync on.

Alex Turner
NetEconomist


On Fri, 28 Jan 2005 11:19:44 -0500, Merlin Moncure
[EMAIL PROTECTED] wrote:
  With the right configuration you can get very serious throughput.  The
  new system is processing over 2500 insert transactions per second.  We
  don't need more RAM with this config.  The disks are fast enough.
  2500 transaction/second is pretty damn fast.
 
 fsync on/off?
 
 Merlin
 


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-31 Thread Olivier Sirven

Le Vendredi 21 Janvier 2005 19:18, Marty Scholes a écrit :
 The indexes can be put on a RAM disk tablespace and that's the end of
 index problems -- just make sure you have enough memory available.  Also
 make sure that the machine can restart correctly after a crash: the
 tablespace is dropped and recreated, along with the indexes.  This will
 cause a machine restart to take some time.
Tell me if I am wrong but it sounds to me like like an endless problemThis 
solution may work with small indexes (less than 4GB) but what appends when 
the indexes grow ? You would add more memory to your server ? But there will 
be a moment were you can not add more so what's next ?

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-29 Thread Christopher Weimann

On 01/28/2005-10:59AM, Alex Turner wrote:
 At this point I will interject a couple of benchmark numbers based on
 a new system we just configured as food for thought.
 
 System A (old system):
 Compaq Proliant Dual Pentium III 933 with Smart Array 5300, one RAID
 1, one 3 Disk RAID 5 on 10k RPM drives, 2GB PC133 RAM.  Original
 Price: $6500
 
 System B (new system):
 Self Built Dual Opteron 242 with 2x3ware 9500S-8MI SATA, one RAID 1
 (OS), one 4 drive RAID 10 (pg_xlog), one 6 drive RAID 10 (data) on 10k
 RPM Raptors, 4GB PC3200 RAM.  Current price $7200
 
 System A for our large insert job: 125 minutes
 System B for our large insert job: 10 minutes.
 
 There is no logical way there should be a 12x performance difference
 between these two systems,  maybe 2x or even 4x, but not 12x
 

Your system A has the absolute worst case Raid 5, 3 drives. The more
drives you add to Raid 5 the better it gets but it will never beat Raid
10. On top of it being the worst case, pg_xlog is not on a separate
spindle.

Your system B has a MUCH better config. Raid 10 is faster than Raid 5 to
begin with but on top of that you have more drives involved plus pg_xlog
is on a separate spindle.

I'd say I am not surprised by your performance difference.

 Bad controler cards/configuration will seriously ruin your day.  3ware
 escalade cards are very well supported on linux, and work excellently.
  Compaq smart array cards are not.  Bonnie++ benchmarks show a 9MB/sec
 write, 29MB/sec read on the RAID 5, but a 172MB/sec write on the
 6xRAID 10, and 66MB/sec write on the RAID 1 on the 3ware.
 

What does bonnie say about the Raid 1 on the Compaq? Comparing the two
Raid 1s is really the only valid comparison that can be made between
these two machines.  Other than that you are comparing apples to 
snow shovels.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-29 Thread Christopher Weimann

On 01/28/2005-05:57PM, Alex Turner wrote:
  
  Your system A has the absolute worst case Raid 5, 3 drives. The more
  drives you add to Raid 5 the better it gets but it will never beat Raid
  10. On top of it being the worst case, pg_xlog is not on a separate
  spindle.
  
 
 True for writes, but not for reads.
 

Good point.

 
 My main point is that you can spend $7k on a server and believe you
 have a fast system.  The person who bought the original system was
 under the delusion that it would make a good DB server.  For the same
 $7k a different configuration can yield a vastly different performance
 output.  This means that it's not quite apples to snow shovels. 

That point is definatly made.  I primarily wanted to point out that the
controlers involved were not the only difference. 

In my experience with SQL servers of various flavors fast disks and 
getting things onto a separate spindles is more important than just
about anything else.  Depending on the size of your 'hot' dataset
RAM could be more important and CPU never is.  


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-28 Thread Andrew Sullivan

On Mon, Jan 24, 2005 at 01:28:29AM +0200, Hannu Krosing wrote:
 
 IIRC it hates pg_dump mainly on master. If you are able to run pg_dump
 from slave, it should be ok.

For the sake of the archives, that's not really a good idea.  There
is some work afoot to solve it, but at the moment dumping from a
slave gives you a useless database dump.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
The fact that technology doesn't work is no bar to success in the marketplace.
--Philip Greenspun

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-28 Thread Alex Turner

At this point I will interject a couple of benchmark numbers based on
a new system we just configured as food for thought.

System A (old system):
Compaq Proliant Dual Pentium III 933 with Smart Array 5300, one RAID
1, one 3 Disk RAID 5 on 10k RPM drives, 2GB PC133 RAM.  Original
Price: $6500

System B (new system):
Self Built Dual Opteron 242 with 2x3ware 9500S-8MI SATA, one RAID 1
(OS), one 4 drive RAID 10 (pg_xlog), one 6 drive RAID 10 (data) on 10k
RPM Raptors, 4GB PC3200 RAM.  Current price $7200

System A for our large insert job: 125 minutes
System B for our large insert job: 10 minutes.

There is no logical way there should be a 12x performance difference
between these two systems,  maybe 2x or even 4x, but not 12x

Bad controler cards/configuration will seriously ruin your day.  3ware
escalade cards are very well supported on linux, and work excellently.
 Compaq smart array cards are not.  Bonnie++ benchmarks show a 9MB/sec
write, 29MB/sec read on the RAID 5, but a 172MB/sec write on the
6xRAID 10, and 66MB/sec write on the RAID 1 on the 3ware.

With the right configuration you can get very serious throughput.  The
new system is processing over 2500 insert transactions per second.  We
don't need more RAM with this config.  The disks are fast enough. 
2500 transaction/second is pretty damn fast.

Alex Turner

On Fri, 28 Jan 2005 10:31:38 -0500, Andrew Sullivan [EMAIL PROTECTED] wrote:
 On Thu, Jan 20, 2005 at 10:40:02PM -0200, Bruno Almeida do Lago wrote:
 
  I was thinking the same! I'd like to know how other databases such as Oracle
  do it.
 
 You mean how Oracle does it.  They're the only ones in the market
 that really have this technology.
 
 A
 
 --
 Andrew Sullivan  | [EMAIL PROTECTED]
 This work was visionary and imaginative, and goes to show that visionary
 and imaginative work need not end up well.
 --Dennis Ritchie
 
 ---(end of broadcast)---
 TIP 7: don't forget to increase your free space map settings


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-28 Thread Christopher Browne

[EMAIL PROTECTED] (Andrew Sullivan) writes:
 On Mon, Jan 24, 2005 at 01:28:29AM +0200, Hannu Krosing wrote:
 
 IIRC it hates pg_dump mainly on master. If you are able to run pg_dump
 from slave, it should be ok.

 For the sake of the archives, that's not really a good idea.  There
 is some work afoot to solve it, but at the moment dumping from a
 slave gives you a useless database dump.

That overstates things a tad; I think it's worth elaborating on a bit.

There's a problem with the results of dumping the _schema_ from a
Slony-I 'subscriber' node; you want to get the schema from the origin
node.  The problem has to do with triggers; Slony-I suppresses RI
triggers and such like on subscriber nodes in a fashion that leaves
the dumped schema a bit broken with regard to triggers.

But there's nothing wrong with the idea of using pg_dump --data-only
against a subscriber node to get you the data without putting a load
on the origin.  And then pulling the schema from the origin, which
oughtn't be terribly expensive there.
-- 
cbbrowne,@,ca.afilias.info
http://dev6.int.libertyrms.com/
Christopher Browne
(416) 673-4124 (land)

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-28 Thread Kevin Brown

PFC wrote:
 So, here is something annoying with the current approach : Updating rows  
 in a table bloats ALL indices, not just those whose indexed values have  
 been actually updated. So if you have a table with many indexed fields and  
 you often update some obscure timestamp field, all the indices will bloat,  
 which will of course be corrected by VACUUM, but vacuum will have extra  
 work to do.

The MVCC approach probably doesn't leave you with many choices here.
The index entries point directly to the rows in the table, and since
an update creates a new row (it's the equivalent of doing an insert
then a delete), all indexes have to be updated to reflect the location
of the new row.

Unless my understanding of how this works is completely off...




-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-28 Thread William Yu

Hervé Piedvache wrote:
My point being is that there is no free solution.  There simply isn't.
I don't know why you insist on keeping all your data in RAM, but the
mysql cluster requires that ALL data MUST fit in RAM all the time.

I don't insist about have data in RAM  but when you use PostgreSQL with 
big database you know that for quick access just for reading the index file 
for example it's better to have many RAM as possible ... I just want to be 
able to get a quick access with a growing and growind database ...
If it's an issue of RAM and not CPU power, think about this scenario. 
Let's just say you *COULD* partition your DB over multiple servers. What 
 are your plans then? Are you going to buy 4 Dual Xeon servers? Ok, 
let's price that out.

For a full-blown rackmount server w/ RAID, 6+ SCSI drives and so on, you 
are looking at roughly $4000 per machine. So now you have 4 machines -- 
total of 16GB of RAM over the 4 machines.

On the otherhand, let's say you spent that money on a Quad Opteron 
instead. 4x850 will cost you roughly $8000. 16GB of RAM using 1GB DIMMs 
is $3000. If you went with 2GB DIMMs, you could stuff 32GB of RAM onto 
that machine for $7500.

Let's review the math:
4X server cluster, total 16GB RAM = $16K
1 beefy server w/ 16GB RAM = $11K
1 beefy server w/ 32GB RAM = $16K
I know what I would choose. I'd get the mega server w/ a ton of RAM and 
skip all the trickyness of partitioning a DB over multiple servers. Yes 
your data will grow to a point where even the XXGB can't cache 
everything. On the otherhand, memory prices drop just as fast. By that 
time, you can ebay your original 16/32GB and get 64/128GB.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-28 Thread Alex Turner

On Fri, 28 Jan 2005 11:54:57 -0500, Christopher Weimann
[EMAIL PROTECTED] wrote:
 On 01/28/2005-10:59AM, Alex Turner wrote:
  At this point I will interject a couple of benchmark numbers based on
  a new system we just configured as food for thought.
 
  System A (old system):
  Compaq Proliant Dual Pentium III 933 with Smart Array 5300, one RAID
  1, one 3 Disk RAID 5 on 10k RPM drives, 2GB PC133 RAM.  Original
  Price: $6500
 
  System B (new system):
  Self Built Dual Opteron 242 with 2x3ware 9500S-8MI SATA, one RAID 1
  (OS), one 4 drive RAID 10 (pg_xlog), one 6 drive RAID 10 (data) on 10k
  RPM Raptors, 4GB PC3200 RAM.  Current price $7200
 
  System A for our large insert job: 125 minutes
  System B for our large insert job: 10 minutes.
 
  There is no logical way there should be a 12x performance difference
  between these two systems,  maybe 2x or even 4x, but not 12x
 
 
 Your system A has the absolute worst case Raid 5, 3 drives. The more
 drives you add to Raid 5 the better it gets but it will never beat Raid
 10. On top of it being the worst case, pg_xlog is not on a separate
 spindle.
 

True for writes, but not for reads.

 Your system B has a MUCH better config. Raid 10 is faster than Raid 5 to
 begin with but on top of that you have more drives involved plus pg_xlog
 is on a separate spindle.

I absolutely agree, it is a much better config, thats why we bought it
;)..  In system A, the xlog was actualy on the RAID 1, so it was
infact on a seperate spindle set.

 
 I'd say I am not surprised by your performance difference.
 

I'm not surprised at all that the new system outperformed the old,
it's more the factor of improvement.  12x is a _VERY_ big performance
jump.

  Bad controler cards/configuration will seriously ruin your day.  3ware
  escalade cards are very well supported on linux, and work excellently.
   Compaq smart array cards are not.  Bonnie++ benchmarks show a 9MB/sec
  write, 29MB/sec read on the RAID 5, but a 172MB/sec write on the
  6xRAID 10, and 66MB/sec write on the RAID 1 on the 3ware.
 
 
 What does bonnie say about the Raid 1 on the Compaq? Comparing the two
 Raid 1s is really the only valid comparison that can be made between
 these two machines.  Other than that you are comparing apples to
 snow shovels.
 
 


My main point is that you can spend $7k on a server and believe you
have a fast system.  The person who bought the original system was
under the delusion that it would make a good DB server.  For the same
$7k a different configuration can yield a vastly different performance
output.  This means that it's not quite apples to snow shovels. 
People who _believe_ they have an adequate config are often sorely
mistaken, and ask misguided questions about needed 20GB of RAM because
the system can't page to disk fast enough, when what they really need
is a good RAID 10 with a high quality controler.  A six drive RAID 10
is going to run a bit less than 20G of SSD.

Alex Turner
NetEconomist

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-28 Thread Greg Stark


William Yu [EMAIL PROTECTED] writes:

 1 beefy server w/ 32GB RAM = $16K
 
 I know what I would choose. I'd get the mega server w/ a ton of RAM and skip
 all the trickyness of partitioning a DB over multiple servers. Yes your data
 will grow to a point where even the XXGB can't cache everything. On the
 otherhand, memory prices drop just as fast. By that time, you can ebay your
 original 16/32GB and get 64/128GB.

a) What do you do when your calculations show you need 256G of ram? [Yes such
machines exist but you're not longer in the realm of simply add more RAM.
Administering such machines is nigh as complex as clustering]

b) What do you do when you find you need multiple machines anyways to divide
the CPU or I/O or network load up. Now you need n big beefy servers when n
servers 1/nth as large would really have sufficed. This is a big difference
when you're talking about the difference between colocating 16 1U boxen with
4G of ram vs 16 4U opterons with 64G of RAM...

All that said, yes, speaking as a user I think the path of least resistance is
to build n complete slaves using Slony and then just divide the workload.
That's how I'm picturing going when I get to that point.

Even if I just divide the workload randomly it's easier than building a
machine with n times the cpu and i/o. And if I divide the workload up in a way
that correlates with data in the database I can probably get close to the same
performance as clustering. The actual cost of replicating the unused data is
slight. And the simplicity of master-slave makes it much more appealing than
full on clustering.

-- 
greg


---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-26 Thread Hannu Krosing

Ühel kenal päeval (teisipäev, 25. jaanuar 2005, 10:41-0500), kirjutas
Tom Lane:
 Hannu Krosing [EMAIL PROTECTED] writes:
  Why is removing index entries essential ?
 
 Because once you re-use the tuple slot, any leftover index entries would
 be pointing to the wrong rows.

That much I understood ;)

But can't clearing up the index be left for later ? 

Indexscan has to check the data tuple anyway, at least for visibility.
would adding the check for field sameness in index and data tuples be
too big performance hit ?

   regards, tom lane
 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq
-- 
Hannu Krosing [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-26 Thread Tom Lane

Hannu Krosing [EMAIL PROTECTED] writes:
 But can't clearing up the index be left for later ? 

Based on what?  Are you going to store the information about what has to
be cleaned up somewhere else, and if so where?

 Indexscan has to check the data tuple anyway, at least for visibility.
 would adding the check for field sameness in index and data tuples be
 too big performance hit ?

It does pretty much suck, especially when you think about functional
indexes on expensive functions.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-26 Thread PFC

http://borg.postgresql.org/docs/8.0/interactive/storage-page-layout.html

If you vacuum as part of the transaction it's going to be more efficient
of resources, because you have more of what you need right there (ie:
odds are that you're on the same page as the old tuple). In cases like
that it very likely makes a lot of sense to take a small hit in your
transaction time up-front, instead of a larger hit doing a vacuum down
the road.
	Some pros would be that you're going to make a disk write anyway because  
the page is modified, so why not vacuum that page while it's there. If the  
machine is CPU bound you lose, if it's IO bound you save some IO, but the  
cost of index updates has to be taken into account...

It prompted a few questions :
Note : temp contains 128k (131072) values generated from a sequence.
create table test (id serial primary key, a integer, z integer, e integer,  
r integer, t integer, y integer ) without oids;
insert into test (id,a,z,e,r,t,y) select id,0,0,0,0,0,0 from temp;
 INSERT 0 131072

explain analyze update test set y=1;
 Seq Scan on test  (cost=0.00..2226.84 rows=126284 width=30) (ac Seq Scan  
on test  (cost=0.00..2274.72 rows=131072 width=30) (actual  
time=0.046..964.590 rows=131072 loops=1)
 Total runtime: 15628.143 ms
tual time=0.047..617.553 rows=131072 loops=1)
 Total runtime: 4432.509 ms

explain analyze update test set y=1;
 Seq Scan on test  (cost=0.00..4453.68 rows=252568 width=30) (actual  
time=52.198..611.594 rows=131072 loops=1)
 Total runtime: 5739.064 ms

explain analyze update test set y=1;
 Seq Scan on test  (cost=0.00..6680.52 rows=378852 width=30) (actual  
time=127.301..848.762 rows=131072 loops=1)
 Total runtime: 6548.206 ms

Gets slower as more and more dead tuples accumulate... normal as this is a  
seq scan. Note the row estimations getting bigger with the table size...

vacuum full test;
explain analyze update test set y=1;
 Seq Scan on test  (cost=0.00..2274.72 rows=131072 width=30) (actual  
time=0.019..779.864 rows=131072 loops=1)
 Total runtime: 5600.311 ms

vacuum full test;
explain analyze update test set y=1;
 Seq Scan on test  (cost=0.00..2274.72 rows=131072 width=30) (actual  
time=0.039..1021.847 rows=131072 loops=1)
 Total runtime: 5126.590 ms

- Seems vacuum full does its job
vacuum test;
explain analyze update test set y=1;
 Seq Scan on test  (cost=0.00..3894.08 rows=196608 width=30) (actual  
time=36.491..860.135 rows=131072 loops=1)
 Total runtime: 7293.698 ms

vacuum test;
explain analyze update test set y=1;
 Seq Scan on test  (cost=0.00..3894.08 rows=196608 width=30) (actual  
time=0.044..657.125 rows=131072 loops=1)
 Total runtime: 5934.141 ms

vacuum analyze test;
explain analyze update test set y=1;
 Seq Scan on test  (cost=0.00..3894.08 rows=196608 width=30) (actual  
time=0.018..871.132 rows=131072 loops=1)
 Total runtime: 5548.053 ms

- here vacuum is about as slow as vacuum full (which is normal as the  
whole table is updated) however the row estimation is still off even after  
ANALYZE.

 Let's create a few indices :
vacuum full test;
create index testa on test(a);
create index testz on test(z);
create index teste on test(e);
create index testr on test(r);
create index testt on test(t);
-- we don't create an index on y
vacuum full test;
explain analyze update test set a=id;
 Seq Scan on test  (cost=0.00..2274.72 rows=131072 width=30) (actual  
time=0.044..846.102 rows=131072 loops=1)
 Total runtime: 14998.307 ms

We see that the index updating time has made this query a lot slower. This  
is normal, but :

vacuum full test;
explain analyze update test set a=id;
 Seq Scan on test  (cost=0.00..2274.72 rows=131072 width=30) (actual  
time=0.045..1387.626 rows=131072 loops=1)
 Total runtime: 17644.368 ms

Now, we updated ALL rows but didn't actually change a single value.  
However it took about the same time as the first one. I guess the updates  
all really took place, even if all it did was copy the rows with new  
transaction ID's.
Now, let's update a column which is not indexed :

vacuum full test;
explain analyze update test set y=id;
 Seq Scan on test  (cost=0.00..2274.72 rows=131072 width=30) (actual  
time=0.046..964.590 rows=131072 loops=1)
 Total runtime: 15628.143 ms

Takes 'bout the same time : the indexes still have to be updated to  
reference the new rows after all.

So, here is something annoying with the current approach : Updating rows  
in a table bloats ALL indices, not just those whose indexed values have  
been actually updated. So if you have a table with many indexed fields and  
you often update some obscure timestamp field, all the indices will bloat,  
which will of course be corrected by VACUUM, but vacuum will have extra  
work to do.

I don't have suggestions, just questions :
	Is there a way that an update to the indices can be avoided if the  
indexed values do not change ?
	Would it depend if an updated tuple can be stored on the same page it was  
before (along with the

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-25 Thread Hannu Krosing

Ühel kenal päeval (esmaspäev, 24. jaanuar 2005, 11:52+0900), kirjutas
Tatsuo Ishii:
  Tatsuo Ishii [EMAIL PROTECTED] writes:
   Probably VACUUM works well for small to medium size tables, but not
   for huge ones. I'm considering about to implement on the spot
   salvaging dead tuples.
  
  That's impossible on its face, except for the special case where the
  same transaction inserts and deletes a tuple.  In all other cases, the
  transaction deleting a tuple cannot know whether it will commit.
 
 Of course. We need to keep a list of such that tuples until commit or
 abort.

what about other transactions, which may have started before current one
and be still running when current one commites ?


I once proposed an extra parameter added to VACUUM FULL which determines
how much free space to leave in each page vacuumed. If there were room
the new tuple could be placed near the old one in most cases and thus
avoid lots of disk head movement when updating huge tables in one go.



Hannu Krosing [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-25 Thread Hannu Krosing

Ühel kenal päeval (pühapäev, 23. jaanuar 2005, 15:40-0500), kirjutas Tom
Lane:
 Simon Riggs [EMAIL PROTECTED] writes:
  Changing the idea slightly might be better: if a row update would cause
  a block split, then if there is more than one row version then we vacuum
  the whole block first, then re-attempt the update.
 
 Block split?  I think you are confusing tables with indexes.
 
 Chasing down prior versions of the same row is not very practical
 anyway, since there is no direct way to find them.
 
 One possibility is, if you tried to insert a row on a given page but
 there's not room, to look through the other rows on the same page to see
 if any are deletable (xmax below the GlobalXmin event horizon).  This
 strikes me as a fairly expensive operation though, especially when you
 take into account the need to get rid of their index entries first.

Why is removing index entries essential ?

In pg yuo always have to visit data page, so finding the wrong tuple
there could just produce the same result as deleted tuple (which in this
case it actually is). The cleaning of index entries could be left to the
real vacuum.

-- 
Hannu Krosing [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-25 Thread Hannu Krosing

Ühel kenal päeval (neljapäev, 20. jaanuar 2005, 16:00+0100), kirjutas
Hervé Piedvache:

  Will both do what you want. Replicator is easier to setup but
  Slony is free.
 
 No ... as I have said ... how I'll manage a database getting a table of may 
 be 
 250 000 000 records ? I'll need incredible servers ... to get quick access or 
 index reading ... no ?
 
 So what we would like to get is a pool of small servers able to make one 
 virtual server ... for that is called a Cluster ... no ?
 
 I know they are not using PostgreSQL ... but how a company like Google do to 
 get an incredible database in size and so quick access ?

They use lots of boxes and lots custom software to implement a very
specific kind of cluster.

 regards,
-- 
Hannu Krosing [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-25 Thread Hannu Krosing

Ühel kenal päeval (neljapäev, 20. jaanuar 2005, 11:02-0500), kirjutas
Rod Taylor:


 Slony has some other issues with databases  200GB in size as well
 (well, it hates long running transactions -- and pg_dump is a regular
 long running transaction)

IIRC it hates pg_dump mainly on master. If you are able to run pg_dump
from slave, it should be ok.

-- 
Hannu Krosing [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-25 Thread Tatsuo Ishii

   Tatsuo Ishii [EMAIL PROTECTED] writes:
Probably VACUUM works well for small to medium size tables, but not
for huge ones. I'm considering about to implement on the spot
salvaging dead tuples.
   
   That's impossible on its face, except for the special case where the
   same transaction inserts and deletes a tuple.  In all other cases, the
   transaction deleting a tuple cannot know whether it will commit.
  
  Of course. We need to keep a list of such that tuples until commit or
  abort.
 
 what about other transactions, which may have started before current one
 and be still running when current one commites ?

Then dead tuples should be left. Perhaps in this case we could
register them in FSM or whatever for later processing.
--
Tatsuo Ishii

 I once proposed an extra parameter added to VACUUM FULL which determines
 how much free space to leave in each page vacuumed. If there were room
 the new tuple could be placed near the old one in most cases and thus
 avoid lots of disk head movement when updating huge tables in one go.
 
 
 
 Hannu Krosing [EMAIL PROTECTED]
 
 ---(end of broadcast)---
 TIP 6: Have you searched our list archives?
 
http://archives.postgresql.org
 

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-25 Thread Tom Lane

Hannu Krosing [EMAIL PROTECTED] writes:
 Why is removing index entries essential ?

Because once you re-use the tuple slot, any leftover index entries would
be pointing to the wrong rows.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-24 Thread Jim C. Nasby

On Sun, Jan 23, 2005 at 03:40:03PM -0500, Tom Lane wrote:
 The real issue with any such scheme is that you are putting maintenance
 costs into the critical paths of foreground processes that are executing
 user queries.  I think that one of the primary advantages of the
 Postgres storage design is that we keep that work outside the critical
 path and delegate it to maintenance processes that can run in the
 background.  We shouldn't lightly toss away that advantage.

To pull out the oft-used show me the numbers card... has anyone done a
study to see if keeping this stuff out of the 'critical path' actually
helps overall system performance? While the current scheme initially
speeds up transactions, eventually you have to run vacuum, which puts a
big load on the system. If you can put off vacuuming until off-hours
(assuming your system has off-hours), then this doesn't matter, but more
and more we're seeing systems where vacuum is a big performance issue
(hence recent work with the delay in vacuum so as not to swamp the IO
system).

If you vacuum as part of the transaction it's going to be more efficient
of resources, because you have more of what you need right there (ie:
odds are that you're on the same page as the old tuple). In cases like
that it very likely makes a lot of sense to take a small hit in your
transaction time up-front, instead of a larger hit doing a vacuum down
the road.

Of course, without numbers this is a bunch of hand-waving, but I don't
think it's valid to assume that minimizing the amount of work you do in
a transaction means better throughput without considering what it will
cost to do the work you're putting off until later.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-23 Thread Simon Riggs

On Sat, 2005-01-22 at 16:10 -0500, Tom Lane wrote:
 Tatsuo Ishii [EMAIL PROTECTED] writes:
  Probably VACUUM works well for small to medium size tables, but not
  for huge ones. I'm considering about to implement on the spot
  salvaging dead tuples.
 
 That's impossible on its face, except for the special case where the
 same transaction inserts and deletes a tuple.  In all other cases, the
 transaction deleting a tuple cannot know whether it will commit.

Perhaps Tatsuo has an idea...

As Tom says, if you have only a single row version and then you update
that row to create a second version, then we must not remove the first
version, since it is effectively the Undo copy.

However, if there were already 2+ row versions, then as Tatsuo suggests,
it might be possible to use on the spot salvaging of dead tuples. It
might be worth checking the Xid of the earlier row version(s), to see if
they are now expired and could be removed immediately.

However, if you had a high number of concurrent updaters, this extra
effort would not be that useful, since the other row versions might
still be transaction-in-progress versions. That would mean implementing
this idea would be useful often, but not in the case of repeatedly
updated rows.

Changing the idea slightly might be better: if a row update would cause
a block split, then if there is more than one row version then we vacuum
the whole block first, then re-attempt the update. That way we wouldn't
do the row every time, just when it becomes a problem.

I'm suggesting putting a call to vacuum_page() into heap_update(),
immediately before any call to RelationGetBufferForTuple().

We already know that page splitting is an expensive operation, so doing
some work to try to avoid that could frequently pay off. This would be
isolated to updating. 

This wouldn't remove the need for vacuuming, but it would act to prevent
severe performance degradation caused by frequent re-updating.

What do you think?

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-23 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 Changing the idea slightly might be better: if a row update would cause
 a block split, then if there is more than one row version then we vacuum
 the whole block first, then re-attempt the update.

Block split?  I think you are confusing tables with indexes.

Chasing down prior versions of the same row is not very practical
anyway, since there is no direct way to find them.

One possibility is, if you tried to insert a row on a given page but
there's not room, to look through the other rows on the same page to see
if any are deletable (xmax below the GlobalXmin event horizon).  This
strikes me as a fairly expensive operation though, especially when you
take into account the need to get rid of their index entries first.
Moreover, the check would often be unproductive.

The real issue with any such scheme is that you are putting maintenance
costs into the critical paths of foreground processes that are executing
user queries.  I think that one of the primary advantages of the
Postgres storage design is that we keep that work outside the critical
path and delegate it to maintenance processes that can run in the
background.  We shouldn't lightly toss away that advantage.

There was some discussion in Toronto this week about storing bitmaps
that would tell VACUUM whether or not there was any need to visit
individual pages of each table.  Getting rid of useless scans through
not-recently-changed areas of large tables would make for a significant
reduction in the cost of VACUUM.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-23 Thread Jim C. Nasby

On Sun, Jan 23, 2005 at 03:40:03PM -0500, Tom Lane wrote:
 There was some discussion in Toronto this week about storing bitmaps
 that would tell VACUUM whether or not there was any need to visit
 individual pages of each table.  Getting rid of useless scans through
 not-recently-changed areas of large tables would make for a significant
 reduction in the cost of VACUUM.
FWIW, that's already on the TODO. See also
http://lnk.nu/archives.postgresql.org/142.php.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-23 Thread Josh Berkus

Tatsuo,

 I'm not clear what pgPool only needs to monitor update switching by

 *connection* not by *table* means. In your example:
  (1) 00:00 User A updates My Profile
  (2) 00:01 My Profile UPDATE finishes executing.
  (3) 00:02  User A sees My Profile re-displayed
  (6) 00:04  My Profile:UserA cascades to the last Slave server

 I think (2) and (3) are on different connections, thus pgpool cannot
 judge if SELECT in (3) should go only to the master or not.

 To solve the problem you need to make pgpool understand web sessions
 not database connections and it seems impossible for pgpool to
 understand sessions.

Depends on your connection pooling software, I suppose.  Most connection 
pooling software only returns connections to the pool after a user has been 
inactive for some period ... generally more than 3 seconds.  So connection 
continuity could be trusted.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-23 Thread Guy Thornley

 The real issue with any such scheme is that you are putting maintenance
 costs into the critical paths of foreground processes that are executing
 user queries.  I think that one of the primary advantages of the
 Postgres storage design is that we keep that work outside the critical
 path and delegate it to maintenance processes that can run in the
 background.  We shouldn't lightly toss away that advantage.

As a rather naive user, I'd consider modifying the FSM so that it has pages
with 'possibly freeable' space on them, as well as those with free space.

This way when the pages of actually free space is depleted, the list of
'possibly freeable' pages could be vacuumed (as a batch for that relation)
then placed on the actually-free list like vacuum currently does

Since there is concern about critical path performance, there could be an
extra backend process that would wake up perodically (or on a signal) and
vacuum the pages, so theyre not processed inline with some transaction. Then
grabbing a page with free space is the same as it is currently.

Actually I was hoping to find some time to investigate this myself, but my
employer is keeping me busy with other tasks ;/. Our particular data
management problems could be mitigated much better with a data partitioning
approach, anyway.

On another note, is anybody investigating backing up the FSM with disk files
so when the FSM size exceeds memory allocated, the appropriate data is
swapped to disk? At least since 7.4 you no longer need a VACUUM when
postgres starts, to learn about free space ;)

- Guy Thornley

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-23 Thread Tatsuo Ishii

 Tatsuo,
 
  I'm not clear what pgPool only needs to monitor update switching by
 
  *connection* not by *table* means. In your example:
   (1) 00:00 User A updates My Profile
   (2) 00:01 My Profile UPDATE finishes executing.
   (3) 00:02  User A sees My Profile re-displayed
   (6) 00:04  My Profile:UserA cascades to the last Slave server
 
  I think (2) and (3) are on different connections, thus pgpool cannot
  judge if SELECT in (3) should go only to the master or not.
 
  To solve the problem you need to make pgpool understand web sessions
  not database connections and it seems impossible for pgpool to
  understand sessions.
 
 Depends on your connection pooling software, I suppose.  Most connection 
 pooling software only returns connections to the pool after a user has been 
 inactive for some period ... generally more than 3 seconds.  So connection 
 continuity could be trusted.

Not sure what you mean by most connection pooling software, but I'm
sure that pgpool behaves differently.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-23 Thread Tatsuo Ishii

 Tatsuo Ishii [EMAIL PROTECTED] writes:
  Probably VACUUM works well for small to medium size tables, but not
  for huge ones. I'm considering about to implement on the spot
  salvaging dead tuples.
 
 That's impossible on its face, except for the special case where the
 same transaction inserts and deletes a tuple.  In all other cases, the
 transaction deleting a tuple cannot know whether it will commit.

Of course. We need to keep a list of such that tuples until commit or
abort.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

[PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-22 Thread Dawid Kuroczko

On Sat, 22 Jan 2005 12:13:00 +0900 (JST), Tatsuo Ishii
[EMAIL PROTECTED] wrote:
 IMO the bottle neck is not WAL but table/index bloat. Lots of updates
 on large tables will produce lots of dead tuples. Problem is, There'
 is no effective way to reuse these dead tuples since VACUUM on huge
 tables takes longer time. 8.0 adds new vacuum delay
 paramters. Unfortunately this does not help. It just make the
 execution time of VACUUM longer, that means more and more dead tuples
 are being made while updating.

 Probably VACUUM works well for small to medium size tables, but not
 for huge ones. I'm considering about to implement on the spot
 salvaging dead tuples.

Quick thought -- would it be to possible to implement a 'partial VACUUM'
per analogiam to partial indexes?

It would be then posiible to do:
VACUUM footable WHERE footime  current_date - 60;
after a statement to DELETE all/some rows older than 60 days.

The VACUUM would check visibility of columns which are mentioned
in an index (in this case: footable_footime_index ;)).

Of course it is not a great solution, but could be great for doing
housecleaning after large update/delete in a known range.

...and should be relatively simple to implement, I guess
(maybe without 'ANALYZE' part).

  Regards,
  Dawid

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-22 Thread Greg Stark

Dawid Kuroczko [EMAIL PROTECTED] writes:

 Quick thought -- would it be to possible to implement a 'partial VACUUM'
 per analogiam to partial indexes?

No.

But it gave me another idea. Perhaps equally infeasible, but I don't see why.

What if there were a map of modified pages. So every time any tuple was marked
deleted it could be marked in the map as modified. VACUUM would only have to
look at these pages. And if it could mark as free every tuple that was marked
as deleted then it could unmark the page.

The only downside I see is that this could be a source of contention on
multi-processor machines running lots of concurrent update/deletes.

-- 
greg


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-22 Thread Bruno Wolff III

On Sat, Jan 22, 2005 at 12:13:00 +0900,
  Tatsuo Ishii [EMAIL PROTECTED] wrote:
 
 Probably VACUUM works well for small to medium size tables, but not
 for huge ones. I'm considering about to implement on the spot
 salvaging dead tuples.

You are probably vacuuming too often. You want to wait until a significant
fraction of a large table is dead tuples before doing a vacuum. If you are
scanning a large table and only marking a few tuples as deleted, you aren't
getting much bang for your buck.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-22 Thread Rod Taylor

On Sat, 2005-01-22 at 12:41 -0600, Bruno Wolff III wrote:
 On Sat, Jan 22, 2005 at 12:13:00 +0900,
   Tatsuo Ishii [EMAIL PROTECTED] wrote:
  
  Probably VACUUM works well for small to medium size tables, but not
  for huge ones. I'm considering about to implement on the spot
  salvaging dead tuples.
 
 You are probably vacuuming too often. You want to wait until a significant
 fraction of a large table is dead tuples before doing a vacuum. If you are
 scanning a large table and only marking a few tuples as deleted, you aren't
 getting much bang for your buck.

The big problem occurs when you have a small set of hot tuples within a
large table.  In the time it takes to vacuum a table with 200M tuples
one can update a small subset of that table many many times.

Some special purpose vacuum which can target hot spots would be great,
but I've always assumed this would come in the form of table
partitioning and the ability to vacuum different partitions
independently of each-other.

-- 


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-22 Thread Jim C. Nasby

From http://developer.postgresql.org/todo.php:

Maintain a map of recently-expired rows

This allows vacuum to reclaim free space without requiring a sequential
scan 

On Sat, Jan 22, 2005 at 12:20:53PM -0500, Greg Stark wrote:
 Dawid Kuroczko [EMAIL PROTECTED] writes:
 
  Quick thought -- would it be to possible to implement a 'partial VACUUM'
  per analogiam to partial indexes?
 
 No.
 
 But it gave me another idea. Perhaps equally infeasible, but I don't see why.
 
 What if there were a map of modified pages. So every time any tuple was marked
 deleted it could be marked in the map as modified. VACUUM would only have to
 look at these pages. And if it could mark as free every tuple that was marked
 as deleted then it could unmark the page.
 
 The only downside I see is that this could be a source of contention on
 multi-processor machines running lots of concurrent update/deletes.
 
 -- 
 greg
 
 
 ---(end of broadcast)---
 TIP 6: Have you searched our list archives?
 
http://archives.postgresql.org
 

-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-22 Thread Tom Lane

Tatsuo Ishii [EMAIL PROTECTED] writes:
 Probably VACUUM works well for small to medium size tables, but not
 for huge ones. I'm considering about to implement on the spot
 salvaging dead tuples.

That's impossible on its face, except for the special case where the
same transaction inserts and deletes a tuple.  In all other cases, the
transaction deleting a tuple cannot know whether it will commit.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-22 Thread Christopher Browne

In an attempt to throw the authorities off his trail, [EMAIL PROTECTED] (Hervé 
Piedvache) transmitted:
 Le Jeudi 20 Janvier 2005 15:24, Christopher Kings-Lynne a écrit :
  Is there any solution with PostgreSQL matching these needs ... ?

 You want: http://www.slony.info/

  Do we have to backport our development to MySQL for this kind of problem
  ? Is there any other solution than a Cluster for our problem ?

 Well, Slony does replication which is basically what you want :)

 Only master-slave though, so you will need to have all inserts go via
 the master server, but selects can come off any server.

 Sorry but I don't agree with this ... Slony is a replication
 solution ... I don't need replication ... what will I do when my
 database will grow up to 50 Gb ... I'll need more than 50 Gb of RAM
 on each server ???  This solution is not very realistic for me ...

Huh?  Why on earth do you imagine that Slony-I requires a lot of
memory?

It doesn't.  A fairly _large_ Slony-I process is about 10MB.  There
will be some demand for memory on the DB servers, but you don't need
an enormous quantity of extra memory to run it.

There is a MySQL replicating/clustering system that uses an
in-memory database which means that if your DB is 50GB in size, you
need something like 200GB of RAM.  If you're thinking of that, that's
not relevant to PostgreSQL or Slony-I...

 I need a Cluster solution not a replication one or explain me in
 details how I will do for managing the scalabilty of my database ...

I'm not sure you understand clustering if you imagine it doesn't
involve replication.

There are numerous models for clustering, much as there are numerous
RAID models.

But the only sorts of clustering cases where you get to NOT do
replication are the cases where all you're looking for from clustering
is improved speed, and you're willing for any breakage on any host to
potentially destroy your cluster.

Perhaps you need to describe what you _think_ you mean by a cluster
solution.  It may be that it'll take further thought to determine
what you actually need...
-- 
output = (cbbrowne @ gmail.com)
http://www3.sympatico.ca/cbbrowne/postgresql.html
Not  me, guy. I  read the  Bash man  page each  day like  a Jehovah's
Witness reads  the Bible.  No  wait, the Bash  man page IS  the bible.
Excuseme...(More   onconfusing   aliases,taken   from
comp.os.linux.misc)

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-22 Thread Christopher Browne

In the last exciting episode, [EMAIL PROTECTED] (Hervé Piedvache) wrote:
 Le Jeudi 20 Janvier 2005 16:05, Joshua D. Drake a écrit :
 Christopher Kings-Lynne wrote:
  Or you could fork over  hundreds of thousands of dollars for Oracle's
  RAC.
 
  No please do not talk about this again ... I'm looking about a
  PostgreSQL solution ... I know RAC ... and I'm not able to pay for a
  RAC certify hardware configuration plus a RAC Licence.
 
  There is absolutely zero PostgreSQL solution...

 I just replied the same thing but then I was thinking. Couldn't he use
 multiple databases
 over multiple servers with dblink?

 It is not exactly how I would want to do it, but it would provide what
 he needs I think???

 Yes seems to be the only solution ... but I'm a little disapointed about 
 this ... could you explain me why there is not this kind of 
 functionnality ... it seems to be a real need for big applications no ?

If this is what you actually need, well, it's something that lots of
people would sort of like to have, but it's really DIFFICULT to
implement it.

Partitioning data onto different servers appears like it ought to be a
good idea.  Unfortunately, getting _exactly_ the right semantics is
hard.  

If the data is all truly independent, then it's no big deal; just have
one server for one set of data, and another for the other.

But reality normally is that if you _think_ you need a cluster, that's
because some of the data needs to be _shared_, which means you need to
either:

 a) Have queries that run across two databases, or

 b) Replicate the shared data between the systems.

We're likely back to the need for replication.
-- 
If this was helpful, http://svcs.affero.net/rm.php?r=cbbrowne rate me
http://www3.sympatico.ca/cbbrowne/rdbms.html
It is the user who should parameterize procedures, not their
creators. -- Alan Perlis

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-22 Thread Christopher Browne

Quoth Ron Mayer [EMAIL PROTECTED]:
 Merlin Moncure wrote:
 ...You need to build a bigger, faster box with lots of storage...
 Clustering ... B: will cost you more, not less


 Is this still true when you get to 5-way or 17-way systems?

 My (somewhat outdated) impression is that up to about 4-way systems
 they're price competitive; but beyond that, I thought multiple cheap
 servers scales much more afordably than large servers.   Certainly
 at the point of a 129-CPU system I bet you're better off with a
 network of cheap servers.

Not necessarily.

If you have 129 boxes that you're trying to keep synced, it is likely
that the cost of syncing them will be greater than the other write
load.

If the problem being addressed is that a 4-way box won't handle the
transaction load, it is unlikely that building a cluster of _smaller_
machines will help terribly much.

The reason to cluster in the context of a transactional system is
that you need improved _reliability_.  

Since communications between servers is _thousands_ of times slower
than communicating with local memory, you have to be willing to live
with an ENORMOUS degradation of performance when hosts are
synchronized.

And if real estate has a cost, where you have to pay for rack space,
having _fewer_ machines is preferable to having more.
-- 
output = (cbbrowne @ gmail.com)
http://www.ntlug.org/~cbbrowne/postgresql.html
If con is the opposite of pro, is Congress the opposite of progress?

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-22 Thread Christopher Browne

After a long battle with technology, [EMAIL PROTECTED] (Hervé Piedvache), an 
earthling, wrote:
 Joshua,

 Le Jeudi 20 Janvier 2005 15:44, Joshua D. Drake a écrit :
 Hervé Piedvache wrote:
 
 My company, which I actually represent, is a fervent user of PostgreSQL.
 We used to make all our applications using PostgreSQL for more than 5
  years. We usually do classical client/server applications under Linux,
  and Web interface (php, perl, C/C++). We used to manage also public web
  services with 10/15 millions records and up to 8 millions pages view by
  month.

 Depending on your needs either:

 Slony: www.slony.info

 or

 Replicator: www.commandprompt.com

 Will both do what you want. Replicator is easier to setup but
 Slony is free.

 No ... as I have said ... how I'll manage a database getting a table
 of may be 250 000 000 records ? I'll need incredible servers ... to
 get quick access or index reading ... no ?

 So what we would like to get is a pool of small servers able to make
 one virtual server ... for that is called a Cluster ... no ?

The term cluster simply indicates the use of multiple servers.

There are numerous _DIFFERENT_ forms of clusters, so that for
someone to say I want a cluster commonly implies that since they
didn't realize the need to specify things further, they really don't
know what they want in a usefully identifiable way.

 I know they are not using PostgreSQL ... but how a company like
 Google do to get an incredible database in size and so quick access
 ?

Google has built a specialized application that evidently falls into
the category known as embarrassingly parallel.
http://c2.com/cgi/wiki?EmbarrassinglyParallel

There are classes of applications that are amenable to
parallelization.

Those tend to be applications completely different from those
implemented atop transactional data stores like PostgreSQL.

If your problem is embarrassingly parallel, then I'd bet lunch that
PostgreSQL (and all other SQL databases) are exactly the _wrong_ tool
for implementing its data store.

If your problem is _not_ embarrassingly parallel, then you'll almost
certainly discover that the cheapest way to make it fast involves
fitting all the data onto _one_ computer so that you do not have to
pay the costs of transmitting data over slow inter-computer
communications links.
-- 
let name=cbbrowne and tld=gmail.com in String.concat @ [name;tld];;
http://www.ntlug.org/~cbbrowne/
It isn't that  physicists enjoy physics  more than they enjoy sex, its
that they enjoy sex more when they are thinking of physics.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Tatsuo Ishii

 Tatsuo,
 
  Yes. However it would be pretty easy to modify pgpool so that it could
  cope with Slony-I. I.e.
 
  1) pgpool does the load balance and sends query to Slony-I's slave and
 master if the query is SELECT.
 
  2) pgpool sends query only to the master if the query is other than
 SELECT.
 
  Remaining problem is that Slony-I is not a sync replication
  solution. Thus you need to prepare that the load balanced query
  results might differ among servers.
 
 Yes, please, some of us are already doing the above ad-hoc.
 
 The simple workaround to replication lag is to calculate the longest likely 
 lag (3 seconds if Slony is tuned right) and have the dispatcher (pgpool) 
 send all requests from that connection to the master for that period.   Then 
 it switches back to pool mode where the slaves may be used.

Can I ask a question?

Suppose table A gets updated on the master at time 00:00. Until 00:03
pgpool needs to send all queries regarding A to the master only. My
question is, how can pgpool know a query is related to A?
--
Tatsuo Ishii

 Of course, all of the above is only useful if you're doing a web app where 
 96% 
 of query activity is selects.   For additional scalability, put all of your 
 session maintenance in memcached, so that you're not doing database writes 
 every time a page loads.
 
 -- 
 Josh Berkus
 Aglio Database Solutions
 San Francisco
 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Matt Clark

Presumably it can't _ever_ know without being explicitly told, because 
even for a plain SELECT there might be triggers involved that update 
tables, or it might be a select of a stored proc, etc.  So in the 
general case, you can't assume that a select doesn't cause an update, 
and you can't be sure that the table list in an update is a complete 
list of the tables that might be updated.


Tatsuo Ishii wrote:
Can I ask a question?
Suppose table A gets updated on the master at time 00:00. Until 00:03
pgpool needs to send all queries regarding A to the master only. My
question is, how can pgpool know a query is related to A?
--
Tatsuo Ishii
 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Joshua D. Drake

Matt Clark wrote:
Presumably it can't _ever_ know without being explicitly told, because 
even for a plain SELECT there might be triggers involved that update 
tables, or it might be a select of a stored proc, etc.  So in the 
general case, you can't assume that a select doesn't cause an update, 
and you can't be sure that the table list in an update is a complete 
list of the tables that might be updated.
Uhmmm no :) There is no such thing as a select trigger. The closest you 
would get
is a function that is called via select which could be detected by 
making sure
you are prepending with a BEGIN or START Transaction. Thus yes pgPool 
can be made
to do this.

Sincerely,
Joshua D. Drake


Tatsuo Ishii wrote:
Can I ask a question?
Suppose table A gets updated on the master at time 00:00. Until 00:03
pgpool needs to send all queries regarding A to the master only. My
question is, how can pgpool know a query is related to A?
--
Tatsuo Ishii
 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
begin:vcard
fn:Joshua Drake
n:Drake;Joshua
org:Command Prompt, Inc.
adr:;;PO Box 215 ;Cascade Locks;OR;97014;US
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
x-mozilla-html:FALSE
url:http://www.commandprompt.com
version:2.1
end:vcard


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Bjoern Metzdorf

Joshua D. Drake wrote:
Matt Clark wrote:
Presumably it can't _ever_ know without being explicitly told, because 
even for a plain SELECT there might be triggers involved that update 
tables, or it might be a select of a stored proc, etc.  So in the 
general case, you can't assume that a select doesn't cause an update, 
and you can't be sure that the table list in an update is a complete 
list of the tables that might be updated.

Uhmmm no :) There is no such thing as a select trigger. The closest you 
would get
is a function that is called via select which could be detected by 
making sure
you are prepending with a BEGIN or START Transaction. Thus yes pgPool 
can be made
to do this.
SELECT SETVAL() is another case.
I'd really love to see pgpool do this.
I am also curious about Slony-II development, Tom mentioned a first 
meeting about it :)

Regards,
Bjoern
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Matt Clark

Yes, I wasn't really choosing my examples particularly carefully, but I 
think the conclusion stands: pgpool (or anyone/thing except for the 
server) cannot in general tell from the SQL it is handed by the client 
whether an update will occur, nor which tables might be affected.

That's not to say that pgpool couldn't make a good guess in the majority 
of cases!

M
Joshua D. Drake wrote:
Matt Clark wrote:
Presumably it can't _ever_ know without being explicitly told, 
because even for a plain SELECT there might be triggers involved that 
update tables, or it might be a select of a stored proc, etc.  So in 
the general case, you can't assume that a select doesn't cause an 
update, and you can't be sure that the table list in an update is a 
complete list of the tables that might be updated.

Uhmmm no :) There is no such thing as a select trigger. The closest 
you would get
is a function that is called via select which could be detected by 
making sure
you are prepending with a BEGIN or START Transaction. Thus yes pgPool 
can be made
to do this.

Sincerely,
Joshua D. Drake

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Josh Berkus

Tatsuo,

 Suppose table A gets updated on the master at time 00:00. Until 00:03
 pgpool needs to send all queries regarding A to the master only. My
 question is, how can pgpool know a query is related to A?

Well, I'm a little late to head off tangental discussion about this, but 

The systems where I've implemented something similar are for web applications.  
In the case of the web app, you don't care if a most users see data which is 
2 seconds out of date; with caching and whatnot, it's often much more than 
that!

The one case where it's not permissable for a user to see old data is the 
case where the user is updating the data.   Namely:

(1) 00:00 User A updates My Profile
(2) 00:01 My Profile UPDATE finishes executing.
(3) 00:02  User A sees My Profile re-displayed
(6) 00:04  My Profile:UserA cascades to the last Slave server

So in an application like the above, it would be a real problem if User A were 
to get switched over to a slave server immediately after the update; she 
would see the old data, assume that her update was not saved, and update 
again.  Or send angry e-mails to [EMAIL PROTECTED]   

However, it makes no difference what User B sees:

(1) 00:00 User A updates My Profilev1 Master
(2) 00:01 My Profile UPDATE finishes executing.   Master
(3) 00:02  User A sees My Profilev2 displayed Master
(4) 00:02  User B requests MyProfile:UserASlave2
(5) 00:03  User B sees My Profilev1   Slave2
(6) 00:04  My Profilev2 cascades to the last Slave server  Slave2

If the web application is structured properly, the fact that UserB is seeing 
UserA's information which is 2 seconds old is not a problem (though it might 
be for web auctions, where it could result in race conditions.   Consider 
memcached as a helper).   This means that pgPool only needs to monitor 
update switching by *connection* not by *table*.

Make sense?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Marty Scholes

This is probably a lot easier than you would think.  You say that your 
DB will have lots of data, lots of updates and lots of reads.

Very likely the disk bottleneck is mostly index reads and writes, with 
some critical WAL fsync() calls.  In the grand scheme of things, the 
actual data is likely not accessed very often.

The indexes can be put on a RAM disk tablespace and that's the end of 
index problems -- just make sure you have enough memory available.  Also 
make sure that the machine can restart correctly after a crash: the 
tablespace is dropped and recreated, along with the indexes.  This will 
cause a machine restart to take some time.

After that, if the WAL fsync() calls are becoming a problem, put the WAL 
files on a fast RAID array, etiher a card or external enclosure, that 
has a good amount of battery-backed write cache.  This way, the WAL 
fsync() calls will flush quickly to the RAM and Pg can move on while the 
RAID controller worries about putting the data to disk.  With WAL, low 
access time is usually more important than total throughput.

The truth is that you could have this running for not much money.
Good Luck,
Marty
Le Jeudi 20 Janvier 2005 19:09, Bruno Almeida do Lago a écrit :
  Could you explain us what do you have in mind for that solution? I mean,
  forget the PostgreSQL (or any other database) restrictions and 
explain us
  how this hardware would be. Where the data would be stored?
 
  I've something in mind for you, but first I need to understand your 
needs!

I just want to make a big database as explained in my first mail ... At the
beginning we will have aprox. 150 000 000 records ... each month we will 
add
about 4/8 millions new rows in constant flow during the day ... and in same
time web users will access to the database in order to read those data.
Stored data are quite close to data stored by google ... (we are not 
making a
google clone ... just a lot of data many small values and some big ones ...
that's why I'm comparing with google for data storage).
Then we will have a search engine searching into those data ...

Dealing about the hardware, for the moment we have only a bi-pentium Xeon
2.8Ghz with 4 Gb of RAM ... and we saw we had bad performance results 
... so
we are thinking about a new solution with maybe several servers (server
design may vary from one to other) ... to get a kind of cluster to get 
better
performance ...

Am I clear ?
Regards,


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Merlin Moncure

 Technically, you can also set up a rule to do things on a select with
DO
 ALSO. However putting update statements in there would be considered
(at
 least by me) very bad form.  Note that this is not a trigger because
it
 does not operate at the row level [I know you knew that already :-)].
 
 
 
 Unfortunately, you can't.  Select operations only allow a single rule,
 and it must be a DO INSTEAD rule, unless this has changed in 8.0 and I
 missed it in the docs.  However, you can do this in a view by calling
a
 function either in the row definition or in the where clause.

You're right...forgot about that.  Heh, the do instead rule could be a
set returning function which could (besides returning the set) do almost
anything!  So in theory it makes no difference...diclaimer: never tried
doing this!

Merlin


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Peter Darley

Tatsuo,
What would happen with SELECT queries that, through a function or some
other mechanism, updates data in the database?  Would those need to be
passed to pgpool in some special way?
Thanks,
Peter Darley

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Tatsuo Ishii
Sent: Thursday, January 20, 2005 5:40 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED];
pgsql-performance@postgresql.org
Subject: Re: [PERFORM] PostgreSQL clustering VS MySQL clustering


 On January 20, 2005 06:49 am, Joshua D. Drake wrote:
  Stephen Frost wrote:
  * Herv? Piedvache ([EMAIL PROTECTED]) wrote:
  Le Jeudi 20 Janvier 2005 15:30, Stephen Frost a écrit :
  * Herv? Piedvache ([EMAIL PROTECTED]) wrote:
  Is there any solution with PostgreSQL matching these needs ... ?
  
  You might look into pg_pool.  Another possibility would be slony,
though
  I'm not sure it's to the point you need it at yet, depends on if you
can
  handle some delay before an insert makes it to the slave select
systems.
  
  I think not ... pgpool or slony are replication solutions ... but as I
   have said to Christopher Kings-Lynne how I'll manage the scalabilty
of
   the database ? I'll need several servers able to load a database
growing
   and growing to get good speed performance ...
  
  They're both replication solutions, but they also help distribute the
  load.  For example:
  
  pg_pool will distribute the select queries amoung the servers.  They'll
  all get the inserts, so that hurts, but at least the select queries are
  distributed.
  
  slony is similar, but your application level does the load distribution
  of select statements instead of pg_pool.  Your application needs to
know
  to send insert statements to the 'main' server, and select from the
  others.
 
  You can put pgpool in front of replicator or slony to get load
  balancing for reads.

 Last time I checked load ballanced reads was only available in pgpool if
you
 were using pgpools's internal replication.  Has something changed
recently?

Yes. However it would be pretty easy to modify pgpool so that it could
cope with Slony-I. I.e.

1) pgpool does the load balance and sends query to Slony-I's slave and
   master if the query is SELECT.

2) pgpool sends query only to the master if the query is other than
   SELECT.

Remaining problem is that Slony-I is not a sync replication
solution. Thus you need to prepare that the load balanced query
results might differ among servers.

If there's enough demand, I would do such that enhancements to pgpool.
--
Tatsuo Ishii

  Is there any other solution than a Cluster for our problem ?
  
  Bigger server, more CPUs/disks in one box.  Try to partition up your
  data some way such that it can be spread across multiple machines,
then
  if you need to combine the data have it be replicated using slony to
a
  big box that has a view which joins all the tables and do your big
  queries against that.
  
  But I'll arrive to limitation of a box size quickly I thing a 4
   processors with 64 Gb of RAM ... and after ?
 
  Opteron.

 IBM Z-series, or other big iron.

 
  Go to non-x86 hardware after if you're going to continue to increase
the
  size of the server.  Personally I think your better bet might be to
  figure out a way to partition up your data (isn't that what google
  does anyway?).
  
 Stephen

 --
 Darcy Buskermolen
 Wavefire Technologies Corp.
 ph: 250.717.0200
 fx:  250.763.1759
 http://www.wavefire.com

 ---(end of broadcast)---
 TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Josh Berkus

Peter, Tatsuo:

would happen with SELECT queries that, through a function or some
 other mechanism, updates data in the database?  Would those need to be
 passed to pgpool in some special way?

Oh, yes, that reminds me.  It would be helpful if pgPool accepted a control 
string ... perhaps one in a SQL comment ... which indicated that the 
statement to follow was, despite appearances, an update.  For example:
--STATEMENT_IS_UPDATE\n

The alternative is, of course, that pgPool direct all explicit transactions to 
the master ... which is a good idea anyway.So you could do:

BEGIN;
SELECT some_update_function();
COMMIT;

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Tatsuo Ishii

 Tatsuo,
 
  Suppose table A gets updated on the master at time 00:00. Until 00:03
  pgpool needs to send all queries regarding A to the master only. My
  question is, how can pgpool know a query is related to A?
 
 Well, I'm a little late to head off tangental discussion about this, but 
 
 The systems where I've implemented something similar are for web 
 applications.  
 In the case of the web app, you don't care if a most users see data which is 
 2 seconds out of date; with caching and whatnot, it's often much more than 
 that!
 
 The one case where it's not permissable for a user to see old data is the 
 case where the user is updating the data.   Namely:
 
 (1) 00:00 User A updates My Profile
 (2) 00:01 My Profile UPDATE finishes executing.
 (3) 00:02  User A sees My Profile re-displayed
 (6) 00:04  My Profile:UserA cascades to the last Slave server
 
 So in an application like the above, it would be a real problem if User A 
 were 
 to get switched over to a slave server immediately after the update; she 
 would see the old data, assume that her update was not saved, and update 
 again.  Or send angry e-mails to [EMAIL PROTECTED]   
 
 However, it makes no difference what User B sees:
 
 (1) 00:00 User A updates My Profilev1   Master
 (2) 00:01 My Profile UPDATE finishes executing. Master
 (3) 00:02  User A sees My Profilev2 displayed   Master
 (4) 00:02  User B requests MyProfile:UserA  Slave2
 (5) 00:03  User B sees My Profilev1 Slave2
 (6) 00:04  My Profilev2 cascades to the last Slave server  Slave2
 
 If the web application is structured properly, the fact that UserB is seeing 
 UserA's information which is 2 seconds old is not a problem (though it might 
 be for web auctions, where it could result in race conditions.   Consider 
 memcached as a helper).   This means that pgPool only needs to monitor 
 update switching by *connection* not by *table*.
 
 Make sense?

I'm not clear what pgPool only needs to monitor update switching by
*connection* not by *table* means. In your example:

 (1) 00:00 User A updates My Profile
 (2) 00:01 My Profile UPDATE finishes executing.
 (3) 00:02  User A sees My Profile re-displayed
 (6) 00:04  My Profile:UserA cascades to the last Slave server

I think (2) and (3) are on different connections, thus pgpool cannot
judge if SELECT in (3) should go only to the master or not.

To solve the problem you need to make pgpool understand web sessions
not database connections and it seems impossible for pgpool to
understand sessions.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Tatsuo Ishii

IMO the bottle neck is not WAL but table/index bloat. Lots of updates
on large tables will produce lots of dead tuples. Problem is, There'
is no effective way to reuse these dead tuples since VACUUM on huge
tables takes longer time. 8.0 adds new vacuum delay
paramters. Unfortunately this does not help. It just make the
execution time of VACUUM longer, that means more and more dead tuples
are being made while updating.

Probably VACUUM works well for small to medium size tables, but not
for huge ones. I'm considering about to implement on the spot
salvaging dead tuples.
--
Tatsuo Ishii

 This is probably a lot easier than you would think.  You say that your 
 DB will have lots of data, lots of updates and lots of reads.
 
 Very likely the disk bottleneck is mostly index reads and writes, with 
 some critical WAL fsync() calls.  In the grand scheme of things, the 
 actual data is likely not accessed very often.
 
 The indexes can be put on a RAM disk tablespace and that's the end of 
 index problems -- just make sure you have enough memory available.  Also 
 make sure that the machine can restart correctly after a crash: the 
 tablespace is dropped and recreated, along with the indexes.  This will 
 cause a machine restart to take some time.
 
 After that, if the WAL fsync() calls are becoming a problem, put the WAL 
 files on a fast RAID array, etiher a card or external enclosure, that 
 has a good amount of battery-backed write cache.  This way, the WAL 
 fsync() calls will flush quickly to the RAM and Pg can move on while the 
 RAID controller worries about putting the data to disk.  With WAL, low 
 access time is usually more important than total throughput.
 
 The truth is that you could have this running for not much money.
 
 Good Luck,
 Marty
 
  Le Jeudi 20 Janvier 2005 19:09, Bruno Almeida do Lago a écrit :
Could you explain us what do you have in mind for that solution? I mean,
forget the PostgreSQL (or any other database) restrictions and 
  explain us
how this hardware would be. Where the data would be stored?
   
I've something in mind for you, but first I need to understand your 
  needs!
  
  I just want to make a big database as explained in my first mail ... At the
  beginning we will have aprox. 150 000 000 records ... each month we will 
  add
  about 4/8 millions new rows in constant flow during the day ... and in same
  time web users will access to the database in order to read those data.
  Stored data are quite close to data stored by google ... (we are not 
  making a
  google clone ... just a lot of data many small values and some big ones ...
  that's why I'm comparing with google for data storage).
  Then we will have a search engine searching into those data ...
  
  Dealing about the hardware, for the moment we have only a bi-pentium Xeon
  2.8Ghz with 4 Gb of RAM ... and we saw we had bad performance results 
  ... so
  we are thinking about a new solution with maybe several servers (server
  design may vary from one to other) ... to get a kind of cluster to get 
  better
  performance ...
  
  Am I clear ?
  
  Regards,
 
 
 
 
 ---(end of broadcast)---
 TIP 2: you can get off all lists at once with the unregister command
 (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-21 Thread Tatsuo Ishii

 Peter, Tatsuo:
 
 would happen with SELECT queries that, through a function or some
  other mechanism, updates data in the database?  Would those need to be
  passed to pgpool in some special way?
 
 Oh, yes, that reminds me.  It would be helpful if pgPool accepted a control 
 string ... perhaps one in a SQL comment ... which indicated that the 
 statement to follow was, despite appearances, an update.  For example:
 --STATEMENT_IS_UPDATE\n

Actually the way judging if it's a pure SELECT or not in pgpool is
very simple. pgpool just checkes if the SQL statement exactly begins
with SELECT (case insensitive, of course). So, for example, you
could insert an SQL comment something like /*this SELECT has side
effect*/ at the beginning of line to indicate that pgpool should not
send this query to the slave.

 The alternative is, of course, that pgPool direct all explicit transactions 
 to 
 the master ... which is a good idea anyway.So you could do:
 
 BEGIN;
 SELECT some_update_function();
 COMMIT;

Yes. pgpool has already done this in load balancing. Expanding this
for Slony-I is pretty easy.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

[PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Herv Piedvache

Dear community,

My company, which I actually represent, is a fervent user of PostgreSQL.
We used to make all our applications using PostgreSQL for more than 5 years.
We usually do classical client/server applications under Linux, and Web 
interface (php, perl, C/C++). We used to manage also public web services with 
10/15 millions records and up to 8 millions pages view by month.

Now we are in front of a new need, but we do not find any good solution with 
PostgreSQL.
We need to make a sort of directory of millions of data growing about 4/8 
millions per month, and to be able to be used by many users from the web. In 
order to do this, our solution need to be able to run perfectly with many 
insert and many select access (done before each insert, and done by web site 
visitors). We will also need to make a search engine for the millions of data 
(140/150 millions records at the immediate beginning) ... No it's not google, 
but the kind of volume of data stored in the main table is similar.

Then ... we have made some tests, with the actual servers we have here, like a 
Bi-Pro Xeon 2.8 Ghz, with 4 Gb of RAM and the result of the cumulative 
inserts, and select access is slowing down the service really quickly ... 
(Load average is going up to 10 really quickly on the database).

We were at this moment thinking about a Cluster solution ... We saw on the 
Internet many solution talking about Cluster solution using MySQL ... but 
nothing about PostgreSQL ... the idea is to use several servers to make a 
sort of big virtual server using the disk space of each server as one, and 
having the ability to use the CPU and RAM of each servers in order to 
maintain good service performance ...one can imagin it is like a GFS but 
dedicated to postgreSQL...

Is there any solution with PostgreSQL matching these needs ... ?
Do we have to backport our development to MySQL for this kind of problem ?
Is there any other solution than a Cluster for our problem ?

Looking for your reply,

Regards,
-- 
Hervé

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Jean-Max Reymond

On Thu, 20 Jan 2005 15:03:31 +0100, Hervé Piedvache [EMAIL PROTECTED] wrote:

 We were at this moment thinking about a Cluster solution ... We saw on the
 Internet many solution talking about Cluster solution using MySQL ... but
 nothing about PostgreSQL ... the idea is to use several servers to make a
 sort of big virtual server using the disk space of each server as one, and
 having the ability to use the CPU and RAM of each servers in order to
 maintain good service performance ...one can imagin it is like a GFS but
 dedicated to postgreSQL...
 

forget mysql cluster for now.
We have a small database which size is 500 Mb.
It is not possible to load these base in a computer with 2 Mb of RAM
and loading the base in RAM is required.
So, we shrink the database and it is ok with 350 Mb to fit in the 2 Gb RAM.
First tests of performance on a basic request: 500x slower, yes 500x.
This issue is reported to mysql team  but no answer (and correction)

Actually, the solution is running with a replication database: 1 node
for write request and all the other nodes for read requests and the
load balancer is made with round robin solution.


-- 
Jean-Max Reymond
CKR Solutions
Nice France
http://www.ckr-solutions.com

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Christopher Kings-Lynne

Is there any solution with PostgreSQL matching these needs ... ?
You want: http://www.slony.info/
Do we have to backport our development to MySQL for this kind of problem ?
Is there any other solution than a Cluster for our problem ?
Well, Slony does replication which is basically what you want :)
Only master-slave though, so you will need to have all inserts go via 
the master server, but selects can come off any server.

Chris
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Stephen Frost

* Herv? Piedvache ([EMAIL PROTECTED]) wrote:
 Is there any solution with PostgreSQL matching these needs ... ?

You might look into pg_pool.  Another possibility would be slony, though
I'm not sure it's to the point you need it at yet, depends on if you can
handle some delay before an insert makes it to the slave select systems.

 Do we have to backport our development to MySQL for this kind of problem ?

Well, hopefully not. :)

 Is there any other solution than a Cluster for our problem ?

Bigger server, more CPUs/disks in one box.  Try to partition up your
data some way such that it can be spread across multiple machines, then
if you need to combine the data have it be replicated using slony to a
big box that has a view which joins all the tables and do your big
queries against that.

Just some thoughts.

Stephen


signature.asc
Description: Digital signature

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Herv Piedvache

Le Jeudi 20 Janvier 2005 15:24, Christopher Kings-Lynne a écrit :
  Is there any solution with PostgreSQL matching these needs ... ?

 You want: http://www.slony.info/

  Do we have to backport our development to MySQL for this kind of problem
  ? Is there any other solution than a Cluster for our problem ?

 Well, Slony does replication which is basically what you want :)

 Only master-slave though, so you will need to have all inserts go via
 the master server, but selects can come off any server.

Sorry but I don't agree with this ... Slony is a replication solution ... I 
don't need replication ... what will I do when my database will grow up to 50 
Gb ... I'll need more than 50 Gb of RAM on each server ???
This solution is not very realistic for me ...

I need a Cluster solution not a replication one or explain me in details how I 
will do for managing the scalabilty of my database ...

regards,
-- 
Hervé

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Herv Piedvache

Le Jeudi 20 Janvier 2005 15:38, Christopher Kings-Lynne a écrit :
  Sorry but I don't agree with this ... Slony is a replication solution ...
  I don't need replication ... what will I do when my database will grow up
  to 50 Gb ... I'll need more than 50 Gb of RAM on each server ???
  This solution is not very realistic for me ...
 
  I need a Cluster solution not a replication one or explain me in details
  how I will do for managing the scalabilty of my database ...

 Buy Oracle

I think this is not my solution ... sorry I'm talking about finding a 
PostgreSQL solution ... 
-- 
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Herv Piedvache

Le Jeudi 20 Janvier 2005 15:30, Stephen Frost a écrit :
 * Herv? Piedvache ([EMAIL PROTECTED]) wrote:
  Is there any solution with PostgreSQL matching these needs ... ?

 You might look into pg_pool.  Another possibility would be slony, though
 I'm not sure it's to the point you need it at yet, depends on if you can
 handle some delay before an insert makes it to the slave select systems.

I think not ... pgpool or slony are replication solutions ... but as I have 
said to Christopher Kings-Lynne how I'll manage the scalabilty of the 
database ? I'll need several servers able to load a database growing and 
growing to get good speed performance ...

  Do we have to backport our development to MySQL for this kind of problem
  ?

 Well, hopefully not. :)

I hope so ;o)

  Is there any other solution than a Cluster for our problem ?

 Bigger server, more CPUs/disks in one box.  Try to partition up your
 data some way such that it can be spread across multiple machines, then
 if you need to combine the data have it be replicated using slony to a
 big box that has a view which joins all the tables and do your big
 queries against that.

But I'll arrive to limitation of a box size quickly I thing a 4 processors 
with 64 Gb of RAM ... and after ?

regards,
-- 
Hervé

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Christopher Kings-Lynne

Sorry but I don't agree with this ... Slony is a replication solution ... I 
don't need replication ... what will I do when my database will grow up to 50 
Gb ... I'll need more than 50 Gb of RAM on each server ???
This solution is not very realistic for me ...

I need a Cluster solution not a replication one or explain me in details how I 
will do for managing the scalabilty of my database ...
Buy Oracle
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Stephen Frost

* Herv? Piedvache ([EMAIL PROTECTED]) wrote:
 Le Jeudi 20 Janvier 2005 15:30, Stephen Frost a écrit :
  * Herv? Piedvache ([EMAIL PROTECTED]) wrote:
   Is there any solution with PostgreSQL matching these needs ... ?
 
  You might look into pg_pool.  Another possibility would be slony, though
  I'm not sure it's to the point you need it at yet, depends on if you can
  handle some delay before an insert makes it to the slave select systems.
 
 I think not ... pgpool or slony are replication solutions ... but as I have 
 said to Christopher Kings-Lynne how I'll manage the scalabilty of the 
 database ? I'll need several servers able to load a database growing and 
 growing to get good speed performance ...

They're both replication solutions, but they also help distribute the
load.  For example:

pg_pool will distribute the select queries amoung the servers.  They'll
all get the inserts, so that hurts, but at least the select queries are
distributed.

slony is similar, but your application level does the load distribution
of select statements instead of pg_pool.  Your application needs to know
to send insert statements to the 'main' server, and select from the
others.

   Is there any other solution than a Cluster for our problem ?
 
  Bigger server, more CPUs/disks in one box.  Try to partition up your
  data some way such that it can be spread across multiple machines, then
  if you need to combine the data have it be replicated using slony to a
  big box that has a view which joins all the tables and do your big
  queries against that.
 
 But I'll arrive to limitation of a box size quickly I thing a 4 processors 
 with 64 Gb of RAM ... and after ?

Go to non-x86 hardware after if you're going to continue to increase the
size of the server.  Personally I think your better bet might be to
figure out a way to partition up your data (isn't that what google
does anyway?).

Stephen


signature.asc
Description: Digital signature

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Joshua D. Drake

Hervé Piedvache wrote:
Dear community,
My company, which I actually represent, is a fervent user of PostgreSQL.
We used to make all our applications using PostgreSQL for more than 5 years.
We usually do classical client/server applications under Linux, and Web 
interface (php, perl, C/C++). We used to manage also public web services with 
10/15 millions records and up to 8 millions pages view by month.
 

Depending on your needs either:
Slony: www.slony.info
or
Replicator: www.commandprompt.com
Will both do what you want. Replicator is easier to setup but
Slony is free.
Sincerely,
Joshua D. Drake

Now we are in front of a new need, but we do not find any good solution with 
PostgreSQL.
We need to make a sort of directory of millions of data growing about 4/8 
millions per month, and to be able to be used by many users from the web. In 
order to do this, our solution need to be able to run perfectly with many 
insert and many select access (done before each insert, and done by web site 
visitors). We will also need to make a search engine for the millions of data 
(140/150 millions records at the immediate beginning) ... No it's not google, 
but the kind of volume of data stored in the main table is similar.

Then ... we have made some tests, with the actual servers we have here, like a 
Bi-Pro Xeon 2.8 Ghz, with 4 Gb of RAM and the result of the cumulative 
inserts, and select access is slowing down the service really quickly ... 
(Load average is going up to 10 really quickly on the database).

We were at this moment thinking about a Cluster solution ... We saw on the 
Internet many solution talking about Cluster solution using MySQL ... but 
nothing about PostgreSQL ... the idea is to use several servers to make a 
sort of big virtual server using the disk space of each server as one, and 
having the ability to use the CPU and RAM of each servers in order to 
maintain good service performance ...one can imagin it is like a GFS but 
dedicated to postgreSQL...

Is there any solution with PostgreSQL matching these needs ... ?
Do we have to backport our development to MySQL for this kind of problem ?
Is there any other solution than a Cluster for our problem ?
Looking for your reply,
Regards,
 


--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
begin:vcard
fn:Joshua Drake
n:Drake;Joshua
org:Command Prompt, Inc.
adr:;;PO Box 215 ;Cascade Locks;OR;97014;US
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
x-mozilla-html:FALSE
url:http://www.commandprompt.com
version:2.1
end:vcard


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Jeff

On Jan 20, 2005, at 9:36 AM, Hervé Piedvache wrote:
Sorry but I don't agree with this ... Slony is a replication solution 
... I
don't need replication ... what will I do when my database will grow 
up to 50
Gb ... I'll need more than 50 Gb of RAM on each server ???
Slony doesn't use much ram. The mysql clustering product, ndb I believe 
it is called, requires all data fit in RAM. (At least, it used to).  
What you'll need is disk space.

As for a cluster I think you are thinking of multi-master replication.
You should look into what others have said about trying to partiition 
data among several boxes and then join the results together.

Or you could fork over  hundreds of thousands of dollars for Oracle's 
RAC.

--
Jeff Trout [EMAIL PROTECTED]
http://www.jefftrout.com/
http://www.stuarthamm.net/
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Joshua D. Drake

Stephen Frost wrote:
* Herv? Piedvache ([EMAIL PROTECTED]) wrote:
 

Le Jeudi 20 Janvier 2005 15:30, Stephen Frost a écrit :
   

* Herv? Piedvache ([EMAIL PROTECTED]) wrote:
 

Is there any solution with PostgreSQL matching these needs ... ?
   

You might look into pg_pool.  Another possibility would be slony, though
I'm not sure it's to the point you need it at yet, depends on if you can
handle some delay before an insert makes it to the slave select systems.
 

I think not ... pgpool or slony are replication solutions ... but as I have 
said to Christopher Kings-Lynne how I'll manage the scalabilty of the 
database ? I'll need several servers able to load a database growing and 
growing to get good speed performance ...
   

They're both replication solutions, but they also help distribute the
load.  For example:
pg_pool will distribute the select queries amoung the servers.  They'll
all get the inserts, so that hurts, but at least the select queries are
distributed.
slony is similar, but your application level does the load distribution
of select statements instead of pg_pool.  Your application needs to know
to send insert statements to the 'main' server, and select from the
others.
 

You can put pgpool in front of replicator or slony to get load
balancing for reads.
 

Is there any other solution than a Cluster for our problem ?
   

Bigger server, more CPUs/disks in one box.  Try to partition up your
data some way such that it can be spread across multiple machines, then
if you need to combine the data have it be replicated using slony to a
big box that has a view which joins all the tables and do your big
queries against that.
 

But I'll arrive to limitation of a box size quickly I thing a 4 processors 
with 64 Gb of RAM ... and after ?
   

Opteron.

Go to non-x86 hardware after if you're going to continue to increase the
size of the server.  Personally I think your better bet might be to
figure out a way to partition up your data (isn't that what google
does anyway?).
	Stephen
 


--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
begin:vcard
fn:Joshua Drake
n:Drake;Joshua
org:Command Prompt, Inc.
adr:;;PO Box 215 ;Cascade Locks;OR;97014;US
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
x-mozilla-html:FALSE
url:http://www.commandprompt.com
version:2.1
end:vcard


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Christopher Kings-Lynne

Sorry but I don't agree with this ... Slony is a replication solution ...
I don't need replication ... what will I do when my database will grow up
to 50 Gb ... I'll need more than 50 Gb of RAM on each server ???
This solution is not very realistic for me ...
I need a Cluster solution not a replication one or explain me in details
how I will do for managing the scalabilty of my database ...
Buy Oracle

I think this is not my solution ... sorry I'm talking about finding a 
PostgreSQL solution ... 
My point being is that there is no free solution.  There simply isn't. 
I don't know why you insist on keeping all your data in RAM, but the 
mysql cluster requires that ALL data MUST fit in RAM all the time.

PostgreSQL has replication, but not partitioning (which is what you want).
So, your only option is Oracle or another very expensive commercial 
database.

Chris
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Herv Piedvache

Le Jeudi 20 Janvier 2005 15:48, Jeff a écrit :
 On Jan 20, 2005, at 9:36 AM, Hervé Piedvache wrote:
  Sorry but I don't agree with this ... Slony is a replication solution
  ... I
  don't need replication ... what will I do when my database will grow
  up to 50
  Gb ... I'll need more than 50 Gb of RAM on each server ???

 Slony doesn't use much ram. The mysql clustering product, ndb I believe
 it is called, requires all data fit in RAM. (At least, it used to).
 What you'll need is disk space.

Slony do not use RAM ... but PostgreSQL will need RAM for accessing a database 
of 50 Gb ... so having two servers with the same configuration replicated by 
slony do not slove the problem of the scalability of the database ...

 As for a cluster I think you are thinking of multi-master replication.

No I'm really thinking about a Cluster solution ... having several servers 
making one big virtual server to have several processors, and many RAM in 
many boxes ...

 You should look into what others have said about trying to partiition
 data among several boxes and then join the results together.

??? Who talk about this ?

 Or you could fork over  hundreds of thousands of dollars for Oracle's
 RAC.

No please do not talk about this again ... I'm looking about a PostgreSQL 
solution ... I know RAC ... and I'm not able to pay for a RAC certify 
hardware configuration plus a RAC Licence.

Regards,
-- 
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Herv Piedvache

Joshua,

Le Jeudi 20 Janvier 2005 15:44, Joshua D. Drake a écrit :
 Hervé Piedvache wrote:
 
 My company, which I actually represent, is a fervent user of PostgreSQL.
 We used to make all our applications using PostgreSQL for more than 5
  years. We usually do classical client/server applications under Linux,
  and Web interface (php, perl, C/C++). We used to manage also public web
  services with 10/15 millions records and up to 8 millions pages view by
  month.

 Depending on your needs either:

 Slony: www.slony.info

 or

 Replicator: www.commandprompt.com

 Will both do what you want. Replicator is easier to setup but
 Slony is free.

No ... as I have said ... how I'll manage a database getting a table of may be 
250 000 000 records ? I'll need incredible servers ... to get quick access or 
index reading ... no ?

So what we would like to get is a pool of small servers able to make one 
virtual server ... for that is called a Cluster ... no ?

I know they are not using PostgreSQL ... but how a company like Google do to 
get an incredible database in size and so quick access ?

regards,
-- 
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Herv Piedvache

Le Jeudi 20 Janvier 2005 15:51, Christopher Kings-Lynne a écrit :
 Sorry but I don't agree with this ... Slony is a replication solution
  ... I don't need replication ... what will I do when my database will
  grow up to 50 Gb ... I'll need more than 50 Gb of RAM on each server
  ??? This solution is not very realistic for me ...
 
 I need a Cluster solution not a replication one or explain me in details
 how I will do for managing the scalabilty of my database ...
 
 Buy Oracle
 
  I think this is not my solution ... sorry I'm talking about finding a
  PostgreSQL solution ...

 My point being is that there is no free solution.  There simply isn't.
 I don't know why you insist on keeping all your data in RAM, but the
 mysql cluster requires that ALL data MUST fit in RAM all the time.

I don't insist about have data in RAM  but when you use PostgreSQL with 
big database you know that for quick access just for reading the index file 
for example it's better to have many RAM as possible ... I just want to be 
able to get a quick access with a growing and growind database ...

 PostgreSQL has replication, but not partitioning (which is what you want).

:o(

 So, your only option is Oracle or another very expensive commercial
 database.

That's not a good news ...
-- 
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Joshua D. Drake


No please do not talk about this again ... I'm looking about a PostgreSQL 
solution ... I know RAC ... and I'm not able to pay for a RAC certify 
hardware configuration plus a RAC Licence.
 

What you want does not exist for PostgreSQL. You will either
have to build it yourself or pay somebody to build it for you.
Sincerely,
Joshua D. Drake

Regards,
 


--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
begin:vcard
fn:Joshua Drake
n:Drake;Joshua
org:Command Prompt, Inc.
adr:;;PO Box 215 ;Cascade Locks;OR;97014;US
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
x-mozilla-html:FALSE
url:http://www.commandprompt.com
version:2.1
end:vcard


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Stephen Frost

* Christopher Kings-Lynne ([EMAIL PROTECTED]) wrote:
 PostgreSQL has replication, but not partitioning (which is what you want).

It doesn't have multi-server partitioning..  It's got partitioning
within a single server (doesn't it?  I thought it did, I know it was
discussed w/ the guy from Cox Communications and I thought he was using
it :).

 So, your only option is Oracle or another very expensive commercial 
 database.

Or partition the data at the application layer.

Stephen


signature.asc
Description: Digital signature

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Joshua D. Drake


So what we would like to get is a pool of small servers able to make one 
virtual server ... for that is called a Cluster ... no ?

I know they are not using PostgreSQL ... but how a company like Google do to 
get an incredible database in size and so quick access ?
 

You could use dblink with multiple servers across data partitions
within PostgreSQL but I don't know how fast that would be.
J

regards,
 


--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
begin:vcard
fn:Joshua Drake
n:Drake;Joshua
org:Command Prompt, Inc.
adr:;;PO Box 215 ;Cascade Locks;OR;97014;US
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
x-mozilla-html:FALSE
url:http://www.commandprompt.com
version:2.1
end:vcard


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Joshua D. Drake

Christopher Kings-Lynne wrote:
Or you could fork over  hundreds of thousands of dollars for Oracle's
RAC.

No please do not talk about this again ... I'm looking about a 
PostgreSQL solution ... I know RAC ... and I'm not able to pay for a 
RAC certify hardware configuration plus a RAC Licence.

There is absolutely zero PostgreSQL solution...

I just replied the same thing but then I was thinking. Couldn't he use 
multiple databases
over multiple servers with dblink?

It is not exactly how I would want to do it, but it would provide what 
he needs I think???

Sincerely,
Joshua D. Drake

You may have to split the data yourself onto two independent db 
servers and combine the results somehow in your application.

Chris
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
begin:vcard
fn:Joshua Drake
n:Drake;Joshua
org:Command Prompt, Inc.
adr:;;PO Box 215 ;Cascade Locks;OR;97014;US
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
x-mozilla-html:FALSE
url:http://www.commandprompt.com
version:2.1
end:vcard


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Herv Piedvache

Le Jeudi 20 Janvier 2005 16:05, Joshua D. Drake a écrit :
 Christopher Kings-Lynne wrote:
  Or you could fork over  hundreds of thousands of dollars for Oracle's
  RAC.
 
  No please do not talk about this again ... I'm looking about a
  PostgreSQL solution ... I know RAC ... and I'm not able to pay for a
  RAC certify hardware configuration plus a RAC Licence.
 
  There is absolutely zero PostgreSQL solution...

 I just replied the same thing but then I was thinking. Couldn't he use
 multiple databases
 over multiple servers with dblink?

 It is not exactly how I would want to do it, but it would provide what
 he needs I think???

Yes seems to be the only solution ... but I'm a little disapointed about 
this ... could you explain me why there is not this kind of 
functionnality ... it seems to be a real need for big applications no ?

Thanks all for your answers ...
-- 
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Stephen Frost

* Herv? Piedvache ([EMAIL PROTECTED]) wrote:
 I know they are not using PostgreSQL ... but how a company like Google do to 
 get an incredible database in size and so quick access ?

They segment their data across multiple machines and have an algorithm
which tells the application layer which machine to contact for what
data.

Stephen


signature.asc
Description: Digital signature

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Steve Wampler

Hervé Piedvache wrote:
No ... as I have said ... how I'll manage a database getting a table of may be 
250 000 000 records ? I'll need incredible servers ... to get quick access or 
index reading ... no ?

So what we would like to get is a pool of small servers able to make one 
virtual server ... for that is called a Cluster ... no ?

I know they are not using PostgreSQL ... but how a company like Google do to 
get an incredible database in size and so quick access ?
Probably by carefully partitioning their data.  I can't imagine anything
being fast on a single table in 250,000,000 tuple range.  Nor can I
really imagine any database that efficiently splits a single table
across multiple machines (or even inefficiently unless some internal
partitioning is being done).
So, you'll have to do some work at your end and not just hope that
a magic bullet is available.
Once you've got the data partitioned, the question becomes one of
how to inhance performance/scalability.  Have you considered RAIDb?
--
Steve Wampler -- [EMAIL PROTECTED]
The gods that smiled on your birth are now laughing out loud.
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Joshua D. Drake


then I was thinking. Couldn't he use
multiple databases
over multiple servers with dblink?
It is not exactly how I would want to do it, but it would provide what
he needs I think???
   

Yes seems to be the only solution ... but I'm a little disapointed about 
this ... could you explain me why there is not this kind of 
functionnality ... it seems to be a real need for big applications no ?
 

Because it is really, really hard to do correctly and hard
equals expensive.
Sincerely,
Joshua D. Drake

Thanks all for your answers ...
 


--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
begin:vcard
fn:Joshua Drake
n:Drake;Joshua
org:Command Prompt, Inc.
adr:;;PO Box 215 ;Cascade Locks;OR;97014;US
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
x-mozilla-html:FALSE
url:http://www.commandprompt.com
version:2.1
end:vcard


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Merlin Moncure

 No please do not talk about this again ... I'm looking about a PostgreSQL
 solution ... I know RAC ... and I'm not able to pay for a RAC certify
 hardware configuration plus a RAC Licence.

Are you totally certain you can't solve your problem with a single server 
solution?

How about:
Price out a 4 way Opteron 4u rackmount server with 64 bit linux, stuffed with 
hard drives (like 40) set up in a complex raid configuration (multiple raid 
controllers) allowing you (with tablespaces) to divide up your database.

You can drop in dual core opterons at some later point for an easy upgrade.  
Let's say this server costs 20k$...are you sure this will not be enough to 
handle your load?

Merlin

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Dave Cramer





Google uses something called the google filesystem, look it up in
google. It is a distributed file system.

Dave

Herv Piedvache wrote:

  Joshua,

Le Jeudi 20 Janvier 2005 15:44, Joshua D. Drake a crit :
  
  
Herv Piedvache wrote:


  My company, which I actually represent, is a fervent user of PostgreSQL.
We used to make all our applications using PostgreSQL for more than 5
years. We usually do classical client/server applications under Linux,
and Web interface (php, perl, C/C++). We used to manage also public web
services with 10/15 millions records and up to 8 millions pages view by
month.
  

Depending on your needs either:

Slony: www.slony.info

or

Replicator: www.commandprompt.com

Will both do what you want. Replicator is easier to setup but
Slony is free.

  
  
No ... as I have said ... how I'll manage a database getting a table of may be 
250 000 000 records ? I'll need incredible servers ... to get quick access or 
index reading ... no ?

So what we would like to get is a pool of small servers able to make one 
virtual server ... for that is called a Cluster ... no ?

I know they are not using PostgreSQL ... but how a company like Google do to 
get an incredible database in size and so quick access ?

regards,
  


-- 
Dave Cramer
http://www.postgresintl.com
519 939 0336
ICQ#14675561

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Herv Piedvache

Le Jeudi 20 Janvier 2005 16:14, Steve Wampler a écrit :
 Once you've got the data partitioned, the question becomes one of
 how to inhance performance/scalability.  Have you considered RAIDb?

No but I'll seems to be very interesting ... close to the explanation of 
Joshua ... but automaticly done ...

Thanks !
-- 
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Herv Piedvache

Le Jeudi 20 Janvier 2005 16:23, Dave Cramer a écrit :
 Google uses something called the google filesystem, look it up in
 google. It is a distributed file system.

Yes that's another point I'm working on ... make a cluster of server using 
GFS ... and making PostgreSQL running with it ...

But I have not finished my test ... and may be people could have experience 
with this ...

Regards,
-- 
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Herv Piedvache

Le Jeudi 20 Janvier 2005 16:16, Merlin Moncure a écrit :
  No please do not talk about this again ... I'm looking about a PostgreSQL
  solution ... I know RAC ... and I'm not able to pay for a RAC certify
  hardware configuration plus a RAC Licence.

 Are you totally certain you can't solve your problem with a single server
 solution?

 How about:
 Price out a 4 way Opteron 4u rackmount server with 64 bit linux, stuffed
 with hard drives (like 40) set up in a complex raid configuration (multiple
 raid controllers) allowing you (with tablespaces) to divide up your
 database.

 You can drop in dual core opterons at some later point for an easy upgrade.
  Let's say this server costs 20k$...are you sure this will not be enough to
 handle your load?

I'm not as I said ibn my mail I want to do a Cluster of servers ... :o)
-- 
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Steve Wampler

Hervé Piedvache wrote:
Le Jeudi 20 Janvier 2005 16:23, Dave Cramer a écrit :
Google uses something called the google filesystem, look it up in
google. It is a distributed file system.

Yes that's another point I'm working on ... make a cluster of server using 
GFS ... and making PostgreSQL running with it ...
A few years ago I played around with GFS, but not for postgresql.
I don't think it's going to help - logically there's no difference
between putting PG on GFS and putting PG on NFS - in both cases
the filesystem doesn't provide any support for distributing the
task at hand - and a PG database server isn't written to be
distributed across hosts regardless of the distribution of the
data across filesystems.

--
Steve Wampler -- [EMAIL PROTECTED]
The gods that smiled on your birth are now laughing out loud.
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Christopher Kings-Lynne

Probably by carefully partitioning their data.  I can't imagine anything
being fast on a single table in 250,000,000 tuple range.  Nor can I
really imagine any database that efficiently splits a single table
across multiple machines (or even inefficiently unless some internal
partitioning is being done).
Ah, what about partial indexes - those might help.  As a kind of 
'semi-partition'.

Chris
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Joshua D. Drake

Christopher Kings-Lynne wrote:
Probably by carefully partitioning their data.  I can't imagine anything
being fast on a single table in 250,000,000 tuple range.  Nor can I
really imagine any database that efficiently splits a single table
across multiple machines (or even inefficiently unless some internal
partitioning is being done).

Ah, what about partial indexes - those might help.  As a kind of 
'semi-partition'.
He could also you schemas to partition out the information within the 
same database.

J
Chris

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
begin:vcard
fn:Joshua Drake
n:Drake;Joshua
org:Command Prompt, Inc.
adr:;;PO Box 215 ;Cascade Locks;OR;97014;US
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
x-mozilla-html:FALSE
url:http://www.commandprompt.com
version:2.1
end:vcard


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Richard_D_Levine


I think maybe a SAN in conjunction with tablespaces might be the answer.
Still need one honking server.

Rick



  
  Stephen Frost 
  
  [EMAIL PROTECTED]   To:   Christopher 
Kings-Lynne [EMAIL PROTECTED]  
  Sent by:   cc:   Hervé 
Piedvache [EMAIL PROTECTED], pgsql-performance@postgresql.org
  [EMAIL PROTECTED]Subject:  Re: [PERFORM] 
PostgreSQL clustering VS MySQL clustering
  tgresql.org   
  

  

  
  01/20/2005 10:08 AM   
  

  

  




* Christopher Kings-Lynne ([EMAIL PROTECTED]) wrote:
 PostgreSQL has replication, but not partitioning (which is what you
want).

It doesn't have multi-server partitioning..  It's got partitioning
within a single server (doesn't it?  I thought it did, I know it was
discussed w/ the guy from Cox Communications and I thought he was using
it :).

 So, your only option is Oracle or another very expensive commercial
 database.

Or partition the data at the application layer.

 Stephen
(See attached file: signature.asc)


signature.asc
Description: Binary data

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Rod Taylor

On Thu, 2005-01-20 at 15:36 +0100, Hervé Piedvache wrote:
 Le Jeudi 20 Janvier 2005 15:24, Christopher Kings-Lynne a écrit :
   Is there any solution with PostgreSQL matching these needs ... ?
 
  You want: http://www.slony.info/
 
   Do we have to backport our development to MySQL for this kind of problem
   ? Is there any other solution than a Cluster for our problem ?
 
  Well, Slony does replication which is basically what you want :)
 
  Only master-slave though, so you will need to have all inserts go via
  the master server, but selects can come off any server.
 
 Sorry but I don't agree with this ... Slony is a replication solution ... I 
 don't need replication ... what will I do when my database will grow up to 50 
 Gb ... I'll need more than 50 Gb of RAM on each server ???
 This solution is not very realistic for me ...

Slony has some other issues with databases  200GB in size as well
(well, it hates long running transactions -- and pg_dump is a regular
long running transaction)

However, you don't need RAM one each server for this, you simply need
enough disk space.

Have a Master which takes writes, a replicator which you can consider
to be a hot-backup of the master, have N slaves replicate off of the
otherwise untouched replicator machine.

For your next trick, have the application send read requests for Clients
A-C to slave 1, D-F to slave 2, ...

You need enough memory to hold the index sections for clients A-C on
slave 1. The rest of the index can remain on disk. It's available should
it be required (D-F box crashed, so your application is now feeding
those read requests to the A-C machine)...

Go to more slaves and smaller segments as you require. Use the absolute
cheapest hardware you can find for the slaves that gives reasonable
performance. They don't need to be reliable, so RAID 0 on IDE drives is
perfectly acceptable.

PostgreSQL can do the replication portion quite nicely. You need to
implement the cluster part in the application side.
-- 


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread amrit

What you want is some kind of huge pararell computing , isn't it? I have heard
from many groups of Japanese Pgsql developer did it but they are talking in
japanese website and of course  in Japanese.
I can name one of them  Asushi Mitani  and his website
http://www.csra.co.jp/~mitani/jpug/pgcluster/en/index.html
and you may directly contact him.

Amrit
Thailand

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Stephen Frost

* [EMAIL PROTECTED] ([EMAIL PROTECTED]) wrote:
 I think maybe a SAN in conjunction with tablespaces might be the answer.
 Still need one honking server.

That's interesting- can a PostgreSQL partition be acress multiple
tablespaces?

Stephen


signature.asc
Description: Digital signature

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Greg Stark

Steve Wampler [EMAIL PROTECTED] writes:

 Hervé Piedvache wrote:
 
  No ... as I have said ... how I'll manage a database getting a table of may
  be 250 000 000 records ? I'll need incredible servers ... to get quick 
  access
  or index reading ... no ?
 
 Probably by carefully partitioning their data.  I can't imagine anything
 being fast on a single table in 250,000,000 tuple range.  

Why are you all so psyched out by the size of the table? That's what indexes
are for.

The size of the table really isn't relevant here. The important thing is the
size of the working set. Ie, How many of those records are required to respond
to queries.

As long as you tune your application so every query can be satisfied by
reading a (very) limited number of those records and have indexes to speed
access to those records you can have quick response time even if you have
terabytes of raw data. 

I would start by looking at the plans for the queries you're running and
seeing if you have any queries that are reading more than hundred records or
so. If so then you have to optimize them or rethink your application design.
You might need to restructure your data so you don't have to scan too many
records for any query.

No clustering system is going to help you if your application requires reading
through too much data. If every query is designed to not have to read more
than a hundred or so records then there's no reason you can't have sub-100ms
response time even if you had terabytes of raw data.

If the problem is just that each individual query is fast but there's too many
coming for a single server then something like slony is all you need. It'll
spread the load over multiple machines. If you spread the load in an
intelligent way you can even concentrate each server on certain subsets of the
data. But that shouldn't even really be necessary, just a nice improvement.

-- 
greg


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread William Yu

Hervé Piedvache wrote:
Sorry but I don't agree with this ... Slony is a replication solution ... I 
don't need replication ... what will I do when my database will grow up to 50 
Gb ... I'll need more than 50 Gb of RAM on each server ???
This solution is not very realistic for me ...
Have you confirmed you need a 1:1 RAM:data ratio? Of course more memory 
gets more speed but often at a diminishing rate of return. Unless every 
record of your 50GB is used in every query, only the most commonly used 
elements of your DB needs to be in RAM. This is the very idea of caching.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Holger Hoffstaette

On Thu, 20 Jan 2005 16:32:27 +0100, Hervé Piedvache wrote:

 Le Jeudi 20 Janvier 2005 16:23, Dave Cramer a écrit :
 Google uses something called the google filesystem, look it up in
 google. It is a distributed file system.
 
 Yes that's another point I'm working on ... make a cluster of server using
 GFS ... and making PostgreSQL running with it ...

Did you read the GFS whitepaper? It really works differently from other
filesystems with regard to latency and consistency. You'll probably have
better success with Lustre (http://www.clusterfs.com/) or RedHat's Global
File System (http://www.redhat.com/software/rha/gfs/).
If you're looking for a 'cheap, free and easy' solution you can just as
well stop right now. :-)

-h



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Alex Turner

The problem is very large ammounts of data that needs to be both read
and updated.  If you replicate a system, you will need to
intelligently route the reads to the server that has the data in RAM
or you will always be hitting DIsk which is slow. This kind of routing
AFAIK is not possible with current database technology, and you are
still stuck for writes.

Writes are always going to be the bane of any cluster.  Clustering can
give better parallel read performance i.e. large no. of clients
accessing data simultaneously, but your write performance is always
going to be bound by the underlying disk infrastructure, not even
Oracle RAC can get around this (It uses multiple read nodes accessing
the same set of database files underneath)

Google solved the problem by building this intelligence into the
middle tier, and using a distributed file system. Java Entity Beans
are supposed to solve this problem somewhat by distributing the data
across multiple servers in a cluster and allowing you to defer write
syncing, but it really doesn't work all that well.

The only way I know to solve this at the RDBMS layer is to configure a
very powerfull disk layer, which is basicaly going to a SAN mesh with
multiple cards on a single system with multiple IO boards, or an OS
that clusters at the base level, thinking HP Superdome or z900.  Even
Opteron w/PCI-X cards has a limit of about 400MB/sec throughput on a
single IO channel, and there are only two independent channels on any
boards I know about.

The other solution is to do what google did.  Implement your own
middle tier that knows how to route queries to the appropriate place. 
Each node can then have it's own independant database with it's own
independant disk subsystem, and your throughput is only limited by
your network interconnects, and your internet pipe.  This kind of
middle tier is really not that hard to if your data can easily be
segmented.  Each node runs it's own query sort and filter
independantly, and supplies the result to the central data broker,
which then collates the results and supplies them back to the user. 
Updated work in a similar fasion.  The update comes into the central
broker that decides which nodes it will affect, and then issues
updates to those nodes.

I've built this kind of architecture, if you want to do it, don't use
Java unless you want to pay top dollar for your programmers, because
it's hard to make it work well in Java (most JMS implementations suck,
look at MQueue or a custom queue impl, forget XML it's too slow to
serialize and deserialize requests).

Alex Turner
NetEconomist


On Thu, 20 Jan 2005 11:13:25 -0500, Stephen Frost [EMAIL PROTECTED] wrote:
 * [EMAIL PROTECTED] ([EMAIL PROTECTED]) wrote:
  I think maybe a SAN in conjunction with tablespaces might be the answer.
  Still need one honking server.
 
 That's interesting- can a PostgreSQL partition be acress multiple
 tablespaces?
 
 Stephen
 
 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Darcy Buskermolen

On January 20, 2005 06:49 am, Joshua D. Drake wrote:
 Stephen Frost wrote:
 * Herv? Piedvache ([EMAIL PROTECTED]) wrote:
 Le Jeudi 20 Janvier 2005 15:30, Stephen Frost a écrit :
 * Herv? Piedvache ([EMAIL PROTECTED]) wrote:
 Is there any solution with PostgreSQL matching these needs ... ?
 
 You might look into pg_pool.  Another possibility would be slony, though
 I'm not sure it's to the point you need it at yet, depends on if you can
 handle some delay before an insert makes it to the slave select systems.
 
 I think not ... pgpool or slony are replication solutions ... but as I
  have said to Christopher Kings-Lynne how I'll manage the scalabilty of
  the database ? I'll need several servers able to load a database growing
  and growing to get good speed performance ...
 
 They're both replication solutions, but they also help distribute the
 load.  For example:
 
 pg_pool will distribute the select queries amoung the servers.  They'll
 all get the inserts, so that hurts, but at least the select queries are
 distributed.
 
 slony is similar, but your application level does the load distribution
 of select statements instead of pg_pool.  Your application needs to know
 to send insert statements to the 'main' server, and select from the
 others.

 You can put pgpool in front of replicator or slony to get load
 balancing for reads.

Last time I checked load ballanced reads was only available in pgpool if you 
were using pgpools's internal replication.  Has something changed recently?


 Is there any other solution than a Cluster for our problem ?
 
 Bigger server, more CPUs/disks in one box.  Try to partition up your
 data some way such that it can be spread across multiple machines, then
 if you need to combine the data have it be replicated using slony to a
 big box that has a view which joins all the tables and do your big
 queries against that.
 
 But I'll arrive to limitation of a box size quickly I thing a 4
  processors with 64 Gb of RAM ... and after ?

 Opteron.

IBM Z-series, or other big iron.


 Go to non-x86 hardware after if you're going to continue to increase the
 size of the server.  Personally I think your better bet might be to
 figure out a way to partition up your data (isn't that what google
 does anyway?).
 
  Stephen

-- 
Darcy Buskermolen
Wavefire Technologies Corp.
ph: 250.717.0200
fx:  250.763.1759
http://www.wavefire.com

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Darcy Buskermolen

On January 20, 2005 06:51 am, Christopher Kings-Lynne wrote:
 Sorry but I don't agree with this ... Slony is a replication solution
  ... I don't need replication ... what will I do when my database will
  grow up to 50 Gb ... I'll need more than 50 Gb of RAM on each server
  ??? This solution is not very realistic for me ...
 
 I need a Cluster solution not a replication one or explain me in details
 how I will do for managing the scalabilty of my database ...
 
 Buy Oracle
 
  I think this is not my solution ... sorry I'm talking about finding a
  PostgreSQL solution ...

 My point being is that there is no free solution.  There simply isn't.
 I don't know why you insist on keeping all your data in RAM, but the
 mysql cluster requires that ALL data MUST fit in RAM all the time.

 PostgreSQL has replication, but not partitioning (which is what you want).

 So, your only option is Oracle or another very expensive commercial
 database.

Another Option to consider would be pgmemcache.  that way you just build the 
farm out of lots of large memory, diskless boxes for keeping the whole 
database in memory in the whole cluster.  More information on it can be found 
at: http://people.freebsd.org/~seanc/pgmemcache/


 Chris

 ---(end of broadcast)---
 TIP 8: explain analyze is your friend

-- 
Darcy Buskermolen
Wavefire Technologies Corp.
ph: 250.717.0200
fx:  250.763.1759
http://www.wavefire.com

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Mitch Pirtle

On Thu, 20 Jan 2005 09:33:42 -0800, Darcy Buskermolen
[EMAIL PROTECTED] wrote:
 
 Another Option to consider would be pgmemcache.  that way you just build the
 farm out of lots of large memory, diskless boxes for keeping the whole
 database in memory in the whole cluster.  More information on it can be found
 at: http://people.freebsd.org/~seanc/pgmemcache/

Which brings up another question: why not just cluster at the hardware
layer? Get an external fiberchannel array, and cluster a bunch of dual
Opterons, all sharing that storage. In that sense you would be getting
one big PostgreSQL 'image' running across all of the servers.

Or is that idea too 90's?  ;-)

-- Mitch

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Darcy Buskermolen

On January 20, 2005 10:42 am, Mitch Pirtle wrote:
 On Thu, 20 Jan 2005 09:33:42 -0800, Darcy Buskermolen

 [EMAIL PROTECTED] wrote:
  Another Option to consider would be pgmemcache.  that way you just build
  the farm out of lots of large memory, diskless boxes for keeping the
  whole database in memory in the whole cluster.  More information on it
  can be found at: http://people.freebsd.org/~seanc/pgmemcache/

 Which brings up another question: why not just cluster at the hardware
 layer? Get an external fiberchannel array, and cluster a bunch of dual
 Opterons, all sharing that storage. In that sense you would be getting
 one big PostgreSQL 'image' running across all of the servers.

It dosn't quite work that way, thanks to shared memory, and kernel disk cache.  
(among other things)

 Or is that idea too 90's?  ;-)

 -- Mitch

 ---(end of broadcast)---
 TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

-- 
Darcy Buskermolen
Wavefire Technologies Corp.
ph: 250.717.0200
fx:  250.763.1759
http://www.wavefire.com

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread =?iso-8859-15?q?Herv=E9_Piedvache?=

Le Jeudi 20 Janvier 2005 19:09, Bruno Almeida do Lago a écrit :
 Could you explain us what do you have in mind for that solution? I mean,
 forget the PostgreSQL (or any other database) restrictions and explain us
 how this hardware would be. Where the data would be stored?

 I've something in mind for you, but first I need to understand your needs!

I just want to make a big database as explained in my first mail ... At the 
beginning we will have aprox. 150 000 000 records ... each month we will add 
about 4/8 millions new rows in constant flow during the day ... and in same 
time web users will access to the database in order to read those data.
Stored data are quite close to data stored by google ... (we are not making a 
google clone ... just a lot of data many small values and some big ones ... 
that's why I'm comparing with google for data storage).
Then we will have a search engine searching into those data ...

Dealing about the hardware, for the moment we have only a bi-pentium Xeon 
2.8Ghz with 4 Gb of RAM ... and we saw we had bad performance results ... so 
we are thinking about a new solution with maybe several servers (server 
design may vary from one to other) ... to get a kind of cluster to get better 
performance ...

Am I clear ?

Regards,
-- 
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Jean-Max Reymond

On Thu, 20 Jan 2005 12:13:17 -0700, Steve Wampler [EMAIL PROTECTED] wrote:
 Mitch Pirtle wrote:

 But that's not enough, because you're going to be running separate
 postgresql backends on the different hosts, and there are
 definitely consistency issues with trying to do that.  So far as
 I know (right, experts?) postgresql isn't designed with providing
 distributed consistency in mind (isn't shared memory used for
 consistency, which restricts all the backends to a single host?).

yes, you're right: you'll need a Distributed Lock Manager and an
application to manage it , Postgres ?

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Greg Stark

Hervé Piedvache [EMAIL PROTECTED] writes:

 Le Jeudi 20 Janvier 2005 19:09, Bruno Almeida do Lago a écrit :
  Could you explain us what do you have in mind for that solution? I mean,
  forget the PostgreSQL (or any other database) restrictions and explain us
  how this hardware would be. Where the data would be stored?
 
  I've something in mind for you, but first I need to understand your needs!
 
 I just want to make a big database as explained in my first mail ... At the 
 beginning we will have aprox. 150 000 000 records ... each month we will add 
 about 4/8 millions new rows in constant flow during the day ... and in same 
 time web users will access to the database in order to read those data.
 Stored data are quite close to data stored by google ... (we are not making a 
 google clone ... just a lot of data many small values and some big ones ... 
 that's why I'm comparing with google for data storage).
 Then we will have a search engine searching into those data ...

You're concentrating on the data within the database. That's only half the
picture. What are you going to *do* with the data in the database? You need to
analyze what we will have a search engine searching into those data means in
more detail.

Postgres is more than capable of storing 150Gb of data. There are people with
terabyte databases on this list. You need to define what types of queries you
need to perform, how many data they need to manipulate, and what your
performance requirements are for those queries.

-- 
greg


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] PostgreSQL clustering VS MySQL clustering

2005-01-20 Thread Dave Cramer





Two way xeon's are as fast as a single opteron, 150M rows isn't a big
deal.
Clustering isn't really the solution, I fail to see how clustering
actually helps since it has to slow down file access.

Dave

Herv Piedvache wrote:

  Le Jeudi 20 Janvier 2005 19:09, Bruno Almeida do Lago a crit :
  
  
Could you explain us what do you have in mind for that solution? I mean,
forget the PostgreSQL (or any other database) restrictions and explain us
how this hardware would be. Where the data would be stored?

I've something in mind for you, but first I need to understand your needs!

  
  
I just want to make a big database as explained in my first mail ... At the 
beginning we will have aprox. 150 000 000 records ... each month we will add 
about 4/8 millions new rows in constant flow during the day ... and in same 
time web users will access to the database in order to read those data.
Stored data are quite close to data stored by google ... (we are not making a 
google clone ... just a lot of data many small values and some big ones ... 
that's why I'm comparing with google for data storage).
Then we will have a search engine searching into those data ...

Dealing about the hardware, for the moment we have only a bi-pentium Xeon 
2.8Ghz with 4 Gb of RAM ... and we saw we had bad performance results ... so 
we are thinking about a new solution with maybe several servers (server 
design may vary from one to other) ... to get a kind of cluster to get better 
performance ...

Am I clear ?

Regards,
  


-- 
Dave Cramer
http://www.postgresintl.com
519 939 0336
ICQ#14675561

1 2 >

1 - 100 of 115 matches

Mail list logo