Re: [Sequoia] PostgreSQL Documentation of High Availability and Load Balancing

Emmanuel Cecchet Wed, 22 Nov 2006 02:22:52 -0800

Markus,

Even here I think that there is a common misconception betweenperformance and scalability. Most people think that by havingmultiple nodes their query will run faster which is obviously wrongif your original workload does not saturate a single node.
Sure. Do you think that should be made clearer?

Yes, I think so because this is a very common belief that we experiencewith new users.

The replication mechanisms are even adding overhead (usuallyperceived as increased latency) to the query execution. It is ONLYwhen the workload increases that you can see throughput going up(ideally somewhat close to the workload increase) and query latencyremaining stable. Unless you really have a parallel query execution(that is only efficient for big queries anyway), you will never see aperformance improvement on a single query execution since this isalways the same database engine that executes the query in the end.
I don't quite agree with that statement, but probably I'm justmisreading it. If you have enough concurrent transactions you canspread among the nodes, you'll certainly note an improvement. Afterall, it's a huge difference, if your single node is processing onlyten or hundreds of concurrent transactions.

Yes, but that already means that your single node was somewhat already abottleneck. My point was that for low workloads (note that low isrelative here since many users have dual-cpu machines with decent RAMand disks, and it takes quite a number of concurrent transactions to getto the peak point), you will not see any improvement and even you'll seea slight degradation especially from a latency perspective. Below thepeak point of a single machine, you will get the same performance (froma client point of view) but the load on the various machine resourceswill decreased by the number of machines in the cluster (at best). Forexample, if I have a workload of 50 requests/second that generates 50%cpu load on 1 node, I will still get my 50 req/s with 2 machines but thecpu load will only be 25% on each node.Now the contention can be elsewhere (disk, locks, ...) and exhibit otherscalability characteristics but it usually conforms to the model Idescribed.

Of course, the amount of concurrent transactions limits how far areplication solution can scale. Having more nodes than concurrenttransactions does not make sense. (Of course with the exception ofparallel query execution.)

Yes but don't underestimate the capability of a single node to executetransactions in parallel as well. Oftentimes sending 2 concurrenttransactions to a single node or to 2 different nodes does not make anydifference (obviously it depends on the nature of the transaction).

To summarize, clustering solutions provide performance scalability(stable latency, throughput increasing almost linearly with load) butnot performance improvement on individual query execution time.
Yes, for writing transactions, no for read-only ones (queries?). Orwhy do you have to add overhead to read-only queries?

In a middleware approach you have to proxy the read results as well soyou will add some latency there. When replication is integrated in thedatabase you can prevent this extra hop but still the replication logicadds some overhead to any query (that seems inevitable if you want toensure consistency).

If the client application is not multithreaded it is very unlikelythat any solution will improve the application performance.
Ehm.. I wouldn't refer to threading here. You can very well havemultiple single-process programs running on different nodes...
I'd keep referring to concurrency of transactions.

Yes you are right. Talking about concurrent transactions is much clearer.

Yeah, I thought you meant that one. I don't know Middle-R at all,sorry. Seems similar to sequoia. Did you base your work on Middle-R?

No Sequoia is much older than Middle-R. In fact, Sequoia is thecontinuation of the C-JDBC project. We use different replicationtechniques in Middle-R and Sequoia. But Sequoia can used on top ofMiddle-R to provide load balancing, transparent failover and cachingthat are missing in Middle-R.

What are your development plans for Postgres-R?
To make it work and production ready as soon as possible. ;-) I'mcurrently working on initialization and recovery.

Good luck, this is the hardest part ! You'll soon figure out thatreplication was really the easy part !


Thanks again for your comments,
Emmanuel

--
Emmanuel Cecchet
Chief Scientific Officer, Continuent

Blog: http://emanux.blogspot.com/
Open source: http://www.continuent.org
Corporate: http://www.continuent.com
Skype: emmanuel_cecchet
Cell: +33 687 342 685


_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

Re: [Sequoia] PostgreSQL Documentation of High Availability and Load Balancing

Reply via email to