Markus,
Even here I think that there is a common misconception between
performance and scalability. Most people think that by having
multiple nodes their query will run faster which is obviously wrong
if your original workload does not saturate a single node.
Sure. Do you think that should be made clearer?
Yes, I think so because this is a very common belief that we experience
with new users.
The replication mechanisms are even adding overhead (usually
perceived as increased latency) to the query execution. It is ONLY
when the workload increases that you can see throughput going up
(ideally somewhat close to the workload increase) and query latency
remaining stable. Unless you really have a parallel query execution
(that is only efficient for big queries anyway), you will never see a
performance improvement on a single query execution since this is
always the same database engine that executes the query in the end.
I don't quite agree with that statement, but probably I'm just
misreading it. If you have enough concurrent transactions you can
spread among the nodes, you'll certainly note an improvement. After
all, it's a huge difference, if your single node is processing only
ten or hundreds of concurrent transactions.
Yes, but that already means that your single node was somewhat already a
bottleneck. My point was that for low workloads (note that low is
relative here since many users have dual-cpu machines with decent RAM
and disks, and it takes quite a number of concurrent transactions to get
to the peak point), you will not see any improvement and even you'll see
a slight degradation especially from a latency perspective. Below the
peak point of a single machine, you will get the same performance (from
a client point of view) but the load on the various machine resources
will decreased by the number of machines in the cluster (at best). For
example, if I have a workload of 50 requests/second that generates 50%
cpu load on 1 node, I will still get my 50 req/s with 2 machines but the
cpu load will only be 25% on each node.
Now the contention can be elsewhere (disk, locks, ...) and exhibit other
scalability characteristics but it usually conforms to the model I
described.
Of course, the amount of concurrent transactions limits how far a
replication solution can scale. Having more nodes than concurrent
transactions does not make sense. (Of course with the exception of
parallel query execution.)
Yes but don't underestimate the capability of a single node to execute
transactions in parallel as well. Oftentimes sending 2 concurrent
transactions to a single node or to 2 different nodes does not make any
difference (obviously it depends on the nature of the transaction).
To summarize, clustering solutions provide performance scalability
(stable latency, throughput increasing almost linearly with load) but
not performance improvement on individual query execution time.
Yes, for writing transactions, no for read-only ones (queries?). Or
why do you have to add overhead to read-only queries?
In a middleware approach you have to proxy the read results as well so
you will add some latency there. When replication is integrated in the
database you can prevent this extra hop but still the replication logic
adds some overhead to any query (that seems inevitable if you want to
ensure consistency).
If the client application is not multithreaded it is very unlikely
that any solution will improve the application performance.
Ehm.. I wouldn't refer to threading here. You can very well have
multiple single-process programs running on different nodes...
I'd keep referring to concurrency of transactions.
Yes you are right. Talking about concurrent transactions is much clearer.
Yeah, I thought you meant that one. I don't know Middle-R at all,
sorry. Seems similar to sequoia. Did you base your work on Middle-R?
No Sequoia is much older than Middle-R. In fact, Sequoia is the
continuation of the C-JDBC project. We use different replication
techniques in Middle-R and Sequoia. But Sequoia can used on top of
Middle-R to provide load balancing, transparent failover and caching
that are missing in Middle-R.
What are your development plans for Postgres-R?
To make it work and production ready as soon as possible. ;-) I'm
currently working on initialization and recovery.
Good luck, this is the hardest part ! You'll soon figure out that
replication was really the easy part !
Thanks again for your comments,
Emmanuel
--
Emmanuel Cecchet
Chief Scientific Officer, Continuent
Blog: http://emanux.blogspot.com/
Open source: http://www.continuent.org
Corporate: http://www.continuent.com
Skype: emmanuel_cecchet
Cell: +33 687 342 685
_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia