Re: View performance

Markus Doppelbauer Thu, 13 Jun 2024 06:22:14 -0700

It is a pity that a fast geo-replicated database does not exist.
Plague or cholera:


1) Distributed sorted heap-tables with local-machine indexing:
    Couchbase, MongoDB, ...

2) Distributed hash-tables with local-machine indexing:
    Cassandra, ScyllaDB, Riak, Aerospike, ...
Cons: Not scalable: all nodes need to be queried

3) Distribued sorted heap with global consistency:
    Spanner, CockroachDB, FoundationDB, TiKV, etcd, Yugabyte
Cons: Scalability, the more DCs the more write amplification (easily
500ms)

4) Fast AP-databases:
    Google-Bigtable, DynamoDB, HBase, Hypertable (dead+flaws)
Cons: Proprietary - Hbase-API sucks if you're not JVM

Regards,
Marcus




Am Donnerstag, dem 13.06.2024 um 13:40 +0100 schrieb Robert Newson:
> Hi,
> Right. a database is divided into Q shards, each of which covers part
> of a numeric range from 0 to 2^32-1, the doc id is run through a
> hashing function that outputs a number in that range, and the
> document is then stored inside the shard whose range contains that
> number.
> as this effectively randomises which doc goes in which shard when we
> need to perform a query across the database we need to ask each shard
> range to perform the query and then combine the results.
> This is good in some ways (the shards can be stored across multiple
> machines and thus benefit from more cores/disks/memory/etc) it's bad
> in others, the number of workers that need to respond goes up as Q
> goes up.
> "Partition" is just our word (and arguably a misleading word) for
> allowing users to control which shard range their documents end up
> in. Only part of the doc id is passed through the hashing
> function(doc id "foo:bar" would be partition foo, and we'd hash only
> "foo"). Thus the user can ensure that a group of documents lands in
> the same shard range. The benefit is as Nick described, when querying
> _within_ a given partition we know that only one shard range is in
> play, so we do not consult shards for other ranges (we already know
> they will have no results).
> So, to your specific list of questions;
> - yes- yes, but the reason to keep partitions small is because of the
> fixed assignment to a given shard range. the default scheme would
> load up each shard range fairly evenly. it's about disk space
> management and predictable performance.- when querying a global view
> we need Q responses (and will issue Q*N requests, but cancel Q*(N-1)
> of those when the first response from each range has been received,
> though they may have already done some, or all, of their work, before
> that happens). for a partitioned view we need 1 response (but will
> issue N requests).- quorum applies to document operations (read,
> write, delete), but not view queries (R is inherently 1)
> B.
> > On 13 Jun 2024, at 09:01, Matthieu Rakotojaona <
> > matthieu.rakotojaona-rainimangav...@imt-atlantique.fr> wrote:
> > Hello, I'm still a bit confused with shards versus partitions, is
> > this correct ?
> > - A database is composed of Q shards, shards splice the database
> > "automatically"
> > - A shard is composed of 1 or more partitions, partitions splice
> > the database according to the user's choice. A partition should
> > thus be made so that it is not huge, in fact must be under (total
> > size/Q)
> > - When querying a view, the whole database needs to be processed,
> > that means all Q shards
> > - When querying a view only on documents inside a partition, since
> > one partition is on 1 shard, only one shard is queried
> > - Whatever the number of shards, quorum still needs to be verified,
> > ie for Q=2 and N=3 I still need to wait for (3+1)/2 = 2 replies for
> > each shard
> > On 12/06/2024 19:54, Nick Vatamaniuc wrote:
> > > Another feature related to efficient view querying are
> > > partitioneddatabases:
> > > https://docs.couchdb.org/en/stable/partitioned-dbs/index.html.It's
> > > a bit of a niche, as you'd need to have a good partition key,
> > > butaside from that, it can speed up your queries as responses
> > > would be comingfrom a single shard only instead of Q shards.
> > >
> > >
> > > On Wed, Jun 12, 2024 at 1:30 PM Markus Doppelbauer<
> > > doppelba...@gmx.net.invalid> wrote:
> > > > Hi Nick,Thank you very much for your reply.This is exactly what
> > > > we are lookingfor.There are so many DBs that store the
> > > > secondary indexlocally(Cassandra, Aerospike, SyllaDB,
> > > > ...)Thanks again for the answerMarcus
> > > >
> > > > Am Mittwoch, dem 12.06.2024 um 13:23 -0400 schrieb Nick
> > > > Vatamaniuc:
> > > > > Hi Marcus,The node handling the request only queries the
> > > > > nodes with shardcopies ofthat database. In a 100 node cluster
> > > > > the shards for thatparticulardatabase might be present on
> > > > > only 6 nodes, depending on theQ and Nsharding factors, so it
> > > > > will query 6 out 100 nodes. Forinstance, for N=3and Q=2
> > > > > sharding factors, it will first send N*Q=6requests, and wait
> > > > > untilit gets at least one response for each of theQ=2 shard
> > > > > ranges. Thishappens very quickly. Then, for the duration
> > > > > ofthe response, it will onlystream responses from those Q=2
> > > > > workers.So, to summarize for a Q=2database, it will be a
> > > > > streaming responsefrom 2 workers. For Q=4, from 4workers,
> > > > > etc...Cheers,-Nick
> > > > > On Wed, Jun 12, 2024 at 1:00 PM Markus Doppelbauer<
> > > > > doppelba...@gmx.net.invalid> wrote:
> > > > > > Hello,Is the CouchDB-view a "global" or "local" index?For
> > > > > > example, if acluster has 100 nodes, would the query askfor
> > > > > > a single node - or100
> > > > > > nodes?/.../_view/posts?startkey="foobar"&endkey="foobaz"Bes
> > > > > > t wishesMarcus
>
>

Re: View performance

Reply via email to