Hi!

You might be interested in this open source project that we wrote at my
work for our daily backup : https://github.com/tolteck/couchcopy

It moves shards with a simple rsync then uses [CouchDB shard management](
https://docs.couchdb.org/en/stable/cluster/sharding.html) to make the new
cluster aware of these shards.
If you want no downtime, after running couchcopy you can replicate the diff
accumulated during the run with a usual replication.

There is also this upstream discussion were I asked if it was a good idea
or not to to write a tool like couchcopy:
https://github.com/apache/couchdb/discussions/3383

Everything that couchcopy does could be done manually with curl following
[CouchDB shard management](
https://docs.couchdb.org/en/stable/cluster/sharding.html) documentation. In
your case, with only one database, it could be quicker to do it manually.

Don't hesitate to MP or open an issue on couchcopy if needed.


Le ven. 15 mars 2024 à 08:39, Chris Bayliss
<christopher.bayl...@unimelb.edu.au.invalid> a écrit :

> Hi all,
>
> I inherited a single-node CouchDB database that backs a medical research
> project. We’ve been using CouchDB for 10+ years so not a concern. Then I
> spotted it uses a single database to store billions, 10^9 if we’re being
> pedantic, of documents (2B at the time just over a TB of data) across the
> default 2 shards. Not ideal but technically not a problem then I spotted
> it’s ingesting ~30M documents a day and was continuously compressing and
> reindexing everything associated with this database.
>
> Skipping over months of trial and error. I’m currently replicating it to a
> 4 node NVMe backed cluster n=3 q=256. Everything is running 3.3.3 (the
> Erlang 24.3 version). I’ve read [1] and [2] and right now it’s replicating
> at 2.25k documents a second +/- 0.5k . This is acceptable, it will catch up
> with the initial node eventually,  but at the rate it’s going it’ll be ~60
> days.
>
> How can speed this process up if at all?
>
> I’d add the code that accesses this database isn’t mine either so
> splitting the database out into logical subsets isn’t an option at this
> time.
>
> Thanks
>
>     Chris
>
> 1 -
> https://blog.cloudant.com/2023/02/08/Replication-efficiency-improvements.html
> 2 - https://github.com/apache/couchdb/issues/4308
>
>
> --
> Christopher Bayliss
> Senior Software Engineer, Melbourne eResearch Group
>
> School of Computing and Information Systems
> Level 5, Melbourne Connect (Building 290)
> University of Melbourne, VIC, 3010, Australia
>
> Email: christopher.bayl...@unimelb.edu.au<mailto:
> christopher.bayl...@unimelb.edu.au>
>
>
>

Reply via email to