WIth a (somewhat. kinda. sorta. maybe.) similar requirement, I ended up
doing this as follows
(1) created a 'daily' database, that data got dumped into in very small
increments - approximately 5 docs/second
(2) uni-directionally replicated the documents out of this database
into a 'reporting' database that I could suck data out of
(3) sucked data out of the reporting database at 15 minute intervals,
processed them somewhat, and dumped all of *those* into one single (highly
sharded) bigcouch db
The advantages here were
- My data was captured in the format best suited for the data
generating events (minimum processing of the event data) thanx to (1)
- The processing of this data did not impact the writing of the data
thanx to (2) allowing for maximum throughput
- I could compact and archive the 'daily' database every day, thus
significantly minimizing disk space thanx to (1). Also, We only retain the
'daily' data for 3 months, since anything beyond that is stale (for our
purposes. YMMV)
- The collated data that ends up in bigcouch per (3) is much *much*
smaller. But, if we ended up needing a different collation (and yes, that
happens every now and then), I can just rerun the reporting process (up to the
last 3 months of course). In fact, I can have multiple collations running in
parallel...
Hope this helps. If you need more info, just ping me...
Cheers
Mahesh Paolini-Subramanya
That Tall Bald Indian Guy...
Google+ | Blog | Twitter
On Jan 11, 2012, at 4:13 AM, Martin Hewitt wrote:
> Hi all,
>
> I'm currently scoping a project which will measure a variety of indicators
> over a long period, and I'm trying to work out where to strike the balance of
> document number vs document size.
>
> I could have one document per metric, leading to a small number of documents,
> but with each document containing ticks for every 5-second interval of any
> given day, these documents would quickly become huge.
>
> Clearly, I could decompose these huge per-metric documents down into smaller
> documents, and I'm in the fortunate position that, because I'm dealing with
> time, I can decompose by year, months, day, hour, minute or even second.
>
> Going all the way to second-level would clearly create a huge number of
> documents, but all of very small size, so that's the other extreme.
>
> I'm aware the usual response to this is "somewhere in the middle", which is
> my working hypothesis (decomposing to a "day" level), but I was wondering if
> there was a) anything in CouchDB's architecture that would make one side of
> the "middle" more suited, or b) if someone has experience architecting
> something like this.
>
> Any help gratefully appreciated.
>
> Martin