WIth a (somewhat.  kinda.  sorta.  maybe.) similar requirement, I ended up 
doing this as follows
        (1) created a 'daily' database, that data got dumped into in very small 
increments - approximately 5 docs/second
        (2) uni-directionally replicated the documents out of this database 
into a 'reporting' database that I could suck data out of
        (3) sucked data out of the reporting database at 15 minute intervals, 
processed them somewhat, and dumped all of *those* into one single (highly 
sharded) bigcouch db
        
The advantages here were
        - My data was captured in the format best suited for the data 
generating events (minimum processing of the event data) thanx to (1)
        - The processing of this data did not impact the writing of the data 
thanx to (2) allowing for maximum throughput
        - I could compact and archive the 'daily' database every day, thus 
significantly minimizing disk space thanx to (1). Also, We only retain the 
'daily' data for 3 months, since anything beyond that is stale (for our 
purposes. YMMV)
        - The collated data that ends up in bigcouch per (3) is much *much* 
smaller. But, if we ended up needing a different collation (and yes, that 
happens every now and then), I can just rerun the reporting process (up to the 
last 3 months of course).  In fact, I can have multiple collations running in 
parallel...

Hope this helps. If you need more info, just ping me...

Cheers

Mahesh Paolini-Subramanya
That Tall Bald Indian Guy...
Google+  | Blog   | Twitter

On Jan 11, 2012, at 4:13 AM, Martin Hewitt wrote:

> Hi all,
> 
> I'm currently scoping a project which will measure a variety of indicators 
> over a long period, and I'm trying to work out where to strike the balance of 
> document number vs document size.
> 
> I could have one document per metric, leading to a small number of documents, 
> but with each document containing ticks for every 5-second interval of any 
> given day, these documents would quickly become huge. 
> 
> Clearly, I could decompose these huge per-metric documents down into smaller 
> documents, and I'm in the fortunate position that, because I'm dealing with 
> time, I can decompose by year, months, day, hour, minute or even second.
> 
> Going all the way to second-level would clearly create a huge number of 
> documents, but all of very small size, so that's the other extreme.
> 
> I'm aware the usual response to this is "somewhere in the middle", which is 
> my working hypothesis (decomposing to a "day" level), but I was wondering if 
> there was a) anything in CouchDB's architecture that would make one side of 
> the "middle" more suited, or b) if someone has experience architecting 
> something like this.
> 
> Any help gratefully appreciated.
> 
> Martin

Reply via email to