FW: Am I doing something fundamentally wrong?

Mike Kimber Wed, 23 May 2012 23:33:12 -0700

Didn't seem to get there first time so having another go

Mike

From: Mike Kimber
Sent: 23 May 2012 12:08
To: [email protected]
Subject: Am I doing something fundamentally wrong??!!

I have been working with Couchdb for a short while now (I'm a traditionally DBA 
and inherited this Couchdb project and yes I know its not SQL!!!).
We use Couchdb to store Maven Build Statistics. Every time a build is run a 
Statistics report is generated and uploaded to Couchdb. Our builds are big and 
we are aiming to bring them down in size, hence the collection of statistics 
for analysis to identify are to focus on, demonstrate improvement and confirm 
that developers are adopting new practices as we role them out. Now I've 
enjoyed working with couch; java script is powerful, Replication magic, schema 
less datastore, restful api, incremental map reduce  etc . However I am 
increasingly thinking couchdb does not fit our use case and I've been asking 
myself the following set of questions:

 *   Are we doing something wrong?
 *   Is couchdb the correct data store for our use case?
 *   Is this really big data, it seems relatively small to me?
 *   Are our documents bigger and more complex than the average Couchdb use 
case?
 *   Would BigCouch make a difference?
 *   Are people really prepared to continue to throw hardware at a problem like 
this. Is that cheaper than developer time or software licenses?
A few statistics etc (last 6 months) that puts our Couchdb implementation into 
perspective:

 *   Number of Documents: 96,848
 *   Total Size of Documents: 52GB (627 docs over 10MB, largest 
16MB)(compressed its 8.5GB)
 *   Average Size of Documents: 0.5MB
 *   Total Number of Array Elements in all docs: 256 Million
 *   Number of Array Element Types: 37 (i.e. each has a different structure 
which we have to handle)
 *   Example Document Structure (cut down as GIST could not cope!) :  
https://gist.github.com/2774454
 *   Views (no reduce just maps): https://gist.github.com/2774491 and  
https://gist.github.com/2774485
 *   Analytics Server: 4 CPU's and 8GB of RAM running on VMware farm
So what's the issue that's making me question our choice of couchdb. Well a 
single NVP and null Key map with no reduce view build takes 6 hours to process 
and burns a full CPU for all that time i.e. it does not seem to be IO bound or 
short of memory (does only seem to be able to use a single CPU/core which is 
odd erlang and all) . The "Build Profile Detail" Map referenced above takes up 
to 15 hours to build. Now once I know what I want that's not necessarily a 
major issue, but it is when I need to discover/explore the data that I need to 
analyse. The feedback loop to do ad-hoc analysis is not practical. Now I know 
we live in the world of the clever compromise/work around so people will say  
use a smaller subset. I have its 19 documents and they are not representative, 
so I create a map think I have what I need apply to main data set wait 16 hours 
and then find that I've missed something. Also if I want to change the order 
(key) by or type of grouping (reduce) I have to change the view and have to 
wait 16 hours again.
To reduce the feedback loop I've hooked up Luciddb using its Couchdb connector 
and loaded the data into it. This provides me with a significantly lower 
feedback loop i.e. 51 seconds to change a grouping (reduce) on 256million rows 
rather than 16 hours to rebuild a view for instant access.
However this also highlighted how much disk space couchdb takes. The two views 
take up 480MB and 5.6GB respectively, but when I load them into Luciddb (column 
orientated) the same data (minus the name part of the pair) takes up 655MB 
(with indexes added); what's in a Couch View (we have Coudhdb 1.2 so they 
should be compressed the data can't be that big)? Which leads me back to my set 
of questions above?
This isn't aiming to be Couchdb bashing post, in fact I'll be continuing to use 
it, I'm just looking to see If I'm doing something fundamentally wrong or have 
just picked the wrong horse for our course or just need to throw some hardware 
at it etc? Couchdb/Lucidb is a pretty decent combo, so if I could bring down 
the View build time in Couch then I'd be happy, but on the flip side it seems 
to be a bit of an anti pattern if I have to throw a load of hardware at it.
Thanks
Mike

FW: Am I doing something fundamentally wrong?

Reply via email to