Didn't seem to get there first time so having another go Mike
From: Mike Kimber Sent: 23 May 2012 12:08 To: [email protected] Subject: Am I doing something fundamentally wrong??!! I have been working with Couchdb for a short while now (I'm a traditionally DBA and inherited this Couchdb project and yes I know its not SQL!!!). We use Couchdb to store Maven Build Statistics. Every time a build is run a Statistics report is generated and uploaded to Couchdb. Our builds are big and we are aiming to bring them down in size, hence the collection of statistics for analysis to identify are to focus on, demonstrate improvement and confirm that developers are adopting new practices as we role them out. Now I've enjoyed working with couch; java script is powerful, Replication magic, schema less datastore, restful api, incremental map reduce etc . However I am increasingly thinking couchdb does not fit our use case and I've been asking myself the following set of questions: * Are we doing something wrong? * Is couchdb the correct data store for our use case? * Is this really big data, it seems relatively small to me? * Are our documents bigger and more complex than the average Couchdb use case? * Would BigCouch make a difference? * Are people really prepared to continue to throw hardware at a problem like this. Is that cheaper than developer time or software licenses? A few statistics etc (last 6 months) that puts our Couchdb implementation into perspective: * Number of Documents: 96,848 * Total Size of Documents: 52GB (627 docs over 10MB, largest 16MB)(compressed its 8.5GB) * Average Size of Documents: 0.5MB * Total Number of Array Elements in all docs: 256 Million * Number of Array Element Types: 37 (i.e. each has a different structure which we have to handle) * Example Document Structure (cut down as GIST could not cope!) : https://gist.github.com/2774454 * Views (no reduce just maps): https://gist.github.com/2774491 and https://gist.github.com/2774485 * Analytics Server: 4 CPU's and 8GB of RAM running on VMware farm So what's the issue that's making me question our choice of couchdb. Well a single NVP and null Key map with no reduce view build takes 6 hours to process and burns a full CPU for all that time i.e. it does not seem to be IO bound or short of memory (does only seem to be able to use a single CPU/core which is odd erlang and all) . The "Build Profile Detail" Map referenced above takes up to 15 hours to build. Now once I know what I want that's not necessarily a major issue, but it is when I need to discover/explore the data that I need to analyse. The feedback loop to do ad-hoc analysis is not practical. Now I know we live in the world of the clever compromise/work around so people will say use a smaller subset. I have its 19 documents and they are not representative, so I create a map think I have what I need apply to main data set wait 16 hours and then find that I've missed something. Also if I want to change the order (key) by or type of grouping (reduce) I have to change the view and have to wait 16 hours again. To reduce the feedback loop I've hooked up Luciddb using its Couchdb connector and loaded the data into it. This provides me with a significantly lower feedback loop i.e. 51 seconds to change a grouping (reduce) on 256million rows rather than 16 hours to rebuild a view for instant access. However this also highlighted how much disk space couchdb takes. The two views take up 480MB and 5.6GB respectively, but when I load them into Luciddb (column orientated) the same data (minus the name part of the pair) takes up 655MB (with indexes added); what's in a Couch View (we have Coudhdb 1.2 so they should be compressed the data can't be that big)? Which leads me back to my set of questions above? This isn't aiming to be Couchdb bashing post, in fact I'll be continuing to use it, I'm just looking to see If I'm doing something fundamentally wrong or have just picked the wrong horse for our course or just need to throw some hardware at it etc? Couchdb/Lucidb is a pretty decent combo, so if I could bring down the View build time in Couch then I'd be happy, but on the flip side it seems to be a bit of an anti pattern if I have to throw a load of hardware at it. Thanks Mike
