Hi, I'm looking into CouchDB for a data mining application. I'm a noob, so I'm just getting an appreciation for the new (and very creative) approach taken with Couch. Please let me first verify that I have a few things straight:
A view is a lot more like an index than a query in SQL terms. The keys emitted from the mapper are used to construct a b-tree. Aggregate values computed in the reducer may be hung on the higher nodes of the tree. Constructing this tree is an expensive operation, but read access is fast and it can be updated incrementally as the underlying data changes. (Baron Schwartz's A Gentle Introduction to CouchDB for Relational Practitioners explains this nicely.) A view is formulated using the map-reduce (MR) pattern, which essentially divides a big job into lots of small independent subtasks. In Hadoop and Google's MR, that independence is used for parallelism in distributed environment. Couch's use of MR is very different. I'm not sure how parallelism comes into play in Couch, but it seems to me a key feature of Couch is that the independence of MR is exploited to compute and cache partial results in the b-tree and to update them incrementally. The targeted here is the "shit-loads of users" scenario where the cost of building and maintaining the view can be amortized over lots of read operations. Now, if that's all more-or-less right, how does that apply to data mining? In a data mining app, you typically have lots of ad-hoc queries. You'll read that Couch doesn't do ad-hoc queries, but I have a feeling that, if you're smart about it, you can create views that will serve as the basis for whole classes of queries. The view will do part of the work and your client code will have to do part as well. I haven't quite gotten my head around how this is done, nor around how Couch's list functions might fit into the picture. It would be great to have an example data mining app for Couch. The classic textbook example is co-occurrence of items in a large database of grocery store shopping baskets. You ask questions like, "If a customer buys diapers, do they also buy beer?" It will come as little surprise to any new parents that, in fact, they do. In this case, you're documents would consist of a set of purchased items and associated information like customer demographics, geographic information, sales and promotions, etc. which are usually modeled in terms of a star schema in an RDBMS. The task is then to ask the same basic questions about what people buy sliced and diced by or conditioned on the associated data, like, "Do males in the pacific northwest buy diapers and beer when beer is on sale?" Is something like that an appropriate use case for Couch? It would be awesome to have some guidance from the gurus on applications like this, which are very different from either transaction processing or the highly-available eventual-consistency use-cases often associated with NoSQL. Thanks! -- Chris
