Hi Nick, couch-incarnate author here.
You described the way it works pretty well. Btw, I suspect also cloudant do their chaining incrementally. I started working on incarnate following my own frustration with the lack of chained-mapreduce. I'm a bit biased here, but I find the end solution to be pretty relaxed and generally with the grain of couch: the service is HTTP-based, eventually-consistent and persistent (its state is kept only on CouchDB). Unfortunately, my work took me away from couch, and aside from a single issue [1] there was little interest in the project from others. It'd be nice to hear if dbcopy or any other chaining solution is going to finally become a part of couch. If not, and if you or anyone else wish to see some effort put into couch-incarnate (see todo's and caveats in the readme), feel free to contact me. Cheers, Alon [1] https://github.com/afters/couch-incarnate/issues/1 On 12 December 2012 19:50, nicholas a. evans <[email protected]> wrote: > I've got some views that simply must use iterative map reduce. The > greatest need is simply to sort based on the value of the first reduction. > I'm looking over my options, and I'm going to list them here. I'm looking > for someone to tell me that I've missed an option, or recommend a > modification to one of the below approaches. > > 0) Just do the second map reduce in client code... > 1) use Cloudant's Chained Map Reduce feature > 2) use couch-incarnate > 3) Skip CouchDB and learn Hadoop or Storm... > 4) learn me some Erlang for great patches... > 5) roll my own by copying reduced view results into another DB... > > (tl;dr: steps 0-4 don't work for me; I'll detail several approaches to step > 5 but none let me relax.) > > > Option 0 is what we're doing now. For our small users, it's great; for > medium users, it's okay. But for some of our more complicated views on our > larger DBs, "GET /user_123/_design/foo/_view/bar?reduce=true&group=true" > downloads several MB of data and takes 2 minutes or more, which just > doesn't work. We can play some neat tricks with Etags and caching > strategies that mimic "stale=update_after" (i.e. immediately grab the > latest from the cache and fire off a background process to update the > cache). We can then also cache the results of the second mapping. But > this is annoyingly elaborate and shares the same latency downside as the > naive approach (mentioned later). > > As a tangential point, I do wonder why it takes 2 minutes to download > several MB of data, even when run via curl to localhost. It's not > bandwidth limited, so why does it take so long to construct and send the > JSON? > > For reasons I won't bore you with, Cloudant or couch-incarnate are not > (currently) options. But both approaches seem basically reasonable... > Anyway, it's just a real shame that Apache CouchDB can't do this out of the > box. > > My one boss is of the opinion that a RDBMS can handle this and everything > else, so why are we wasting time with this NoSQL nonsense? (Ignoring that > CouchDB made most other bits easier and faster, excepting this one glaring > issue.)h And my other boss is of the opinion that we should switch to > using Storm (which I don't have any experience with, so I can't properly > assess what it would make easier or harder). Rewriting our current code > base to use Postgresql, Hadoop, or Storm sounds like a lot of work for me, > when all I want from CouchDB is efficient sorting based on the map/reduce > values. > > > It seems to me that it should be relatively simple to modify CouchDB chain > views within a single design document. You'd need to: > 1) choose a syntax to say that one view depends on another, e.g. > "sorted_by_foo": {"depends": "bar_view", "map": "function(doc) { ... }"}, > or maybe even a simpler "sorted_by_foo": { "depends": "bar_view", > "sort_by": "function (k,v) { return [v.foo, k]; }"} > 2) construct a DAG to run the views in the correct order (or detect cycles > and throw an error). > 3) collect view changes in order to run the chained views incrementally. > 4) feed the changes from one view into the map (or sort) function for the > dependent view(s). > > It *seems* simple to me... but I know *nothing* about the internals of > CouchDB or Erlang, so I'm probably oversimplifying or missing some steps. > :) Of course, if I had a view changes feed, I could put that to work for > other efficient chaining mechanisms, too. > > Thinking of more elaborate changes to couch, since sorting by value is > probably the most common chained map/reduce use case, it would really be > nice to simply have multiple sorts on a single map/reduce function, with no > need to duplicate view data or go to the extra work of creating chained > views. e.g. "foo_view": { "map": "...", "reduce": "...", "sort_by": > {"bar": "function (k,v) { ... }", "baz": "function (k,v) { ... }"}}; which > could then be queried by an API like > "/db/_design/blah/_view/foo/_sort_by/bar?reduced=true&group=true". From a > user's point of view, this would be most relaxing... but I imagine that > this would require far more changes to couch internals and the view file > structure than my previous proposal. > > > So it looks like I'm stuck with rolling my own with the tools that CouchDB > 1.2 provides. Here is what I've come up with (none of which leaves me > feeling particularly relaxed): > > > Simple (Naive) approach: > SourceDB => ChainedDB > > 1) GET all keys from reduced/grouped view in SourceDB. > 2) GET all docs in ChainedDB. > 3) Bulk update all docs in ChainedDB. > 4) Place secondary views on ChainedDB. > > This feels conceptually similar to what Cloudant is doing (with "dbcopy"). > > Downside: high latency. It's okay for the *initial* load to take a > while, but 2 minutes or more is *way* too long (and wasting far too much > CPU/IO) to wait for minor changes in SourceDB to trickle through to > ChainedDB. Most of the time (for my use case), a single row change in > SourceDB will result in only a single row change in ChainedDB, so this can > and should be done more intelligently. > > Also, steps 2+3 are to delete missing rows, only update the rows that have > changed, with the correct _rev. It might be sped up a little bit by > creating a list function that returns bulk_docs format, posting bulk_docs > with all_or_nothing, and a separate delete+conflict cleanup phase. But > that doesn't address the core issue of the slow get/update *everything* > approach. > > Is there a faster way of doing this approach, that preserves its > simplicity? > > > Incremental changes approach (the couch-incarnate way, as I understand it): > SourceDB => ChainedMapDB => ChainedReduceDB > > 1) GET changes to SourceDB from changes feed. > 2) In my own process (not CouchDB), run the map function on the changed > docs. Include SourceDB _id as metadata in new rows. > 3) Use a special view in ChainedMapDB to find rows affected by the > changed source docs. > 4) Bulk create/update/delete the affected rows in ChainedMapDB. > 5) GET changes to ChainedMapDB from changes feed. > 6) It will be obvious which rows need to change in ChainedReduceDB. > 7) Place the reduce function onto ChainedMapDB (with a trivial identity > map function). > 8) GET changed rows from reduced/grouped view in ChainedMapDB. > 9) Bulk create/update/delete changed ChainedReduceDB. > 10) Place secondary views on ChainedReduceDB. > > Major downsides: > * Complicated code to manage syncing source=>map=>reduce DBs. > * Managing the map function external to Couch. > * Three databases for one map/reduce/map chain. > * This is *not* relaxing. It does make me want to learn Hadoop or > Storm. > > Is there a way to simplify this? > > > Modified incremental approach: > SourceDB => ChainedDB > > Like the couch-incarnate approach, but map/reduce function lives in > SourceDB, modified to include metadata. > > 0) Original map function is modified so that 'emit(key, value);' > transforms into 'emit(["data", key], value); emit(["metadata", doc._id], > key);' and original reduce function is modified to skip metadata rows and > use the original key for data rows. > 1) GET changes to SourceDB. > 2) query view using ["metadata", changed.id] keys; which tells you which > rows in the ChainedDB need to be updated and which rows to grab from this > view. > 3) query reduced/grouped view using ["data", ...] keys from previous step > to get new values > 4) Bulk create/update/delete ChainedDB. > 5) Place secondary views on ChainedDB. > > Downsides: > * need to modify original map function (although a relatively trivial > substitution). > * need to modify original reduce function (although relatively trivial > insertion). > * As far as I can tell, no one has used this approach yet. Untested. > > If no one has any better ideas, this is what I'll wind up trying. > > > It seems to me that a view changes feed would make this much simpler. You > could easily get the simplicity of the naive approach with the speed of the > incremental approaches. I thought I read about a patch for view changes > floating around, but some quick searching didn't turn it up. Does it apply > cleanly to 1.2 (or 1.3)? I'll consider using it... but is there a reason > it hasn't already been accepted into master? > > > Thanks for reading this far, and thanks for any input you might have. > -- > Nick Evans >
