Re: Options for Iterative Map Reduce

Alon Keren Wed, 12 Dec 2012 13:24:11 -0800

Hi Nick,

couch-incarnate author here.


You described the way it works pretty well. Btw, I suspect also cloudant do
their chaining incrementally.

I started working on incarnate following my own frustration with the lack
of chained-mapreduce.
I'm a bit biased here, but I find the end solution to be pretty relaxed and
generally with the grain of couch: the service is HTTP-based,
eventually-consistent and persistent (its state is kept only on CouchDB).

Unfortunately, my work took me away from couch, and aside from a single
issue [1] there was little interest in the project from others.

It'd be nice to hear if dbcopy or any other chaining solution is going to
finally become a part of couch.
If not, and if you or anyone else wish to see some effort put into
couch-incarnate (see todo's and caveats in the readme), feel free to
contact me.


Cheers,

   Alon

[1] https://github.com/afters/couch-incarnate/issues/1



On 12 December 2012 19:50, nicholas a. evans <[email protected]> wrote:

> I've got some views that simply must use iterative map reduce.  The
> greatest need is simply to sort based on the value of the first reduction.
>  I'm looking over my options, and I'm going to list them here.  I'm looking
> for someone to tell me that I've missed an option, or recommend a
> modification to one of the below approaches.
>
> 0) Just do the second map reduce in client code...
> 1) use Cloudant's Chained Map Reduce feature
> 2) use couch-incarnate
> 3) Skip CouchDB and learn Hadoop or Storm...
> 4) learn me some Erlang for great patches...
> 5) roll my own by copying reduced view results into another DB...
>
> (tl;dr: steps 0-4 don't work for me; I'll detail several approaches to step
> 5 but none let me relax.)
>
>
> Option 0 is what we're doing now.  For our small users, it's great; for
> medium users, it's okay.  But for some of our more complicated views on our
> larger DBs, "GET /user_123/_design/foo/_view/bar?reduce=true&group=true"
> downloads several MB of data and takes 2 minutes or more, which just
> doesn't work.  We can play some neat tricks with Etags and caching
> strategies that mimic "stale=update_after" (i.e. immediately grab the
> latest from the cache and fire off a background process to update the
> cache).  We can then also cache the results of the second mapping.  But
> this is annoyingly elaborate and shares the same latency downside as the
> naive approach (mentioned later).
>
> As a tangential point, I do wonder why it takes 2 minutes to download
> several MB of data, even when run via curl to localhost.  It's not
> bandwidth limited, so why does it take so long to construct and send the
> JSON?
>
> For reasons I won't bore you with, Cloudant or couch-incarnate are not
> (currently) options.  But both approaches seem basically reasonable...
> Anyway, it's just a real shame that Apache CouchDB can't do this out of the
> box.
>
> My one boss is of the opinion that a RDBMS can handle this and everything
> else, so why are we wasting time with this NoSQL nonsense? (Ignoring that
> CouchDB made most other bits easier and faster, excepting this one glaring
> issue.)h  And my other boss is of the opinion that we should switch to
> using Storm (which I don't have any experience with, so I can't properly
> assess what it would make easier or harder).  Rewriting our current code
> base to use Postgresql, Hadoop, or Storm sounds like a lot of work for me,
> when all I want from CouchDB is efficient sorting based on the map/reduce
> values.
>
>
> It seems to me that it should be relatively simple to modify CouchDB chain
> views within a single design document.  You'd need to:
>  1) choose a syntax to say that one view depends on another, e.g.
> "sorted_by_foo": {"depends": "bar_view", "map": "function(doc) { ... }"},
> or maybe even a simpler "sorted_by_foo": { "depends": "bar_view",
> "sort_by": "function (k,v) { return [v.foo, k]; }"}
>  2) construct a DAG to run the views in the correct order (or detect cycles
> and throw an error).
>  3) collect view changes in order to run the chained views incrementally.
>  4) feed the changes from one view into the map (or sort) function for the
> dependent view(s).
>
> It *seems* simple to me... but I know *nothing* about the internals of
> CouchDB or Erlang, so I'm probably oversimplifying or missing some steps.
>  :)  Of course, if I had a view changes feed, I could put that to work for
> other efficient chaining mechanisms, too.
>
> Thinking of more elaborate changes to couch, since sorting by value is
> probably the most common chained map/reduce use case, it would really be
> nice to simply have multiple sorts on a single map/reduce function, with no
> need to duplicate view data or go to the extra work of creating chained
> views.  e.g. "foo_view": { "map": "...", "reduce": "...", "sort_by":
> {"bar": "function (k,v) { ... }", "baz": "function (k,v) { ... }"}}; which
> could then be queried by an API like
> "/db/_design/blah/_view/foo/_sort_by/bar?reduced=true&group=true".  From a
> user's point of view, this would be most relaxing... but I imagine that
> this would require far more changes to couch internals and the view file
> structure than my previous proposal.
>
>
> So it looks like I'm stuck with rolling my own with the tools that CouchDB
> 1.2 provides.  Here is what I've come up with (none of which leaves me
> feeling particularly relaxed):
>
>
> Simple (Naive) approach:
>   SourceDB => ChainedDB
>
>   1) GET all keys from reduced/grouped view in SourceDB.
>   2) GET all docs in ChainedDB.
>   3) Bulk update all docs in ChainedDB.
>   4) Place secondary views on ChainedDB.
>
> This feels conceptually similar to what Cloudant is doing (with "dbcopy").
>
> Downside: high latency.  It's okay for the *initial* load to take a
> while, but 2 minutes or more is *way* too long (and wasting far too much
> CPU/IO) to wait for minor changes in SourceDB to trickle through to
> ChainedDB.  Most of the time (for my use case), a single row change in
> SourceDB will result in only a single row change in ChainedDB, so this can
> and should be done more intelligently.
>
> Also, steps 2+3 are to delete missing rows, only update the rows that have
> changed, with the correct _rev.  It might be sped up a little bit by
> creating a list function that returns bulk_docs format, posting bulk_docs
> with all_or_nothing, and a separate delete+conflict cleanup phase.  But
> that doesn't address the core issue of the slow get/update *everything*
> approach.
>
> Is there a faster way of doing this approach, that preserves its
> simplicity?
>
>
> Incremental changes approach (the couch-incarnate way, as I understand it):
>   SourceDB => ChainedMapDB => ChainedReduceDB
>
>   1) GET changes to SourceDB from changes feed.
>   2) In my own process (not CouchDB), run the map function on the changed
> docs.  Include SourceDB _id as metadata in new rows.
>   3) Use a special view in ChainedMapDB to find rows affected by the
> changed source docs.
>   4) Bulk create/update/delete the affected rows in ChainedMapDB.
>   5) GET changes to ChainedMapDB from changes feed.
>   6) It will be obvious which rows need to change in ChainedReduceDB.
>   7) Place the reduce function onto ChainedMapDB (with a trivial identity
> map function).
>   8) GET changed rows from reduced/grouped view in ChainedMapDB.
>   9) Bulk create/update/delete changed ChainedReduceDB.
>   10) Place secondary views on ChainedReduceDB.
>
> Major downsides:
>     * Complicated code to manage syncing source=>map=>reduce DBs.
>     * Managing the map function external to Couch.
>     * Three databases for one map/reduce/map chain.
>     * This is *not* relaxing.  It does make me want to learn Hadoop or
> Storm.
>
> Is there a way to simplify this?
>
>
> Modified incremental approach:
>   SourceDB => ChainedDB
>
> Like the couch-incarnate approach, but map/reduce function lives in
> SourceDB, modified to include  metadata.
>
>   0) Original map function is modified so that 'emit(key, value);'
> transforms into 'emit(["data", key], value); emit(["metadata", doc._id],
> key);' and original reduce function is modified to skip metadata rows and
> use the original key for data rows.
>   1) GET changes to SourceDB.
>   2) query view using ["metadata", changed.id] keys; which tells you which
> rows in the ChainedDB need to be updated and which rows to grab from this
> view.
>   3) query reduced/grouped view using ["data", ...] keys from previous step
> to get new values
>   4) Bulk create/update/delete ChainedDB.
>   5) Place secondary views on ChainedDB.
>
> Downsides:
>   * need to modify original map function (although a relatively trivial
> substitution).
>   * need to modify original reduce function (although relatively trivial
> insertion).
>   * As far as I can tell, no one has used this approach yet.  Untested.
>
> If no one has any better ideas, this is what I'll wind up trying.
>
>
> It seems to me that a view changes feed would make this much simpler.  You
> could easily get the simplicity of the naive approach with the speed of the
> incremental approaches.  I thought I read about a patch for view changes
> floating around, but some quick searching didn't turn it up.  Does it apply
> cleanly to 1.2 (or 1.3)?  I'll consider using it... but is there a reason
> it hasn't already been accepted into master?
>
>
> Thanks for reading this far, and thanks for any input you might have.
> --
> Nick Evans
>

Re: Options for Iterative Map Reduce

Reply via email to