I feel your pain but cannot offer any help. I also use your option 5: I use node.js to manually store view output into a separate db, with the doc _ids equal to the key of the view output, so that I can limit updates to only those things that change.
This is one feature I really want in CouchDB...the ability to reduce
view output and/or treat view output like a regular couchdb database.
Like a flag or something that says, hey couchdb, save this view as a
db automatically, but also update it when the source db changes (not
on view, but on edit)
Regards,
James Marca
On Wed, Dec 12, 2012 at 12:50:52PM -0500, nicholas a. evans wrote:
> I've got some views that simply must use iterative map reduce. The
> greatest need is simply to sort based on the value of the first reduction.
> I'm looking over my options, and I'm going to list them here. I'm looking
> for someone to tell me that I've missed an option, or recommend a
> modification to one of the below approaches.
>
> 0) Just do the second map reduce in client code...
> 1) use Cloudant's Chained Map Reduce feature
> 2) use couch-incarnate
> 3) Skip CouchDB and learn Hadoop or Storm...
> 4) learn me some Erlang for great patches...
> 5) roll my own by copying reduced view results into another DB...
>
> (tl;dr: steps 0-4 don't work for me; I'll detail several approaches to step
> 5 but none let me relax.)
>
>
> Option 0 is what we're doing now. For our small users, it's great; for
> medium users, it's okay. But for some of our more complicated views on our
> larger DBs, "GET /user_123/_design/foo/_view/bar?reduce=true&group=true"
> downloads several MB of data and takes 2 minutes or more, which just
> doesn't work. We can play some neat tricks with Etags and caching
> strategies that mimic "stale=update_after" (i.e. immediately grab the
> latest from the cache and fire off a background process to update the
> cache). We can then also cache the results of the second mapping. But
> this is annoyingly elaborate and shares the same latency downside as the
> naive approach (mentioned later).
>
> As a tangential point, I do wonder why it takes 2 minutes to download
> several MB of data, even when run via curl to localhost. It's not
> bandwidth limited, so why does it take so long to construct and send the
> JSON?
>
> For reasons I won't bore you with, Cloudant or couch-incarnate are not
> (currently) options. But both approaches seem basically reasonable...
> Anyway, it's just a real shame that Apache CouchDB can't do this out of the
> box.
>
> My one boss is of the opinion that a RDBMS can handle this and everything
> else, so why are we wasting time with this NoSQL nonsense? (Ignoring that
> CouchDB made most other bits easier and faster, excepting this one glaring
> issue.)h And my other boss is of the opinion that we should switch to
> using Storm (which I don't have any experience with, so I can't properly
> assess what it would make easier or harder). Rewriting our current code
> base to use Postgresql, Hadoop, or Storm sounds like a lot of work for me,
> when all I want from CouchDB is efficient sorting based on the map/reduce
> values.
>
>
> It seems to me that it should be relatively simple to modify CouchDB chain
> views within a single design document. You'd need to:
> 1) choose a syntax to say that one view depends on another, e.g.
> "sorted_by_foo": {"depends": "bar_view", "map": "function(doc) { ... }"},
> or maybe even a simpler "sorted_by_foo": { "depends": "bar_view",
> "sort_by": "function (k,v) { return [v.foo, k]; }"}
> 2) construct a DAG to run the views in the correct order (or detect cycles
> and throw an error).
> 3) collect view changes in order to run the chained views incrementally.
> 4) feed the changes from one view into the map (or sort) function for the
> dependent view(s).
>
> It *seems* simple to me... but I know *nothing* about the internals of
> CouchDB or Erlang, so I'm probably oversimplifying or missing some steps.
> :) Of course, if I had a view changes feed, I could put that to work for
> other efficient chaining mechanisms, too.
>
> Thinking of more elaborate changes to couch, since sorting by value is
> probably the most common chained map/reduce use case, it would really be
> nice to simply have multiple sorts on a single map/reduce function, with no
> need to duplicate view data or go to the extra work of creating chained
> views. e.g. "foo_view": { "map": "...", "reduce": "...", "sort_by":
> {"bar": "function (k,v) { ... }", "baz": "function (k,v) { ... }"}}; which
> could then be queried by an API like
> "/db/_design/blah/_view/foo/_sort_by/bar?reduced=true&group=true". From a
> user's point of view, this would be most relaxing... but I imagine that
> this would require far more changes to couch internals and the view file
> structure than my previous proposal.
>
>
> So it looks like I'm stuck with rolling my own with the tools that CouchDB
> 1.2 provides. Here is what I've come up with (none of which leaves me
> feeling particularly relaxed):
>
>
> Simple (Naive) approach:
> SourceDB => ChainedDB
>
> 1) GET all keys from reduced/grouped view in SourceDB.
> 2) GET all docs in ChainedDB.
> 3) Bulk update all docs in ChainedDB.
> 4) Place secondary views on ChainedDB.
>
> This feels conceptually similar to what Cloudant is doing (with "dbcopy").
>
> Downside: high latency. It's okay for the *initial* load to take a
> while, but 2 minutes or more is *way* too long (and wasting far too much
> CPU/IO) to wait for minor changes in SourceDB to trickle through to
> ChainedDB. Most of the time (for my use case), a single row change in
> SourceDB will result in only a single row change in ChainedDB, so this can
> and should be done more intelligently.
>
> Also, steps 2+3 are to delete missing rows, only update the rows that have
> changed, with the correct _rev. It might be sped up a little bit by
> creating a list function that returns bulk_docs format, posting bulk_docs
> with all_or_nothing, and a separate delete+conflict cleanup phase. But
> that doesn't address the core issue of the slow get/update *everything*
> approach.
>
> Is there a faster way of doing this approach, that preserves its simplicity?
>
>
> Incremental changes approach (the couch-incarnate way, as I understand it):
> SourceDB => ChainedMapDB => ChainedReduceDB
>
> 1) GET changes to SourceDB from changes feed.
> 2) In my own process (not CouchDB), run the map function on the changed
> docs. Include SourceDB _id as metadata in new rows.
> 3) Use a special view in ChainedMapDB to find rows affected by the
> changed source docs.
> 4) Bulk create/update/delete the affected rows in ChainedMapDB.
> 5) GET changes to ChainedMapDB from changes feed.
> 6) It will be obvious which rows need to change in ChainedReduceDB.
> 7) Place the reduce function onto ChainedMapDB (with a trivial identity
> map function).
> 8) GET changed rows from reduced/grouped view in ChainedMapDB.
> 9) Bulk create/update/delete changed ChainedReduceDB.
> 10) Place secondary views on ChainedReduceDB.
>
> Major downsides:
> * Complicated code to manage syncing source=>map=>reduce DBs.
> * Managing the map function external to Couch.
> * Three databases for one map/reduce/map chain.
> * This is *not* relaxing. It does make me want to learn Hadoop or
> Storm.
>
> Is there a way to simplify this?
>
>
> Modified incremental approach:
> SourceDB => ChainedDB
>
> Like the couch-incarnate approach, but map/reduce function lives in
> SourceDB, modified to include metadata.
>
> 0) Original map function is modified so that 'emit(key, value);'
> transforms into 'emit(["data", key], value); emit(["metadata", doc._id],
> key);' and original reduce function is modified to skip metadata rows and
> use the original key for data rows.
> 1) GET changes to SourceDB.
> 2) query view using ["metadata", changed.id] keys; which tells you which
> rows in the ChainedDB need to be updated and which rows to grab from this
> view.
> 3) query reduced/grouped view using ["data", ...] keys from previous step
> to get new values
> 4) Bulk create/update/delete ChainedDB.
> 5) Place secondary views on ChainedDB.
>
> Downsides:
> * need to modify original map function (although a relatively trivial
> substitution).
> * need to modify original reduce function (although relatively trivial
> insertion).
> * As far as I can tell, no one has used this approach yet. Untested.
>
> If no one has any better ideas, this is what I'll wind up trying.
>
>
> It seems to me that a view changes feed would make this much simpler. You
> could easily get the simplicity of the naive approach with the speed of the
> incremental approaches. I thought I read about a patch for view changes
> floating around, but some quick searching didn't turn it up. Does it apply
> cleanly to 1.2 (or 1.3)? I'll consider using it... but is there a reason
> it hasn't already been accepted into master?
>
>
> Thanks for reading this far, and thanks for any input you might have.
> --
> Nick Evans
pgpfILWb1HP6b.pgp
Description: PGP signature
