I have some summation data that was very easy to generate using some
relatively simple map/reduce views. But we want to sort the data
based on the group-reduced view *values* (not the keys). It was
suggested that we could use couchdb-lucene to do this. But how? It's
not clear to me how to use a full text index to quickly rank this sort
of data.
**What we already have**
An oversimplified example view looks something like the following:
by_sender: {
map: "function(doc) { emit(doc.sender, 1); }",
reduce: "function(keys, values, rereduce) { return sum(values); }"
}
Which returns results somewhat like the following (when run with `group=true`):
{"rows":[
{"key":"[email protected]","value":2},
{"key":"[email protected]","value":1},
{"key":"[email protected]","value":34},
{"key":"[email protected]","value":1},
... thousands or tens of thousands of rows ...
]}
**What we want**
Those are sorted by the key, but I need to sort it data according the
values, like so:
{"rows":[
{"key":"[email protected]","value":847},
{"key":"[email protected]","value":345},
{"key":"[email protected]","value":99},
{"key":"[email protected]","value":34},
... thousands or tens of thousands of rows ...
]}
**More context: what we already tried**
The best answer on
http://stackoverflow.com/questions/2817703/sorting-couchdb-views-by-value
gives four viable options, which we've tried in increasing order of
difficulty:
1. First we sorted the results client side, but that was *way* too slow.
2. Next we created a list view which sorts the data. A little
faster, but still too slow.
3. Chained Map-Reduce Views should handle this problem easily.
- Someone pointed out Cloudant's Chained Map-Reduce Views. They
are not in BigCouch but are part of Cloudant's services, which are
unfortunately not in our budget at this time.
- I started an application layer implementation using the
_bulk_docs API. It is tricky if you want to keep updates as snappy as
possible while avoiding race conditions, etc. I can continue with
this approach, but it is *not* relaxing. :(
4. The answer suggested using couchdb-lucene. But I'm not nearly
familiar enough with full-text search to understand how to get it to
do anything more sophisticated than index the document and return a
search result. I don't even know where to start.
I also posted this at
http://stackoverflow.com/questions/9893759/sorting-couchdb-data-using-couch-lucene
Is it bad form to post the question in both places? I hope not. :)
Has someone already shared an open source implementation of map-reduce
chaining? Are there other good approaches? Or is this a
hammer/screwdriver problem: should we be looking outside of couchdb to
handle this particular type of data analysis? E.g. monitor the
changes feed and run "zincrby messages:by_sender 1 $sender" for every
new row.
Thanks for your consideration!
--
Nick Evans