Sorting CouchDB data using couch-lucene

nicholas a. evans Tue, 27 Mar 2012 10:05:27 -0700

I have some summation data that was very easy to generate using some
relatively simple map/reduce views.  But we want to sort the data
based on the group-reduced view *values* (not the keys).  It was
suggested that we could use couchdb-lucene to do this.  But how?  It's
not clear to me how to use a full text index to quickly rank this sort
of data.


**What we already have**

An oversimplified example view looks something like the following:

    by_sender: {
      map: "function(doc) { emit(doc.sender, 1); }",
      reduce: "function(keys, values, rereduce) { return sum(values); }"
    }

Which returns results somewhat like the following (when run with `group=true`):

     {"rows":[
     {"key":"[email protected]","value":2},
     {"key":"[email protected]","value":1},
     {"key":"[email protected]","value":34},
     {"key":"[email protected]","value":1},
     ... thousands or tens of thousands of rows ...
     ]}

**What we want**

Those are sorted by the key, but I need to sort it data according the
values, like so:

     {"rows":[
     {"key":"[email protected]","value":847},
     {"key":"[email protected]","value":345},
     {"key":"[email protected]","value":99},
     {"key":"[email protected]","value":34},
     ... thousands or tens of thousands of rows ...
     ]}

**More context: what we already tried**

The best answer on
http://stackoverflow.com/questions/2817703/sorting-couchdb-views-by-value
gives four viable options, which we've tried in increasing order of
difficulty:

 1. First we sorted the results client side, but that was *way* too slow.
 2. Next we created a list view which sorts the data.  A little
faster, but still too slow.
 3. Chained Map-Reduce Views should handle this problem easily.
    - Someone pointed out Cloudant's Chained Map-Reduce Views.  They
are not in BigCouch but are part of Cloudant's services, which are
unfortunately not in our budget at this time.
    - I started an application layer implementation using the
_bulk_docs API.  It is tricky if you want to keep updates as snappy as
possible while avoiding race conditions, etc.  I can continue with
this approach, but it is *not* relaxing.  :(
 4. The answer suggested using couchdb-lucene.  But I'm not nearly
familiar enough with full-text search to understand how to get it to
do anything more sophisticated than index the document and return a
search result.  I don't even know where to start.

I also posted this at
http://stackoverflow.com/questions/9893759/sorting-couchdb-data-using-couch-lucene
Is it bad form to post the question in both places?  I hope not.  :)

Has someone already shared an open source implementation of map-reduce
chaining?  Are there other good approaches?  Or is this a
hammer/screwdriver problem: should we be looking outside of couchdb to
handle this particular type of data analysis?  E.g. monitor the
changes feed and run "zincrby messages:by_sender 1 $sender" for every
new row.

Thanks for your consideration!
-- 
Nick Evans

Sorting CouchDB data using couch-lucene

Reply via email to