Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?

Alon Keren Wed, 16 Nov 2011 01:00:09 -0800

Hi,

Here's an incremental alternative, using Couch Incarnate (
https://github.com/afters/Couch-Incarnate). Keep in mind that the project
is still experimental.


I'd first use Couch Incarnate to create a DB holding only the unique names:
{
  "source": YOUR_ORIGINAL_DB_URL,
  "map": "function (doc) { emit([doc.user_id, doc.date, date_played[0],
date_played[1], date_played[2], doc.song_id], null); }",
  "reduces": {
    "count": {
      "function": "function (key, values, rereduce) { return null; }",
      "group_levels": ['exact']
    }
  }
}

Documents in this DB would look like this:
{
  key: ["boris", 2011, 11, 14, "happy birthday"],
  value: null,
}

Then I'd simply count the docs for a specific date-range with this view:

  map: function (doc) {
    emit(doc.key, null);
  }

  reduce: _count

querying with:

  /?start_key='["boris", 2011, 11, 14]'&end_key='["boris", 2011, 11, 14,
{}]'&group_level=4


The above would work when the time frame is days. If you want to do the
same with months (or years), you'd have to create another DB using
Couch-Incarnate.

  Alon

On 15 November 2011 22:03, Rob Crowell <[email protected]> wrote:

> Hey Zach,
>
> That looks really good!  Thanks for the pointer, I'd read about list
> functions but never thought about using them in this way.
>
> From what I read, it seems there is no way to build this list function
> incrementally?
>
> --Rob
>
>
> On Tue, Nov 15, 2011 at 1:37 PM, Zachary Zolton
> <[email protected]> wrote:
> > Rob
> >
> > Since you've already got a view that'll give you Boris's 50k unique
> > songs, you could use a _list function to return the number of rows.
> >
> > Something like this should do the trick:
> >
> > function() {
> >  var count = 0;
> >  while(getRow()) count++;
> >  return JSON.stringify({count: count});
> > }
> >
> > If you query this list function, with the same view, key range and
> > group level, it'll just respond with a bit of JSON, such as:
> > {"count":"50612"}
> >
> > Is that more like what you're looking for?
> >
> > You can read up more here:
> > http://guide.couchdb.org/draft/transforming.html
> > http://wiki.apache.org/couchdb/Formatting_with_Show_and_List
> >
> > --Zach
> >
> > On Tue, Nov 15, 2011 at 11:39 AM, Rob Crowell <[email protected]>
> wrote:
> >> I don't think this works, unless I am misunderstanding.
> >>
> >> If our user "boris" listened to the same song 20 times, and only
> >> listened to that one song, the _count reduce would return 20 would it
> >> not?  I would like the value 1 instead (only 1 distinct song listened
> >> to).
> >>
> >> --Rob
> >>
> >> On Tue, Nov 15, 2011 at 12:21 PM, Robert Newson <[email protected]>
> wrote:
> >>> then just emit(doc.user_id, null) and use _count?
> >>>
> >>> B.
> >>>
> >>> On 15 November 2011 17:17, Rob Crowell <[email protected]> wrote:
> >>>> Hello everyone,
> >>>>
> >>>> I'm writing some log-parsing code which is currently running on a
> >>>> MySQL backend.  We're doing a huge amount of aggregates on this data
> >>>> right now, but performance is suffering and I'm looking for
> >>>> alternatives.  The idea of incremental map/reduce initially seemed
> >>>> like the exact right thing, but I can't seem to express some of the
> >>>> most important queries we are currently running in our production
> >>>> system.
> >>>>
> >>>> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
> >>>> WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
> >>>> "2010-02-01" variety.  Currently in MySQL-land we've got a cron job
> to
> >>>> pre-compute these aggregates (it checks modified timestamps and pulls
> >>>> in only new records) and write them to a summary table.  I initially
> >>>> believed I could use CouchDB's incremental map/reduce to effortlessly
> >>>> build and update our "summary information" as it changes, but I'm
> >>>> stuck.  I'm trying to relax, but I can't figure out exactly how :)
> >>>>
> >>>> In our example, our user "boris" listens to the same song many times
> >>>> each month, and we're interested in the number of distinct songs he's
> >>>> listened to during a specified time period (NOT the number of song
> >>>> plays, but the number of distinct songs played).  In CouchDB it isn't
> >>>> much trouble to get all of the unique songs that he's listened to
> >>>> during a period.  Here's our document:
> >>>>
> >>>> {
> >>>>  song_id: "happy birthday",
> >>>>  user_id: "boris",
> >>>>  date_played: [2011, 11, 14, 00, 12, 55],
> >>>>  _id: ...
> >>>> }
> >>>>
> >>>> To get the unique values, all we need to do is emit([doc.user_id,
> >>>> doc.date_played, doc.song_id], null), reduce with _count, and query
> >>>> with a startkey=["boris", "2011-01-01"]&endkey=["boris",
> >>>> "2011-02-01"]&group_level=1.  This query will yield results like:
> >>>>
> >>>> ["boris", "happy birthday"], 20
> >>>> ["boris", "yesterday"], 14
> >>>> ...
> >>>>
> >>>> However, if our user has listened to 50,000 songs during the date
> >>>> range, we'll get back 50,000 rows which seems expensive.  What I want
> >>>> is just the scalar 50,000.  I've tried writing a reduce that returns
> >>>> the set of distinct song_ids for each user (turning the values list
> >>>> into a dictionary and back again), but CouchDB complains that I am not
> >>>> reducing my values fast enough :-/  I'm also not sure how to reduce
> >>>> this list to a scalar at the end without returning the whole thing to
> >>>> my client (which defeats the purpose of all this anyways).
> >>>>
> >>>> Is this possible to do in CouchDB today?  If not, is it something that
> >>>> is on the roadmap, or does the internal structure of CouchDB's b-tree
> >>>> make this really hard to do?  It would of course be possible for me to
> >>>> implement this myself (subscribe to the update notifications and
> >>>> update my counts as appropriate in a custom script), but I wanted to
> >>>> move to CouchDB so that I wouldn't have to do all this myself.
> >>>>
> >>>> Thanks for any advice!
> >>>>
> >>>> --Rob
> >>>>
> >>>
> >>
> >
>

Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?

Reply via email to