Hi,
Here's an incremental alternative, using Couch Incarnate (
https://github.com/afters/Couch-Incarnate). Keep in mind that the project
is still experimental.
I'd first use Couch Incarnate to create a DB holding only the unique names:
{
"source": YOUR_ORIGINAL_DB_URL,
"map": "function (doc) { emit([doc.user_id, doc.date, date_played[0],
date_played[1], date_played[2], doc.song_id], null); }",
"reduces": {
"count": {
"function": "function (key, values, rereduce) { return null; }",
"group_levels": ['exact']
}
}
}
Documents in this DB would look like this:
{
key: ["boris", 2011, 11, 14, "happy birthday"],
value: null,
}
Then I'd simply count the docs for a specific date-range with this view:
map: function (doc) {
emit(doc.key, null);
}
reduce: _count
querying with:
/?start_key='["boris", 2011, 11, 14]'&end_key='["boris", 2011, 11, 14,
{}]'&group_level=4
The above would work when the time frame is days. If you want to do the
same with months (or years), you'd have to create another DB using
Couch-Incarnate.
Alon
On 15 November 2011 22:03, Rob Crowell <[email protected]> wrote:
> Hey Zach,
>
> That looks really good! Thanks for the pointer, I'd read about list
> functions but never thought about using them in this way.
>
> From what I read, it seems there is no way to build this list function
> incrementally?
>
> --Rob
>
>
> On Tue, Nov 15, 2011 at 1:37 PM, Zachary Zolton
> <[email protected]> wrote:
> > Rob
> >
> > Since you've already got a view that'll give you Boris's 50k unique
> > songs, you could use a _list function to return the number of rows.
> >
> > Something like this should do the trick:
> >
> > function() {
> > var count = 0;
> > while(getRow()) count++;
> > return JSON.stringify({count: count});
> > }
> >
> > If you query this list function, with the same view, key range and
> > group level, it'll just respond with a bit of JSON, such as:
> > {"count":"50612"}
> >
> > Is that more like what you're looking for?
> >
> > You can read up more here:
> > http://guide.couchdb.org/draft/transforming.html
> > http://wiki.apache.org/couchdb/Formatting_with_Show_and_List
> >
> > --Zach
> >
> > On Tue, Nov 15, 2011 at 11:39 AM, Rob Crowell <[email protected]>
> wrote:
> >> I don't think this works, unless I am misunderstanding.
> >>
> >> If our user "boris" listened to the same song 20 times, and only
> >> listened to that one song, the _count reduce would return 20 would it
> >> not? I would like the value 1 instead (only 1 distinct song listened
> >> to).
> >>
> >> --Rob
> >>
> >> On Tue, Nov 15, 2011 at 12:21 PM, Robert Newson <[email protected]>
> wrote:
> >>> then just emit(doc.user_id, null) and use _count?
> >>>
> >>> B.
> >>>
> >>> On 15 November 2011 17:17, Rob Crowell <[email protected]> wrote:
> >>>> Hello everyone,
> >>>>
> >>>> I'm writing some log-parsing code which is currently running on a
> >>>> MySQL backend. We're doing a huge amount of aggregates on this data
> >>>> right now, but performance is suffering and I'm looking for
> >>>> alternatives. The idea of incremental map/reduce initially seemed
> >>>> like the exact right thing, but I can't seem to express some of the
> >>>> most important queries we are currently running in our production
> >>>> system.
> >>>>
> >>>> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
> >>>> WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
> >>>> "2010-02-01" variety. Currently in MySQL-land we've got a cron job
> to
> >>>> pre-compute these aggregates (it checks modified timestamps and pulls
> >>>> in only new records) and write them to a summary table. I initially
> >>>> believed I could use CouchDB's incremental map/reduce to effortlessly
> >>>> build and update our "summary information" as it changes, but I'm
> >>>> stuck. I'm trying to relax, but I can't figure out exactly how :)
> >>>>
> >>>> In our example, our user "boris" listens to the same song many times
> >>>> each month, and we're interested in the number of distinct songs he's
> >>>> listened to during a specified time period (NOT the number of song
> >>>> plays, but the number of distinct songs played). In CouchDB it isn't
> >>>> much trouble to get all of the unique songs that he's listened to
> >>>> during a period. Here's our document:
> >>>>
> >>>> {
> >>>> song_id: "happy birthday",
> >>>> user_id: "boris",
> >>>> date_played: [2011, 11, 14, 00, 12, 55],
> >>>> _id: ...
> >>>> }
> >>>>
> >>>> To get the unique values, all we need to do is emit([doc.user_id,
> >>>> doc.date_played, doc.song_id], null), reduce with _count, and query
> >>>> with a startkey=["boris", "2011-01-01"]&endkey=["boris",
> >>>> "2011-02-01"]&group_level=1. This query will yield results like:
> >>>>
> >>>> ["boris", "happy birthday"], 20
> >>>> ["boris", "yesterday"], 14
> >>>> ...
> >>>>
> >>>> However, if our user has listened to 50,000 songs during the date
> >>>> range, we'll get back 50,000 rows which seems expensive. What I want
> >>>> is just the scalar 50,000. I've tried writing a reduce that returns
> >>>> the set of distinct song_ids for each user (turning the values list
> >>>> into a dictionary and back again), but CouchDB complains that I am not
> >>>> reducing my values fast enough :-/ I'm also not sure how to reduce
> >>>> this list to a scalar at the end without returning the whole thing to
> >>>> my client (which defeats the purpose of all this anyways).
> >>>>
> >>>> Is this possible to do in CouchDB today? If not, is it something that
> >>>> is on the roadmap, or does the internal structure of CouchDB's b-tree
> >>>> make this really hard to do? It would of course be possible for me to
> >>>> implement this myself (subscribe to the update notifications and
> >>>> update my counts as appropriate in a custom script), but I wanted to
> >>>> move to CouchDB so that I wouldn't have to do all this myself.
> >>>>
> >>>> Thanks for any advice!
> >>>>
> >>>> --Rob
> >>>>
> >>>
> >>
> >
>