I don't think this works, unless I am misunderstanding. If our user "boris" listened to the same song 20 times, and only listened to that one song, the _count reduce would return 20 would it not? I would like the value 1 instead (only 1 distinct song listened to).
--Rob On Tue, Nov 15, 2011 at 12:21 PM, Robert Newson <[email protected]> wrote: > then just emit(doc.user_id, null) and use _count? > > B. > > On 15 November 2011 17:17, Rob Crowell <[email protected]> wrote: >> Hello everyone, >> >> I'm writing some log-parsing code which is currently running on a >> MySQL backend. We're doing a huge amount of aggregates on this data >> right now, but performance is suffering and I'm looking for >> alternatives. The idea of incremental map/reduce initially seemed >> like the exact right thing, but I can't seem to express some of the >> most important queries we are currently running in our production >> system. >> >> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id) >> WHERE user_id = "boris" AND created >= "2010-01-01" AND created < >> "2010-02-01" variety. Currently in MySQL-land we've got a cron job to >> pre-compute these aggregates (it checks modified timestamps and pulls >> in only new records) and write them to a summary table. I initially >> believed I could use CouchDB's incremental map/reduce to effortlessly >> build and update our "summary information" as it changes, but I'm >> stuck. I'm trying to relax, but I can't figure out exactly how :) >> >> In our example, our user "boris" listens to the same song many times >> each month, and we're interested in the number of distinct songs he's >> listened to during a specified time period (NOT the number of song >> plays, but the number of distinct songs played). In CouchDB it isn't >> much trouble to get all of the unique songs that he's listened to >> during a period. Here's our document: >> >> { >> song_id: "happy birthday", >> user_id: "boris", >> date_played: [2011, 11, 14, 00, 12, 55], >> _id: ... >> } >> >> To get the unique values, all we need to do is emit([doc.user_id, >> doc.date_played, doc.song_id], null), reduce with _count, and query >> with a startkey=["boris", "2011-01-01"]&endkey=["boris", >> "2011-02-01"]&group_level=1. This query will yield results like: >> >> ["boris", "happy birthday"], 20 >> ["boris", "yesterday"], 14 >> ... >> >> However, if our user has listened to 50,000 songs during the date >> range, we'll get back 50,000 rows which seems expensive. What I want >> is just the scalar 50,000. I've tried writing a reduce that returns >> the set of distinct song_ids for each user (turning the values list >> into a dictionary and back again), but CouchDB complains that I am not >> reducing my values fast enough :-/ I'm also not sure how to reduce >> this list to a scalar at the end without returning the whole thing to >> my client (which defeats the purpose of all this anyways). >> >> Is this possible to do in CouchDB today? If not, is it something that >> is on the roadmap, or does the internal structure of CouchDB's b-tree >> make this really hard to do? It would of course be possible for me to >> implement this myself (subscribe to the update notifications and >> update my counts as appropriate in a custom script), but I wanted to >> move to CouchDB so that I wouldn't have to do all this myself. >> >> Thanks for any advice! >> >> --Rob >> >
