Except that the OP specified the need to query for these counts within a date range. So, you have to collate by listening time, not the song.
On Wed, Nov 16, 2011 at 2:45 PM, Marcello Nuccio <[email protected]> wrote: > Hi Rob, > I remember I've done something similar a while ago, but I cannot find > the code right now and I don't have time to rewrite it right now... > however the trick is to only count when the song name changes. This > works because view rows are sorted. > > HTH, > Marcello > > 2011/11/15 Rob Crowell <[email protected]>: >> Hello everyone, >> >> I'm writing some log-parsing code which is currently running on a >> MySQL backend. We're doing a huge amount of aggregates on this data >> right now, but performance is suffering and I'm looking for >> alternatives. The idea of incremental map/reduce initially seemed >> like the exact right thing, but I can't seem to express some of the >> most important queries we are currently running in our production >> system. >> >> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id) >> WHERE user_id = "boris" AND created >= "2010-01-01" AND created < >> "2010-02-01" variety. Currently in MySQL-land we've got a cron job to >> pre-compute these aggregates (it checks modified timestamps and pulls >> in only new records) and write them to a summary table. I initially >> believed I could use CouchDB's incremental map/reduce to effortlessly >> build and update our "summary information" as it changes, but I'm >> stuck. I'm trying to relax, but I can't figure out exactly how :) >> >> In our example, our user "boris" listens to the same song many times >> each month, and we're interested in the number of distinct songs he's >> listened to during a specified time period (NOT the number of song >> plays, but the number of distinct songs played). In CouchDB it isn't >> much trouble to get all of the unique songs that he's listened to >> during a period. Here's our document: >> >> { >> song_id: "happy birthday", >> user_id: "boris", >> date_played: [2011, 11, 14, 00, 12, 55], >> _id: ... >> } >> >> To get the unique values, all we need to do is emit([doc.user_id, >> doc.date_played, doc.song_id], null), reduce with _count, and query >> with a startkey=["boris", "2011-01-01"]&endkey=["boris", >> "2011-02-01"]&group_level=1. This query will yield results like: >> >> ["boris", "happy birthday"], 20 >> ["boris", "yesterday"], 14 >> ... >> >> However, if our user has listened to 50,000 songs during the date >> range, we'll get back 50,000 rows which seems expensive. What I want >> is just the scalar 50,000. I've tried writing a reduce that returns >> the set of distinct song_ids for each user (turning the values list >> into a dictionary and back again), but CouchDB complains that I am not >> reducing my values fast enough :-/ I'm also not sure how to reduce >> this list to a scalar at the end without returning the whole thing to >> my client (which defeats the purpose of all this anyways). >> >> Is this possible to do in CouchDB today? If not, is it something that >> is on the roadmap, or does the internal structure of CouchDB's b-tree >> make this really hard to do? It would of course be possible for me to >> implement this myself (subscribe to the update notifications and >> update my counts as appropriate in a custom script), but I wanted to >> move to CouchDB so that I wouldn't have to do all this myself. >> >> Thanks for any advice! >> >> --Rob >> >
