Hi Rob, I remember I've done something similar a while ago, but I cannot find the code right now and I don't have time to rewrite it right now... however the trick is to only count when the song name changes. This works because view rows are sorted.
HTH, Marcello 2011/11/15 Rob Crowell <[email protected]>: > Hello everyone, > > I'm writing some log-parsing code which is currently running on a > MySQL backend. We're doing a huge amount of aggregates on this data > right now, but performance is suffering and I'm looking for > alternatives. The idea of incremental map/reduce initially seemed > like the exact right thing, but I can't seem to express some of the > most important queries we are currently running in our production > system. > > We're running a lot of queries of the SELECT COUNT(DISTINCT song_id) > WHERE user_id = "boris" AND created >= "2010-01-01" AND created < > "2010-02-01" variety. Currently in MySQL-land we've got a cron job to > pre-compute these aggregates (it checks modified timestamps and pulls > in only new records) and write them to a summary table. I initially > believed I could use CouchDB's incremental map/reduce to effortlessly > build and update our "summary information" as it changes, but I'm > stuck. I'm trying to relax, but I can't figure out exactly how :) > > In our example, our user "boris" listens to the same song many times > each month, and we're interested in the number of distinct songs he's > listened to during a specified time period (NOT the number of song > plays, but the number of distinct songs played). In CouchDB it isn't > much trouble to get all of the unique songs that he's listened to > during a period. Here's our document: > > { > song_id: "happy birthday", > user_id: "boris", > date_played: [2011, 11, 14, 00, 12, 55], > _id: ... > } > > To get the unique values, all we need to do is emit([doc.user_id, > doc.date_played, doc.song_id], null), reduce with _count, and query > with a startkey=["boris", "2011-01-01"]&endkey=["boris", > "2011-02-01"]&group_level=1. This query will yield results like: > > ["boris", "happy birthday"], 20 > ["boris", "yesterday"], 14 > ... > > However, if our user has listened to 50,000 songs during the date > range, we'll get back 50,000 rows which seems expensive. What I want > is just the scalar 50,000. I've tried writing a reduce that returns > the set of distinct song_ids for each user (turning the values list > into a dictionary and back again), but CouchDB complains that I am not > reducing my values fast enough :-/ I'm also not sure how to reduce > this list to a scalar at the end without returning the whole thing to > my client (which defeats the purpose of all this anyways). > > Is this possible to do in CouchDB today? If not, is it something that > is on the roadmap, or does the internal structure of CouchDB's b-tree > make this really hard to do? It would of course be possible for me to > implement this myself (subscribe to the update notifications and > update my counts as appropriate in a custom script), but I wanted to > move to CouchDB so that I wouldn't have to do all this myself. > > Thanks for any advice! > > --Rob >
