On 16 April 2012 22:25, James Marca <[email protected]> wrote:
> On Sun, Apr 15, 2012 at 12:00:38PM +0300, Alon Keren wrote: > > On 15 April 2012 09:13, James Marca <[email protected]> wrote: > > > > > CouchDB will compute reduced values for what you select. If you just > > > ask for values from A to B, it *only* will compute the reduced values > > > over that range. So you can get "clever" with the key value, using > > > something like > > > > > > map: emit( [user,game,trynumber], score); > > > > > > where trynumber is some value that is guaranteed to increase with each > > > completed game score stored. > > > > > > Your reduce could use the built-in Erlang _sum > > > > > > Then you can just request something like...hmm > > > > > > startKey=[user,game,BIGNUMBER]&order=descending&limit=10&reduce=false > > > (where BIGNUMBER is something bigger than the highest try number of > game). > > > > > > This will give 10 values, and you can do the average lickety-split > > > client side, OR you can do one query to get highest try number, then > > > another to get between that game and ten back to let couch compute the > > > sum for you. > > > > > > > Thanks! > > > > I think a simpler alternative to 'trynumber' is the game's timestamp, and > > BIGNUMBER could be replaced by '{}' (see: > > http://wiki.apache.org/couchdb/View_collation). That's what I'm doing at > > the moment :) > > Unfortunately, as numbers of games and game-types grow, this would become > > pretty demanding in CPU time and number of calls to couch. > > > > I thought about timestamp first, but you said you wanted the last 10, > and I wanted to be able to pipe the request through reduce. > > With timestamps you have to do two requests to get current and 10 > prior, or a single request without reducing. > > At the risk of stating the obvious, if you ask for "limit=10" in a > request, *and* the request goes through reduce, you will get 10 > reduced values, not 10 values that get reduced to one. By using an > integer value, you can do the simple request I settled on above (give > me ten values, no reduce), OR in a real application you probably know > the current last game number, so you can pass start and end keys (end > key is ten less than current game number) and force just 10 results to > get piped through reduce. > Ah, I think I see now what you're getting it - thanks for clarifying. It seems to me that even with this approach, if I want to use the db's reduce, I would have to make a separate query for each game type. Or am I missing something? > > Also, I really don't think there is any load at all on the CPU with > this approach. Or to be more accurate, no more than any active > database processing a view. Again apologies for stating the obvious, > but CouchDB does incremenal updates of views, so if you keep adding > data, it only processes the new data. Once you have processed the > data into a view, querying it (without reduce) takes almost no CPU. > Reducing it can be expensive if you do something in JavaScript, but > isn't as expensive if you stick with the built in native Erlang reduce > functions (sum, count, etc). > Reduces in couchdb should be incremental, unlike when doing them outside of couch. > > But one thing to keep in mind is that you can probably use multiple > databases. Is there any reason you *have* to put all the games and all > the users in a single database? Can you have a database per game? Or > a database per user? then the Views are only updated when a > particular user is adding results and querying results. > > I do data collection from sensors with CouchDB. I use one database > per sensor per year of data, roughly a thousand or so DBs per year. I > do this so I can eventually spread the pain on multiple machines (I > haven't really had to yet), and because Erlang does a really good job > maxing out a multicore machine if it has a lot of jobs to run. With > just one database, I was only getting two cores busy, but with > thousands (when processing historical data) all 8 cores on my two > servers were very busy. > > I also keep one database to do aggregations across all the detectors > at the hourly level (I have 30 second data). Each db has a view that > generates hourly summaries I need, and I have a node process that > polls all the databases at the end of each day to collect hourly > documents and writes them to the collation database, which has other > views. Kind of a manual version of your chained map reduce project > (incarnate, right?), but it suits the data better than automating it. > > For your app, suppose there are a million users all playing any of a > thousand games. If evey user posts a new score every second, ideally > I would only want to make each player wait for their data to get processed, > not the data from the other 999,999 players. So that calls for a > database per user. If users have to wait for Erlang to finish other > jobs before it can schedule the user's job on a CPU, then you need > more CPUs. With just one database, you don't get that choice, you > have to wait for all of the data to get processed (unless you allow > stale views. > Actually, several users can participate in each game, but their scores are individual. However, there should be enough user specific data that's derived from these games, so it may be a good optimization down the line to put at least this kind of data in user-specific databases. > > As with my app, across users you can have a separate database that queries > each > db for that user's last 10 once every minute or so (changes feed would > probably work really well here...a change adds a callback to get data > from that database when the periodic process is run) and updates a > collating db with username_game_average type of documents, to get the > user's standings compared to other players. > > Regards, > james > > PS, sorry for the long reply. I've had too much coffee today. > > Nothing to be sorry about - thanks a lot for giving it so much attention, James! Alon
