On Sun, Apr 15, 2012 at 12:00:38PM +0300, Alon Keren wrote: > On 15 April 2012 09:13, James Marca <[email protected]> wrote: > > > CouchDB will compute reduced values for what you select. If you just > > ask for values from A to B, it *only* will compute the reduced values > > over that range. So you can get "clever" with the key value, using > > something like > > > > map: emit( [user,game,trynumber], score); > > > > where trynumber is some value that is guaranteed to increase with each > > completed game score stored. > > > > Your reduce could use the built-in Erlang _sum > > > > Then you can just request something like...hmm > > > > startKey=[user,game,BIGNUMBER]&order=descending&limit=10&reduce=false > > (where BIGNUMBER is something bigger than the highest try number of game). > > > > This will give 10 values, and you can do the average lickety-split > > client side, OR you can do one query to get highest try number, then > > another to get between that game and ten back to let couch compute the > > sum for you. > > > > Thanks! > > I think a simpler alternative to 'trynumber' is the game's timestamp, and > BIGNUMBER could be replaced by '{}' (see: > http://wiki.apache.org/couchdb/View_collation). That's what I'm doing at > the moment :) > Unfortunately, as numbers of games and game-types grow, this would become > pretty demanding in CPU time and number of calls to couch. >
I thought about timestamp first, but you said you wanted the last 10, and I wanted to be able to pipe the request through reduce. With timestamps you have to do two requests to get current and 10 prior, or a single request without reducing. At the risk of stating the obvious, if you ask for "limit=10" in a request, *and* the request goes through reduce, you will get 10 reduced values, not 10 values that get reduced to one. By using an integer value, you can do the simple request I settled on above (give me ten values, no reduce), OR in a real application you probably know the current last game number, so you can pass start and end keys (end key is ten less than current game number) and force just 10 results to get piped through reduce. Also, I really don't think there is any load at all on the CPU with this approach. Or to be more accurate, no more than any active database processing a view. Again apologies for stating the obvious, but CouchDB does incremenal updates of views, so if you keep adding data, it only processes the new data. Once you have processed the data into a view, querying it (without reduce) takes almost no CPU. Reducing it can be expensive if you do something in JavaScript, but isn't as expensive if you stick with the built in native Erlang reduce functions (sum, count, etc). But one thing to keep in mind is that you can probably use multiple databases. Is there any reason you *have* to put all the games and all the users in a single database? Can you have a database per game? Or a database per user? then the Views are only updated when a particular user is adding results and querying results. I do data collection from sensors with CouchDB. I use one database per sensor per year of data, roughly a thousand or so DBs per year. I do this so I can eventually spread the pain on multiple machines (I haven't really had to yet), and because Erlang does a really good job maxing out a multicore machine if it has a lot of jobs to run. With just one database, I was only getting two cores busy, but with thousands (when processing historical data) all 8 cores on my two servers were very busy. I also keep one database to do aggregations across all the detectors at the hourly level (I have 30 second data). Each db has a view that generates hourly summaries I need, and I have a node process that polls all the databases at the end of each day to collect hourly documents and writes them to the collation database, which has other views. Kind of a manual version of your chained map reduce project (incarnate, right?), but it suits the data better than automating it. For your app, suppose there are a million users all playing any of a thousand games. If evey user posts a new score every second, ideally I would only want to make each player wait for their data to get processed, not the data from the other 999,999 players. So that calls for a database per user. If users have to wait for Erlang to finish other jobs before it can schedule the user's job on a CPU, then you need more CPUs. With just one database, you don't get that choice, you have to wait for all of the data to get processed (unless you allow stale views. As with my app, across users you can have a separate database that queries each db for that user's last 10 once every minute or so (changes feed would probably work really well here...a change adds a callback to get data from that database when the periodic process is run) and updates a collating db with username_game_average type of documents, to get the user's standings compared to other players. Regards, james PS, sorry for the long reply. I've had too much coffee today.
pgpTjaRuExsgn.pgp
Description: PGP signature
