I just saw a video on how "ShareThis" is implementing its analytics (though on mongodb) using buffer queues (link<http://www.10gen.com/presentation/mongosv2010/sharethis> ).
Coming back to the problem in hand, I did a quick test. _rev_limit=1000 (default) Counter started from 0 & was incremented to 44,000 I managed to generate 1GB of disk usage in the database... (I added a couple of attachments to the document (~80k) to simulate real world size of the document). 1GB disk usage for just incrementing a counter is a bit too large... -- Mayank http://adomado.com On Mon, Sep 12, 2011 at 11:32 PM, Sam Bisbee <[email protected]> wrote: > Hi, > > The first part of my answer is not CouchDB specific. All of the big > analytics systems that I have ever built or seen at my clients' have > used queues. Since, as you know, analytics can have such a high write > rate you would be crazy to try and persist each transaction to disk > (which is what databases do). Instead send them to a queue where they > can sit and you can consume them at your own leisure. > > If you don't want to host your own queue, then take a look at Amazon > Simple Queue Service. > > Now, for the CouchDB part. > > Have each transaction be its own document. Yes, even if you are > tracking the same type of action for the same resource (URL). You no > longer live in a locking world, so this is the most straight forward > approach. Now you can build views that use actions, resources, or > whatever other piece of data that you want. More information at > http://guide.couchdb.org/draft/recipes.html > > Given the write rate of analytics systems you would be right to worry > about view build time. That's why you have the queue: you can control > the write rate in CouchDB. You can also just build views once per > night (or whatever), and ALWAYS query with ?stale=ok so you don't kick > off a view build at read time. > > There's a bunch more land mines, but these are the basics and should > get you on your way. :) > > -- > Sam Bisbee > > On Thu, Jun 2, 2011 at 5:34 AM, [email protected] <[email protected]> > wrote: > > Hi everyone, > > > > I came across couchdb a couple of weeks back & got really excited by > > the fundamental change it brings by simply taking the app-server out > > of the picture. > > Must say, kudos to the dev team! > > > > I am planning to write a quick analytics solution for my website - > > something on the lines of Google analytics - which will measure > > certain properties of the visitors hitting our site. > > > > Since this is my first attempt at a JSON style document store, I > > thought I'll share the architecture & see if I can make it better (or > > correct my mistakes before I do them) :-) > > > > - For each unique visitor, create a document with his session_id as the > doc.id > > - For each property i need to track about this visitor, I create a > > key-value pair in the doc created for this visitor > > - If visitor is a returning user, use the session_id to re-open his > > doc & keep on modifying the properties > > - At end of each calculation time period (say 1 hour or 24 hours), I > > run a cron job which fires the map-reduce jobs by requesting the views > > over curl/http. > > > > A couple of questions based on above architecture... > > We see concurrent traffic ranging from 2k users to 5k users. > > - Would a couchdb instance running on a good machine (say High CPU > > EC2, medium instance) work well with simultaneous writes happening... > > (visitors browsing, properties changing or getting created) > > - With a couple of million documents, would I be able to process my > > views without causing any significant impact to write performance? > > > > I think my questions might be biased by the fact that I come from a > > MySQL/Rails background... :-) > > > > Let me know how you guys think about this. > > > > Thanks in advance, > > -- > > Mayank > > http://adomado.com > > >
