Re: Using couchdb for analytics

[email protected] Mon, 12 Sep 2011 23:33:17 -0700

I just saw a video on how "ShareThis" is implementing its analytics (though
on mongodb) using buffer queues
(link<http://www.10gen.com/presentation/mongosv2010/sharethis>
).


Coming back to the problem in hand, I did a quick test.

_rev_limit=1000 (default)
Counter started from 0 & was incremented to 44,000

I managed to generate 1GB of disk usage in the database...
(I added a couple of attachments to the document (~80k) to simulate real
world size of the document).

1GB disk usage for just incrementing a counter is a bit too large...
--
Mayank
http://adomado.com



On Mon, Sep 12, 2011 at 11:32 PM, Sam Bisbee <[email protected]> wrote:

> Hi,
>
> The first part of my answer is not CouchDB specific. All of the big
> analytics systems that I have ever built or seen at my clients' have
> used queues. Since, as you know, analytics can have such a high write
> rate you would be crazy to try and persist each transaction to disk
> (which is what databases do). Instead send them to a queue where they
> can sit and you can consume them at your own leisure.
>
> If you don't want to host your own queue, then take a look at Amazon
> Simple Queue Service.
>
> Now, for the CouchDB part.
>
> Have each transaction be its own document. Yes, even if you are
> tracking the same type of action for the same resource (URL). You no
> longer live in a locking world, so this is the most straight forward
> approach. Now you can build views that use actions, resources, or
> whatever other piece of data that you want. More information at
> http://guide.couchdb.org/draft/recipes.html
>
> Given the write rate of analytics systems you would be right to worry
> about view build time. That's why you have the queue: you can control
> the write rate in CouchDB. You can also just build views once per
> night (or whatever), and ALWAYS query with ?stale=ok so you don't kick
> off a view build at read time.
>
> There's a bunch more land mines, but these are the basics and should
> get you on your way. :)
>
> --
> Sam Bisbee
>
> On Thu, Jun 2, 2011 at 5:34 AM, [email protected] <[email protected]>
> wrote:
> > Hi everyone,
> >
> > I came across couchdb a couple of weeks back & got really excited by
> > the fundamental change it brings by simply taking the app-server out
> > of the picture.
> > Must say, kudos to the dev team!
> >
> > I am planning to write a quick analytics solution for my website -
> > something on the lines of Google analytics - which will measure
> > certain properties of the visitors hitting our site.
> >
> > Since this is my first attempt at a JSON style document store, I
> > thought I'll share the architecture & see if I can make it better (or
> > correct my mistakes before I do them) :-)
> >
> > - For each unique visitor, create a document with his session_id as the
> doc.id
> > - For each property i need to track about this visitor, I create a
> > key-value pair in the doc created for this visitor
> > - If visitor is a returning user, use the session_id to re-open his
> > doc & keep on modifying the properties
> > - At end of each calculation time period (say 1 hour or 24 hours), I
> > run a cron job which fires the map-reduce jobs by requesting the views
> > over curl/http.
> >
> > A couple of questions based on above architecture...
> > We see concurrent traffic ranging from 2k users to 5k users.
> > - Would a couchdb instance running on a good machine (say High CPU
> > EC2, medium instance) work well with simultaneous writes happening...
> > (visitors browsing, properties changing or getting created)
> > - With a couple of million documents, would I be able to process my
> > views without causing any significant impact to write performance?
> >
> > I think my questions might be biased by the fact that I come from a
> > MySQL/Rails background... :-)
> >
> > Let me know how you guys think about this.
> >
> > Thanks in advance,
> > --
> > Mayank
> > http://adomado.com
> >
>

Re: Using couchdb for analytics

Reply via email to