Hi Kasper, I'd suggest taking a look at Spark, Storm, or Samza (all are Apache projects) for a possible approach. Depending on your needs and your existing infrastructure, one of those may work better than others for you.
Steve On Tue, Apr 1, 2014 at 2:51 AM, Kasper Petersen <[email protected]>wrote: > Hi, > > I have a large amount (can be >100 million) of (id uuid, score int) > entries in Cassandra. I need to, at regular intervals of lets say 30-60 > minutes, find the cut-off points for the score needed to be in the top > 0.1%, 33% and 66% of all scores. > > What would a good approach be to this problem? > > All the data wont fit into memory thus using regular sorting on the > application side won't be possible (unless I do it using a merge sort > algorithm with files, which feels like a bad solution). > > Iterating over the data once and build a histogram would cut down the > required memory usage quite significantly, but I'm afraid this could still > end up being "too big". Are there any easier ways to do these computations? > > Lastly I've thought about the possibility to use analytics tools to > compute these things for me - would setting up hadoop and/or pig help me do > this in a manner that could make the results accessible to the application > servers once done? I've had a hard time finding any guides on how to set it > up and what exactly I'd be able to do with it afterwards. Any pointers > would be much appreciated. > > > Best regards, > Kasper > -- Steve Robenalt Software Architect HighWire | Stanford University 425 Broadway St, Redwood City, CA 94063 [email protected] http://highwire.stanford.edu
