Re: Finding cut-off points

Steven A Robenalt Tue, 01 Apr 2014 08:55:36 -0700

Hi Kasper,

I'd suggest taking a look at Spark, Storm, or Samza (all are Apache
projects) for a possible approach. Depending on your needs and your
existing infrastructure, one of those may work better than others for you.


Steve





On Tue, Apr 1, 2014 at 2:51 AM, Kasper Petersen <[email protected]>wrote:

> Hi,
>
> I have a large amount (can be >100 million) of (id uuid, score int)
> entries in Cassandra. I need to, at regular intervals of lets say 30-60
> minutes, find the cut-off points for the score needed to be in the top
> 0.1%, 33% and 66% of all scores.
>
> What would a good approach be to this problem?
>
> All the data wont fit into memory thus using regular sorting on the
> application side won't be possible (unless I do it using a merge sort
> algorithm with files, which feels like a bad solution).
>
> Iterating over the data once and build a histogram would cut down the
> required memory usage quite significantly, but I'm afraid this could still
> end up being "too big". Are there any easier ways to do these computations?
>
> Lastly I've thought about the possibility to use analytics tools to
> compute these things for me - would setting up hadoop and/or pig help me do
> this in a manner that could make the results accessible to the application
> servers once done? I've had a hard time finding any guides on how to set it
> up and what exactly I'd be able to do with it afterwards. Any pointers
> would be much appreciated.
>
>
> Best regards,
> Kasper
>



-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

[email protected]
http://highwire.stanford.edu

Re: Finding cut-off points

Reply via email to