Hey Lucas,

Could you provide some rough psuedo-code for your job? One question is: are
you loading the data from cassandra every time you perform an action, or do
you cache() the dataset first? If you have a dataset that's already in an
RDD, it's very hard for me to imaging that filters and aggregations could
possibly take 4 minutes... should be more like seconds.

- Patrick


On Mon, Oct 28, 2013 at 9:11 AM, Lucas Fernandes Brunialti <
[email protected]> wrote:

> Hello,
>
> We're using Spark to run analytics and ML jobs against Cassandra. Our
> analytics jobs are simple (filters and counts) and we're trying to improve
> the performance, these jobs takes around 4 minutes querying 160Gb (size of
> our dataset). Also, we use 5 workers and 1 master, EC2 m1.xlarge with 8gb
> in jvm heap.
>
> We tried to increase the jvm heap to 12gb, but we had no gain in
> performance. We're using CACHE_ONLY (after some tests we've found it
> better), also it's not caching everything, just around 1000 of 2500 blocks.
> Maybe the cache is not impacting on performance, just the cassandra IO (?)
>
> I saw that people from ooyala can do analytics jobs in milliseconds (
> http://www.youtube.com/watch?v=6kHlArorzvs), any advices?
>
> Appreciate the help!
>
> Lucas.
>
> --
>
> Lucas Fernandes Brunialti
>
> *Dev/Ops Software Engineer*
>
> *+55 9 6512 4514*
>
> *[email protected]* <[email protected]>
>

Reply via email to