Well, he did mention that not everything was staying in the cache, so even
with an ongoing job they're probably be re-reading from Cassandra. It
sounds to me like the first issue to address is why things are being
evicted.

-Ewen

-----
Ewen Cheslack-Postava
StraightUp | http://readstraightup.com
[email protected]
(201) 286-7785


On Mon, Oct 28, 2013 at 9:24 AM, Patrick Wendell <[email protected]> wrote:

> Hey Lucas,
>
> Could you provide some rough psuedo-code for your job? One question is:
> are you loading the data from cassandra every time you perform an action,
> or do you cache() the dataset first? If you have a dataset that's already
> in an RDD, it's very hard for me to imaging that filters and aggregations
> could possibly take 4 minutes... should be more like seconds.
>
> - Patrick
>
>
> On Mon, Oct 28, 2013 at 9:11 AM, Lucas Fernandes Brunialti <
> [email protected]> wrote:
>
>> Hello,
>>
>> We're using Spark to run analytics and ML jobs against Cassandra. Our
>> analytics jobs are simple (filters and counts) and we're trying to improve
>> the performance, these jobs takes around 4 minutes querying 160Gb (size of
>> our dataset). Also, we use 5 workers and 1 master, EC2 m1.xlarge with 8gb
>> in jvm heap.
>>
>> We tried to increase the jvm heap to 12gb, but we had no gain in
>> performance. We're using CACHE_ONLY (after some tests we've found it
>> better), also it's not caching everything, just around 1000 of 2500 blocks.
>> Maybe the cache is not impacting on performance, just the cassandra IO (?)
>>
>> I saw that people from ooyala can do analytics jobs in milliseconds (
>> http://www.youtube.com/watch?v=6kHlArorzvs), any advices?
>>
>> Appreciate the help!
>>
>> Lucas.
>>
>> --
>>
>> Lucas Fernandes Brunialti
>>
>> *Dev/Ops Software Engineer*
>>
>> *+55 9 6512 4514*
>>
>> *[email protected]* <[email protected]>
>>
>
>

Reply via email to