Well, he did mention that not everything was staying in the cache, so even with an ongoing job they're probably be re-reading from Cassandra. It sounds to me like the first issue to address is why things are being evicted.
-Ewen ----- Ewen Cheslack-Postava StraightUp | http://readstraightup.com [email protected] (201) 286-7785 On Mon, Oct 28, 2013 at 9:24 AM, Patrick Wendell <[email protected]> wrote: > Hey Lucas, > > Could you provide some rough psuedo-code for your job? One question is: > are you loading the data from cassandra every time you perform an action, > or do you cache() the dataset first? If you have a dataset that's already > in an RDD, it's very hard for me to imaging that filters and aggregations > could possibly take 4 minutes... should be more like seconds. > > - Patrick > > > On Mon, Oct 28, 2013 at 9:11 AM, Lucas Fernandes Brunialti < > [email protected]> wrote: > >> Hello, >> >> We're using Spark to run analytics and ML jobs against Cassandra. Our >> analytics jobs are simple (filters and counts) and we're trying to improve >> the performance, these jobs takes around 4 minutes querying 160Gb (size of >> our dataset). Also, we use 5 workers and 1 master, EC2 m1.xlarge with 8gb >> in jvm heap. >> >> We tried to increase the jvm heap to 12gb, but we had no gain in >> performance. We're using CACHE_ONLY (after some tests we've found it >> better), also it's not caching everything, just around 1000 of 2500 blocks. >> Maybe the cache is not impacting on performance, just the cassandra IO (?) >> >> I saw that people from ooyala can do analytics jobs in milliseconds ( >> http://www.youtube.com/watch?v=6kHlArorzvs), any advices? >> >> Appreciate the help! >> >> Lucas. >> >> -- >> >> Lucas Fernandes Brunialti >> >> *Dev/Ops Software Engineer* >> >> *+55 9 6512 4514* >> >> *[email protected]* <[email protected]> >> > >
