Hey Lucas, Could you provide some rough psuedo-code for your job? One question is: are you loading the data from cassandra every time you perform an action, or do you cache() the dataset first? If you have a dataset that's already in an RDD, it's very hard for me to imaging that filters and aggregations could possibly take 4 minutes... should be more like seconds.
- Patrick On Mon, Oct 28, 2013 at 9:11 AM, Lucas Fernandes Brunialti < [email protected]> wrote: > Hello, > > We're using Spark to run analytics and ML jobs against Cassandra. Our > analytics jobs are simple (filters and counts) and we're trying to improve > the performance, these jobs takes around 4 minutes querying 160Gb (size of > our dataset). Also, we use 5 workers and 1 master, EC2 m1.xlarge with 8gb > in jvm heap. > > We tried to increase the jvm heap to 12gb, but we had no gain in > performance. We're using CACHE_ONLY (after some tests we've found it > better), also it's not caching everything, just around 1000 of 2500 blocks. > Maybe the cache is not impacting on performance, just the cassandra IO (?) > > I saw that people from ooyala can do analytics jobs in milliseconds ( > http://www.youtube.com/watch?v=6kHlArorzvs), any advices? > > Appreciate the help! > > Lucas. > > -- > > Lucas Fernandes Brunialti > > *Dev/Ops Software Engineer* > > *+55 9 6512 4514* > > *[email protected]* <[email protected]> >
