Gerard, Strings in particular are very inefficient because they're stored in a two-byte format by the JVM. If you use the Kryo serializer and have use StorageLevel.MEMORY_ONLY_SER then Kryo stores Strings in UTF8, which for ASCII-like strings will take half the space.
Andrew On Tue, Jun 17, 2014 at 8:54 AM, Gerard Maas <gerard.m...@gmail.com> wrote: > Hi Rohit, > > Thanks a lot for looking at this. The intention of calculating the data > upfront it to only benchmark the time it takes store in records/sec > eliminating the generation factor from it (which will be different on the > real scenario, reading from HDFS) > I used a profiler today and indeed it's not the storage part, but the > generation that's bloating the memory. Objects in memory take surprisingly > more space that one would expect based on the data they hold. In my case it > was 2.1x the size of the original data. > > Now that we are talking about this, do you have some figures of how > Calliope compares -performance wise- to a classic Cassandra driver > (DataStax / Astyanax) ? that would be awesome. > > Thanks again! > > -kr, Gerard. > > > > > > On Tue, Jun 17, 2014 at 4:27 PM, tj opensource <opensou...@tuplejump.com> > wrote: > >> Dear Gerard, >> >> I just tried the code you posted in the gist ( >> https://gist.github.com/maasg/68de6016bffe5e71b78c) and it does give a >> OOM. It is cause of the data being generated locally and then paralellized >> - >> >> >> ---------------------------------------------------------------------------------------------------------------------- >> >> >> >> val entries = for (i <- 1 to total) yield { >> >> >> >> >> Array(s"devy$i", "aggr", "1000", "sum", (i to i+10).mkString(",")) >> >> >> >> >> } >> >> >> >> val rdd = sc.parallelize(entries,8) >> >> >> >> >> >> ---------------------------------------------------------------------------------------------------------------------- >> >> >> >> >> >> This will generate all the data on the local system and then try to >> partition it. >> >> Instead, we should paralellize the keys (i <- 1 to total) and generate >> data in the map tasks. This is *closer* to what you will get if you >> distribute out a file on a DFS like HDFS/SnackFS. >> >> I have made the change in the script here ( >> https://gist.github.com/milliondreams/aac52e08953949057e7d) >> >> >> ---------------------------------------------------------------------------------------------------------------------- >> >> >> >> val rdd = sc.parallelize(1 to total, 8).map(i => Array(s"devy$i", >> "aggr", "1000", "sum", (i to i+10).mkString(","))) >> >> >> >> >> ---------------------------------------------------------------------------------------------------------------------- >> >> >> >> >> >> I was able to insert 50M records using just over 350M RAM. Attaching the >> log and screenshot. >> >> Let me know if you still face this issue... we can do a screen share and >> resolve thee issue there. >> >> And thanks for using Calliope. I hope it serves your needs. >> >> Cheers, >> Rohit >> >> >> On Mon, Jun 16, 2014 at 9:57 PM, Gerard Maas <gerard.m...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I've been doing some testing with Calliope as a way to do batch load >>> from Spark into Cassandra. >>> My initial results are promising on the performance area, but worrisome >>> on the memory footprint side. >>> >>> I'm generating N records of about 50 bytes each and using the UPDATE >>> mutator to insert them into C*. I get OOM if my memory is below 1GB per >>> million of records, or about 50Mb of raw data (without counting any >>> RDD/structural overhead). (See code [1]) >>> >>> (so, to avoid confusions: e.g.: I need 4GB RAM to save 4M of 50Byte >>> records to Cassandra) That's an order of magnitude more than the RAW data. >>> >>> I understood that Calliope builds on top of the Hadoop support of >>> Cassandra, which builds on top of SSTables and sstableloader. >>> >>> I would like to know what's the memory usage factor of Calliope and what >>> parameters could I use to control/tune that. >>> >>> Any experience/advice on that? >>> >>> -kr, Gerard. >>> >>> [1] https://gist.github.com/maasg/68de6016bffe5e71b78c >>> >> >> >