Hi all, For several reasons which I won't elaborate (yet), we're using Spark on local mode as an in memory SQL engine for data we're retrieving from Cassandra, execute SQL queries and return to the client - so no cluster, no worker nodes. I'm well aware local mode has always been considered a testing mode, but it does fit our purposes at the moment....
We're on Spark 2.0.0 I'm finding several challenges which I would like to get some comments if possible: 1 - For group by based SQL queries I'm finding shuffle disk spills to constantly happen, to a point where after a couple of days I have 9GB of disk filled in the block manager folder with broadcast files. My understanding is that disk spills only occur during the lifetime of an RDD. Once the RDD is gone from memory, so should the files, this doesn't seem to be happening. Is there any way of completely disable the disk spills? I've tweaked the memory fraction configuration to maximize execution memory and avoid the disk spills but doesn't seem to have done much to avoid the spills... 2 - GC overhead is overwhelming - when refreshing an Dataframe (even empty data!) and executing 1 group by queries every second on around 1MB of data, the amount of Young Gen used goes up to 2GB every 10 seconds. I've started profiling the JVM and can find considerable number of hashmap objects which I assume are created internally in Spark. 3 - I'm really looking for low latency multithreaded refreshes and collection of data frames - in order of milliseconds of query execution and collection of data within this local node, and I'm afraid goes against the nature of spark. Spark partitions all data s blocks and uses the scheduler for its multi-node design, and that's great for multi-node execution. For a local node execution adds considerable overhead, and I'm aware of this constraint, the hope is that we could optimize it to do the point where this kind of usage becomes a possibility - in memory efficient SQL engine within the same JVM where the data lives. Any suggestions are very welcomed! Thanks in advance, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-in-local-mode-as-SQL-engine-what-to-optimize-tp27815.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org