How many nodes do you have in casandra cluster? On Thu, 18 May 2017 at 1:33 am, Jörn Franke <jornfra...@gmail.com> wrote:
> The issue might be group by , which under certain circumstances can cause > a lot of traffic to one node. This transfer is of course obsolete the less > nodes you have. > Have you checked in the UI what it reports? > > On 17. May 2017, at 17:13, Junaid Nasir <jna...@an10.io> wrote: > > I have a large data set of 1B records and want to run analytics using > Apache spark because of the scaling it provides, but I am seeing an anti > pattern here. The more nodes I add to spark cluster, completion time > increases. Data store is Cassandra, and queries are run by Zeppelin. I have > tried many different queries but even a simple query of `dataframe.count()` > behaves like this. > > Here is the zeppelin notebook, temp table has 18M records > > val df = sqlContext > > .read > > .format("org.apache.spark.sql.cassandra") > > .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace")) > > .load().cache() > > df.registerTempTable("table") > > %sql > > SELECT first(devid),date,count(1) FROM table group by date,rtu order by > date > > > when tested against different no. of spark worker nodes these were the > results > Spark nodesTime > 4 nodes 22 min 58 sec > 3 nodes 15 min 49 sec > 2 nodes 12 min 51 sec > 1 node 17 min 59 sec > > Increasing the no. of nodes decreases performance. which should not happen > as it defeats the purpose of using Spark. > > If you want me to run any query or further info about the setup please ask. > Any cues on why this is happening would be very helpful, been stuck on > this for two days now. Thank you for your time. > > > ***versions*** > > Zeppelin: 0.7.1 > Spark: 2.1.0 > Cassandra: 2.2.9 > Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11 > > *Spark cluster specs* > > 6 vCPUs, 32 GB memory = 1 node > > *Cassandra + Zeppelin server specs* > 8 vCPUs, 52 GB memory > > -- Best Regards, Ayan Guha