How many nodes do you have in casandra cluster?

On Thu, 18 May 2017 at 1:33 am, Jörn Franke <jornfra...@gmail.com> wrote:

> The issue might be group by , which under certain circumstances can cause
> a lot of traffic to one node. This transfer is of course obsolete the less
> nodes you have.
> Have you checked in the UI what it reports?
>
> On 17. May 2017, at 17:13, Junaid Nasir <jna...@an10.io> wrote:
>
> I have a large data set of 1B records and want to run analytics using
> Apache spark because of the scaling it provides, but I am seeing an anti
> pattern here. The more nodes I add to spark cluster, completion time
> increases. Data store is Cassandra, and queries are run by Zeppelin. I have
> tried many different queries but even a simple query of `dataframe.count()`
> behaves like this.
>
> Here is the zeppelin notebook, temp table has 18M records
>
> val df = sqlContext
>
>      .read
>
>      .format("org.apache.spark.sql.cassandra")
>
>      .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))
>
>      .load().cache()
>
>    df.registerTempTable("table")
>
> %sql
>
> SELECT first(devid),date,count(1) FROM table group by date,rtu order by
> date
>
>
> when tested against different no. of spark worker nodes these were the
> results
> Spark nodesTime
> 4 nodes 22 min 58 sec
> 3 nodes 15 min 49 sec
> 2 nodes 12 min 51 sec
> 1 node 17 min 59 sec
>
> Increasing the no. of nodes decreases performance. which should not happen
> as it defeats the purpose of using Spark.
>
> If you want me to run any query or further info about the setup please ask.
> Any cues on why this is happening would be very helpful, been stuck on
> this for two days now. Thank you for your time.
>
>
> ***versions***
>
> Zeppelin: 0.7.1
> Spark: 2.1.0
> Cassandra: 2.2.9
> Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11
>
> *Spark cluster specs*
>
> 6 vCPUs, 32 GB memory = 1 node
>
> *Cassandra + Zeppelin server specs*
> 8 vCPUs, 52 GB memory
>
> --
Best Regards,
Ayan Guha

Reply via email to