Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

kant kodali Thu, 24 Nov 2016 11:53:22 -0800

some accurate numbers here. so it took me 1hr:30 mins to count  698705723
rows (~700 Million)


and my code is just this

sc.cassandraTable("cuneiform", "blocks").cassandraCount



On Thu, Nov 24, 2016 at 10:48 AM, kant kodali <kanth...@gmail.com> wrote:

> Take a look at this https://github.com/brianmhess/cassandra-count
>
> Now It is just matter of incorporating it into spark-cassandra-connector I
> guess.
>
> On Thu, Nov 24, 2016 at 1:01 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> According to this link https://github.com/datastax/sp
>> ark-cassandra-connector/blob/master/doc/3_selection.md
>>
>> I tried the following but it still looks like it is taking forever
>>
>> sc.cassandraTable(keyspace, table).cassandraCount
>>
>>
>> On Thu, Nov 24, 2016 at 12:56 AM, kant kodali <kanth...@gmail.com> wrote:
>>
>>> I would be glad if SELECT COUNT(*) FROM hello can return any value for
>>> that size :) I can say for sure it didn't return anything for 30 mins and I
>>> probably need to build more patience to sit for few more hours after that!
>>> Cassandra recommends to use ColumnFamilyStats using nodetool cfstats which
>>> will give a pretty good estimate but not an accurate value.
>>>
>>> On Thu, Nov 24, 2016 at 12:48 AM, Anastasios Zouzias <zouz...@gmail.com>
>>> wrote:
>>>
>>>> How fast is Cassandra without Spark on the count operation?
>>>>
>>>> cqsh> SELECT COUNT(*) FROM hello
>>>>
>>>> (this is not equivalent with what you are doing but might help you find
>>>> the root of the cause)
>>>>
>>>> On Thu, Nov 24, 2016 at 9:03 AM, kant kodali <kanth...@gmail.com>
>>>> wrote:
>>>>
>>>>> I have the following code
>>>>>
>>>>> I invoke spark-shell as follows
>>>>>
>>>>>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
>>>>> --executor-memory 15G --executor-cores 12 --conf
>>>>> spark.cassandra.input.split.size_in_mb=67108864
>>>>>
>>>>> code
>>>>>
>>>>>     scala> val df = spark.sql("SELECT test from hello") // Billion
>>>>> rows in hello and test column is 1KB
>>>>>
>>>>>     df: org.apache.spark.sql.DataFrame = [test: binary]
>>>>>
>>>>>     scala> df.count
>>>>>
>>>>>     [Stage 0:>   (0 + 2) / 13] // I dont know what these numbers mean
>>>>> precisely.
>>>>>
>>>>> If I invoke spark-shell as follows
>>>>>
>>>>>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
>>>>>
>>>>> code
>>>>>
>>>>>
>>>>>     val df = spark.sql("SELECT test from hello") // This has about
>>>>> billion rows
>>>>>
>>>>>     scala> df.count
>>>>>
>>>>>
>>>>>     [Stage 0:=>  (686 + 2) / 24686] // What are these numbers
>>>>> precisely?
>>>>>
>>>>>
>>>>> Both of these versions didn't work Spark keeps running forever and I
>>>>> have been waiting for more than 15 mins and no response. Any ideas on what
>>>>> could be wrong and how to fix this?
>>>>>
>>>>> I am using Spark 2.0.2
>>>>> and spark-cassandra-connector_2.11-2.0.0-M3.jar
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> -- Anastasios Zouzias
>>>> <a...@zurich.ibm.com>
>>>>
>>>
>>>
>>
>

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

Reply via email to