Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

kant kodali Thu, 24 Nov 2016 16:33:06 -0800

We have a 8 node Cassandra Cluster. Replication Strategy: 3 Consistency
Level Quorum. Data Spread: I can let you know once I get access to our
production cluster.


The use case for simple count is more for internal use than say end
clients/customers however there are many uses cases from customers which
require a table scan.

I believe count can be faster although my spark nodes and Cassandra nodes
are not collocated and it can be in done in many ways. Here are couple of
ways I can think of

1) Since Cassandra cluster is all about token ranges. There should be a way
to break billion records into smaller subsets/token ranges and do a count
on each individual subset in parallel and finally sum them up.

2) I do think Ideas like whole stage code generation can be implemented on
the Cassandra server side (or in general for most databases). underneath
this is nothing but reading lines from a file so If my file has 1B rows and
each line is 1KB which means it is like reading 1TB and so I don't expect
it to take 1hr 30 mins. This is only if you read all columns but since
Cassandra is columnar database you could simply read one column in which
case the data size would be about 250 GB so it should be lot faster.


We have several uses cases where we need real time ad hoc querying. And
this would require table scans/going through all the records and so on.

Thanks







On Thu, Nov 24, 2016 at 1:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> I am not sure what use case you want to demonstrate with select count in
> general. Maybe you can elaborate more what your use case is.
>
> Aside from this: this is a Cassandra issue. What is the setup of
> Cassandra? Dedicated nodes? How many? Replication strategy? Consistency
> configuration? How is the data spread on nodes?
> Cassandra is more for use cases where you have a lot of data, but select
> only a subset from it or where you have a lot of single writes.
>
> If you want to analyze it you have to export it once to parquet, orc etc
> and then run queries on it. Depending on your use case you may want to go
> for that on hive2+tez+ldap or spark.
>
> On 24 Nov 2016, at 20:52, kant kodali <kanth...@gmail.com> wrote:
>
> some accurate numbers here. so it took me 1hr:30 mins to count  698705723
> rows (~700 Million)
>
> and my code is just this
>
> sc.cassandraTable("cuneiform", "blocks").cassandraCount
>
>
>
> On Thu, Nov 24, 2016 at 10:48 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> Take a look at this https://github.com/brianmhess/cassandra-count
>>
>> Now It is just matter of incorporating it into spark-cassandra-connector
>> I guess.
>>
>> On Thu, Nov 24, 2016 at 1:01 AM, kant kodali <kanth...@gmail.com> wrote:
>>
>>> According to this link https://github.com/datastax/sp
>>> ark-cassandra-connector/blob/master/doc/3_selection.md
>>>
>>> I tried the following but it still looks like it is taking forever
>>>
>>> sc.cassandraTable(keyspace, table).cassandraCount
>>>
>>>
>>> On Thu, Nov 24, 2016 at 12:56 AM, kant kodali <kanth...@gmail.com>
>>> wrote:
>>>
>>>> I would be glad if SELECT COUNT(*) FROM hello can return any value for
>>>> that size :) I can say for sure it didn't return anything for 30 mins and I
>>>> probably need to build more patience to sit for few more hours after that!
>>>> Cassandra recommends to use ColumnFamilyStats using nodetool cfstats which
>>>> will give a pretty good estimate but not an accurate value.
>>>>
>>>> On Thu, Nov 24, 2016 at 12:48 AM, Anastasios Zouzias <zouz...@gmail.com
>>>> > wrote:
>>>>
>>>>> How fast is Cassandra without Spark on the count operation?
>>>>>
>>>>> cqsh> SELECT COUNT(*) FROM hello
>>>>>
>>>>> (this is not equivalent with what you are doing but might help you
>>>>> find the root of the cause)
>>>>>
>>>>> On Thu, Nov 24, 2016 at 9:03 AM, kant kodali <kanth...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I have the following code
>>>>>>
>>>>>> I invoke spark-shell as follows
>>>>>>
>>>>>>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
>>>>>> --executor-memory 15G --executor-cores 12 --conf
>>>>>> spark.cassandra.input.split.size_in_mb=67108864
>>>>>>
>>>>>> code
>>>>>>
>>>>>>     scala> val df = spark.sql("SELECT test from hello") // Billion
>>>>>> rows in hello and test column is 1KB
>>>>>>
>>>>>>     df: org.apache.spark.sql.DataFrame = [test: binary]
>>>>>>
>>>>>>     scala> df.count
>>>>>>
>>>>>>     [Stage 0:>   (0 + 2) / 13] // I dont know what these numbers mean
>>>>>> precisely.
>>>>>>
>>>>>> If I invoke spark-shell as follows
>>>>>>
>>>>>>     ./spark-shell --conf spark.cassandra.connection.hos
>>>>>> t=170.99.99.134
>>>>>>
>>>>>> code
>>>>>>
>>>>>>
>>>>>>     val df = spark.sql("SELECT test from hello") // This has about
>>>>>> billion rows
>>>>>>
>>>>>>     scala> df.count
>>>>>>
>>>>>>
>>>>>>     [Stage 0:=>  (686 + 2) / 24686] // What are these numbers
>>>>>> precisely?
>>>>>>
>>>>>>
>>>>>> Both of these versions didn't work Spark keeps running forever and I
>>>>>> have been waiting for more than 15 mins and no response. Any ideas on 
>>>>>> what
>>>>>> could be wrong and how to fix this?
>>>>>>
>>>>>> I am using Spark 2.0.2
>>>>>> and spark-cassandra-connector_2.11-2.0.0-M3.jar
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -- Anastasios Zouzias
>>>>> <a...@zurich.ibm.com>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

Reply via email to