Re: Cassandra/Spark failing to process large table

Faraz Mateen Thu, 08 Mar 2018 00:39:14 -0800

Hi Ben,

That makes sense. I also read about "read repairs". So, once an
inconsistent record is read, cassandra synchronizes its replicas on other
nodes as well. I ran the same spark query again, this time with default
consistency level (LOCAL_ONE) and the result was correct.


Thanks again for the help.

Thanks,
Faraz

On Wed, Mar 7, 2018 at 7:13 AM, Ben Slater <ben.sla...@instaclustr.com>
wrote:

> Hi Faraz
>
> Yes, it likely does mean there is inconsistency in the replicas. However,
> you shouldn’t be too freaked out about it - Cassandra is design to allow
> for this inconsistency to occur and the consistency levels allow you to
> achieve consistent results despite replicas not being consistent. To keep
> you replicas as consistent as possible (which is still a good thing), you
> do need to regularly run repairs (once a week is the standard
> recommendation for full repairs). Inconsistency can result from a whole
> range of conditions from nodes being down the cluster being overloaded to
> network issues.
>
> Cheers
> Ben
>
> On Tue, 6 Mar 2018 at 22:18 Faraz Mateen <fmat...@an10.io> wrote:
>
>> Thanks a lot for the response.
>>
>> Setting consistency to ALL/TWO started giving me consistent  count
>> results on both cqlsh and spark. As expected, my query time has increased
>> by 1.5x ( Before, it was taking ~1.6 hours but with consistency level ALL,
>> same query is taking ~2.4 hours to complete.)
>>
>> Does this mean my replicas are out of sync? When I first started pushing
>> data to cassandra, I had a single node setup. Then I added two more nodes,
>> changed replication factor to 2 and ran nodetool repair to distribute data
>> to all the nodes. So, according to my understanding the nodes should have
>> passively replicated data among themselves to remain in sync.
>>
>> Do I need to run repairs repeatedly to keep data in sync?
>> How can I further debug why my replicas were not in sync before?
>>
>> Thanks,
>> Faraz
>>
>> On Sun, Mar 4, 2018 at 9:46 AM, Ben Slater <ben.sla...@instaclustr.com>
>> wrote:
>>
>>> Both CQLSH and the Spark Cassandra query at consistent level ONE
>>> (LOCAL_ONE for Spark connector) by default so if there is any inconsistency
>>> in your replicas this can resulting in inconsistent query results.
>>>
>>> See http://cassandra.apache.org/doc/latest/tools/cqlsh.html and
>>> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/
>>> reference.md for info on how to chance consistency. If you are unsure
>>> of how consistent the on-disk replicas are (eg if you have been writing at
>>> CL One or haven’t run repaires) that using consistency level all should
>>> give you the most consistent results but requires all replicas to be
>>> available for the query to succeed. If you are using QUORUM for your writes
>>> then querying at QUORUM or LOCAL_QUORUM as appropriate should give you
>>> consistent results.
>>>
>>> Cheers
>>> Ben
>>>
>>> On Sun, 4 Mar 2018 at 00:59 Kant Kodali <k...@peernova.com> wrote:
>>>
>>>> The fact that cqlsh itself gives different results tells me that this
>>>> has nothing to do with spark. Moreover, spark results are monotonically
>>>> increasing which seem to be more consistent than cqlsh. so I believe
>>>> spark can be taken out of the equation.
>>>>
>>>>  Now, while you are running these queries is there another process or
>>>> thread that is writing also at the same time ? If yes then your results are
>>>> fine but If it's not, you may want to try nodetool flush first and then run
>>>> these iterations again?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> On Fri, Mar 2, 2018 at 11:17 PM, Faraz Mateen <fmat...@an10.io> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I am trying to use spark to process a large cassandra table (~402
>>>>> million entries and 84 columns) but I am getting inconsistent results.
>>>>> Initially the requirement was to copy some columns from this table to
>>>>> another table. After copying the data, I noticed that some entries in the
>>>>> new table were missing. To verify that I took count of the large source
>>>>> table but I am getting different values each time. I tried the queries on 
>>>>> a
>>>>> smaller table (~7 million records) and the results were fine.
>>>>>
>>>>> Initially, I attempted to take count using pyspark. Here is my pyspark
>>>>> script:
>>>>>
>>>>> spark = SparkSession.builder.appName("Datacopy App").getOrCreate()
>>>>> df = 
>>>>> spark.read.format("org.apache.spark.sql.cassandra").options(table=sourcetable,
>>>>>  keyspace=sourcekeyspace).load().cache()
>>>>> df.createOrReplaceTempView("data")
>>>>> query = ("select count(1) from data " )
>>>>> vgDF = spark.sql(query)
>>>>> vgDF.show(10)
>>>>>
>>>>> Spark submit command is as follows:
>>>>>
>>>>> ~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master 
>>>>> spark://10.128.0.18:7077 --packages 
>>>>> datastax:spark-cassandra-connector:2.0.1-s_2.11 --conf 
>>>>> spark.cassandra.connection.host="10.128.1.1,10.128.1.2,10.128.1.3" --conf 
>>>>> "spark.storage.memoryFraction=1" --conf spark.local.dir=/media/db/ 
>>>>> --executor-memory 10G --num-executors=6 --executor-cores=2 
>>>>> --total-executor-cores 18 pyspark_script.py
>>>>>
>>>>> The above spark submit process takes ~90 minutes to complete. I ran it
>>>>> three times and here are the counts I got:
>>>>>
>>>>> Spark iteration 1:  402273852
>>>>> Spark iteration 2:  402273884
>>>>> Spark iteration 3:  402274209
>>>>>
>>>>> Spark does not show any error or exception during the entire process.
>>>>> I ran the same query in cqlsh thrice and got different results again:
>>>>>
>>>>> Cqlsh iteration 1:   402273598
>>>>> Cqlsh iteration 2:   402273499
>>>>> Cqlsh iteration 3:   402273515
>>>>>
>>>>> I am unable to find out why I am getting different outcomes from the
>>>>> same query. Cassandra system logs (*/var/log/cassandra/system.log*)
>>>>> has shown the following error message just once:
>>>>>
>>>>> ERROR [SSTableBatchOpen:3] 2018-02-27 09:48:23,592 
>>>>> CassandraDaemon.java:226 - Exception in thread 
>>>>> Thread[SSTableBatchOpen:3,5,main]
>>>>> java.lang.AssertionError: Stats component is missing for sstable 
>>>>> /media/db/datakeyspace/sensordata1-acfa7880acba11e782fd9bf3ae460699/mc-58617-big
>>>>>         at 
>>>>> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:460)
>>>>>  ~[apache-cassandra-3.9.jar:3.9]
>>>>>         at 
>>>>> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:375)
>>>>>  ~[apache-cassandra-3.9.jar:3.9]
>>>>>         at 
>>>>> org.apache.cassandra.io.sstable.format.SSTableReader$4.run(SSTableReader.java:536)
>>>>>  ~[apache-cassandra-3.9.jar:3.9]
>>>>>         at 
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
>>>>> ~[na:1.8.0_131]
>>>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
>>>>> ~[na:1.8.0_131]
>>>>>         at 
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>  ~[na:1.8.0_131]
>>>>>         at 
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>  [na:1.8.0_131]
>>>>>         at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
>>>>>
>>>>> *Versions:*
>>>>>
>>>>>    - Cassandra 3.9
>>>>>    - Spark 2.1.0
>>>>>    - Datastax's spark-cassandra-connector 2.0.1
>>>>>    - Scala version 2.11
>>>>>
>>>>> *Cluster:*
>>>>>
>>>>>    - Spark setup with 3 workers and 1 master node.
>>>>>    - 3 worker nodes also have a cassandra cluster installed.
>>>>>    - Each worker node has 8 CPU cores and 40 GB RAM.
>>>>>
>>>>> Any help will be greatly appreciated.
>>>>>
>>>>> Thanks,
>>>>> Faraz
>>>>>
>>>>
>>>> --
>>>
>>>
>>> *Ben Slater*
>>>
>>> *Chief Product Officer <https://www.instaclustr.com/>*
>>>
>>> <https://www.facebook.com/instaclustr>
>>> <https://twitter.com/instaclustr>
>>> <https://www.linkedin.com/company/instaclustr>
>>>
>>> Read our latest technical blog posts here
>>> <https://www.instaclustr.com/blog/>.
>>>
>>> This email has been sent on behalf of Instaclustr Pty. Limited
>>> (Australia) and Instaclustr Inc (USA).
>>>
>>> This email and any attachments may contain confidential and legally
>>> privileged information.  If you are not the intended recipient, do not copy
>>> or disclose its content, but please reply to this email immediately and
>>> highlight the error to the sender and then immediately delete the message.
>>>
>>
>> --
>
>
> *Ben Slater*
>
> *Chief Product Officer <https://www.instaclustr.com/>*
>
> <https://www.facebook.com/instaclustr>   <https://twitter.com/instaclustr>
>    <https://www.linkedin.com/company/instaclustr>
>
> Read our latest technical blog posts here
> <https://www.instaclustr.com/blog/>.
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>

Re: Cassandra/Spark failing to process large table

Reply via email to