Re: Data loss in spark job

2018-02-27 Thread Faraz Mateen
Hi,

I saw the following error message in executor logs:

*Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x000662f0, 520093696, 0) failed; error='Cannot
allocate memory' (errno=12)*

By increasing RAM of my nodes to 40 GB each, I was able to get rid of RPC
connection failures. However, the results I am getting after copying data
are still incorrect.

Before termination, executor logs have this error message:

*ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM*

I believe the executors are not shutting down gracefully and that is
causing spark to lose some data.

Can anyone please explain how I can further debug this?

Thanks,
Faraz

On Mon, Feb 26, 2018 at 4:46 PM, Faraz Mateen <fmat...@an10.io> wrote:

> Hi,
>
> I think I have a situation where spark is silently failing to write data
> to my Cassandra table. Let me explain my current situation.
>
> I have a table consisting of around 402 million records. The table
> consists of 84 columns. Table schema is something like this:
>
>
> *id (text)  |   datetime (timestamp)  |   field1 (text) | . |   field
> 84 (text)*
>
>
> To optimize queries on the data, I am splitting it into multiple tables
> using spark job mentioned below. Each separated table must have data from
> just one field from the source table. New table has the following structure:
>
>
> *id (text)  |   datetime (timestamp)  |   day (date)  |   value (text)*
>
>
> where, "value" column will contain the field column from the source table.
> Source table has around *402 million* records which is around *85 GB* of
> data distributed on *3 nodes (27 + 32 + 26)*. New table being populated
> is supposed to have the same number of records but it is missing some data.
>
> Initially, I assumed some problem with the data in source table. So, I
> copied 1 weeks of data from the source table into another table with the
> same schema. Then I split the data like I did before but this time, field
> specific table had the same number of records as the source table. I
> repeated this again with another data set from another time period and
> again number of records in field specific table  were equal to number of
> records in the source table.
>
> This has led me to believe that there is some problem with spark's
> handling of large data set. Here is my spark submit command to separate the
> data:
>
> *~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master
> spark://10.128.0.18:7077 <http://10.128.0.18:7077/>  --packages
> datastax:spark-cassandra-connector:2.0.1-s_2.11 --con**f
> spark.cassandra.connection.host="10.128.1.1,10.128.1.2,10.128.1.3" --conf
> "spark.storage.memoryFraction=1" --conf spark.local.dir=/media/db/
> --executor-memory 10G --num-executors=6 --executo**r-cores=3
> --total-executor-cores 18 split_data.py*
>
>
> *split_data.py* is the name of my pyspark application. It is essentially
> executing the following query:
>
>
> *("select id,datetime,DATE_FORMAT(datetime,'-MM-dd') as day, "+field+"
> as value  from data  " )*
>
> The spark job does not crash after these errors and warnings. However when
> I check the number of records in the new table, it is always less than the
> number of records in source table. Moreover, the number of records in
> destination table is not the same after each run of the query. I changed
> logging level for spark submit to WARN and saw the following WARNINGS and
> ERRORS on the console:
>
> https://gist.github.com/anonymous/e05f1aaa131348c9a5a9a2db6d
> 141f8c#file-gistfile1-txt
>
> My cluster consists of *3 gcloud VMs*. A spark and a cassandra node is
> deployed on each VM.
> Each VM has *8 cores* of CPU and* 30 GB* RAM. Spark is deployed in
> standalone cluster mode.
> Spark version is *2.1.0*
> I am using datastax spark cassandra connector version *2.0.1*
> Cassandra Version is *3.9*
> Each spark executor is allowed 10 GB of RAM and there are 2 executors
> running on each node.
>
> Is the problem related to my machine resources? How can I root cause or
> fix this?
> Any help will be greatly appreciated.
>
> Thanks,
> Faraz
>


Data loss in spark job

2018-02-26 Thread Faraz Mateen
 Hi,

I think I have a situation where spark is silently failing to write data to
my Cassandra table. Let me explain my current situation.

I have a table consisting of around 402 million records. The table consists
of 84 columns. Table schema is something like this:


*id (text)  |   datetime (timestamp)  |   field1 (text) | . |   field
84 (text)*


To optimize queries on the data, I am splitting it into multiple tables
using spark job mentioned below. Each separated table must have data from
just one field from the source table. New table has the following structure:


*id (text)  |   datetime (timestamp)  |   day (date)  |   value (text)*


where, "value" column will contain the field column from the source table.
Source table has around *402 million* records which is around *85 GB* of
data distributed on *3 nodes (27 + 32 + 26)*. New table being populated is
supposed to have the same number of records but it is missing some data.

Initially, I assumed some problem with the data in source table. So, I
copied 1 weeks of data from the source table into another table with the
same schema. Then I split the data like I did before but this time, field
specific table had the same number of records as the source table. I
repeated this again with another data set from another time period and
again number of records in field specific table  were equal to number of
records in the source table.

This has led me to believe that there is some problem with spark's handling
of large data set. Here is my spark submit command to separate the data:

*~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master
spark://10.128.0.18:7077   --packages
datastax:spark-cassandra-connector:2.0.1-s_2.11 --con**f
spark.cassandra.connection.host="10.128.1.1,10.128.1.2,10.128.1.3" --conf
"spark.storage.memoryFraction=1" --conf spark.local.dir=/media/db/
--executor-memory 10G --num-executors=6 --executo**r-cores=3
--total-executor-cores 18 split_data.py*


*split_data.py* is the name of my pyspark application. It is essentially
executing the following query:


*("select id,datetime,DATE_FORMAT(datetime,'-MM-dd') as day, "+field+"
as value  from data  " )*

The spark job does not crash after these errors and warnings. However when
I check the number of records in the new table, it is always less than the
number of records in source table. Moreover, the number of records in
destination table is not the same after each run of the query. I changed
logging level for spark submit to WARN and saw the following WARNINGS and
ERRORS on the console:

https://gist.github.com/anonymous/e05f1aaa131348c9a5a9a2db6d
141f8c#file-gistfile1-txt

My cluster consists of *3 gcloud VMs*. A spark and a cassandra node is
deployed on each VM.
Each VM has *8 cores* of CPU and* 30 GB* RAM. Spark is deployed in
standalone cluster mode.
Spark version is *2.1.0*
I am using datastax spark cassandra connector version *2.0.1*
Cassandra Version is *3.9*
Each spark executor is allowed 10 GB of RAM and there are 2 executors
running on each node.

Is the problem related to my machine resources? How can I root cause or fix
this?
Any help will be greatly appreciated.

Thanks,
Faraz