Re: Spark DStream Data stored out of order in Cassandra

2015-11-30 Thread PhuDuc Nguyen
Kafka only guarantees ordering within a single partition in a topic, not
for an entire topic. Unless you're creating topics in Kafka with only a
single partition (you probably shouldn't be doing this), messages won't be
served to consumers as FIFO. As for Spark, there are many operations that
will change the "ordering" of the consumed messages (e.g. cluster wide
shuffles like groupByKey). Ask yourself what "ordering" means to you in
distributed systems that process events in parallel? Generally speaking,
you should not rely on arrival time for ordering.

A common approach to solving this problem is to maintain ordering in the
external store, not in Kafka nor Spark. This implies that you create your
Cassandra schema to maintain ordering - e.g. use a clustering column to
save event time (not arrival time) in your C* schema to maintain order on
insertion. This allows you to not worry about solving ordering in the
processing engine.

HTH,
Duc


On Mon, Nov 30, 2015 at 4:37 AM, Prateek .  wrote:

> Hi,
>
>
>
> I have an time critical spark application, which is taking sensor data
> from kafka stream, storing in case class, applying transformations and then
> storing in cassandra schema. The data needs to be stored in schema, in FIFO
> order.
>
>
>
> The order is maintained at kafka queue but I am observing, out of order
> data in Cassandra schema. Does Spark Streaming provide any functionality to
> retain order. Or do we need do implement some sorting based on timestamp of
> arrival.
>
>
>
>
>
> Regards,
>
> Prateek
> "DISCLAIMER: This message is proprietary to Aricent and is intended solely
> for the use of the individual to whom it is addressed. It may contain
> privileged or confidential information and should not be circulated or used
> for any purpose other than for what it is intended. If you have received
> this message in error, please notify the originator immediately. If you are
> not the intended recipient, you are notified that you are strictly
> prohibited from using, copying, altering, or disclosing the contents of
> this message. Aricent accepts no responsibility for loss or damage arising
> from the use of the information transmitted by this email including damage
> from virus."
>


Re: Spark DStream Data stored out of order in Cassandra

2015-11-30 Thread Gerard Maas
Spark Streaming will consumer and process data in parallel. So the order of
the output will depend not only on the order of the input but also in the
time it takes for each task to process. Different options, like
repartitions, sorts and shuffles at Spark level will also affect ordering,
so the best way would be to rely on the scheme in Cassandra to ensure the
ordering expected by the application.

What is the schema you're using at the Cassandra side?  And how is the data
going to be queried?   That last question should drive the required
ordering.

-kr, Gerard.

On Mon, Nov 30, 2015 at 12:37 PM, Prateek .  wrote:

> Hi,
>
>
>
> I have an time critical spark application, which is taking sensor data
> from kafka stream, storing in case class, applying transformations and then
> storing in cassandra schema. The data needs to be stored in schema, in FIFO
> order.
>
>
>
> The order is maintained at kafka queue but I am observing, out of order
> data in Cassandra schema. Does Spark Streaming provide any functionality to
> retain order. Or do we need do implement some sorting based on timestamp of
> arrival.
>
>
>
>
>
> Regards,
>
> Prateek
> "DISCLAIMER: This message is proprietary to Aricent and is intended solely
> for the use of the individual to whom it is addressed. It may contain
> privileged or confidential information and should not be circulated or used
> for any purpose other than for what it is intended. If you have received
> this message in error, please notify the originator immediately. If you are
> not the intended recipient, you are notified that you are strictly
> prohibited from using, copying, altering, or disclosing the contents of
> this message. Aricent accepts no responsibility for loss or damage arising
> from the use of the information transmitted by this email including damage
> from virus."
>


Spark DStream Data stored out of order in Cassandra

2015-11-30 Thread Prateek .
Hi,

I have an time critical spark application, which is taking sensor data  from 
kafka stream, storing in case class, applying transformations and then storing 
in cassandra schema. The data needs to be stored in schema, in FIFO order.

The order is maintained at kafka queue but I am observing, out of order data in 
Cassandra schema. Does Spark Streaming provide any functionality to retain 
order. Or do we need do implement some sorting based on timestamp of arrival.


Regards,
Prateek
"DISCLAIMER: This message is proprietary to Aricent and is intended solely for 
the use of the individual to whom it is addressed. It may contain privileged or 
confidential information and should not be circulated or used for any purpose 
other than for what it is intended. If you have received this message in error, 
please notify the originator immediately. If you are not the intended 
recipient, you are notified that you are strictly prohibited from using, 
copying, altering, or disclosing the contents of this message. Aricent accepts 
no responsibility for loss or damage arising from the use of the information 
transmitted by this email including damage from virus."