Re: Spark DStream Data stored out of order in Cassandra
Kafka only guarantees ordering within a single partition in a topic, not for an entire topic. Unless you're creating topics in Kafka with only a single partition (you probably shouldn't be doing this), messages won't be served to consumers as FIFO. As for Spark, there are many operations that will change the "ordering" of the consumed messages (e.g. cluster wide shuffles like groupByKey). Ask yourself what "ordering" means to you in distributed systems that process events in parallel? Generally speaking, you should not rely on arrival time for ordering. A common approach to solving this problem is to maintain ordering in the external store, not in Kafka nor Spark. This implies that you create your Cassandra schema to maintain ordering - e.g. use a clustering column to save event time (not arrival time) in your C* schema to maintain order on insertion. This allows you to not worry about solving ordering in the processing engine. HTH, Duc On Mon, Nov 30, 2015 at 4:37 AM, Prateek . wrote: > Hi, > > > > I have an time critical spark application, which is taking sensor data > from kafka stream, storing in case class, applying transformations and then > storing in cassandra schema. The data needs to be stored in schema, in FIFO > order. > > > > The order is maintained at kafka queue but I am observing, out of order > data in Cassandra schema. Does Spark Streaming provide any functionality to > retain order. Or do we need do implement some sorting based on timestamp of > arrival. > > > > > > Regards, > > Prateek > "DISCLAIMER: This message is proprietary to Aricent and is intended solely > for the use of the individual to whom it is addressed. It may contain > privileged or confidential information and should not be circulated or used > for any purpose other than for what it is intended. If you have received > this message in error, please notify the originator immediately. If you are > not the intended recipient, you are notified that you are strictly > prohibited from using, copying, altering, or disclosing the contents of > this message. Aricent accepts no responsibility for loss or damage arising > from the use of the information transmitted by this email including damage > from virus." >
Re: Spark DStream Data stored out of order in Cassandra
Spark Streaming will consumer and process data in parallel. So the order of the output will depend not only on the order of the input but also in the time it takes for each task to process. Different options, like repartitions, sorts and shuffles at Spark level will also affect ordering, so the best way would be to rely on the scheme in Cassandra to ensure the ordering expected by the application. What is the schema you're using at the Cassandra side? And how is the data going to be queried? That last question should drive the required ordering. -kr, Gerard. On Mon, Nov 30, 2015 at 12:37 PM, Prateek . wrote: > Hi, > > > > I have an time critical spark application, which is taking sensor data > from kafka stream, storing in case class, applying transformations and then > storing in cassandra schema. The data needs to be stored in schema, in FIFO > order. > > > > The order is maintained at kafka queue but I am observing, out of order > data in Cassandra schema. Does Spark Streaming provide any functionality to > retain order. Or do we need do implement some sorting based on timestamp of > arrival. > > > > > > Regards, > > Prateek > "DISCLAIMER: This message is proprietary to Aricent and is intended solely > for the use of the individual to whom it is addressed. It may contain > privileged or confidential information and should not be circulated or used > for any purpose other than for what it is intended. If you have received > this message in error, please notify the originator immediately. If you are > not the intended recipient, you are notified that you are strictly > prohibited from using, copying, altering, or disclosing the contents of > this message. Aricent accepts no responsibility for loss or damage arising > from the use of the information transmitted by this email including damage > from virus." >
Spark DStream Data stored out of order in Cassandra
Hi, I have an time critical spark application, which is taking sensor data from kafka stream, storing in case class, applying transformations and then storing in cassandra schema. The data needs to be stored in schema, in FIFO order. The order is maintained at kafka queue but I am observing, out of order data in Cassandra schema. Does Spark Streaming provide any functionality to retain order. Or do we need do implement some sorting based on timestamp of arrival. Regards, Prateek "DISCLAIMER: This message is proprietary to Aricent and is intended solely for the use of the individual to whom it is addressed. It may contain privileged or confidential information and should not be circulated or used for any purpose other than for what it is intended. If you have received this message in error, please notify the originator immediately. If you are not the intended recipient, you are notified that you are strictly prohibited from using, copying, altering, or disclosing the contents of this message. Aricent accepts no responsibility for loss or damage arising from the use of the information transmitted by this email including damage from virus."