RE: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-15 Thread Wolfgang Buchner
Hi Nimrod, i am also interested in your first point, what exactly doesn "false alarm" mean. Today had following scenario, which in my opinion is a false alarm. Following example: - Topic contains 'N' Messages - Spark Streaming application consumed all 'N' messages successfully - Checkpoints of s

RE: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-15 Thread Wolfgang Buchner
Hi Nimrod, i am also interested in your first point, what exactly doesn "false alarm" mean. Today had following scenario, which in my opinion is a false alarm. Following example: - Topic contains 'N' Messages - Spark Streaming application consumed all 'N' messages successfully - Checkpoints of s

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-14 Thread Khalid Mammadov
1. I think false alarm in this context means you are ok to loose data like in Dev and Test envs. 2. Not sure 3. Sorry not sure again but guess would be during your failover checkpoint got out of sync Sorry, that is all I used this feature for. If you think you can smoothly fail over to other clust

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-13 Thread Nimrod Ofek
Thanks Khalid, Some follow ups: 1. I'm still unsure what will be "false alarms" 2. When there is data loss on some partitions - will that lead to all partitions to get reset? 3. I had an occurrence - that I set failOnDataloss to false, I set policy to earliest (which was about 24 h

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-10 Thread Khalid Mammadov
I use this option in development environments where jobs are not actively running and Kafka topic has retention policy on. Meaning when a streaming job runs it may find that the last offset it read is not there anymore and in this case it falls back to starting position (i.e. earliest or latest) sp

Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-10 Thread Nimrod Ofek
Hi everyone, I'm currently working with Spark Structured Streaming integrated with Kafka and had some questions regarding the failOnDataLoss option. The current documentation states: *"Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of