Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
You are right. - another question on migration. Is there a way to get the microbatch id during the microbatch dataset `trasform` operation like in rdd transform ? I am attempting to implement the following pseudo functionality with structured streaming. In this approach, recordCategoriesMetadata is

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
With regard to this sentence *Offset Tracking with Structured Streaming:: While storing offsets in an external storage with DStreams was necessary, SSS handles this automatically through checkpointing. The checkpoints include the offsets processed by each micro-batch. However, you can still acces

Remote File change detection in S3 when spark queries are running and parquet files in S3 changes

2024-05-22 Thread Raghvendra Yadav
Hello, We are hoping someone can help us understand the spark behavior for scenarios listed below. Q. *Will spark running queries fail when S3 parquet object changes underneath with S3A remote file change detection enabled? Is it 100%? * Our understanding is that S3A has a featur

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
Hi Anil, Ok let us put the complete picture here * Current DStreams Setup:* - Data Source: Kafka - Processing Engine: Spark DStreams - Data Transformation with Spark - Sink: S3 - Data Format: Parquet - Exactly-Once Delivery (Attempted): You're attempting exactly-once

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
The right way to associated microbatches when committing to external storage is to use the microbatch id that you can get in foreachBatch. That microbatch id guarantees that the data produced in the batch is the always the same no matter any recomputations (assuming all processing logic is determin

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
Thanks Das, Mtich. Mitch, We process data from Kafka and write it to S3 in Parquet format using Dstreams. To ensure exactly-once delivery and prevent data loss, our process records micro-batch offsets to an external storage at the end of each micro-batch in foreachRDD, which is then used when the

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
If you want to find what offset ranges are present in a microbatch in Structured Streaming, you have to look at the StreamingQuery.lastProgress or use the QueryProgressListener . Both of these

[ANNOUNCE] Apache Celeborn 0.4.1 available

2024-05-22 Thread Nicholas Jiang
Hi all, Apache Celeborn community is glad to announce the new release of Apache Celeborn 0.4.1. Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including shuffle data, spilled da

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
OK to understand better your current model relies on streaming data input through Kafka topic, Spark does some ETL and you send to a sink, a database for file storage like HDFS etc? Your current architecture relies on Direct Streams (DStream) and RDDs and you want to move to Spark sStructured Stre

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread ashok34...@yahoo.com.INVALID
Hello, what options are you considering yourself? On Wednesday 22 May 2024 at 07:37:30 BST, Anil Dasari wrote: Hello, We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges in structure s