Chaoqin Li created SPARK-47793: ---------------------------------- Summary: Implement SimpleDataSourceStreamReader for python streaming data source Key: SPARK-47793 URL: https://issues.apache.org/jira/browse/SPARK-47793 Project: Spark Issue Type: New Feature Components: PySpark, SS Affects Versions: 3.5.1 Reporter: Chaoqin Li
SimpleDataSourceStreamReader is a simplified version of the DataStreamReader interface. # It doesn’t require developers to reason about data partitioning. # It doesn’t require getting the latest offset before reading data. There are 2 functions that needs to be defined # Read data and return the end offset. _def read(self, start: Offset) -> (Iterator[Tuple], Offset)_ # Read data between start and end offset, this is required for exactly once read. _def read2(self, start: Offset, end: Offset) -> Iterator[Tuple]_ Implementation: Wrap the SimpleDataSourceStreamReader instance in a DataSourceStreamReader internally and make the prefetching and caching transparent to the data source developer. The record prefetched in python process will be sent to JVM as arrow record batches. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org