Chaoqin Li created SPARK-47793:
----------------------------------

             Summary: Implement SimpleDataSourceStreamReader for python 
streaming data source
                 Key: SPARK-47793
                 URL: https://issues.apache.org/jira/browse/SPARK-47793
             Project: Spark
          Issue Type: New Feature
          Components: PySpark, SS
    Affects Versions: 3.5.1
            Reporter: Chaoqin Li


 SimpleDataSourceStreamReader is a simplified version of the DataStreamReader 
interface.
 # It doesn’t require developers to reason about data partitioning.
 # It doesn’t require getting the latest offset before reading data.

There are 2 functions that needs to be defined 
 # Read data and return the end offset.

_def read(self, start: Offset) -> (Iterator[Tuple], Offset)_
 # Read data between start and end offset, this is required for exactly once 
read.

_def read2(self, start: Offset, end: Offset) -> Iterator[Tuple]_

 

Implementation: Wrap the SimpleDataSourceStreamReader instance in a 
DataSourceStreamReader internally and make the prefetching and caching 
transparent to the data source developer. The record prefetched in python 
process will be sent to JVM as arrow record batches.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to