[ 
https://issues.apache.org/jira/browse/SPARK-10734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903106#comment-14903106
 ] 

Cody Koeninger commented on SPARK-10734:
----------------------------------------

as I explained in SPARK-10732 , kafka's getOffsetsBefore api is limited to the 
timestamps on log file segments, so its granularity is quite poor and doesn't 
really behave as one might expect.


> DirectKafkaInputDStream uses the OffsetRequest.LatestTime to find the latest 
> offset, however using the batch time would be more desireable.
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10734
>                 URL: https://issues.apache.org/jira/browse/SPARK-10734
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>            Reporter: Bijay Singh Bisht
>
> DirectKafkaInputDStream uses the OffsetRequest.LatestTime to find the latest 
> offset, however since OffsetRequest.LatestTime is a relative thing, its 
> depends on when the batch is scheduled. One would imagine that given an input 
> data set the data in the batches should be predictable, irrespective of the 
> system conditions. Using the batch time implies that the stream processing 
> will have the same batches irrespective of whether when the processing was 
> started and the load conditions on the system.
> This along with [SPARK-10732] provides for a nice regression scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to