[ https://issues.apache.org/jira/browse/SPARK-10734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903106#comment-14903106 ]
Cody Koeninger commented on SPARK-10734: ---------------------------------------- as I explained in SPARK-10732 , kafka's getOffsetsBefore api is limited to the timestamps on log file segments, so its granularity is quite poor and doesn't really behave as one might expect. > DirectKafkaInputDStream uses the OffsetRequest.LatestTime to find the latest > offset, however using the batch time would be more desireable. > ------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-10734 > URL: https://issues.apache.org/jira/browse/SPARK-10734 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Reporter: Bijay Singh Bisht > > DirectKafkaInputDStream uses the OffsetRequest.LatestTime to find the latest > offset, however since OffsetRequest.LatestTime is a relative thing, its > depends on when the batch is scheduled. One would imagine that given an input > data set the data in the batches should be predictable, irrespective of the > system conditions. Using the batch time implies that the stream processing > will have the same batches irrespective of whether when the processing was > started and the load conditions on the system. > This along with [SPARK-10732] provides for a nice regression scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org