[jira] [Commented] (SPARK-21378) Spark Poll timeout when specific offsets are passed
[ https://issues.apache.org/jira/browse/SPARK-21378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092088#comment-16092088 ] Shixiong Zhu commented on SPARK-21378: -- The data must already be in Kafka when executors try to poll. So if it cannot poll any record, it means either the timeout is too short, or something is wrong. It it's jut a timeout issue, you can just increase it. Otherwise, IMO, blindly retrying on unknown is pretty bad. One example of this is if you type a wrong broker address, the Kafka consumer will just run forever. > Spark Poll timeout when specific offsets are passed > --- > > Key: SPARK-21378 > URL: https://issues.apache.org/jira/browse/SPARK-21378 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.2 >Reporter: Ambud Sharma > > Kafka direct stream fails with poll timeout: > {code:java} > JavaInputDStream> stream = > KafkaUtils.createDirectStream(ssc, > LocationStrategies.PreferConsistent(), > ConsumerStrategies. String>Subscribe(topicsCollection, kafkaParams, fromOffsets)); > {code} > Digging deeper shows that there's an assert statement such that if no records > are returned (which is a valid case) then a failure will happen. > https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L75 > This solution: https://issues.apache.org/jira/browse/SPARK-19275 keeps > getting "Added jobs for time" and eventually leads to "Failed to get records > for spark-x after polling for 3000"; in this case batch size is 3seconds > We can increase it to an even bigger number which leads to OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21378) Spark Poll timeout when specific offsets are passed
[ https://issues.apache.org/jira/browse/SPARK-21378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085930#comment-16085930 ] Ambud Sharma commented on SPARK-21378: -- [~zsxwing] https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#your-own-data-store this shows it's not an offset range. It is *"fromOffset"* similar to how the standard KafkaConsumer (0.9 onwards) works. In the standard consumer if the poll loop returns no records you are suppose to keep polling, not throw an exception. Kafka may get data later on, specially for topics that are not streaming hundreds of records every second. Thoughts? > Spark Poll timeout when specific offsets are passed > --- > > Key: SPARK-21378 > URL: https://issues.apache.org/jira/browse/SPARK-21378 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.2 >Reporter: Ambud Sharma > > Kafka direct stream fails with poll timeout: > {code:java} > JavaInputDStream> stream = > KafkaUtils.createDirectStream(ssc, > LocationStrategies.PreferConsistent(), > ConsumerStrategies. String>Subscribe(topicsCollection, kafkaParams, fromOffsets)); > {code} > Digging deeper shows that there's an assert statement such that if no records > are returned (which is a valid case) then a failure will happen. > https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L75 > This solution: https://issues.apache.org/jira/browse/SPARK-19275 keeps > getting "Added jobs for time" and eventually leads to "Failed to get records > for spark-x after polling for 3000"; in this case batch size is 3seconds > We can increase it to an even bigger number which leads to OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21378) Spark Poll timeout when specific offsets are passed
[ https://issues.apache.org/jira/browse/SPARK-21378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084772#comment-16084772 ] Shixiong Zhu commented on SPARK-21378: -- bq. Digging deeper shows that there's an assert statement such that if no records are returned (which is a valid case) then a failure will happen. That's actually not a valid case. CachedKafkaConsumer.scala uses the offset range generated in the driver, so the records are supposed to be in Kafka. If not, then it means timeout, or the data is missing. If it's just because of timeout, you can increase "spark.streaming.kafka.consumer.poll.ms". > Spark Poll timeout when specific offsets are passed > --- > > Key: SPARK-21378 > URL: https://issues.apache.org/jira/browse/SPARK-21378 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.2 >Reporter: Ambud Sharma > > Kafka direct stream fails with poll timeout: > {code:java} > JavaInputDStream> stream = > KafkaUtils.createDirectStream(ssc, > LocationStrategies.PreferConsistent(), > ConsumerStrategies. String>Subscribe(topicsCollection, kafkaParams, fromOffsets)); > {code} > Digging deeper shows that there's an assert statement such that if no records > are returned (which is a valid case) then a failure will happen. > https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L75 > This solution: https://issues.apache.org/jira/browse/SPARK-19275 keeps > getting "Added jobs for time" and eventually leads to "Failed to get records > for spark-x after polling for 3000"; in this case batch size is 3seconds > We can increase it to an even bigger number which leads to OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org