[jira] [Commented] (SPARK-21378) Spark Poll timeout when specific offsets are passed

2017-07-18 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092088#comment-16092088
 ] 

Shixiong Zhu commented on SPARK-21378:
--

The data must already be in Kafka when executors try to poll. So if it cannot 
poll any record, it means either the timeout is too short, or something is 
wrong. It it's jut a timeout issue, you can just increase it. Otherwise, IMO, 
blindly retrying on unknown is pretty bad. One example of this is if you type a 
wrong broker address, the Kafka consumer will just run forever.

> Spark Poll timeout when specific offsets are passed
> ---
>
> Key: SPARK-21378
> URL: https://issues.apache.org/jira/browse/SPARK-21378
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Ambud Sharma
>
> Kafka direct stream fails with poll timeout:
> {code:java}
> JavaInputDStream> stream = 
> KafkaUtils.createDirectStream(ssc,
>   LocationStrategies.PreferConsistent(),
>   ConsumerStrategies. String>Subscribe(topicsCollection, kafkaParams, fromOffsets));
> {code}
> Digging deeper shows that there's an assert statement such that if no records 
> are returned (which is a valid case) then a failure will happen.
> https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L75
> This solution: https://issues.apache.org/jira/browse/SPARK-19275 keeps 
> getting "Added jobs for time" and eventually leads to "Failed to get records 
> for spark-x after polling for 3000"; in this case batch size is 3seconds
> We can increase it to an even bigger number which leads to OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21378) Spark Poll timeout when specific offsets are passed

2017-07-13 Thread Ambud Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085930#comment-16085930
 ] 

Ambud Sharma commented on SPARK-21378:
--

[~zsxwing] 
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#your-own-data-store
 this shows it's not an offset range. It is *"fromOffset"* similar to how the 
standard KafkaConsumer (0.9 onwards) works. 

In the standard consumer if the poll loop returns no records you are suppose to 
keep polling, not throw an exception. Kafka may get data later on, specially 
for topics that are not streaming hundreds of records every second.

Thoughts?

> Spark Poll timeout when specific offsets are passed
> ---
>
> Key: SPARK-21378
> URL: https://issues.apache.org/jira/browse/SPARK-21378
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Ambud Sharma
>
> Kafka direct stream fails with poll timeout:
> {code:java}
> JavaInputDStream> stream = 
> KafkaUtils.createDirectStream(ssc,
>   LocationStrategies.PreferConsistent(),
>   ConsumerStrategies. String>Subscribe(topicsCollection, kafkaParams, fromOffsets));
> {code}
> Digging deeper shows that there's an assert statement such that if no records 
> are returned (which is a valid case) then a failure will happen.
> https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L75
> This solution: https://issues.apache.org/jira/browse/SPARK-19275 keeps 
> getting "Added jobs for time" and eventually leads to "Failed to get records 
> for spark-x after polling for 3000"; in this case batch size is 3seconds
> We can increase it to an even bigger number which leads to OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21378) Spark Poll timeout when specific offsets are passed

2017-07-12 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084772#comment-16084772
 ] 

Shixiong Zhu commented on SPARK-21378:
--

bq. Digging deeper shows that there's an assert statement such that if no 
records are returned (which is a valid case) then a failure will happen.

That's actually not a valid case. CachedKafkaConsumer.scala uses the offset 
range generated in the driver, so the records are supposed to be in Kafka. If 
not, then it means timeout, or the data is missing. If it's just because of 
timeout, you can increase "spark.streaming.kafka.consumer.poll.ms".



> Spark Poll timeout when specific offsets are passed
> ---
>
> Key: SPARK-21378
> URL: https://issues.apache.org/jira/browse/SPARK-21378
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Ambud Sharma
>
> Kafka direct stream fails with poll timeout:
> {code:java}
> JavaInputDStream> stream = 
> KafkaUtils.createDirectStream(ssc,
>   LocationStrategies.PreferConsistent(),
>   ConsumerStrategies. String>Subscribe(topicsCollection, kafkaParams, fromOffsets));
> {code}
> Digging deeper shows that there's an assert statement such that if no records 
> are returned (which is a valid case) then a failure will happen.
> https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L75
> This solution: https://issues.apache.org/jira/browse/SPARK-19275 keeps 
> getting "Added jobs for time" and eventually leads to "Failed to get records 
> for spark-x after polling for 3000"; in this case batch size is 3seconds
> We can increase it to an even bigger number which leads to OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org