Re: Setting startingOffsets to earliest in structured streaming never catches up
+1 to Ryan's suggestion of setting maxOffsetsPerTrigger. This way you can at least see how quickly it is making progress towards catching up. On Sun, Jan 22, 2017 at 7:02 PM, Timothy Chanwrote: > I'm using version 2.02. > > The difference I see between using latest and earliest is a series of jobs > that take less than a second vs. one job that goes on for over 24 hours. > > On Sun, Jan 22, 2017 at 6:54 PM Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Which Spark version are you using? If you are using 2.1.0, could you use >> the monitoring APIs (http://spark.apache.org/docs/ >> latest/structured-streaming-programming-guide.html# >> monitoring-streaming-queries) to check the input rate and the processing >> rate? One possible issue is that the Kafka source launched a pretty large >> batch and it took too long to finish it. If so, you can use >> "maxOffsetsPerTrigger" option to limit the data size in a batch in order to >> observe the progress. >> >> On Sun, Jan 22, 2017 at 10:22 AM, Timothy Chan >> wrote: >> >> I'm running my structured streaming jobs in EMR. We were thinking a worst >> case scenario recovery situation would be to spin up another cluster and >> set startingOffsets to earliest (our Kafka cluster has a retention policy >> of 7 days). >> >> My observation is that the job never catches up to latest. This is not >> acceptable. I've set the number of partitions for the topic to 6. I've >> tried using a cluster of 4 in EMR. >> >> The producer rate for this topic is 4 events/second. Does anyone have any >> suggestions on what I can do to have my consumer catch up to latest faster? >> >> >>
Re: Setting startingOffsets to earliest in structured streaming never catches up
I'm using version 2.02. The difference I see between using latest and earliest is a series of jobs that take less than a second vs. one job that goes on for over 24 hours. On Sun, Jan 22, 2017 at 6:54 PM Shixiong(Ryan) Zhuwrote: > Which Spark version are you using? If you are using 2.1.0, could you use > the monitoring APIs ( > http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries) > to check the input rate and the processing rate? One possible issue is that > the Kafka source launched a pretty large batch and it took too long to > finish it. If so, you can use "maxOffsetsPerTrigger" option to limit the > data size in a batch in order to observe the progress. > > On Sun, Jan 22, 2017 at 10:22 AM, Timothy Chan > wrote: > > I'm running my structured streaming jobs in EMR. We were thinking a worst > case scenario recovery situation would be to spin up another cluster and > set startingOffsets to earliest (our Kafka cluster has a retention policy > of 7 days). > > My observation is that the job never catches up to latest. This is not > acceptable. I've set the number of partitions for the topic to 6. I've > tried using a cluster of 4 in EMR. > > The producer rate for this topic is 4 events/second. Does anyone have any > suggestions on what I can do to have my consumer catch up to latest faster? > > >
Re: Setting startingOffsets to earliest in structured streaming never catches up
Which Spark version are you using? If you are using 2.1.0, could you use the monitoring APIs ( http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries) to check the input rate and the processing rate? One possible issue is that the Kafka source launched a pretty large batch and it took too long to finish it. If so, you can use "maxOffsetsPerTrigger" option to limit the data size in a batch in order to observe the progress. On Sun, Jan 22, 2017 at 10:22 AM, Timothy Chanwrote: > I'm running my structured streaming jobs in EMR. We were thinking a worst > case scenario recovery situation would be to spin up another cluster and > set startingOffsets to earliest (our Kafka cluster has a retention policy > of 7 days). > > My observation is that the job never catches up to latest. This is not > acceptable. I've set the number of partitions for the topic to 6. I've > tried using a cluster of 4 in EMR. > > The producer rate for this topic is 4 events/second. Does anyone have any > suggestions on what I can do to have my consumer catch up to latest faster? >