Re: Setting startingOffsets to earliest in structured streaming never catches up

2017-01-23 Thread Michael Armbrust
+1 to Ryan's suggestion of setting maxOffsetsPerTrigger.  This way you can
at least see how quickly it is making progress towards catching up.

On Sun, Jan 22, 2017 at 7:02 PM, Timothy Chan  wrote:

> I'm using version 2.02.
>
> The difference I see between using latest and earliest is a series of jobs
> that take less than a second vs. one job that goes on for over 24 hours.
>
> On Sun, Jan 22, 2017 at 6:54 PM Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Which Spark version are you using? If you are using 2.1.0, could you use
>> the monitoring APIs (http://spark.apache.org/docs/
>> latest/structured-streaming-programming-guide.html#
>> monitoring-streaming-queries) to check the input rate and the processing
>> rate? One possible issue is that the Kafka source launched a pretty large
>> batch and it took too long to finish it. If so, you can use
>> "maxOffsetsPerTrigger" option to limit the data size in a batch in order to
>> observe the progress.
>>
>> On Sun, Jan 22, 2017 at 10:22 AM, Timothy Chan 
>> wrote:
>>
>> I'm running my structured streaming jobs in EMR. We were thinking a worst
>> case scenario recovery situation would be to spin up another cluster and
>> set startingOffsets to earliest (our Kafka cluster has a retention policy
>> of 7 days).
>>
>> My observation is that the job never catches up to latest. This is not
>> acceptable. I've set the number of partitions for the topic to 6. I've
>> tried using a cluster of 4 in EMR.
>>
>> The producer rate for this topic is 4 events/second. Does anyone have any
>> suggestions on what I can do to have my consumer catch up to latest faster?
>>
>>
>>


Re: Setting startingOffsets to earliest in structured streaming never catches up

2017-01-22 Thread Timothy Chan
I'm using version 2.02.

The difference I see between using latest and earliest is a series of jobs
that take less than a second vs. one job that goes on for over 24 hours.

On Sun, Jan 22, 2017 at 6:54 PM Shixiong(Ryan) Zhu 
wrote:

> Which Spark version are you using? If you are using 2.1.0, could you use
> the monitoring APIs (
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries)
> to check the input rate and the processing rate? One possible issue is that
> the Kafka source launched a pretty large batch and it took too long to
> finish it. If so, you can use "maxOffsetsPerTrigger" option to limit the
> data size in a batch in order to observe the progress.
>
> On Sun, Jan 22, 2017 at 10:22 AM, Timothy Chan 
> wrote:
>
> I'm running my structured streaming jobs in EMR. We were thinking a worst
> case scenario recovery situation would be to spin up another cluster and
> set startingOffsets to earliest (our Kafka cluster has a retention policy
> of 7 days).
>
> My observation is that the job never catches up to latest. This is not
> acceptable. I've set the number of partitions for the topic to 6. I've
> tried using a cluster of 4 in EMR.
>
> The producer rate for this topic is 4 events/second. Does anyone have any
> suggestions on what I can do to have my consumer catch up to latest faster?
>
>
>


Re: Setting startingOffsets to earliest in structured streaming never catches up

2017-01-22 Thread Shixiong(Ryan) Zhu
Which Spark version are you using? If you are using 2.1.0, could you use
the monitoring APIs (
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries)
to check the input rate and the processing rate? One possible issue is that
the Kafka source launched a pretty large batch and it took too long to
finish it. If so, you can use "maxOffsetsPerTrigger" option to limit the
data size in a batch in order to observe the progress.

On Sun, Jan 22, 2017 at 10:22 AM, Timothy Chan  wrote:

> I'm running my structured streaming jobs in EMR. We were thinking a worst
> case scenario recovery situation would be to spin up another cluster and
> set startingOffsets to earliest (our Kafka cluster has a retention policy
> of 7 days).
>
> My observation is that the job never catches up to latest. This is not
> acceptable. I've set the number of partitions for the topic to 6. I've
> tried using a cluster of 4 in EMR.
>
> The producer rate for this topic is 4 events/second. Does anyone have any
> suggestions on what I can do to have my consumer catch up to latest faster?
>