Thanks. I could repro myself as well. How long was the network down?

Trying to get the fix into 2.7 RC2.

On Fri, Sep 14, 2018 at 12:25 PM Eduardo Soldera <
[email protected]> wrote:

> Just to make myself clear, I'm not sure how to use the patch but if you
> could send us some guidance would be great.
>
> Em sex, 14 de set de 2018 às 16:24, Eduardo Soldera <
> [email protected]> escreveu:
>
>> Hi Raghu, yes, it is feasible, would you do that for us? I'm not sure how
>> we'd use the patch. We're using SBT and Spotify's Scio with Scala.
>>
>> Thanks
>>
>> Em sex, 14 de set de 2018 às 16:07, Raghu Angadi <[email protected]>
>> escreveu:
>>
>>> Is is feasible for you to verify the fix in your dev job? I can make a
>>> patch against Beam 2.4 branch if you like.
>>>
>>> Raghu.
>>>
>>> On Fri, Sep 14, 2018 at 11:14 AM Eduardo Soldera <
>>> [email protected]> wrote:
>>>
>>>> Hi Raghu, thank you very much for the pull request.
>>>> We'll wait for the 2.7 Beam release.
>>>>
>>>> Regards!
>>>>
>>>> Em qui, 13 de set de 2018 às 18:19, Raghu Angadi <[email protected]>
>>>> escreveu:
>>>>
>>>>> Fix: https://github.com/apache/beam/pull/6391
>>>>>
>>>>> On Wed, Sep 12, 2018 at 3:30 PM Raghu Angadi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Filed BEAM-5375 <https://issues.apache.org/jira/browse/BEAM-5375>. I
>>>>>> will fix it later this week.
>>>>>>
>>>>>> On Wed, Sep 12, 2018 at 12:16 PM Raghu Angadi <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 12, 2018 at 12:11 PM Raghu Angadi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the job id, I looked at the worker logs (following usual
>>>>>>>> support oncall access protocol that provides temporary access to things
>>>>>>>> like logs in GCP):
>>>>>>>>
>>>>>>>> Root issue looks like consumerPollLoop() mentioned earlier needs to
>>>>>>>> handle unchecked exception. In your case it is clear that poll thread
>>>>>>>> exited with a runtime exception. The reader does not check for it and
>>>>>>>> continues to wait for poll thread to enqueue messages. A fix should 
>>>>>>>> result
>>>>>>>> in an IOException for read from the source. The runners will handle 
>>>>>>>> that
>>>>>>>> appropriately after that.  I will file a jira.
>>>>>>>>
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L678
>>>>>>>>
>>>>>>>
>>>>>>> Ignore the link.. was pasted here by mistake.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> From the logs (with a comment below each one):
>>>>>>>>
>>>>>>>>    - 2018-09-12 06:13:07.345 PDT Reader-0: reading from
>>>>>>>>    kafka_topic-0 starting at offset 2
>>>>>>>>       - Implies the reader is initialized and poll thread is
>>>>>>>>       started.
>>>>>>>>    - 2018-09-12 06:13:07.780 PDT Reader-0: first record offset 2
>>>>>>>>       - The reader actually got a message received by the poll
>>>>>>>>       thread from Kafka.
>>>>>>>>    - 2018-09-12 06:16:48.771 PDT Reader-0: exception while
>>>>>>>>    fetching latest offset for partition kafka_topic-0. will be retried.
>>>>>>>>       - This must have happened around the time when network was
>>>>>>>>       disrupted. This is from. Actual log is from another periodic 
>>>>>>>> task that
>>>>>>>>       fetches latest offsets for partitions.
>>>>>>>>
>>>>>>>> The poll thread must have died around the time network was
>>>>>>>> disrupted.
>>>>>>>>
>>>>>>>> The following log comes from kafka client itself and is printed
>>>>>>>> every second when KafkaIO fetches latest offset. This log seems to be 
>>>>>>>> added
>>>>>>>> in recent versions. It is probably an unintentional log. I don't think
>>>>>>>> there is any better to fetch latest offsets than how KafkaIO does now. 
>>>>>>>> This
>>>>>>>> is logged inside consumer.position() called at [1].
>>>>>>>>
>>>>>>>>    - 2018-09-12 06:13:11.786 PDT [Consumer clientId=consumer-2,
>>>>>>>>    groupId=Reader-0_offset_consumer_1735388161_genericPipe] Resetting 
>>>>>>>> offset
>>>>>>>>    for partition com.arquivei.dataeng.andre-0 to offset 3.
>>>>>>>>
>>>>>>>> [1]:
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L678
>>>>>>>>
>>>>>>>
>>>>>>> This 'Resetting offset' is harmless, but is quite annoying to see in
>>>>>>> the worker logs. One way to avoid is to set kafka consumer's log level 
>>>>>>> to
>>>>>>> WARNING. Ideally KafkaIO itself should do something to avoid it without
>>>>>>> user option.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Wed, Sep 12, 2018 at 10:27 AM Eduardo Soldera <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Raghu! The job_id of our dev job is
>>>>>>>>> 2018-09-12_06_11_48-5600553605191377866.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> Em qua, 12 de set de 2018 às 14:18, Raghu Angadi <
>>>>>>>>> [email protected]> escreveu:
>>>>>>>>>
>>>>>>>>>> Thanks for debugging.
>>>>>>>>>> Can you provide the job_id of your dev job? The stacktrace shows
>>>>>>>>>> that there is no thread running 'consumerPollLoop()' which can 
>>>>>>>>>> explain
>>>>>>>>>> stuck reader. You will likely find a logs at line 594 & 587 [1].  
>>>>>>>>>> Dataflow
>>>>>>>>>> caches its readers and DirectRunner may not. That can explain 
>>>>>>>>>> DirectRunner
>>>>>>>>>> resume reads. The expectation in KafkaIO is that Kafka client 
>>>>>>>>>> library takes
>>>>>>>>>> care of retrying in case of connection problems (as documented). It 
>>>>>>>>>> is
>>>>>>>>>> possible that in some cases poll() throws and we need to restart the 
>>>>>>>>>> client
>>>>>>>>>> in KafkaIO.
>>>>>>>>>>
>>>>>>>>>> [1]:
>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L594
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 12, 2018 at 9:59 AM Eduardo Soldera <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Raghu, thanks for your help.
>>>>>>>>>>> Just answering your previous question, the following logs were
>>>>>>>>>>> the same as before the error, as if the pipeline were still getting 
>>>>>>>>>>> the
>>>>>>>>>>> messages, for example:
>>>>>>>>>>>
>>>>>>>>>>> (...)
>>>>>>>>>>> Resetting offset for partition com.arquivei.dataeng.andre-0 to
>>>>>>>>>>> offset 10.
>>>>>>>>>>> Resetting offset for partition com.arquivei.dataeng.andre-0 to
>>>>>>>>>>> offset 15.
>>>>>>>>>>> ERROR
>>>>>>>>>>> Resetting offset for partition com.arquivei.dataeng.andre-0 to
>>>>>>>>>>> offset 22.
>>>>>>>>>>> Resetting offset for partition com.arquivei.dataeng.andre-0 to
>>>>>>>>>>> offset 30.
>>>>>>>>>>> (...)
>>>>>>>>>>>
>>>>>>>>>>> But when checking the Kafka Consumer Group, the current offset
>>>>>>>>>>> stays at 15, the commited offset from the last processed message, 
>>>>>>>>>>> before
>>>>>>>>>>> the error.
>>>>>>>>>>>
>>>>>>>>>>> We'll file a bug, but we could now reproduce the issue in a Dev
>>>>>>>>>>> scenario.
>>>>>>>>>>> We started the same pipeline using the direct runner, without
>>>>>>>>>>> Google Dataflow. We blocked the Kafka Broker network and the same 
>>>>>>>>>>> error was
>>>>>>>>>>> thrown. Then we unblocked the network and the pipeline was able to
>>>>>>>>>>> successfully process the subsequent messages.
>>>>>>>>>>> When we started the same pipeline in the Dataflow runner and did
>>>>>>>>>>> the same test, the same problem from our production scenario 
>>>>>>>>>>> happened,
>>>>>>>>>>> Dataflow couldn't process the new messages. Unfortunately, we've 
>>>>>>>>>>> stopped
>>>>>>>>>>> the dataflow job in production, but the problematic dev job is still
>>>>>>>>>>> running and the log file of the VM is attached. Thank you very much.
>>>>>>>>>>> Best regards
>>>>>>>>>>>
>>>>>>>>>>> Em ter, 11 de set de 2018 às 18:28, Raghu Angadi <
>>>>>>>>>>> [email protected]> escreveu:
>>>>>>>>>>>
>>>>>>>>>>>> Specifically, I am interested if you have any thread running
>>>>>>>>>>>> 'consumerPollLoop()' [1]. There should always be one (if a worker 
>>>>>>>>>>>> is
>>>>>>>>>>>> assigned one of the partitions). It is possible that KafkaClient 
>>>>>>>>>>>> itself is
>>>>>>>>>>>> hasn't recovered from the group coordinator error (though 
>>>>>>>>>>>> unlikely).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L570
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 11, 2018 at 12:31 PM Raghu Angadi <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Eduardo,
>>>>>>>>>>>>>
>>>>>>>>>>>>> In case of any error, the pipeline should keep on trying to
>>>>>>>>>>>>> fetch. I don't know about this particular error. Do you see any 
>>>>>>>>>>>>> others
>>>>>>>>>>>>> afterwards in the log?
>>>>>>>>>>>>> Couple of things you could try if the logs are not useful :
>>>>>>>>>>>>>  - login to one of the VMs and get stacktrace of java worker
>>>>>>>>>>>>> (look for a container called java-streaming)
>>>>>>>>>>>>>  - file a support bug or stackoverflow question with jobid so
>>>>>>>>>>>>> that Dataflow oncall can take a look.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Raghu.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 11, 2018 at 12:10 PM Eduardo Soldera <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> We have a Apache Beam pipeline running in Google Dataflow
>>>>>>>>>>>>>> using KafkaIO. Suddenly the pipeline stop fetching Kafka 
>>>>>>>>>>>>>> messages at all,
>>>>>>>>>>>>>> as our other workers from other pipelines continued to get Kafka 
>>>>>>>>>>>>>> messages.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> At the moment it stopped we got these messages:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I  [Consumer clientId=consumer-1, groupId=genericPipe] Error 
>>>>>>>>>>>>>> sending fetch request (sessionId=1396189203, epoch=2431598) to 
>>>>>>>>>>>>>> node 3: org.apache.kafka.common.errors.DisconnectException.
>>>>>>>>>>>>>> I  [Consumer clientId=consumer-1, groupId=genericPipe] Group 
>>>>>>>>>>>>>> coordinator 10.0.52.70:9093 (id: 2147483646 rack: null) is 
>>>>>>>>>>>>>> unavailable or invalid, will attempt rediscovery
>>>>>>>>>>>>>> I  [Consumer clientId=consumer-1, groupId=genericPipe] 
>>>>>>>>>>>>>> Discovered group coordinator 10.0.52.70:9093 (id: 2147483646 
>>>>>>>>>>>>>> rack: null)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And then the pipeline stopped reading the messages.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is the KafkaIO setup  we have:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> KafkaIO.read[String,String]()
>>>>>>>>>>>>>>   .withBootstrapServers(server)
>>>>>>>>>>>>>>   .withTopic(topic)
>>>>>>>>>>>>>>   .withKeyDeserializer(classOf[StringDeserializer])
>>>>>>>>>>>>>>   .withValueDeserializer(classOf[StringDeserializer])
>>>>>>>>>>>>>>   .updateConsumerProperties(properties)
>>>>>>>>>>>>>>   .commitOffsetsInFinalize()
>>>>>>>>>>>>>>   .withoutMetadata()
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Any help will be much appreciated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Eduardo Soldera Garcia
>>>>>>>>>>>>>> Data Engineer
>>>>>>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br
>>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>>>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>>>>>>>>>>> [image: Google seleciona Arquivei para imersão e mentoria no
>>>>>>>>>>>>>> Vale do Silício]
>>>>>>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
>>>>>>>>>>>>>> <https://www.facebook.com/arquivei>
>>>>>>>>>>>>>> <https://www.linkedin.com/company/arquivei>
>>>>>>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Eduardo Soldera Garcia
>>>>>>>>>>> Data Engineer
>>>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br
>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>>>>>>>> [image: Google seleciona Arquivei para imersão e mentoria no
>>>>>>>>>>> Vale do Silício]
>>>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
>>>>>>>>>>> <https://www.facebook.com/arquivei>
>>>>>>>>>>> <https://www.linkedin.com/company/arquivei>
>>>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Eduardo Soldera Garcia
>>>>>>>>> Data Engineer
>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br
>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>>>>>> [image: Google seleciona Arquivei para imersão e mentoria no Vale
>>>>>>>>> do Silício]
>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
>>>>>>>>> <https://www.facebook.com/arquivei>
>>>>>>>>> <https://www.linkedin.com/company/arquivei>
>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk>
>>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> Eduardo Soldera Garcia
>>>> Data Engineer
>>>> (16) 3509-5555 | www.arquivei.com.br
>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
>>>> Silício]
>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
>>>> <https://www.facebook.com/arquivei>
>>>> <https://www.linkedin.com/company/arquivei>
>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk>
>>>>
>>>
>>
>> --
>> Eduardo Soldera Garcia
>> Data Engineer
>> (16) 3509-5555 | www.arquivei.com.br
>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
>> Silício]
>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
>> <https://www.facebook.com/arquivei>
>> <https://www.linkedin.com/company/arquivei>
>> <https://www.youtube.com/watch?v=sSUUKxbXnxk>
>>
>
>
> --
> Eduardo Soldera Garcia
> Data Engineer
> (16) 3509-5555 | www.arquivei.com.br
> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
> Silício]
> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
> <https://www.facebook.com/arquivei>
> <https://www.linkedin.com/company/arquivei>
> <https://www.youtube.com/watch?v=sSUUKxbXnxk>
>

Reply via email to