Hi Raghu, we're not sure how long the network was down. According to the
logs no longer than one minute. A 30 second shutdown would work for the
tests.

Regards

Em sex, 14 de set de 2018 às 21:41, Raghu Angadi <[email protected]>
escreveu:

> Thanks. I could repro myself as well. How long was the network down?
>
> Trying to get the fix into 2.7 RC2.
>
> On Fri, Sep 14, 2018 at 12:25 PM Eduardo Soldera <
> [email protected]> wrote:
>
>> Just to make myself clear, I'm not sure how to use the patch but if you
>> could send us some guidance would be great.
>>
>> Em sex, 14 de set de 2018 às 16:24, Eduardo Soldera <
>> [email protected]> escreveu:
>>
>>> Hi Raghu, yes, it is feasible, would you do that for us? I'm not sure
>>> how we'd use the patch. We're using SBT and Spotify's Scio with Scala.
>>>
>>> Thanks
>>>
>>> Em sex, 14 de set de 2018 às 16:07, Raghu Angadi <[email protected]>
>>> escreveu:
>>>
>>>> Is is feasible for you to verify the fix in your dev job? I can make a
>>>> patch against Beam 2.4 branch if you like.
>>>>
>>>> Raghu.
>>>>
>>>> On Fri, Sep 14, 2018 at 11:14 AM Eduardo Soldera <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Raghu, thank you very much for the pull request.
>>>>> We'll wait for the 2.7 Beam release.
>>>>>
>>>>> Regards!
>>>>>
>>>>> Em qui, 13 de set de 2018 às 18:19, Raghu Angadi <[email protected]>
>>>>> escreveu:
>>>>>
>>>>>> Fix: https://github.com/apache/beam/pull/6391
>>>>>>
>>>>>> On Wed, Sep 12, 2018 at 3:30 PM Raghu Angadi <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Filed BEAM-5375 <https://issues.apache.org/jira/browse/BEAM-5375>.
>>>>>>> I will fix it later this week.
>>>>>>>
>>>>>>> On Wed, Sep 12, 2018 at 12:16 PM Raghu Angadi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 12, 2018 at 12:11 PM Raghu Angadi <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the job id, I looked at the worker logs (following
>>>>>>>>> usual support oncall access protocol that provides temporary access to
>>>>>>>>> things like logs in GCP):
>>>>>>>>>
>>>>>>>>> Root issue looks like consumerPollLoop() mentioned earlier needs
>>>>>>>>> to handle unchecked exception. In your case it is clear that poll 
>>>>>>>>> thread
>>>>>>>>> exited with a runtime exception. The reader does not check for it and
>>>>>>>>> continues to wait for poll thread to enqueue messages. A fix should 
>>>>>>>>> result
>>>>>>>>> in an IOException for read from the source. The runners will handle 
>>>>>>>>> that
>>>>>>>>> appropriately after that.  I will file a jira.
>>>>>>>>>
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L678
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ignore the link.. was pasted here by mistake.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> From the logs (with a comment below each one):
>>>>>>>>>
>>>>>>>>>    - 2018-09-12 06:13:07.345 PDT Reader-0: reading from
>>>>>>>>>    kafka_topic-0 starting at offset 2
>>>>>>>>>       - Implies the reader is initialized and poll thread is
>>>>>>>>>       started.
>>>>>>>>>    - 2018-09-12 06:13:07.780 PDT Reader-0: first record offset 2
>>>>>>>>>       - The reader actually got a message received by the poll
>>>>>>>>>       thread from Kafka.
>>>>>>>>>    - 2018-09-12 06:16:48.771 PDT Reader-0: exception while
>>>>>>>>>    fetching latest offset for partition kafka_topic-0. will be 
>>>>>>>>> retried.
>>>>>>>>>       - This must have happened around the time when network was
>>>>>>>>>       disrupted. This is from. Actual log is from another periodic 
>>>>>>>>> task that
>>>>>>>>>       fetches latest offsets for partitions.
>>>>>>>>>
>>>>>>>>> The poll thread must have died around the time network was
>>>>>>>>> disrupted.
>>>>>>>>>
>>>>>>>>> The following log comes from kafka client itself and is printed
>>>>>>>>> every second when KafkaIO fetches latest offset. This log seems to be 
>>>>>>>>> added
>>>>>>>>> in recent versions. It is probably an unintentional log. I don't think
>>>>>>>>> there is any better to fetch latest offsets than how KafkaIO does 
>>>>>>>>> now. This
>>>>>>>>> is logged inside consumer.position() called at [1].
>>>>>>>>>
>>>>>>>>>    - 2018-09-12 06:13:11.786 PDT [Consumer clientId=consumer-2,
>>>>>>>>>    groupId=Reader-0_offset_consumer_1735388161_genericPipe] Resetting 
>>>>>>>>> offset
>>>>>>>>>    for partition com.arquivei.dataeng.andre-0 to offset 3.
>>>>>>>>>
>>>>>>>>> [1]:
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L678
>>>>>>>>>
>>>>>>>>
>>>>>>>> This 'Resetting offset' is harmless, but is quite annoying to see
>>>>>>>> in the worker logs. One way to avoid is to set kafka consumer's log 
>>>>>>>> level
>>>>>>>> to WARNING. Ideally KafkaIO itself should do something to avoid it 
>>>>>>>> without
>>>>>>>> user option.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Wed, Sep 12, 2018 at 10:27 AM Eduardo Soldera <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Raghu! The job_id of our dev job is
>>>>>>>>>> 2018-09-12_06_11_48-5600553605191377866.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> Em qua, 12 de set de 2018 às 14:18, Raghu Angadi <
>>>>>>>>>> [email protected]> escreveu:
>>>>>>>>>>
>>>>>>>>>>> Thanks for debugging.
>>>>>>>>>>> Can you provide the job_id of your dev job? The stacktrace shows
>>>>>>>>>>> that there is no thread running 'consumerPollLoop()' which can 
>>>>>>>>>>> explain
>>>>>>>>>>> stuck reader. You will likely find a logs at line 594 & 587 [1].  
>>>>>>>>>>> Dataflow
>>>>>>>>>>> caches its readers and DirectRunner may not. That can explain 
>>>>>>>>>>> DirectRunner
>>>>>>>>>>> resume reads. The expectation in KafkaIO is that Kafka client 
>>>>>>>>>>> library takes
>>>>>>>>>>> care of retrying in case of connection problems (as documented). It 
>>>>>>>>>>> is
>>>>>>>>>>> possible that in some cases poll() throws and we need to restart 
>>>>>>>>>>> the client
>>>>>>>>>>> in KafkaIO.
>>>>>>>>>>>
>>>>>>>>>>> [1]:
>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L594
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 12, 2018 at 9:59 AM Eduardo Soldera <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Raghu, thanks for your help.
>>>>>>>>>>>> Just answering your previous question, the following logs were
>>>>>>>>>>>> the same as before the error, as if the pipeline were still 
>>>>>>>>>>>> getting the
>>>>>>>>>>>> messages, for example:
>>>>>>>>>>>>
>>>>>>>>>>>> (...)
>>>>>>>>>>>> Resetting offset for partition com.arquivei.dataeng.andre-0 to
>>>>>>>>>>>> offset 10.
>>>>>>>>>>>> Resetting offset for partition com.arquivei.dataeng.andre-0 to
>>>>>>>>>>>> offset 15.
>>>>>>>>>>>> ERROR
>>>>>>>>>>>> Resetting offset for partition com.arquivei.dataeng.andre-0 to
>>>>>>>>>>>> offset 22.
>>>>>>>>>>>> Resetting offset for partition com.arquivei.dataeng.andre-0 to
>>>>>>>>>>>> offset 30.
>>>>>>>>>>>> (...)
>>>>>>>>>>>>
>>>>>>>>>>>> But when checking the Kafka Consumer Group, the current offset
>>>>>>>>>>>> stays at 15, the commited offset from the last processed message, 
>>>>>>>>>>>> before
>>>>>>>>>>>> the error.
>>>>>>>>>>>>
>>>>>>>>>>>> We'll file a bug, but we could now reproduce the issue in a Dev
>>>>>>>>>>>> scenario.
>>>>>>>>>>>> We started the same pipeline using the direct runner, without
>>>>>>>>>>>> Google Dataflow. We blocked the Kafka Broker network and the same 
>>>>>>>>>>>> error was
>>>>>>>>>>>> thrown. Then we unblocked the network and the pipeline was able to
>>>>>>>>>>>> successfully process the subsequent messages.
>>>>>>>>>>>> When we started the same pipeline in the Dataflow runner and
>>>>>>>>>>>> did the same test, the same problem from our production scenario 
>>>>>>>>>>>> happened,
>>>>>>>>>>>> Dataflow couldn't process the new messages. Unfortunately, we've 
>>>>>>>>>>>> stopped
>>>>>>>>>>>> the dataflow job in production, but the problematic dev job is 
>>>>>>>>>>>> still
>>>>>>>>>>>> running and the log file of the VM is attached. Thank you very 
>>>>>>>>>>>> much.
>>>>>>>>>>>> Best regards
>>>>>>>>>>>>
>>>>>>>>>>>> Em ter, 11 de set de 2018 às 18:28, Raghu Angadi <
>>>>>>>>>>>> [email protected]> escreveu:
>>>>>>>>>>>>
>>>>>>>>>>>>> Specifically, I am interested if you have any thread running
>>>>>>>>>>>>> 'consumerPollLoop()' [1]. There should always be one (if a worker 
>>>>>>>>>>>>> is
>>>>>>>>>>>>> assigned one of the partitions). It is possible that KafkaClient 
>>>>>>>>>>>>> itself is
>>>>>>>>>>>>> hasn't recovered from the group coordinator error (though 
>>>>>>>>>>>>> unlikely).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L570
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 11, 2018 at 12:31 PM Raghu Angadi <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Eduardo,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In case of any error, the pipeline should keep on trying to
>>>>>>>>>>>>>> fetch. I don't know about this particular error. Do you see any 
>>>>>>>>>>>>>> others
>>>>>>>>>>>>>> afterwards in the log?
>>>>>>>>>>>>>> Couple of things you could try if the logs are not useful :
>>>>>>>>>>>>>>  - login to one of the VMs and get stacktrace of java worker
>>>>>>>>>>>>>> (look for a container called java-streaming)
>>>>>>>>>>>>>>  - file a support bug or stackoverflow question with jobid so
>>>>>>>>>>>>>> that Dataflow oncall can take a look.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Raghu.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 11, 2018 at 12:10 PM Eduardo Soldera <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> We have a Apache Beam pipeline running in Google Dataflow
>>>>>>>>>>>>>>> using KafkaIO. Suddenly the pipeline stop fetching Kafka 
>>>>>>>>>>>>>>> messages at all,
>>>>>>>>>>>>>>> as our other workers from other pipelines continued to get 
>>>>>>>>>>>>>>> Kafka messages.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> At the moment it stopped we got these messages:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I  [Consumer clientId=consumer-1, groupId=genericPipe] Error 
>>>>>>>>>>>>>>> sending fetch request (sessionId=1396189203, epoch=2431598) to 
>>>>>>>>>>>>>>> node 3: org.apache.kafka.common.errors.DisconnectException.
>>>>>>>>>>>>>>> I  [Consumer clientId=consumer-1, groupId=genericPipe] Group 
>>>>>>>>>>>>>>> coordinator 10.0.52.70:9093 (id: 2147483646 rack: null) is 
>>>>>>>>>>>>>>> unavailable or invalid, will attempt rediscovery
>>>>>>>>>>>>>>> I  [Consumer clientId=consumer-1, groupId=genericPipe] 
>>>>>>>>>>>>>>> Discovered group coordinator 10.0.52.70:9093 (id: 2147483646 
>>>>>>>>>>>>>>> rack: null)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And then the pipeline stopped reading the messages.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is the KafkaIO setup  we have:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> KafkaIO.read[String,String]()
>>>>>>>>>>>>>>>   .withBootstrapServers(server)
>>>>>>>>>>>>>>>   .withTopic(topic)
>>>>>>>>>>>>>>>   .withKeyDeserializer(classOf[StringDeserializer])
>>>>>>>>>>>>>>>   .withValueDeserializer(classOf[StringDeserializer])
>>>>>>>>>>>>>>>   .updateConsumerProperties(properties)
>>>>>>>>>>>>>>>   .commitOffsetsInFinalize()
>>>>>>>>>>>>>>>   .withoutMetadata()
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Any help will be much appreciated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Eduardo Soldera Garcia
>>>>>>>>>>>>>>> Data Engineer
>>>>>>>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br
>>>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>>>>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>>>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>>>>>>>>>>>> [image: Google seleciona Arquivei para imersão e mentoria no
>>>>>>>>>>>>>>> Vale do Silício]
>>>>>>>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
>>>>>>>>>>>>>>> <https://www.facebook.com/arquivei>
>>>>>>>>>>>>>>> <https://www.linkedin.com/company/arquivei>
>>>>>>>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Eduardo Soldera Garcia
>>>>>>>>>>>> Data Engineer
>>>>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br
>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>>>>>>>>> [image: Google seleciona Arquivei para imersão e mentoria no
>>>>>>>>>>>> Vale do Silício]
>>>>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
>>>>>>>>>>>> <https://www.facebook.com/arquivei>
>>>>>>>>>>>> <https://www.linkedin.com/company/arquivei>
>>>>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Eduardo Soldera Garcia
>>>>>>>>>> Data Engineer
>>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br
>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>>>>>>> [image: Google seleciona Arquivei para imersão e mentoria no Vale
>>>>>>>>>> do Silício]
>>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
>>>>>>>>>> <https://www.facebook.com/arquivei>
>>>>>>>>>> <https://www.linkedin.com/company/arquivei>
>>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk>
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>> Eduardo Soldera Garcia
>>>>> Data Engineer
>>>>> (16) 3509-5555 | www.arquivei.com.br
>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>>>> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
>>>>> Silício]
>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
>>>>> <https://www.facebook.com/arquivei>
>>>>> <https://www.linkedin.com/company/arquivei>
>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk>
>>>>>
>>>>
>>>
>>> --
>>> Eduardo Soldera Garcia
>>> Data Engineer
>>> (16) 3509-5555 | www.arquivei.com.br
>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>>> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
>>> Silício]
>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
>>> <https://www.facebook.com/arquivei>
>>> <https://www.linkedin.com/company/arquivei>
>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk>
>>>
>>
>>
>> --
>> Eduardo Soldera Garcia
>> Data Engineer
>> (16) 3509-5555 | www.arquivei.com.br
>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
>> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
>> Silício]
>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
>> <https://www.facebook.com/arquivei>
>> <https://www.linkedin.com/company/arquivei>
>> <https://www.youtube.com/watch?v=sSUUKxbXnxk>
>>
>

-- 
Eduardo Soldera Garcia
Data Engineer
(16) 3509-5555 | www.arquivei.com.br
<https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
[image: Arquivei.com.br – Inteligência em Notas Fiscais]
<https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
[image: Google seleciona Arquivei para imersão e mentoria no Vale do
Silício]
<https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
<https://www.facebook.com/arquivei>
<https://www.linkedin.com/company/arquivei>
<https://www.youtube.com/watch?v=sSUUKxbXnxk>

Reply via email to