[jira] [Commented] (KAFKA-7121) Intermittently, Connectors fail to assign tasks and keep retrying every second forever.

2018-07-10 Thread Gwen Shapira (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539264#comment-16539264
 ] 

Gwen Shapira commented on KAFKA-7121:
-

Oh, sorry [~yuzhih...@gmail.com], I forgot to update:
We used 1.1.0 release.
We resolved the issue by setting advertised.host for the connect workers. The 
real issue was that connect workers couldn't talk to the HTTP leader.

There are few layers of problems here:
1. When advertised host isn't set, workers end up picking the wrong IP to 
advertise.
2. When workers can't talk to the leader the error is completely misleading (we 
assume that the only reason you can't find the leader is a rebalance, but this 
is a distributed system, there are 500 reasons why 2 nodes can't talk to each 
other).
3. We keep retrying forever in this scenario (and logging 10 times per second). 
I'm not sure this is the right thing to do in this scenario.

> Intermittently, Connectors fail to assign tasks and keep retrying every 
> second forever.
> ---
>
> Key: KAFKA-7121
> URL: https://issues.apache.org/jira/browse/KAFKA-7121
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Reporter: Gwen Shapira
>Assignee: Konstantine Karantasis
>Priority: Major
>
> We started a connector, and even though it is in RUNNING status, tasks are 
> not getting assigned:
> {"name":"prod-xxx-v2","connector":{"state":"RUNNING","worker_id":"0.0.0.0:8083"},"tasks":[],"type":"sink"}
> Other connectors are running without issues.
> Attempt to restart the connector returned 409 status.
> Logs show the following messages, keep repeating for hours:
> [2018-06-29 20:23:19,288] ERROR Task reconfiguration for prod-xxx-v2 failed 
> unexpectedly, this connector will not be properly reconfigured unless 
> manually triggered. 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder:956)
> [2018-06-29 20:23:19,289] INFO 10.200.149.201 - - [29/Jun/2018:20:23:19 
> +] "POST /connectors/prod-xxx-v2/tasks?forward=false HTTP/1.1" 409 113 0 
> (org.apache.kafka.connect.runtime.rest.RestServer:60)
> [2018-06-29 20:23:19,289] INFO 10.200.149.201 - - [29/Jun/2018:20:23:19 
> +] "POST /connectors/prod-xxx-v2/tasks?forward=true HTTP/1.1" 409 113 1 
> (org.apache.kafka.connect.runtime.rest.RestServer:60)
> [2018-06-29 20:23:19,289] INFO 10.200.149.201 - - [29/Jun/2018:20:23:19 
> +] "POST /connectors/prod-xxx-v2/tasks HTTP/1.1" 409 113 1 
> (org.apache.kafka.connect.runtime.rest.RestServer:60)
> [2018-06-29 20:23:19,289] ERROR Request to leader to reconfigure connector 
> tasks failed 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1018)
> org.apache.kafka.connect.runtime.rest.errors.ConnectRestException: Cannot 
> complete request because of a conflicting operation (e.g. worker rebalance)
>  at 
> org.apache.kafka.connect.runtime.rest.RestServer.httpRequest(RestServer.java:229)
>  at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder$18.run(DistributedHerder.java:1015)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7121) Intermittently, Connectors fail to assign tasks and keep retrying every second forever.

2018-06-29 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528494#comment-16528494
 ] 

Ted Yu commented on KAFKA-7121:
---

Which release of Connect were you using ?

Looking at 
./connect/runtime/src/main/java/org/apache/kafka/connect/runtime/rest/RestServer.java
 , here is what I see around line 229:
{code}
try {
connectRestExtension.close();
} catch (IOException e) {
log.warn("Error while invoking close on " + 
connectRestExtension.getClass(), e);
}
{code}
nit: the back ticks are used on PR.
On JIRA, please enclose log snippet within:
{code}
{code}
{code}

> Intermittently, Connectors fail to assign tasks and keep retrying every 
> second forever.
> ---
>
> Key: KAFKA-7121
> URL: https://issues.apache.org/jira/browse/KAFKA-7121
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Reporter: Gwen Shapira
>Assignee: Konstantine Karantasis
>Priority: Major
>
> We started a connector, and even though it is in RUNNING status, tasks are 
> not getting assigned:
> {"name":"prod-xxx-v2","connector":{"state":"RUNNING","worker_id":"0.0.0.0:8083"},"tasks":[],"type":"sink"}
> Other connectors are running without issues.
> Attempt to restart the connector returned 409 status.
> Logs show the following messages, keep repeating for hours:
> [2018-06-29 20:23:19,288] ERROR Task reconfiguration for prod-xxx-v2 failed 
> unexpectedly, this connector will not be properly reconfigured unless 
> manually triggered. 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder:956)
> [2018-06-29 20:23:19,289] INFO 10.200.149.201 - - [29/Jun/2018:20:23:19 
> +] "POST /connectors/prod-xxx-v2/tasks?forward=false HTTP/1.1" 409 113 0 
> (org.apache.kafka.connect.runtime.rest.RestServer:60)
> [2018-06-29 20:23:19,289] INFO 10.200.149.201 - - [29/Jun/2018:20:23:19 
> +] "POST /connectors/prod-xxx-v2/tasks?forward=true HTTP/1.1" 409 113 1 
> (org.apache.kafka.connect.runtime.rest.RestServer:60)
> [2018-06-29 20:23:19,289] INFO 10.200.149.201 - - [29/Jun/2018:20:23:19 
> +] "POST /connectors/prod-xxx-v2/tasks HTTP/1.1" 409 113 1 
> (org.apache.kafka.connect.runtime.rest.RestServer:60)
> [2018-06-29 20:23:19,289] ERROR Request to leader to reconfigure connector 
> tasks failed 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1018)
> org.apache.kafka.connect.runtime.rest.errors.ConnectRestException: Cannot 
> complete request because of a conflicting operation (e.g. worker rebalance)
>  at 
> org.apache.kafka.connect.runtime.rest.RestServer.httpRequest(RestServer.java:229)
>  at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder$18.run(DistributedHerder.java:1015)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)