[jira] [Comment Edited] (KAFKA-16603) Data loss when kafka connect sending data to Kafka

2024-05-13 Thread Chris Egerton (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845927#comment-17845927
 ] 

Chris Egerton edited comment on KAFKA-16603 at 5/13/24 1:34 PM:


[~dasarianil] ah, nice! That takes advantage of existing logic we use in Kafka 
Connect to prevent connector source offsets from being prematurely committed, 
should save you work.

It shouldn't be too expensive to read source offsets at runtime–have you 
encountered a noticeable performance impact from this? Connect workers are 
constantly polling the offsets topic regardless of whether connectors have 
issued read requests; the only additional workload incurred by requesting 
offsets from the runtime should be a negligible bump in CPU from converting the 
offsets to/from byte arrays, and possibly a bit of load on your Kafka cluster 
caused by fetching the latest stable offsets for the offsets topic.

I think we could possibly add some kind of callback-based API for 
{{SourceConnector}} implementations to be notified whenever new offsets are 
read from the offsets topic, but I'd be hesitant to do so if there isn't 
convincing evidence that the existing API is insufficient. We almost certainly 
can't do something that would notify {{SourceConnector}} instances of when 
records from their tasks are successfully produced to Kafka.


was (Author: chrisegerton):
[~dasarianil] ah, nice! That takes advantage of existing logic we use in Kafka 
Connect to prevent connector source offsets from being prematurely committed, 
should save you work.

It shouldn't be too expensive to read source offsets at runtime–have you 
encountered a noticeable performance impact from this? Connect workers are 
constantly polling the offsets topic regardless of whether connectors have 
issued read requests; the only additional workload incurred by requesting 
offsets from the runtime should be a negligible bump in CPU from converting the 
offsets to/from byte arrays, and possibly a bit of load on your Kafka cluster 
caused by fetching the latest stable offsets for the offsets topic.

I think we could possibly add some kind of callback-based API for 
{{SourceConnector}} implementations to be notified whenever new offsets are 
read from the offsets topic, but I'd be hesitant to do so if there isn't 
convincing evidence that the existing API is insufficient.

> Data loss when kafka connect sending data to Kafka
> --
>
> Key: KAFKA-16603
> URL: https://issues.apache.org/jira/browse/KAFKA-16603
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 3.3.1
>Reporter: Anil Dasari
>Priority: Major
>
> We are experiencing a data loss when Kafka Source connector is failed to send 
> data to Kafka topic and offset topic. 
> Kafka cluster and Kafka connect details:
>  # Kafka connect version i.e client : Confluent community version 7.3.1 i.e 
> Kafka 3.3.1
>  # Kafka version: 0.11.0 (server)
>  # Cluster size : 3 brokers
>  # Number of partitions in all topics = 3
>  # Replication factor = 3
>  # Min ISR set 2
>  # Uses no transformations in Kafka connector
>  # Use default error tolerance i.e None.
> Our connector checkpoints the offsets info received in 
> SourceTask#commitRecord and resume the data process from the persisted 
> checkpoint.
> The data loss is noticed when broker is unresponsive for few mins due to high 
> load and kafka connector was restarted. Also, Kafka connector graceful 
> shutdown failed.
> Logs:
>  
> {code:java}
> [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Discovered group 
> coordinator 10.75.100.176:31000 (id: 2147483647 rack: null)
> Apr 22, 2024 @ 15:56:16.152 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Group coordinator 
> 10.75.100.176:31000 (id: 2147483647 rack: null) is unavailable or invalid due 
> to cause: coordinator unavailable. isDisconnected: false. Rediscovery will be 
> attempted.
> Apr 22, 2024 @ 15:56:16.153 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Requesting disconnect from 
> last known coordinator 10.75.100.176:31000 (id: 2147483647 rack: null)
> Apr 22, 2024 @ 15:56:16.514 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Node 0 disconnected.
> Apr 22, 2024 @ 15:56:16.708 [Producer 
> clientId=connector-producer-d094a5d7bbb046b99d62398cb84d648c-0] Node 0 
> disconnected.
> Apr 22, 2024 @ 15:56:16.710 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Node 2147483647 
> disconnected.
> Apr 22, 2024 @ 15:56:16.731 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Group coordinator 
> 10.75.100.176:31000 (id: 2147483647 rac

[jira] [Comment Edited] (KAFKA-16603) Data loss when kafka connect sending data to Kafka

2024-05-13 Thread Chris Egerton (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845927#comment-17845927
 ] 

Chris Egerton edited comment on KAFKA-16603 at 5/13/24 1:33 PM:


[~dasarianil] ah, nice! That takes advantage of existing logic we use in Kafka 
Connect to prevent connector source offsets from being prematurely committed, 
should save you work.

It shouldn't be too expensive to read source offsets at runtime–have you 
encountered a noticeable performance impact from this? Connect workers are 
constantly polling the offsets topic regardless of whether connectors have 
issued read requests; the only additional workload incurred by requesting 
offsets from the runtime should be a negligible bump in CPU from converting the 
offsets to/from byte arrays, and possibly a bit of load on your Kafka cluster 
caused by fetching the latest stable offsets for the offsets topic.

I think we could possibly add some kind of callback-based API for 
{{SourceConnector}} implementations to be notified whenever new offsets are 
read from the offsets topic, but I'd be hesitant to do so if there isn't 
convincing evidence that the existing API is insufficient.


was (Author: chrisegerton):
[~dasarianil] ah, nice! That takes advantage of existing logic we use in Kafka 
Connect to prevent connector source offsets from being prematurely committed, 
should save you work.

It shouldn't be too expensive to read source offsets at runtime–have you 
encountered a noticeable performance impact from this? Connect workers are 
constantly polling the offsets topic regardless of whether connectors have 
issued read requests; the only additional workload incurred by requesting 
offsets from the runtime should be a negligible bump in CPU from converting the 
offsets to/from byte arrays, and possibly a bit of load on your Kafka cluster 
caused by fetching the latest stable offsets for the offsets topic.

> Data loss when kafka connect sending data to Kafka
> --
>
> Key: KAFKA-16603
> URL: https://issues.apache.org/jira/browse/KAFKA-16603
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 3.3.1
>Reporter: Anil Dasari
>Priority: Major
>
> We are experiencing a data loss when Kafka Source connector is failed to send 
> data to Kafka topic and offset topic. 
> Kafka cluster and Kafka connect details:
>  # Kafka connect version i.e client : Confluent community version 7.3.1 i.e 
> Kafka 3.3.1
>  # Kafka version: 0.11.0 (server)
>  # Cluster size : 3 brokers
>  # Number of partitions in all topics = 3
>  # Replication factor = 3
>  # Min ISR set 2
>  # Uses no transformations in Kafka connector
>  # Use default error tolerance i.e None.
> Our connector checkpoints the offsets info received in 
> SourceTask#commitRecord and resume the data process from the persisted 
> checkpoint.
> The data loss is noticed when broker is unresponsive for few mins due to high 
> load and kafka connector was restarted. Also, Kafka connector graceful 
> shutdown failed.
> Logs:
>  
> {code:java}
> [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Discovered group 
> coordinator 10.75.100.176:31000 (id: 2147483647 rack: null)
> Apr 22, 2024 @ 15:56:16.152 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Group coordinator 
> 10.75.100.176:31000 (id: 2147483647 rack: null) is unavailable or invalid due 
> to cause: coordinator unavailable. isDisconnected: false. Rediscovery will be 
> attempted.
> Apr 22, 2024 @ 15:56:16.153 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Requesting disconnect from 
> last known coordinator 10.75.100.176:31000 (id: 2147483647 rack: null)
> Apr 22, 2024 @ 15:56:16.514 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Node 0 disconnected.
> Apr 22, 2024 @ 15:56:16.708 [Producer 
> clientId=connector-producer-d094a5d7bbb046b99d62398cb84d648c-0] Node 0 
> disconnected.
> Apr 22, 2024 @ 15:56:16.710 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Node 2147483647 
> disconnected.
> Apr 22, 2024 @ 15:56:16.731 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Group coordinator 
> 10.75.100.176:31000 (id: 2147483647 rack: null) is unavailable or invalid due 
> to cause: coordinator unavailable. isDisconnected: true. Rediscovery will be 
> attempted.
> Apr 22, 2024 @ 15:56:19.103 == Trying to sleep while stop == (** custom log 
> **)
> Apr 22, 2024 @ 15:56:19.755 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Broker coordinator was 
> unreachable for 3000ms. Revoking previous assignment Assignment{error=0, 
> lead

[jira] [Comment Edited] (KAFKA-16603) Data loss when kafka connect sending data to Kafka

2024-04-29 Thread Anil Dasari (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842023#comment-17842023
 ] 

Anil Dasari edited comment on KAFKA-16603 at 4/29/24 3:10 PM:
--

Thanks [~ChrisEgerton] . will look into exact-once support. Exactly-once 
processing usually impacts throughput to some extent as all records are 
processed in order.


was (Author: JIRAUSER283879):
Thanks [~ChrisEgerton] . will look into exact-once support. Exactly-once 
processing usually impacts throughput to some extent as all records are 
processed in order. 

 

 

 

> Data loss when kafka connect sending data to Kafka
> --
>
> Key: KAFKA-16603
> URL: https://issues.apache.org/jira/browse/KAFKA-16603
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 3.3.1
>Reporter: Anil Dasari
>Priority: Major
>
> We are experiencing a data loss when Kafka Source connector is failed to send 
> data to Kafka topic and offset topic. 
> Kafka cluster and Kafka connect details:
>  # Kafka connect version i.e client : Confluent community version 7.3.1 i.e 
> Kafka 3.3.1
>  # Kafka version: 0.11.0 (server)
>  # Cluster size : 3 brokers
>  # Number of partitions in all topics = 3
>  # Replication factor = 3
>  # Min ISR set 2
>  # Uses no transformations in Kafka connector
>  # Use default error tolerance i.e None.
> Our connector checkpoints the offsets info received in 
> SourceTask#commitRecord and resume the data process from the persisted 
> checkpoint.
> The data loss is noticed when broker is unresponsive for few mins due to high 
> load and kafka connector was restarted. Also, Kafka connector graceful 
> shutdown failed.
> Logs:
>  
> {code:java}
> [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Discovered group 
> coordinator 10.75.100.176:31000 (id: 2147483647 rack: null)
> Apr 22, 2024 @ 15:56:16.152 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Group coordinator 
> 10.75.100.176:31000 (id: 2147483647 rack: null) is unavailable or invalid due 
> to cause: coordinator unavailable. isDisconnected: false. Rediscovery will be 
> attempted.
> Apr 22, 2024 @ 15:56:16.153 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Requesting disconnect from 
> last known coordinator 10.75.100.176:31000 (id: 2147483647 rack: null)
> Apr 22, 2024 @ 15:56:16.514 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Node 0 disconnected.
> Apr 22, 2024 @ 15:56:16.708 [Producer 
> clientId=connector-producer-d094a5d7bbb046b99d62398cb84d648c-0] Node 0 
> disconnected.
> Apr 22, 2024 @ 15:56:16.710 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Node 2147483647 
> disconnected.
> Apr 22, 2024 @ 15:56:16.731 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Group coordinator 
> 10.75.100.176:31000 (id: 2147483647 rack: null) is unavailable or invalid due 
> to cause: coordinator unavailable. isDisconnected: true. Rediscovery will be 
> attempted.
> Apr 22, 2024 @ 15:56:19.103 == Trying to sleep while stop == (** custom log 
> **)
> Apr 22, 2024 @ 15:56:19.755 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Broker coordinator was 
> unreachable for 3000ms. Revoking previous assignment Assignment{error=0, 
> leader='connect-1-8f41a1d2-6cc9-4956-9be3-1fbae9c6d305', 
> leaderUrl='http://10.75.100.46:8083/', offset=4, 
> connectorIds=[d094a5d7bbb046b99d62398cb84d648c], 
> taskIds=[d094a5d7bbb046b99d62398cb84d648c-0], revokedConnectorIds=[], 
> revokedTaskIds=[], delay=0} to avoid running tasks while not being a member 
> the group
> Apr 22, 2024 @ 15:56:19.866 Stopping connector 
> d094a5d7bbb046b99d62398cb84d648c
> Apr 22, 2024 @ 15:56:19.874 Stopping task d094a5d7bbb046b99d62398cb84d648c-0
> Apr 22, 2024 @ 15:56:19.880 Scheduled shutdown for 
> WorkerConnectorWorkerConnector{id=d094a5d7bbb046b99d62398cb84d648c}
> Apr 22, 2024 @ 15:56:24.105 Connector 'd094a5d7bbb046b99d62398cb84d648c' 
> failed to properly shut down, has become unresponsive, and may be consuming 
> external resources. Correct the configuration for this connector or remove 
> the connector. After fixing the connector, it may be necessary to restart 
> this worker to release any consumed resources.
> Apr 22, 2024 @ 15:56:24.110 [Producer 
> clientId=connector-producer-d094a5d7bbb046b99d62398cb84d648c-0] Closing the 
> Kafka producer with timeoutMillis = 0 ms.
> Apr 22, 2024 @ 15:56:24.110 [Producer 
> clientId=connector-producer-d094a5d7bbb046b99d62398cb84d648c-0] Proceeding to 
> force close the producer since pending requests could not be completed within 
> timeout 0 ms.
> Apr 

[jira] [Comment Edited] (KAFKA-16603) Data loss when kafka connect sending data to Kafka

2024-04-24 Thread Anil Dasari (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840652#comment-17840652
 ] 

Anil Dasari edited comment on KAFKA-16603 at 4/25/24 5:04 AM:
--

Issue is because of out of order acks. So, Not sure if this is kafka connect 
issue or connector implementation. Please find the detailed conversation 
[here|https://debezium.zulipchat.com/#narrow/stream/348249-community-postgresql/topic/Data.20loss.20on.20connector.20restart].

This 
[doc|https://github.com/apache/kafka/blob/864744ffd4ddc3b0d216a3049ee0c61e9c0d3ad1/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L901]
 says that acks order is guaranteed for records that are send to same 
partition. 

IIUC, KafkaProducer#send doesn't the guarantee the order of acks for the 
records that belong to different partitions. Could someone confirm or clarify ? 
thanks in advance.

 


was (Author: JIRAUSER283879):
Issue is because of out of order acks. So, Not sure if this is kafka connect 
issue or connector implementation. Please find the detailed conversation 
[here|https://debezium.zulipchat.com/#narrow/stream/348249-community-postgresql/topic/Data.20loss.20on.20connector.20restart].

This 
[doc|https://github.com/apache/kafka/blob/864744ffd4ddc3b0d216a3049ee0c61e9c0d3ad1/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L901]
 says that acks order is guaranteed for records that are send to same 
partition. 

IIUC, KafkaProducer#send doesn't the guarantee the order of acks for the 
records that belong to different partitions. Could someone confirm or clarify ? 
thanks in advance. 

 

 

 

> Data loss when kafka connect sending data to Kafka
> --
>
> Key: KAFKA-16603
> URL: https://issues.apache.org/jira/browse/KAFKA-16603
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 3.3.1
>Reporter: Anil Dasari
>Priority: Major
>
> We are experiencing a data loss when Kafka Source connector is failed to send 
> data to Kafka topic and offset topic. 
> Kafka cluster and Kafka connect details:
>  # Kafka connect version i.e client : Confluent community version 7.3.1 i.e 
> Kafka 3.3.1
>  # Kafka version: 0.11.0 (server)
>  # Cluster size : 3 brokers
>  # Number of partitions in all topics = 3
>  # Replication factor = 3
>  # Min ISR set 2
>  # Uses no transformations in Kafka connector
>  # Use default error tolerance i.e None.
> Our connector checkpoints the offsets info received in 
> SourceTask#commitRecord and resume the data process from the persisted 
> checkpoint.
> The data loss is noticed when broker is unresponsive for few mins due to high 
> load and kafka connector was restarted. Also, Kafka connector graceful 
> shutdown failed.
> Logs:
>  
> {code:java}
> [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Discovered group 
> coordinator 10.75.100.176:31000 (id: 2147483647 rack: null)
> Apr 22, 2024 @ 15:56:16.152 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Group coordinator 
> 10.75.100.176:31000 (id: 2147483647 rack: null) is unavailable or invalid due 
> to cause: coordinator unavailable. isDisconnected: false. Rediscovery will be 
> attempted.
> Apr 22, 2024 @ 15:56:16.153 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Requesting disconnect from 
> last known coordinator 10.75.100.176:31000 (id: 2147483647 rack: null)
> Apr 22, 2024 @ 15:56:16.514 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Node 0 disconnected.
> Apr 22, 2024 @ 15:56:16.708 [Producer 
> clientId=connector-producer-d094a5d7bbb046b99d62398cb84d648c-0] Node 0 
> disconnected.
> Apr 22, 2024 @ 15:56:16.710 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Node 2147483647 
> disconnected.
> Apr 22, 2024 @ 15:56:16.731 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Group coordinator 
> 10.75.100.176:31000 (id: 2147483647 rack: null) is unavailable or invalid due 
> to cause: coordinator unavailable. isDisconnected: true. Rediscovery will be 
> attempted.
> Apr 22, 2024 @ 15:56:19.103 == Trying to sleep while stop == (** custom log 
> **)
> Apr 22, 2024 @ 15:56:19.755 [Worker clientId=connect-1, 
> groupId=pg-group-adf06ea08abb4394ad4f2787481fee17] Broker coordinator was 
> unreachable for 3000ms. Revoking previous assignment Assignment{error=0, 
> leader='connect-1-8f41a1d2-6cc9-4956-9be3-1fbae9c6d305', 
> leaderUrl='http://10.75.100.46:8083/', offset=4, 
> connectorIds=[d094a5d7bbb046b99d62398cb84d648c], 
> taskIds=[d094a5d7bbb046b99d62398cb84d648c-0], revokedConnectorIds=[], 
> revokedTaskIds=[], delay=0} to avoid run