[jira] [Commented] (KAFKA-15657) Unexpected errors when producing transactionally in 3.6

2023-10-20 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1859#comment-1859
 ] 

Justine Olshan commented on KAFKA-15657:


Thanks [~twmb] and [~ijuma]. I will follow along with KAFKA-15653 and if there 
are still issues after that is fixed.

> Unexpected errors when producing transactionally in 3.6
> ---
>
> Key: KAFKA-15657
> URL: https://issues.apache.org/jira/browse/KAFKA-15657
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 3.6.0
>Reporter: Travis Bischel
>Priority: Major
>
> In loop-testing the franz-go client, I am frequently receiving INVALID_RECORD 
> (which I created a separate issue for), and INVALID_TXN_STATE and 
> UNKNOWN_SERVER_ERROR.
> INVALID_TXN_STATE is being returned even though the partitions have been 
> added to the transaction (AddPartitionsToTxn). Nothing about the code has 
> changed between 3.5 and 3.6, and I have loop-integration-tested this code 
> against 3.5 thousands of times. 3.6 is newly - and always - returning 
> INVALID_TXN_STATE. If I change the code to retry on INVALID_TXN_STATE, I 
> eventually quickly (always) receive UNKNOWN_SERVER_ERROR. In looking at the 
> broker logs, the broker indicates that sequence numbers are out of order - 
> but (a) I am repeating requests that were in order (so something on the 
> broker got a little haywire maybe? or maybe this is due to me ignoring 
> invalid_txn_state?), _and_ I am not receiving OUT_OF_ORDER_SEQUENCE_NUMBER, I 
> am receiving UNKNOWN_SERVER_ERROR.
> I think the main problem is the client unexpectedly receiving 
> INVALID_TXN_STATE, but a second problem here is that OOOSN is being mapped to 
> USE on return for some reason.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15657) Unexpected errors when producing transactionally in 3.6

2023-10-19 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1567#comment-1567
 ] 

Ismael Juma commented on KAFKA-15657:
-

I was wondering the same. We should fix KAFKA-15653 and see if it's the source 
of the issues you have been seeing. I am not aware of any other change that 
would result in that sort of problem.

> Unexpected errors when producing transactionally in 3.6
> ---
>
> Key: KAFKA-15657
> URL: https://issues.apache.org/jira/browse/KAFKA-15657
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 3.6.0
>Reporter: Travis Bischel
>Priority: Major
>
> In loop-testing the franz-go client, I am frequently receiving INVALID_RECORD 
> (which I created a separate issue for), and INVALID_TXN_STATE and 
> UNKNOWN_SERVER_ERROR.
> INVALID_TXN_STATE is being returned even though the partitions have been 
> added to the transaction (AddPartitionsToTxn). Nothing about the code has 
> changed between 3.5 and 3.6, and I have loop-integration-tested this code 
> against 3.5 thousands of times. 3.6 is newly - and always - returning 
> INVALID_TXN_STATE. If I change the code to retry on INVALID_TXN_STATE, I 
> eventually quickly (always) receive UNKNOWN_SERVER_ERROR. In looking at the 
> broker logs, the broker indicates that sequence numbers are out of order - 
> but (a) I am repeating requests that were in order (so something on the 
> broker got a little haywire maybe? or maybe this is due to me ignoring 
> invalid_txn_state?), _and_ I am not receiving OUT_OF_ORDER_SEQUENCE_NUMBER, I 
> am receiving UNKNOWN_SERVER_ERROR.
> I think the main problem is the client unexpectedly receiving 
> INVALID_TXN_STATE, but a second problem here is that OOOSN is being mapped to 
> USE on return for some reason.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15657) Unexpected errors when producing transactionally in 3.6

2023-10-19 Thread Travis Bischel (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1565#comment-1565
 ] 

Travis Bischel commented on KAFKA-15657:


I'm beginning to suspect that KAFKA-15653 may eventually lead to this, I never 
experience this bug without first experiencing the NPEs while appending. I'll 
wait until 15653 is addressed and loop-test seeing if this still occurs.

> Unexpected errors when producing transactionally in 3.6
> ---
>
> Key: KAFKA-15657
> URL: https://issues.apache.org/jira/browse/KAFKA-15657
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 3.6.0
>Reporter: Travis Bischel
>Priority: Major
>
> In loop-testing the franz-go client, I am frequently receiving INVALID_RECORD 
> (which I created a separate issue for), and INVALID_TXN_STATE and 
> UNKNOWN_SERVER_ERROR.
> INVALID_TXN_STATE is being returned even though the partitions have been 
> added to the transaction (AddPartitionsToTxn). Nothing about the code has 
> changed between 3.5 and 3.6, and I have loop-integration-tested this code 
> against 3.5 thousands of times. 3.6 is newly - and always - returning 
> INVALID_TXN_STATE. If I change the code to retry on INVALID_TXN_STATE, I 
> eventually quickly (always) receive UNKNOWN_SERVER_ERROR. In looking at the 
> broker logs, the broker indicates that sequence numbers are out of order - 
> but (a) I am repeating requests that were in order (so something on the 
> broker got a little haywire maybe? or maybe this is due to me ignoring 
> invalid_txn_state?), _and_ I am not receiving OUT_OF_ORDER_SEQUENCE_NUMBER, I 
> am receiving UNKNOWN_SERVER_ERROR.
> I think the main problem is the client unexpectedly receiving 
> INVALID_TXN_STATE, but a second problem here is that OOOSN is being mapped to 
> USE on return for some reason.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15657) Unexpected errors when producing transactionally in 3.6

2023-10-19 Thread Travis Bischel (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1537#comment-1537
 ] 

Travis Bischel commented on KAFKA-15657:


re: first comment – the client doesn't advance to producing unless 
AddPartitionsToTxn succeeds. If the request partially succeeds, failed 
partitions are stripped and only successfully added partitions are produced. 
The logic is definitely hard to follow if you're not familiar with the code, 
but here's issuing/stripping: 
[here|https://github.com/twmb/franz-go/blob/ae169a1f35c2ee6b130c4e520632b33e6c491e0b/pkg/kgo/sink.go#L442-L498,]
 and here's where the request is issued (in the same function as producing – 
before the produce request is issued): 
[here|https://github.com/twmb/franz-go/blob/ae169a1f35c2ee6b130c4e520632b33e6c491e0b/pkg/kgo/sink.go#L316-L357]

Also wrt race condition – these tests also pass against the redpanda binary, 
which has always had KIP-890 semantics / has never allowed transactional 
produce requests unless the partition has been added to the transaction (in 
fact this is part of how I caught some early redpanda bugs with _that_ 
implementation).

 

re: second comment, I'll capture some debug logs so you can see both the client 
logs and the container. The tests currently are using v3. I'm currently running 
this in a loop:

```

docker compose down; sleep 1; docker compose up -d ; sleep 5 ; while go test 
-run Txn/cooperative > logs; do echo whoo; docker compose down; sleep 1; docker 
compose up -d. sleep 5; done

```

Once this fails, I'll upload the logs. This is currently ignoring 
INVALID_RECORD, which I more regularly run into. I may remove gating this to 
just the cooperative test and instead run it against all balancers at once (it 
seems heavier load runs into the problem more frequently).

 

Also this does remind me though, somebody had a feature request that 
deliberately abused the ability to produce before AddPartitionsToTxn was done, 
I need to remove support of this for 3.6+. This _is_ exercised in franz-go's CI 
right now and will fail CI for 3.6+ (see the doc comment on 
[EndBeginTxnUnsafe|https://pkg.go.dev/github.com/twmb/franz-go/pkg/kgo#EndBeginTxnHow]).

> Unexpected errors when producing transactionally in 3.6
> ---
>
> Key: KAFKA-15657
> URL: https://issues.apache.org/jira/browse/KAFKA-15657
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 3.6.0
>Reporter: Travis Bischel
>Priority: Major
>
> In loop-testing the franz-go client, I am frequently receiving INVALID_RECORD 
> (which I created a separate issue for), and INVALID_TXN_STATE and 
> UNKNOWN_SERVER_ERROR.
> INVALID_TXN_STATE is being returned even though the partitions have been 
> added to the transaction (AddPartitionsToTxn). Nothing about the code has 
> changed between 3.5 and 3.6, and I have loop-integration-tested this code 
> against 3.5 thousands of times. 3.6 is newly - and always - returning 
> INVALID_TXN_STATE. If I change the code to retry on INVALID_TXN_STATE, I 
> eventually quickly (always) receive UNKNOWN_SERVER_ERROR. In looking at the 
> broker logs, the broker indicates that sequence numbers are out of order - 
> but (a) I am repeating requests that were in order (so something on the 
> broker got a little haywire maybe? or maybe this is due to me ignoring 
> invalid_txn_state?), _and_ I am not receiving OUT_OF_ORDER_SEQUENCE_NUMBER, I 
> am receiving UNKNOWN_SERVER_ERROR.
> I think the main problem is the client unexpectedly receiving 
> INVALID_TXN_STATE, but a second problem here is that OOOSN is being mapped to 
> USE on return for some reason.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15657) Unexpected errors when producing transactionally in 3.6

2023-10-19 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1526#comment-1526
 ] 

Justine Olshan commented on KAFKA-15657:


[~twmb] Can you confirm if the AddPartitionsToTxn calls are succeeding? And 
what version they are using? I am concerned the partitions might not be added 
correctly.

 

> Unexpected errors when producing transactionally in 3.6
> ---
>
> Key: KAFKA-15657
> URL: https://issues.apache.org/jira/browse/KAFKA-15657
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 3.6.0
>Reporter: Travis Bischel
>Priority: Major
>
> In loop-testing the franz-go client, I am frequently receiving INVALID_RECORD 
> (which I created a separate issue for), and INVALID_TXN_STATE and 
> UNKNOWN_SERVER_ERROR.
> INVALID_TXN_STATE is being returned even though the partitions have been 
> added to the transaction (AddPartitionsToTxn). Nothing about the code has 
> changed between 3.5 and 3.6, and I have loop-integration-tested this code 
> against 3.5 thousands of times. 3.6 is newly - and always - returning 
> INVALID_TXN_STATE. If I change the code to retry on INVALID_TXN_STATE, I 
> eventually quickly (always) receive UNKNOWN_SERVER_ERROR. In looking at the 
> broker logs, the broker indicates that sequence numbers are out of order - 
> but (a) I am repeating requests that were in order (so something on the 
> broker got a little haywire maybe? or maybe this is due to me ignoring 
> invalid_txn_state?), _and_ I am not receiving OUT_OF_ORDER_SEQUENCE_NUMBER, I 
> am receiving UNKNOWN_SERVER_ERROR.
> I think the main problem is the client unexpectedly receiving 
> INVALID_TXN_STATE, but a second problem here is that OOOSN is being mapped to 
> USE on return for some reason.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15657) Unexpected errors when producing transactionally in 3.6

2023-10-19 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1524#comment-1524
 ] 

Justine Olshan commented on KAFKA-15657:


Hey Travis. INVALID_TXN_STATE likely indicates there was a race condition or a 
bug in the client. In this case, the transaction should abort. This is part of 
the work of KIP-890. 

I wonder if there is a bug in the client that caused hanging (or late messages 
getting through) before and it is just being caught now.

If you want to disable transaction verification, you can by setting 
transaction.partition.verification.enable to false in your server config files.

> Unexpected errors when producing transactionally in 3.6
> ---
>
> Key: KAFKA-15657
> URL: https://issues.apache.org/jira/browse/KAFKA-15657
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 3.6.0
>Reporter: Travis Bischel
>Priority: Major
>
> In loop-testing the franz-go client, I am frequently receiving INVALID_RECORD 
> (which I created a separate issue for), and INVALID_TXN_STATE and 
> UNKNOWN_SERVER_ERROR.
> INVALID_TXN_STATE is being returned even though the partitions have been 
> added to the transaction (AddPartitionsToTxn). Nothing about the code has 
> changed between 3.5 and 3.6, and I have loop-integration-tested this code 
> against 3.5 thousands of times. 3.6 is newly - and always - returning 
> INVALID_TXN_STATE. If I change the code to retry on INVALID_TXN_STATE, I 
> eventually quickly (always) receive UNKNOWN_SERVER_ERROR. In looking at the 
> broker logs, the broker indicates that sequence numbers are out of order - 
> but (a) I am repeating requests that were in order (so something on the 
> broker got a little haywire maybe? or maybe this is due to me ignoring 
> invalid_txn_state?), _and_ I am not receiving OUT_OF_ORDER_SEQUENCE_NUMBER, I 
> am receiving UNKNOWN_SERVER_ERROR.
> I think the main problem is the client unexpectedly receiving 
> INVALID_TXN_STATE, but a second problem here is that OOOSN is being mapped to 
> USE on return for some reason.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)