[jira] [Comment Edited] (KAFKA-15657) Unexpected errors when producing transactionally in 3.6

Travis Bischel (Jira) Thu, 19 Oct 2023 19:36:09 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-15657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777537#comment-17777537
 ]


Travis Bischel edited comment on KAFKA-15657 at 10/20/23 2:35 AM:
------------------------------------------------------------------

re: first comment – the client doesn't advance to producing unless 
AddPartitionsToTxn succeeds. If the request partially succeeds, failed 
partitions are stripped and only successfully added partitions are produced. 
The logic is definitely hard to follow if you're not familiar with the code, 
but here's issuing/stripping: 
[here|https://github.com/twmb/franz-go/blob/ae169a1f35c2ee6b130c4e520632b33e6c491e0b/pkg/kgo/sink.go#L442-L498,]
 and here's where the request is issued (in the same function as producing – 
before the produce request is issued): 
[here|https://github.com/twmb/franz-go/blob/ae169a1f35c2ee6b130c4e520632b33e6c491e0b/pkg/kgo/sink.go#L316-L357]

Also wrt race condition – these tests also pass against the redpanda binary, 
which has always had KIP-890 semantics / has never allowed transactional 
produce requests unless the partition has been added to the transaction (in 
fact this is part of how I caught some early redpanda bugs with _that_ 
implementation).

 

re: second comment, I'll capture some debug logs so you can see both the client 
logs and the container. The tests currently are using v3. I'm currently running 
this in a loop:

```

docker compose down; sleep 1; docker compose up -d ; sleep 5 ; while go test 
-run Txn/cooperative > logs; do echo whoo; docker compose down; sleep 1; docker 
compose up -d. sleep 5; done

```

Once this fails, I'll upload the logs. This is currently ignoring 
INVALID_RECORD, which I more regularly run into. I may remove gating this to 
just the cooperative test and instead run it against all balancers at once (it 
seems heavier load runs into the problem more frequently).

 

Also this does remind me though, somebody had a feature request that 
deliberately abused the ability to produce before AddPartitionsToTxn was done, 
I need to remove support of this for 3.6+. This _is_ exercised in franz-go's CI 
right now and will fail CI for 3.6+ (see the doc comment on 
[EndBeginTxnUnsafe|https://pkg.go.dev/github.com/twmb/franz-go/pkg/kgo#EndBeginTxnHow]).

Edit: KAFKA-15653 may be complicating the investigation here, too.


was (Author: twmb):
re: first comment – the client doesn't advance to producing unless 
AddPartitionsToTxn succeeds. If the request partially succeeds, failed 
partitions are stripped and only successfully added partitions are produced. 
The logic is definitely hard to follow if you're not familiar with the code, 
but here's issuing/stripping: 
[here|https://github.com/twmb/franz-go/blob/ae169a1f35c2ee6b130c4e520632b33e6c491e0b/pkg/kgo/sink.go#L442-L498,]
 and here's where the request is issued (in the same function as producing – 
before the produce request is issued): 
[here|https://github.com/twmb/franz-go/blob/ae169a1f35c2ee6b130c4e520632b33e6c491e0b/pkg/kgo/sink.go#L316-L357]

Also wrt race condition – these tests also pass against the redpanda binary, 
which has always had KIP-890 semantics / has never allowed transactional 
produce requests unless the partition has been added to the transaction (in 
fact this is part of how I caught some early redpanda bugs with _that_ 
implementation).

 

re: second comment, I'll capture some debug logs so you can see both the client 
logs and the container. The tests currently are using v3. I'm currently running 
this in a loop:

```

docker compose down; sleep 1; docker compose up -d ; sleep 5 ; while go test 
-run Txn/cooperative > logs; do echo whoo; docker compose down; sleep 1; docker 
compose up -d. sleep 5; done

```

Once this fails, I'll upload the logs. This is currently ignoring 
INVALID_RECORD, which I more regularly run into. I may remove gating this to 
just the cooperative test and instead run it against all balancers at once (it 
seems heavier load runs into the problem more frequently).

 

Also this does remind me though, somebody had a feature request that 
deliberately abused the ability to produce before AddPartitionsToTxn was done, 
I need to remove support of this for 3.6+. This _is_ exercised in franz-go's CI 
right now and will fail CI for 3.6+ (see the doc comment on 
[EndBeginTxnUnsafe|https://pkg.go.dev/github.com/twmb/franz-go/pkg/kgo#EndBeginTxnHow]).

> Unexpected errors when producing transactionally in 3.6
> -------------------------------------------------------
>
>                 Key: KAFKA-15657
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15657
>             Project: Kafka
>          Issue Type: Bug
>          Components: producer 
>    Affects Versions: 3.6.0
>            Reporter: Travis Bischel
>            Priority: Major
>
> In loop-testing the franz-go client, I am frequently receiving INVALID_RECORD 
> (which I created a separate issue for), and INVALID_TXN_STATE and 
> UNKNOWN_SERVER_ERROR.
> INVALID_TXN_STATE is being returned even though the partitions have been 
> added to the transaction (AddPartitionsToTxn). Nothing about the code has 
> changed between 3.5 and 3.6, and I have loop-integration-tested this code 
> against 3.5 thousands of times. 3.6 is newly - and always - returning 
> INVALID_TXN_STATE. If I change the code to retry on INVALID_TXN_STATE, I 
> eventually quickly (always) receive UNKNOWN_SERVER_ERROR. In looking at the 
> broker logs, the broker indicates that sequence numbers are out of order - 
> but (a) I am repeating requests that were in order (so something on the 
> broker got a little haywire maybe? or maybe this is due to me ignoring 
> invalid_txn_state?), _and_ I am not receiving OUT_OF_ORDER_SEQUENCE_NUMBER, I 
> am receiving UNKNOWN_SERVER_ERROR.
> I think the main problem is the client unexpectedly receiving 
> INVALID_TXN_STATE, but a second problem here is that OOOSN is being mapped to 
> USE on return for some reason.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-15657) Unexpected errors when producing transactionally in 3.6

Reply via email to