Re: Transactions, delivery timeout and changing transactional producer behavior

2022-09-11 Thread Colt McNealy
Hi all—

I'm not a committer so I can't review this PR (or is that not true?).
However, I'd like to bump this as well. I believe that I've encountered
this bug during chaos testing with the transactional producer. I can
sometimes produce this error when killing a broker during a long-running
transaction, which causes a batch to encounter delivery timeout as
described in the Jira. I have observed some inconsistencies with
the consumer offset being advanced prematurely (i.e. perhaps after the
delivery of the EndTxnRequest).

Daniel, thank you for the PR.

Cheers,
Colt McNealy
*Founder, LittleHorse.io*

On Fri, Sep 9, 2022 at 9:54 AM Dániel Urbán  wrote:

> Hi all,
>
> I would like to bump this and bring some attention to the issue.
> This is a nasty bug in the transactional producer, would be nice if I could
> get some feedback on the PR: https://github.com/apache/kafka/pull/12392
>
> Thanks in advance,
> Daniel
>
> Viktor Somogyi-Vass  ezt írta
> (időpont: 2022. júl. 25., H, 15:28):
>
> > Hi Luke & Artem,
> >
> > We prepared the fix, would you please help in getting a
> committer-reviewer
> > to get this issue resolved?
> >
> > Thanks,
> > Viktor
> >
> > On Fri, Jul 8, 2022 at 12:57 PM Dániel Urbán 
> > wrote:
> >
> > > Submitted a PR with the fix:
> https://github.com/apache/kafka/pull/12392
> > > In the PR I tried keeping the producer in a usable state after the
> forced
> > > bump. I understand that it might be the cleanest solution, but the only
> > > other option I know of is to transition into a fatal state, meaning
> that
> > > the producer has to be recreated after a delivery timeout. I think that
> > is
> > > still fine compared to the out-of-order messages.
> > >
> > > Looking forward to your reviews,
> > > Daniel
> > >
> > > Dániel Urbán  ezt írta (időpont: 2022. júl. 7.,
> > Cs,
> > > 12:04):
> > >
> > > > Thanks for the feedback, I created
> > > > https://issues.apache.org/jira/browse/KAFKA-14053 and started
> working
> > on
> > > > a PR.
> > > >
> > > > Luke, for the workaround, we used the transaction admin tool released
> > in
> > > > 3.0 to "abort" these hanging batches manually.
> > > > Naturally, the cluster health should be stabilized. This issue popped
> > up
> > > > most frequently around times when some partitions went into a few
> > minute
> > > > window of unavailability. The infinite retries on the producer side
> > > caused
> > > > a situation where the last retry was still in-flight, but the
> delivery
> > > > timeout was triggered on the client side. We reduced the retries and
> > > > increased the delivery timeout to avoid such situations.
> > > > Still, the issue can occur in other scenarios, for example a client
> > > > queueing up many batches in the producer buffer, and causing those
> > > batches
> > > > to spend most of the delivery timeout window in the client memory.
> > > >
> > > > Thanks,
> > > > Daniel
> > > >
> > > > Luke Chen  ezt írta (időpont: 2022. júl. 7., Cs,
> > > 5:13):
> > > >
> > > >> Hi Daniel,
> > > >>
> > > >> Thanks for reporting the issue, and the investigation.
> > > >> I'm curious, so, what's your workaround for this issue?
> > > >>
> > > >> I agree with Artem, it makes sense. Please file a bug in JIRA.
> > > >> And looking forward to your PR! :)
> > > >>
> > > >> Thank you.
> > > >> Luke
> > > >>
> > > >> On Thu, Jul 7, 2022 at 3:07 AM Artem Livshits
> > > >>  wrote:
> > > >>
> > > >> > Hi Daniel,
> > > >> >
> > > >> > What you say makes sense.  Could you file a bug and put this info
> > > there
> > > >> so
> > > >> > that it's easier to track?
> > > >> >
> > > >> > -Artem
> > > >> >
> > > >> > On Wed, Jul 6, 2022 at 8:34 AM Dániel Urbán <
> urb.dani...@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > > Hello everyone,
> > > >> > >
> > > >> > > I've been investigating some transaction related issues in a
> very
> > > >> > > problematic cluster. Besides finding some interesting issues, I
> > had
> > > >> some
> > > >> > > ideas about how transactional producer behavior could be
> improved.
> > > >> > >
> > > >> > > My suggestion in short is: when the transactional producer
> > > encounters
> > > >> an
> > > >> > > error which doesn't necessarily mean that the in-flight request
> > was
> > > >> > > processed (for example a client side timeout), the producer
> should
> > > not
> > > >> > send
> > > >> > > an EndTxnRequest on abort, but instead it should bump the
> producer
> > > >> epoch.
> > > >> > >
> > > >> > > The long description about the issue I found, and how I came to
> > the
> > > >> > > suggestion:
> > > >> > >
> > > >> > > First, the description of the issue. When I say that the cluster
> > is
> > > >> "very
> > > >> > > problematic", I mean all kinds of different issues, be it infra
> > > (disks
> > > >> > and
> > > >> > > network) or throughput (high volume producers without fine
> > tuning).
> > > >> > > In this cluster, Kafka transactions are widely used by many
> > > producers.
> > > >> > And
> > > >> > > in this cluster, partitions 

Re: Problem with Kafka KRaft 3.1.X

2022-09-11 Thread Colin McCabe
Thanks, Paul. I would be really curious to see the talk when you're done :)

BTW, David Arthur posted a KIP recently that should avoid the upper limit on 
the number of elements in a batch for CreateTopics or CreatePartitions when 
it's done.

best,
Colin


On Fri, Sep 9, 2022, at 17:22, Paul Brebner wrote:
> Colin, hi, current max partitions reached is about 600,000 - I had to
> increase Linux file descriptors, mmap, and tweak the JVM heap settings a
> bit - heap error again.
> This is a bit of a hack to, as RF=1 and only a single EC2 instance - a
> proper 3 node cluster would in theory give >1M partitions which was what I
> really wanted to test out. I think I was also hitting this error attempting
> to create a single topic with lots of partitions:
> https://github.com/apache/kafka/pull/12595
> Current approach is to create multiple topics with 1000 partitions each, or
> single topic and increase the number of partitions.
> I've also got some good numbers around speed of meta data operations of
> Zookeeper vs. KRaft mode (KRaft lots faster = O(1) c.f. O(n) for ZK) etc.
> Anyway I'm happy I've got some numbers to report for my talk now, thanks
> for the info.
>
> Regards, Paul
>
> On Sat, 10 Sept 2022 at 02:43, Colin McCabe  wrote:
>
>> Hi Paul,
>>
>> As Keith wrote, it does sound like you are hitting a separate Linux limit
>> like the max mmap count.
>>
>> I'm curious how many partitions you can create if you change that config!
>>
>> best,
>> Colin
>>
>>
>> On Tue, Sep 6, 2022, at 14:02, Keith Paulson wrote:
>> > I've had similar errors cause by mmap counts; try with
>> > vm.max_map_count=262144
>> >
>> >
>> > On 2022/09/01 23:57:54 Paul Brebner wrote:
>> >> Hi all,
>> >>
>> >> I've been attempting to benchmark Kafka KRaft version for an ApacheCon
>> > talk
>> >> and have identified 2 problems:
>> >>
>> >> 1 - it's still impossible to create large number of partitions/topics -
>> I
>> >> can create more than the comparable Zookeeper version but still not
>> >> "millions" - this is with RF=1 (as anything higher needs huge clusters
>> to
>> >> cope with the replication CPU overhead) only, and no load on the
>> clusters
>> >> yet (i.e. purely a topic/partition creation experiment).
>> >>
>> >> 2 - eventually the topic/partition creation command causes the Kafka
>> >> process to fail - looks like a memory error -
>> >>
>> >> java.lang.OutOfMemoryError: Metaspace
>> >> OpenJDK 64-Bit Server VM warning: INFO:
>> >> os::commit_memory(0x7f4f554f9000, 65536, 1) failed; error='Not
>> enough
>> >> space' (errno=12)
>> >>
>> >> or similar error
>> >>
>> >> seems to happen consistently around 30,000+ partitions - this is on a
>> test
>> >> EC2 instance with 32GB Ram, 500,000 file descriptors (increased from
>> >> default) and 64GB disk (plenty spare). I'm not an OS expert, but the
>> kafka
>> >> process and the OS both seem to have plenty of RAM when this error
>> occurs.
>> >>
>> >> So there's 3 questions really: What's going wrong exactly? How to
>> achieve
>> >> more partitions? And should the topic create command (just using the CLI
>> > at
>> >> present to create topics) really be capable of killing the Kafka
>> instance,
>> >> or should it fail and throw an error, and the Kafka instance still
>> > continue
>> >> working...
>> >>
>> >> Regards, Paul Brebner
>> >>
>>


Re: Problem with Kafka KRaft 3.1.X

2022-09-11 Thread Paul Brebner
Thanks, that fix would be nice :-) Paul

On Mon, 12 Sept 2022 at 10:41, Colin McCabe  wrote:

> Thanks, Paul. I would be really curious to see the talk when you're done :)
>
> BTW, David Arthur posted a KIP recently that should avoid the upper limit
> on the number of elements in a batch for CreateTopics or CreatePartitions
> when it's done.
>
> best,
> Colin
>
>
> On Fri, Sep 9, 2022, at 17:22, Paul Brebner wrote:
> > Colin, hi, current max partitions reached is about 600,000 - I had to
> > increase Linux file descriptors, mmap, and tweak the JVM heap settings a
> > bit - heap error again.
> > This is a bit of a hack to, as RF=1 and only a single EC2 instance - a
> > proper 3 node cluster would in theory give >1M partitions which was what
> I
> > really wanted to test out. I think I was also hitting this error
> attempting
> > to create a single topic with lots of partitions:
> > https://github.com/apache/kafka/pull/12595
> > Current approach is to create multiple topics with 1000 partitions each,
> or
> > single topic and increase the number of partitions.
> > I've also got some good numbers around speed of meta data operations of
> > Zookeeper vs. KRaft mode (KRaft lots faster = O(1) c.f. O(n) for ZK) etc.
> > Anyway I'm happy I've got some numbers to report for my talk now, thanks
> > for the info.
> >
> > Regards, Paul
> >
> > On Sat, 10 Sept 2022 at 02:43, Colin McCabe  wrote:
> >
> >> Hi Paul,
> >>
> >> As Keith wrote, it does sound like you are hitting a separate Linux
> limit
> >> like the max mmap count.
> >>
> >> I'm curious how many partitions you can create if you change that
> config!
> >>
> >> best,
> >> Colin
> >>
> >>
> >> On Tue, Sep 6, 2022, at 14:02, Keith Paulson wrote:
> >> > I've had similar errors cause by mmap counts; try with
> >> > vm.max_map_count=262144
> >> >
> >> >
> >> > On 2022/09/01 23:57:54 Paul Brebner wrote:
> >> >> Hi all,
> >> >>
> >> >> I've been attempting to benchmark Kafka KRaft version for an
> ApacheCon
> >> > talk
> >> >> and have identified 2 problems:
> >> >>
> >> >> 1 - it's still impossible to create large number of
> partitions/topics -
> >> I
> >> >> can create more than the comparable Zookeeper version but still not
> >> >> "millions" - this is with RF=1 (as anything higher needs huge
> clusters
> >> to
> >> >> cope with the replication CPU overhead) only, and no load on the
> >> clusters
> >> >> yet (i.e. purely a topic/partition creation experiment).
> >> >>
> >> >> 2 - eventually the topic/partition creation command causes the Kafka
> >> >> process to fail - looks like a memory error -
> >> >>
> >> >> java.lang.OutOfMemoryError: Metaspace
> >> >> OpenJDK 64-Bit Server VM warning: INFO:
> >> >> os::commit_memory(0x7f4f554f9000, 65536, 1) failed; error='Not
> >> enough
> >> >> space' (errno=12)
> >> >>
> >> >> or similar error
> >> >>
> >> >> seems to happen consistently around 30,000+ partitions - this is on a
> >> test
> >> >> EC2 instance with 32GB Ram, 500,000 file descriptors (increased from
> >> >> default) and 64GB disk (plenty spare). I'm not an OS expert, but the
> >> kafka
> >> >> process and the OS both seem to have plenty of RAM when this error
> >> occurs.
> >> >>
> >> >> So there's 3 questions really: What's going wrong exactly? How to
> >> achieve
> >> >> more partitions? And should the topic create command (just using the
> CLI
> >> > at
> >> >> present to create topics) really be capable of killing the Kafka
> >> instance,
> >> >> or should it fail and throw an error, and the Kafka instance still
> >> > continue
> >> >> working...
> >> >>
> >> >> Regards, Paul Brebner
> >> >>
> >>
>


[jira] [Created] (KAFKA-14218) replace temp file handler with JUnit 5 Temporary Directory Support

2022-09-11 Thread Luke Chen (Jira)
Luke Chen created KAFKA-14218:
-

 Summary: replace temp file handler with JUnit 5 Temporary 
Directory Support
 Key: KAFKA-14218
 URL: https://issues.apache.org/jira/browse/KAFKA-14218
 Project: Kafka
  Issue Type: Improvement
  Components: unit tests
Reporter: Luke Chen


We created many temp files in tests, and sometimes we forgot to delete them 
after usage. Instead of polluting @AfterEach for each test, we should consider 
to use JUnit 5 TempDirectory Extension.

 

REF: 1. [https://github.com/apache/kafka/pull/12591#issuecomment-1243001431]

2. [https://www.baeldung.com/junit-5-temporary-directory]

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)