RE: Re: Kafka trunk test & build stability

2024-02-03 Thread kafka
I wonder if we've considered adding a Gradle task timeout [0] on unitTest and integrationTest tasks. The timeout applies separately for each subproject and marks the currently running test as SKIPPED on timeout. This helped me find a test which stalls builds [1]. [0]

Re: Kafka trunk test & build stability

2024-01-25 Thread Justine Olshan
It looks like there was some server maintenance that shut down Jenkins. Upon coming back up, the builds were expired but unable to stop. They all had similar logs: Cancelling nested steps due to timeoutCancelling nested steps due to timeoutBody did not finish within grace period; terminating

Re: Kafka trunk test & build stability

2024-01-25 Thread Justine Olshan
Hey folks -- I noticed some builds have been running for a day or more. I thought we limited builds to 8 hours. Any ideas why this is happening? https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/activity/ I tried to abort the build for PR-15257, and it also still seems to

Re: Kafka trunk test & build stability

2024-01-14 Thread Qichao Chu
Hi Divij and all, Regarding the speeding up of the build & de-flaking tests, LinkedIn has done some great work which we probably can borrow ideas from. In the LinkedIn/Kafka repo, we can see one of their most recent PRs only took < 9 min(unit

Re: Kafka trunk test & build stability

2024-01-10 Thread Divij Vaidya
Hey folks We seem to have a handle on the OOM issues with the multiple fixes community members made. In https://issues.apache.org/jira/browse/KAFKA-16052, you can see the "before" profile in the description and the "after" profile in the latest comment to see the difference. To prevent future

Re: Kafka trunk test & build stability

2024-01-09 Thread Colin McCabe
Sorry, but to put it bluntly, the current build setup isn't good enough at partial rebuilds that build caching would make sense. All Kafka devs have had the experience of needing to clean the build directory in order to get a valid build. The scala code esspecially seems to have this issue.

Re: Kafka trunk test & build stability

2024-01-02 Thread Nick Telford
Addendum: I've opened a PR with what I believe are the changes necessary to enable Remote Build Caching, if you choose to go that route: https://github.com/apache/kafka/pull/15109 On Tue, 2 Jan 2024 at 14:31, Nick Telford wrote: > Hi everyone, > > Regarding building a "dependency graph"...

Re: Kafka trunk test & build stability

2024-01-02 Thread Nick Telford
Hi everyone, Regarding building a "dependency graph"... Gradle already has this information, albeit fairly coarse-grained. You might be able to get some considerable improvement by configuring the Gradle Remote Build Cache. It looks like it's currently disabled explicitly:

Re: Kafka trunk test & build stability

2024-01-02 Thread Lucas Brutschy
Thanks for all the work that has already been done on this in the past days! Have we considered running our test suite with -XX:+HeapDumpOnOutOfMemoryError and uploading the heap dumps as Jenkins build artifacts? This could speed up debugging. Even if we store them only for a day and do it only

Re: Kafka trunk test & build stability

2023-12-27 Thread Divij Vaidya
I have started to perform an analysis of the OOM at https://issues.apache.org/jira/browse/KAFKA-16052. Please feel free to contribute to the investigation. -- Divij Vaidya On Wed, Dec 27, 2023 at 1:23 AM Justine Olshan wrote: > I am still seeing quite a few OOM errors in the builds and I was

Re: Kafka trunk test & build stability

2023-12-26 Thread Justine Olshan
I am still seeing quite a few OOM errors in the builds and I was curious if folks had any ideas on how to identify the cause and fix the issue. I was looking in gradle enterprise and found some info about memory usage, but nothing detailed enough to help figure the issue out. OOMs sometimes fail

Re: Kafka trunk test & build stability

2023-12-26 Thread David Arthur
S2. We’ve looked into this before, and it wasn’t possible at the time with JUnit. We commonly set a timeout on each test class (especially integration tests). It is probably worth looking at this again and seeing if something has changed with JUnit (or our usage of it) that would allow a global

Re: Kafka trunk test & build stability

2023-12-26 Thread Greg Harris
Hey Stan & Sophie, About the 90-day view: That was restricted to only trunk builds. If we include PR builds, there's 100 builds > 5h20m in the last 90 days, which is a significant number. It may still be caused by environmental factors that S-3 would address, but we might be able to find a test

Re: Kafka trunk test & build stability

2023-12-26 Thread Sophie Blee-Goldman
Regarding: S-4. Separate tests ran depending on what module is changed. > - This makes sense although is tricky to implement successfully, as > unrelated tests may expose problems in an unrelated change (e.g changing > core stuff like clients, the server, etc) Imo this avenue could provide a

Re: Kafka trunk test & build stability

2023-12-26 Thread Stanislav Kozlovski
Great discussion! Greg, that was a good call out regarding the two long-running builds. I missed that 90d view. My takeaway from that is that our average build time for tests is between 3-4 hours. Which in of itself seems large. But then reconciling this with Sophie's statement - is it

Re: Kafka trunk test & build stability

2023-12-22 Thread Justine Olshan
Thanks David! I think this should help a lot! While we should include these improvements, I think it is also good to remind folks that a lot of these issues come from merging on builds that regress the CI. I know I'm not perfect at this (and have merged on flaky and failing tests), but let's all

Re: Kafka trunk test & build stability

2023-12-22 Thread David Jacot
I just merged both PRs. Cheers, David Le ven. 22 déc. 2023 à 14:38, David Jacot a écrit : > Hey folks, > > I believe that my two PRs will fix most of the issues. I have also tweaked > the configuration of Jenkins to fix the issues relating to cloning the > repo. There may be other issues but

Re: Kafka trunk test & build stability

2023-12-22 Thread David Jacot
Hey folks, I believe that my two PRs will fix most of the issues. I have also tweaked the configuration of Jenkins to fix the issues relating to cloning the repo. There may be other issues but the overall situation should be much better when I merge those two. I will update this thread when I

Re: Kafka trunk test & build stability

2023-12-22 Thread Divij Vaidya
Hey folks I think David (dajac) has some fixes lined-up to improve CI such as https://github.com/apache/kafka/pull/15063 and https://github.com/apache/kafka/pull/15062. I have some bandwidth for the next two days to work on fixing the CI. Let me start by taking a look at the list that Sophie

Re: Kafka trunk test & build stability

2023-12-22 Thread Luke Chen
Hi Sophie and Philip and all, I share the same pain as you. I've been waiting for a CI build result in a PR for days. Unfortunately, I can only get 1 result each day because it takes 8 hours for each run, and with failed results. :( I've looked into the 8 hour timeout build issue and would like

Re: Kafka trunk test & build stability

2023-12-21 Thread Philip Nee
Hey Sophie - I've gotten 2 inflight PRs each with more than 15 retries... Namely: https://github.com/apache/kafka/pull/15023 and https://github.com/apache/kafka/pull/15035 justin filed a flaky test report here though: https://issues.apache.org/jira/browse/KAFKA-16045 P On Thu, Dec 21, 2023 at

Re: Kafka trunk test & build stability

2023-12-21 Thread Sophie Blee-Goldman
On a related note, has anyone else had trouble getting even a single run with no build failures lately? I've had multiple pure-docs PRs blocked for days or even weeks because of miscellaneous infra, test, and timeout failures. I know we just had a discussion about whether it's acceptable to ever

Re: Kafka trunk test & build stability

2023-12-19 Thread David Jacot
The slowness of the CI is definitely causing us a lot of pain. I wonder if we should move to a dedicated CI infrastructure for Kafka. Our integration tests are quite heavy and ASF's CI is not really tuned for them. We could tune it for our needs and this would also allow external companies to

Re: Kafka trunk test & build stability

2023-12-19 Thread Stanislav Kozlovski
Hey Николай, Apologies about this - I wasn't aware of this behavior. I have made all the gists public. On Wed, Dec 20, 2023 at 12:09 AM Greg Harris wrote: > Hey Stan, > > Thanks for opening the discussion. I haven't been looking at overall > build duration recently, so it's good that you are

Re: Kafka trunk test & build stability

2023-12-19 Thread Greg Harris
Hey Stan, Thanks for opening the discussion. I haven't been looking at overall build duration recently, so it's good that you are calling it out. I worry about us over-indexing on this one build, which itself appears to be an outlier. I only see one other build [1] above 6h overall in the last

Re: Kafka trunk test & build stability

2023-12-19 Thread Николай Ижиков
Hello, Stanislav. Can you, please, make the gist public. Private gists not available for some GitHub users even if link are known. > 19 дек. 2023 г., в 17:33, Stanislav Kozlovski > написал(а): > > Hey everybody, > I've heard various complaints that build times in trunk are taking too > long,

Re: Kafka trunk test & build stability

2023-12-19 Thread Viktor Somogyi-Vass
Hey Stan, I also experienced this, some of the tests indeed take a long time. As an immediate workaround, do you think we can enforce a global timeout of let's say 10 minutes? I don't know if these are taking a long time because of some race condition or because of the lack of resources and

Kafka trunk test & build stability

2023-12-19 Thread Stanislav Kozlovski
Hey everybody, I've heard various complaints that build times in trunk are taking too long, some taking as much as 8 hours (the timeout) - and this is slowing us down from being able to meet the code freeze deadline for 3.7. I took it upon myself to gather up some data in Gradle Enterprise to see