Hi Stefano, On Wed, 20 Oct 2021 at 03:55, Stefano Babic <sba...@denx.de> wrote: > > On 20.10.21 05:42, Simon Glass wrote: > > Hi, > > > > On Tue, 19 Oct 2021 at 17:01, Tom Rini <tr...@konsulko.com> wrote: > >> > >> On Tue, Oct 19, 2021 at 04:59:15PM -0600, Simon Glass wrote: > >>> Hi Tom, > >>> > >>> On Tue, 19 Oct 2021 at 16:53, Tom Rini <tr...@konsulko.com> wrote: > >>>> > >>>> On Tue, Oct 19, 2021 at 05:39:12PM +0200, Stefano Babic wrote: > >>>>> Hi Simon, > >>>>> > >>>>> On 07.10.21 15:43, Simon Glass wrote: > >>>>>> Hi Stefano, > >>>>>> > >>>>>> On Thu, 7 Oct 2021 at 04:37, Stefano Babic <sba...@denx.de> wrote: > >>>>>>> > >>>>>>> Hi all, > >>>>>>> > >>>>>>> CI stops by building aarch64 without notice, for reference: > >>>>>>> > >>>>>>> https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319 > >>>>>>> > >>>>>>> There is no error, just process is killed. It looks like it stops at > >>>>>>> xilinx_zynqmp_virt, > >>>>>>> > >>>>>>> ./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be > >>>>>>> built > >>>>>>> without issues. > >>>>>>> > >>>>>>> If I build on my host (not in docker, anyway), it generally builds > >>>>>>> fine > >>>>>>> - but it crashes sometimes, too. On gitlab instance , it crashes. > >>>>>>> Issue does not seem that depends on merged patches, and introduces > >>>>>>> boards were already built successfully. Any hint ? I have also no idea > >>>>>>> what I should look as what I see is just > >>>>>>> > >>>>>>> "usr/bin/bash: line 104: 24 Killed > >>>>>>> ./tools/buildman/buildman -o /tmp -P -E -W aarch64" > >>>>>> > >>>>>> I cannot see that link. I am not sure what is going on. Does it say > >>>>>> what signal killed it? > >>>>> > >>>>> Pipelines on our server were not public - I have enbaled now for > >>>>> u-boot-imx. > >>>>> > >>>>>> > >>>>>> Does it sit there for an hour and timeout? If so, then I did see that > >>>>>> myself once recently, when the Kconfig needed stdin, but I could not > >>>>>> quitetie it down. I think buildman would provide it, but sometimes > >>>>>> not, apparently. So it can happen when there is an existing build > >>>>>> there and your new one which adds Kconfig options that don't have > >>>>>> defaults, or something like that? > >>>>>> > >>>>> > >>>>> I have investigated further, and I can reproduce it on my host outside > >>>>> the > >>>>> gitlab server. buildman causes a OOM, but I cannot find the cause. > >>>>> > >>>>> Strange enough, this happens with the "aarch64" target, and I cannot > >>>>> reproduce it with Tom's master. So it seems that -master is ok, and > >>>>> somethin > >>>>> on u-boot-imx generates the OOM. > >>>>> > >>>>> However.... > >>>>> > >>>>> The OOM happens always when -2 (two boards remain) appears. I can see > >>>>> with > >>>>> htop that buildman starts to allocate memory until it is exhausted > >>>>> (64GB RAM > >>>>> + 8 GB swap). Then the kernel decides that it is enough and kills > >>>>> buildman - > >>>>> this is what I see on Ci. > >>>>> > >>>>> You can see now the pipelines: > >>>>> > >>>>> https://source.denx.de/u-boot/custodians/u-boot-imx/-/pipelines/9520 > >>>>> > >>>>> I have then split aarch64 and I built imx8 separately - same result. The > >>>>> pipeline stops with xilinx board, but they have nothing to do. In fact, > >>>>> I > >>>>> can build all xilinx board separately. If I run buildman -W aarch64 -x > >>>>> xilinx, OOM is shown by another board. > >>>>> > >>>>> Strange enough, I can build each single board with buildman without > >>>>> issues, > >>>>> neither errors nor warnongs. Just when buildman runs all together > >>>>> (aarch64, > >>>>> 308 boards), the OOM is generated. > >>>>> > >>>>> Bisect does not help: I started bisect, and at the end this commit was > >>>>> presented: > >>>>> > >>>>> commit 53a24dee86fb72ae41e7579607bafe13442616f2 > >>>>> Author: Fabio Estevam <feste...@denx.de> > >>>>> Date: Mon Aug 23 21:11:09 2021 -0300 > >>>>> > >>>>> imx8mm-cl-iot-gate: Split the defconfigs > >>>> > >>>> I strongly suspect what's going on here is that these new defconfigs are > >>>> out of sync with changes now in Kconfig. The build itself will just sit > >>>> there, waiting for the "oldconfig" prompt to be answered. > >>>> > >>>> I want to say the problem here is that stdin is open, rather than > >>>> pointing to something closed and would lead to the build failing > >>>> immediately, rather than once a timeout is hit, or OOM kicks in due to > >>>> kconfig chewing up all the memory. > >>> > >>> Yes that's exactly what I saw... > >>> > >>> In fact, see this commit: > >>> > >>> e62a24ce27a buildman: Avoid hanging when the config changes > >>> > >>> But that was 3 years ago. > >> > >> Looks like something else needs to be changed then, I've bisected down > >> similar failures here before very recently. > > > > I dug into this a bit and I think buildman can detect this situation. > > I'll send a little series. > > > > Patch definetly help ;-) > > It breaks build (on CI when build-tools runs), but I get much more > details when I build locally single boards. I can find for > kontron-sl-mx8mm several errors due to: > > - CONFIG_SYS_LOAD_ADDR not defined in configs, but in header > - CONFIG_SYS_EXTRA_OPTIONS instead of CONFIG_IMX_CONFIG > - CONFIG_SYS_MALLOC_LEN not defined in config, but in header > > Your patch are a valueable tool (CI driove me crazy), I can now folow > what happens. I send a patch for kontron, and I go on with the rest (I > guess kontron is not the only board causing this deadlock). Many thanks ! > > Tom, I apply Simon's patches on my tree, I cannot work without them...
OK good. Unfortunately this is likely to throw up problems that are only fatal 1% of the time. Regards, Simon