Yocto Technical Team Minutes, Engineering Sync, for June 22, 2021 archive: https://docs.google.com/document/d/1ly8nyhO14kDNnFcW2QskANXW3ZT7QwKC5wWVDg9dDH4/edit
== disclaimer == Best efforts are made to ensure the below is accurate and valid. However, errors sometimes happen. If any errors or omissions are found, please feel free to reply to this email with any corrections. == attendees == Trevor Woerner, Stephen Jolley, Michael Halstead, Paul Barker, Richard Purdie, Scott Murray, Steve Sakoman, Tony Tascioglu, Trevor Gamblin, Alejandro Hernandez, Alexandre Belloni, Jan-Simon Möller, Randy MacLeod == notes == - 3.4 m1 (honister) through QA, couple issue found, in review by TSC to decide on release - 3.1.9 (dunfell) currently being build then sent to QA - thanks to PaulG for tracking down null pointer in cgroups mount code (and others) that have led to solving some AB hangs - the rcu dump AB vm hangs are still occurring - 2 new manual sections: reproducible builds, and YP compatible - multiconfig changes continue to cause issues - still lots of AB-INT issues, we’re working on trying to define the load pattern that causes these issues == general == RP: special thanks to Paul Gortmaker for finding the null pointer issue to fix the LTP builds PaulB: still working on the pr-server changes. my changes work on my setup but don’t seem to do well on the AB. if i enable parallelism in my builds then the oom-killer gets busy. it’s annoying that the output from a test is buffered until the test is finished, so if a test hangs then i can’t find out which one is hanging. it looks like we’ll have to dig into async-io or python parallelism to see if there’s something wrong there or at least figure out how to get more output before a test finishes RP: did you look at the trace output i found PaulB: yes, but not exactly sure what i’m looking at RP: if bitbake server starts one of the servers, then it’s responsible for taking it out when it shuts down. so they run individually, but there is some sharing. that traceback suggests to me that it’s trying to shut down the pr server at shutdown time but not succeeding. sometimes the sqlite database takes time to sync to disk for example (especially when there is high io) but bitbake ends up blocked. it looks like some contention between the parse threads and the locks. when doing python parallel there are 2 systems and we have to make sure to not mix the two. PaulB: in one case it looks like one of the comm sockets is already closed, but the server is waiting for it to close again, but i would think we’d get the same traceback every time RP: maybe it’s stuck somewhere where it shouldn’t be, which makes it skip some error/exit handling. or maybe add more debug code PaulB: yes, i think adding timeouts to every wait/callback to see which ones are timing out RP: can python give you a dump of what’s in every waitstate? PaulB: not sure, i’ll have to look at what python provides. bitbake isn’t what’s telling prserv to exit on the cluster, but it is on my builds RP: that seems unlikely PaulB: i can’t generate the same conditions locally as what we’re seeing on the AB. because of thread contention there seems to be, perhaps, a resource exhaustion that i can’t reproduce RP: put a “sleep 5” in the shutdown path in the prserv. it would cause hashserve to exist longer than bitbake PaulB: or backout the use of multiprocessing, keep the new json rpc system. but do a more traditional fork() to start the process up RP: i suspect we did the fork() for a reason PaulB: multiprocessing didn’t exist back then RP: it did, but it was broken. i sent a cooker daemon log which should show what was running at the time. so we should be able to reverse engineer what was going on at that time. i’m getting good at reading those, so i can probably read them and tell you what tests were running at the time PaulB: if we could just run that one test RP: i would run just the prserv selftest and put that sleep in there. i’m convinced it’s one of them PaulB: i’m afraid i’m going to have to hand this off to someone else. i’ll give one more push but then i’m out of ideas RP: okay. it’d be good to get that in because i’m looking forward to it TrevorW: move to SPDX license identifiers RP: we have TrevorW: classes and ptests yes, but not recipes and other places of oecore RP: any place where we’re writing python we have. it’s a nice cleanup thing, we’ve been embracing it. we’ve made some of the changes. we’ve done it with code. the problem with recipes is that if we put that at the top of the license it’ll get confused with the LICENCE field. so we’re not quite sure what to do in that case PaulB: i've been using a linting tool that checks for these identifiers. again, the problem of what license to put on the recipe and for patches etc (https://reuse.software/spec) RP: licenses in patches in something to think about and would make upstreaming easier. we’ve done the low hanging fruit but need more thought for other cases. there’s also a standard form for copyright headers too PaulB: yes, there’s been an update to the copyright format as well RP: YP now has attained silver status for core infrastructure in LF. to get to gold we need copyright on all files and many of our recipes don’t ScottM: if there’s a statement at the top or the repo doesn’t that help RP: i don’t think it accounts for that. and then we have to decide if these are oe contributors or yocto contributors. so i left it at that. if anyone wants to move this forward i’m definitely interested. there are some questions we need to answer. our code is good at having copyright, but the problem is with recipes RP: there are a couple things we need to get gold status: 2-factor auth for git pushes, some https issues (best practices changed), and licensing TrevorG: working on the job server. create a dir in tmp to create the fifo, but not seeing any difference in performance. i’m not sure kea is the right way to go RP: it was deleting the fifo TrevorG: yes, at the end of the day it’s using the one that’s tied to the user process RP: that’s the correct way to do it. if it already exists then it should get an EXISTS error and just go ahead and user it. as long as it’s not deleting it TrevorG: i’ve probably left something out, so i’ll look through it and see RP: it was a POC i had written, but i can't remember how much testing it got. these rcu issues seem to be related to a peak-load issue, so it’d be nice to dig in and get to the core of the issue TrevorG: tl;dr: no definitive results yet Randy: make has the ability to look at the system load and use that to decide whether or not to add more load, is there a reason we’re not considering that RP: good question. we’ve talked about it before, but i can't remember why we decided not to go that route RP: any updates on the rcu things AlexB? we had another one over the weekend AlexB: right now i don’t have anything to say because i believe this is something that is stuck in userspace. so i’m debugging, but unfortunately the kernel we’re using doesn’t have debugging enabled, so i want to build another kernel with debugging enabled so we can test with debugging enabled. maybe we can carry a patch for the qemu kernels on the AB which will enable more debugging. we’re also missing a few important symbols that will help us see more. so i don't think these core dump are important as-is. RP: which debug symbols are you missing? our dumps have a lot of things enabled AlexB: debuginfo. there’s the risk that a new kernel will change behaviour but we should try RP: because we’re so reproducible, we should be able to debug easier. would a complete kernel build help AlexB: config? RP: x86-64 musl. we should change the qmp code to run the commands that we found AlexB: yes. i’m carrying the 2nd qmp socket patch (instead of going through the close) RP: it’d be nice to always have this AlexB: how do we detect the hang RP: there are timeouts, so if we hit the timeouts then we assume there’s a hang AlexB: we do that before killing… RP: yes. if you have notes about how to set this up with gdb i’d like to try replicating it AlexB: okay. i was hoping this was a problem in the kernel and that the answer would be right there. we’ve also seen the load of qemu going to 300 to 400 then 300 then 400. RP: it’s like something is waiting for something else to happen AlexB: you straced it RP: yes, while pinging it. and we could see the straces going into the ping AlexB: yes, which means the kernel is still alive, but the rcu is saying that the kernel is not alive, so something is confused. something seems to be waiting for something to happen that might never happen Randy: so no progress on the rcu stalls? RP: yes. we’ve made progress at poking qemu, and we have this core dump, but no insight into what is causing it Randy: has anyone tried to reproduce it? i’ve got a busy system and i’ve run stress-ng RP: i haven’t tried that. Randy: where does it happen RP: ubuntu 18.04 AlexB: but maybe that’s because that’s what most of our AB is running? RP: that’s the info we have. if we can reproduce it then that would be wonderful since that would make it easier to debug. making qmp better is worth it in any case. Randy: what about the kernel change PaulG made. i’d like to hammer on it RP: i’ve done some hammering and haven’t seen any issues yet Randy: i’ll do some hammering tonight so we can add some stats RP: we’ve see one arm AB issue reading a message in /proc, but we’ve only seen it once Randy: we don’t have any arm builders. what are you using? RP: ampere (something) and a huawei (something). not sure what the trigger is; no idea how to reproduce. it’s a hanging read(), but you can still ssh into the system. we may end up disabling the test PaulB: back to copyright headers. i’ve pasted the output of a run i’ve just done looking for licenses. there’s probably some low-hanging fruit there and some false positives.
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#53941): https://lists.yoctoproject.org/g/yocto/message/53941 Mute This Topic: https://lists.yoctoproject.org/mt/83729747/21656 Group Owner: yocto+ow...@lists.yoctoproject.org Unsubscribe: https://lists.yoctoproject.org/g/yocto/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-