Re: [git pull] drm for 6.9-rc1
On Tue, 12 Mar 2024 at 21:07, Dave Airlie wrote: > > I've done a trial merge into your tree from a few hours ago, there > are definitely some slighty messy conflicts, I've pushed a sample > branch here: I appreciate your sample merges since I like verifying my end result, but I think your merge is wrong. I got two differences when I did the merge. The one in intel_dp_detect() I think is just syntactic - I ended up placing the if (!intel_dp_is_edp(intel_dp)) intel_psr_init_dpcd(intel_dp); differently than you did (I did it *after* the tunnel_detect()). I don't _think,_ that placement matters, but somebody more familiar with the code should check it out. Added Animesh and Jani to the participants. But I think your merge gets the TP_printk() for the xe_bo_move trace event is actively wrong. You don't have the destination for the move in the printk. Or maybe I got it wrong. Our merges end up _close_, but not identical. Linus
Re: [PATCH 1/3] kci-gitlab: Introducing GitLab-CI Pipeline for Kernel Testing
On Fri, 1 Mar 2024 at 02:27, Nikolai Kondrashov wrote: > > I agree, it's hard to imagine even a simple majority agreeing on how GitLab CI > should be done. Still, we would like to help people, who are interested in > this kind of thing, to set it up. How about we reframe this contribution as a > sort of template, or a reference for people to start their setup with, > assuming that most maintainers would want to tweak it? We would also be glad > to stand by for questions and help, as people try to use it. Ack. I think seeing it as a library for various gitlab CI models would be a lot more palatable. Particularly if you can then show that yes, it is also relevant to our currently existing drm case. So I'm not objecting to having (for example) some kind of CI helper templates - I think a logical place would be in tools/ci/ which is kind of alongside our tools/testing subdirectory. (And then perhaps have a 'gitlab' directory under that. I'm not sure whether - and how much - commonality there might be between the different CI models of different hosts). Just to clarify: when I say "a logical place", I very much want to emphasize the "a" - maybe there are better places, and I'm not saying that is the only possible place. But it sounds more logical to me than some. Linus
Re: [PATCH 1/3] kci-gitlab: Introducing GitLab-CI Pipeline for Kernel Testing
On Thu, 29 Feb 2024 at 01:23, Nikolai Kondrashov wrote: > > However, I think a better approach would be *not* to add the .gitlab-ci.yaml > file in the root of the source tree, but instead change the very same repo > setting to point to a particular entry YAML, *inside* the repo (somewhere > under "ci" directory) instead. I really don't want some kind of top-level CI for the base kernel project. We already have the situation that the drm people have their own ci model. II'm ok with that, partly because then at least the maintainers of that subsystem can agree on the rules for that one subsystem. I'm not at all interested in having something that people will then either fight about, or - more likely - ignore, at the top level because there isn't some global agreement about what the rules are. For example, even just running checkpatch is often a stylistic thing, and not everybody agrees about all the checkpatch warnings. I would suggest the CI project be separate from the kernel. And having that slack channel that is restricted to particular companies is just another sign of this whole disease. If you want to make a google/microsoft project to do kernel CI, then more power to you, but don't expect it to be some kind of agreed-upon kernel project when it's a closed system. Linus
Re: [PATCH next v2 08/11] minmax: Add min_const() and max_const()
On Sun, 25 Feb 2024 at 08:53, David Laight wrote: > > The expansions of min() and max() contain statement expressions so are > not valid for static intialisers. > min_const() and max_const() are expressions so can be used for static > initialisers. I hate the name. Naming shouldn't be about an implementation detail, particularly not an esoteric one like the "C constant expression" rule. That can be useful for some internal helper functions or macros, but not for something that random people are supposed to USE. Telling some random developer that inside an array size declaration or a static initializer you need to use "max_const()" because it needs to syntactically be a constant expression, and our regular "max()" function isn't that, is just *horrid*. No, please just use the traditional C model of just using ALL CAPS for macro names that don't act like a function. Yes, yes, that may end up requiring getting rid of some current users of #define MIN(a,b) ((a)<(b) ? (a):(b)) but dammit, we don't actually have _that_ many of them, and why should we have random drivers doing that anyway? Linus
Re: [PATCH 2/9] lib/cmdline: Fix an invalid format specifier in an assertion msg
On Thu, 22 Feb 2024 at 09:36, Daniel Latypov wrote: > > Copying the line for context, it's about `p-r` where > p = memchr_inv([1], 0, sizeof(r) - sizeof(r[0])); > `p-r` should never be negative unless something has gone horribly > horribly wrong. Sure it would - if 'p' is NULL. Of course, then a negative value wouldn't be helpful either, and in this case that's what the EXPECT_PTR_EQ checking is testing in the first place, so it's a non-issue. IOW, in practice clearly the sign should simply not matter here. I do think that the default case for pointer differences should be that they are signed, because they *can* be. Just because of that "default case", unless there's some actual reason to use '%tu', I think '%td' should be seen as the normal case to use. That said, just as a quick aside: be careful with pointer differences in the kernel. For this particular case, when we're talking about just 'char *', it's not a big deal, but we've had code where people didn't think about what it means to do a pointer difference in C, and how it can be often unnecessarily expensive due to the implied "divide by the size of the pointed object". Sometimes it's actually worth writing the code in ways that avoids pointer differences entirely (which might involve passing around indexes instead of pointers). Linus
Re: [PATCH 8/9] drm/xe/tests: Fix printf format specifiers in xe_migrate test
On Wed, 21 Feb 2024 at 21:05, Lucas De Marchi wrote: > > this has a potential to cause conflicts with upcoming work, so I think > it's better to apply this through drm-xe-next. Let me know if you agree. I disagree. Violently. For this to be fixed, we need to have the printf format checking enabled. And we can't enable it until all the problems have been fixed. Which means that we should *not* have to wait for [N] different trees to fix their issues separately. This should get fixed in the Kunit tree, so that the Kunit tree can just send a pull request to me to enable format checking for the KUnit tests, together with all the fixes. Trying to spread those fixes out to different git branches will only result in pain and pointless dependencies between different trees. Honestly, the reason I noticed the problem in the first place was that the drm tree had a separate bug, that had been apparently noted in linux-next, and *despite* that it made it into a pull request to me and caused new build failures in rc5. So as things are, I am not IN THE LEAST interested in some kind of "let us fix this in the drm tree separately" garbage. We're not making things worse by trying to fix this in different trees. We're fixing this in the Kunit tree, and I do not want to get *more* problems from the drm side. I've had enough. Linus
Re: [PATCH next 10/11] block: Use a boolean expression instead of max() on booleans
On Sun, 28 Jan 2024 at 14:22, David Laight wrote: > > H blame gcc :-) I do agree that the gcc warning quoting is unnecessarily ugly (even just visually), but.. > The error message displays as '0' but is e2:80:98 30 e2:80:99 > I HATE UTF-8, it wouldn't be as bad if it were a bijection. No, that's not the problem. The UTF-8 that gcc emits is fine. And your email was also UTF-8: Content-Type: text/plain; charset=UTF-8 The problem is that you clearly then used some other tool in between that took the UTF-8 byte stream, and used it as (presumably) Latin1, which is bogus. If you just make everything use and stay as UTF-8, it all works out beautifully. But I suspect you have an editor or a MUA that is fixed in some 1980s mindset, and when you cut-and-pasted the UTF-8, it treated it as Latin1. Just make all your environment be utf-8, like it should be. It's not the 80s any more. We don't do mullets, and we don't do Latin1, ok? Linus
Re: [PATCH next 10/11] block: Use a boolean expression instead of max() on booleans
On Sun, 28 Jan 2024 at 11:36, David Laight wrote: > > However it generates: > error: comparison of constant ‘0’ with boolean expression is always true > [-Werror=bool-compare] > inside the signedness check that max() does unless a '+ 0' is added. Please fix your locale. You have random garbage characters there, presumably because you have some incorrect locale setting somewhere in your toolchain. Linus
Re: [BUG] BUG: kernel NULL pointer dereference at ttm_device_init+0xb4
On Mon, 22 Jan 2024 at 16:56, Bhardwaj, Rajneesh wrote: > > I think a fix might already be in flight. Please see Linux-Kernel Archive: > Re: [PATCH] drm/ttm: fix ttm pool initialization for no-dma-device drivers > (iu.edu) Please use lore.kernel.org that doesn't corrupt whitespace in patches or lose header information: https://lore.kernel.org/lkml/20240113213347.9562-1-pchel...@ispras.ru/ although that seems to be a strange definition of "in flight". It was sent out 8 days ago, and apparently nobody thought to include it in the drm fixes pile that came in last Friday. So it made it into rc1, even though it was reported a week before. It also looks like some mailing list there is mangling emails - if you use 'all' instead of 'lkml', lore reports multiple emails with the same message-id, and it all looks messier as a result. I assume it's dri-devel@lists.freedesktop.org that messes up, mainly because I don't tend to see this behaviour when only the usual kernel.org mailing lists are involved. Linus
Re: [BUG] BUG: kernel NULL pointer dereference at ttm_device_init+0xb4
On Mon, 22 Jan 2024 at 15:17, Steven Rostedt wrote: > > Perhaps this is the real fix? If you send a signed-off version, I'll apply it asap. Thanks, Linus
Re: [git pull] drm for 6.8
On Wed, 10 Jan 2024 at 11:49, Dave Airlie wrote: > > Let me know if there are any issues, Your testing is seriously lacking. This doesn't even build. The reason seems to be that commit b49e894c3fd8 ("drm/i915: Replace custom intel runtime_pm tracker with ref_tracker library") changed the 'intel_wakeref_t' type from a 'depot_stack_handle_t' to 'unsigned long', and as a result did this: - drm_dbg(>drm, "async_put_wakeref %u\n", + drm_dbg(>drm, "async_put_wakeref %lu\n", power_domains->async_put_wakeref); meanwhile, the Xe driver has this: drivers/gpu/drm/xe/compat-i915-headers/intel_wakeref.h: typedef bool intel_wakeref_t; which has never been valid, but now the build fails with drivers/gpu/drm/i915/display/intel_display_power.c: In function ‘print_async_put_domains_state’: drivers/gpu/drm/i915/display/intel_display_power.c:408:29: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 5 has type ‘int’ [-Werror=format=] because the drm header files have this disgusting thing where a *header* file includes a *C* file: In file included from ./include/drm/drm_mm.h:51, from drivers/gpu/drm/xe/xe_bo_types.h:11, from drivers/gpu/drm/xe/xe_bo.h:11, from ./drivers/gpu/drm/xe/compat-i915-headers/gem/i915_gem_object.h:11, from ./drivers/gpu/drm/xe/compat-i915-headers/i915_drv.h:15, from drivers/gpu/drm/i915/display/intel_display_power.c:8: nasty. I made it build by fixing that broken Xe compat header file, but this is definitely *NOT* how things should have worked. How did this ever get to me without any kind of build testing? And why the %^!@$% does a header file include a C file? That's wrong regardless of this bug. Linus
Re: [git pull] drm fixes for 6.8
On Wed, 3 Jan 2024 at 18:30, Dave Airlie wrote: > > These were from over the holiday period, mainly i915, a couple of > qaic, bridge and an mgag200. > > I have a set of nouveau fixes that I'll send after this, that might be > too rich for you at this point. > > I expect there might also be some more regular fixes before 6.8, but > they should be minor. I'm assuming you're just confused about the numbering, and meant 6.7 here and in the subject line. This seems to be too small of a pull to be an early pull request for the 6.8 merge window. Linus
Re: [GIT PULL] fbdev fixes and updates for v6.7-rc3
On Sat, 25 Nov 2023 at 22:58, Helge Deller wrote: > > please pull some small fbdev fixes for 6.7-rc3. These all seem to be pure cleanups, not bug fixes. Please resend during the merge window. Linus
github version complaints about the gitlab CI requirements.txt
So every time I push to my github mirror, github now ends up having a 'dependabot' thing that warns about some of the CI version requirements for the gitlab automated testing file. It wants to update the pip requirements from 23.2.1 to 23.3 - When installing a package from a Mercurial VCS URL, e.g. pip install hg+..., with pip prior to v23.3, the specified Mercurial revision could be used to inject arbitrary configuration options to the hg clone call (e.g. --config). Controlling the Mercurial configuration can modify how and which repository is installed. This vulnerability does not affect users who aren't installing from Mercurial. and upgrade the urllib3 requirements from 2.0.4 to 2.0.7: - urllib3's request body not stripped after redirect from 303 status changes request method to GET - `Cookie` HTTP header isn't stripped on cross-origin redirects And it's not like any of this looks like a big deal, but I'd like to shut up the messages I get. I can either just close those issues, or I can apply a patch something like the attached (which also adds a missing newline at the end). I thought I should ask the people who actually set this up. Comments? Linus drivers/gpu/drm/ci/xfails/requirements.txt | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/ci/xfails/requirements.txt b/drivers/gpu/drm/ci/xfails/requirements.txt index d8856d1581fd..e9994c9db799 100644 --- a/drivers/gpu/drm/ci/xfails/requirements.txt +++ b/drivers/gpu/drm/ci/xfails/requirements.txt @@ -5,7 +5,7 @@ termcolor==2.3.0 certifi==2023.7.22 charset-normalizer==3.2.0 idna==3.4 -pip==23.2.1 +pip==23.3 python-gitlab==3.15.0 requests==2.31.0 requests-toolbelt==1.0.0 @@ -13,5 +13,5 @@ ruamel.yaml==0.17.32 ruamel.yaml.clib==0.2.7 setuptools==68.0.0 tenacity==8.2.3 -urllib3==2.0.4 -wheel==0.41.1 \ No newline at end of file +urllib3==2.0.7 +wheel==0.41.1
Re: drm/vkms: deadlock between dev->event_lock and timer
On Wed, 13 Sept 2023 at 07:21, Tetsuo Handa wrote: > > Hello. A deadlock was reported in drivers/gpu/drm/vkms/ . > It looks like this locking pattern remains as of 6.6-rc1. Please fix. > > void drm_crtc_vblank_off(struct drm_crtc *crtc) { > spin_lock_irq(>event_lock); > drm_vblank_disable_and_save(dev, pipe) { > __disable_vblank(dev, pipe) { > crtc->funcs->disable_vblank(crtc) == vkms_disable_vblank { > hrtimer_cancel(>vblank_hrtimer) { // Retries with > dev->event_lock held until lock_hrtimer_base() succeeds. > ret = hrtimer_try_to_cancel(timer) { > base = lock_hrtimer_base(timer, ); // Fails forever because > vkms_vblank_simulate() is in progress. Heh. Ok. This is clearly a bug, but it does seem to be limited to just the vkms driver, and literally only to the "simulate vblank" case. The worst part about it is that it's so subtle and not obvious. Some light grepping seems to show that amdgpu has almost the exact same pattern in its own vkms thing, except it uses hrtimer_try_to_cancel(_crtc->vblank_timer); directly, which presumably fixes the livelock, but means that the cancel will fail if it's currently running. So just doing the same thing in the vkms driver probably fixes things. Maybe the vkms people need to add a flag to say "it's canceled" so that it doesn't then get re-enabled? Or maybe it doesn't matter and/or already happens for some reason I didn't look into. Linus
Re: [git pull] drm CI integration
On Wed, 30 Aug 2023 at 18:00, Dave Airlie wrote: > > This is a PR to add drm-ci support files to the upstream tree. So I finally had no other pull requests pending, and spent some time looking at this, and I see nothing offensive. I did wonder how this then expands to having more than one subsystem using this (and mixing them possibly on the same CI system), but that's more about my ignorance about how the gitlab CI works than anything else, so that's certainly not a real concern. The other side of that "I do wonder" coin is for when others want to use the same tests but using some other CI infrastructure, whether it's some AWS or google cloud thing, or github or whatever. Anyway, considering that both of my idle curiosity reactions were about "if this is successful", I think me having those questions only means that I should pull this, rather than questioning the pull itself. If it works out so well that others want to actually do this and integrate our other selftests in similar manners, I think that would be lovely. And if - as you say - this is a failure and the whole thing gets deleted a year from now as "this didn't go anywhere", it doesn't look like it should cause a ton of problems either. Anyway, it's in my tree now, let's see where it goes. Linus
Re: [git pull] drm fixes for 6.6-rc1
On Thu, 7 Sept 2023 at 19:45, Dave Airlie wrote: > > Just a poke about the outstanding drm CI support pull request since I > haven't see any movement on that in the week, hopefully it's not as > difficult a problem as bcachefs :-) I was assuming that it wouldn't interfere with anything else... and that I could just ignore it until I have all my "real" pulls done. I didn't want to even look at it until I was "done". Yes, it's past the mid-way point of the second week of the merge window, and I mostly _should_ be "done". But I still keep finding new pull requests in my inbox that aren't just fixes and updates for previous main pull requests. Linus
Re: mainline build failure due to 501126083855 ("fbdev/g364fb: Use fbdev I/O helpers")
On Thu, 31 Aug 2023 at 11:48, Sudip Mukherjee (Codethink) wrote: > The latest mainline kernel branch fails to build mips jazz_defconfig with > the error: > > drivers/video/fbdev/g364fb.c:115:9: error: 'FB_DEFAULT_IOMEM_HELPERS' > undeclared here (not in a function); did you mean 'FB_DEFAULT_IOMEM_OPS'? > 115 | FB_DEFAULT_IOMEM_HELPERS, > | ^~~~ > | FB_DEFAULT_IOMEM_OPS > > > git bisect pointed to 501126083855 ("fbdev/g364fb: Use fbdev I/O helpers"). > > Reverting the commit has fixed the build failure. > > I will be happy to test any patch or provide any extra log if needed. Would you mind testing the exact thing that the compiler suggested? So instead of the revert, just replace FB_DEFAULT_IOMEM_HELPERS with FB_DEFAULT_IOMEM_OPS. I think it's just a typo / confusion with the config variable (which is called FB_IOMEM_HELPERS). Linus
Re: [git pull] drm fixes for 6.5-rc4
On Thu, 27 Jul 2023 at 19:20, Dave Airlie wrote: > > Regular scheduled fixes, msm and amdgpu leading the way, with some > i915 and a single misc fbdev, all seems fine. Pulled. Tangentially related: where do you keep your pgp key? The one I have is long expired, and doing a refresh doesn't get any updates... Linus
Re: [pull] amdgpu drm-fixes-6.4
On Fri, 23 Jun 2023 at 14:18, Alex Deucher wrote: > > are available in the Git repository at: > > https://gitlab.freedesktop.org/agd5f/linux.git > tags/amd-drm-fixes-6.4-2023-06-23 That's not a valid signed tag. Yes, it's a tag. But it doesn't actually have any cryptographic signing, and I'm not willing to pull random content from git sites that I can't verify. In fact, these days I ask even kernel.org pull requests to be proper signed tags, although I haven't really gotten to the point where I *require* it. So please sign your tags - use "git tag -s" (or "-u keyname" if you have some specific key you want to use, rather than one described by your regular git config file). Linus
Re: mainline build failure due to 322458c2bb1a ("drm/ttm: Reduce the number of used allocation orders for TTM pages")
On Wed, Apr 26, 2023 at 10:44 AM Sudip Mukherjee (Codethink) wrote: > > drivers/gpu/drm/ttm/ttm_pool.c:73:29: error: variably modified > 'global_write_combined' at file scope >73 | static struct ttm_pool_type global_write_combined[TTM_DIM_ORDER]; > | ^ Ugh. This is because we have #define TTM_DIM_ORDER (__TTM_DIM_ORDER <= MAX_ORDER ? __TTM_DIM_ORDER : MAX_ORDER) which looks perfectly fine as a constant ("pick the smaller of MAX_ORDER and __TTM_DIM_ORDER"). But: #define __TTM_DIM_ORDER (TTM_MAX_ORDER + 1) #define TTM_MAX_ORDER (PMD_SHIFT - PAGE_SHIFT) which still _looks_ fine, but on 64-bit powerpc, we then have #define PTE_INDEX_SIZE __pte_index_size so that __TTM_DIM_ORDER constant isn't actually a constant at all. I would suggest that the DRM code just make those fixed-size arrays use "MAX_ORDER", because even though it might be a bit bigger than necessary, that's still not a very big constant (ie typically it's "11", but it could be up to 64). It's a bit sad how that macro that _looks_ like a constant (and is one pretty much everywhere else) isn't actually constant on powerpc, but looking around it looks fairly unavoidable. Maybe the DRM code could try to avoid using things like PMD_SHIFT entirely? Linus
Disabling -Warray-bounds for gcc-13 too
Kees, I made the mistake of upgrading my M2 Macbook Air to Fedora-38, and in the process I got gcc-13 which is not WERROR-clean because we only limited the 'array-bounds' warning to gcc-11 and gcc-12. But gcc-13 has all the same issues. And I want to be able to do my arm64 builds with WERROR on still... I guess it never made much sense to hope it was going to go away without having a confirmation, so I just changed it to be gcc-11+. A lot of the warnings seem just crazy, with gcc just not getting the bounds right, and then being upset about us going backwards with 'container_of()' etc. Ok, so the kernel is special. We do odd things. I get it, gcc ends up being confused. But before I disabled it, I did take a look at a couple of warnings that didn't look like the sea of crazy. And one of them is from you. In particular, commit 4076ea2419cf ("drm/nouveau/disp: Fix nvif_outp_acquire_dp() argument size") cannot possibly be right, It changes nvif_outp_acquire_dp(struct nvif_outp *outp, u8 dpcd[16], to nvif_outp_acquire_dp(struct nvif_outp *outp, u8 dpcd[DP_RECEIVER_CAP_SIZE], and then does memcpy(args.dp.dpcd, dpcd, sizeof(args.dp.dpcd)); where that 'args.dp.dpcd' is a 16-byte array, and DP_RECEIVER_CAP_SIZE is 15. So yeah, it's copying 16 bytes from an argument that claims to be 15 bytes in size. I think that commit was wrong, and the problem is that the 'dpcd' array is something 15 and sometimes 16. For example, we have struct nouveau_encoder { ... union { struct { ... u8 dpcd[DP_RECEIVER_CAP_SIZE]; } dp; }; so there it's indeed 15 bytes, but then we have union nvif_outp_acquire_args { struct nvif_outp_acquire_v0 { ... union { ... struct { ... __u8 dpcd[16]; } dp; where it's 16. I think it's all entirely harmless from a code generation standpoint, because the 15-byte field will be padded out to 16 bytes in the structure that contains it, but it's most definitely buggy. So that warning does find real cases of wrong code. But when those real cases are hidden by hundreds of lines of unfixable false positives, we don't have much choice. But could the Nouveau driver *please* pick a size for the dhcp[] array and stick with it? The other driver where the warnings didn't look entirely crazy was the ath/carl9170 wireless driver, but I didn't look closer at that one. Linus
Re: Linux 6.3-rc3
On Wed, Mar 22, 2023 at 9:40 AM Sedat Dilek wrote: > > You have to pass `make LLVM=1` in any case... to `oldconfig` or when > adding any MAKEFLAGS like -j${number-of-available-cpus}. I actually think we should look (again) at just making the compiler choice (and the prefix) be a Kconfig option. That would simplify *so* many use cases. It used to be that gcc was "THE compiler" and anything else was just an odd toy special case, but that's clearly not true any more. So it would be lovely to make the kernel choice a Kconfig choice - so you'd set it only at config time, and then after that a kernel build wouldn't need special flags any more, and you'd never need to play games with GNUmakefile or anything like that. Yes, you'd still use environment variables (or make arguments) for that initial Kconfig, but that's no different from the other environment variables we already have, like KCONFIG_SEED that kconfig uses internally, but also things like "$(ARCH)" that we already use *inside* the Kconfig files themselves. I really dislike how you have to set ARCH and CROSS_COMPILE etc externally, and can't just have them *in* the config file. So when you do cross-compiles, right now you have to do something like make ARCH=i386 allmodconfig to build the .config file, but then you have to *repeat* that ARCH=i386 when you actually build things: make ARCH=i386 because the ARCH choice ends up being in the .config file, but the makefiles themselves always take it from the environment. There are good historical reasons for our behavior (and probably a number of extant practical reasons too), but it's a bit annoying, and it would be lovely if we could start moving away from this model. Linus
Re: Linux 6.3-rc3
On Mon, Mar 20, 2023 at 3:06 PM Nathan Chancellor wrote: > > Right, this seems like a subtle difference in semantics between > -Wuninitialized between clang and GCC. I guess it's a bit ambiguous whether it's "X may be USED uninitialized" or whether it is "X may BE uninitialized" and then depending on how you see that ambiguity, the control flow matters. In this case, there is absolutely no question that the variable is uninitialized (since there is no write to it at all). So it is very clearly and unambiguously uninitialized. And I do think that as a result, "-Wuninitialized" should warn. But at the same time, whether it is *used* or not depends on that conditional, so I can see how it could be confusing and not be so clear an unambiguous. On the whole, I do wish that the logic would be "after dead code removal, if some pseudo has no initializer, it should always warn, regardless of any remaining dynamic conditoinals". That "after dead code removal" might matter, because I could see where config things (#ifdef's etc) would just remove the initialization of some variable, and if the use is behind some static "if (0)", then warning about it is all kinds of silly. Linus
Re: Linux 6.3-rc3
On Mon, Mar 20, 2023 at 1:05 PM Guenter Roeck wrote: > > I have noticed that gcc doesn't always warn about uninitialized variables > in most architectures. Yeah, I'm getting the feeling that when the gcc people were trying to make -Wmaybe-uninitialized work better (when moving it into "-Wall"), they ended up moving a lot of "clearly uninitialized" cases into it. So then because we disable the "maybe" case (with -Wno-maybe-uninitialized) because it had too many random false positives, we end up not seeing the obvious cases either. Linus
Re: Linux 6.3-rc3
On Mon, Mar 20, 2023 at 11:56 AM Nathan Chancellor wrote: > > I did see a patch fly by to fix that: > > https://lore.kernel.org/20230316082035.567520-3-christian.koe...@amd.com/ > > It seems like the DRM_TEGRA half of it is broken though: > > https://lore.kernel.org/202303170635.a2rsq1wu-...@intel.com/ Hmm. x86-64 has 'vmap()' too. So I think that DRM_TEGRA breakage is likely just due to a missing header file include that then (by luck and mistake) gets included on arm. You need for 'vmap()'. There might be something else going on, I didn't look deeply at it. Linus
Re: Linux 6.3-rc3
On Mon, Mar 20, 2023 at 11:26 AM Linus Torvalds wrote: > > Hmm. I do my arm64 allmodconfig builds with gcc, and I'm surprised > that gcc doesn't warn about this. Side note: I'm also wondering why that TEGRA_HOST1X config has that ARM dependency in depends on ARCH_TEGRA || (ARM && COMPILE_TEST) because it seems to build just fine at least on x86-64 if I change it to be just depends on ARCH_TEGRA || COMPILE_TEST ie there seems to be nothing ARM-specific in there. Limiting it to just the tegra platform by default makes 100% sense, but that "only do compile-testing on ARM" seems a bit bogus. That limit goes back to forever (commit 6f44c2b5280f: "gpu: host1x: Increase compile test coverage" back in Nov 2013), so maybe things didn't use to work as well back in the dark ages? None of this explains why gcc didn't catch it, but at least allowing the build on x86-64 would likely have made it easier for people to see clang catching this. Linus
Re: Linux 6.3-rc3
On Mon, Mar 20, 2023 at 11:05 AM Nathan Chancellor wrote: > > On the clang front, I am still seeing the following warning turned error > for arm64 allmodconfig at least: > > drivers/gpu/host1x/dev.c:520:6: error: variable 'syncpt_irq' is > uninitialized when used here [-Werror,-Wuninitialized] > if (syncpt_irq < 0) > ^~ Hmm. I do my arm64 allmodconfig builds with gcc, and I'm surprised that gcc doesn't warn about this. That syncpt_irq thing isn't written to anywhere, so that's pretty egregious. We use -Wno-maybe-uninitialized because gcc gets it so wrong, but that's different from the "-Wuninitialized" thing (without the "maybe"). I've seen gcc mess this up when there is one single assignment, because then the SSA format makes it *so* easy to just use that assignment out-of-order (or unconditionally), but this case looks unusually clear-cut. So the fact that gcc doesn't warn about it is outright odd. > If that does not come to you through other means before -rc4, could you > just apply it directly so that I can stop applying it to our CI? :) Bah. I took it now, there's no excuse for that thing. Do we have any gcc people around that could explain why gcc failed so miserably at this trivial case? Linus
Re: [Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()
On Wed, Mar 15, 2023 at 5:22 PM Steven Rostedt wrote: > > I hope that this gets in by -rc3, as I want to start basing my next branch > on that tag. My tree should have it now as commit c00133a9e87e ("drm/ttm: drop extra ttm_bo_put in ttm_bo_cleanup_refs"). Linus
Re: [PATCH 0/2] docs & checkpatch: allow Closes tags with links
On Thu, Mar 16, 2023 at 4:43 AM Matthieu Baerts wrote: > > @Linus: in short, we would like to continue using the "Closes:" tag (or > similar, see below) with a URL in commit messages. They are useful to > have public bug trackers doing automated actions like closing a specific > ticket. Any objection from your side? As long as it's a public link, I guess that just documents what the drm people have been doing. I'm not convinced "Closes" is actually any better than just "Link:", though. I would very much hope and expect that the actual closing of any bug report is actually done separately and verified, rather than some kind of automated "well, the commit says it closes it, so.." So honestly, I feel like "Link:" is just a better thing, and I worry that "Closes:" is then going to be used for random internal crap. We've very much seen people wanting to do that - having their own private bug trackers, and then using the commit message to refer to them, which I am *violently* against. If it's only useful to some closed community, it shouldn't be in the public commits. And while the current GPU people seem to use "Closes:" the right way (and maybe some other groups do too - but it does seem to be mostly a freedesktop thing), I really think it is amenable to mis-use in ways "Link:" is not. The point of "Link:" is explicitly two-fold: - it makes it quite obvious that you expect an actual valid web-link, not some internal garbage - random people always want random extensions, and "Link:" is _designed_ to counter-act that creeping "let's add a random new tag" disease. It's very explicitly "any relevant link". and I really question the value of adding new types of tags, particularly ones that seem almost designed to be mis-used. So I'm not violently against it, and 99% of the existing uses seem fine. But I do note that some of the early "Closes:" tags in the kernel were very much complete garbage, and exactly the kind of thing that I absolutely detest. What does Closes: 10437 mean? That's crazy talk. (And yes, in that case it was a kernel.bugzilla.org number, which is perfectly fine, but I'm using it as a very real example of how "Closes:" ends up being very naturally to mis-use). End result: I don't hate our current "Closes:" uses. But I'm very wary of it. I'm not at all convinced that it really adds a lot of value over "Link:", and I am, _very_ aware of how easily it can be then taken to be a "let's use our own bug tracker cookies here". So I will neither endorse nor condemn it, but if I see people using it wrong, I will absolutely put my foot down. Linus
Re: drm next for 6.3-rc1
On Fri, Feb 24, 2023 at 5:30 PM Dave Airlie wrote: > > Any issues with this? I get nervous around 48hrs :-) It was merged on Wednesday evening. See commit a5c95ca18a98. If you were waiting for a pr-tracker-bot reply, I think you need to put "{GIT PULL]" in the subject line for the automation to trigger on it. Linus
Re: [PATCH RFC v7 00/23] DEPT(Dependency Tracker)
[ Back from travel, so trying to make sense of this series.. ] On Sun, Jan 8, 2023 at 7:33 PM Byungchul Park wrote: > > I've been developing a tool for detecting deadlock possibilities by > tracking wait/event rather than lock(?) acquisition order to try to > cover all synchonization machanisms. It's done on v6.2-rc2. Ugh. I hate how this adds random patterns like if (timeout == MAX_SCHEDULE_TIMEOUT) sdt_might_sleep_strong(NULL); else sdt_might_sleep_strong_timeout(NULL); ... sdt_might_sleep_finish(); to various places, it seems so very odd and unmaintainable. I also recall this giving a fair amount of false positives, are they all fixed? Anyway, I'd really like the lockdep people to comment and be involved. We did have a fairly recent case of "lockdep doesn't track page lock dependencies because it fundamentally cannot" issue, so DEPT might fix those kinds of missing dependency analysis. See https://lore.kernel.org/lkml/60d41f05f139a...@google.com/ for some context to that one, but at teh same time I would *really* want the lockdep people more involved and acking this work. Maybe I missed the email where you reported on things DEPT has found (and on the lack of false positives)? Linus
Re: [git pull] drm for 6.2-rc1
On Wed, Dec 14, 2022 at 12:05 AM Christian König wrote: > > Anyway we need to re-apply b09d6acba1d9 which should be trivial. Note that my resolution did exactly that (*), it's just that when I double-checked against Dave's suggested merge that I noticed I'd done things differently than he did. (*) Well, when I say "did exactly that" I don't actually know some of the other fencing changes that happened, so there may be a reason why something further should still be done. So I can only point to my merge commit a594533df0f6 and ask people to verify. It does all at least work for me. Knock wood. Linus
Re: [git pull] drm for 6.2-rc1
On Mon, Dec 12, 2022 at 6:56 PM Dave Airlie wrote: > > There are a bunch of conflicts, one in amdgpu is a bit nasty, I've > cc'ed Christian/Alex to make sure they know to check whatever > resolution you find. The one I have is what we have in drm-tip tree. Hmm. My merge resolution is slightly different from yours. You seem to have basically dropped commit b09d6acba1d9 ("drm/amdgpu: handle gang submit before VMID"). Now, there are other fence changes in the drm tree that may mean that that commit *should* be dropped, so it's entirely possible that my resolution which kept that ordering change might be wrong and your resolution that just took the drm tip code is the right one. Christian? Alex? Can you please double-check what I just pushed out? Linus
Re: [PATCH RFC 16/19] mm/frame-vector: remove FOLL_FORCE usage
On Tue, Nov 22, 2022 at 4:25 AM Hans Verkuil wrote: > > I tracked the use of 'force' all the way back to the first git commit > (2.6.12-rc1) in the very old video-buf.c. So it is very, very old and the > reason is lost in the mists of time. Well, not entirely. For archaeology reasons, I went back to the old BK history, which exists as a git conversion in https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/ and there you can actually find it. Not with a lot of explanations, though - it's commit b7649ef789 ("[PATCH] videobuf update"): This updates the video-buf.c module (helper module for video buffer management). Some memory management fixes, also some adaptions to the final v4l2 api. but it went from err = get_user_pages(current,current->mm, -data, dma->nr_pages, -rw == READ, 0, /* don't force */ +data & PAGE_MASK, dma->nr_pages, +rw == READ, 1, /* force */ dma->pages, NULL); in that commit. So it goes back to October 2002. > Looking at this old LWN article https://lwn.net/Articles/28548/ suggests > that it might be related to calling get_user_pages for write buffers The timing roughly matches. > I assume that removing FOLL_FORCE from 'FOLL_FORCE|FOLL_WRITE' will still > allow drivers to read from the buffer? The issue with some of the driver hacks has been that - they only want to read, and the buffer may be read-only - they then used FOLL_WRITE despite that, because they want to break COW (due to the issue that David is now fixing with his series) - but that means that the VM layer says "nope, you can't write to this read-only user mapping" - ... and then they use FOLL_FORCE to say "yes, I can". iOW, the FOLL_FORCE may be entirely due to an (incorrect, but historically needed) FOLL_WRITE. Linus
Re: [PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE
On Thu, Nov 17, 2022 at 2:58 PM Kees Cook wrote: > > Oh, er, why does get_arg_page() even need FOLL_FORCE? This is writing the > new stack contents to the nascent brpm->vma, which was newly allocated > with VM_STACK_FLAGS, which an arch can override, but they all appear to > include > VM_WRITE | VM_MAYWRITE. Yeah, it does seem entirely superfluous. It's been there since the very beginning (although in that original commit b6a2fea39318 it was there as a '1' to the 'force' argument to get_user_pages()). I *think* it can be just removed. But as long as it exists, it should most definitely not be renamed to FOLL_PTRACE. There's a slight worry that it currently hides some other setup issue that makes it matter, since it's been that way so long, but I can't see what it is. Linus
Re: [PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE
On Wed, Nov 16, 2022 at 2:30 AM David Hildenbrand wrote: > > Let's make it clearer that functionality provided by FOLL_FORCE is > really only for ptrace access. I'm not super-happy about this one. I do understand the "let's rename the bit so that no new user shows up". And it's true that the main traditional use is ptrace. But from the patch itself it becomes obvious that no, it's not *just* ptrace. At least not yet. It's used for get_arg_page(), which uses it to basically look up (and install) pages in the newly created VM. Now, I'm not entirely sure why it even uses FOLL_FORCE, - I think it might be historical, because the target should always be the new stack vma. Following the history of it is a big of a mess, because there's a number of renamings and re-organizations, but it seems to go back to 2007 and commit b6a2fea39318 ("mm: variable length argument support"). Before that commit, we kept our own array of "this is the set of pages that I will install in the new VM". That commit basically just inserts the pages directly into the VM instead, getting rid of the array size limitation. So at a minimum, I think that FOLL_FORCE would need to be removed before any renaming to FOLL_PTRACE, because that's not some kind of small random case. It *might* be as simple as just removing it, but maybe there's some reason for having it that I don't immediately see. There _are_ also small random cases too, like get_cmdline(). Maybe that counts as ptrace, but the execve() case most definitely does not. Linus
Re: [PATCH RFC 00/19] mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning)
On Mon, Nov 7, 2022 at 8:18 AM David Hildenbrand wrote: > > So instead, make R/O long-term pinning work as expected, by breaking COW > in a COW mapping early, such that we can remove any FOLL_FORCE usage from > drivers. Nothing makes me unhappy from a quick scan through these patches. And I'd really love to just have this long saga ended, and FOLL_FORCE finally relegated to purely ptrace accesses. So an enthusiastic Ack from me. Linus
Re: [PATCH v4a 00/38] timers: Use timer_shutdown*() before freeing timers
On Sat, Nov 5, 2022 at 2:03 PM Jason A. Donenfeld wrote: > > Something that might help here is changing the `...` into > `... when exists` or into `... when != ptr` or similar. I actually tried that. You don't want "when exists", you'd want "when forall", but that seems to be the default. And trying "when != ptr->timer" actually does the right thing in that it gets rid of the case where the timer is modified outside of the del_timer() case, *but* it also causes odd other changes to the output. Look at what it generates for that drivers/media/usb/pvrusb2/pvrusb2-hdw.c file, which finds a lot of triggers with the "when != ptr->timer", but only does one without it. So I gave up, just because I clearly don't understand the rules. (Comparing output is also fun because the ordering of the patches is random, so consecutive runs with the same rule will give different patches. I assume that it's just because it's done in parallel, but it doesn't help the "try to see what changes when you change the script" ;) Linus
Re: [PATCH v4a 00/38] timers: Use timer_shutdown*() before freeing timers
On Sat, Nov 5, 2022 at 11:04 AM Steven Rostedt wrote: > > Here's the changes I made after running the script Please. No. What part of "I don't want extra crud" was I unclear on? I'm not interested in converting everything. That's clearly a 6.,2 issue, possibly even longer considering how complicated the networking side has been. I'm not AT ALL interested in "oh, I then added my own small cleanups on top to random files because I happened to notice them". Repeat after me: "If the script didn't catch them, they weren't trivially obvious". And it does seem that right now the script itself is a bit too generous, which is why it didn't notice that sometimes there wasn't a kfree after all because of a goto around it. So clearly that "..." doesn't really work, I think it accepts "_any_ path leads to the second situation" rather than "_all_ paths lead to the second situation". But yeah, my coccinelle-foo is very weak too, and maybe there's no pattern for "no flow control". I would also like the coccinelle script to notice the "timer is used afterwards", so that it does *not* modify that case that does del_timer(>timer); dch->timer.function = NULL; since now the timer is modified in between the del_timer() and the kfree. Again, that timer modification is then made pointless by changing the del_timer() to a "timer_shutdown()", but at that point it is no longer a "so obvious non-semantic change that it should be scripted". At that point it's a manual thing. So I think the "..." in your script should be "no flow control, and no access to the timer", but do not know how to do that in coccinelle. Julia? And this thread has way too many participants, I suspect some email systems will just mark it as spam as a result. Which is partly *why* I would like to get rid of noisy changes that really don't matter - but I would like it to be truly mindlessly obvious that there are *zero* questions about it, and absolutely no manual intervention because the patch is so strict that it's just unquestionably correct. Linus
Re: [PATCH v4a 00/38] timers: Use timer_shutdown*() before freeing timers
On Fri, Nov 4, 2022 at 11:01 PM Steven Rostedt wrote: > > Patch 1 fixes an issue with sunrpc/xprt where it incorrectly uses > del_singleshot_timer_sync() for something that is not a oneshot timer. As this > will be converted to shutdown, this needs to be fixed first. So this is the kind of thing that I would *not* want to get eartly. I really would want to get just the infrastructure in to let people start doing conversions. And then the "mindlessly obvious patches that are done by scripting and can not possibly matter". The kinds that do not *need* review, because they are mechanical, and that just cause pointless noise for the rest of the patches that *do* want review. Not this kind of thing that is so subtle that you have to explain it. That's not a "scripted patch for no semantic change". So leave the del_singleshot_timer_sync() cases alone, they are irrelevant for the new infrastructure and for the "mindless scripted conversion" patches. > Patches 2-4 changes existing timer_shutdown() functions used locally in ARM > and > some drivers to better namespace names. Ok, these are relevant. > Patch 5 implements the new timer_shutdown() and timer_shutdown_sync() > functions > that disable re-arming the timer after they are called. This is obviously what I'd want early so that people can start doign this in their trees. > Patches 6-28 change all the locations where there's a kfree(), kfree_rcu(), > kmem_cache_free() and one call_rcu() call where the RCU function frees the > timer (the workqueue patch) in the same function as the del_timer{,_sync}() is > called on that timer, and there's no extra exit path between the del_timer and > freeing of the timer. So honestly, I was literally hoping for a "this is the coccinelle script" kind of patch. Now there seems to be a number of patches here that are actualyl really hard to see that they are "obviously correct" and I can't tell if they are actually scripted or not. They don't *look* scripted, but I can't really tell. I looked at the patches with ten lines of context, and I didn't see the immediately following kfree() even in that expanded patch context, so it's fairly far away. Others in the series were *definitely* not scripted, doing clearly manual cleanups: -if (dch->timer.function) { -del_timer(>timer); -dch->timer.function = NULL; -} +timer_shutdown(>timer); so no, this does *not* make me feel "ok, this is all trivial". IOW, I'd really want *just* the infrastructure and *just* the provably trivial stuff. If it wasn't some scripted really obvious thing that cannot possibly change anything and that wasn't then edited manually for some reason, I really don't want it early. IOW, any early conversions I'd take are literally about removing pure mindless noise. Not about doing conversions. And I wouldn't mind it as a single conversion patch that has the coccinelle script as the explanation. Really just THAT kind of "100% mindless conversion". Linus
Re: [RFC][PATCH v3 00/33] timers: Use timer_shutdown*() before freeing timers
On Fri, Nov 4, 2022 at 12:42 PM Steven Rostedt wrote: > > Linus, should I also add any patches that has already been acked by the > respective maintainer? No, I'd prefer to keep only the ones that are 100% unambiguously not changing any semantics. Linus
Re: [RFC][PATCH v3 00/33] timers: Use timer_shutdown*() before freeing timers
On Thu, Nov 3, 2022 at 10:48 PM Steven Rostedt wrote: > > Ideally, I would have the first patch go into this rc cycle, which is mostly > non functional as it will allow the other patches to come in via the > respective > subsystems in the next merge window. Ack. I also wonder if we could do the completely trivially correct conversions immediately. I'm talking about the scripted ones where it's currently a "del_timer_sync()", and the very next action is freeing whatever data structure the timer is in (possibly with something like free_irq() in between - my point is that there's an unconditional free that is very clear and unambiguous), so that there is absolutely no question about whether they should use "timer_shutdown_sync()" or not. IOW, things like patches 03, 17 and 31, and at least parts others in this series. This series clearly has several much more complex cases that need actual real code review, and I think it would help to have the completely unambiguous cases out of the way, just to get rid of noise. So I'd take that first patch, and a scripted set of "this cannot change any semantics" patches early. Linus
Re: [PATCH] drm/amd/display: Fix build breakage with CONFIG_DEBUG_FS=n
On Fri, Oct 14, 2022 at 8:22 AM Nathan Chancellor wrote: > > After commit 8799c0be89eb ("drm/amd/display: Fix vblank refcount in vrr > transition"), a build with CONFIG_DEBUG_FS=n is broken due to a > misplaced brace, along the lines of: Thanks, applied. Linus
Re: [git pull] drm fixes for 6.1-rc1
On Thu, Oct 13, 2022 at 5:29 PM Dave Airlie wrote: > > Round of fixes for the merge window stuff, bunch of amdgpu and i915 > changes, this should have the gcc11 warning fix, amongst other > changes. Some of those amd changes aren't "fixes". They are some major code changes. We're still in the merge window, so I'm letting it slide, but calling then "fixes" really stretches things. They are fixes exactly the same way completely new development can "fix" things. Linus
Re: mainline build failure due to 5d8c3e836fc2 ("drm/amd/display: fix array-bounds error in dc_stream_remove_writeback()")
On Thu, Oct 6, 2022 at 1:50 PM Sudip Mukherjee wrote: > > > And it looks like Sudip's proposed fix for this particular code is > > additionally fixing unsigned vs signed as well. I think -Warray-bounds > > did its job (though, with quite a confusing index range in the report). > > Not my. Linus's. I just tested. :) I suspect Kees meant Stephen's other patch that Hamza pointed at, and that is perhaps the cleaner version. That said, I hate how this forces us to write random code changes just to make a compiler just randomly _happen_ to not complain about it. Linus
Re: [git pull] drm for 6.1-rc1
On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie wrote: > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0088 > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] As far as I can tell, that's the line struct drm_gpu_scheduler *sched = s_fence->sched; where 's_fence' is NULL. The code is 0: 0f 1f 44 00 00nopl 0x0(%rax,%rax,1) 5: 41 54push %r12 7: 55push %rbp 8: 53push %rbx 9: 48 89 fb mov%rdi,%rbx c:* 48 8b af 88 00 00 00 mov0x88(%rdi),%rbp <-- trapping instruction 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) 1a: 48 8b 85 80 01 00 00 mov0x180(%rbp),%rax and that next 'lock decl' instruction would have been the atomic_dec(>hw_rq_count); at the top of drm_sched_job_done(). Now, as to *why* you'd have a NULL s_fence, it would seem that drm_sched_job_cleanup() was called with an active job. Looking at that code, it does if (kref_read(>s_fence->finished.refcount)) { /* drm_sched_job_arm() has been called */ dma_fence_put(>s_fence->finished); ... but then it does job->s_fence = NULL; anyway, despite the job still being active. The logic of that kind of "fake refcount" escapes me. The above looks fundamentally racy, not to say pointless and wrong (a refcount is a _count_, not a flag, so there could be multiple references to it, what says that you can just decrement one of them and say "I'm done"). Now, _why_ any of that happens, I have no idea. I'm just looking at the immediate "that pointer is NULL" thing, and reacting to what looks like a completely bogus refcount pattern. But that odd refcount pattern isn't new, so it's presumably some user on the amd gpu side that changed. The problem hasn't happened again for me, but that's not saying a lot, since it was very random to begin with. Linus
Re: [git pull] drm for 6.1-rc1
On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher wrote: > > Maybe you are seeing this which is an issue with GPU TLB flushes which > is kind of sporadic: > https://gitlab.freedesktop.org/drm/amd/-/issues/2113 Well, that seems to be 5.19, and while timing changes (or whatever other software updates) could have made it start trigger, this machine has been pretty solid otgerwise. > Are you seeing any GPU page faults in your kernel log? Nothing even remotely like that "no-retry page fault" in that issue report. Of course, if it happens just before the whole thing locks up... Linus
Re: [git pull] drm for 6.1-rc1
On Thu, Oct 6, 2022 at 12:30 PM Dave Airlie wrote: > > netconsole? I've never been really successful with that in the past, and haven't used it for decades. I guess I could try if nothing else works. Linus
Re: mainline build failure due to 5d8c3e836fc2 ("drm/amd/display: fix array-bounds error in dc_stream_remove_writeback()")
On Thu, Oct 6, 2022 at 1:51 AM Sudip Mukherjee (Codethink) wrote: > > This is only seen with gcc-11, gcc-12 builds are ok. Hmm. This seems to be some odd gcc issue. I *think* that what is going on is that the test j = 0 ; j < MAX_DWB_PIPES makes gcc decide that "hey, j is in the range [0,MAX_DWB_PIPES[", and then since MAX_DWB_PIPES is 1, that simplifies to "j must be zero". Good range analysis so far. But for 'i', we start off with that lower bound of 0, but the upper bound is not fixed (the test for "i" is: "i < stream->num_wb_info"). And then that "if (i != j)", so now gcc decides that it can simplify that to "if (i != 0)", and then simplifies *that* to "oh, the lower bound of 'i' in that code is actually 1. So then it decides that "stream->writeback_info[i]" must be out of bounds. Of course, the *reality* is that stream->num_wb_info should be <= MAX_DWB_PIPES, and as such (with the current MAX_DWB_PIPES value of 1) it's not that 'i' can be 1, it's that the code in question cannot be reached at all. What confuses me is that error message ("array subscript [0, 0] is outside array bounds of 'struct dc_writeback_info[1]') which seems to be aware that the value is actually 0. So this seems to be some gcc-11 range analysis bug, but I don't know what the fix is. I suspect some random code change would magically just make gcc realize it's ok after all, but since it all depends on random gcc confusion, I don't know what the random code change would be. The fix *MAY* be to just add a '&& i < MAX_DWB_PIPES' to that loop too, and then gcc will see that both i and j are always 0, and that the code is unreachable and not warn about it. Hmm? Can you test that? And the reason gcc-12 builds are ok probably isn't that gcc-12 gets this right, it's simply that gcc-12 gets so many *opther* things wrong that we already disabled -Warray-bounds with gcc-12 entirely. If somebody cannot come up with a fix, I suspect the solution is "gcc array bounds analysis is terminally buggy" and we just need to disable it for gcc-11 too. Kees, any idea? Who else might be interested in fixing a -Warray-bounds issue? Linus
Re: [git pull] drm for 6.1-rc1
On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie wrote: > > Lots of stuff all over, some new AMD IP support and gang > submit support [..] Hmm. I have now had my main desktop lock up twice after pulling this. Nothing in the dmesg after a reboot, and nothing in particular that seems to trigger it, so I have a hard time even guessing what's up, but the drm changes are the primary suspect. I will try to see if I can get any information out of the machine, but with the symptom being just a dead machine ... This is the same (old) Radeon device: 49:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7) with dual 4k monitors, running on my good old Threadripper setup. Again, there's no explicit reason to blame the drm pull, except that it started after that merge (that machine ran the kernel with the networking pull for a day with no problems, and while there were other pull requests in between them, they seem to be fairly unrelated to the hardware I have). But the lockup is so sporadic (twice in the last day) that I really can't bisect it, so I'm afraid I have very very little info. Any suggestions? Linus
Re: [git pull] drm for 6.1-rc1
On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie wrote: > > This is very conflict heavy, mostly the correct answer is picking > the version from drm-next. Ugh, yes, that was a bit annoying. I get the same end result as you did, but I do wonder if the drm people should try to keep some kind of separate "fixes" branches for things that go both into the development tree and then get sent to me for fixes pulls? Hopefully this "lots of pointless noise" was a one-off, but it might be due to how you guys work.. Linus
Re: [PATCH v13 5/9] drm/i915: Check for integer truncation on scatterlist creation
On Wed, Sep 28, 2022 at 1:15 AM Gwan-gyeong Mun wrote: > > + if (check_assign(obj->base.size >> PAGE_SHIFT, )) > + return -E2BIG; I have to say, I find that new "check_assign()" macro use to be disgusting. It's one thing to check for overflows. It's another thing entirely to just assign something to a local variable. This disgusting "let's check and assign" needs to die. It makes the code a completely unreadable mess. The "user" wersion is even worse. If you worry about overflow, then use a mix of (a) use a sufficiently large type to begin with (b) check for value range separately and in this particular case, I also suspect that the whole range check should have been somewhere else entirely - at the original creation of that "obj" structure, not at one random end-point where it is used. In other words, THIS WHOLE PATCH is just end-points checking the size requirements of that "base.size" thing much too late, when it should have been checked originally for some "maximum acceptable base size" instead. And that "maximum acceptable base size" should *not* be about "this is the size of the variables we use". It should be a sanity check of "this value is sane and fits in sane use cases". Because "let's plug security checks" is most definitely not about picking random assignments and saying "let's check this one". It's about trying to catch things earlier than that. Kees, you need to reign in the craziness in overflow.h. Linus
Re: mainline build failure for x86_64 allmodconfig with clang
On Sun, Aug 7, 2022 at 10:36 AM David Laight wrote: > > Or just shoot the software engineer who thinks 100 arguments > is sane. :-) I suspect the issue is that it's not primarily a software engineer who wrote that code. Hardware people writing code are about as scary as software engineers with a soldering iron. Linus
Re: mainline build failure for x86_64 allmodconfig with clang
On Thu, Aug 4, 2022 at 1:43 PM Nathan Chancellor wrote: > > I do note that commit 1b54a0121dba ("drm/amd/display: Reduce stack size > in the mode support function") did have a workaround for GCC. It appears > clang will still inline mode_support_configuration(). If I mark it as > 'noinline', the warning disappears in that file. That sounds like probably the best option for now. Gcc does not inline that function (at least for allmodconfig builds in my testing), so if that makes clang match what gcc does, it seems a reasonable thing to do. Linus
Re: mainline build failure for x86_64 allmodconfig with clang
On Thu, Aug 4, 2022 at 11:37 AM Sudip Mukherjee (Codethink) wrote: > > git bisect points to 3876a8b5e241 ("drm/amd/display: Enable building new > display engine with KCOV enabled"). Ahh. So that was presumably why it was disabled before - because it presumably does disgusting things that make KCOV generate even bigger stack frames than it already has. Those functions do seem to have fairly big stack footprints already (I didn't try to look into why, I assume it's partly due to aggressive inlining, and probably some automatic structures on stack). But gcc doesn't seem to make it all that much worse with KCOV (and my clang build doesn't enable KCOV). So it's presumably some KCOV-vs-clang thing. Nathan? Linus
Re: [PATCH] drm/amd/display: restore plane with no modifiers code.
On Wed, Aug 3, 2022 at 10:50 PM Dave Airlie wrote: > > With this applied, I get gdm back. I can confirm that it makes thing work again for me too. Applied. Linus
Re: mainline build failure due to 6fdd2077ec03 ("drm/amd/amdgpu: add memory training support for PSP_V13")
On Thu, Aug 4, 2022 at 12:35 AM Sudip Mukherjee (Codethink) wrote: > > I will be happy to test any patch or provide any extra log if needed. It sounds like that file just needs to get a #include there, and for some reason architectures other than alpha and mips end up getting it accidentally through other headers. Mind testing just adding that header file, and perhaps even sending a patch if (when) that works for you? Linus
Re: [git pull] drm for 5.20/6.0
On Wed, Aug 3, 2022 at 9:27 PM Linus Torvalds wrote: > > I'll do a few more. It's close enough already that it should be just > four more reboots to pinpoint exactly which commit breaks. commit 5d945cbcd4b16a29d6470a80dfb19738f9a4319f is the first bad commit. I think it's supposed to make no semantic changes, but it clearly does. What a pain to figure out what's wrong in there, and I assume it doesn't revert cleanly either. Bringing in the guilty parties. See https://lore.kernel.org/all/CAHk-=wj+yzaunxiewhfcrkbdlsqkizdr1q3yjlaqpo6avq2...@mail.gmail.com/ for the beginning of this thread. Linus
Re: [git pull] drm for 5.20/6.0
On Wed, Aug 3, 2022 at 9:24 PM Dave Airlie wrote: > > I've reproduced it, I'll send you a revert pile when I confirm it is > the buddy allocator. I've bisected it to 86bd6706c404..074293dd9f61 and don't see "buddy" in any of those commits. I'll do a few more. It's close enough already that it should be just four more reboots to pinpoint exactly which commit breaks. Linus
Re: [git pull] drm for 5.20/6.0
On Wed, Aug 3, 2022 at 8:53 PM Dave Airlie wrote: > > > It works on my intel laptop, so it's amdgpu somewhere. > > I'll spin my ryzen up to see if I can reproduce, and test against the > drm-next pre-merge tree as well. So it's not my merge - I've had a bad result in the middle of the DRM history too. On a positive note, my arm64 machine works fine, but that's just using fbdev so ... But another datapoint to say that it's amdgpu-specific. Not that that was really in doubt. Linus
Re: [git pull] drm for 5.20/6.0
On Wed, Aug 3, 2022 at 8:37 PM Dave Airlie wrote: > > Actually I did miss that so that looks good. .. I wish it did, but I just actually test-booted my desktop with the result, and it crashes the X server. This seems to be the splat in Xorg.0.log: (II) Initializing extension DRI2 (II) AMDGPU(0): Setting screen physical size to 2032 x 571 (EE) (EE) Backtrace: (EE) 0: /usr/libexec/Xorg (OsLookupColor+0x13d) [0x55b1dc61258d] (EE) 1: /lib64/libc.so.6 (__sigaction+0x50) [0x7f7972a3ea70] (EE) 2: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so (AMDGPUCreateWindow_oneshot+0x101) [0x7f797207ddd1] (EE) 3: /usr/libexec/Xorg (compIsAlternateVisual+0xdc4) [0x55b1dc545fa4] (EE) 4: /usr/libexec/Xorg (InitRootWindow+0x17) [0x55b1dc4e0047] (EE) 5: /usr/libexec/Xorg (miPutImage+0xd4c) [0x55b1dc49e60b] (EE) 6: /lib64/libc.so.6 (__libc_start_call_main+0x80) [0x7f7972a29550] (EE) 7: /lib64/libc.so.6 (__libc_start_main+0x89) [0x7f7972a29609] (EE) 8: /usr/libexec/Xorg (_start+0x25) [0x55b1dc49f2c5] (EE) (EE) Segmentation fault at address 0x4 (EE) Fatal server error: (EE) Caught signal 11 (Segmentation fault). Server aborting so something is going horribly wrong. No kernel oops, though. It works on my intel laptop, so it's amdgpu somewhere. I guess I will start bisecting. Oy vey. Linus
Re: [git pull] drm for 5.20/6.0
On Wed, Aug 3, 2022 at 7:46 PM Linus Torvalds wrote: > > I think I have it resolved, am still doing a full build test, and will > then compare against what your suggested merge is. Hmm. I end up with *almost* the same thing. Except I ended up with a select DRM_BUDDY for the DRM_AMDGPU config entry, and you don't have that. I *think* my version is correct, in that clearly the amdgpu driver now uses that buddy logic (just doing a random "grep drm_buddy_block" to see). But this was messy enough to resolve that I think people should double-check my end, and maybe I just got confused at some point in the process. And while I seem to have gotten the same result as you did on the i915 firmware side too, again, I'd like people to re-verify. Linus
Re: [git pull] drm for 5.20/6.0
On Tue, Aug 2, 2022 at 10:38 PM Dave Airlie wrote: > > This is a conflicty one. The late revert in 5.19 of the amdgpu buddy > allocator causes major conflict fallout. The buddy allocator code in > this one works, so the resolutions are usually just to take stuff from > this. It might actually be cleaner if you revert > 925b6e59138cefa47275c67891c65d48d3266d57 (Revert "drm/amdgpu: add drm > buddy support to amdgpu") first in your tree then merge this. Ugh, what a pain. The other conflicts are also due to just randomly duplicated commits, with *usually* your drm tree having the superset (so "just take yours" is the easy resolution), but not always (ie the Intel firmware-69 mess was apparently not dealt with in the development tree). Nasty. I think I have it resolved, am still doing a full build test, and will then compare against what your suggested merge is. Linus
Re: [git pull] drm fixes for 5.19-rc7
On Sat, Jul 16, 2022 at 2:35 PM Linus Torvalds wrote: > > That said, even those type simplifications do not fix the fundamental > issue. That "DIV_ROUND_UP()" still ends up being a 64-bit divide, > although now it's at least a "64-by-32" bit divide. Hmm. The "DIV_ROUND_UP()" issue could be solved by just making the rule be that the max_segment size is always a power of two. Then you don't need the (expensive!) DIV_ROUND_UP(), and can just use the regular "round_up()" that works on powers-of-two. And the simplest way to do that is to just make "max_segments" be 2GB. The whole "round_down(UINT_MAX, page_alignment)" seems entirely pointless. Do you really want segments that are some odd number just under the 4GB mark, and force expensive divides? For consistency, I used the same value in i915_rsgt_from_buddy_resource(). I have no idea if that makes sense. Anyway, the attached patch is COMPLETELY UNTESTED. But it at least seems to compile. Maybe. Linus drivers/gpu/drm/i915/i915_scatterlist.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_scatterlist.c b/drivers/gpu/drm/i915/i915_scatterlist.c index f63b50b71e10..830dcd833d4e 100644 --- a/drivers/gpu/drm/i915/i915_scatterlist.c +++ b/drivers/gpu/drm/i915/i915_scatterlist.c @@ -81,7 +81,9 @@ struct i915_refct_sgt *i915_rsgt_from_mm_node(const struct drm_mm_node *node, u64 region_start, u64 page_alignment) { - const u64 max_segment = round_down(UINT_MAX, page_alignment); + // Make sure max_segment (and thus segment_pages) is + // a power of two that fits in 32 bits. + const u64 max_segment = 1ul << 31; u64 segment_pages = max_segment >> PAGE_SHIFT; u64 block_size, offset, prev_end; struct i915_refct_sgt *rsgt; @@ -96,7 +98,7 @@ struct i915_refct_sgt *i915_rsgt_from_mm_node(const struct drm_mm_node *node, i915_refct_sgt_init(rsgt, node->size << PAGE_SHIFT); st = >table; - if (sg_alloc_table(st, DIV_ROUND_UP(node->size, segment_pages), + if (sg_alloc_table(st, round_up(node->size, segment_pages), GFP_KERNEL)) { i915_refct_sgt_put(rsgt); return ERR_PTR(-ENOMEM); @@ -159,7 +161,7 @@ struct i915_refct_sgt *i915_rsgt_from_buddy_resource(struct ttm_resource *res, { struct i915_ttm_buddy_resource *bman_res = to_ttm_buddy_resource(res); const u64 size = res->num_pages << PAGE_SHIFT; - const u64 max_segment = round_down(UINT_MAX, page_alignment); + const u64 max_segment = 1u << 31; struct drm_buddy *mm = bman_res->mm; struct list_head *blocks = _res->blocks; struct drm_buddy_block *block;
Re: [git pull] drm fixes for 5.19-rc7
On Fri, Jul 15, 2022 at 2:09 PM Nathan Chancellor wrote: > > On Fri, Jul 15, 2022 at 01:36:17PM +1000, Dave Airlie wrote: > > Matthew Auld (1): > > drm/i915/ttm: fix sg_table construction > > This patch breaks i386_defconfig with both GCC and clang: > > ld: drivers/gpu/drm/i915/i915_scatterlist.o: in function > `i915_rsgt_from_mm_node': > i915_scatterlist.c:(.text+0x1a7): undefined reference to `__udivdi3' Yeah, we definitely don't want arbitrary 64x64 divides in the kernel, and the fact that we don't include libgcc.a once again caught this horrid issue. The offending code is if (sg_alloc_table(st, DIV_ROUND_UP(node->size, segment_pages), GFP_KERNEL)) { and I have to say that all of those games with "u64 page_alignment" that commit aff1e0b09b54 ("drm/i915/ttm: fix sg_table construction") introduces are absolutely disgusting. And they are just *pointlessly* disgusting. Why is that "page_alignment" a "u64"? And why is it a "size", instead of being a "number of bits"? The code literally does things like const u64 max_segment = round_down(UINT_MAX, page_alignment); which means that (a) page_alignment must be a power-of-two for this to work (round_down() only works in powers of two) (b) the result obviously must fit in an "unsigned int", since it's rounding down a UINT_MAX! So (a) makes it doubtful that "page_alignment" should have been a value (as opposed to mask), and (b) then questions why was that made an "u64" value when it cannot have a u64 range? And if max_segments cannot have a 64-bit range, then segment_pages here: u64 segment_pages = max_segment >> PAGE_SHIFT; sure cannot. Fixing those then uncovers other things: len = min(block_size, max_segment - sg->length); now complains about mixing types ("max_segment - sg->length" being u32), because 'block_size' is 64, bit, and that does seem to make some amount of sense: block_size = node->size << PAGE_SHIFT; with the 'node->size' being from drm_mm_node, and that size is a 'u64'. That I *could* see being more than 32 bits on a 64-bit architecture. Ok. But then that means that 'len' cannot be a 64-bit value either, and it should probably have been u32 len; .. len = min_t(u64, block_size, max_segment - sg->length); and that would just have been a lot nicer on 32-bit x86, avoiding a lot of pointlessly 64-bit things. That said, even those type simplifications do not fix the fundamental issue. That "DIV_ROUND_UP()" still ends up being a 64-bit divide, although now it's at least a "64-by-32" bit divide. Which needs to be handled by "do_div()" rather than the generic DIV_ROUND_UP() helper, because sadly, at least gcc still generates a full __udivdi3() even for the 64-by-32 divides. Can Intel GPU people please take a look? Linus
Re: Annoying AMDGPU boot-time warning due to simplefb / amdgpu resource clash
On Mon, Jun 27, 2022 at 1:02 AM Javier Martinez Canillas wrote: > > The flag was dropped because it was causing drivers that requested their > memory resource with pci_request_region() to fail with -EBUSY (e.g: the > vmwgfx driver): > > https://www.spinics.net/lists/dri-devel/msg329672.html See, *that* link would have been useful in the commit. Rather than the useless link it has. Anyway, removing the busy bit just made things worse. > > If simplefb is actually still using that frame buffer, it's a problem. > > If it isn't, then maybe that resource should have been released? > > It's supposed to be released once amdgpu asks for conflicting framebuffers > to be removed calling drm_aperture_remove_conflicting_pci_framebuffers(). That most definitely doesn't happen. This is on a running system: [torvalds@ryzen linux]$ cat /proc/iomem | grep BOOTFB - : BOOTFB so I suspect that the BUSY bit was never the problem - even for vmwgfx). The problem was that simplefb doesn't remove its resource. Guys, the *reason* for resource management is to catch people that trample over each other's resources. You literally basically disabled the code that checked for it by removing the BUSY flag, and just continued to have conflicting resources. That isn't a "fix", that is literally "we are ignoring and breaking the whole reason that the resource tree exists, but we'll still use it for no good reason". Yeah, yeah, most modern drivers ignore the IO resource tree, because they end up working on another resource level entirely: they work on not the IO resources, but on the "driver level" instead, and just attach to PCI devices. So these days, few enough drivers even care about the IO resource tree, and it's mostly used for (a) legacy devices (think ISA) and (b) the actual bus resource handling (so the PCI code itself uses it to sort out resource use and avoid conflicts, but PCI drivers themselves generally then don't care, because the bus has "taken care of it". So that's why the amdgpu driver itself doesn't care about resource allocations, and we only get a warning for that memory type case, not for any deeper resource case. And apparently the vmwgfx driver still uses that legacy "let's claim all PCI resources in the resource tree" instead of just claiming the device itself. Which is why it hit this whole BOOTFB resource thing even harder. But the real bug is that BOOTFB seems to claim this resource even after it is done with it and other drivers want to take over. Not the BUSY bit. Linus
Annoying AMDGPU boot-time warning due to simplefb / amdgpu resource clash
So this has been going on for a while, and it's quite annoying. At bootup, my main desktop (Threadripper 3970X with radeon graphics) now complains about resource sanity check: requesting [mem 0xd000-0xdfff], which spans more than BOOTFB [mem 0xd000-0xd02f] and then gives me a nasty callchain that is basically the amdgpu probe sequence ending in amdgpu_bo_init() doing the arch_io_reserve_memtype_wc() which is then what complains. That "BOOTFB" resource is from sysfb_simplefb.c, and I think what started triggering this is commit c96898342c38 ("drivers/firmware: Don't mark as busy the simple-framebuffer IO resource"). Because it turns out that that removed the IORESOURCE_BUSY, which in turn is what makes the resource conflict code complain about it now, because /* * if a resource is "BUSY", it's not a hardware resource * but a driver mapping of such a resource; we don't want * to warn for those; some drivers legitimately map only * partial hardware resources. (example: vesafb) */ so the issue is that now the resource code - correctly - says "hey, you have *two* conflicting driver mappings". And that commit claims it did it because "which can lead to drivers requesting the same memory resource to fail", but - once again - the link in the commit that might actually tell more is just one of those useless patch submission links again. So who knows why that commit was actually done, but it's causing annoyance. If simplefb is actually still using that frame buffer, it's a problem. If it isn't, then maybe that resource should have been released? I really think that commit c96898342c38 is buggy. It talks about "let drivers to request it as busy instead", but then it registers a resource that isn't actually a proper real resource. It's just a random incomplete chunk of the actual real thing, so it will still interfere with resource allocation, and in fact now interferes even with that "set memtype" because of this valid warning. Linus
Re: [git pull] drm for 5.19-rc1
On Tue, Jun 7, 2022 at 3:23 AM Geert Uytterhoeven wrote: > > These header files are heavy users of large constants lacking the "U" > suffix e.g.: > > #define NB_ADAPTER_ID__SUBSYSTEM_ID_MASK 0xL As Andreas says, this is not undefined behavior. A hexadecimal integer constant will always get a type that fits the actual value. So on a 32-bit architecture, because 0x doesn't fit in 'long', it will automatically become 'unsigned long'. Now, a C compiler might still warn about such implicit type conversions, but I'd be a bit surprised if any version of gcc actually would do that, because this behavior for hex constants is *very* traditional, and very common. It's also true that the type of the constant - but not the value - will be different on 32-bit and 64-bit architectures (ie on 64-bit, it will be plain "long" and never extended to "unsigned long", because the hex value obviously fits just fine). I don't see any normal situation where that really matters, since any normal use will have the same result. The case you point to at https://lore.kernel.org/r/cak8p3a0qrihbr_2fq7uz5w2jmljv7czfrrarcmmjohvndj3...@mail.gmail.com is very different, because the constant "1" is always just a plain signed "int". So when you do "(1 << 31)", that is now a signed integer with the top bit set, and so it will have an actual negative value, and that can cause various problems (when right-shifted, or when compared to other values). But hexadecimal constants can be signed types, but they never have negative values. Linus
Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")
On Wed, Jun 1, 2022 at 3:28 PM Keisuke Nishimura wrote: > > > I found 13 definitions of packed structure that contains: > > - spinlock_t > > - atomic_t > > - dma_addr_t > > - phys_addr_t > > - size_t > > - struct mutex > > - struct device > > - raw_spinlock_t Ok, so I don't think dma_addr_t/phys_addr_t/size_t are problematic, they are just regular integers. And 'struct device' is problematic only as it then contains any of the atomic types (which it does) > security/tomoyo/common.h: atomic_t in tomoyo_shared_acl_head > drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h: spinlock_t in > key_map > arch/s390/include/asm/kvm_host.h: atomic_t in kvm_s390_sie_block So these do look problematic. I'm not actually clear on why tomoyo_shared_acl_head would be packed. That makes no sense to me. Same goes for key_map, it's not clear what the reason for that __packed is, and it's clearly bogus. It might work, almost by mistake, but it's wrong to try to pack that spinlock_t. The s390 kvm use actually looks fine: the structure is packed, but it's also aligned, and the spin-lock is at the beginning, so the "packing" part is about the other members, not the first one. The two that look wrong look like they will probably work anyway (they'll presumably be effectively word-aligned, and that's sufficient for spinlocks in practice). But let's cc the tomoyo and chelsio people. Linus
Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")
On Tue, May 31, 2022 at 1:04 AM Arnd Bergmann wrote: > > As an experiment: what kind of results would we get when looking > for packed structures and unions that contain any of these: Yeah, any atomics or locks should always be aligned, and won't even work (or might be *very* slow) on multiple architectures. Even x86 - which does very well on unaligned data - reacts badly to sufficiently unaligned atomics (ie cacheline crossing). I don't think we have that. Not only because it would already cause breakage, but simply because the kinds of structures that people pack aren't generally the kind that contain these kinds of things. That said, you might have a struct that is packed, but that intentionally aligns parts of itself, so it *could* be valid. But it would probably not be a bad idea to check that packed structures/unions don't have atomic types or locks in them. I _think_ we're all good, but who knows.. Linus
Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")
On Sat, May 28, 2022 at 11:59 AM Arnd Bergmann wrote: > > It's CONFIG_ARM_AEABI, which is normally set everywhere. Without this > option, you the kernel is built for the old 'OABI' that forces all non-packed > struct members to be at least 16-bit aligned. Looks like forced word (32 bit) alignment to me. I wonder how many other structures that messes up, but I committed the EDID fix for now. This has presumably been broken for a long time, but maybe the affected targets don't typically use EDID and kernel modesetting, and only use some fixed display setup instead. Those structure definitions go back a _loong_ time (from a quick 'git blame' I see November 2008). But despite that, I did not mark my fix 'cc:stable' because I don't know if any of those machines affected by this bad arm ABI issue could possibly care. At least my tree hopefully now builds on them, with the BUILD_BUG_ON() that uncovered this. Linus
Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")
On Sat, May 28, 2022 at 10:40 AM Linus Torvalds wrote: > > So digging a bit deeper - since I have am arm compiler after all - I > note that 'sizeof(detailed_timings)' is 88. Hmm. sizeof() both detailed_timings[0].data.other_data.data.range.formula.gtf2 and detailed_timings[0].data.other_data.data.range.formula.cvt is 7. But the *union* of those things is detailed_timings[0].data.other_data.data.range.formula and its size is 8 (despite having an alignment of just 1). The attached patch would seem to fix it for me. Not very much tested, and I have no idea what it is that triggers this only on spear3xx_defconfig. Some ARM ABI issue that is triggered by some very particular ARM compiler flag enabled by that config? Adding some ARM (and SPEAR, and SoC) people in case they have any idea. This smells like a compiler bug triggered by "there's a 16-bit member field in that gtf2 structure, and despite it being packed and aligned to 1, we somehow still align the size to 2". I dunno. But marking those unions packed too doesn't seem wrong, and does seem to fix it. Linus include/drm/drm_edid.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/drm/drm_edid.h b/include/drm/drm_edid.h index c3204a58fb09..b2756753370b 100644 --- a/include/drm/drm_edid.h +++ b/include/drm/drm_edid.h @@ -121,7 +121,7 @@ struct detailed_data_monitor_range { u8 supported_scalings; u8 preferred_refresh; } __attribute__((packed)) cvt; - } formula; + } __attribute__((packed)) formula; } __attribute__((packed)); struct detailed_data_wpindex { @@ -154,7 +154,7 @@ struct detailed_non_pixel { struct detailed_data_wpindex color; struct std_timing timings[6]; struct cvt_timing cvt[4]; - } data; + } __attribute__((packed)) data; } __attribute__((packed)); #define EDID_DETAIL_EST_TIMINGS 0xf7 @@ -172,7 +172,7 @@ struct detailed_timing { union { struct detailed_pixel_timing pixel_data; struct detailed_non_pixel other_data; - } data; + } __attribute__((packed)) data; } __attribute__((packed)); #define DRM_EDID_INPUT_SERRATION_VSYNC (1 << 0)
Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")
On Sat, May 28, 2022 at 5:13 AM Sudip Mukherjee wrote: > > just tried this with > make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- drivers/gpu/drm/drm_edid.s > > size_of_edid: > mov r0, #144@, > ldmfd sp, {fp, sp, pc}@ So digging a bit deeper - since I have am arm compiler after all - I note that 'sizeof(detailed_timings)' is 88. Which is completely wrong. It should be 72 bytes (an array of 4 structures, each 18 bytes in size). I have not dug deeper, but that is clearly the issue. Now, why that only happens on that spear3xx_defconfig, I have no idea. Linus
Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")
On Fri, May 27, 2022 at 4:41 PM Sudip Mukherjee wrote: > > I just tested with various values, sizeof(*edid) is 144 bytes at that place. Hmm. What compiler do you have? Because it seems very broken. You don't actually have to try with various sizes, you could have just done something like int size_of_edid(const struct edid *edid) { return sizeof(*edid); } and then "make drivers/gpu/drm/drm_edid.s" to generate assembly and see what it looks like (obviously removing the BUG_ON() in order to build). That obviously generates code like movl$128, %eax ret for me, and looking at the definition of that type I really can't see how it would ever generate anything else. But it's apparently not even close for you. I suspect some of the structs inside of that 'struct edid' end up getting aligned, despite the '__attribute__((packed))'. For example, 'struct est_timings' is supposed to be just 3 bytes, and it's at an odd offset too (byte offset 35 in the 'struct edid' if I did the math correctly). But it obviously doesn't happen for me or for most other people, so it's something in your setup. Unusual compiler? Linus
Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")
On Fri, May 27, 2022 at 2:07 AM Sudip Mukherjee wrote: > > declared with attribute error: BUILD_BUG_ON failed: sizeof(*edid) != > EDID_LENGTH > > And, reverting it on top of mainline branch has fixed the build failure. Hmm. That BUILD_BUG_ON() looks entirely correct, and if that sizeof() doesn't work, then the code doesn't work. I'm not seeing what could go wrong in there, with all the structures I see being marked as __packed__. I wonder if the union in 'struct detailed_timing' also wants that "__attribute__((packed))" but I also wonder what it is that would make this fail on spear3xx but not elsewhere. Very strange. It would be interesting to know where that sizeof goes wrong, but it would seem to be something very peculiar to your build environment (either that config, or your compiler). Linus
Re: [git pull] drm for 5.19-rc1
On Tue, May 24, 2022 at 11:07 PM Dave Airlie wrote: > > AMD has started some new GPU support [...] Oh Christ. Which means another set of auto-generated monster headers. Lovely. Linus
Re: Adding CI results to the kernel tree was Re: [RFC v2] drm/msm: Add initial ci/ subdirectory
On Wed, May 11, 2022 at 12:08 PM Linus Torvalds wrote: > > The kernel tree might have just the expected *failures* listed, if > there are any. Presumably the ci tree has to have the expected results > anyway, so what's the advantage of listing non-failures? .. put another way: I think a list of "we are aware that these currently fail" is quite reasonable for a development tree, maybe even with a comment in the commit that created them about why they currently fail. That also ends up being very nice if you fix a problem, and the fix commit might then remove the failure for the list, and that all makes perfect sense. But having just the raw output of "these are the expected CI results" that is being done and specified by some other tree entirely - that seems pointless and just noise to me. There's no actual reason to have that kind of noise - and update that kind of noise - that I really see. Linus
Re: Adding CI results to the kernel tree was Re: [RFC v2] drm/msm: Add initial ci/ subdirectory
On Wed, May 11, 2022 at 11:40 AM Rob Clark wrote: > > It is missing in this revision of the RFC, but the intention is to > have the gitlab-ci.yml point to a specific commit SHA in the > gfx-ci/drm-ci[1] tree, to solve the problem of keeping the results in > sync with the expectations. Ie. a kernel commit would control moving > to a new version of i-g-t (and eventually deqp and/or piglit), and at > the same time make any necessary updates in the expectations files. Wouldn't it then be better to just have the expectation files in the ci tree too? The kernel tree might have just the expected *failures* listed, if there are any. Presumably the ci tree has to have the expected results anyway, so what's the advantage of listing non-failures? Linus
Re: Adding CI results to the kernel tree was Re: [RFC v2] drm/msm: Add initial ci/ subdirectory
On Tue, May 10, 2022 at 10:07 PM Dave Airlie wrote: > > > And use it to store expectations about what the drm/msm driver is > > supposed to pass in the IGT test suite. > > I wanted to loop in Linus/Greg to see if there are any issues raised > by adding CI results file to the tree in their minds, or if any other > subsystem has done this already, and it's all fine. > > I think this is a good thing after our Mesa experience, but Mesa has a > lot tighter integration here, so I want to get some more opinions > outside the group. Honestly, my immediate reaction is that I think it might be ok, but (a) are these things going to absolutely balloon over time? (b) should these not be separated out? Those two issues kind of interact. If it's a small and targeted test-suite, by all means keep it in the kernel, but why not make it part of "tools/testing/selftests" But if people expect this to balloon and we end up having megabytes of test output, then I really think it should be a separate git tree. A diffstat like this: > 7 files changed, 791 insertions(+) is not a problem at all. But I get the feeling that this is just the tip of the iceberg, and people will want to not just have the result files, but start adding actual *input* files that may be largely automated stuff and may be tens of megabytes in size. Because the result files on their own aren't really self-contained, and then people will want to keep them in sync with the test-files themselves, and start adding those, and now it *really* is likely very unwieldy. Or if that doesn't happen, and the actual input test files stay in a separate CI repo, and then you end up having random coherency issues with that CI repo, and it all gets to be either horribly messy, or the result files in the kernel end up really stale. So honestly, I personally don't see a good end result here. This particular small patch? *This* one looks fine to me, except I really think tools/testing/selftests/gpu would be a much more logical place for it. But I don't see a way forward that is sane. Can somebody argue otherwise? Linus
drm pull request (was Re: )
On Thu, May 5, 2022 at 9:07 PM Dave Airlie wrote: > > pretty quiet week, one fbdev, msm, kconfig, and 2 amdgpu fixes, about > what I'd expect for rc6. You're not getting the automated pr-tracker-bot response, because your subject line was missing... Just a "how did that happen" together with a "here's the manual response instead". Linus
Re: [PATCH RFC v6 00/21] DEPT(Dependency Tracker)
On Wed, May 4, 2022 at 1:19 AM Byungchul Park wrote: > > Hi Linus and folks, > > I've been developing a tool for detecting deadlock possibilities by > tracking wait/event rather than lock(?) acquisition order to try to > cover all synchonization machanisms. So what is the actual status of reports these days? Last time I looked at some reports, it gave a lot of false positives due to mis-understanding prepare_to_sleep(). For this all to make sense, it would need to not have false positives (or at least a very small number of them together with a way to sanely get rid of them), and also have a track record of finding things that lockdep doesn't. Maybe such reports have been sent out with the current situation, and I haven't seen them. Linus
Re: [git pull] drm fixes for 5.18-rc3
On Thu, Apr 14, 2022 at 2:33 PM Dave Airlie wrote: > > git://anongit.freedesktop.org/drm/drm tags/drm-fixes-2022-04-15 .. and since I'm now back home, I can confirm that my boot-time splat I saw before is all gone. It presumably went away with the previous pull request already, but I hadn't remembered to check the issue until this pull reminded me about the issue. Thanks, Linus
Re: [git pull] drm fixes for 5.18-rc2
On Thu, Apr 7, 2022 at 2:20 PM Dave Airlie wrote: > > I think this should fix the amdgpu splat you have been seeing since rc1. Not the machine I'm currently traveling with, but I'll double-check when I'm back home. Thanks, Linus
amdgpu link problem (was Re: [git pull] drm for 5.18-rc1)
I didn't notice this until now, probably because everything still _works_, but I get a new big warning splat at bootup on my main workstation these days as of the merge window changes. The full warning is attached, but it's basically the ASSERT(0) at line 938 in drivers/gpu/drm/amd/display/dc/core/dc_link.c and it looks to have been introduced by commit c282d9512cdd ("drm/amd/display: factor out dp detection link training and mst top detection"). This is the same old setup I've reported before, with some random Radeon card with two monitors attached (PCI ID 1002:67df rev e7, subsystem ID 1da2:e353). I think the card went by the name "Sapphire Pulse RX 580 8GB" in case any of that matters, but it's been working fine. It still works fine, it just has a big ugly boot-time splat. As mentioned, two 4K monitors attached, both over HDMI. If there is any particular info you want, just let me know where/how to find it, and I can provide. Linus warn Description: Binary data
Re: [git pull] drm fixes for 5.18-rc1
On Thu, Mar 24, 2022 at 7:13 PM Dave Airlie wrote: > > Some fixes were queued up in and in light of the fbdev regressions, > I've pulled those in as well, Thanks, pulled. It sounds (from the other thread) that the mediatek DT issue is also about to be fixed, even if it's not yet here. But that hopefully (probably?) affects fewer people and testing robots. Linus
Re: [git pull] drm for 5.18-rc1
On Wed, Mar 23, 2022 at 7:30 PM Dave Airlie wrote: > > This is the main drm pull request for 5.18. > > The summary changelog is below, lots of work all over, > Intel improving DG2 support, amdkfd CRIU support, msm > new hw support, and faster fbdev support. Ok, so this was annoying. I've merged it, but will note three things that I really hope get fixed / checked: (1) My merge resolution looked mostly trivial, except for an annoying conflict between commits 4ed545e7d100 ("dt-bindings: display: mediatek: disp: split each block to individual yaml") and 6d0990e6e844 ("media: dt-binding: mediatek: Get rid of mediatek,larb for multimedia HW") where one of them splits up a file that is modified by the other. I ended up just getting rid of all the "mediatek,larb" mentions in the split-up files, despite the fact that (a) those mentions can be found elsewhere and (b) the split-up did other changes too, so maybe it's wrong. (2) As Guenter reported, the fbdev performance improvement of cfb_imageblit() is broken. I was going to just revert it, but I see that there is a two-series patch to fix it at https://lore.kernel.org/all/20220313192952.12058-1-tzimmerm...@suse.de/ so I merged it in that broken form, in the hope that this set of fixes will be sent to me asap. (3) Very similarly to (2), but broken mediatek DT files. I hope my changes in (1) didn't make things worse, but there's a series of fixes as https://lore.kernel.org/all/20220309134702.9942-1-jason-jh@mediatek.com/ and again I hope I'll get those fixes from the proper places asap. I considered just delaying merging this all entirely, but it seems better to get this all in, with the known problems and known fixes, and see if we hit something _else_ too. Anyway, let's hope I didn't miss anything, and that those are the only major issues. Linus
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Wed, Mar 2, 2022 at 12:07 PM Kees Cook wrote: > > I've long wanted to change kfree() to explicitly set pointers to NULL on > free. https://github.com/KSPP/linux/issues/87 We've had this discussion with the gcc people in the past, and gcc actually has some support for it, but it's sadly tied to the actual function name (ie gcc has some special-casing for "free()") See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94527 for some of that discussion. Oh, and I see some patch actually got merged since I looked there last so that you can mark "deallocator" functions, but I think it's only for the context matching, not for actually killing accesses to the pointer afterwards. Linus
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Tue, Mar 1, 2022 at 3:19 PM David Laight wrote: > > Having said that there are so few users of list_entry_is_head() > it is reasonable to generate two new names. Well, the problem is that the users of list_entry_is_head() may be few - but there are a number of _other_ ways to check "was that the HEAD pointer", and not all of them are necessarily correct. IOW, different places do different random tests for "did we walk the whole loop without breaking out". And many of them happen to work. In fact, in practice, pretty much *all* of them happen to work, and you have to have the right struct layout and really really bad luck to hit a case of "type confusion ended up causing the test to not work". And *THAT* is the problem here. It's not the "there are 25ish places that current use list_entry_is_head()". It's the "there are ~480 places that use the type-confused HEAD entry that has been cast to the wrong type". And THAT is why I think we'd be better off with that bigger change that simply means that you can't use the iterator variable at all outside the loop, and try to make it something where the compiler can help catch mis-uses. Now, making the list_for_each_entry() thing force the iterator to NULL at the end of the loop does fix the problem. The issue I have with it is really just that you end up getting no warning at all from the compiler if you mix old-style and new-style semantics. Now, you *will* get an oops (if using a new-style iterator with an old-style check), but many of these things will be in odd driver code and may happen only for error cases. And if you use a new-style check with an old-style iterator (ie some backport problem), you will probably end up getting random memory corruption, because you'll decide "it's not a HEAD entry", and then you'll actually *use* the HEAD that has the wrong type cast associated with it. See what my worry is? With the "don't use iterator outside the loop" approach, the exact same code works in both the old world order and the new world order, and you don't have the semantic confusion. And *if* you try to use the iterator outside the loop, you'll _mostly_ (*) get a compiler warning about it not being initialized. Linus (*) Unless somebody initializes the iterator pointer pointlessly. Which clearly does happen. Thus the "mostly". It's not perfect, and that's most definitely not nice - but it should at least hopefully make it that much harder to mess up.
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Tue, Mar 1, 2022 at 2:58 PM David Laight wrote: > > Can it be resolved by making: > #define list_entry_is_head(pos, head, member) ((pos) == NULL) > and double-checking that it isn't used anywhere else (except in > the list macros themselves). Well, yes, except for the fact that then the name is entirely misleading... And somebody possibly uses it together with list_first_entry() etc, so it really is completely broken to mix that change with the list traversal change. Linus Linus
Re: [PATCH 6/6] treewide: remove check of list iterator against head past the loop body
So looking at this patch, I really reacted to the fact that quite often the "use outside the loop" case is all kinds of just plain unnecessary, but _used_ to be a convenience feature. I'll just quote the first chunk in it's entirely as an example - not because I think this chunk is particularly important, but because it's a good example: On Mon, Feb 28, 2022 at 3:09 AM Jakob Koschel wrote: > > diff --git a/arch/arm/mach-mmp/sram.c b/arch/arm/mach-mmp/sram.c > index 6794e2db1ad5..fc47c107059b 100644 > --- a/arch/arm/mach-mmp/sram.c > +++ b/arch/arm/mach-mmp/sram.c > @@ -39,19 +39,22 @@ static LIST_HEAD(sram_bank_list); > struct gen_pool *sram_get_gpool(char *pool_name) > { > struct sram_bank_info *info = NULL; > + struct sram_bank_info *tmp; > > if (!pool_name) > return NULL; > > mutex_lock(_lock); > > - list_for_each_entry(info, _bank_list, node) > - if (!strcmp(pool_name, info->pool_name)) > + list_for_each_entry(tmp, _bank_list, node) > + if (!strcmp(pool_name, tmp->pool_name)) { > + info = tmp; > break; > + } > > mutex_unlock(_lock); > > - if (>node == _bank_list) > + if (!info) > return NULL; > > return info->gpool; I realize this was probably at least auto-generated with coccinelle, but maybe that script could be taught to do avoid the "use after loop" by simply moving the code _into_ the loop. IOW, this all would be cleaner and clear written as if (!pool_name) return NULL; mutex_lock(_lock); list_for_each_entry(info, _bank_list, node) { if (!strcmp(pool_name, info->pool_name)) { mutex_unlock(_lock); return info; } } mutex_unlock(_lock); return NULL; Ta-daa - no use outside the loop, no need for new variables, just a simple "just do it inside the loop". Yes, we end up having that lock thing twice, but it looks worth it from a "make the code obvious" standpoint. Would it be even cleaner if the locking was done in the caller, and the loop was some simple helper function? It probably would. But that would require a bit more smarts than probably a simple coccinelle script would do. Linus
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Tue, Mar 1, 2022 at 11:06 AM Linus Torvalds wrote: > > So instead of that simple "if (!entry)", we'd effectively have to > continue to use something that still works with the old world order > (ie that "if (list_entry_is_head())" model). Just to prove my point about how this is painful, that doesn't work at all. If the loop iterator at the end is NULL (good, in theory), we can't use "list_entry_is_head()" to check whether we ended. We'd have to use a new thing entirely, to handle the "list_for_each_entry() has the old/new semantics" cases. That's largely why I was pushing for the "let's make it impossible to use the loop iterator at all outside the loop". It avoids the confusing case, and the patches to move to that stricter semantic can be merged independently (and before) doing the actual semantic change. I'm not saying my suggested approach is wonderful either. Honestly, it's painful that we have so nasty semantics for the end-of-loop case for list_for_each_entry(). The minimal patch would clearly be to keep those broken semantics, and just force everybody to use the list_entry_is_head() case. That's the "we know we messed up, we are too lazy to fix it, we'll just work around it and people need to be careful" approach. And laziness is a virtue. But bad semantics are bad semantics. So it's a question of balancing those two issues. Linus
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Mon, Feb 28, 2022 at 2:29 PM James Bottomley wrote: > > However, if the desire is really to poison the loop variable then we > can do > > #define list_for_each_entry(pos, head, member) \ > for (pos = list_first_entry(head, typeof(*pos), member);\ > !list_entry_is_head(pos, head, member) && ((pos = NULL) == NULL; > \ > pos = list_next_entry(pos, member)) > > Which would at least set pos to NULL when the loop completes. That would actually have been excellent if we had done that originally. It would not only avoid the stale and incorrectly typed head entry left-over turd, it would also have made it very easy to test for "did I find an entry in the loop". But I don't much like it in the situation we are now. Why? Mainly because it basically changes the semantics of the loop _without_ any warnings about it. And we don't actually get the advantage of the nicer semantics, because we can't actually make code do list_for_each_entry(entry, ) { .. } if (!entry) return -ESRCH; .. use the entry we found .. because that would be a disaster for back-porting, plus it would be a flag-day issue (ie we'd have to change the semantics of the loop at the same time we change every single user). So instead of that simple "if (!entry)", we'd effectively have to continue to use something that still works with the old world order (ie that "if (list_entry_is_head())" model). So we couldn't really take _advantage_ of the nicer semantics, and we'd not even get a warning if somebody does it wrong - the code would just silently do the wrong thing. IOW: I don't think you are wrong about that patch: it would solve the problem that Jakob wants to solve, and it would have absolutely been much better if we had done this from the beginning. But I think that in our current situation, it's actually a really fragile solution to the "don't do that then" problem we have. Linus
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Tue, Mar 1, 2022 at 10:14 AM Kees Cook wrote: > > The first big glitch with -Wshadow was with shadowed global variables. > GCC 4.8 fixed that, but it still yells about shadowed functions. What > _almost_ works is -Wshadow=local. Heh. Yeah, I just have long memories of "-Wshadow was a disaster". You looked into the details. > Another way to try to catch misused shadow variables is > -Wunused-but-set-varible, but it, too, has tons of false positives. That on the face of it should be an easy warning to get technically right for a compiler. So I assume the "false positives" are simply because we end up having various variables that really don't end up being used - and "intentionally" so). Or rather, they might only be used under some config option - perhaps the use is even syntactically there and parsed, but the compiler notices that it's turned off under some if (IS_ENABLED(..)) option? Because yeah, we have a lot of those. I think that's a common theme with a lot of compiler warnings: on the face of it they sound "obviously sane" and nobody should ever write code like that. A conditional that is always true? Sounds idiotic, and sounds like a reasonable thing for a compiler to warn about, since why would you have a conditional in the first place for that? But then you realize that maybe the conditional is a build config option, and "always true" suddenly makes sense. Or it's a test for something that is always true on _that_architecture_ but not in some general sense (ie testing "sizeof()"). Or it's a purely syntactic conditional, like "do { } while (0)". It's why I'm often so down on a lot of the odd warnings that are hiding under W=1 and friends. They all may make sense in the trivial case ("That is insane") but then in the end they happen for sane code. And yeah, -Wshadow has had tons of history with macro nesting, and just being badly done in the first place (eg "strlen" can be a perfectly fine local variable). That said, maybe people could ask the gcc and clan people for a way to _mark_ the places where we expect to validly see shadowing. For example, that "local variable in a macro expression statement" thing is absolutely horrendous to fix with preprocessor tricks to try to make for unique identifiers. But I think it would be much more syntactically reasonable to add (for example) a "shadow" attribute to such a variable exactly to tell the compiler "yeah, yeah, I know this identifier could shadow an outer one" and turn it off that way. Linus
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Mon, Feb 28, 2022 at 4:38 PM Segher Boessenkool wrote: > > In C its scope is the rest of the declaration and the entire loop, not > anything after it. This was the same in C++98 already, btw (but in > pre-standard versions of C++ things were like you remember, yes, and it > was painful). Yeah, the original C++ model was just unadulterated garbage, with no excuse for it, and the scope was not the loop, but the block the loop existed in. That would never have been acceptable for the kernel - it's basically just an even uglier version of "put variable declarations in the middle of code" (and we use "-Wdeclaration-after-statement" to disallow that for kernel code, although apparently some of our user space tooling code doesn't enforce or follow that rule). The actual C99 version is the sane one which actually makes it easier and clearer to have loop iterators that are clearly just in loop scope. That's a good idea in general, and I have wanted to start using that in the kernel even aside from some of the loop construct macros. Because putting variables in natural minimal scope is a GoodThing(tm). Of course, we shouldn't go crazy with it. Even after we do that -std=gnu11 thing, we'll have backports to worry about. And it's not clear that we necessarily want to backport that gnu11 thing - since people who run old stable kernels also may be still using those really old compilers... Linus
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Mon, Feb 28, 2022 at 4:45 PM Linus Torvalds wrote: > > Yeah, except that's ugly beyond belief, plus it's literally not what > we do in the kernel. (Of course, I probably shouldn't have used 'min()' as an example, because that is actually one of the few places where we do exactly that, using our __UNIQUE_ID() macros. Exactly because people _have_ tried to do -Wshadow when doing W=2). Linus
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Mon, Feb 28, 2022 at 3:26 PM Matthew Wilcox wrote: > > #define ___PASTE(a, b) a##b > #define __PASTE(a, b) ___PASTE(a, b) > #define _min(a, b, u) ({ \ Yeah, except that's ugly beyond belief, plus it's literally not what we do in the kernel. Really. The "-Wshadow doesn't work on the kernel" is not some new issue, because you have to do completely insane things to the source code to enable it. Just compare your uglier-than-sin version to my straightforward one. One does the usual and obvious "use a private variable to avoid the classic multi-use of a macro argument". And the other one is an abomination. Linus
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Mon, Feb 28, 2022 at 1:47 PM Jakob Koschel wrote: > > The goal of this is to get compiler warnings right? This would indeed be > great. Yes, so I don't mind having a one-time patch that has been gathered using some automated checker tool, but I don't think that works from a long-term maintenance perspective. So if we have the basic rule being "don't use the loop iterator after the loop has finished, because it can cause all kinds of subtle issues", then in _addition_ to fixing the existing code paths that have this issue, I really would want to (a) get a compiler warning for future cases and (b) make it not actually _work_ for future cases. Because otherwise it will just happen again. > Changing the list_for_each_entry() macro first will break all of those cases > (e.g. the ones using 'list_entry_is_head()). So I have no problems with breaking cases that we basically already have a patch for due to your automated tool. There were certainly more than a handful, but it didn't look _too_ bad to just make the rule be "don't use the iterator after the loop". Of course, that's just based on that patch of yours. Maybe there are a ton of other cases that your patch didn't change, because they didn't match your trigger case, so I may just be overly optimistic here. But basically to _me_, the important part is that the end result is maintainable longer-term. I'm more than happy to have a one-time patch to fix a lot of dubious cases if we can then have clean rules going forward. > I assumed it is better to fix those cases first and then have a simple > coccinelle script changing the macro + moving the iterator into the scope > of the macro. So that had been another plan of mine, until I actually looked at changing the macro. In the one case I looked at, it was ugly beyond belief. It turns out that just syntactically, it's really nice to give the type of the iterator from outside the way we do now. Yeah, it may be a bit odd, and maybe it's partly because I'm so used to the "list_for_each_list_entry()" syntax, but moving the type into the loop construct really made it nasty - either one very complex line, or having to split it over two lines which was even worse. Maybe the place I looked at just happened to have a long typename, but it's basically always going to be a struct, so it's never a _simple_ type. And it just looked very odd adn unnatural to have the type as one of the "arguments" to that list_for_each_entry() macro. So yes, initially my idea had been to just move the iterator entirely inside the macro. But specifying the type got so ugly that I think that typeof (pos) pos trick inside the macro really ends up giving us the best of all worlds: (a) let's us keep the existing syntax and code for all the nice cases that did everything inside the loop anyway (b) gives us a nice warning for any normal use-after-loop case (unless you explicitly initialized it like that sgx_mmu_notifier_release() function did for no good reason (c) also guarantees that even if you don't get a warning, non-converted (or newly written) bad code won't actually _work_ so you end up getting the new rules without any ambiguity or mistaken > With this you are no longer able to set the 'outer' pos within the list > iterator loop body or am I missing something? Correct. Any assignment inside the loop will be entirely just to the local loop case. So any "break;" out of the loop will have to set another variable - like your updated patch did. > I fail to see how this will make most of the changes in this > patch obsolete (if that was the intention). I hope my explanation above clarifies my thinking: I do not dislike your patch, and in fact your patch is indeed required to make the new semantics work. What I disliked was always the maintainability of your patch - making the rules be something that isn't actually visible in the source code, and letting the old semantics still work as well as they ever did, and having to basically run some verification pass to find bad users. (I also disliked your original patch that mixed up the "CPU speculation type safety" with the actual non-speculative problems, but that was another issue). Linus
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Mon, Feb 28, 2022 at 12:29 PM Johannes Berg wrote: > > If we're willing to change the API for the macro, we could do > > list_for_each_entry(type, pos, head, member) > > and then actually take advantage of -Wshadow? See my reply to Willy. There is no way -Wshadow will ever happen. I considered that (type, pos, head, member) kind of thing, to the point of trying it for one file, but it ends up as horrendous syntax. It turns out that declaring the type separately really helps, and avoids crazy long lines among other things. It would be unacceptable for another reason too - the amount of churn would just be immense. Every single use of that macro (and related macros) would change, even the ones that really don't need it or want it (ie the good kinds that already only use the variable inside the loop). So "typeof(pos) pos" may be ugly - but it's a very localized ugly. Linus
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Mon, Feb 28, 2022 at 12:16 PM Matthew Wilcox wrote: > > Then we can never use -Wshadow ;-( I'd love to be able to turn it on; > it catches real bugs. Oh, we already can never use -Wshadow regardless of things like this. That bridge hasn't just been burned, it never existed in the first place. The whole '-Wshadow' thing simply cannot work with local variables in macros - something that we've used since day 1. Try this (as a "p.c" file): #define min(a,b) ({ \ typeof(a) __a = (a);\ typeof(b) __b = (b);\ __a < __b ? __a : __b; }) int min3(int a, int b, int c) { return min(a,min(b,c)); } and now do "gcc -O2 -S t.c". Then try it with -Wshadow. In other words, -Wshadow is simply not acceptable. Never has been, never will be, and that has nothing to do with the typeof(pos) pos kind of thing. Your argument just isn't an argument. Linus