Re: [git pull] drm for 6.9-rc1

2024-03-13 Thread Linus Torvalds
On Tue, 12 Mar 2024 at 21:07, Dave Airlie  wrote:
>
> I've done a trial merge into your tree from a few hours ago, there
> are definitely some slighty messy conflicts, I've pushed a sample
> branch here:

I appreciate your sample merges since I like verifying my end result,
but I think your merge is wrong.

I got two differences when I did the merge. The one in
intel_dp_detect() I think is just syntactic - I ended up placing the

if (!intel_dp_is_edp(intel_dp))
intel_psr_init_dpcd(intel_dp);

differently than you did (I did it *after* the tunnel_detect()).

I don't _think,_ that placement matters, but somebody more familiar
with the code should check it out. Added Animesh and Jani to the
participants.

But I think your merge gets the TP_printk() for the xe_bo_move trace
event is actively wrong. You don't have the destination for the move
in the printk.

Or maybe I got it wrong. Our merges end up _close_, but not identical.

   Linus


Re: [PATCH 1/3] kci-gitlab: Introducing GitLab-CI Pipeline for Kernel Testing

2024-03-03 Thread Linus Torvalds
On Fri, 1 Mar 2024 at 02:27, Nikolai Kondrashov  wrote:
>
> I agree, it's hard to imagine even a simple majority agreeing on how GitLab CI
> should be done. Still, we would like to help people, who are interested in
> this kind of thing, to set it up. How about we reframe this contribution as a
> sort of template, or a reference for people to start their setup with,
> assuming that most maintainers would want to tweak it? We would also be glad
> to stand by for questions and help, as people try to use it.

Ack. I think seeing it as a library for various gitlab CI models would
be a lot more palatable. Particularly if you can then show that yes,
it is also relevant to our currently existing drm case.

So I'm not objecting to having (for example) some kind of CI helper
templates - I think a logical place would be in tools/ci/ which is
kind of alongside our tools/testing subdirectory.

(And then perhaps have a 'gitlab' directory under that. I'm not sure
whether - and how much - commonality there might be between the
different CI models of different hosts).

Just to clarify: when I say "a logical place", I very much want to
emphasize the "a" - maybe there are better places, and I'm not saying
that is the only possible place. But it sounds more logical to me than
some.

Linus


Re: [PATCH 1/3] kci-gitlab: Introducing GitLab-CI Pipeline for Kernel Testing

2024-03-01 Thread Linus Torvalds
On Thu, 29 Feb 2024 at 01:23, Nikolai Kondrashov  wrote:
>
> However, I think a better approach would be *not* to add the .gitlab-ci.yaml
> file in the root of the source tree, but instead change the very same repo
> setting to point to a particular entry YAML, *inside* the repo (somewhere
> under "ci" directory) instead.

I really don't want some kind of top-level CI for the base kernel project.

We already have the situation that the drm people have their own ci
model. II'm ok with that, partly because then at least the maintainers
of that subsystem can agree on the rules for that one subsystem.

I'm not at all interested in having something that people will then
either fight about, or - more likely - ignore, at the top level
because there isn't some global agreement about what the rules are.

For example, even just running checkpatch is often a stylistic thing,
and not everybody agrees about all the checkpatch warnings.

I would suggest the CI project be separate from the kernel.

And having that slack channel that is restricted to particular
companies is just another sign of this whole disease.

If you want to make a google/microsoft project to do kernel CI, then
more power to you, but don't expect it to be some kind of agreed-upon
kernel project when it's a closed system.

   Linus


Re: [PATCH next v2 08/11] minmax: Add min_const() and max_const()

2024-02-25 Thread Linus Torvalds
On Sun, 25 Feb 2024 at 08:53, David Laight  wrote:
>
> The expansions of min() and max() contain statement expressions so are
> not valid for static intialisers.
> min_const() and max_const() are expressions so can be used for static
> initialisers.

I hate the name.

Naming shouldn't be about an implementation detail, particularly not
an esoteric one like the "C constant expression" rule. That can be
useful for some internal helper functions or macros, but not for
something that random people are supposed to USE.

Telling some random developer that inside an array size declaration or
a static initializer you need to use "max_const()" because it needs to
syntactically be a constant expression, and our regular "max()"
function isn't that, is just *horrid*.

No, please just use the traditional C model of just using ALL CAPS for
macro names that don't act like a function.

Yes, yes, that may end up requiring getting rid of some current users of

  #define MIN(a,b) ((a)<(b) ? (a):(b))

but dammit, we don't actually have _that_ many of them, and why should
we have random drivers doing that anyway?

  Linus


Re: [PATCH 2/9] lib/cmdline: Fix an invalid format specifier in an assertion msg

2024-02-22 Thread Linus Torvalds
On Thu, 22 Feb 2024 at 09:36, Daniel Latypov  wrote:
>
> Copying the line for context, it's about `p-r` where
>   p = memchr_inv([1], 0, sizeof(r) - sizeof(r[0]));
> `p-r` should never be negative unless something has gone horribly
> horribly wrong.

Sure it would - if 'p' is NULL.

Of course, then a negative value wouldn't be helpful either, and in
this case that's what the EXPECT_PTR_EQ checking is testing in the
first place, so it's a non-issue.

IOW, in practice clearly the sign should simply not matter here.

I do think that the default case for pointer differences should be
that they are signed, because they *can* be.

Just because of that "default case", unless there's some actual reason
to use '%tu', I think '%td' should be seen as the normal case to use.

That said, just as a quick aside: be careful with pointer differences
in the kernel.

For this particular case, when we're talking about just 'char *', it's
not a big deal, but we've had code where people didn't think about
what it means to do a pointer difference in C, and how it can be often
unnecessarily expensive due to the implied "divide by the size of the
pointed object".

Sometimes it's actually worth writing the code in ways that avoids
pointer differences entirely (which might involve passing around
indexes instead of pointers).

 Linus


Re: [PATCH 8/9] drm/xe/tests: Fix printf format specifiers in xe_migrate test

2024-02-21 Thread Linus Torvalds
On Wed, 21 Feb 2024 at 21:05, Lucas De Marchi  wrote:
>
> this has a potential to cause conflicts with upcoming work, so I think
> it's better to apply this through drm-xe-next. Let me know if you agree.

I disagree. Violently.

For this to be fixed, we need to have the printf format checking enabled.

And we can't enable it until all the problems have been fixed.

Which means that we should *not* have to wait for [N] different trees
to fix their issues separately.

This should get fixed in the Kunit tree, so that the Kunit tree can
just send a pull request to me to enable format checking for the KUnit
tests, together with all the fixes.  Trying to spread those fixes out
to different git branches will only result in pain and pointless
dependencies between different trees.

Honestly, the reason I noticed the problem in the first place was that
the drm tree had a separate bug, that had been apparently noted in
linux-next, and *despite* that it made it into a pull request to me
and caused new build failures in rc5.

So as things are, I am not IN THE LEAST interested in some kind of
"let us fix this in the drm tree separately" garbage.  We're not
making things worse by trying to fix this in different trees.

We're fixing this in the Kunit tree, and I do not want to get *more*
problems from the drm side. I've had enough.

   Linus


Re: [PATCH next 10/11] block: Use a boolean expression instead of max() on booleans

2024-01-28 Thread Linus Torvalds
On Sun, 28 Jan 2024 at 14:22, David Laight  wrote:
>
> H blame gcc :-)

I do agree that the gcc warning quoting is unnecessarily ugly (even
just visually), but..

> The error message displays as '0' but is e2:80:98 30 e2:80:99
> I HATE UTF-8, it wouldn't be as bad if it were a bijection.

No, that's not the problem. The UTF-8 that gcc emits is fine.

And your email was also UTF-8:

Content-Type: text/plain; charset=UTF-8

The problem is that you clearly then used some other tool in between
that took the UTF-8 byte stream, and used it as (presumably) Latin1,
which is bogus.

If you just make everything use and stay as UTF-8, it all works out
beautifully. But I suspect you have an editor or a MUA that is fixed
in some 1980s mindset, and when you cut-and-pasted the UTF-8, it
treated it as Latin1.

Just make all your environment be utf-8, like it should be. It's not
the 80s any more. We don't do mullets, and we don't do Latin1, ok?

Linus


Re: [PATCH next 10/11] block: Use a boolean expression instead of max() on booleans

2024-01-28 Thread Linus Torvalds
On Sun, 28 Jan 2024 at 11:36, David Laight  wrote:
>
> However it generates:
> error: comparison of constant ‘0’ with boolean expression is always true 
> [-Werror=bool-compare]
> inside the signedness check that max() does unless a '+ 0' is added.

Please fix your locale. You have random garbage characters there,
presumably because you have some incorrect locale setting somewhere in
your toolchain.

   Linus


Re: [BUG] BUG: kernel NULL pointer dereference at ttm_device_init+0xb4

2024-01-22 Thread Linus Torvalds
On Mon, 22 Jan 2024 at 16:56, Bhardwaj, Rajneesh
 wrote:
>
> I think a fix might already be in flight. Please see  Linux-Kernel Archive: 
> Re: [PATCH] drm/ttm: fix ttm pool initialization for no-dma-device drivers 
> (iu.edu)

Please use lore.kernel.org that doesn't corrupt whitespace in patches
or lose header information:

  https://lore.kernel.org/lkml/20240113213347.9562-1-pchel...@ispras.ru/

although that seems to be a strange definition of "in flight". It was
sent out 8 days ago, and apparently nobody thought to include it in
the drm fixes pile that came in last Friday.

So it made it into rc1, even though it was reported a week before.

It also looks like some mailing list there is mangling emails - if you
use 'all' instead of 'lkml', lore reports multiple emails with the
same message-id, and it all looks messier as a result.

I assume it's dri-devel@lists.freedesktop.org that messes up, mainly
because I don't tend to see this behaviour when only the usual
kernel.org mailing lists are involved.

  Linus


Re: [BUG] BUG: kernel NULL pointer dereference at ttm_device_init+0xb4

2024-01-22 Thread Linus Torvalds
On Mon, 22 Jan 2024 at 15:17, Steven Rostedt  wrote:
>
> Perhaps this is the real fix?

If you send a signed-off version, I'll apply it asap.

Thanks,
 Linus


Re: [git pull] drm for 6.8

2024-01-12 Thread Linus Torvalds
On Wed, 10 Jan 2024 at 11:49, Dave Airlie  wrote:
>
> Let me know if there are any issues,

Your testing is seriously lacking.

This doesn't even build. The reason seems to be that commit
b49e894c3fd8 ("drm/i915: Replace custom intel runtime_pm tracker with
ref_tracker library") changed the 'intel_wakeref_t' type from a
'depot_stack_handle_t' to 'unsigned long', and as a result did this:

-   drm_dbg(>drm, "async_put_wakeref %u\n",
+   drm_dbg(>drm, "async_put_wakeref %lu\n",
power_domains->async_put_wakeref);

meanwhile, the Xe driver has this:

  drivers/gpu/drm/xe/compat-i915-headers/intel_wakeref.h:
typedef bool intel_wakeref_t;

which has never been valid, but now the build fails with

  drivers/gpu/drm/i915/display/intel_display_power.c: In function
‘print_async_put_domains_state’:
  drivers/gpu/drm/i915/display/intel_display_power.c:408:29: error:
format ‘%lu’ expects argument of type ‘long unsigned int’, but
argument 5 has type ‘int’ [-Werror=format=]

because the drm header files have this disgusting thing where a
*header* file includes a *C* file:

  In file included from ./include/drm/drm_mm.h:51,
 from drivers/gpu/drm/xe/xe_bo_types.h:11,
 from drivers/gpu/drm/xe/xe_bo.h:11,
 from
./drivers/gpu/drm/xe/compat-i915-headers/gem/i915_gem_object.h:11,
 from ./drivers/gpu/drm/xe/compat-i915-headers/i915_drv.h:15,
 from drivers/gpu/drm/i915/display/intel_display_power.c:8:

nasty.

I made it build by fixing that broken Xe compat header file, but this
is definitely *NOT* how things should have worked. How did this ever
get to me without any kind of build testing?

And why the %^!@$% does a header file include a C file? That's wrong
regardless of this bug.

   Linus


Re: [git pull] drm fixes for 6.8

2024-01-04 Thread Linus Torvalds
On Wed, 3 Jan 2024 at 18:30, Dave Airlie  wrote:
>
> These were from over the holiday period, mainly i915, a couple of
> qaic, bridge and an mgag200.
>
> I have a set of nouveau fixes that I'll send after this, that might be
> too rich for you at this point.
>
> I expect there might also be some more regular fixes before 6.8, but
> they should be minor.

I'm assuming you're just confused about the numbering, and meant 6.7
here and in the subject line.

This seems to be too small of a pull to be an early pull request for
the 6.8 merge window.

   Linus


Re: [GIT PULL] fbdev fixes and updates for v6.7-rc3

2023-11-26 Thread Linus Torvalds
On Sat, 25 Nov 2023 at 22:58, Helge Deller  wrote:
>
> please pull some small fbdev fixes for 6.7-rc3.

These all seem to be pure cleanups, not bug fixes.

Please resend during the merge window.

Linus


github version complaints about the gitlab CI requirements.txt

2023-11-12 Thread Linus Torvalds
So every time I push to my github mirror, github now ends up having a
'dependabot' thing that warns about some of the CI version
requirements for the gitlab automated testing file.

It wants to update the pip requirements from 23.2.1 to 23.3

 - When installing a package from a Mercurial VCS URL, e.g. pip install
   hg+..., with pip prior to v23.3, the specified Mercurial revision
   could be used to inject arbitrary configuration options to the hg
   clone call (e.g. --config). Controlling the Mercurial configuration
   can modify how and which repository is installed. This vulnerability
   does not affect users who aren't installing from Mercurial.

and upgrade the urllib3 requirements from 2.0.4 to 2.0.7:

 - urllib3's request body not stripped after redirect from 303 status
   changes request method to GET

 - `Cookie` HTTP header isn't stripped on cross-origin redirects

And it's not like any of this looks like a big deal, but I'd like to
shut up the messages I get.

I can either just close those issues, or I can apply a patch something
like the attached (which also adds a missing newline at the end).

I thought I should ask the people who actually set this up. Comments?

   Linus
 drivers/gpu/drm/ci/xfails/requirements.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/ci/xfails/requirements.txt b/drivers/gpu/drm/ci/xfails/requirements.txt
index d8856d1581fd..e9994c9db799 100644
--- a/drivers/gpu/drm/ci/xfails/requirements.txt
+++ b/drivers/gpu/drm/ci/xfails/requirements.txt
@@ -5,7 +5,7 @@ termcolor==2.3.0
 certifi==2023.7.22
 charset-normalizer==3.2.0
 idna==3.4
-pip==23.2.1
+pip==23.3
 python-gitlab==3.15.0
 requests==2.31.0
 requests-toolbelt==1.0.0
@@ -13,5 +13,5 @@ ruamel.yaml==0.17.32
 ruamel.yaml.clib==0.2.7
 setuptools==68.0.0
 tenacity==8.2.3
-urllib3==2.0.4
-wheel==0.41.1
\ No newline at end of file
+urllib3==2.0.7
+wheel==0.41.1


Re: drm/vkms: deadlock between dev->event_lock and timer

2023-09-13 Thread Linus Torvalds
On Wed, 13 Sept 2023 at 07:21, Tetsuo Handa
 wrote:
>
> Hello. A deadlock was reported in drivers/gpu/drm/vkms/ .
> It looks like this locking pattern remains as of 6.6-rc1. Please fix.
>
> void drm_crtc_vblank_off(struct drm_crtc *crtc) {
>   spin_lock_irq(>event_lock);
>   drm_vblank_disable_and_save(dev, pipe) {
> __disable_vblank(dev, pipe) {
>   crtc->funcs->disable_vblank(crtc) == vkms_disable_vblank {
> hrtimer_cancel(>vblank_hrtimer) { // Retries with 
> dev->event_lock held until lock_hrtimer_base() succeeds.
>   ret = hrtimer_try_to_cancel(timer) {
> base = lock_hrtimer_base(timer, ); // Fails forever because 
> vkms_vblank_simulate() is in progress.

Heh. Ok. This is clearly a bug, but it does seem to be limited to just
the vkms driver, and literally only to the "simulate vblank" case.

The worst part about it is that it's so subtle and not obvious.

Some light grepping seems to show that amdgpu has almost the exact
same pattern in its own vkms thing, except it uses

hrtimer_try_to_cancel(_crtc->vblank_timer);

directly, which presumably fixes the livelock, but means that the
cancel will fail if it's currently running.

So just doing the same thing in the vkms driver probably fixes things.

Maybe the vkms people need to add a flag to say "it's canceled" so
that it doesn't then get re-enabled?  Or maybe it doesn't matter
and/or already happens for some reason I didn't look into.

   Linus


Re: [git pull] drm CI integration

2023-09-10 Thread Linus Torvalds
On Wed, 30 Aug 2023 at 18:00, Dave Airlie  wrote:
>
> This is a PR to add drm-ci support files to the upstream tree.

So I finally had no other pull requests pending, and spent some time
looking at this, and I see nothing offensive.

I did wonder how this then expands to having more than one subsystem
using this (and mixing them possibly on the same CI system), but
that's more about my ignorance about how the gitlab CI works than
anything else, so that's certainly not a real concern.

The other side of that "I do wonder" coin is for when others want to
use the same tests but using some other CI infrastructure, whether
it's some AWS or google cloud thing, or github or whatever.

Anyway, considering that both of my idle curiosity reactions were
about "if this is successful", I think me having those questions only
means that I should pull this, rather than questioning the pull
itself.

If it works out so well that others want to actually do this and
integrate our other selftests in similar manners, I think that would
be lovely.

And if - as you say - this is a failure and the whole thing gets
deleted a year from now as "this didn't go anywhere", it doesn't look
like it should cause a ton of problems either.

Anyway, it's in my tree now, let's see where it goes.

   Linus


Re: [git pull] drm fixes for 6.6-rc1

2023-09-07 Thread Linus Torvalds
On Thu, 7 Sept 2023 at 19:45, Dave Airlie  wrote:
>
> Just a poke about the outstanding drm CI support pull request since I
> haven't see any movement on that in the week, hopefully it's not as
> difficult a problem as bcachefs :-)

I was assuming that it wouldn't interfere with anything else... and
that I could just ignore it until I have all my "real" pulls done. I
didn't want to even look at it until I was "done".

Yes, it's past the mid-way point of the second week of the merge
window, and I mostly _should_ be "done".

But I still keep finding new pull requests in my inbox that aren't
just fixes and updates for previous main pull requests.

Linus


Re: mainline build failure due to 501126083855 ("fbdev/g364fb: Use fbdev I/O helpers")

2023-08-31 Thread Linus Torvalds
On Thu, 31 Aug 2023 at 11:48, Sudip Mukherjee (Codethink)
 wrote:
> The latest mainline kernel branch fails to build mips jazz_defconfig with
> the error:
>
> drivers/video/fbdev/g364fb.c:115:9: error: 'FB_DEFAULT_IOMEM_HELPERS' 
> undeclared here (not in a function); did you mean 'FB_DEFAULT_IOMEM_OPS'?
>   115 | FB_DEFAULT_IOMEM_HELPERS,
>   | ^~~~
>   | FB_DEFAULT_IOMEM_OPS
>
>
> git bisect pointed to 501126083855 ("fbdev/g364fb: Use fbdev I/O helpers").
>
> Reverting the commit has fixed the build failure.
>
> I will be happy to test any patch or provide any extra log if needed.

Would you mind testing the exact thing that the compiler suggested?

So instead of the revert, just replace FB_DEFAULT_IOMEM_HELPERS with
FB_DEFAULT_IOMEM_OPS.

I think it's just a typo / confusion with the config variable (which
is called FB_IOMEM_HELPERS).

   Linus


Re: [git pull] drm fixes for 6.5-rc4

2023-07-28 Thread Linus Torvalds
On Thu, 27 Jul 2023 at 19:20, Dave Airlie  wrote:
>
> Regular scheduled fixes, msm and amdgpu leading the way, with some
> i915 and a single misc fbdev, all seems fine.

Pulled.

Tangentially related: where do you keep your pgp key? The one I have
is long expired, and doing a refresh doesn't get any updates...

Linus


Re: [pull] amdgpu drm-fixes-6.4

2023-06-23 Thread Linus Torvalds
On Fri, 23 Jun 2023 at 14:18, Alex Deucher  wrote:
>
> are available in the Git repository at:
>
>   https://gitlab.freedesktop.org/agd5f/linux.git 
> tags/amd-drm-fixes-6.4-2023-06-23

That's not a valid signed tag.

Yes, it's a tag.  But it doesn't actually have any cryptographic
signing, and I'm not willing to pull random content from git sites
that I can't verify. In fact, these days I ask even kernel.org pull
requests to be proper signed tags, although I haven't really gotten to
the point where I *require* it.

So please sign your tags - use "git tag -s" (or "-u keyname" if you
have some specific key you want to use, rather than one described by
your regular git config file).

  Linus


Re: mainline build failure due to 322458c2bb1a ("drm/ttm: Reduce the number of used allocation orders for TTM pages")

2023-04-26 Thread Linus Torvalds
On Wed, Apr 26, 2023 at 10:44 AM Sudip Mukherjee (Codethink)
 wrote:
>
> drivers/gpu/drm/ttm/ttm_pool.c:73:29: error: variably modified 
> 'global_write_combined' at file scope
>73 | static struct ttm_pool_type global_write_combined[TTM_DIM_ORDER];
>   | ^

Ugh.

This is because we have

  #define TTM_DIM_ORDER (__TTM_DIM_ORDER <= MAX_ORDER ?
__TTM_DIM_ORDER : MAX_ORDER)

which looks perfectly fine as a constant ("pick the smaller of
MAX_ORDER and __TTM_DIM_ORDER").

But:

  #define __TTM_DIM_ORDER (TTM_MAX_ORDER + 1)
  #define TTM_MAX_ORDER (PMD_SHIFT - PAGE_SHIFT)

which still _looks_ fine, but  on 64-bit powerpc, we then have

   #define PTE_INDEX_SIZE  __pte_index_size

so that __TTM_DIM_ORDER constant isn't actually a constant at all.

I would suggest that the DRM code just make those fixed-size arrays
use "MAX_ORDER", because even though it might be a bit bigger than
necessary, that's still not a very big constant (ie typically it's
"11", but it could be up to 64).

It's a bit sad how that macro that _looks_ like a constant (and is one
pretty much everywhere else) isn't actually constant on powerpc, but
looking around it looks fairly unavoidable.

Maybe the DRM code could try to avoid using things like PMD_SHIFT entirely?

Linus


Disabling -Warray-bounds for gcc-13 too

2023-04-23 Thread Linus Torvalds
Kees,
  I made the mistake of upgrading my M2 Macbook Air to Fedora-38, and
in the process I got gcc-13 which is not WERROR-clean because we only
limited the 'array-bounds' warning to gcc-11 and gcc-12. But gcc-13
has all the same issues.

And I want to be able to do my arm64 builds with WERROR on still...

I guess it never made much sense to hope it was going to go away
without having a confirmation, so I just changed it to be gcc-11+.

A lot of the warnings seem just crazy, with gcc just not getting the
bounds right, and then being upset about us going backwards with
'container_of()' etc. Ok, so the kernel is special. We do odd things.
I get it, gcc ends up being confused.

But before I disabled it, I did take a look at a couple of warnings
that didn't look like the sea of crazy.

And one of them is from you.

In particular, commit 4076ea2419cf ("drm/nouveau/disp: Fix
nvif_outp_acquire_dp() argument size") cannot possibly be right, It
changes

 nvif_outp_acquire_dp(struct nvif_outp *outp, u8 dpcd[16],

to

 nvif_outp_acquire_dp(struct nvif_outp *outp, u8 dpcd[DP_RECEIVER_CAP_SIZE],

and then does

memcpy(args.dp.dpcd, dpcd, sizeof(args.dp.dpcd));

where that 'args.dp.dpcd' is a 16-byte array, and DP_RECEIVER_CAP_SIZE is 15.

So yeah, it's copying 16 bytes from an argument that claims to be 15
bytes in size.

I think that commit was wrong, and the problem is that the 'dpcd'
array is something 15 and sometimes 16. For example, we have

  struct nouveau_encoder {
...
union {
struct {
...
u8 dpcd[DP_RECEIVER_CAP_SIZE];
} dp;
};

so there it's indeed 15 bytes, but then we have

union nvif_outp_acquire_args {
struct nvif_outp_acquire_v0 {
...
union {
...
struct {
...
__u8 dpcd[16];
} dp;

where it's 16.

I think it's all entirely harmless from a code generation standpoint,
because the 15-byte field will be padded out to 16 bytes in the
structure that contains it, but it's most definitely buggy.

So that warning does find real cases of wrong code. But when those
real cases are hidden by hundreds of lines of unfixable false
positives, we don't have much choice.

But could the Nouveau driver *please* pick a size for the dhcp[] array
and stick with it?

The other driver where the warnings didn't look entirely crazy was the
ath/carl9170 wireless driver, but I didn't look closer at that one.

 Linus


Re: Linux 6.3-rc3

2023-03-22 Thread Linus Torvalds
On Wed, Mar 22, 2023 at 9:40 AM Sedat Dilek  wrote:
>
> You have to pass `make LLVM=1` in any case... to `oldconfig` or when
> adding any MAKEFLAGS like -j${number-of-available-cpus}.

I actually think we should look (again) at just making the compiler
choice (and the prefix) be a Kconfig option.

That would simplify *so* many use cases.

It used to be that gcc was "THE compiler" and anything else was just
an odd toy special case, but that's clearly not true any more.

So it would be lovely to make the kernel choice a Kconfig choice - so
you'd set it only at config time, and then after that a kernel build
wouldn't need special flags any more, and you'd never need to play
games with GNUmakefile or anything like that.

Yes, you'd still use environment variables (or make arguments) for
that initial Kconfig, but that's no different from the other
environment variables we already have, like KCONFIG_SEED that kconfig
uses internally, but also things like "$(ARCH)" that we already use
*inside* the Kconfig files themselves.

I really dislike how you have to set ARCH and CROSS_COMPILE etc
externally, and can't just have them *in* the config file.

So when you do cross-compiles, right now you have to do something like

make ARCH=i386 allmodconfig

to build the .config file, but then you have to *repeat* that
ARCH=i386 when you actually build things:

make ARCH=i386

because the ARCH choice ends up being in the .config file, but the
makefiles themselves always take it from the environment.

There are good historical reasons for our behavior (and probably a
number of extant practical reasons too), but it's a bit annoying, and
it would be lovely if we could start moving away from this model.

Linus


Re: Linux 6.3-rc3

2023-03-20 Thread Linus Torvalds
On Mon, Mar 20, 2023 at 3:06 PM Nathan Chancellor  wrote:
>
> Right, this seems like a subtle difference in semantics between
> -Wuninitialized between clang and GCC.

I guess it's a bit ambiguous whether it's

 "X may be USED uninitialized"

or whether it is

 "X may BE uninitialized"

and then depending on how you see that ambiguity, the control flow matters.

In this case, there is absolutely no question that the variable is
uninitialized (since there is no write to it at all).

So it is very clearly and unambiguously uninitialized. And I do think
that as a result, "-Wuninitialized" should warn.

But at the same time, whether it is *used* or not depends on that
conditional, so I can see how it could be confusing and not be so
clear an unambiguous.

On the whole, I do wish that the logic would be "after dead code
removal, if some pseudo has no initializer, it should always warn,
regardless of any remaining dynamic conditoinals".

That "after dead code removal" might matter, because I could see where
config things (#ifdef's etc) would just remove the initialization of
some variable, and if the use is behind some static "if (0)", then
warning about it is all kinds of silly.

 Linus


Re: Linux 6.3-rc3

2023-03-20 Thread Linus Torvalds
On Mon, Mar 20, 2023 at 1:05 PM Guenter Roeck  wrote:
>
> I have noticed that gcc doesn't always warn about uninitialized variables
> in most architectures.

Yeah, I'm getting the feeling that when the gcc people were trying to
make -Wmaybe-uninitialized work better (when moving it into "-Wall"),
they ended up moving a lot of "clearly uninitialized" cases into it.

So then because we disable the "maybe" case (with
-Wno-maybe-uninitialized) because it had too many random false
positives, we end up not seeing the obvious cases either.

  Linus


Re: Linux 6.3-rc3

2023-03-20 Thread Linus Torvalds
On Mon, Mar 20, 2023 at 11:56 AM Nathan Chancellor  wrote:
>
> I did see a patch fly by to fix that:
>
> https://lore.kernel.org/20230316082035.567520-3-christian.koe...@amd.com/
>
> It seems like the DRM_TEGRA half of it is broken though:
>
> https://lore.kernel.org/202303170635.a2rsq1wu-...@intel.com/

Hmm. x86-64 has 'vmap()' too.

So I think that DRM_TEGRA breakage is likely just due to a missing
header file include that then (by luck and mistake) gets included on
arm.

You need  for 'vmap()'.

There might be something else going on, I didn't look deeply at it.

   Linus


Re: Linux 6.3-rc3

2023-03-20 Thread Linus Torvalds
On Mon, Mar 20, 2023 at 11:26 AM Linus Torvalds
 wrote:
>
> Hmm. I do my arm64 allmodconfig builds with gcc, and I'm surprised
> that gcc doesn't warn about this.

Side note: I'm also wondering why that TEGRA_HOST1X config has that
ARM dependency in

depends on ARCH_TEGRA || (ARM && COMPILE_TEST)

because it seems to build just fine at least on x86-64 if I change it to be just

depends on ARCH_TEGRA || COMPILE_TEST

ie there seems to be nothing ARM-specific in there.

Limiting it to just the tegra platform by default makes 100% sense,
but that "only do compile-testing on ARM" seems a bit bogus.

That limit goes back to forever (commit 6f44c2b5280f: "gpu: host1x:
Increase compile test coverage" back in Nov 2013), so maybe things
didn't use to work as well back in the dark ages?

None of this explains why gcc didn't catch it, but at least allowing
the build on x86-64 would likely have made it easier for people to see
clang catching this.

   Linus


Re: Linux 6.3-rc3

2023-03-20 Thread Linus Torvalds
On Mon, Mar 20, 2023 at 11:05 AM Nathan Chancellor  wrote:
>
> On the clang front, I am still seeing the following warning turned error
> for arm64 allmodconfig at least:
>
>   drivers/gpu/host1x/dev.c:520:6: error: variable 'syncpt_irq' is 
> uninitialized when used here [-Werror,-Wuninitialized]
>   if (syncpt_irq < 0)
>   ^~

Hmm. I do my arm64 allmodconfig builds with gcc, and I'm surprised
that gcc doesn't warn about this.

That syncpt_irq thing isn't written to anywhere, so that's pretty egregious.

We use -Wno-maybe-uninitialized because gcc gets it so wrong, but
that's different from the "-Wuninitialized" thing (without the
"maybe").

I've seen gcc mess this up when there is one single assignment,
because then the SSA format makes it *so* easy to just use that
assignment out-of-order (or unconditionally), but this case looks
unusually clear-cut.

So the fact that gcc doesn't warn about it is outright odd.

> If that does not come to you through other means before -rc4, could you
> just apply it directly so that I can stop applying it to our CI? :)

Bah. I took it now, there's no excuse for that thing.

Do we have any gcc people around that could explain why gcc failed so
miserably at this trivial case?

   Linus


Re: [Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

2023-03-17 Thread Linus Torvalds
On Wed, Mar 15, 2023 at 5:22 PM Steven Rostedt  wrote:
>
> I hope that this gets in by -rc3, as I want to start basing my next branch
> on that tag.

My tree should have it now as commit c00133a9e87e ("drm/ttm: drop
extra ttm_bo_put in ttm_bo_cleanup_refs").

Linus


Re: [PATCH 0/2] docs & checkpatch: allow Closes tags with links

2023-03-16 Thread Linus Torvalds
On Thu, Mar 16, 2023 at 4:43 AM Matthieu Baerts
 wrote:
>
> @Linus: in short, we would like to continue using the "Closes:" tag (or
> similar, see below) with a URL in commit messages. They are useful to
> have public bug trackers doing automated actions like closing a specific
> ticket. Any objection from your side?

As long as it's a public link, I guess that just documents what the
drm people have been doing.

I'm not convinced "Closes" is actually any better than just "Link:",
though. I would very much hope and expect that the actual closing of
any bug report is actually done separately and verified, rather than
some kind of automated "well, the commit says it closes it, so.."

So honestly, I feel like "Link:" is just a better thing, and I worry
that "Closes:" is then going to be used for random internal crap.
We've very much seen people wanting to do that - having their own
private bug trackers, and then using the commit message to refer to
them, which I am *violently* against. If it's only useful to some
closed community, it shouldn't be in the public commits.

And while the current GPU people seem to use "Closes:" the right way
(and maybe some other groups do too - but it does seem to be mostly a
freedesktop thing), I really think it is amenable to mis-use in ways
"Link:" is not.

The point of "Link:" is explicitly two-fold:

 - it makes it quite obvious that you expect an actual valid web-link,
not some internal garbage

 - random people always want random extensions, and "Link:" is
_designed_ to counter-act that creeping "let's add a random new tag"
disease. It's very explicitly "any relevant link".

and I really question the value of adding new types of tags,
particularly ones that seem almost designed to be mis-used.

So I'm not violently against it, and 99% of the existing uses seem
fine. But I do note that some of the early "Closes:" tags in the
kernel were very much complete garbage, and exactly the kind of thing
that I absolutely detest.

What does

Closes: 10437

mean? That's crazy talk. (And yes, in that case it was a
kernel.bugzilla.org number, which is perfectly fine, but I'm using it
as a very real example of how "Closes:" ends up being very naturally
to mis-use).

End result: I don't hate our current "Closes:" uses. But I'm very wary of it.

I'm not at all convinced that it really adds a lot of value over
"Link:", and I am, _very_ aware of how easily it can be then taken to
be a "let's use our own bug tracker cookies here".

So I will neither endorse nor condemn it, but if I see people using it
wrong, I will absolutely put my foot down.

Linus


Re: drm next for 6.3-rc1

2023-02-24 Thread Linus Torvalds
On Fri, Feb 24, 2023 at 5:30 PM Dave Airlie  wrote:
>
> Any issues with this? I get nervous around 48hrs :-)

It was merged on Wednesday evening. See commit a5c95ca18a98.

If you were waiting for a pr-tracker-bot reply, I think you need to
put "{GIT PULL]" in the subject line for the automation to trigger on
it.

 Linus


Re: [PATCH RFC v7 00/23] DEPT(Dependency Tracker)

2023-01-16 Thread Linus Torvalds
[ Back from travel, so trying to make sense of this series.. ]

On Sun, Jan 8, 2023 at 7:33 PM Byungchul Park  wrote:
>
> I've been developing a tool for detecting deadlock possibilities by
> tracking wait/event rather than lock(?) acquisition order to try to
> cover all synchonization machanisms. It's done on v6.2-rc2.

Ugh. I hate how this adds random patterns like

if (timeout == MAX_SCHEDULE_TIMEOUT)
sdt_might_sleep_strong(NULL);
else
sdt_might_sleep_strong_timeout(NULL);
   ...
sdt_might_sleep_finish();

to various places, it seems so very odd and unmaintainable.

I also recall this giving a fair amount of false positives, are they all fixed?

Anyway, I'd really like the lockdep people to comment and be involved.
We did have a fairly recent case of "lockdep doesn't track page lock
dependencies because it fundamentally cannot" issue, so DEPT might fix
those kinds of missing dependency analysis. See

https://lore.kernel.org/lkml/60d41f05f139a...@google.com/

for some context to that one, but at teh same time I would *really*
want the lockdep people more involved and acking this work.

Maybe I missed the email where you reported on things DEPT has found
(and on the lack of false positives)?

   Linus


Re: [git pull] drm for 6.2-rc1

2022-12-14 Thread Linus Torvalds
On Wed, Dec 14, 2022 at 12:05 AM Christian König
 wrote:
>
> Anyway we need to re-apply b09d6acba1d9 which should be trivial.

Note that my resolution did exactly that (*), it's just that when I
double-checked against Dave's suggested merge that I noticed I'd done
things differently than he did.

(*) Well, when I say "did exactly that" I don't actually know some of
the other fencing changes that happened, so there may be a reason why
something further should still be done.  So I can only point to my
merge commit a594533df0f6 and ask people to verify.

It does all at least work for me. Knock wood.

  Linus


Re: [git pull] drm for 6.2-rc1

2022-12-13 Thread Linus Torvalds
On Mon, Dec 12, 2022 at 6:56 PM Dave Airlie  wrote:
>
> There are a bunch of conflicts, one in amdgpu is a bit nasty, I've
> cc'ed Christian/Alex to make sure they know to check whatever
> resolution you find. The one I have is what we have in drm-tip tree.

Hmm. My merge resolution is slightly different from yours.

You seem to have basically dropped commit b09d6acba1d9 ("drm/amdgpu:
handle gang submit before VMID").

Now, there are other fence changes in the drm tree that may mean that
that commit *should* be dropped, so it's entirely possible that my
resolution which kept that ordering change might be wrong and your
resolution that just took the drm tip code is the right one.

Christian? Alex? Can you please double-check what I just pushed out?

Linus


Re: [PATCH RFC 16/19] mm/frame-vector: remove FOLL_FORCE usage

2022-11-22 Thread Linus Torvalds
On Tue, Nov 22, 2022 at 4:25 AM Hans Verkuil  wrote:
>
> I tracked the use of 'force' all the way back to the first git commit
> (2.6.12-rc1) in the very old video-buf.c. So it is very, very old and the
> reason is lost in the mists of time.

Well, not entirely.

For archaeology reasons, I went back to the old BK history, which
exists as a git conversion in

https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/

and there you can actually find it.

Not with a lot of explanations, though - it's commit b7649ef789
("[PATCH] videobuf update"):

This updates the video-buf.c module (helper module for video buffer
management).  Some memory management fixes, also some adaptions to the
final v4l2 api.

but it went from

 err = get_user_pages(current,current->mm,
-data, dma->nr_pages,
-rw == READ, 0, /* don't force */
+data & PAGE_MASK, dma->nr_pages,
+rw == READ, 1, /* force */
 dma->pages, NULL);

in that commit.

So it goes back to October 2002.

> Looking at this old LWN article https://lwn.net/Articles/28548/ suggests
> that it might be related to calling get_user_pages for write buffers

The timing roughly matches.

> I assume that removing FOLL_FORCE from 'FOLL_FORCE|FOLL_WRITE' will still
> allow drivers to read from the buffer?

The issue with some of the driver hacks has been that

 - they only want to read, and the buffer may be read-only

 - they then used FOLL_WRITE despite that, because they want to break
COW (due to the issue that David is now fixing with his series)

 - but that means that the VM layer says "nope, you can't write to
this read-only user mapping"

 - ... and then they use FOLL_FORCE to say "yes, I can".

iOW, the FOLL_FORCE may be entirely due to an (incorrect, but
historically needed) FOLL_WRITE.

 Linus


Re: [PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE

2022-11-17 Thread Linus Torvalds
On Thu, Nov 17, 2022 at 2:58 PM Kees Cook  wrote:
>
> Oh, er, why does get_arg_page() even need FOLL_FORCE? This is writing the
> new stack contents to the nascent brpm->vma, which was newly allocated
> with VM_STACK_FLAGS, which an arch can override, but they all appear to 
> include
> VM_WRITE | VM_MAYWRITE.

Yeah, it does seem entirely superfluous.

It's been there since the very beginning (although in that original
commit b6a2fea39318 it was there as a '1' to the 'force' argument to
get_user_pages()).

I *think* it can be just removed. But as long as it exists, it should
most definitely not be renamed to FOLL_PTRACE.

There's a slight worry that it currently hides some other setup issue
that makes it matter, since it's been that way so long, but I can't
see what it is.

 Linus


Re: [PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE

2022-11-16 Thread Linus Torvalds
On Wed, Nov 16, 2022 at 2:30 AM David Hildenbrand  wrote:
>
> Let's make it clearer that functionality provided by FOLL_FORCE is
> really only for ptrace access.

I'm not super-happy about this one.

I do understand the "let's rename the bit so that no new user shows up".

And it's true that the main traditional use is ptrace.

But from the patch itself it becomes obvious that no, it's not *just*
ptrace. At least not yet.

It's used for get_arg_page(), which uses it to basically look up (and
install) pages in the newly created VM.

Now, I'm not entirely sure why it even uses FOLL_FORCE, - I think it
might be historical, because the target should always be the new stack
vma.

Following the history of it is a big of a mess, because there's a
number of renamings and re-organizations, but it seems to go back to
2007 and commit b6a2fea39318 ("mm: variable length argument support").

Before that commit, we kept our own array of "this is the set of pages
that I will install in the new VM". That commit basically just inserts
the pages directly into the VM instead, getting rid of the array size
limitation.

So at a minimum, I think that FOLL_FORCE would need to be removed
before any renaming to FOLL_PTRACE, because that's not some kind of
small random case.

It *might* be as simple as just removing it, but maybe there's some
reason for having it that I don't immediately see.

There _are_ also small random cases too, like get_cmdline(). Maybe
that counts as ptrace, but the execve() case most definitely does not.

Linus


Re: [PATCH RFC 00/19] mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning)

2022-11-07 Thread Linus Torvalds
On Mon, Nov 7, 2022 at 8:18 AM David Hildenbrand  wrote:
>
> So instead, make R/O long-term pinning work as expected, by breaking COW
> in a COW mapping early, such that we can remove any FOLL_FORCE usage from
> drivers.

Nothing makes me unhappy from a quick scan through these patches.

And I'd really love to just have this long saga ended, and FOLL_FORCE
finally relegated to purely ptrace accesses.

So an enthusiastic Ack from me.

   Linus


Re: [PATCH v4a 00/38] timers: Use timer_shutdown*() before freeing timers

2022-11-05 Thread Linus Torvalds
On Sat, Nov 5, 2022 at 2:03 PM Jason A. Donenfeld  wrote:
>
> Something that might help here is changing the `...` into
> `... when exists` or into `... when != ptr` or similar.

I actually tried that.

You don't want "when exists", you'd want "when forall", but that seems
to be the default.

And trying "when != ptr->timer" actually does the right thing in that
it gets rid of the case where the timer is modified outside of the
del_timer() case, *but* it also causes odd other changes to the
output.

Look at what it generates for that

   drivers/media/usb/pvrusb2/pvrusb2-hdw.c

file, which finds a lot of triggers with the "when !=  ptr->timer",
but only does one without it.

So I gave up, just because I clearly don't understand the rules.

(Comparing output is also fun because the ordering of the patches is
random, so consecutive runs with the same rule will give different
patches. I assume that it's just because it's done in parallel, but it
doesn't help the "try to see what changes when you change the script"
;)

 Linus


Re: [PATCH v4a 00/38] timers: Use timer_shutdown*() before freeing timers

2022-11-05 Thread Linus Torvalds
On Sat, Nov 5, 2022 at 11:04 AM Steven Rostedt  wrote:
>
> Here's the changes I made after running the script

Please. No.

What part of "I don't want extra crud" was I unclear on?

I'm not interested in converting everything. That's clearly a 6.,2
issue, possibly even longer considering how complicated the networking
side has been.

I'm not AT ALL interested in "oh, I then added my own small cleanups
on top to random files because I happened to notice them".

Repeat after me: "If the script didn't catch them, they weren't
trivially obvious".

And it does seem that right now the script itself is a bit too
generous, which is why it didn't notice that sometimes there wasn't a
kfree after all because of a goto around it. So clearly that "..."
doesn't really work, I think it accepts "_any_ path leads to the
second situation" rather than "_all_ paths lead to the second
situation".

But yeah, my coccinelle-foo is very weak too, and maybe there's no
pattern for "no flow control".

I would also like the coccinelle script to notice the "timer is used
afterwards", so that it does *not* modify that case that does

del_timer(>timer);
dch->timer.function = NULL;

since now the timer is modified in between the del_timer() and the kfree.

Again, that timer modification is then made pointless by changing the
del_timer() to a "timer_shutdown()", but at that point it is no longer
a "so obvious non-semantic change that it should be scripted". At that
point it's a manual thing.

So I think the "..." in your script should be "no flow control, and no
access to the timer", but do not know how to do that in coccinelle.

Julia?

And this thread has way too many participants, I suspect some email
systems will just mark it as spam as a result. Which is partly *why* I
would like to get rid of noisy changes that really don't matter - but
I would like it to be truly mindlessly obvious that there are *zero*
questions about it, and absolutely no manual intervention because the
patch is so strict that it's just unquestionably correct.

  Linus


Re: [PATCH v4a 00/38] timers: Use timer_shutdown*() before freeing timers

2022-11-05 Thread Linus Torvalds
On Fri, Nov 4, 2022 at 11:01 PM Steven Rostedt  wrote:
>
> Patch 1 fixes an issue with sunrpc/xprt where it incorrectly uses
> del_singleshot_timer_sync() for something that is not a oneshot timer. As this
> will be converted to shutdown, this needs to be fixed first.

So this is the kind of thing that I would *not* want to get eartly.

I really would want to get just the infrastructure in to let people
start doing conversions.

And then the "mindlessly obvious patches that are done by scripting
and can not possibly matter".

The kinds that do not *need* review, because they are mechanical, and
that just cause pointless noise for the rest of the patches that *do*
want review.

Not this kind of thing that is so subtle that you have to explain it.
That's not a "scripted patch for no semantic change".

So leave the del_singleshot_timer_sync() cases alone, they are
irrelevant for the new infrastructure and for the "mindless scripted
conversion" patches.

> Patches 2-4 changes existing timer_shutdown() functions used locally in ARM 
> and
> some drivers to better namespace names.

Ok, these are relevant.

> Patch 5 implements the new timer_shutdown() and timer_shutdown_sync() 
> functions
> that disable re-arming the timer after they are called.

This is obviously what I'd want early so that people can start doign
this in their trees.

> Patches 6-28 change all the locations where there's a kfree(), kfree_rcu(),
> kmem_cache_free() and one call_rcu() call where the RCU function frees the
> timer (the workqueue patch) in the same function as the del_timer{,_sync}() is
> called on that timer, and there's no extra exit path between the del_timer and
> freeing of the timer.

So honestly, I was literally hoping for a "this is the coccinelle
script" kind of patch.

Now there seems to be a number of patches here that are actualyl
really hard to see that they are "obviously correct" and I can't tell
if they are actually scripted or not.

They don't *look* scripted, but I can't really tell.  I looked at the
patches with ten lines of context, and I didn't see the immediately
following kfree() even in that expanded patch context, so it's fairly
far away.

Others in the series were *definitely* not scripted, doing clearly
manual cleanups:

-if (dch->timer.function) {
-del_timer(>timer);
-dch->timer.function = NULL;
-}
+timer_shutdown(>timer);

so no, this does *not* make me feel "ok, this is all trivial".

IOW, I'd really want *just* the infrastructure and *just* the provably
trivial stuff. If it wasn't some scripted really obvious thing that
cannot possibly change anything and that wasn't then edited manually
for some reason, I really don't want it early.

IOW, any early conversions I'd take are literally about removing pure
mindless noise. Not about doing conversions.

And I wouldn't mind it as a single conversion patch that has the
coccinelle script as the explanation.

Really just THAT kind of "100% mindless conversion".

   Linus


Re: [RFC][PATCH v3 00/33] timers: Use timer_shutdown*() before freeing timers

2022-11-04 Thread Linus Torvalds
On Fri, Nov 4, 2022 at 12:42 PM Steven Rostedt  wrote:
>
> Linus, should I also add any patches that has already been acked by the
> respective maintainer?

No, I'd prefer to keep only the ones that are 100% unambiguously not
changing any semantics.

  Linus


Re: [RFC][PATCH v3 00/33] timers: Use timer_shutdown*() before freeing timers

2022-11-04 Thread Linus Torvalds
On Thu, Nov 3, 2022 at 10:48 PM Steven Rostedt  wrote:
>
> Ideally, I would have the first patch go into this rc cycle, which is mostly
> non functional as it will allow the other patches to come in via the 
> respective
> subsystems in the next merge window.

Ack.

I also wonder if we could do the completely trivially correct
conversions immediately.

I'm talking about the scripted ones where it's currently a
"del_timer_sync()", and the very next action is freeing whatever data
structure the timer is in (possibly with something like free_irq() in
between - my point is that there's an unconditional free that is very
clear and unambiguous), so that there is absolutely no question about
whether they should use "timer_shutdown_sync()" or not.

IOW, things like patches 03, 17 and 31, and at least parts others in
this series.

This series clearly has several much more complex cases that need
actual real code review, and I think it would help to have the
completely unambiguous cases out of the way, just to get rid of noise.

So I'd take that first patch, and a scripted set of "this cannot
change any semantics" patches early.

Linus


Re: [PATCH] drm/amd/display: Fix build breakage with CONFIG_DEBUG_FS=n

2022-10-14 Thread Linus Torvalds
On Fri, Oct 14, 2022 at 8:22 AM Nathan Chancellor  wrote:
>
> After commit 8799c0be89eb ("drm/amd/display: Fix vblank refcount in vrr
> transition"), a build with CONFIG_DEBUG_FS=n is broken due to a
> misplaced brace, along the lines of:

Thanks, applied.

  Linus


Re: [git pull] drm fixes for 6.1-rc1

2022-10-13 Thread Linus Torvalds
On Thu, Oct 13, 2022 at 5:29 PM Dave Airlie  wrote:
>
> Round of fixes for the merge window stuff, bunch of amdgpu and i915
> changes, this should have the gcc11 warning fix, amongst other
> changes.

Some of those amd changes aren't "fixes". They are some major code changes.

We're still in the merge window, so I'm letting it slide, but calling
then "fixes" really stretches things. They are fixes exactly the same
way completely new development can "fix" things.

  Linus


Re: mainline build failure due to 5d8c3e836fc2 ("drm/amd/display: fix array-bounds error in dc_stream_remove_writeback()")

2022-10-06 Thread Linus Torvalds
On Thu, Oct 6, 2022 at 1:50 PM Sudip Mukherjee
 wrote:
>
> > And it looks like Sudip's proposed fix for this particular code is
> > additionally fixing unsigned vs signed as well. I think -Warray-bounds
> > did its job (though, with quite a confusing index range in the report).
>
> Not my. Linus's. I just tested. :)

I suspect Kees meant Stephen's other patch that Hamza pointed at, and
that is perhaps the cleaner version.

That said, I hate how this forces us to write random code changes just
to make a compiler just randomly _happen_ to not complain about it.

   Linus


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Linus Torvalds
On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie  wrote:
>
>
> [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0088
> [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]

As far as I can tell, that's the line

struct drm_gpu_scheduler *sched = s_fence->sched;

where 's_fence' is NULL. The code is

   0: 0f 1f 44 00 00nopl   0x0(%rax,%rax,1)
   5: 41 54push   %r12
   7: 55push   %rbp
   8: 53push   %rbx
   9: 48 89 fb  mov%rdi,%rbx
   c:* 48 8b af 88 00 00 00 mov0x88(%rdi),%rbp <-- trapping instruction
  13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
  1a: 48 8b 85 80 01 00 00 mov0x180(%rbp),%rax

and that next 'lock decl' instruction would have been the

atomic_dec(>hw_rq_count);

at the top of drm_sched_job_done().

Now, as to *why* you'd have a NULL s_fence, it would seem that
drm_sched_job_cleanup() was called with an active job. Looking at that
code, it does

if (kref_read(>s_fence->finished.refcount)) {
/* drm_sched_job_arm() has been called */
dma_fence_put(>s_fence->finished);
...

but then it does

job->s_fence = NULL;

anyway, despite the job still being active. The logic of that kind of
"fake refcount" escapes me. The above looks fundamentally racy, not to
say pointless and wrong (a refcount is a _count_, not a flag, so there
could be multiple references to it, what says that you can just
decrement one of them and say "I'm done").

Now, _why_ any of that happens, I have no idea. I'm just looking at
the immediate "that pointer is NULL" thing, and reacting to what looks
like a completely bogus refcount pattern.

But that odd refcount pattern isn't new, so it's presumably some user
on the amd gpu side that changed.

The problem hasn't happened again for me, but that's not saying a lot,
since it was very random to begin with.

 Linus


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Linus Torvalds
On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher  wrote:
>
> Maybe you are seeing this which is an issue with GPU TLB flushes which
> is kind of sporadic:
> https://gitlab.freedesktop.org/drm/amd/-/issues/2113

Well, that seems to be 5.19, and while timing changes (or whatever
other software updates) could have made it start trigger, this machine
has been pretty solid otgerwise.

> Are you seeing any GPU page faults in your kernel log?

Nothing even remotely like that "no-retry page fault" in that issue
report. Of course, if it happens just before the whole thing locks
up...

   Linus


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Linus Torvalds
On Thu, Oct 6, 2022 at 12:30 PM Dave Airlie  wrote:
>
> netconsole?

I've never been really successful with that in the past, and haven't
used it for decades. I guess I could try if nothing else works.

   Linus


Re: mainline build failure due to 5d8c3e836fc2 ("drm/amd/display: fix array-bounds error in dc_stream_remove_writeback()")

2022-10-06 Thread Linus Torvalds
On Thu, Oct 6, 2022 at 1:51 AM Sudip Mukherjee (Codethink)
 wrote:
>
> This is only seen with gcc-11, gcc-12 builds are ok.

Hmm. This seems to be some odd gcc issue.

I *think* that what is going on is that the test

j = 0 ; j < MAX_DWB_PIPES

makes gcc decide that "hey, j is in the range [0,MAX_DWB_PIPES[", and
then since MAX_DWB_PIPES is 1, that simplifies to "j must be zero".
Good range analysis so far.

But for 'i', we start off with that lower bound of 0, but the upper
bound is not fixed (the test for "i" is: "i < stream->num_wb_info").

And then that "if (i != j)", so now gcc decides that it can simplify
that to "if (i != 0)", and then simplifies *that* to "oh, the lower
bound of 'i' in that code is actually 1.

So then it decides that "stream->writeback_info[i]" must be out of bounds.

Of course, the *reality* is that stream->num_wb_info should be <=
MAX_DWB_PIPES, and as such (with the current MAX_DWB_PIPES value of 1)
it's not that 'i' can be 1, it's that the code in question cannot be
reached at all.

What confuses me is that error message ("array subscript [0, 0] is
outside array bounds of 'struct dc_writeback_info[1]') which seems to
be aware that the value is actually 0.

So this seems to be some gcc-11 range analysis bug, but I don't know
what the fix is. I suspect some random code change would magically
just make gcc realize it's ok after all, but since it all depends on
random gcc confusion, I don't know what the random code change would
be.

The fix *MAY* be to just add a '&& i < MAX_DWB_PIPES' to that loop
too, and then gcc will see that both i and j are always 0, and that
the code is unreachable and not warn about it. Hmm? Can you test that?

And the reason gcc-12 builds are ok probably isn't that gcc-12 gets
this right, it's simply that gcc-12 gets so many *opther* things wrong
that we already disabled -Warray-bounds with gcc-12 entirely.

If somebody cannot come up with a fix, I suspect the solution is "gcc
array bounds analysis is terminally buggy" and we just need to disable
it for gcc-11 too.

Kees, any idea? Who else might be interested in fixing a -Warray-bounds issue?

 Linus


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Linus Torvalds
On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie  wrote:
>
> Lots of stuff all over, some new AMD IP support and gang
> submit support [..]

Hmm.

I have now had my main desktop lock up twice after pulling this.
Nothing in the dmesg after a reboot, and nothing in particular that
seems to trigger it, so I have a hard time even guessing what's up,
but the drm changes are the primary suspect.

I will try to see if I can get any information out of the machine, but
with the symptom being just a dead machine ...

This is the same (old) Radeon device:

   49:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)

with dual 4k monitors, running on my good old Threadripper setup.

Again, there's no explicit reason to blame the drm pull, except that
it started after that merge (that machine ran the kernel with the
networking pull for a day with no problems, and while there were other
pull requests in between them, they seem to be fairly unrelated to the
hardware I have).

But the lockup is so sporadic (twice in the last day) that I really
can't bisect it, so I'm afraid I have very very little info.

Any suggestions?

  Linus


Re: [git pull] drm for 6.1-rc1

2022-10-05 Thread Linus Torvalds
On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie  wrote:
>
> This is very conflict heavy, mostly the correct answer is picking
> the version from drm-next.

Ugh, yes, that was a bit annoying.

I get the same end result as you did, but I do wonder if the drm
people should try to keep some kind of separate "fixes" branches for
things that go both into the development tree and then get sent to me
for fixes pulls?

Hopefully this "lots of pointless noise" was a one-off, but it might
be due to how you guys work..

  Linus


Re: [PATCH v13 5/9] drm/i915: Check for integer truncation on scatterlist creation

2022-09-28 Thread Linus Torvalds
On Wed, Sep 28, 2022 at 1:15 AM Gwan-gyeong Mun
 wrote:
>
> +   if (check_assign(obj->base.size >> PAGE_SHIFT, ))
> +   return -E2BIG;

I have to say, I find that new "check_assign()" macro use to be disgusting.

It's one thing to check for overflows.

It's another thing entirely to just assign something to a local variable.

This disgusting "let's check and assign" needs to die. It makes the
code a completely unreadable mess. The "user" wersion is even worse.

If you worry about overflow, then use a mix of

 (a) use a sufficiently large type to begin with

 (b) check for value range separately

and in this particular case, I also suspect that the whole range check
should have been somewhere else entirely - at the original creation of
that "obj" structure, not at one random end-point where it is used.

In other words, THIS WHOLE PATCH is just end-points checking the size
requirements of that "base.size" thing much too late, when it should
have been checked originally for some "maximum acceptable base size"
instead.

And that "maximum acceptable base size" should *not* be about "this is
the size of the variables we use". It should be a sanity check of
"this value is sane and fits in sane use cases".

Because "let's plug security checks" is most definitely not about
picking random assignments and saying "let's check this one". It's
about trying to catch things earlier than that.

Kees, you need to reign in the craziness in overflow.h.

 Linus


Re: mainline build failure for x86_64 allmodconfig with clang

2022-08-07 Thread Linus Torvalds
On Sun, Aug 7, 2022 at 10:36 AM David Laight  wrote:
>
> Or just shoot the software engineer who thinks 100 arguments
> is sane. :-)

I suspect the issue is that it's not primarily a software engineer who
wrote that code.

Hardware people writing code are about as scary as software engineers
with a soldering iron.

   Linus


Re: mainline build failure for x86_64 allmodconfig with clang

2022-08-04 Thread Linus Torvalds
On Thu, Aug 4, 2022 at 1:43 PM Nathan Chancellor  wrote:
>
> I do note that commit 1b54a0121dba ("drm/amd/display: Reduce stack size
> in the mode support function") did have a workaround for GCC. It appears
> clang will still inline mode_support_configuration(). If I mark it as
> 'noinline', the warning disappears in that file.

That sounds like probably the best option for now. Gcc does not inline
that function (at least for allmodconfig builds in my testing), so if
that makes clang match what gcc does, it seems a reasonable thing to
do.

Linus


Re: mainline build failure for x86_64 allmodconfig with clang

2022-08-04 Thread Linus Torvalds
On Thu, Aug 4, 2022 at 11:37 AM Sudip Mukherjee (Codethink)
 wrote:
>
> git bisect points to 3876a8b5e241 ("drm/amd/display: Enable building new 
> display engine with KCOV enabled").

Ahh. So that was presumably why it was disabled before - because it
presumably does disgusting things that make KCOV generate even bigger
stack frames than it already has.

Those functions do seem to have fairly big stack footprints already (I
didn't try to look into why, I assume it's partly due to aggressive
inlining, and probably some automatic structures on stack). But gcc
doesn't seem to make it all that much worse with KCOV (and my clang
build doesn't enable KCOV).

So it's presumably some KCOV-vs-clang thing. Nathan?

  Linus


Re: [PATCH] drm/amd/display: restore plane with no modifiers code.

2022-08-04 Thread Linus Torvalds
On Wed, Aug 3, 2022 at 10:50 PM Dave Airlie  wrote:
>
> With this applied, I get gdm back.

I can confirm that it makes thing work again for me too. Applied.

 Linus


Re: mainline build failure due to 6fdd2077ec03 ("drm/amd/amdgpu: add memory training support for PSP_V13")

2022-08-04 Thread Linus Torvalds
On Thu, Aug 4, 2022 at 12:35 AM Sudip Mukherjee (Codethink)
 wrote:
>
> I will be happy to test any patch or provide any extra log if needed.

It sounds like that file just needs to get a

#include 

there, and for some reason architectures other than alpha and mips end
up getting it accidentally through other headers.

Mind testing just adding that header file, and perhaps even sending a
patch if (when) that works for you?

Linus


Re: [git pull] drm for 5.20/6.0

2022-08-03 Thread Linus Torvalds
On Wed, Aug 3, 2022 at 9:27 PM Linus Torvalds
 wrote:
>
> I'll do a few more. It's close enough already that it should be just
> four more reboots to pinpoint exactly which commit breaks.

commit 5d945cbcd4b16a29d6470a80dfb19738f9a4319f is the first bad commit.

I think it's supposed to make no semantic changes, but it clearly does.

What a pain to figure out what's wrong in there, and I assume it
doesn't revert cleanly either.

Bringing in the guilty parties. See

  
https://lore.kernel.org/all/CAHk-=wj+yzaunxiewhfcrkbdlsqkizdr1q3yjlaqpo6avq2...@mail.gmail.com/

for the beginning of this thread.

Linus


Re: [git pull] drm for 5.20/6.0

2022-08-03 Thread Linus Torvalds
On Wed, Aug 3, 2022 at 9:24 PM Dave Airlie  wrote:
>
> I've reproduced it, I'll send you a revert pile when I confirm it is
> the buddy allocator.

I've bisected it to 86bd6706c404..074293dd9f61 and don't see "buddy"
in any of those commits.

I'll do a few more. It's close enough already that it should be just
four more reboots to pinpoint exactly which commit breaks.

  Linus


Re: [git pull] drm for 5.20/6.0

2022-08-03 Thread Linus Torvalds
On Wed, Aug 3, 2022 at 8:53 PM Dave Airlie  wrote:
>
> > It works on my intel laptop, so it's amdgpu somewhere.
>
> I'll spin my ryzen up to see if I can reproduce, and test against the
> drm-next pre-merge tree as well.

So it's not my merge - I've had a bad result in the middle of the DRM
history too.

On a positive note, my arm64 machine works fine, but that's just using
fbdev so ...

But another datapoint to say that it's amdgpu-specific. Not that that
was really in doubt.

Linus


Re: [git pull] drm for 5.20/6.0

2022-08-03 Thread Linus Torvalds
On Wed, Aug 3, 2022 at 8:37 PM Dave Airlie  wrote:
>
> Actually I did miss that so that looks good.

.. I wish it did, but I just actually test-booted my desktop with the
result, and it crashes the X server.  This seems to be the splat in
Xorg.0.log:

  (II) Initializing extension DRI2
  (II) AMDGPU(0): Setting screen physical size to 2032 x 571
  (EE)
  (EE) Backtrace:
  (EE) 0: /usr/libexec/Xorg (OsLookupColor+0x13d) [0x55b1dc61258d]
  (EE) 1: /lib64/libc.so.6 (__sigaction+0x50) [0x7f7972a3ea70]
  (EE) 2: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so
(AMDGPUCreateWindow_oneshot+0x101) [0x7f797207ddd1]
  (EE) 3: /usr/libexec/Xorg (compIsAlternateVisual+0xdc4) [0x55b1dc545fa4]
  (EE) 4: /usr/libexec/Xorg (InitRootWindow+0x17) [0x55b1dc4e0047]
  (EE) 5: /usr/libexec/Xorg (miPutImage+0xd4c) [0x55b1dc49e60b]
  (EE) 6: /lib64/libc.so.6 (__libc_start_call_main+0x80) [0x7f7972a29550]
  (EE) 7: /lib64/libc.so.6 (__libc_start_main+0x89) [0x7f7972a29609]
  (EE) 8: /usr/libexec/Xorg (_start+0x25) [0x55b1dc49f2c5]
  (EE)
  (EE) Segmentation fault at address 0x4
  (EE)
Fatal server error:
  (EE) Caught signal 11 (Segmentation fault). Server aborting

so something is going horribly wrong. No kernel oops, though.

It works on my intel laptop, so it's amdgpu somewhere.

I guess I will start bisecting. Oy vey.

 Linus


Re: [git pull] drm for 5.20/6.0

2022-08-03 Thread Linus Torvalds
On Wed, Aug 3, 2022 at 7:46 PM Linus Torvalds
 wrote:
>
> I think I have it resolved, am still doing a full build test, and will
> then compare against what your suggested merge is.

Hmm.

I end up with *almost* the same thing.

Except I ended up with a

select DRM_BUDDY

for the DRM_AMDGPU config entry, and you don't have that.

I *think* my version is correct, in that clearly the amdgpu driver now
uses that buddy logic (just doing a random "grep drm_buddy_block" to
see).

But this was messy enough to resolve that I think people should
double-check my end, and maybe I just got confused at some point in
the process.

And while I seem to have gotten the same result as you did on the i915
firmware side too, again, I'd like people to re-verify.

   Linus


Re: [git pull] drm for 5.20/6.0

2022-08-03 Thread Linus Torvalds
On Tue, Aug 2, 2022 at 10:38 PM Dave Airlie  wrote:
>
> This is a conflicty one. The late revert in 5.19 of the amdgpu buddy
> allocator causes major conflict fallout. The buddy allocator code in
> this one works, so the resolutions are usually just to take stuff from
> this. It might actually be cleaner if you revert
> 925b6e59138cefa47275c67891c65d48d3266d57 (Revert "drm/amdgpu: add drm
> buddy support to amdgpu") first in your tree then merge this.

Ugh, what a pain. The other conflicts are also due to just randomly
duplicated commits, with *usually* your drm tree having the superset
(so "just take yours" is the easy resolution), but not always (ie the
Intel firmware-69 mess was apparently not dealt with in the
development tree).

Nasty.

I think I have it resolved, am still doing a full build test, and will
then compare against what your suggested merge is.

  Linus


Re: [git pull] drm fixes for 5.19-rc7

2022-07-16 Thread Linus Torvalds
On Sat, Jul 16, 2022 at 2:35 PM Linus Torvalds
 wrote:
>
> That said, even those type simplifications do not fix the fundamental
> issue. That "DIV_ROUND_UP()" still ends up being a 64-bit divide,
> although now it's at least a "64-by-32" bit divide.

Hmm. The "DIV_ROUND_UP()" issue could be solved by just making the
rule be that the max_segment size is always a power of two.

Then you don't need the (expensive!) DIV_ROUND_UP(), and can just use
the regular "round_up()" that works on powers-of-two.

And the simplest way to do that is to just make "max_segments" be 2GB.

The whole "round_down(UINT_MAX, page_alignment)" seems entirely
pointless. Do you really want segments that are some odd number just
under the 4GB mark, and force expensive divides?

For consistency, I used the same value in
i915_rsgt_from_buddy_resource(). I have no idea if that makes sense.

Anyway, the attached patch is COMPLETELY UNTESTED. But it at least
seems to compile. Maybe.

  Linus
 drivers/gpu/drm/i915/i915_scatterlist.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_scatterlist.c b/drivers/gpu/drm/i915/i915_scatterlist.c
index f63b50b71e10..830dcd833d4e 100644
--- a/drivers/gpu/drm/i915/i915_scatterlist.c
+++ b/drivers/gpu/drm/i915/i915_scatterlist.c
@@ -81,7 +81,9 @@ struct i915_refct_sgt *i915_rsgt_from_mm_node(const struct drm_mm_node *node,
 	  u64 region_start,
 	  u64 page_alignment)
 {
-	const u64 max_segment = round_down(UINT_MAX, page_alignment);
+	// Make sure max_segment (and thus segment_pages) is
+	// a power of two that fits in 32 bits.
+	const u64 max_segment = 1ul << 31;
 	u64 segment_pages = max_segment >> PAGE_SHIFT;
 	u64 block_size, offset, prev_end;
 	struct i915_refct_sgt *rsgt;
@@ -96,7 +98,7 @@ struct i915_refct_sgt *i915_rsgt_from_mm_node(const struct drm_mm_node *node,
 
 	i915_refct_sgt_init(rsgt, node->size << PAGE_SHIFT);
 	st = >table;
-	if (sg_alloc_table(st, DIV_ROUND_UP(node->size, segment_pages),
+	if (sg_alloc_table(st, round_up(node->size, segment_pages),
 			   GFP_KERNEL)) {
 		i915_refct_sgt_put(rsgt);
 		return ERR_PTR(-ENOMEM);
@@ -159,7 +161,7 @@ struct i915_refct_sgt *i915_rsgt_from_buddy_resource(struct ttm_resource *res,
 {
 	struct i915_ttm_buddy_resource *bman_res = to_ttm_buddy_resource(res);
 	const u64 size = res->num_pages << PAGE_SHIFT;
-	const u64 max_segment = round_down(UINT_MAX, page_alignment);
+	const u64 max_segment = 1u << 31;
 	struct drm_buddy *mm = bman_res->mm;
 	struct list_head *blocks = _res->blocks;
 	struct drm_buddy_block *block;


Re: [git pull] drm fixes for 5.19-rc7

2022-07-16 Thread Linus Torvalds
On Fri, Jul 15, 2022 at 2:09 PM Nathan Chancellor  wrote:
>
> On Fri, Jul 15, 2022 at 01:36:17PM +1000, Dave Airlie wrote:
> > Matthew Auld (1):
> >   drm/i915/ttm: fix sg_table construction
>
> This patch breaks i386_defconfig with both GCC and clang:
>
>   ld: drivers/gpu/drm/i915/i915_scatterlist.o: in function 
> `i915_rsgt_from_mm_node':
>   i915_scatterlist.c:(.text+0x1a7): undefined reference to `__udivdi3'

Yeah, we definitely don't want arbitrary 64x64 divides in the kernel,
and the fact that we don't include libgcc.a once again caught this
horrid issue.

The offending code is

if (sg_alloc_table(st, DIV_ROUND_UP(node->size, segment_pages),
   GFP_KERNEL)) {

and I have to say that all of those games with "u64 page_alignment"
that commit aff1e0b09b54 ("drm/i915/ttm: fix sg_table construction")
introduces are absolutely disgusting.

And they are just *pointlessly* disgusting.

Why is that "page_alignment" a "u64"?

And why is it a "size", instead of being a "number of bits"?

The code literally does things like

const u64 max_segment = round_down(UINT_MAX, page_alignment);

which means that

 (a) page_alignment must be a power-of-two for this to work
(round_down() only works in powers of two)

 (b) the result obviously must fit in an "unsigned int", since it's
rounding down a UINT_MAX!

So (a) makes it doubtful that "page_alignment" should have been a
value (as opposed to mask), and (b) then questions why was that made
an "u64" value when it cannot have a u64 range?

And if max_segments cannot have a 64-bit range, then segment_pages here:

u64 segment_pages = max_segment >> PAGE_SHIFT;

sure cannot.

Fixing those then uncovers other things:

len = min(block_size, max_segment - sg->length);

now complains about mixing types ("max_segment - sg->length" being
u32), because 'block_size' is 64, bit, and that does seem to make some
amount of sense:

block_size = node->size << PAGE_SHIFT;

with the 'node->size' being from drm_mm_node, and that size is a
'u64'. That I *could* see being more than 32 bits on a 64-bit
architecture. Ok.

But then that means that 'len' cannot be a 64-bit value either, and it
should probably have been

u32 len;
..
len = min_t(u64, block_size, max_segment - sg->length);

and that would just have been a lot nicer on 32-bit x86, avoiding a
lot of pointlessly 64-bit things.

That said, even those type simplifications do not fix the fundamental
issue. That "DIV_ROUND_UP()" still ends up being a 64-bit divide,
although now it's at least a "64-by-32" bit divide.

Which needs to be handled by "do_div()" rather than the generic
DIV_ROUND_UP() helper, because sadly, at least gcc still generates a
full __udivdi3() even for the 64-by-32 divides.

Can Intel GPU people please take a look?

 Linus


Re: Annoying AMDGPU boot-time warning due to simplefb / amdgpu resource clash

2022-06-27 Thread Linus Torvalds
On Mon, Jun 27, 2022 at 1:02 AM Javier Martinez Canillas
 wrote:
>
> The flag was dropped because it was causing drivers that requested their
> memory resource with pci_request_region() to fail with -EBUSY (e.g: the
> vmwgfx driver):
>
> https://www.spinics.net/lists/dri-devel/msg329672.html

See, *that* link would have been useful in the commit.

Rather than the useless link it has.

Anyway, removing the busy bit just made things worse.

> > If simplefb is actually still using that frame buffer, it's a problem.
> > If it isn't, then maybe that resource should have been released?
>
> It's supposed to be released once amdgpu asks for conflicting framebuffers
> to be removed calling drm_aperture_remove_conflicting_pci_framebuffers().

That most definitely doesn't happen. This is on a running system:

  [torvalds@ryzen linux]$ cat /proc/iomem | grep BOOTFB
- : BOOTFB

so I suspect that the BUSY bit was never the problem - even for
vmwgfx). The problem was that simplefb doesn't remove its resource.

Guys, the *reason* for resource management is to catch people that
trample over each other's resources.

You literally basically disabled the code that checked for it by
removing the BUSY flag, and just continued to have conflicting
resources.

That isn't a "fix", that is literally "we are ignoring and breaking
the whole reason that the resource tree exists, but we'll still use it
for no good reason".

Yeah, yeah, most modern drivers ignore the IO resource tree, because
they end up working on another resource level entirely: they work on
not the IO resources, but on the "driver level" instead, and just
attach to PCI devices.

So these days, few enough drivers even care about the IO resource
tree, and it's mostly used for (a) legacy devices (think ISA) and (b)
the actual bus resource handling (so the PCI code itself uses it to
sort out resource use and avoid conflicts, but PCI drivers themselves
generally then don't care, because the bus has "taken care of it".

So that's why the amdgpu driver itself doesn't care about resource
allocations, and we only get a warning for that memory type case, not
for any deeper resource case.

And apparently the vmwgfx driver still uses that legacy "let's claim
all PCI resources in the resource tree" instead of just claiming the
device itself. Which is why it hit this whole BOOTFB resource thing
even harder.

But the real bug is that BOOTFB seems to claim this resource even
after it is done with it and other drivers want to take over.

Not the BUSY bit.

 Linus


Annoying AMDGPU boot-time warning due to simplefb / amdgpu resource clash

2022-06-26 Thread Linus Torvalds
So this has been going on for a while, and it's quite annoying.

At bootup, my main desktop (Threadripper 3970X with radeon graphics)
now complains about

  resource sanity check: requesting [mem 0xd000-0xdfff], which
spans more than BOOTFB [mem 0xd000-0xd02f]

and then gives me a nasty callchain that is basically the amdgpu probe
sequence ending in amdgpu_bo_init() doing the
arch_io_reserve_memtype_wc() which is then what complains.

That "BOOTFB" resource is from sysfb_simplefb.c, and I think what
started triggering this is commit c96898342c38 ("drivers/firmware:
Don't mark as busy the simple-framebuffer IO resource").

Because it turns out that that removed the IORESOURCE_BUSY, which in
turn is what makes the resource conflict code complain about it now,
because

/*
 * if a resource is "BUSY", it's not a hardware resource
 * but a driver mapping of such a resource; we don't want
 * to warn for those; some drivers legitimately map only
 * partial hardware resources. (example: vesafb)
 */

so the issue is that now the resource code - correctly - says "hey,
you have *two* conflicting driver mappings".

And that commit claims it did it because "which can lead to drivers
requesting the same memory resource to fail", but - once again - the
link in the commit that might actually tell more is just one of those
useless patch submission links again.

So who knows why that commit was actually done, but it's causing annoyance.

If simplefb is actually still using that frame buffer, it's a problem.
If it isn't, then maybe that resource should have been released?

I really think that commit c96898342c38 is buggy. It talks about "let
drivers to request it as busy instead", but then it registers a
resource that isn't actually a proper real resource. It's just a
random incomplete chunk of the actual real thing, so it will still
interfere with resource allocation, and in fact now interferes even
with that "set memtype" because of this valid warning.

 Linus


Re: [git pull] drm for 5.19-rc1

2022-06-07 Thread Linus Torvalds
On Tue, Jun 7, 2022 at 3:23 AM Geert Uytterhoeven  wrote:
>
> These header files are heavy users of large constants lacking the "U"
> suffix e.g.:
>
> #define NB_ADAPTER_ID__SUBSYSTEM_ID_MASK 0xL

As Andreas says, this is not undefined behavior.

A hexadecimal integer constant will always get a type that fits the
actual value. So on a 32-bit architecture, because 0x doesn't
fit in 'long', it will automatically become 'unsigned long'.

Now, a C compiler might still warn about such implicit type
conversions, but I'd be a bit surprised if any version of gcc actually
would do that, because this behavior for hex constants is *very*
traditional, and very common.

It's also true that the type of the constant - but not the value -
will be different on 32-bit and 64-bit architectures (ie on 64-bit, it
will be plain "long" and never extended to "unsigned long", because
the hex value obviously fits just fine).

I don't see any normal situation where that really matters, since any
normal use will have the same result.

The case you point to at

  
https://lore.kernel.org/r/cak8p3a0qrihbr_2fq7uz5w2jmljv7czfrrarcmmjohvndj3...@mail.gmail.com

is very different, because the constant "1" is always just a plain
signed "int". So when you do "(1 << 31)", that is now a signed integer
with the top bit set, and so it will have an actual negative value,
and that can cause various problems (when right-shifted, or when
compared to other values).

But hexadecimal constants can be signed types, but they never have
negative values.

 Linus


Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")

2022-06-01 Thread Linus Torvalds
On Wed, Jun 1, 2022 at 3:28 PM Keisuke Nishimura
 wrote:
>
>
> I found 13 definitions of packed structure that contains:
>  > - spinlock_t
>  > - atomic_t
>  > - dma_addr_t
>  > - phys_addr_t
>  > - size_t
>  > - struct mutex
>  > - struct device
>
> - raw_spinlock_t

Ok, so I don't think dma_addr_t/phys_addr_t/size_t are problematic,
they are just regular integers.

And 'struct device' is problematic only as it then contains any of the
atomic types (which it does)

> security/tomoyo/common.h: atomic_t in tomoyo_shared_acl_head
> drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h: spinlock_t in 
> key_map
> arch/s390/include/asm/kvm_host.h: atomic_t in kvm_s390_sie_block

So these do look problematic.

I'm not actually clear on why tomoyo_shared_acl_head would be packed.
That makes no sense to me.

Same goes for key_map, it's not clear what the reason for that
__packed is, and it's clearly bogus. It might work, almost by mistake,
but it's wrong to try to pack that spinlock_t.

The s390 kvm use actually looks fine: the structure is packed, but
it's also aligned, and the spin-lock is at the beginning, so the
"packing" part is about the other members, not the first one.

The two that look wrong look like they will probably work anyway
(they'll presumably be effectively word-aligned, and that's sufficient
for spinlocks in practice).

But let's cc the tomoyo and chelsio people.

 Linus


Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")

2022-05-31 Thread Linus Torvalds
On Tue, May 31, 2022 at 1:04 AM Arnd Bergmann  wrote:
>
> As an experiment: what kind of results would we get when looking
> for packed structures and unions that contain any of these:

Yeah, any atomics or locks should always be aligned, and won't even
work (or might be *very* slow) on multiple architectures. Even x86 -
which does very well on unaligned data - reacts badly to sufficiently
unaligned atomics (ie cacheline crossing).

I don't think we have that. Not only because it would already cause
breakage, but simply because the kinds of structures that people pack
aren't generally the kind that contain these kinds of things.

That said, you might have a struct that is packed, but that
intentionally aligns parts of itself, so it *could* be valid.

But it would probably not be a bad idea to check that packed
structures/unions don't have atomic types or locks in them. I _think_
we're all good, but who knows..

Linus


Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")

2022-05-28 Thread Linus Torvalds
On Sat, May 28, 2022 at 11:59 AM Arnd Bergmann  wrote:
>
> It's CONFIG_ARM_AEABI, which is normally set everywhere. Without this
> option, you the kernel is built for the old 'OABI' that forces all non-packed
> struct members to be at least 16-bit aligned.

Looks like forced word (32 bit) alignment to me.

I wonder how many other structures that messes up, but I committed the
EDID fix for now.

This has presumably been broken for a long time, but maybe the
affected targets don't typically use EDID and kernel modesetting, and
only use some fixed display setup instead.

Those structure definitions go back a _loong_ time (from a quick 'git
blame' I see November 2008).

But despite that, I did not mark my fix 'cc:stable' because I don't
know if any of those machines affected by this bad arm ABI issue could
possibly care.

At least my tree hopefully now builds on them, with the BUILD_BUG_ON()
that uncovered this.

   Linus


Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")

2022-05-28 Thread Linus Torvalds
On Sat, May 28, 2022 at 10:40 AM Linus Torvalds
 wrote:
>
> So digging a bit deeper - since I have am arm compiler after all - I
> note that 'sizeof(detailed_timings)' is 88.

Hmm.

sizeof() both

detailed_timings[0].data.other_data.data.range.formula.gtf2

and

detailed_timings[0].data.other_data.data.range.formula.cvt

is 7.

But the *union* of those things is

detailed_timings[0].data.other_data.data.range.formula

and its size is 8 (despite having an alignment of just 1).

The attached patch would seem to fix it for me.

Not very much tested, and I have no idea what it is that triggers this
only on spear3xx_defconfig.

Some ARM ABI issue that is triggered by some very particular ARM
compiler flag enabled by that config?

Adding some ARM (and SPEAR, and SoC) people in case they have any idea.

This smells like a compiler bug triggered by "there's a 16-bit member
field in that gtf2 structure, and despite it being packed and aligned
to 1, we somehow still align the size to 2".

I dunno. But marking those unions packed too doesn't seem wrong, and
does seem to fix it.

  Linus
 include/drm/drm_edid.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/drm/drm_edid.h b/include/drm/drm_edid.h
index c3204a58fb09..b2756753370b 100644
--- a/include/drm/drm_edid.h
+++ b/include/drm/drm_edid.h
@@ -121,7 +121,7 @@ struct detailed_data_monitor_range {
 			u8 supported_scalings;
 			u8 preferred_refresh;
 		} __attribute__((packed)) cvt;
-	} formula;
+	} __attribute__((packed)) formula;
 } __attribute__((packed));
 
 struct detailed_data_wpindex {
@@ -154,7 +154,7 @@ struct detailed_non_pixel {
 		struct detailed_data_wpindex color;
 		struct std_timing timings[6];
 		struct cvt_timing cvt[4];
-	} data;
+	} __attribute__((packed)) data;
 } __attribute__((packed));
 
 #define EDID_DETAIL_EST_TIMINGS 0xf7
@@ -172,7 +172,7 @@ struct detailed_timing {
 	union {
 		struct detailed_pixel_timing pixel_data;
 		struct detailed_non_pixel other_data;
-	} data;
+	} __attribute__((packed)) data;
 } __attribute__((packed));
 
 #define DRM_EDID_INPUT_SERRATION_VSYNC (1 << 0)


Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")

2022-05-28 Thread Linus Torvalds
On Sat, May 28, 2022 at 5:13 AM Sudip Mukherjee
 wrote:
>
> just tried this with
> make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- drivers/gpu/drm/drm_edid.s
>
> size_of_edid:
> mov r0, #144@,
> ldmfd   sp, {fp, sp, pc}@

So digging a bit deeper - since I have am arm compiler after all - I
note that 'sizeof(detailed_timings)' is 88.

Which is completely wrong. It should be 72 bytes (an array of 4
structures, each 18 bytes in size).

I have not dug deeper, but that is clearly the issue.

Now, why that only happens on that spear3xx_defconfig, I have no idea.

 Linus


Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")

2022-05-27 Thread Linus Torvalds
On Fri, May 27, 2022 at 4:41 PM Sudip Mukherjee
 wrote:
>
> I just tested with various values, sizeof(*edid) is 144 bytes at that place.

Hmm. What compiler do you have? Because it seems very broken.

You don't actually have to try with various sizes, you could have just
done something like

 int size_of_edid(const struct edid *edid)
 {
return sizeof(*edid);
 }

and then "make drivers/gpu/drm/drm_edid.s" to generate assembly and
see what it looks like (obviously removing the BUG_ON() in order to
build).

That obviously generates code like

movl$128, %eax
ret

for me, and looking at the definition of that type I really can't see
how it would ever generate anything else. But it's apparently not even
close for you.

I suspect some of the structs inside of that 'struct edid' end up
getting aligned, despite the '__attribute__((packed))'. For example,
'struct est_timings' is supposed to be just 3 bytes, and it's at an
odd offset too (byte offset 35 in the 'struct edid' if I did the math
correctly).

But it obviously doesn't happen for me or for most other people, so
it's something in your setup. Unusual compiler?

  Linus


Re: mainline build failure due to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")

2022-05-27 Thread Linus Torvalds
On Fri, May 27, 2022 at 2:07 AM Sudip Mukherjee
 wrote:
>
> declared with attribute error: BUILD_BUG_ON failed: sizeof(*edid) != 
> EDID_LENGTH
>
> And, reverting it on top of mainline branch has fixed the build failure.

Hmm. That BUILD_BUG_ON() looks entirely correct, and if that sizeof()
doesn't work, then the code doesn't work.

I'm not seeing what could go wrong in there, with all the structures I
see being marked as __packed__.

I wonder if the union in 'struct detailed_timing' also wants that
"__attribute__((packed))" but I also wonder what it is that would make
this fail on spear3xx but not elsewhere.

Very strange. It would be interesting to know where that sizeof goes
wrong, but it would seem to be something very peculiar to your build
environment (either that config, or your compiler).

 Linus


Re: [git pull] drm for 5.19-rc1

2022-05-25 Thread Linus Torvalds
On Tue, May 24, 2022 at 11:07 PM Dave Airlie  wrote:
>
> AMD has started some new GPU support [...]

Oh Christ. Which means another set of auto-generated monster headers. Lovely.

  Linus


Re: Adding CI results to the kernel tree was Re: [RFC v2] drm/msm: Add initial ci/ subdirectory

2022-05-11 Thread Linus Torvalds
On Wed, May 11, 2022 at 12:08 PM Linus Torvalds
 wrote:
>
> The kernel tree might have just the expected *failures* listed, if
> there are any. Presumably the ci tree has to have the expected results
> anyway, so what's the advantage of listing non-failures?

.. put another way: I think a list of "we are aware that these
currently fail" is quite reasonable for a development tree, maybe even
with a comment in the commit that created them about why they
currently fail.

That also ends up being very nice if you fix a problem, and the fix
commit might then remove the failure for the list, and that all makes
perfect sense.

But having just the raw output of "these are the expected CI results"
that is being done and specified by some other tree entirely - that
seems pointless and just noise to me. There's no actual reason to have
that kind of noise - and update that kind of noise - that I really
see.

Linus


Re: Adding CI results to the kernel tree was Re: [RFC v2] drm/msm: Add initial ci/ subdirectory

2022-05-11 Thread Linus Torvalds
On Wed, May 11, 2022 at 11:40 AM Rob Clark  wrote:
>
> It is missing in this revision of the RFC, but the intention is to
> have the gitlab-ci.yml point to a specific commit SHA in the
> gfx-ci/drm-ci[1] tree, to solve the problem of keeping the results in
> sync with the expectations.  Ie. a kernel commit would control moving
> to a new version of i-g-t (and eventually deqp and/or piglit), and at
> the same time make any necessary updates in the expectations files.

Wouldn't it then be better to just have the expectation files in the
ci tree too?

The kernel tree might have just the expected *failures* listed, if
there are any. Presumably the ci tree has to have the expected results
anyway, so what's the advantage of listing non-failures?

  Linus


Re: Adding CI results to the kernel tree was Re: [RFC v2] drm/msm: Add initial ci/ subdirectory

2022-05-11 Thread Linus Torvalds
On Tue, May 10, 2022 at 10:07 PM Dave Airlie  wrote:
>
> > And use it to store expectations about what the drm/msm driver is
> > supposed to pass in the IGT test suite.
>
> I wanted to loop in Linus/Greg to see if there are any issues raised
> by adding CI results file to the tree in their minds, or if any other
> subsystem has done this already, and it's all fine.
>
> I think this is a good thing after our Mesa experience, but Mesa has a
> lot tighter integration here, so I want to get some more opinions
> outside the group.

Honestly, my immediate reaction is that I think it might be ok, but

 (a) are these things going to absolutely balloon over time?

 (b) should these not be separated out?

Those two issues kind of interact.

If it's a small and targeted test-suite, by all means keep it in the
kernel, but why not make it part of "tools/testing/selftests"

But if people expect this to balloon and we end up having megabytes of
test output, then I really think it should be a separate git tree.

A diffstat like this:

>  7 files changed, 791 insertions(+)

is not a problem at all. But I get the feeling that this is just the
tip of the iceberg, and people will want to not just have the result
files, but start adding actual *input* files that may be largely
automated stuff and may be tens of megabytes in size.

Because the result files on their own aren't really self-contained,
and then people will want to keep them in sync with the test-files
themselves, and start adding those, and now it *really* is likely very
unwieldy.

Or if that doesn't happen, and the actual input test files stay in a
separate CI repo, and then you end up having random coherency issues
with that CI repo, and it all gets to be either horribly messy, or the
result files in the kernel end up really stale.

So honestly, I personally don't see a good end result here.  This
particular small patch? *This* one looks fine to me, except I really
think tools/testing/selftests/gpu would be a much more logical place
for it.

But I don't see a way forward that is sane.

Can somebody argue otherwise?

Linus


drm pull request (was Re: )

2022-05-06 Thread Linus Torvalds
On Thu, May 5, 2022 at 9:07 PM Dave Airlie  wrote:
>
> pretty quiet week, one fbdev, msm, kconfig, and 2 amdgpu fixes, about
> what I'd expect for rc6.

You're not getting the automated pr-tracker-bot response, because your
subject line was missing...

Just a "how did that happen" together with a "here's the manual
response instead".

  Linus


Re: [PATCH RFC v6 00/21] DEPT(Dependency Tracker)

2022-05-04 Thread Linus Torvalds
On Wed, May 4, 2022 at 1:19 AM Byungchul Park  wrote:
>
> Hi Linus and folks,
>
> I've been developing a tool for detecting deadlock possibilities by
> tracking wait/event rather than lock(?) acquisition order to try to
> cover all synchonization machanisms.

So what is the actual status of reports these days?

Last time I looked at some reports, it gave a lot of false positives
due to mis-understanding prepare_to_sleep().

For this all to make sense, it would need to not have false positives
(or at least a very small number of them together with a way to sanely
get rid of them), and also have a track record of finding things that
lockdep doesn't.

Maybe such reports have been sent out with the current situation, and
I haven't seen them.

 Linus


Re: [git pull] drm fixes for 5.18-rc3

2022-04-14 Thread Linus Torvalds
On Thu, Apr 14, 2022 at 2:33 PM Dave Airlie  wrote:
>
>   git://anongit.freedesktop.org/drm/drm tags/drm-fixes-2022-04-15

.. and since I'm  now back home, I can confirm that my boot-time splat
I saw before is all gone.

It presumably went away with the previous pull request already, but I
hadn't remembered to check the issue until this pull reminded me about
the issue.

Thanks,
Linus


Re: [git pull] drm fixes for 5.18-rc2

2022-04-07 Thread Linus Torvalds
On Thu, Apr 7, 2022 at 2:20 PM Dave Airlie  wrote:
>
> I think this should fix the amdgpu splat you have been seeing since rc1.

Not the machine I'm currently traveling with, but I'll double-check
when I'm back home.

Thanks,
Linus


amdgpu link problem (was Re: [git pull] drm for 5.18-rc1)

2022-03-28 Thread Linus Torvalds
I didn't notice this until now, probably because everything still
_works_, but I get a new big warning splat at bootup on my main
workstation these days as of the merge window changes.

The full warning is attached, but it's basically the ASSERT(0) at line 938 in

  drivers/gpu/drm/amd/display/dc/core/dc_link.c

and it looks to have been introduced by commit c282d9512cdd
("drm/amd/display: factor out dp detection link training and mst top
detection").

This is the same old setup I've reported before, with some random
Radeon card with two monitors attached (PCI ID 1002:67df rev e7,
subsystem ID 1da2:e353).

I think the card went by the name "Sapphire Pulse RX 580 8GB" in case
any of that matters, but it's been working fine.

It still works fine, it just has a big ugly boot-time splat.

As mentioned, two 4K monitors attached, both over HDMI.

If there is any particular info you want, just let me know where/how
to find it, and I can provide.

  Linus


warn
Description: Binary data


Re: [git pull] drm fixes for 5.18-rc1

2022-03-25 Thread Linus Torvalds
On Thu, Mar 24, 2022 at 7:13 PM Dave Airlie  wrote:
>
> Some fixes were queued up in and in light of the fbdev regressions,
> I've pulled those in as well,

Thanks, pulled.

It sounds (from the other thread) that the mediatek DT issue is also
about to be fixed, even if it's not yet here.

But that hopefully (probably?) affects fewer people and testing robots.

 Linus


Re: [git pull] drm for 5.18-rc1

2022-03-24 Thread Linus Torvalds
On Wed, Mar 23, 2022 at 7:30 PM Dave Airlie  wrote:
>
> This is the main drm pull request for 5.18.
>
> The summary changelog is below, lots of work all over,
> Intel improving DG2 support, amdkfd CRIU support, msm
> new hw support, and faster fbdev support.

Ok, so this was annoying.

I've merged it, but will note three things that I really hope get
fixed / checked:

 (1) My merge resolution looked mostly trivial, except for an annoying
conflict between commits

4ed545e7d100 ("dt-bindings: display: mediatek: disp: split
each block to individual yaml")

and

6d0990e6e844 ("media: dt-binding: mediatek: Get rid of
mediatek,larb for multimedia HW")

where one of them splits up a file that is modified by the other.

I ended up just getting rid of all the "mediatek,larb" mentions in the
split-up files, despite the fact that (a) those mentions can be found
elsewhere and (b) the split-up did other changes too, so maybe it's
wrong.

 (2) As Guenter reported, the fbdev performance improvement of
cfb_imageblit() is broken.

I was going to just revert it, but I see that there is a two-series
patch to fix it at

https://lore.kernel.org/all/20220313192952.12058-1-tzimmerm...@suse.de/

so I merged it in that broken form, in the hope that this set of fixes
will be sent to me asap.

 (3) Very similarly to (2), but broken mediatek DT files.

I hope my changes in (1) didn't make things worse, but there's a
series of fixes as

https://lore.kernel.org/all/20220309134702.9942-1-jason-jh@mediatek.com/

and again I hope I'll get those fixes from the proper places asap.

I considered just delaying merging this all entirely, but it seems
better to get this all in, with the known problems and known fixes,
and see if we hit something _else_ too.

Anyway, let's hope I didn't miss anything, and that those are the only
major issues.

  Linus


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-03-02 Thread Linus Torvalds
On Wed, Mar 2, 2022 at 12:07 PM Kees Cook  wrote:
>
> I've long wanted to change kfree() to explicitly set pointers to NULL on
> free. https://github.com/KSPP/linux/issues/87

We've had this discussion with the gcc people in the past, and gcc
actually has some support for it, but it's sadly tied to the actual
function name (ie gcc has some special-casing for "free()")

See

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94527

for some of that discussion.

Oh, and I see some patch actually got merged since I looked there last
so that you can mark "deallocator" functions, but I think it's only
for the context matching, not for actually killing accesses to the
pointer afterwards.

   Linus


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-03-01 Thread Linus Torvalds
On Tue, Mar 1, 2022 at 3:19 PM David Laight  wrote:
>
> Having said that there are so few users of list_entry_is_head()
> it is reasonable to generate two new names.

Well, the problem is that the users of list_entry_is_head() may be few
- but there are a number of _other_ ways to check "was that the HEAD
pointer", and not all of them are necessarily correct.

IOW, different places do different random tests for "did we walk the
whole loop without breaking out". And many of them happen to work. In
fact, in practice, pretty much *all* of them happen to work, and you
have to have the right struct layout and really really bad luck to hit
a case of "type confusion ended up causing the test to not work".

And *THAT* is the problem here. It's not the "there are 25ish places
that current use list_entry_is_head()".

It's the "there are ~480 places that use the type-confused HEAD entry
that has been cast to the wrong type".

And THAT is why I think we'd be better off with that bigger change
that simply means that you can't use the iterator variable at all
outside the loop, and try to make it something where the compiler can
help catch mis-uses.

Now, making the list_for_each_entry() thing force the iterator to NULL
at the end of the loop does fix the problem. The issue I have with it
is really just that you end up getting no warning at all from the
compiler if you mix old-style and new-style semantics. Now, you *will*
get an oops (if using a new-style iterator with an old-style check),
but many of these things will be in odd driver code and may happen
only for error cases.

And if you use a new-style check with an old-style iterator (ie some
backport problem), you will probably end up getting random memory
corruption, because you'll decide "it's not a HEAD entry", and then
you'll actually *use* the HEAD that has the wrong type cast associated
with it.

See what my worry is?

With the "don't use iterator outside the loop" approach, the exact
same code works in both the old world order and the new world order,
and you don't have the semantic confusion. And *if* you try to use the
iterator outside the loop, you'll _mostly_ (*) get a compiler warning
about it not being initialized.

 Linus

(*) Unless somebody initializes the iterator pointer pointlessly.
Which clearly does happen. Thus the "mostly". It's not perfect, and
that's most definitely not nice - but it should at least hopefully
make it that much harder to mess up.


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-03-01 Thread Linus Torvalds
On Tue, Mar 1, 2022 at 2:58 PM David Laight  wrote:
>
> Can it be resolved by making:
> #define list_entry_is_head(pos, head, member) ((pos) == NULL)
> and double-checking that it isn't used anywhere else (except in
> the list macros themselves).

Well, yes, except for the fact that then the name is entirely misleading...

And somebody possibly uses it together with list_first_entry() etc, so
it really is completely broken to mix that change with the list
traversal change.

 Linus

   Linus


Re: [PATCH 6/6] treewide: remove check of list iterator against head past the loop body

2022-03-01 Thread Linus Torvalds
So looking at this patch, I really reacted to the fact that quite
often the "use outside the loop" case is all kinds of just plain
unnecessary, but _used_ to be a convenience feature.

I'll just quote the first chunk in it's entirely as an example - not
because I think this chunk is particularly important, but because it's
a good example:

On Mon, Feb 28, 2022 at 3:09 AM Jakob Koschel  wrote:
>
> diff --git a/arch/arm/mach-mmp/sram.c b/arch/arm/mach-mmp/sram.c
> index 6794e2db1ad5..fc47c107059b 100644
> --- a/arch/arm/mach-mmp/sram.c
> +++ b/arch/arm/mach-mmp/sram.c
> @@ -39,19 +39,22 @@ static LIST_HEAD(sram_bank_list);
>  struct gen_pool *sram_get_gpool(char *pool_name)
>  {
> struct sram_bank_info *info = NULL;
> +   struct sram_bank_info *tmp;
>
> if (!pool_name)
> return NULL;
>
> mutex_lock(_lock);
>
> -   list_for_each_entry(info, _bank_list, node)
> -   if (!strcmp(pool_name, info->pool_name))
> +   list_for_each_entry(tmp, _bank_list, node)
> +   if (!strcmp(pool_name, tmp->pool_name)) {
> +   info = tmp;
> break;
> +   }
>
> mutex_unlock(_lock);
>
> -   if (>node == _bank_list)
> +   if (!info)
> return NULL;
>
> return info->gpool;

I realize this was probably at least auto-generated with coccinelle,
but maybe that script could be taught to do avoid the "use after loop"
by simply moving the code _into_ the loop.

IOW, this all would be cleaner and clear written as

if (!pool_name)
return NULL;

mutex_lock(_lock);
list_for_each_entry(info, _bank_list, node) {
if (!strcmp(pool_name, info->pool_name)) {
mutex_unlock(_lock);
return info;
}
}
mutex_unlock(_lock);
return NULL;

Ta-daa - no use outside the loop, no need for new variables, just a
simple "just do it inside the loop". Yes, we end up having that lock
thing twice, but it looks worth it from a "make the code obvious"
standpoint.

Would it be even cleaner if the locking was done in the caller, and
the loop was some simple helper function? It probably would. But that
would require a bit more smarts than probably a simple coccinelle
script would do.

Linus


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-03-01 Thread Linus Torvalds
On Tue, Mar 1, 2022 at 11:06 AM Linus Torvalds
 wrote:
>
> So instead of that simple "if (!entry)", we'd effectively have to
> continue to use something that still works with the old world order
> (ie that "if (list_entry_is_head())" model).

Just to prove my point about how this is painful, that doesn't work at all.

If the loop iterator at the end is NULL (good, in theory), we can't
use "list_entry_is_head()" to check whether we ended. We'd have to use
a new thing entirely, to handle the "list_for_each_entry() has the
old/new semantics" cases.

That's largely why I was pushing for the "let's make it impossible to
use the loop iterator at all outside the loop". It avoids the
confusing case, and the patches to move to that stricter semantic can
be merged independently (and before) doing the actual semantic change.

I'm not saying my suggested approach is wonderful either. Honestly,
it's painful that we have so nasty semantics for the end-of-loop case
for list_for_each_entry().

The minimal patch would clearly be to keep those broken semantics, and
just force everybody to use the list_entry_is_head() case. That's the
"we know we messed up, we are too lazy to fix it, we'll just work
around it and people need to be careful" approach.

And laziness is a virtue. But bad semantics are bad semantics. So it's
a question of balancing those two issues.

   Linus


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-03-01 Thread Linus Torvalds
On Mon, Feb 28, 2022 at 2:29 PM James Bottomley
 wrote:
>
> However, if the desire is really to poison the loop variable then we
> can do
>
> #define list_for_each_entry(pos, head, member)  \
> for (pos = list_first_entry(head, typeof(*pos), member);\
>  !list_entry_is_head(pos, head, member) && ((pos = NULL) == NULL; 
>   \
>  pos = list_next_entry(pos, member))
>
> Which would at least set pos to NULL when the loop completes.

That would actually have been excellent if we had done that
originally. It would not only avoid the stale and incorrectly typed
head entry left-over turd, it would also have made it very easy to
test for "did I find an entry in the loop".

But I don't much like it in the situation we are now.

Why? Mainly because it basically changes the semantics of the loop
_without_ any warnings about it.  And we don't actually get the
advantage of the nicer semantics, because we can't actually make code
do

list_for_each_entry(entry, ) {
..
}
if (!entry)
return -ESRCH;
.. use the entry we found ..

because that would be a disaster for back-porting, plus it would be a
flag-day issue (ie we'd have to change the semantics of the loop at
the same time we change every single user).

So instead of that simple "if (!entry)", we'd effectively have to
continue to use something that still works with the old world order
(ie that "if (list_entry_is_head())" model).

So we couldn't really take _advantage_ of the nicer semantics, and
we'd not even get a warning if somebody does it wrong - the code would
just silently do the wrong thing.

IOW: I don't think you are wrong about that patch: it would solve the
problem that Jakob wants to solve, and it would have absolutely been
much better if we had done this from the beginning. But I think that
in our current situation, it's actually a really fragile solution to
the "don't do that then" problem we have.

  Linus


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-03-01 Thread Linus Torvalds
On Tue, Mar 1, 2022 at 10:14 AM Kees Cook  wrote:
>
> The first big glitch with -Wshadow was with shadowed global variables.
> GCC 4.8 fixed that, but it still yells about shadowed functions. What
> _almost_ works is -Wshadow=local.

Heh. Yeah, I just have long memories of "-Wshadow was a disaster". You
looked into the details.

> Another way to try to catch misused shadow variables is
> -Wunused-but-set-varible, but it, too, has tons of false positives.

That on the face of it should be an easy warning to get technically
right for a compiler.

So I assume the "false positives" are simply because we end up having
various variables that really don't end up being used - and
"intentionally" so).

Or rather, they might only be used under some config option - perhaps
the use is even syntactically there and parsed, but the compiler
notices that it's turned off under some

if (IS_ENABLED(..))

option? Because yeah, we have a lot of those.

I think that's a common theme with a lot of compiler warnings: on the
face of it they sound "obviously sane" and nobody should ever write
code like that.

A conditional that is always true? Sounds idiotic, and sounds like a
reasonable thing for a compiler to warn about, since why would you
have a conditional in the first place for that?

But then you realize that maybe the conditional is a build config
option, and "always true" suddenly makes sense. Or it's a test for
something that is always true on _that_architecture_ but not in some
general sense (ie testing "sizeof()"). Or it's a purely syntactic
conditional, like "do { } while (0)".

It's why I'm often so down on a lot of the odd warnings that are
hiding under W=1 and friends. They all may make sense in the trivial
case ("That is insane") but then in the end they happen for sane code.

And yeah, -Wshadow has had tons of history with macro nesting, and
just being badly done in the first place (eg "strlen" can be a
perfectly fine local variable).

That said, maybe people could ask the gcc and clan people for a way to
_mark_ the places where we expect to validly see shadowing. For
example, that "local variable in a macro expression statement" thing
is absolutely horrendous to fix with preprocessor tricks to try to
make for unique identifiers.

But I think it would be much more syntactically reasonable to add (for
example) a "shadow" attribute to such a variable exactly to tell the
compiler "yeah, yeah, I know this identifier could shadow an outer
one" and turn it off that way.

   Linus


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread Linus Torvalds
On Mon, Feb 28, 2022 at 4:38 PM Segher Boessenkool
 wrote:
>
> In C its scope is the rest of the declaration and the entire loop, not
> anything after it.  This was the same in C++98 already, btw (but in
> pre-standard versions of C++ things were like you remember, yes, and it
> was painful).

Yeah, the original C++ model was just unadulterated garbage, with no
excuse for it, and the scope was not the loop, but the block the loop
existed in.

That would never have been acceptable for the kernel - it's basically
just an even uglier version of "put variable declarations in the
middle of code" (and we use "-Wdeclaration-after-statement" to
disallow that for kernel code, although apparently some of our user
space tooling code doesn't enforce or follow that rule).

The actual C99 version is the sane one which actually makes it easier
and clearer to have loop iterators that are clearly just in loop
scope.

That's a good idea in general, and I have wanted to start using that
in the kernel even aside from some of the loop construct macros.
Because putting variables in natural minimal scope is a GoodThing(tm).

Of course, we shouldn't go crazy with it. Even after we do that
-std=gnu11 thing, we'll have backports to worry about. And it's not
clear that we necessarily want to backport that gnu11 thing - since
people who run old stable kernels also may be still using those really
old compilers...

Linus


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread Linus Torvalds
On Mon, Feb 28, 2022 at 4:45 PM Linus Torvalds
 wrote:
>
> Yeah, except that's ugly beyond belief, plus it's literally not what
> we do in the kernel.

(Of course, I probably shouldn't have used 'min()' as an example,
because that is actually one of the few places where we do exactly
that, using our __UNIQUE_ID() macros. Exactly because people _have_
tried to do -Wshadow when doing W=2).

 Linus


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread Linus Torvalds
On Mon, Feb 28, 2022 at 3:26 PM Matthew Wilcox  wrote:
>
> #define ___PASTE(a, b)  a##b
> #define __PASTE(a, b) ___PASTE(a, b)
> #define _min(a, b, u) ({ \

Yeah, except that's ugly beyond belief, plus it's literally not what
we do in the kernel.

Really. The "-Wshadow doesn't work on the kernel" is not some new
issue, because you have to do completely insane things to the source
code to enable it.

Just compare your uglier-than-sin version to my straightforward one.
One does the usual and obvious "use a private variable to avoid the
classic multi-use of a macro argument". And the other one is an
abomination.

  Linus


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread Linus Torvalds
On Mon, Feb 28, 2022 at 1:47 PM Jakob Koschel  wrote:
>
> The goal of this is to get compiler warnings right? This would indeed be 
> great.

Yes, so I don't mind having a one-time patch that has been gathered
using some automated checker tool, but I don't think that works from a
long-term maintenance perspective.

So if we have the basic rule being "don't use the loop iterator after
the loop has finished, because it can cause all kinds of subtle
issues", then in _addition_ to fixing the existing code paths that
have this issue, I really would want to (a) get a compiler warning for
future cases and (b) make it not actually _work_ for future cases.

Because otherwise it will just happen again.

> Changing the list_for_each_entry() macro first will break all of those cases
> (e.g. the ones using 'list_entry_is_head()).

So I have no problems with breaking cases that we basically already
have a patch for due to  your automated tool. There were certainly
more than a handful, but it didn't look _too_ bad to just make the
rule be "don't use the iterator after the loop".

Of course, that's just based on that patch of yours. Maybe there are a
ton of other cases that your patch didn't change, because they didn't
match your trigger case, so I may just be overly optimistic here.

But basically to _me_, the important part is that the end result is
maintainable longer-term. I'm more than happy to have a one-time patch
to fix a lot of dubious cases if we can then have clean rules going
forward.

> I assumed it is better to fix those cases first and then have a simple
> coccinelle script changing the macro + moving the iterator into the scope
> of the macro.

So that had been another plan of mine, until I actually looked at
changing the macro. In the one case I looked at, it was ugly beyond
belief.

It turns out that just syntactically, it's really nice to give the
type of the iterator from outside the way we do now. Yeah, it may be a
bit odd, and maybe it's partly because I'm so used to the
"list_for_each_list_entry()" syntax, but moving the type into the loop
construct really made it nasty - either one very complex line, or
having to split it over two lines which was even worse.

Maybe the place I looked at just happened to have a long typename, but
it's basically always going to be a struct, so it's never a _simple_
type. And it just looked very odd adn unnatural to have the type as
one of the "arguments" to that list_for_each_entry() macro.

So yes, initially my idea had been to just move the iterator entirely
inside the macro. But specifying the type got so ugly that I think
that

typeof (pos) pos

trick inside the macro really ends up giving us the best of all worlds:

 (a) let's us keep the existing syntax and code for all the nice cases
that did everything inside the loop anyway

 (b) gives us a nice warning for any normal use-after-loop case
(unless you explicitly initialized it like that
sgx_mmu_notifier_release() function did for no good reason

 (c) also guarantees that even if you don't get a warning,
non-converted (or newly written) bad code won't actually _work_

so you end up getting the new rules without any ambiguity or mistaken

> With this you are no longer able to set the 'outer' pos within the list
> iterator loop body or am I missing something?

Correct. Any assignment inside the loop will be entirely just to the
local loop case. So any "break;" out of the loop will have to set
another variable - like your updated patch did.

> I fail to see how this will make most of the changes in this
> patch obsolete (if that was the intention).

I hope my explanation above clarifies my thinking: I do not dislike
your patch, and in fact your patch is indeed required to make the new
semantics work.

What I disliked was always the maintainability of your patch - making
the rules be something that isn't actually visible in the source code,
and letting the old semantics still work as well as they ever did, and
having to basically run some verification pass to find bad users.

(I also disliked your original patch that mixed up the "CPU
speculation type safety" with the actual non-speculative problems, but
that was another issue).

Linus


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread Linus Torvalds
On Mon, Feb 28, 2022 at 12:29 PM Johannes Berg
 wrote:
>
> If we're willing to change the API for the macro, we could do
>
>   list_for_each_entry(type, pos, head, member)
>
> and then actually take advantage of -Wshadow?

See my reply to Willy. There is no way -Wshadow will ever happen.

I considered that (type, pos, head, member) kind of thing, to the
point of trying it for one file, but it ends up as horrendous syntax.
It turns out that declaring the type separately really helps, and
avoids crazy long lines among other things.

It would be unacceptable for another reason too - the amount of churn
would just be immense. Every single use of that macro (and related
macros) would change, even the ones that really don't need it or want
it (ie the good kinds that already only use the variable inside the
loop).

So "typeof(pos) pos" may be ugly - but it's a very localized ugly.

Linus


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread Linus Torvalds
On Mon, Feb 28, 2022 at 12:16 PM Matthew Wilcox  wrote:
>
> Then we can never use -Wshadow ;-(  I'd love to be able to turn it on;
> it catches real bugs.

Oh, we already can never use -Wshadow regardless of things like this.
That bridge hasn't just been burned, it never existed in the first
place.

The whole '-Wshadow' thing simply cannot work with local variables in
macros - something that we've used since day 1.

Try this (as a "p.c" file):

#define min(a,b) ({ \
typeof(a) __a = (a);\
typeof(b) __b = (b);\
__a < __b ? __a : __b; })

int min3(int a, int b, int c)
{
return min(a,min(b,c));
}

and now do "gcc -O2 -S t.c".

Then try it with -Wshadow.

In other words, -Wshadow is simply not acceptable. Never has been,
never will be, and that has nothing to do with the

typeof(pos) pos

kind of thing.

Your argument just isn't an argument.

  Linus


  1   2   3   4   5   6   7   >