Re: [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map()

2023-06-01 Thread Hugh Dickins
On Wed, 31 May 2023, Jason Gunthorpe wrote:
> On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote:
> > There is a faint risk that __pte_offset_map(), on a 32-bit architecture
> > with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
> > a pmdval assembled from a pmd_low and a pmd_high which never belonged
> > together: their combination not pointing to a page table at all, perhaps
> > not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.
> > 
> > Guard against that (on such configs) by local_irq_save() blocking TLB
> > flush between present updates, as linux/pgtable.h suggests.  It's only
> > needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
> > __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
> > lock, would just send it back to __pte_offset_map() again.
> 
> What about the other places calling pmdp_get_lockless ? It seems like
> this is quietly making it part of the API that the caller must hold
> the IPIs off.

No, I'm making no judgment of other places where pmdp_get_lockless() is
used: examination might show that some need more care, but I'll just
assume that each is taking as much care as it needs.

But here where I'm making changes, I do see that we need this extra care.

> 
> And Jann had a note that this approach used by the lockless functions
> doesn't work anyhow:
> 
> https://lore.kernel.org/linux-mm/cag48ez3h-mnp9zfc10v+-bw_8nqvxbwbsmyjfp8jx31o0b1...@mail.gmail.com/

Thanks a lot for the link: I don't know why, but I never saw that mail
thread at all before.  I have not fully digested it yet, to be honest:
MADV_DONTNEED, doesn't flush TLB yet, etc - I'll have to get into the
right frame of mind for that.

> 
> Though we never fixed it, AFAIK..

I'm certainly depending very much on pmdp_get_lockless(): and hoping to
find its case is easier to defend than at the ptep_get_lockless() level.

Thanks,
Hugh


Re: [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

2023-06-01 Thread Hugh Dickins
On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:26 AM Hugh Dickins  wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> 
> I think there's a weirdness in the existing code, and this change
> probably turns that into a UAF bug.
> 
> collapse_pte_mapped_thp() can be called on an address that might not
> be associated with a VMA anymore, and after this change, the page
> tables for that address might be in the middle of page table teardown
> in munmap(), right? The existing mmap_write_lock() guards against
> concurrent munmap() (so in the old code we are guaranteed to either
> see a normal VMA or not see the page tables anymore), but
> mmap_read_lock() only guards against the part of munmap() up to the
> mmap_write_downgrade() in do_vmi_align_munmap(), and unmap_region()
> (including free_pgtables()) happens after that.

Excellent point, thank you.  Don't let anyone overhear us, but I have
to confess to you that that mmap_write_downgrade() has never impinged
forcefully enough on my consciousness: it's still my habit to think of
mmap_lock as exclusive over free_pgtables(), and I've not encountered
this bug in my testing.

Right, I'll gladly incorporate your collapse_pte_mapped_thp()
rearrangement below.  And am reassured to realize that by removing
mmap_lock dependence elsewhere, I won't have got it wrong in other places.

Thanks,
Hugh

> 
> So we can now enter collapse_pte_mapped_thp() and race with concurrent
> free_pgtables() such that a PUD disappears under us while we're
> walking it or something like that:
> 
> 
> int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   bool install_pmd)
> {
>   struct mmu_notifier_range range;
>   unsigned long haddr = addr & HPAGE_PMD_MASK;
>   struct vm_area_struct *vma = vma_lookup(mm, haddr); // <<< returns NULL
>   struct page *hpage;
>   pte_t *start_pte, *pte;
>   pmd_t *pmd, pgt_pmd;
>   spinlock_t *pml, *ptl;
>   int nr_ptes = 0, result = SCAN_FAIL;
>   int i;
> 
>   mmap_assert_locked(mm);
> 
>   /* Fast check before locking page if already PMD-mapped */
>   result = find_pmd_or_thp_or_none(mm, haddr, ); // <<< PUD UAF in here
>   if (result == SCAN_PMD_MAPPED)
> return result;
> 
>   if (!vma || !vma->vm_file || // <<< bailout happens too late
>   !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> return SCAN_VMA_CHECK;
> 
> 
> I guess the right fix here is to make sure that at least the basic VMA
> revalidation stuff (making sure there still is a VMA covering this
> range) happens before find_pmd_or_thp_or_none()? Like:
> 
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 301c0e54a2ef..5db365587556 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1481,15 +1481,15 @@ int collapse_pte_mapped_thp(struct mm_struct
> *mm, unsigned long addr,
> 
>  mmap_assert_locked(mm);
> 
> +if (!vma || !vma->vm_file ||
> +!range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> +return SCAN_VMA_CHECK;
> +
>  /* Fast check before locking page if already PMD-mapped */
>  result = find_pmd_or_thp_or_none(mm, haddr, );
>  if (result == SCAN_PMD_MAPPED)
>  return result;
> 
> -if (!vma || !vma->vm_file ||
> -!range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> -return SCAN_VMA_CHECK;
> -
>  /*
>   * If we are here, we've succeeded in replacing all the native pages
>   * in the page cache with a single hugepage. If a mm were to fault-in
> 

Re: [PATCH 00/12] mm: free retracted page table by RCU

2023-06-01 Thread Hugh Dickins
On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:11 AM Hugh Dickins  wrote:
> > Here is the third series of patches to mm (and a few architectures), based
> > on v6.4-rc3 with the preceding two series applied: in which khugepaged
> > takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.
> 
> To clarify: Part of the design here is that when you look up a user
> page table with pte_offset_map_nolock() or pte_offset_map() without
> holding mmap_lock in write mode, and you later lock the page table
> yourself, you don't know whether you actually have the real page table
> or a detached table that is currently in its RCU grace period, right?

Right.  (And I'd rather not assume anything of mmap_lock, but there are
one or two or three places that may still do so.)

> And detached tables are supposed to consist of only zeroed entries,
> and we assume that no relevant codepath will do anything bad if one of
> these functions spuriously returns a pointer to a page table full of
> zeroed entries?

(Nit that I expect you're well aware of: IIRC "zeroed" isn't 0 on s390.)

If someone is using pte_offset_map() without lock, they must be prepared
to accept page-table-like changes.  The limits of pte_offset_map_nolock()
with later spin_lock(ptl): I'm still exploring: there's certainly an
argument that one ought to do a pmd_same() check before proceeding,
but I don't think anywhere needs that at present.

Whether the page table has to be full of zeroed entries when detached:
I believe it is always like that at present (by the end of the series,
when the collapse_pte_offset_map() oddity is fixed), but whether it needs
to be so I'm not sure.  Quite likely it will need to be; but I'm open to
the possibility that all it needs is to be still a page table, with
perhaps new entries from a new usage in it.

The most obvious vital thing (in the split ptlock case) is that it
remains a struct page with a usable ptl spinlock embedded in it.

The question becomes more urgent when/if extending to replacing the
pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
wants to deposit the page table for later use even in the shmem/file
case (and all arches in the anon case): I did work out the details once
before, but I'm not sure whether I would still agree with myself; and was
glad to leave replacement out of this series, to revisit some time later.

> 
> So in particular, in handle_pte_fault() we can reach the "if
> (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
> detached zeroed page table, but we're okay with that because in that
> case we know that !pte_none(vmf->orig_pte)&_none(*vmf->pte) ,
> which implies !pte_same(*vmf->pte, entry) , which means we'll bail
> out?

There is no current (even at end of series) circumstance in which we
could be pointing to a detached page table there; but yes, I want to
allow for that, and yes I agree with your analysis.  But with the
interesting unanswered question for the future, of what if the same
value could be found there: would that imply it's safe to proceed,
or would some further prevention be needed?

> 
> If that's the intent, it might be good to add some comments, because
> at least to me that's not very obvious.

That's a very fair request; but I shall have difficulty deciding where
to place such comments.  I shall have to try, then you redirect me.

And I think we approach this in opposite ways: my nature is to put some
infrastructure in place, and then look at it to see what we can get away
with; whereas your nature is to define upfront what the possibilities are.
We can expect some tussles!

Thanks,
Hugh

Re: [PATCH] ASoC: fsl_sai: Enable BCI bit if SAI works on synchronous mode with BYP asserted

2023-06-01 Thread Shengjiu Wang
On Tue, May 30, 2023 at 6:30 PM Chancel Liu  wrote:

> There's an issue on SAI synchronous mode that TX/RX side can't get BCLK
> from RX/TX it sync with if BYP bit is asserted. It's a workaround to
> fix it that enable SION of IOMUX pad control and assert BCI.
>
> For example if TX sync with RX which means both TX and RX are using clk
> form RX and BYP=1. TX can get BCLK only if the following two conditions
> are valid:
> 1. SION of RX BCLK IOMUX pad is set to 1
> 2. BCI of TX is set to 1
>
> Signed-off-by: Chancel Liu 
>

Acked-by: Shengjiu Wang 

Best regards
Wang Shengjiu

> ---
>  sound/soc/fsl/fsl_sai.c | 11 +--
>  sound/soc/fsl/fsl_sai.h |  1 +
>  2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c
> index d9344025dc16..5e09f634c61b 100644
> --- a/sound/soc/fsl/fsl_sai.c
> +++ b/sound/soc/fsl/fsl_sai.c
> @@ -491,14 +491,21 @@ static int fsl_sai_set_bclk(struct snd_soc_dai *dai,
> bool tx, u32 freq)
> regmap_update_bits(sai->regmap, reg, FSL_SAI_CR2_MSEL_MASK,
>FSL_SAI_CR2_MSEL(sai->mclk_id[tx]));
>
> -   if (savediv == 1)
> +   if (savediv == 1) {
> regmap_update_bits(sai->regmap, reg,
>FSL_SAI_CR2_DIV_MASK | FSL_SAI_CR2_BYP,
>FSL_SAI_CR2_BYP);
> -   else
> +   if (fsl_sai_dir_is_synced(sai, adir))
> +   regmap_update_bits(sai->regmap, FSL_SAI_xCR2(tx,
> ofs),
> +  FSL_SAI_CR2_BCI,
> FSL_SAI_CR2_BCI);
> +   else
> +   regmap_update_bits(sai->regmap, FSL_SAI_xCR2(tx,
> ofs),
> +  FSL_SAI_CR2_BCI, 0);
> +   } else {
> regmap_update_bits(sai->regmap, reg,
>FSL_SAI_CR2_DIV_MASK | FSL_SAI_CR2_BYP,
>savediv / 2 - 1);
> +   }
>
> if (sai->soc_data->max_register >= FSL_SAI_MCTL) {
> /* SAI is in master mode at this point, so enable MCLK */
> diff --git a/sound/soc/fsl/fsl_sai.h b/sound/soc/fsl/fsl_sai.h
> index 3eb994aef36a..8254c3547b87 100644
> --- a/sound/soc/fsl/fsl_sai.h
> +++ b/sound/soc/fsl/fsl_sai.h
> @@ -116,6 +116,7 @@
>
>  /* SAI Transmit and Receive Configuration 2 Register */
>  #define FSL_SAI_CR2_SYNC   BIT(30)
> +#define FSL_SAI_CR2_BCIBIT(28)
>  #define FSL_SAI_CR2_MSEL_MASK  (0x3 << 26)
>  #define FSL_SAI_CR2_MSEL_BUS   0
>  #define FSL_SAI_CR2_MSEL_MCLK1 BIT(26)
> --
> 2.25.1
>
>


Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s

2023-06-01 Thread Hugh Dickins
On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:15 AM Hugh Dickins  wrote:
> > Before putting them to use (several commits later), add rcu_read_lock()
> > to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
> > separate commit, since it risks exposing imbalances: prior commits have
> > fixed all the known imbalances, but we may find some have been missed.
> [...]
> > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> > index c7ab18a5fb77..674671835631 100644
> > --- a/mm/pgtable-generic.c
> > +++ b/mm/pgtable-generic.c
> > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, 
> > pmd_t *pmdvalp)
> >  {
> > pmd_t pmdval;
> >
> > -   /* rcu_read_lock() to be added later */
> > +   rcu_read_lock();
> > pmdval = pmdp_get_lockless(pmd);
> > if (pmdvalp)
> > *pmdvalp = pmdval;
> 
> It might be a good idea to document that this series assumes that the
> first argument to __pte_offset_map() is a pointer into a second-level
> page table (and not a local copy of the entry) unless the containing
> VMA is known to not be THP-eligible or the page table is detached from
> the page table hierarchy or something like that. Currently a bunch of
> places pass references to local copies of the entry, and while I think
> all of these are fine, it would probably be good to at least document
> why these are allowed to do it while other places aren't.

Thanks Jann: but I have to guess that here you are showing awareness of
an important issue that I'm simply ignorant of.

I have been haunted by a dim recollection that there is one architecture
(arm-32?) which is fussy about the placement of the pmdval being examined
(deduces info missing from the arch-independent interface, by following
up the address?), but I couldn't track it down when I tried.

Please tell me more; or better, don't spend your time explaining to me,
but please just send a link to a good reference on the issue.  I'll be
unable to document what you ask there, without educating myself first.

Thanks,
Hugh

> 
> $ vgrep 'pte_offset_map(&'
> Index File  Line Content
> 0 arch/sparc/mm/tlb.c151 pte = pte_offset_map(, vaddr);
> 1 kernel/events/core.c  7501 ptep = pte_offset_map(, addr);
> 2 mm/gup.c  2460 ptem = ptep = pte_offset_map(, addr);
> 3 mm/huge_memory.c  2057 pte = pte_offset_map(&_pmd, haddr);
> 4 mm/huge_memory.c  2214 pte = pte_offset_map(&_pmd, haddr);
> 5 mm/page_table_check.c  240 pte_t *ptep = pte_offset_map(, addr);

Re: [PATCH 00/13] mm: jit/text allocator

2023-06-01 Thread Song Liu
On Thu, Jun 1, 2023 at 3:13 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> Hi,
>
> module_alloc() is used everywhere as a mean to allocate memory for code.
>
> Beside being semantically wrong, this unnecessarily ties all subsystmes
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.
>
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
>
> This set splits code allocation from modules by introducing
> jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> sites of module_alloc() and module_memfree() with the new APIs and
> implements core text and related allocation in a central place.
>
> Instead of architecture specific overrides for module_alloc(), the
> architectures that require non-default behaviour for text allocation must
> fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> returns a pointer to that structure. If an architecture does not implement
> jit_alloc_arch_params(), the defaults compatible with the current
> modules::module_alloc() are used.
>
> The new jitalloc infrastructure allows decoupling of kprobes and ftrace
> from modules, and most importantly it enables ROX allocations for
> executable memory.

This set does look cleaner than my version [1]. However, this is
partially because this set only separates text and data; while [1]
also separates rw data, ro data, and ro_after_init data. We need
such separation to fully cover module usage, and to remove
VM_FLUSH_RESET_PERMS. Once we add these logic to this
set, the two versions will look similar.

OTOH, I do like the fact this version enables kprobes (and
potentially ftrace and bpf) without CONFIG_MODULES. And
mm/ seems a better home for the logic.

That being said, besides comments in a few patches, this
version looks good to me. With the fix I suggested for patch
12/13, it passed my tests on x86_64 with modules, kprobes,
ftrace, and BPF.

If we decided to ship this version, I would appreciate it if I
could get more credit for my work in [1] and research work
before that.

Thanks,
Song

[1] https://lore.kernel.org/lkml/20230526051529.3387103-1-s...@kernel.org/


Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Song Liu
On Thu, Jun 1, 2023 at 4:07 AM Mike Rapoport  wrote:
>
> On Thu, Jun 01, 2023 at 12:30:50PM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 01, 2023 at 01:12:56PM +0300, Mike Rapoport wrote:
> >
> > > +static void __init_or_module do_text_poke(void *addr, const void 
> > > *opcode, size_t len)
> > > +{
> > > +   if (system_state < SYSTEM_RUNNING) {
> > > +   text_poke_early(addr, opcode, len);
> > > +   } else {
> > > +   mutex_lock(_mutex);
> > > +   text_poke(addr, opcode, len);
> > > +   mutex_unlock(_mutex);
> > > +   }
> > > +}
> >
> > So I don't much like do_text_poke(); why?
>
> I believe the idea was to keep memcpy for early boot before the kernel
> image is protected without going and adding if (is_module_text_address())
> all over the place.
>
> I think this can be used instead without updating all the call sites of
> text_poke_early():
>
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 91057de8e6bc..f994e63e9903 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1458,7 +1458,7 @@ void __init_or_module text_poke_early(void *addr, const 
> void *opcode,
>  * code cannot be running and speculative code-fetches are
>  * prevented. Just change the code.
>  */
> -   memcpy(addr, opcode, len);
> +   text_poke_copy(addr, opcode, len);
> } else {
> local_irq_save(flags);
> memcpy(addr, opcode, len);
>

This alone doesn't work, as text_poke_early() is called
before addr is added to the list of module texts. So we
still use memcpy() here.

Thanks,
Song


Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Nadav Amit



> On Jun 1, 2023, at 1:50 PM, Edgecombe, Rick P  
> wrote:
> 
> On Thu, 2023-06-01 at 14:38 -0400, Kent Overstreet wrote:
>> On Thu, Jun 01, 2023 at 06:13:44PM +, Edgecombe, Rick P wrote:
 text_poke() _does_ create a separate RW mapping.
>>> 
>>> Sorry, I meant a separate RW allocation.
>> 
>> Ah yes, that makes sense
>> 
>> 
>>> 
 
 The thing that sucks about text_poke() is that it always does a
 full
 TLB
 flush, and AFAICT that's not remotely needed. What it really
 wants to
 be
 doing is conceptually just
 
 kmap_local()
 mempcy()
 kunmap_loca()
 flush_icache();
 
 ...except that kmap_local() won't actually create a new mapping
 on
 non-highmem architectures, so text_poke() open codes it.
>>> 
>>> Text poke creates only a local CPU RW mapping. It's more secure
>>> because
>>> other threads can't write to it.
>> 
>> *nod*, same as kmap_local
> 
> It's only used and flushed locally, but it is accessible to all CPU's,
> right?
> 
>> 
>>> It also only needs to flush the local core when it's done since
>>> it's
>>> not using a shared MM.
>>  
>> Ahh! Thanks for that; perhaps the comment in text_poke() about IPIs
>> could be a bit clearer.
>> 
>> What is it (if anything) you don't like about text_poke() then? It
>> looks
>> like it's doing broadly similar things to kmap_local(), so should be
>> in the same ballpark from a performance POV?
> 
> The way text_poke() is used here, it is creating a new writable alias
> and flushing it for *each* write to the module (like for each write of
> an individual relocation, etc). I was just thinking it might warrant
> some batching or something.

I am not advocating to do so, but if you want to have many efficient
writes, perhaps you can just disable CR0.WP. Just saying that if you
are about to write all over the memory, text_poke() does not provide
too much security for the poking thread.



Re: [PATCH v8 00/10] arm64: Add framework to turn an IPI as NMI

2023-06-01 Thread Doug Anderson
Hi,

On Wed, May 10, 2023 at 9:42 AM Doug Anderson  wrote:
>
> Hi,
>
> On Wed, May 10, 2023 at 9:30 AM Mark Rutland  wrote:
> >
> > On Wed, May 10, 2023 at 08:28:17AM -0700, Doug Anderson wrote:
> > > Hi,
> >
> > Hi Doug,
> >
> > > On Wed, Apr 19, 2023 at 3:57 PM Douglas Anderson  
> > > wrote:
> > > > This is an attempt to resurrect Sumit's old patch series [1] that
> > > > allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and
> > > > also to round up CPUs in kdb/kgdb. The last post from Sumit that I
> > > > could find was v7, so I called this series v8. I haven't copied all of
> > > > his old changelongs here, but you can find them from the link.
> > > >
> > > > Since v7, I have:
> > > > * Addressed the small amount of feedback that was there for v7.
> > > > * Rebased.
> > > > * Added a new patch that prevents us from spamming the logs with idle
> > > >   tasks.
> > > > * Added an extra patch to gracefully fall back to regular IPIs if
> > > >   pseudo-NMIs aren't there.
> > > >
> > > > Since there appear to be a few different patches series related to
> > > > being able to use NMIs to get stack traces of crashed systems, let me
> > > > try to organize them to the best of my understanding:
> > > >
> > > > a) This series. On its own, a) will (among other things) enable stack
> > > >traces of all running processes with the soft lockup detector if
> > > >you've enabled the sysctl "kernel.softlockup_all_cpu_backtrace". On
> > > >its own, a) doesn't give a hard lockup detector.
> > > >
> > > > b) A different recently-posted series [2] that adds a hard lockup
> > > >detector based on perf. On its own, b) gives a stack crawl of the
> > > >locked up CPU but no stack crawls of other CPUs (even if they're
> > > >locked too). Together with a) + b) we get everything (full lockup
> > > >detect, full ability to get stack crawls).
> > > >
> > > > c) The old Android "buddy" hard lockup detector [3] that I'm
> > > >considering trying to upstream. If b) lands then I believe c) would
> > > >be redundant (at least for arm64). c) on its own is really only
> > > >useful on arm64 for platforms that can print CPU_DBGPCSR somehow
> > > >(see [4]). a) + c) is roughly as good as a) + b).
> >
> > > It's been 3 weeks and I haven't heard a peep on this series. That
> > > means nobody has any objections and it's all good to land, right?
> > > Right? :-P
> >
> > FWIW, there are still longstanding soundness issues in the arm64 pseudo-NMI
> > support (and fixing that requires an overhaul of our DAIF / IRQ flag
> > management, which I've been chipping away at for a number of releases), so I
> > hadn't looked at this in detail yet because the foundations are still 
> > somewhat
> > dodgy.
> >
> > I appreciate that this has been around for a while, and it's on my queue to
> > look at.
>
> Ah, thanks for the heads up! We've been thinking about turning this on
> in production in ChromeOS because it will help us track down a whole
> class of field-generated crash reports that are otherwise opaque to
> us. It sounds as if maybe that's not a good idea quite yet? Do you
> have any idea of how much farther along this needs to go?

I'm still very interested in this topic. Do you have anything concrete
that is broken? Your reply gives me the vibe that there have been a
bunch of bugs found recently and, while there are no known issues,
you're worried that there might be something lingering still. Is that
correct, or do you have something concrete that's broken? Is this
anything others could help with? I could make an attempt, or we might
be able to rope some of the others interested in this topic and more
GIC-knowledgeable to help?

I still have a goal for turning this on for production but obviously
don't want to make the device crashier because of it.

Also: if this really has known bugs, should we put a big WARN_ON splat
if anyone tries to turn on pseudo NMIs? Without that, it feels like
it's a bit of a footgun.


> ...of
> course, we've also run into issues with Mediatek devices because they
> don't save/restore GICR registers properly [1]. In theory, we might be
> able to work around that in the kernel.

To followup with this, we've formulated a plan for dealing with
Mediatek Chromebooks. I doubt anyone is truly interested, but if
anyone is the details are in the public Google bug tracker [1]. None
of that should block lading the NMI backtrace series.


> In any case, even if there are bugs that would prevent turning this on
> for production, it still seems like we could still land this series.
> It simply wouldn't do anything until someone turned on pseudo NMIs,
> which wouldn't happen till the kinks are worked out.
>
> ...actually, I guess I should say that if all the patches of the
> current series do land then it actually _would_ still do something,
> even without pseudo-NMI. Assuming the last patch looks OK, it would at
> least start falling back to using regular IPIs to do 

Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Song Liu
On Thu, Jun 1, 2023 at 3:15 AM Mike Rapoport  wrote:
>
> From: Song Liu 
>
> Replace direct memory writes to memory allocated for code with text poking
> to allow allocation of executable memory as ROX.
>
> The only exception is arch_prepare_bpf_trampoline() that cannot jit
> directly into module memory yet, so it uses set_memory calls to
> unprotect the memory before writing to it and to protect memory in the
> end.
>
> Signed-off-by: Song Liu 
> Co-developed-by: Mike Rapoport (IBM) 
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/x86/kernel/alternative.c | 43 +++
>  arch/x86/kernel/ftrace.c  | 41 +
>  arch/x86/kernel/module.c  | 24 +--
>  arch/x86/kernel/static_call.c | 10 
>  arch/x86/kernel/unwind_orc.c  | 13 +++
>  arch/x86/net/bpf_jit_comp.c   | 22 +-

We need the following in this patch (or before this patch).
Otherwise, the system will crash at the VIRTUAL_BUG_ON()
in vmalloc_to_page().

Thanks,
Song

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index bf954d2721c1..4efa8a795ebc 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1084,7 +1084,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen,
u8 **image_ptr,
return NULL;
}

-   *rw_header = kvmalloc(size, GFP_KERNEL);
+   *rw_header = kvzalloc(size, GFP_KERNEL);
if (!*rw_header) {
bpf_arch_text_copy(_header->size, , sizeof(size));
bpf_prog_pack_free(ro_header);
@@ -1092,8 +1092,6 @@ bpf_jit_binary_pack_alloc(unsigned int proglen,
u8 **image_ptr,
return NULL;
}

-   /* Fill space with illegal/arch-dep instructions. */
-   bpf_fill_ill_insns(*rw_header, size);
(*rw_header)->size = size;

hole = min_t(unsigned int, size - (proglen + sizeof(*ro_header)),


[PATCH v4 RESEND 3/3] block: sed-opal: keyring support for SED keys

2023-06-01 Thread gjoyce
From: Greg Joyce 

Extend the SED block driver so it can alternatively
obtain a key from a sed-opal kernel keyring. The SED
ioctls will indicate the source of the key, either
directly in the ioctl data or from the keyring.

This allows the use of SED commands in scripts such as
udev scripts so that drives may be automatically unlocked
as they become available.

Signed-off-by: Greg Joyce 
Reviewed-by: Jonathan Derrick 
---
 block/Kconfig |   2 +
 block/sed-opal.c  | 174 +-
 include/linux/sed-opal.h  |   3 +
 include/uapi/linux/sed-opal.h |   8 +-
 4 files changed, 184 insertions(+), 3 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index 941b2dca70db..76b23114fdeb 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -180,6 +180,8 @@ config BLK_DEBUG_FS_ZONED
 
 config BLK_SED_OPAL
bool "Logic for interfacing with Opal enabled SEDs"
+   depends on KEYS
+   select PSERIES_PLPKS if PPC_PSERIES
help
Builds Logic for interfacing with Opal enabled controllers.
Enabling this option enables users to setup/unlock/lock
diff --git a/block/sed-opal.c b/block/sed-opal.c
index 5b277eaabbc7..7f5f235a9048 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -20,6 +20,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include "opal_proto.h"
 
@@ -29,6 +32,8 @@
 /* Number of bytes needed by cmd_finalize. */
 #define CMD_FINALIZE_BYTES_NEEDED 7
 
+static struct key *sed_opal_keyring;
+
 struct opal_step {
int (*fn)(struct opal_dev *dev, void *data);
void *data;
@@ -265,6 +270,101 @@ static void print_buffer(const u8 *ptr, u32 length)
 #endif
 }
 
+/*
+ * Allocate/update a SED Opal key and add it to the SED Opal keyring.
+ */
+static int update_sed_opal_key(const char *desc, u_char *key_data, int keylen)
+{
+   key_ref_t kr;
+
+   if (!sed_opal_keyring)
+   return -ENOKEY;
+
+   kr = key_create_or_update(make_key_ref(sed_opal_keyring, true), "user",
+ desc, (const void *)key_data, keylen,
+ KEY_USR_VIEW | KEY_USR_SEARCH | KEY_USR_WRITE,
+ KEY_ALLOC_NOT_IN_QUOTA | KEY_ALLOC_BUILT_IN |
+   KEY_ALLOC_BYPASS_RESTRICTION);
+   if (IS_ERR(kr)) {
+   pr_err("Error adding SED key (%ld)\n", PTR_ERR(kr));
+   return PTR_ERR(kr);
+   }
+
+   return 0;
+}
+
+/*
+ * Read a SED Opal key from the SED Opal keyring.
+ */
+static int read_sed_opal_key(const char *key_name, u_char *buffer, int buflen)
+{
+   int ret;
+   key_ref_t kref;
+   struct key *key;
+
+   if (!sed_opal_keyring)
+   return -ENOKEY;
+
+   kref = keyring_search(make_key_ref(sed_opal_keyring, true),
+ _type_user, key_name, true);
+
+   if (IS_ERR(kref))
+   ret = PTR_ERR(kref);
+
+   key = key_ref_to_ptr(kref);
+   down_read(>sem);
+   ret = key_validate(key);
+   if (ret == 0) {
+   if (buflen > key->datalen)
+   buflen = key->datalen;
+
+   ret = key->type->read(key, (char *)buffer, buflen);
+   }
+   up_read(>sem);
+
+   key_ref_put(kref);
+
+   return ret;
+}
+
+static int opal_get_key(struct opal_dev *dev, struct opal_key *key)
+{
+   int ret = 0;
+
+   switch (key->key_type) {
+   case OPAL_INCLUDED:
+   /* the key is ready to use */
+   break;
+   case OPAL_KEYRING:
+   /* the key is in the keyring */
+   ret = read_sed_opal_key(OPAL_AUTH_KEY, key->key, OPAL_KEY_MAX);
+   if (ret > 0) {
+   if (ret > U8_MAX) {
+   ret = -ENOSPC;
+   goto error;
+   }
+   key->key_len = ret;
+   key->key_type = OPAL_INCLUDED;
+   }
+   break;
+   default:
+   ret = -EINVAL;
+   break;
+   }
+   if (ret < 0)
+   goto error;
+
+   /* must have a PEK by now or it's an error */
+   if (key->key_type != OPAL_INCLUDED || key->key_len == 0) {
+   ret = -EINVAL;
+   goto error;
+   }
+   return 0;
+error:
+   pr_debug("Error getting password: %d\n", ret);
+   return ret;
+}
+
 static bool check_tper(const void *data)
 {
const struct d0_tper_features *tper = data;
@@ -2271,6 +2371,9 @@ static int opal_secure_erase_locking_range(struct 
opal_dev *dev,
};
int ret;
 
+   ret = opal_get_key(dev, _session->opal_key);
+   if (ret)
+   return ret;
mutex_lock(>dev_lock);
setup_opal_dev(dev);
ret = execute_steps(dev, erase_steps, ARRAY_SIZE(erase_steps));
@@ -2304,6 +2407,9 @@ static int opal_revertlsp(struct opal_dev 

[PATCH v4 RESEND 2/3] block: sed-opal: Implement IOC_OPAL_REVERT_LSP

2023-06-01 Thread gjoyce
From: Greg Joyce 

This is used in conjunction with IOC_OPAL_REVERT_TPR to return a drive to
Original Factory State without erasing the data. If IOC_OPAL_REVERT_LSP
is called with opal_revert_lsp.options bit OPAL_PRESERVE set prior
to calling IOC_OPAL_REVERT_TPR, the drive global locking range will not
be erased.

Signed-off-by: Greg Joyce 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jonathan Derrick 
---
 block/opal_proto.h|  4 
 block/sed-opal.c  | 40 +++
 include/linux/sed-opal.h  |  1 +
 include/uapi/linux/sed-opal.h | 11 ++
 4 files changed, 56 insertions(+)

diff --git a/block/opal_proto.h b/block/opal_proto.h
index 7152aa1f1a49..c3b5bff0b9e4 100644
--- a/block/opal_proto.h
+++ b/block/opal_proto.h
@@ -215,6 +215,10 @@ enum opal_parameter {
OPAL_SUM_SET_LIST = 0x06,
 };
 
+enum opal_revertlsp {
+   OPAL_KEEP_GLOBAL_RANGE_KEY = 0x06,
+};
+
 /* Packets derived from:
  * TCG_Storage_Architecture_Core_Spec_v2.01_r1.00
  * Secion: 3.2.3 ComPackets, Packets & Subpackets
diff --git a/block/sed-opal.c b/block/sed-opal.c
index b55a8eb29f5b..5b277eaabbc7 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -1634,6 +1634,26 @@ static int internal_activate_user(struct opal_dev *dev, 
void *data)
return finalize_and_send(dev, parse_and_check_status);
 }
 
+static int revert_lsp(struct opal_dev *dev, void *data)
+{
+   struct opal_revert_lsp *rev = data;
+   int err;
+
+   err = cmd_start(dev, opaluid[OPAL_THISSP_UID],
+   opalmethod[OPAL_REVERTSP]);
+   add_token_u8(, dev, OPAL_STARTNAME);
+   add_token_u64(, dev, OPAL_KEEP_GLOBAL_RANGE_KEY);
+   add_token_u8(, dev, (rev->options & OPAL_PRESERVE) ?
+   OPAL_TRUE : OPAL_FALSE);
+   add_token_u8(, dev, OPAL_ENDNAME);
+   if (err) {
+   pr_debug("Error building REVERT SP command.\n");
+   return err;
+   }
+
+   return finalize_and_send(dev, parse_and_check_status);
+}
+
 static int erase_locking_range(struct opal_dev *dev, void *data)
 {
struct opal_session_info *session = data;
@@ -2275,6 +2295,23 @@ static int opal_get_discv(struct opal_dev *dev, struct 
opal_discovery *discv)
return discv->size; /* modified to actual length of data */
 }
 
+static int opal_revertlsp(struct opal_dev *dev, struct opal_revert_lsp *rev)
+{
+   /* controller will terminate session */
+   const struct opal_step steps[] = {
+   { start_admin1LSP_opal_session, >key },
+   { revert_lsp, rev }
+   };
+   int ret;
+
+   mutex_lock(>dev_lock);
+   setup_opal_dev(dev);
+   ret = execute_steps(dev, steps, ARRAY_SIZE(steps));
+   mutex_unlock(>dev_lock);
+
+   return ret;
+}
+
 static int opal_erase_locking_range(struct opal_dev *dev,
struct opal_session_info *opal_session)
 {
@@ -2842,6 +2879,9 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, 
void __user *arg)
case IOC_OPAL_GET_STATUS:
ret = opal_get_status(dev, arg);
break;
+   case IOC_OPAL_REVERT_LSP:
+   ret = opal_revertlsp(dev, p);
+   break;
case IOC_OPAL_DISCOVERY:
ret = opal_get_discv(dev, p);
break;
diff --git a/include/linux/sed-opal.h b/include/linux/sed-opal.h
index 8e824ab908e9..bf2f9d7e11eb 100644
--- a/include/linux/sed-opal.h
+++ b/include/linux/sed-opal.h
@@ -46,6 +46,7 @@ static inline bool is_sed_ioctl(unsigned int cmd)
case IOC_OPAL_GENERIC_TABLE_RW:
case IOC_OPAL_GET_STATUS:
case IOC_OPAL_DISCOVERY:
+   case IOC_OPAL_REVERT_LSP:
return true;
}
return false;
diff --git a/include/uapi/linux/sed-opal.h b/include/uapi/linux/sed-opal.h
index 98f216cf2241..720725f1c42a 100644
--- a/include/uapi/linux/sed-opal.h
+++ b/include/uapi/linux/sed-opal.h
@@ -56,6 +56,10 @@ struct opal_key {
__u8 key[OPAL_KEY_MAX];
 };
 
+enum opal_revert_lsp_opts {
+   OPAL_PRESERVE = 0x01,
+};
+
 struct opal_lr_act {
struct opal_key key;
__u32 sum;
@@ -156,6 +160,12 @@ struct opal_discovery {
__u64 size;
 };
 
+struct opal_revert_lsp {
+   struct opal_key key;
+   __u32 options;
+   __u32 __pad;
+};
+
 #define IOC_OPAL_SAVE  _IOW('p', 220, struct opal_lock_unlock)
 #define IOC_OPAL_LOCK_UNLOCK   _IOW('p', 221, struct opal_lock_unlock)
 #define IOC_OPAL_TAKE_OWNERSHIP_IOW('p', 222, struct opal_key)
@@ -174,5 +184,6 @@ struct opal_discovery {
 #define IOC_OPAL_GENERIC_TABLE_RW   _IOW('p', 235, struct 
opal_read_write_table)
 #define IOC_OPAL_GET_STATUS _IOR('p', 236, struct opal_status)
 #define IOC_OPAL_DISCOVERY  _IOW('p', 237, struct opal_discovery)
+#define IOC_OPAL_REVERT_LSP _IOW('p', 238, struct opal_revert_lsp)
 
 #endif /* _UAPI_SED_OPAL_H */
-- 

[PATCH v4 RESEND 1/3] block: sed-opal: Implement IOC_OPAL_DISCOVERY

2023-06-01 Thread gjoyce
From: Greg Joyce 

Add IOC_OPAL_DISCOVERY ioctl to return raw discovery data to a SED Opal
application. This allows the application to display drive capabilities
and state.

Signed-off-by: Greg Joyce 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jonathan Derrick 
---
 block/sed-opal.c  | 38 ---
 include/linux/sed-opal.h  |  1 +
 include/uapi/linux/sed-opal.h |  6 ++
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index c320093c14f1..b55a8eb29f5b 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -457,8 +457,11 @@ static int execute_steps(struct opal_dev *dev,
return error;
 }
 
-static int opal_discovery0_end(struct opal_dev *dev)
+static int opal_discovery0_end(struct opal_dev *dev, void *data)
 {
+   struct opal_discovery *discv_out = data; /* may be NULL */
+   u8 __user *buf_out;
+   u64 len_out;
bool found_com_id = false, supported = true, single_user = false;
const struct d0_header *hdr = (struct d0_header *)dev->resp;
const u8 *epos = dev->resp, *cpos = dev->resp;
@@ -474,6 +477,15 @@ static int opal_discovery0_end(struct opal_dev *dev)
return -EFAULT;
}
 
+   if (discv_out) {
+   buf_out = (u8 __user *)(uintptr_t)discv_out->data;
+   len_out = min_t(u64, discv_out->size, hlen);
+   if (buf_out && copy_to_user(buf_out, dev->resp, len_out))
+   return -EFAULT;
+
+   discv_out->size = hlen; /* actual size of data */
+   }
+
epos += hlen; /* end of buffer */
cpos += sizeof(*hdr); /* current position on buffer */
 
@@ -559,13 +571,13 @@ static int opal_discovery0(struct opal_dev *dev, void 
*data)
if (ret)
return ret;
 
-   return opal_discovery0_end(dev);
+   return opal_discovery0_end(dev, data);
 }
 
 static int opal_discovery0_step(struct opal_dev *dev)
 {
const struct opal_step discovery0_step = {
-   opal_discovery0,
+   opal_discovery0, NULL
};
 
return execute_step(dev, _step, 0);
@@ -2247,6 +2259,22 @@ static int opal_secure_erase_locking_range(struct 
opal_dev *dev,
return ret;
 }
 
+static int opal_get_discv(struct opal_dev *dev, struct opal_discovery *discv)
+{
+   const struct opal_step discovery0_step = {
+   opal_discovery0, discv
+   };
+   int ret = 0;
+
+   mutex_lock(>dev_lock);
+   setup_opal_dev(dev);
+   ret = execute_step(dev, _step, 0);
+   mutex_unlock(>dev_lock);
+   if (ret)
+   return ret;
+   return discv->size; /* modified to actual length of data */
+}
+
 static int opal_erase_locking_range(struct opal_dev *dev,
struct opal_session_info *opal_session)
 {
@@ -2814,6 +2842,10 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, 
void __user *arg)
case IOC_OPAL_GET_STATUS:
ret = opal_get_status(dev, arg);
break;
+   case IOC_OPAL_DISCOVERY:
+   ret = opal_get_discv(dev, p);
+   break;
+
default:
break;
}
diff --git a/include/linux/sed-opal.h b/include/linux/sed-opal.h
index 31ac562a17d7..8e824ab908e9 100644
--- a/include/linux/sed-opal.h
+++ b/include/linux/sed-opal.h
@@ -45,6 +45,7 @@ static inline bool is_sed_ioctl(unsigned int cmd)
case IOC_OPAL_WRITE_SHADOW_MBR:
case IOC_OPAL_GENERIC_TABLE_RW:
case IOC_OPAL_GET_STATUS:
+   case IOC_OPAL_DISCOVERY:
return true;
}
return false;
diff --git a/include/uapi/linux/sed-opal.h b/include/uapi/linux/sed-opal.h
index d7a1524023db..98f216cf2241 100644
--- a/include/uapi/linux/sed-opal.h
+++ b/include/uapi/linux/sed-opal.h
@@ -151,6 +151,11 @@ struct opal_status {
__u32 reserved;
 };
 
+struct opal_discovery {
+   __u64 data;
+   __u64 size;
+};
+
 #define IOC_OPAL_SAVE  _IOW('p', 220, struct opal_lock_unlock)
 #define IOC_OPAL_LOCK_UNLOCK   _IOW('p', 221, struct opal_lock_unlock)
 #define IOC_OPAL_TAKE_OWNERSHIP_IOW('p', 222, struct opal_key)
@@ -168,5 +173,6 @@ struct opal_status {
 #define IOC_OPAL_WRITE_SHADOW_MBR   _IOW('p', 234, struct opal_shadow_mbr)
 #define IOC_OPAL_GENERIC_TABLE_RW   _IOW('p', 235, struct 
opal_read_write_table)
 #define IOC_OPAL_GET_STATUS _IOR('p', 236, struct opal_status)
+#define IOC_OPAL_DISCOVERY  _IOW('p', 237, struct opal_discovery)
 
 #endif /* _UAPI_SED_OPAL_H */
-- 
gjo...@linux.vnet.ibm.com



[PATCH v4 RESEND 0/3] sed-opal: keyrings, discovery, revert, key store

2023-06-01 Thread gjoyce
From: Greg Joyce 

This patchset has gone through numerous rounds of review and
all comments/suggetions have been addressed. I believe that
this patchset is ready for inclusion.

TCG SED Opal is a specification from The Trusted Computing Group
that allows self encrypting storage devices (SED) to be locked at
power on and require an authentication key to unlock the drive.

The current SED Opal implementation in the block driver
requires that authentication keys be provided in an ioctl
so that they can be presented to the underlying SED
capable drive. Currently, the key is typically entered by
a user with an application like sedutil or sedcli. While
this process works, it does not lend itself to automation
like unlock by a udev rule.

The SED block driver has been extended so it can alternatively
obtain a key from a sed-opal kernel keyring. The SED ioctls
will indicate the source of the key, either directly in the
ioctl data or from the keyring.

Two new SED ioctls have also been added. These are:
  1) IOC_OPAL_REVERT_LSP to revert LSP state
  2) IOC_OPAL_DISCOVERY to discover drive capabilities/state

change log v4:
- rebase to 6.3-rc7
- replaced "255" magic number with U8_MAX

change log:
- rebase to 6.x
- added latest reviews
- removed platform functions for persistent key storage
- replaced key update logic with key_create_or_update()
- minor bracing and padding changes
- add error returns
- opal_key structure is application provided but kernel
  verified
- added brief description of TCG SED Opal

Greg Joyce (3):
  block: sed-opal: Implement IOC_OPAL_DISCOVERY
  block: sed-opal: Implement IOC_OPAL_REVERT_LSP
  block: sed-opal: keyring support for SED keys

 block/Kconfig |   2 +
 block/opal_proto.h|   4 +
 block/sed-opal.c  | 252 +-
 include/linux/sed-opal.h  |   5 +
 include/uapi/linux/sed-opal.h |  25 +++-
 5 files changed, 282 insertions(+), 6 deletions(-)

Signed-off-by: Greg Joyce 
base-commit: 6a8f57ae2eb07ab39a6f0ccad60c760743051026
-- 
gjo...@linux.vnet.ibm.com



[PATCH v6 0/4] generic and PowerPC SED Opal keystore

2023-06-01 Thread gjoyce
From: Greg Joyce 

This patchset has gone through numerous rounds of review and
all comments/suggetions have been addressed. I believe that
this patchset is ready for inclusion.

TCG SED Opal is a specification from The Trusted Computing Group
that allows self encrypting storage devices (SED) to be locked at
power on and require an authentication key to unlock the drive.

Generic functions have been defined for accessing SED Opal keys.
The generic functions are defined as weak so that they may be superseded
by keystore specific versions.

PowerPC/pseries versions of these functions provide read/write access
to SED Opal keys in the PLPKS keystore.

The SED block driver has been modified to read the SED Opal
keystore to populate a key in the SED Opal keyring. Changes to the
SED Opal key will be written to the SED Opal keystore.

Patch 3 "keystore access for SED Opal keys" is dependent on:

https://lore.kernel.org/keyrings/20220818143045.680972-4-gjo...@linux.vnet.ibm.com/T/#u

Changelog
v6: - squashed two commits (suggested by Andrew Donnellan)

v5: - updated to reflect changes in PLPKS API

v4:
- scope reduced to cover just SED Opal keys
- base SED Opal keystore is now in SED block driver
- removed use of enum to indicate type
- refactored common code into common function that read and
  write use
- removed cast to void
- added use of SED Opal keystore functions to SED block driver

v3:
- No code changes, but per reviewer requests, adding additional
  mailing lists(keyring, EFI) for wider review.

v2:
- Include feedback from Gregory Joyce, Eric Richter and
  Murilo Opsfelder Araujo.
- Include suggestions from Michael Ellerman.
- Moved a dependency from generic SED code to this patchset.
  This patchset now builds of its own.



Greg Joyce (3):
  block:sed-opal: SED Opal keystore
  block: sed-opal: keystore access for SED Opal keys
  powerpc/pseries: PLPKS SED Opal keystore support

 arch/powerpc/platforms/pseries/Kconfig|   6 +
 arch/powerpc/platforms/pseries/Makefile   |   1 +
 .../powerpc/platforms/pseries/plpks_sed_ops.c | 114 ++
 block/Kconfig |   1 +
 block/Makefile|   2 +-
 block/sed-opal-key.c  |  24 
 block/sed-opal.c  |  18 ++-
 include/linux/sed-opal-key.h  |  15 +++
 8 files changed, 178 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/plpks_sed_ops.c
 create mode 100644 block/sed-opal-key.c
 create mode 100644 include/linux/sed-opal-key.h

-- 
gjo...@linux.vnet.ibm.com



[PATCH v6 1/3] block:sed-opal: SED Opal keystore

2023-06-01 Thread gjoyce
From: Greg Joyce 

TCG SED Opal is a specification from The Trusted Computing Group
that allows self encrypting storage devices (SED) to be locked at
power on and require an authentication key to unlock the drive.

Add read and write functions that allow SED Opal authentication keys
to be stored in a permanent keystore.

Signed-off-by: Greg Joyce 
Reviewed-by: Jonathan Derrick 
---
 block/Makefile   |  2 +-
 block/sed-opal-key.c | 24 
 include/linux/sed-opal-key.h | 15 +++
 3 files changed, 40 insertions(+), 1 deletion(-)
 create mode 100644 block/sed-opal-key.c
 create mode 100644 include/linux/sed-opal-key.h

diff --git a/block/Makefile b/block/Makefile
index 4e01bb71ad6e..464a9f209552 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -35,7 +35,7 @@ obj-$(CONFIG_BLK_DEV_ZONED)   += blk-zoned.o
 obj-$(CONFIG_BLK_WBT)  += blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
 obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
-obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o
+obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o sed-opal-key.o
 obj-$(CONFIG_BLK_PM)   += blk-pm.o
 obj-$(CONFIG_BLK_INLINE_ENCRYPTION)+= blk-crypto.o blk-crypto-profile.o \
   blk-crypto-sysfs.o
diff --git a/block/sed-opal-key.c b/block/sed-opal-key.c
new file mode 100644
index ..16f380164c44
--- /dev/null
+++ b/block/sed-opal-key.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * SED key operations.
+ *
+ * Copyright (C) 2022 IBM Corporation
+ *
+ * These are the accessor functions (read/write) for SED Opal
+ * keys. Specific keystores can provide overrides.
+ *
+ */
+
+#include 
+#include 
+#include 
+
+int __weak sed_read_key(char *keyname, char *key, u_int *keylen)
+{
+   return -EOPNOTSUPP;
+}
+
+int __weak sed_write_key(char *keyname, char *key, u_int keylen)
+{
+   return -EOPNOTSUPP;
+}
diff --git a/include/linux/sed-opal-key.h b/include/linux/sed-opal-key.h
new file mode 100644
index ..c9b1447986d8
--- /dev/null
+++ b/include/linux/sed-opal-key.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * SED key operations.
+ *
+ * Copyright (C) 2022 IBM Corporation
+ *
+ * These are the accessor functions (read/write) for SED Opal
+ * keys. Specific keystores can provide overrides.
+ *
+ */
+
+#include 
+
+int sed_read_key(char *keyname, char *key, u_int *keylen);
+int sed_write_key(char *keyname, char *key, u_int keylen);
-- 
gjo...@linux.vnet.ibm.com



[PATCH v6 3/3] powerpc/pseries: PLPKS SED Opal keystore support

2023-06-01 Thread gjoyce
From: Greg Joyce 

Define operations for SED Opal to read/write keys
from POWER LPAR Platform KeyStore(PLPKS). This allows
non-volatile storage of SED Opal keys.

Signed-off-by: Greg Joyce 
Reviewed-by: Jonathan Derrick 

---
 arch/powerpc/platforms/pseries/Kconfig|   6 +
 arch/powerpc/platforms/pseries/Makefile   |   1 +
 .../powerpc/platforms/pseries/plpks_sed_ops.c | 114 ++
 block/Kconfig |   1 +
 4 files changed, 122 insertions(+)
 create mode 100644 arch/powerpc/platforms/pseries/plpks_sed_ops.c

diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 21b22bf16ce6..c2f8a29e7b9b 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -163,6 +163,12 @@ config PSERIES_PLPKS
# This option is selected by in-kernel consumers that require
# access to the PKS.
 
+config PSERIES_PLPKS_SED
+   depends on PPC_PSERIES
+   bool
+   # This option is selected by in-kernel consumers that require
+   # access to the SED PKS keystore.
+
 config PAPR_SCM
depends on PPC_PSERIES && MEMORY_HOTPLUG && LIBNVDIMM
tristate "Support for the PAPR Storage Class Memory interface"
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 53c3b91af2f7..1476c5e4433c 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_PPC_SVM) += svm.o
 obj-$(CONFIG_FA_DUMP)  += rtas-fadump.o
 obj-$(CONFIG_PSERIES_PLPKS)+= plpks.o
 obj-$(CONFIG_PPC_SECURE_BOOT)  += plpks-secvar.o
+obj-$(CONFIG_PSERIES_PLPKS_SED)+= plpks_sed_ops.o
 obj-$(CONFIG_SUSPEND)  += suspend.o
 obj-$(CONFIG_PPC_VAS)  += vas.o vas-sysfs.o
 
diff --git a/arch/powerpc/platforms/pseries/plpks_sed_ops.c 
b/arch/powerpc/platforms/pseries/plpks_sed_ops.c
new file mode 100644
index ..c1d08075e850
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/plpks_sed_ops.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * POWER Platform specific code for non-volatile SED key access
+ * Copyright (C) 2022 IBM Corporation
+ *
+ * Define operations for SED Opal to read/write keys
+ * from POWER LPAR Platform KeyStore(PLPKS).
+ *
+ * Self Encrypting Drives(SED) key storage using PLPKS
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * structure that contains all SED data
+ */
+struct plpks_sed_object_data {
+   u_char version;
+   u_char pad1[7];
+   u_long authority;
+   u_long range;
+   u_int  key_len;
+   u_char key[32];
+};
+
+#define PLPKS_SED_OBJECT_DATA_V00
+#define PLPKS_SED_MANGLED_LABEL "/default/pri"
+#define PLPKS_SED_COMPONENT "sed-opal"
+#define PLPKS_SED_KEY   "opal-boot-pin"
+
+/*
+ * authority is admin1 and range is global
+ */
+#define PLPKS_SED_AUTHORITY  0x000900010001
+#define PLPKS_SED_RANGE  0x08020001
+
+void plpks_init_var(struct plpks_var *var, char *keyname)
+{
+   var->name = keyname;
+   var->namelen = strlen(keyname);
+   if (strcmp(PLPKS_SED_KEY, keyname) == 0) {
+   var->name = PLPKS_SED_MANGLED_LABEL;
+   var->namelen = strlen(keyname);
+   }
+   var->policy = PLPKS_WORLDREADABLE;
+   var->os = PLPKS_VAR_COMMON;
+   var->data = NULL;
+   var->datalen = 0;
+   var->component = PLPKS_SED_COMPONENT;
+}
+
+/*
+ * Read the SED Opal key from PLPKS given the label
+ */
+int sed_read_key(char *keyname, char *key, u_int *keylen)
+{
+   struct plpks_var var;
+   struct plpks_sed_object_data data;
+   int ret;
+   u_int len;
+
+   plpks_init_var(, keyname);
+   var.data = (u8 *)
+   var.datalen = sizeof(data);
+
+   ret = plpks_read_os_var();
+   if (ret != 0)
+   return ret;
+
+   len = min_t(u16, be32_to_cpu(data.key_len), var.datalen);
+   memcpy(key, data.key, len);
+   key[len] = '\0';
+   *keylen = len;
+
+   return 0;
+}
+
+/*
+ * Write the SED Opal key to PLPKS given the label
+ */
+int sed_write_key(char *keyname, char *key, u_int keylen)
+{
+   struct plpks_var var;
+   struct plpks_sed_object_data data;
+   struct plpks_var_name vname;
+
+   plpks_init_var(, keyname);
+
+   var.datalen = sizeof(struct plpks_sed_object_data);
+   var.data = (u8 *)
+
+   /* initialize SED object */
+   data.version = PLPKS_SED_OBJECT_DATA_V0;
+   data.authority = cpu_to_be64(PLPKS_SED_AUTHORITY);
+   data.range = cpu_to_be64(PLPKS_SED_RANGE);
+   memset(, '\0', sizeof(data.pad1));
+   data.key_len = cpu_to_be32(keylen);
+   memcpy(data.key, (char *)key, keylen);
+
+   /*
+* Key update requires remove first. The return value
+* is ignored since it's okay if the key doesn't exist.
+*/

[PATCH v6 2/3] block/sed-opal: keystore access for SED Opal keys

2023-06-01 Thread gjoyce
From: Greg Joyce 

Allow for permanent SED Opal authentication keys by
reading/writing to the SED Opal non-volatile keystore.

Signed-off-by: Greg Joyce 
Reviewed-by: Jonathan Derrick 
---
 block/sed-opal.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 7f5f235a9048..1e8cfa00b609 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2803,7 +2804,13 @@ static int opal_set_new_pw(struct opal_dev *dev, struct 
opal_new_pw *opal_pw)
if (ret)
return ret;
 
-   /* update keyring with new password */
+   /* update keyring and key store with new password */
+   ret = sed_write_key(OPAL_AUTH_KEY,
+   opal_pw->new_user_pw.opal_key.key,
+   opal_pw->new_user_pw.opal_key.key_len);
+   if (ret != -EOPNOTSUPP)
+   pr_warn("error updating SED key: %d\n", ret);
+
ret = update_sed_opal_key(OPAL_AUTH_KEY,
  opal_pw->new_user_pw.opal_key.key,
  opal_pw->new_user_pw.opal_key.key_len);
@@ -3050,6 +3057,8 @@ EXPORT_SYMBOL_GPL(sed_ioctl);
 static int __init sed_opal_init(void)
 {
struct key *kr;
+   char init_sed_key[OPAL_KEY_MAX];
+   int keylen = OPAL_KEY_MAX - 1;
 
kr = keyring_alloc(".sed_opal",
   GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, current_cred(),
@@ -3062,6 +3071,11 @@ static int __init sed_opal_init(void)
 
sed_opal_keyring = kr;
 
-   return 0;
+   if (sed_read_key(OPAL_AUTH_KEY, init_sed_key, ) < 0) {
+   memset(init_sed_key, '\0', sizeof(init_sed_key));
+   keylen = OPAL_KEY_MAX - 1;
+   }
+
+   return update_sed_opal_key(OPAL_AUTH_KEY, init_sed_key, keylen);
 }
 late_initcall(sed_opal_init);
-- 
gjo...@linux.vnet.ibm.com



Re: [PATCH 04/13] mm/jitalloc, arch: convert remaining overrides of module_alloc to jitalloc

2023-06-01 Thread Song Liu
On Thu, Jun 1, 2023 at 3:13 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> Extend jitalloc parameters to accommodate more complex overrides of
> module_alloc() by architectures.
>
> This includes specification of a fallback range required by arm, arm64
> and powerpc and support for allocation of KASAN shadow required by
> arm64, s390 and x86.
>
> The core implementation of jit_alloc() takes care of suppressing warnings
> when the initial allocation fails but there is a fallback range defined.
>
> Signed-off-by: Mike Rapoport (IBM) 

[...]

>
> diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> index 5af4975caeb5..ecf1f4030317 100644
> --- a/arch/arm64/kernel/module.c
> +++ b/arch/arm64/kernel/module.c
> @@ -17,56 +17,49 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>
> -void *module_alloc(unsigned long size)
> +static struct jit_alloc_params jit_alloc_params = {
> +   .alignment  = MODULE_ALIGN,
> +   .flags  = JIT_ALLOC_KASAN_SHADOW,
> +};
> +
> +struct jit_alloc_params *jit_alloc_arch_params(void)
>  {
> u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;

module_alloc_base() is initialized in kaslr_init(), which is called after
mm_core_init(). We will need some special logic for this.

Thanks,
Song

> -   gfp_t gfp_mask = GFP_KERNEL;
> -   void *p;
> -
> -   /* Silence the initial allocation */
> -   if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> -   gfp_mask |= __GFP_NOWARN;
>

[...]


Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Edgecombe, Rick P
On Thu, 2023-06-01 at 14:38 -0400, Kent Overstreet wrote:
> On Thu, Jun 01, 2023 at 06:13:44PM +, Edgecombe, Rick P wrote:
> > > text_poke() _does_ create a separate RW mapping.
> > 
> > Sorry, I meant a separate RW allocation.
> 
> Ah yes, that makes sense
> 
> 
> > 
> > > 
> > > The thing that sucks about text_poke() is that it always does a
> > > full
> > > TLB
> > > flush, and AFAICT that's not remotely needed. What it really
> > > wants to
> > > be
> > > doing is conceptually just
> > > 
> > > kmap_local()
> > > mempcy()
> > > kunmap_loca()
> > > flush_icache();
> > > 
> > > ...except that kmap_local() won't actually create a new mapping
> > > on
> > > non-highmem architectures, so text_poke() open codes it.
> > 
> > Text poke creates only a local CPU RW mapping. It's more secure
> > because
> > other threads can't write to it.
> 
> *nod*, same as kmap_local

It's only used and flushed locally, but it is accessible to all CPU's,
right?

> 
> > It also only needs to flush the local core when it's done since
> > it's
> > not using a shared MM.
>  
> Ahh! Thanks for that; perhaps the comment in text_poke() about IPIs
> could be a bit clearer.
> 
> What is it (if anything) you don't like about text_poke() then? It
> looks
> like it's doing broadly similar things to kmap_local(), so should be
> in the same ballpark from a performance POV?

The way text_poke() is used here, it is creating a new writable alias
and flushing it for *each* write to the module (like for each write of
an individual relocation, etc). I was just thinking it might warrant
some batching or something.



Re: [PATCH v2 23/25] iommu: Add ops->domain_alloc_paging()

2023-06-01 Thread Jason Gunthorpe
On Thu, Jun 01, 2023 at 08:17:56PM +0100, Robin Murphy wrote:
> On 2023-05-16 01:00, Jason Gunthorpe wrote:
> > This callback requests the driver to create only a __IOMMU_DOMAIN_PAGING
> > domain, so it saves a few lines in a lot of drivers needlessly checking
> > the type.
> > 
> > More critically, this allows us to sweep out all the
> > IOMMU_DOMAIN_UNMANAGED and IOMMU_DOMAIN_DMA checks from a lot of the
> > drivers, simplifying what is going on in the code and ultimately removing
> > the now-unused special cases in drivers where they did not support
> > IOMMU_DOMAIN_DMA.
> > 
> > domain_alloc_paging() should return a struct iommu_domain that is
> > functionally compatible with ARM_DMA_USE_IOMMU, dma-iommu.c and iommufd.
> > 
> > Be forwards looking and pass in a 'struct device *' argument. We can
> > provide this when allocating the default_domain. No drivers will look at
> > this.
> 
> As mentioned before, we already know we're going to need additional flags
> (and possibly data) to cover the existing set_pgtable_quirks use-case plus
> new stuff like the proposed dirty-tracking enable, so I'd be inclined to
> either add an extensible structure argument now to avoid future
> churn,

I think I said this doesn't really work out because you still have to
go into every driver and code up a 'does not support' check and fail
if any new extension is added. Basically the same churn as adding a
function argument.

Adding more ops is a possible smoother way to support this. Keep
alloc_paging() for these simple drivers and we can add an op with a
big signature and structure or whatever for the fewer fully featured
drivers.

This way the core code can do the 'do not support' checks in one place
based on the set ops and we keep boiler plate code out of alot of
drivers.

> or just not bother adding the device argument either until drivers
> can actually use it.

I debated this, I figured we'd use it pretty quickly, but certainly it
can be taken out.

> > @@ -1995,14 +1995,25 @@ void iommu_set_fault_handler(struct iommu_domain 
> > *domain,
> >   EXPORT_SYMBOL_GPL(iommu_set_fault_handler);
> >   static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops 
> > *ops,
> > +struct device *dev,
> >  unsigned int type)
> >   {
> > struct iommu_domain *domain;
> > if (type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain)
> > return ops->identity_domain;
> > +   else if ((type == IOMMU_DOMAIN_UNMANAGED || type == IOMMU_DOMAIN_DMA) &&
> > +ops->domain_alloc_paging) {
> > +   /*
> > +* For now exclude DMA_FQ since it is still a driver policy
> > +* decision through domain_alloc() if we can use FQ mode.
> > +*/
> 
> That's sorted now, so the type test can neatly collapse down to "type &
> __IOMMU_DOMAIN_PAGING".

Thanks, I rebase it on Joerg's tree with that series and fix it up

Jason


Re: [PATCH v3 03/34] s390: Use pt_frag_refcount for pagetables

2023-06-01 Thread Vishal Moola
On Thu, Jun 1, 2023 at 6:19 AM Gerald Schaefer
 wrote:
>
>  On Wed, 31 May 2023 14:30:01 -0700
> "Vishal Moola (Oracle)"  wrote:
>
> > s390 currently uses _refcount to identify fragmented page tables.
> > The page table struct already has a member pt_frag_refcount used by
> > powerpc, so have s390 use that instead of the _refcount field as well.
> > This improves the safety for _refcount and the page table tracking.
> >
> > This also allows us to simplify the tracking since we can once again use
> > the lower byte of pt_frag_refcount instead of the upper byte of _refcount.
>
> This would conflict with s390 impact of pte_free_defer() work from Hugh 
> Dickins
> https://lore.kernel.org/lkml/35e983f5-7ed3-b310-d949-9ae8b130c...@google.com/
> https://lore.kernel.org/lkml/6dd63b39-e71f-2e8b-7e0-83e02f3bc...@google.com/
>
> There he uses pt_frag_refcount, or rather pt_mm in the same union, to save
> the mm_struct for deferred pte_free().
>
> I still need to look closer into both of your patch series, but so far it
> seems that you have no hard functional requirement to switch from _refcount
> to pt_frag_refcount here, for s390.
>
> If this is correct, and you do not e.g. need this to make some other use
> of _refcount, I would suggest to drop this patch.

The goal of this preparation patch is to consolidate s390's usage of
struct page fields so that struct ptdesc can be smaller. Its not particularly
mandatory; leaving _refcount in ptdesc only increases the struct by
8 bytes and can always be changed later.

However it is a little annoying since s390 is the only architecture
that egregiously uses space throughout struct page for their page
tables, rather than just the page table struct. For example, s390
gmap uses page->index which also aliases with pt_mm and
pt_frag_refcount. I'm not sure if/how gmap page tables interact
with s390 process page tables at all, but if it does that could
potentially cause problems with Hugh's patch as well :(

I can add _refcount to ptdesc if we would like, but I still
prefer if s390 could be simplified instead.


Re: [PATCH v2 25/25] iommu: Convert remaining simple drivers to domain_alloc_paging()

2023-06-01 Thread Jason Gunthorpe
On Thu, Jun 01, 2023 at 08:47:28PM +0100, Robin Murphy wrote:
> On 2023-05-16 01:00, Jason Gunthorpe wrote:
> > These drivers don't support IOMMU_DOMAIN_DMA, so this commit effectively
> > allows them to support that mode.
> > 
> > The prior work to require default_domains makes this safe because every
> > one of these drivers is either compilation incompatible with dma-iommu.c,
> > or already establishing a default_domain. In both cases alloc_domain()
> > will never be called with IOMMU_DOMAIN_DMA for these drivers so it is safe
> > to drop the test.
> > 
> > Removing these tests clarifies that the domain allocation path is only
> > about the functionality of a paging domain and has nothing to do with
> > policy of how the paging domain is used for UNMANAGED/DMA/DMA_FQ.
> > 
> > Tested-by: Niklas Schnelle 
> > Signed-off-by: Jason Gunthorpe 
> > ---
> >   drivers/iommu/fsl_pamu_domain.c | 7 ++-
> >   drivers/iommu/msm_iommu.c   | 7 ++-
> >   drivers/iommu/mtk_iommu_v1.c| 7 ++-
> >   drivers/iommu/omap-iommu.c  | 7 ++-
> >   drivers/iommu/s390-iommu.c  | 7 ++-
> >   5 files changed, 10 insertions(+), 25 deletions(-)
> > 
> > diff --git a/drivers/iommu/fsl_pamu_domain.c 
> > b/drivers/iommu/fsl_pamu_domain.c
> > index ca4f5ebf028783..8d5d6a3acf9dfd 100644
> > --- a/drivers/iommu/fsl_pamu_domain.c
> > +++ b/drivers/iommu/fsl_pamu_domain.c
> > @@ -192,13 +192,10 @@ static void fsl_pamu_domain_free(struct iommu_domain 
> > *domain)
> > kmem_cache_free(fsl_pamu_domain_cache, dma_domain);
> >   }
> > -static struct iommu_domain *fsl_pamu_domain_alloc(unsigned type)
> > +static struct iommu_domain *fsl_pamu_domain_alloc_paging(struct device 
> > *dev)
> 
> This isn't a paging domain - it doesn't support map/unmap, and AFAICT all it
> has ever been intended to do is "isolate" accesses to within an aperture
> which is never set to anything less than the entire physical address space
> :/

Uh.. Yes, I missed that.. Yikes

I'll drop that hunk

It would be horrifying if this driver was ever used with VFIO, we
should definitely block that as it doesn't meet VFIO's security model..

It kind of looks like all this is to support
drivers/soc/fsl/qbman/qman_portal.c which does call
iommu_domain_alloc()?

It looks like the only point is to call fsl_pamu_configure_l1_stash()

Thanks,
Jason


Re: [PATCH v2 09/25] iommu/fsl_pamu: Implement an IDENTITY domain

2023-06-01 Thread Jason Gunthorpe
On Thu, Jun 01, 2023 at 08:53:41PM +0100, Robin Murphy wrote:
> On 2023-06-01 20:46, Jason Gunthorpe wrote:
> > On Thu, Jun 01, 2023 at 08:37:45PM +0100, Robin Murphy wrote:
> > > On 2023-05-16 01:00, Jason Gunthorpe wrote:
> > > > Robin was able to check the documentation and what fsl_pamu has
> > > > historically called detach_dev() is really putting the IOMMU into an
> > > > IDENTITY mode.
> > > 
> > > Unfortunately it was the other way around - it's the call to
> > > fsl_setup_liodns() from fsl_pamu_probe() which leaves everything in bypass
> > > by default (the PAACE_ATM_NO_XLATE part, IIRC), whereas the 
> > > detach_device()
> > > call here ends up disabling the given device's LIODN altogether
> > 
> > Er, I see.. Let me think about it, you convinced me to change it from
> > PLATFORM, so maybe we should go back to that if it is all wonky.
> 
> FWIW I was thinking more along the lines of a token nominal identity domain
> where attach does nothing at all...

I'm worried that would create security problems for VFIO.. At least
the driver currently wipes out the VFIO installed translation which
sounds like the right thing to do.

So, I think my first patch was right, we should label this
PLATFORM/PRIVATE/whatever and just leave it as is with some comments
explaining this thread. Based on the same rational as my prior email
that we should label things correctly and this detach_dev is doing
BLOCKING.

Jason


Re: [PATCH v2 04/25] iommu: Add IOMMU_DOMAIN_PLATFORM for S390

2023-06-01 Thread Jason Gunthorpe
On Thu, Jun 01, 2023 at 07:25:32PM +0100, Robin Murphy wrote:
> On 2023-05-16 01:00, Jason Gunthorpe wrote:
> > The PLATFORM domain will be set as the default domain and attached as
> > normal during probe. The driver will ignore the initial attach from a NULL
> > domain to the PLATFORM domain.
> > 
> > After this, the PLATFORM domain's attach_dev will be called whenever we
> > detach from an UNMANAGED domain (eg for VFIO). This is the same time the
> > original design would have called op->detach_dev().
> > 
> > This is temporary until the S390 dma-iommu.c conversion is merged.
> 
> If we do need a stopgap here, can we please just call the current situation
> an identity domain? 

I really rather wouldn't lie, especially since we need something for
PPC more permanently. I definately don't want to call PPC IDENTITY.

> Then similarly for patch #3 - since we already know s390 is temporary, it
> seems an anathema to introduce a whole domain type with its own weird
> ops->default_domain mechanism solely for POWER to not actually use domains
> with.

A main point of this entire series is to remove the NULL default
domain, so power's weirdness has to be accomodated somehow.
 
> In terms of reasoning, I don't see that IOMMU_DOMAIN_PLATFORM is any more
> useful than a NULL default domain, it just renames the problem

Yes! Renaming is the whole point.

We are giving actuall meaningful names to all the things that drivers
did with NULL. The ones that actually implemented IDENTITY correctly
gets to be called IDENTITY, everyone else needs to get labeled
something else.

Then we can tell what is correct and what needs fixing at a glance.

That is the *whole point*.

> and gives us more code to maintain for the privilege.

The PLATFORM bit is only 3 lines of core code. Let's not overstate the
cost of a label please.

> As I say, though, we don't actually need to juggle the semantic of a
> "we don't know what's happening here" domain around any further,
> since it works out that a "we're not influencing anything here"
> domain actually suffices for what we want to reason about, and those
> are already well-defined. Sure, the platform DMA ops *might* be
> doing more, but that's beyond the scope of the IOMMU API either
> way. At that point, lo and behold, s390 and POWER now look just like
> ARM and the core code only needs a single special case for
> arch-specific default identity domains, lovely!

I'm really against mis-labeling things.

That is totally the wrong direction for this series. The point is to
label things correctly. Labeling something that is not an IDENTITY
operation as IDENTITY is just repeating the whole mistake of an
ill-defined NULL all over again.

Labels need to be correct, labels are cheap to add, so lets not be
afraid to use proper labels to actually describe the underlying
behavior.

If you don't like PLATFORM we can choose PRIVATE, OPAQUE or some other
label, but not IDENTITY.

Jason


Re: [PATCH v2 09/25] iommu/fsl_pamu: Implement an IDENTITY domain

2023-06-01 Thread Robin Murphy

On 2023-06-01 20:46, Jason Gunthorpe wrote:

On Thu, Jun 01, 2023 at 08:37:45PM +0100, Robin Murphy wrote:

On 2023-05-16 01:00, Jason Gunthorpe wrote:

Robin was able to check the documentation and what fsl_pamu has
historically called detach_dev() is really putting the IOMMU into an
IDENTITY mode.


Unfortunately it was the other way around - it's the call to
fsl_setup_liodns() from fsl_pamu_probe() which leaves everything in bypass
by default (the PAACE_ATM_NO_XLATE part, IIRC), whereas the detach_device()
call here ends up disabling the given device's LIODN altogether


Er, I see.. Let me think about it, you convinced me to change it from
PLATFORM, so maybe we should go back to that if it is all wonky.


FWIW I was thinking more along the lines of a token nominal identity 
domain where attach does nothing at all...



There doesn't appear to have ever been any code anywhere for putting
things *back* into bypass after using a VFIO domain, so as-is these
default domains would probably break all DMA :(


Sounds like it just never worked right.

ie going to VFIO mode was always a one way trip and you can't go back
to a kernel driver.


...on the assumption that doing so wouldn't really be any less broken 
than it always has been :)


Thanks,
Robin.


I don't think this patch makes it worse because we call the identity
attach_dev in all the same places we called detach_dev in the first
place.

We add an extra call at the start of time, but that call is NOP'd
by this:


if (domain == platform_domain || !domain)
+   return 0;
+


(bah, and the variable name needs updating too)

Honestly, I don't really want to fix FSL since it seems abandoned, so
either this patch or going back to PLATFORM seems like the best option.

Jason


Re: [PATCH v2 25/25] iommu: Convert remaining simple drivers to domain_alloc_paging()

2023-06-01 Thread Robin Murphy

On 2023-05-16 01:00, Jason Gunthorpe wrote:

These drivers don't support IOMMU_DOMAIN_DMA, so this commit effectively
allows them to support that mode.

The prior work to require default_domains makes this safe because every
one of these drivers is either compilation incompatible with dma-iommu.c,
or already establishing a default_domain. In both cases alloc_domain()
will never be called with IOMMU_DOMAIN_DMA for these drivers so it is safe
to drop the test.

Removing these tests clarifies that the domain allocation path is only
about the functionality of a paging domain and has nothing to do with
policy of how the paging domain is used for UNMANAGED/DMA/DMA_FQ.

Tested-by: Niklas Schnelle 
Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/fsl_pamu_domain.c | 7 ++-
  drivers/iommu/msm_iommu.c   | 7 ++-
  drivers/iommu/mtk_iommu_v1.c| 7 ++-
  drivers/iommu/omap-iommu.c  | 7 ++-
  drivers/iommu/s390-iommu.c  | 7 ++-
  5 files changed, 10 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index ca4f5ebf028783..8d5d6a3acf9dfd 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -192,13 +192,10 @@ static void fsl_pamu_domain_free(struct iommu_domain 
*domain)
kmem_cache_free(fsl_pamu_domain_cache, dma_domain);
  }
  
-static struct iommu_domain *fsl_pamu_domain_alloc(unsigned type)

+static struct iommu_domain *fsl_pamu_domain_alloc_paging(struct device *dev)


This isn't a paging domain - it doesn't support map/unmap, and AFAICT 
all it has ever been intended to do is "isolate" accesses to within an 
aperture which is never set to anything less than the entire physical 
address space :/


I hate to imagine what the VFIO userspace applications looked like...

Thanks,
Robin.


  {
struct fsl_dma_domain *dma_domain;
  
-	if (type != IOMMU_DOMAIN_UNMANAGED)

-   return NULL;
-
dma_domain = kmem_cache_zalloc(fsl_pamu_domain_cache, GFP_KERNEL);
if (!dma_domain)
return NULL;
@@ -476,7 +473,7 @@ static const struct iommu_ops fsl_pamu_ops = {
.identity_domain = _pamu_identity_domain,
.def_domain_type = _pamu_def_domain_type,
.capable= fsl_pamu_capable,
-   .domain_alloc   = fsl_pamu_domain_alloc,
+   .domain_alloc_paging = fsl_pamu_domain_alloc_paging,
.probe_device   = fsl_pamu_probe_device,
.device_group   = fsl_pamu_device_group,
.default_domain_ops = &(const struct iommu_domain_ops) {
diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 26ed81cfeee897..a163cee0b7242d 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -302,13 +302,10 @@ static void __program_context(void __iomem *base, int ctx,
SET_M(base, ctx, 1);
  }
  
-static struct iommu_domain *msm_iommu_domain_alloc(unsigned type)

+static struct iommu_domain *msm_iommu_domain_alloc_paging(struct device *dev)
  {
struct msm_priv *priv;
  
-	if (type != IOMMU_DOMAIN_UNMANAGED)

-   return NULL;
-
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv)
goto fail_nomem;
@@ -691,7 +688,7 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id)
  
  static struct iommu_ops msm_iommu_ops = {

.identity_domain = _iommu_identity_domain,
-   .domain_alloc = msm_iommu_domain_alloc,
+   .domain_alloc_paging = msm_iommu_domain_alloc_paging,
.probe_device = msm_iommu_probe_device,
.device_group = generic_device_group,
.pgsize_bitmap = MSM_IOMMU_PGSIZES,
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 7c0c1d50df5f75..67e044c1a7d93b 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -270,13 +270,10 @@ static int mtk_iommu_v1_domain_finalise(struct 
mtk_iommu_v1_data *data)
return 0;
  }
  
-static struct iommu_domain *mtk_iommu_v1_domain_alloc(unsigned type)

+static struct iommu_domain *mtk_iommu_v1_domain_alloc_paging(struct device 
*dev)
  {
struct mtk_iommu_v1_domain *dom;
  
-	if (type != IOMMU_DOMAIN_UNMANAGED)

-   return NULL;
-
dom = kzalloc(sizeof(*dom), GFP_KERNEL);
if (!dom)
return NULL;
@@ -585,7 +582,7 @@ static int mtk_iommu_v1_hw_init(const struct 
mtk_iommu_v1_data *data)
  
  static const struct iommu_ops mtk_iommu_v1_ops = {

.identity_domain = _iommu_v1_identity_domain,
-   .domain_alloc   = mtk_iommu_v1_domain_alloc,
+   .domain_alloc_paging = mtk_iommu_v1_domain_alloc_paging,
.probe_device   = mtk_iommu_v1_probe_device,
.probe_finalize = mtk_iommu_v1_probe_finalize,
.release_device = mtk_iommu_v1_release_device,
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 34340ef15241bc..fcf99bd195b32e 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1580,13 

Re: [PATCH v2 09/25] iommu/fsl_pamu: Implement an IDENTITY domain

2023-06-01 Thread Jason Gunthorpe
On Thu, Jun 01, 2023 at 08:37:45PM +0100, Robin Murphy wrote:
> On 2023-05-16 01:00, Jason Gunthorpe wrote:
> > Robin was able to check the documentation and what fsl_pamu has
> > historically called detach_dev() is really putting the IOMMU into an
> > IDENTITY mode.
> 
> Unfortunately it was the other way around - it's the call to
> fsl_setup_liodns() from fsl_pamu_probe() which leaves everything in bypass
> by default (the PAACE_ATM_NO_XLATE part, IIRC), whereas the detach_device()
> call here ends up disabling the given device's LIODN altogether

Er, I see.. Let me think about it, you convinced me to change it from
PLATFORM, so maybe we should go back to that if it is all wonky.

> There doesn't appear to have ever been any code anywhere for putting
> things *back* into bypass after using a VFIO domain, so as-is these
> default domains would probably break all DMA :(

Sounds like it just never worked right.

ie going to VFIO mode was always a one way trip and you can't go back
to a kernel driver.

I don't think this patch makes it worse because we call the identity
attach_dev in all the same places we called detach_dev in the first
place.

We add an extra call at the start of time, but that call is NOP'd
by this:

> > if (domain == platform_domain || !domain)
> > +   return 0;
> > +

(bah, and the variable name needs updating too)

Honestly, I don't really want to fix FSL since it seems abandoned, so
either this patch or going back to PLATFORM seems like the best option.

Jason


Re: [PATCH v2 09/25] iommu/fsl_pamu: Implement an IDENTITY domain

2023-06-01 Thread Robin Murphy

On 2023-05-16 01:00, Jason Gunthorpe wrote:

Robin was able to check the documentation and what fsl_pamu has
historically called detach_dev() is really putting the IOMMU into an
IDENTITY mode.


Unfortunately it was the other way around - it's the call to 
fsl_setup_liodns() from fsl_pamu_probe() which leaves everything in 
bypass by default (the PAACE_ATM_NO_XLATE part, IIRC), whereas the 
detach_device() call here ends up disabling the given device's LIODN 
altogether. There doesn't appear to have ever been any code anywhere for 
putting things *back* into bypass after using a VFIO domain, so as-is 
these default domains would probably break all DMA :(


Thanks,
Robin.


Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain. This is a ppc driver without any dma_ops, so ensure
the identity translation is the default domain.

Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/fsl_pamu_domain.c | 32 +---
  1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index bce37229709965..ca4f5ebf028783 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -283,15 +283,21 @@ static int fsl_pamu_attach_device(struct iommu_domain 
*domain,
return ret;
  }
  
-static void fsl_pamu_set_platform_dma(struct device *dev)

+static int fsl_pamu_identity_attach(struct iommu_domain *platform_domain,
+   struct device *dev)
  {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct fsl_dma_domain *dma_domain = to_fsl_dma_domain(domain);
+   struct fsl_dma_domain *dma_domain;
const u32 *prop;
int len;
struct pci_dev *pdev = NULL;
struct pci_controller *pci_ctl;
  
+	if (domain == platform_domain || !domain)

+   return 0;
+
+   dma_domain = to_fsl_dma_domain(domain);
+
/*
 * Use LIODN of the PCI controller while detaching a
 * PCI device.
@@ -312,8 +318,18 @@ static void fsl_pamu_set_platform_dma(struct device *dev)
detach_device(dev, dma_domain);
else
pr_debug("missing fsl,liodn property at %pOF\n", dev->of_node);
+   return 0;
  }
  
+static struct iommu_domain_ops fsl_pamu_identity_ops = {

+   .attach_dev = fsl_pamu_identity_attach,
+};
+
+static struct iommu_domain fsl_pamu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _pamu_identity_ops,
+};
+
  /* Set the domain stash attribute */
  int fsl_pamu_configure_l1_stash(struct iommu_domain *domain, u32 cpu)
  {
@@ -447,12 +463,22 @@ static struct iommu_device *fsl_pamu_probe_device(struct 
device *dev)
return _iommu;
  }
  
+static int fsl_pamu_def_domain_type(struct device *dev)

+{
+   /*
+* This platform does not use dma_ops at all so the normally the iommu
+* must be in identity mode
+*/
+   return IOMMU_DOMAIN_IDENTITY;
+}
+
  static const struct iommu_ops fsl_pamu_ops = {
+   .identity_domain = _pamu_identity_domain,
+   .def_domain_type = _pamu_def_domain_type,
.capable= fsl_pamu_capable,
.domain_alloc   = fsl_pamu_domain_alloc,
.probe_device   = fsl_pamu_probe_device,
.device_group   = fsl_pamu_device_group,
-   .set_platform_dma_ops = fsl_pamu_set_platform_dma,
.default_domain_ops = &(const struct iommu_domain_ops) {
.attach_dev = fsl_pamu_attach_device,
.iova_to_phys   = fsl_pamu_iova_to_phys,


Re: [PATCH v2 23/25] iommu: Add ops->domain_alloc_paging()

2023-06-01 Thread Robin Murphy

On 2023-05-16 01:00, Jason Gunthorpe wrote:

This callback requests the driver to create only a __IOMMU_DOMAIN_PAGING
domain, so it saves a few lines in a lot of drivers needlessly checking
the type.

More critically, this allows us to sweep out all the
IOMMU_DOMAIN_UNMANAGED and IOMMU_DOMAIN_DMA checks from a lot of the
drivers, simplifying what is going on in the code and ultimately removing
the now-unused special cases in drivers where they did not support
IOMMU_DOMAIN_DMA.

domain_alloc_paging() should return a struct iommu_domain that is
functionally compatible with ARM_DMA_USE_IOMMU, dma-iommu.c and iommufd.

Be forwards looking and pass in a 'struct device *' argument. We can
provide this when allocating the default_domain. No drivers will look at
this.


As mentioned before, we already know we're going to need additional 
flags (and possibly data) to cover the existing set_pgtable_quirks 
use-case plus new stuff like the proposed dirty-tracking enable, so I'd 
be inclined to either add an extensible structure argument now to avoid 
future churn, or just not bother adding the device argument either until 
drivers can actually use it.



Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/iommu.c | 18 +++---
  include/linux/iommu.h |  3 +++
  2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c4cac1dcf80610..15aa51c356bd74 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1995,14 +1995,25 @@ void iommu_set_fault_handler(struct iommu_domain 
*domain,
  EXPORT_SYMBOL_GPL(iommu_set_fault_handler);
  
  static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops,

+struct device *dev,
 unsigned int type)
  {
struct iommu_domain *domain;
  
  	if (type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain)

return ops->identity_domain;
+   else if ((type == IOMMU_DOMAIN_UNMANAGED || type == IOMMU_DOMAIN_DMA) &&
+ops->domain_alloc_paging) {
+   /*
+* For now exclude DMA_FQ since it is still a driver policy
+* decision through domain_alloc() if we can use FQ mode.
+*/


That's sorted now, so the type test can neatly collapse down to "type & 
__IOMMU_DOMAIN_PAGING".


Thanks,
Robin.


+   domain = ops->domain_alloc_paging(dev);
+   } else if (ops->domain_alloc)
+   domain = ops->domain_alloc(type);
+   else
+   return NULL;
  
-	domain = ops->domain_alloc(type);

if (!domain)
return NULL;
  
@@ -2033,14 +2044,15 @@ __iommu_group_domain_alloc(struct iommu_group *group, unsigned int type)
  
  	lockdep_assert_held(>mutex);
  
-	return __iommu_domain_alloc(dev_iommu_ops(dev), type);

+   return __iommu_domain_alloc(dev_iommu_ops(dev), dev, type);
  }
  
  struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)

  {
if (bus == NULL || bus->iommu_ops == NULL)
return NULL;
-   return __iommu_domain_alloc(bus->iommu_ops, IOMMU_DOMAIN_UNMANAGED);
+   return __iommu_domain_alloc(bus->iommu_ops, NULL,
+   IOMMU_DOMAIN_UNMANAGED);
  }
  EXPORT_SYMBOL_GPL(iommu_domain_alloc);
  
diff --git a/include/linux/iommu.h b/include/linux/iommu.h

index 387746f8273c99..18b0df42cc80d1 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -227,6 +227,8 @@ struct iommu_iotlb_gather {
   * struct iommu_ops - iommu ops and capabilities
   * @capable: check capability
   * @domain_alloc: allocate iommu domain
+ * @domain_alloc_paging: Allocate an iommu_domain that can be used for
+ *   UNMANAGED, DMA, and DMA_FQ domain types.
   * @probe_device: Add device to iommu driver handling
   * @release_device: Remove device from iommu driver handling
   * @probe_finalize: Do final setup work after the device is added to an IOMMU
@@ -258,6 +260,7 @@ struct iommu_ops {
  
  	/* Domain allocation and freeing by the iommu driver */

struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type);
+   struct iommu_domain *(*domain_alloc_paging)(struct device *dev);
  
  	struct iommu_device *(*probe_device)(struct device *dev);

void (*release_device)(struct device *dev);


Re: [PATCH v2 08/25] iommu: Allow an IDENTITY domain as the default_domain in ARM32

2023-06-01 Thread Robin Murphy

On 2023-05-16 01:00, Jason Gunthorpe wrote:

Even though dma-iommu.c and CONFIG_ARM_DMA_USE_IOMMU do approximately the
same stuff, the way they relate to the IOMMU core is quiet different.

dma-iommu.c expects the core code to setup an UNMANAGED domain (of type
IOMMU_DOMAIN_DMA) and then configures itself to use that domain. This
becomes the default_domain for the group.

ARM_DMA_USE_IOMMU does not use the default_domain, instead it directly
allocates an UNMANAGED domain and operates it just like an external
driver. In this case group->default_domain is NULL.

If the driver provides a global static identity_domain then automatically
use it as the default_domain when in ARM_DMA_USE_IOMMU mode.

This allows drivers that implemented default_domain == NULL as an IDENTITY
translation to trivially get a properly labeled non-NULL default_domain on
ARM32 configs.

With this arrangment when ARM_DMA_USE_IOMMU wants to disconnect from the
device the normal detach_domain flow will restore the IDENTITY domain as
the default domain. Overall this makes attach_dev() of the IDENTITY domain
called in the same places as detach_dev().

This effectively migrates these drivers to default_domain mode. For
drivers that support ARM64 they will gain support for the IDENTITY
translation mode for the dma_api and behave in a uniform way.

Drivers use this by setting ops->identity_domain to a static singleton
iommu_domain that implements the identity attach. If the core detects
ARM_DMA_USE_IOMMU mode then it automatically attaches the IDENTITY domain
during probe.

Drivers can continue to prevent the use of DMA translation by returning
IOMMU_DOMAIN_IDENTITY from def_domain_type, this will completely prevent
IOMMU_DMA from running but will not impact ARM_DMA_USE_IOMMU.

This allows removing the set_platform_dma_ops() from every remaining
driver.

Remove the set_platform_dma_ops from rockchip and mkt_v1 as all it does
is set an existing global static identity domain. mkt_v1 does not support
IOMMU_DOMAIN_DMA and it does not compile on ARM64 so this transformation
is safe.

Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/iommu.c  | 40 +-
  drivers/iommu/mtk_iommu_v1.c   | 12 --
  drivers/iommu/rockchip-iommu.c | 10 -
  3 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8ba90571449cec..bed7cb6e5ee65b 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1757,18 +1757,48 @@ static int iommu_get_default_domain_type(struct 
iommu_group *group,
int type;
  
  	lockdep_assert_held(>mutex);

+
+   /*
+* ARM32 drivers supporting CONFIG_ARM_DMA_USE_IOMMU can declare an
+* identity_domain and it will automatically become their default
+* domain. Later on ARM_DMA_USE_IOMMU will install its UNMANAGED domain.
+* Override the selection to IDENTITY if we are sure the driver supports
+* it.
+*/
+   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) && ops->identity_domain) {


If I cared about arm-smmu on 32-bit, I'd bring that up again, but 
honestly I'm not sure that I do... I think it might end up working after 
patch #21, and it's currently still broken for lack of .set_platform_dma 
anyway, so meh.



+   type = IOMMU_DOMAIN_IDENTITY;
+   if (best_type && type && best_type != type)
+   goto err;
+   best_type = target_type = IOMMU_DOMAIN_IDENTITY;
+   }
+
for_each_group_device(group, gdev) {
type = best_type;
if (ops->def_domain_type) {
type = ops->def_domain_type(gdev->dev);
-   if (best_type && type && best_type != type)
+   if (best_type && type && best_type != type) {
+   /* Stick with the last driver override we saw */
+   best_type = type;
goto err;
+   }
}
  
  		if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted) {

-   type = IOMMU_DOMAIN_DMA;
-   if (best_type && type && best_type != type)
-   goto err;
+   /*
+* We don't have any way for the iommu core code to
+* force arm_iommu to activate so we can't enforce
+* trusted. Log it and keep going with the IDENTITY
+* default domain.
+*/
+   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)) {
+   dev_warn(
+   gdev->dev,
+   "PCI device is untrusted but ARM32 does not 
support secure IOMMU operation, continuing anyway.\n");


To within experimental error, this is dead code. The ARM DMA ops don't 

Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Kent Overstreet
On Thu, Jun 01, 2023 at 06:13:44PM +, Edgecombe, Rick P wrote:
> > text_poke() _does_ create a separate RW mapping.
> 
> Sorry, I meant a separate RW allocation.

Ah yes, that makes sense


> 
> > 
> > The thing that sucks about text_poke() is that it always does a full
> > TLB
> > flush, and AFAICT that's not remotely needed. What it really wants to
> > be
> > doing is conceptually just
> > 
> > kmap_local()
> > mempcy()
> > kunmap_loca()
> > flush_icache();
> > 
> > ...except that kmap_local() won't actually create a new mapping on
> > non-highmem architectures, so text_poke() open codes it.
> 
> Text poke creates only a local CPU RW mapping. It's more secure because
> other threads can't write to it.

*nod*, same as kmap_local

> It also only needs to flush the local core when it's done since it's
> not using a shared MM.
 
Ahh! Thanks for that; perhaps the comment in text_poke() about IPIs
could be a bit clearer.

What is it (if anything) you don't like about text_poke() then? It looks
like it's doing broadly similar things to kmap_local(), so should be
in the same ballpark from a performance POV?


Re: [PATCH v2 04/25] iommu: Add IOMMU_DOMAIN_PLATFORM for S390

2023-06-01 Thread Robin Murphy

On 2023-05-16 01:00, Jason Gunthorpe wrote:

The PLATFORM domain will be set as the default domain and attached as
normal during probe. The driver will ignore the initial attach from a NULL
domain to the PLATFORM domain.

After this, the PLATFORM domain's attach_dev will be called whenever we
detach from an UNMANAGED domain (eg for VFIO). This is the same time the
original design would have called op->detach_dev().

This is temporary until the S390 dma-iommu.c conversion is merged.


If we do need a stopgap here, can we please just call the current 
situation an identity domain? It's true enough in the sense that the 
IOMMU API is not offering any translation or guarantee of isolation, so 
the semantics of an identity domain - from the point of view of anything 
inside the IOMMU API that would be looking - are no weaker or less 
useful than a "platform" domain whose semantics are intentionally unknown.


Then similarly for patch #3 - since we already know s390 is temporary, 
it seems an anathema to introduce a whole domain type with its own weird 
ops->default_domain mechanism solely for POWER to not actually use 
domains with.


In terms of reasoning, I don't see that IOMMU_DOMAIN_PLATFORM is any 
more useful than a NULL default domain, it just renames the problem, and 
gives us more code to maintain for the privilege. As I say, though, we 
don't actually need to juggle the semantic of a "we don't know what's 
happening here" domain around any further, since it works out that a 
"we're not influencing anything here" domain actually suffices for what 
we want to reason about, and those are already well-defined. Sure, the 
platform DMA ops *might* be doing more, but that's beyond the scope of 
the IOMMU API either way. At that point, lo and behold, s390 and POWER 
now look just like ARM and the core code only needs a single special 
case for arch-specific default identity domains, lovely!


Thanks,
Robin.


Tested-by: Heiko Stuebner 
Tested-by: Niklas Schnelle 
Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/s390-iommu.c | 21 +++--
  1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index fbf59a8db29b11..f0c867c57a5b9b 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -142,14 +142,31 @@ static int s390_iommu_attach_device(struct iommu_domain 
*domain,
return 0;
  }
  
-static void s390_iommu_set_platform_dma(struct device *dev)

+/*
+ * Switch control over the IOMMU to S390's internal dma_api ops
+ */
+static int s390_iommu_platform_attach(struct iommu_domain *platform_domain,
+ struct device *dev)
  {
struct zpci_dev *zdev = to_zpci_dev(dev);
  
+	if (!zdev->s390_domain)

+   return 0;
+
__s390_iommu_detach_device(zdev);
zpci_dma_init_device(zdev);
+   return 0;
  }
  
+static struct iommu_domain_ops s390_iommu_platform_ops = {

+   .attach_dev = s390_iommu_platform_attach,
+};
+
+static struct iommu_domain s390_iommu_platform_domain = {
+   .type = IOMMU_DOMAIN_PLATFORM,
+   .ops = _iommu_platform_ops,
+};
+
  static void s390_iommu_get_resv_regions(struct device *dev,
struct list_head *list)
  {
@@ -428,12 +445,12 @@ void zpci_destroy_iommu(struct zpci_dev *zdev)
  }
  
  static const struct iommu_ops s390_iommu_ops = {

+   .default_domain = _iommu_platform_domain,
.capable = s390_iommu_capable,
.domain_alloc = s390_domain_alloc,
.probe_device = s390_iommu_probe_device,
.release_device = s390_iommu_release_device,
.device_group = generic_device_group,
-   .set_platform_dma_ops = s390_iommu_set_platform_dma,
.pgsize_bitmap = SZ_4K,
.get_resv_regions = s390_iommu_get_resv_regions,
.default_domain_ops = &(const struct iommu_domain_ops) {


Re: [PATCH 00/13] mm: jit/text allocator

2023-06-01 Thread Kent Overstreet
On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> For a while I have wanted to give kprobes its own allocator so that it can 
> work
> even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
> the modules area.
> 
> Given that, I think these should have their own allocator functions that can 
> be
> provided independently, even if those happen to use common infrastructure.

How much memory can kprobes conceivably use? I think we also want to try
to push back on combinatorial new allocators, if we can.

> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> > 
> > This set splits code allocation from modules by introducing
> > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > sites of module_alloc() and module_memfree() with the new APIs and
> > implements core text and related allocation in a central place.
> > 
> > Instead of architecture specific overrides for module_alloc(), the
> > architectures that require non-default behaviour for text allocation must
> > fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> > returns a pointer to that structure. If an architecture does not implement
> > jit_alloc_arch_params(), the defaults compatible with the current
> > modules::module_alloc() are used.
> 
> As above, I suspect that each of the callsites should probably be using common
> infrastructure, but I don't think that a single jit_alloc_arch_params() makes
> sense, since the parameters for each case may need to be distinct.

I don't see how that follows. The whole point of function parameters is
that they may be different :)

Can you give more detail on what parameters you need? If the only extra
parameter is just "does this allocation need to live close to kernel
text", that's not that big of a deal.


Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Edgecombe, Rick P
On Thu, 2023-06-01 at 14:00 -0400, Kent Overstreet wrote:
> On Thu, Jun 01, 2023 at 04:54:27PM +, Edgecombe, Rick P wrote:
> > It is just a local flush, but I wonder how much text_poke()ing is
> > too
> > much. A lot of the are even inside loops. Can't it do the batch
> > version
> > at least?
> > 
> > The other thing, and maybe this is in paranoia category, but it's
> > probably at least worth noting. Before the modules were not made
> > executable until all of the code was finalized. Now they are made
> > executable in an intermediate state and then patched later. It
> > might
> > weaken the CFI stuff, but also it just kind of seems a bit
> > unbounded
> > for dealing with executable code.
> 
> I believe bpf starts out by initializing new executable memory with
> illegal opcodes, maybe we should steal that and make it standard.

I was thinking of modules which have a ton of alternatives, errata
fixes, etc applied to them after the initial sections are written to
the to-be-executable mapping. I thought this had zeroed pages to start,
which seems ok.

> 
> > Preparing the modules in a separate RW mapping, and then
> > text_poke()ing
> > the whole thing in when you are done would resolve both of these.
> 
> text_poke() _does_ create a separate RW mapping.

Sorry, I meant a separate RW allocation.

> 
> The thing that sucks about text_poke() is that it always does a full
> TLB
> flush, and AFAICT that's not remotely needed. What it really wants to
> be
> doing is conceptually just
> 
> kmap_local()
> mempcy()
> kunmap_loca()
> flush_icache();
> 
> ...except that kmap_local() won't actually create a new mapping on
> non-highmem architectures, so text_poke() open codes it.

Text poke creates only a local CPU RW mapping. It's more secure because
other threads can't write to it. It also only needs to flush the local
core when it's done since it's not using a shared MM. It used to use
the fixmap, which is similar to what you are describing I think.



Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Kent Overstreet
On Thu, Jun 01, 2023 at 04:54:27PM +, Edgecombe, Rick P wrote:
> It is just a local flush, but I wonder how much text_poke()ing is too
> much. A lot of the are even inside loops. Can't it do the batch version
> at least?
> 
> The other thing, and maybe this is in paranoia category, but it's
> probably at least worth noting. Before the modules were not made
> executable until all of the code was finalized. Now they are made
> executable in an intermediate state and then patched later. It might
> weaken the CFI stuff, but also it just kind of seems a bit unbounded
> for dealing with executable code.

I believe bpf starts out by initializing new executable memory with
illegal opcodes, maybe we should steal that and make it standard.

> Preparing the modules in a separate RW mapping, and then text_poke()ing
> the whole thing in when you are done would resolve both of these.

text_poke() _does_ create a separate RW mapping.

The thing that sucks about text_poke() is that it always does a full TLB
flush, and AFAICT that's not remotely needed. What it really wants to be
doing is conceptually just

kmap_local()
mempcy()
kunmap_loca()
flush_icache();

...except that kmap_local() won't actually create a new mapping on
non-highmem architectures, so text_poke() open codes it.


Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Kent Overstreet
On Thu, Jun 01, 2023 at 12:30:50PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 01, 2023 at 01:12:56PM +0300, Mike Rapoport wrote:
> 
> > +static void __init_or_module do_text_poke(void *addr, const void *opcode, 
> > size_t len)
> > +{
> > +   if (system_state < SYSTEM_RUNNING) {
> > +   text_poke_early(addr, opcode, len);
> > +   } else {
> > +   mutex_lock(_mutex);
> > +   text_poke(addr, opcode, len);
> > +   mutex_unlock(_mutex);
> > +   }
> > +}
> 
> So I don't much like do_text_poke(); why?

Could you share why?

I think the impementation sucks but conceptually it's the right idea -
create a new temporary mapping to avoid the need for RWX mappings.


Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Edgecombe, Rick P
On Thu, 2023-06-01 at 13:12 +0300, Mike Rapoport wrote:
>  /*
>   * Are we looking at a near JMP with a 1 or 4-byte displacement.
> @@ -331,7 +344,7 @@ void __init_or_module noinline
> apply_alternatives(struct alt_instr *start,
>  
> DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn:
> ", instr);
>  
> -   text_poke_early(instr, insn_buff, insn_buff_sz);
> +   do_text_poke(instr, insn_buff, insn_buff_sz);
>  
>  next:
> optimize_nops(instr, a->instrlen);
> @@ -564,7 +577,7 @@ void __init_or_module noinline
> apply_retpolines(s32 *start, s32 *end)
> optimize_nops(bytes, len);
> DUMP_BYTES(((u8*)addr),  len, "%px: orig: ",
> addr);
> DUMP_BYTES(((u8*)bytes), len, "%px: repl: ",
> addr);
> -   text_poke_early(addr, bytes, len);
> +   do_text_poke(addr, bytes, len);
> }
> }
>  }
> @@ -638,7 +651,7 @@ void __init_or_module noinline apply_returns(s32
> *start, s32 *end)
> if (len == insn.length) {
> DUMP_BYTES(((u8*)addr),  len, "%px: orig: ",
> addr);
> DUMP_BYTES(((u8*)bytes), len, "%px: repl: ",
> addr);
> -   text_poke_early(addr, bytes, len);
> +   do_text_poke(addr, bytes, len);
> }
> }
>  }
> @@ -674,7 +687,7 @@ static void poison_endbr(void *addr, bool warn)
>  */
> DUMP_BYTES(((u8*)addr), 4, "%px: orig: ", addr);
> DUMP_BYTES(((u8*)), 4, "%px: repl: ", addr);
> -   text_poke_early(addr, , 4);
> +   do_text_poke(addr, , 4);
>  }
>  
>  /*
> @@ -869,7 +882,7 @@ static int cfi_disable_callers(s32 *start, s32
> *end)
> if (!hash) /* nocfi callers */
> continue;
>  
> -   text_poke_early(addr, jmp, 2);
> +   do_text_poke(addr, jmp, 2);
> }
>  
> return 0;
> @@ -892,7 +905,7 @@ static int cfi_enable_callers(s32 *start, s32
> *end)
> if (!hash) /* nocfi callers */
> continue;
>  
> -   text_poke_early(addr, mov, 2);
> +   do_text_poke(addr, mov, 2);
> }
>  
> return 0;
> @@ -913,7 +926,7 @@ static int cfi_rand_preamble(s32 *start, s32
> *end)
> return -EINVAL;
>  
> hash = cfi_rehash(hash);
> -   text_poke_early(addr + 1, , 4);
> +   do_text_poke(addr + 1, , 4);
> }
>  
> return 0;
> @@ -932,9 +945,9 @@ static int cfi_rewrite_preamble(s32 *start, s32
> *end)
>  addr, addr, 5, addr))
> return -EINVAL;
>  
> -   text_poke_early(addr, fineibt_preamble_start,
> fineibt_preamble_size);
> +   do_text_poke(addr, fineibt_preamble_start,
> fineibt_preamble_size);
> WARN_ON(*(u32 *)(addr + fineibt_preamble_hash) !=
> 0x12345678);
> -   text_poke_early(addr + fineibt_preamble_hash, ,
> 4);
> +   do_text_poke(addr + fineibt_preamble_hash, , 4);
> }

It is just a local flush, but I wonder how much text_poke()ing is too
much. A lot of the are even inside loops. Can't it do the batch version
at least?

The other thing, and maybe this is in paranoia category, but it's
probably at least worth noting. Before the modules were not made
executable until all of the code was finalized. Now they are made
executable in an intermediate state and then patched later. It might
weaken the CFI stuff, but also it just kind of seems a bit unbounded
for dealing with executable code.

Preparing the modules in a separate RW mapping, and then text_poke()ing
the whole thing in when you are done would resolve both of these.


Re: [PATCH v2 05/25] iommu/tegra-gart: Remove tegra-gart

2023-06-01 Thread Thierry Reding
On Mon, May 15, 2023 at 09:00:38PM -0300, Jason Gunthorpe wrote:
> Thierry says this is not used anymore, and doesn't think it ever will
> be. The HW it supports is about 10 years old now and newer HW uses
> different IOMMU drivers.
> 
> As this is the only driver with a GART approach, and it doesn't really
> meet the driver expectations from the IOMMU core, let's just remove it
> so we don't have to think about how to make it fit in.
> 
> It has a number of identified problems:
>  - The assignment of iommu_groups doesn't match the HW behavior
> 
>  - It claims to have an UNMANAGED domain but it is really an IDENTITY
>domain with a translation aperture. This is inconsistent with the core
>expectation for security sensitive operations
> 
>  - It doesn't implement a SW page table under struct iommu_domain so
>* It can't accept a map until the domain is attached
>* It forgets about all maps after the domain is detached
>* It doesn't clear the HW of maps once the domain is detached
>  (made worse by having the wrong groups)
> 
> Cc: Thierry Reding 
> Cc: Dmitry Osipenko 
> Signed-off-by: Jason Gunthorpe 
> ---
>  arch/arm/configs/multi_v7_defconfig |   1 -
>  arch/arm/configs/tegra_defconfig|   1 -
>  drivers/iommu/Kconfig   |  11 -
>  drivers/iommu/Makefile  |   1 -
>  drivers/iommu/tegra-gart.c  | 371 
>  drivers/memory/tegra/mc.c   |  34 ---
>  drivers/memory/tegra/tegra20.c  |  28 ---
>  include/soc/tegra/mc.h  |  26 --
>  8 files changed, 473 deletions(-)
>  delete mode 100644 drivers/iommu/tegra-gart.c

To clarify, I think this hardware could very well be used again, but I
don't think it makes sense to use it in the context of the IOMMU
subsystem. If anybody wants to make use of this on Tegra20, it probably
makes more sense to move the GART programming into whatever driver ends
up using it instead of jumping through all these hoops just to make it
work like something it isn't.

Acked-by: Thierry Reding 


signature.asc
Description: PGP signature


Re: [PATCH v8 0/7] Add pci_dev_for_each_resource() helper and update users

2023-06-01 Thread Andy Shevchenko
On Thu, Jun 01, 2023 at 07:25:46PM +0300, Andy Shevchenko wrote:
> On Wed, May 31, 2023 at 08:48:35PM +0200, Jonas Gorski wrote:
> > On Tue, 30 May 2023 at 23:34, Bjorn Helgaas  wrote:
> > > On Fri, May 12, 2023 at 02:48:51PM -0500, Bjorn Helgaas wrote:

...

> > > Where are we at?  Are we going to ignore this because some Coverity
> > > reports are false positives?
> > 
> > Looking at the code I understand where coverity is coming from:
> > 
> > #define __pci_dev_for_each_res0(dev, res, ...) \
> >for (unsigned int __b = 0;  \
> > res = pci_resource_n(dev, __b), __b < PCI_NUM_RESOURCES;   \
> > __b++)
> > 
> >  res will be assigned before __b is checked for being less than
> > PCI_NUM_RESOURCES, making it point to behind the array at the end of
> > the last loop iteration.
> 
> Which is fine and you stumbled over the same mistake I made, that's why the
> documentation has been added to describe why the heck this macro is written
> the way it's written.
> 
> Coverity sucks.
> 
> > Rewriting the test expression as
> > 
> > __b < PCI_NUM_RESOURCES && (res = pci_resource_n(dev, __b));
> > 
> > should avoid the (coverity) warning by making use of lazy evaluation.
> 
> Obviously NAK.
> 
> > It probably makes the code slightly less performant as res will now be
> > checked for being not NULL (which will always be true), but I doubt it
> > will be significant (or in any hot paths).

Oh my god, I mistakenly read this as bus macro, sorry for my rant,
it's simply wrong.

-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH v8 0/7] Add pci_dev_for_each_resource() helper and update users

2023-06-01 Thread Andy Shevchenko
On Wed, May 31, 2023 at 08:48:35PM +0200, Jonas Gorski wrote:
> On Tue, 30 May 2023 at 23:34, Bjorn Helgaas  wrote:
> > On Fri, May 12, 2023 at 02:48:51PM -0500, Bjorn Helgaas wrote:
> > > On Fri, May 12, 2023 at 01:56:29PM +0300, Andy Shevchenko wrote:
> > > > On Tue, May 09, 2023 at 01:21:22PM -0500, Bjorn Helgaas wrote:
> > > > > On Tue, Apr 04, 2023 at 11:11:01AM -0500, Bjorn Helgaas wrote:
> > > > > > On Thu, Mar 30, 2023 at 07:24:27PM +0300, Andy Shevchenko wrote:
> > > > > > > Provide two new helper macros to iterate over PCI device 
> > > > > > > resources and
> > > > > > > convert users.
> > > > >
> > > > > > Applied 2-7 to pci/resource for v6.4, thanks, I really like this!
> > > > >
> > > > > This is 09cc90063240 ("PCI: Introduce pci_dev_for_each_resource()")
> > > > > upstream now.
> > > > >
> > > > > Coverity complains about each use,
> > > >
> > > > It needs more clarification here. Use of reduced variant of the
> > > > macro or all of them? If the former one, then I can speculate that
> > > > Coverity (famous for false positives) simply doesn't understand `for
> > > > (type var; var ...)` code.
> > >
> > > True, Coverity finds false positives.  It flagged every use in
> > > drivers/pci and drivers/pnp.  It didn't mention the arch/alpha, arm,
> > > mips, powerpc, sh, or sparc uses, but I think it just didn't look at
> > > those.
> > >
> > > It flagged both:
> > >
> > >   pbus_size_iopci_dev_for_each_resource(dev, r)
> > >   pbus_size_mem   pci_dev_for_each_resource(dev, r, i)
> > >
> > > Here's a spreadsheet with a few more details (unfortunately I don't
> > > know how to make it dump the actual line numbers or analysis like I
> > > pasted below, so "pci_dev_for_each_resource" doesn't appear).  These
> > > are mostly in the "Drivers-PCI" component.
> > >
> > > https://docs.google.com/spreadsheets/d/1ohOJwxqXXoDUA0gwopgk-z-6ArLvhN7AZn4mIlDkHhQ/edit?usp=sharing
> > >
> > > These particular reports are in the "High Impact Outstanding" tab.
> >
> > Where are we at?  Are we going to ignore this because some Coverity
> > reports are false positives?
> 
> Looking at the code I understand where coverity is coming from:
> 
> #define __pci_dev_for_each_res0(dev, res, ...) \
>for (unsigned int __b = 0;  \
> res = pci_resource_n(dev, __b), __b < PCI_NUM_RESOURCES;   \
> __b++)
> 
>  res will be assigned before __b is checked for being less than
> PCI_NUM_RESOURCES, making it point to behind the array at the end of
> the last loop iteration.

Which is fine and you stumbled over the same mistake I made, that's why the
documentation has been added to describe why the heck this macro is written
the way it's written.

Coverity sucks.

> Rewriting the test expression as
> 
> __b < PCI_NUM_RESOURCES && (res = pci_resource_n(dev, __b));
> 
> should avoid the (coverity) warning by making use of lazy evaluation.

Obviously NAK.

> It probably makes the code slightly less performant as res will now be
> checked for being not NULL (which will always be true), but I doubt it
> will be significant (or in any hot paths).

-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH 8/9] powerpc: Add HOTPLUG_SMT support

2023-06-01 Thread Laurent Dufour
On 01/06/2023 15:27:30, Laurent Dufour wrote:
> On 24/05/2023 17:56:29, Michael Ellerman wrote:
>> Add support for HOTPLUG_SMT, which enables the generic sysfs SMT support
>> files in /sys/devices/system/cpu/smt, as well as the "nosmt" boot
>> parameter.
> 
> Hi Michael,
> 
> It seems that there is now a conflict between with the PPC 'smt-enabled'
> boot option.
> 
> Booting the patched kernel with 'smt-enabled=4', later, change to the SMT
> level (for instance to 6) done through /sys/devices/system/cpu/smt/control
> are not applied. Nothing happens.
> Based on my early debug, I think the reasons is that cpu_smt_num_threads=8
> when entering __store_smt_control(). But I need to dig further.

I dug deeper.

FWIW, I think smt_enabled_at_boot should be passed to
cpu_smt_check_topology() in smp_prepare_cpus(), instead of
threads_per_core. But that's not enough to fix the issue because this value
is also used to set cpu_smt_max_threads.

To achieve that, cpu_smt_check_topology() should receive 2 parameters, the
current SMT level define at boot time, and the maximum SMT level.

The attached patch is fixing the issue on my ppc64 test LPAR.
This patch is not addressing the x86 architecture (I didn't get the time to
do it, but it should be doable) and should be spread among the patches 3
and 8 of your series.

Hope this helps.

Cheers,
Laurent.

> 
> BTW, should the 'smt-enabled' PPC specific option remain?
> 
> Cheers,
> Laurent.
> 
>> Implement the recently added hooks to allow partial SMT states, allow
>> any number of threads per core.
>>
>> Tie the config symbol to HOTPLUG_CPU, which enables it on the major
>> platforms that support SMT. If there are other platforms that want the
>> SMT support that can be tweaked in future.
>>
>> Signed-off-by: Michael Ellerman 
>> ---
>>  arch/powerpc/Kconfig|  1 +
>>  arch/powerpc/include/asm/topology.h | 25 +
>>  arch/powerpc/kernel/smp.c   |  3 +++
>>  3 files changed, 29 insertions(+)
>>
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 539d1f03ff42..5cf87ca10a9c 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -273,6 +273,7 @@ config PPC
>>  select HAVE_SYSCALL_TRACEPOINTS
>>  select HAVE_VIRT_CPU_ACCOUNTING
>>  select HAVE_VIRT_CPU_ACCOUNTING_GEN
>> +select HOTPLUG_SMT  if HOTPLUG_CPU
>>  select HUGETLB_PAGE_SIZE_VARIABLE   if PPC_BOOK3S_64 && HUGETLB_PAGE
>>  select IOMMU_HELPER if PPC64
>>  select IRQ_DOMAIN
>> diff --git a/arch/powerpc/include/asm/topology.h 
>> b/arch/powerpc/include/asm/topology.h
>> index 8a4d4f4d9749..1e9117a22d14 100644
>> --- a/arch/powerpc/include/asm/topology.h
>> +++ b/arch/powerpc/include/asm/topology.h
>> @@ -143,5 +143,30 @@ static inline int cpu_to_coregroup_id(int cpu)
>>  #endif
>>  #endif
>>  
>> +#ifdef CONFIG_HOTPLUG_SMT
>> +#include 
>> +#include 
>> +
>> +static inline bool topology_smt_supported(void)
>> +{
>> +return threads_per_core > 1;
>> +}
>> +
>> +static inline bool topology_smt_threads_supported(unsigned int num_threads)
>> +{
>> +return num_threads <= threads_per_core;
>> +}
>> +
>> +static inline bool topology_is_primary_thread(unsigned int cpu)
>> +{
>> +return cpu == cpu_first_thread_sibling(cpu);
>> +}
>> +
>> +static inline bool topology_smt_thread_allowed(unsigned int cpu)
>> +{
>> +return cpu_thread_in_core(cpu) < cpu_smt_num_threads;
>> +}
>> +#endif
>> +
>>  #endif /* __KERNEL__ */
>>  #endif  /* _ASM_POWERPC_TOPOLOGY_H */
>> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
>> index 265801a3e94c..eed20b9253b7 100644
>> --- a/arch/powerpc/kernel/smp.c
>> +++ b/arch/powerpc/kernel/smp.c
>> @@ -1154,6 +1154,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>>  
>>  if (smp_ops && smp_ops->probe)
>>  smp_ops->probe();
>> +
>> +// Initalise the generic SMT topology support
>> +cpu_smt_check_topology(threads_per_core);
>>  }
>>  
>>  void smp_prepare_boot_cpu(void)
> 
From 682e7d78fb98d6298926e88e5093e2172488ea6f Mon Sep 17 00:00:00 2001
From: Laurent Dufour 
Date: Thu, 1 Jun 2023 18:02:55 +0200
Subject: [PATCH] Consider the SMT level specify at boot time

This allows PPC kernel to boot with a SMT level different from 1 or threads
per core value.

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/kernel/smp.c | 2 +-
 include/linux/cpu_smt.h   | 6 --
 kernel/cpu.c  | 9 +++--
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index eed20b9253b7..ec8ccf3d6294 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1156,7 +1156,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
smp_ops->probe();
 
// Initalise the generic SMT topology support
-   cpu_smt_check_topology(threads_per_core);
+   cpu_smt_check_topology(smt_enabled_at_boot, 

Re: [PATCH 00/13] mm: jit/text allocator

2023-06-01 Thread Mark Rutland
Hi Mike,

On Thu, Jun 01, 2023 at 01:12:44PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Hi,
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystmes
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.

I agree this is a problem, and one key issue here is that these can have
different requirements. For example, on arm64 we need modules to be placed
within a 128M or 2G window containing the kernel, whereas it would be safe for
the kprobes XOL area to be placed arbitrarily far from the kernel image (since
we don't allow PC-relative insns to be stepped out-of-line). Likewise arm64
doesn't have ftrace trampolines, and DIRECT_CALL trampolines can safely be
placed arbitarily far from the kernel image.

For a while I have wanted to give kprobes its own allocator so that it can work
even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
the modules area.

Given that, I think these should have their own allocator functions that can be
provided independently, even if those happen to use common infrastructure.

> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> This set splits code allocation from modules by introducing
> jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> sites of module_alloc() and module_memfree() with the new APIs and
> implements core text and related allocation in a central place.
> 
> Instead of architecture specific overrides for module_alloc(), the
> architectures that require non-default behaviour for text allocation must
> fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> returns a pointer to that structure. If an architecture does not implement
> jit_alloc_arch_params(), the defaults compatible with the current
> modules::module_alloc() are used.

As above, I suspect that each of the callsites should probably be using common
infrastructure, but I don't think that a single jit_alloc_arch_params() makes
sense, since the parameters for each case may need to be distinct.

> The new jitalloc infrastructure allows decoupling of kprobes and ftrace
> from modules, and most importantly it enables ROX allocations for
> executable memory.
> 
> A centralized infrastructure for code allocation allows future
> optimizations for allocations of executable memory, caching large pages for
> better iTLB performance and providing sub-page allocations for users that
> only need small jit code snippets.

This sounds interesting, but I think this can be achieved without requiring a
single jit_alloc_arch_params() shared by all users?

Thanks,
Mark.

> 
> patches 1-5: split out the code allocation from modules and arch
> patch 6: add dedicated API for data allocations with constraints similar to
> code allocations
> patches 7-9: decouple dynamic ftrace and kprobes form CONFIG_MODULES
> patches 10-13: enable ROX allocations for executable memory on x86
> 
> Mike Rapoport (IBM) (11):
>   nios2: define virtual address space for modules
>   mm: introduce jit_text_alloc() and use it instead of module_alloc()
>   mm/jitalloc, arch: convert simple overrides of module_alloc to jitalloc
>   mm/jitalloc, arch: convert remaining overrides of module_alloc to jitalloc
>   module, jitalloc: drop module_alloc
>   mm/jitalloc: introduce jit_data_alloc()
>   x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
>   arch: make jitalloc setup available regardless of CONFIG_MODULES
>   kprobes: remove dependcy on CONFIG_MODULES
>   modules, jitalloc: prepare to allocate executable memory as ROX
>   x86/jitalloc: make memory allocated for code ROX
> 
> Song Liu (2):
>   ftrace: Add swap_func to ftrace_process_locs()
>   x86/jitalloc: prepare to allocate exectuatble memory as ROX
> 
>  arch/Kconfig |   5 +-
>  arch/arm/kernel/module.c |  32 --
>  arch/arm/mm/init.c   |  35 ++
>  arch/arm64/kernel/module.c   |  47 
>  arch/arm64/mm/init.c |  42 +++
>  arch/loongarch/kernel/module.c   |   6 -
>  arch/loongarch/mm/init.c |  16 +++
>  arch/mips/kernel/module.c|   9 --
>  arch/mips/mm/init.c  |  19 
>  arch/nios2/include/asm/pgtable.h |   5 +-
>  arch/nios2/kernel/module.c   |  24 ++--
>  arch/parisc/kernel/module.c  |  11 --
>  arch/parisc/mm/init.c|  21 +++-
>  arch/powerpc/kernel/kprobes.c|   4 +-
>  arch/powerpc/kernel/module.c |  37 ---
>  arch/powerpc/mm/mem.c|  41 +++
>  arch/riscv/kernel/module.c   |  10 --
>  arch/riscv/mm/init.c |  18 +++
>  arch/s390/kernel/ftrace.c|   4 +-
>  arch/s390/kernel/kprobes.c   |   4 +-
>  

Re: [PATCH 1/4] block:sed-opal: SED Opal keystore

2023-06-01 Thread Greg Joyce
On Thu, 2023-05-11 at 01:50 +0300, Jarkko Sakkinen wrote:
> On Fri May 5, 2023 at 10:43 PM EEST,  wrote:
> > From: Greg Joyce 
> > 
> > Add read and write functions that allow SED Opal keys to stored
> > in a permanent keystore.
> 
> Please be more verbose starting from "Self-Encrypting Drive (SED)",
> instead of just "SED", and take time to explain what these keys are.

A further elaboration of SED and the keys will be in the next patchset.

> 
> > Signed-off-by: Greg Joyce 
> > Reviewed-by: Jonathan Derrick 
> > ---
> >  block/Makefile   |  2 +-
> >  block/sed-opal-key.c | 24 
> >  include/linux/sed-opal-key.h | 15 +++
> >  3 files changed, 40 insertions(+), 1 deletion(-)
> >  create mode 100644 block/sed-opal-key.c
> >  create mode 100644 include/linux/sed-opal-key.h
> > 
> > diff --git a/block/Makefile b/block/Makefile
> > index 4e01bb71ad6e..464a9f209552 100644
> > --- a/block/Makefile
> > +++ b/block/Makefile
> > @@ -35,7 +35,7 @@ obj-$(CONFIG_BLK_DEV_ZONED)   += blk-zoned.o
> >  obj-$(CONFIG_BLK_WBT)  += blk-wbt.o
> >  obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
> >  obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
> > -obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o
> > +obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o sed-opal-key.o
> >  obj-$(CONFIG_BLK_PM)   += blk-pm.o
> >  obj-$(CONFIG_BLK_INLINE_ENCRYPTION)+= blk-crypto.o blk-
> > crypto-profile.o \
> >blk-crypto-sysfs.o
> > diff --git a/block/sed-opal-key.c b/block/sed-opal-key.c
> > new file mode 100644
> > index ..16f380164c44
> > --- /dev/null
> > +++ b/block/sed-opal-key.c
> > @@ -0,0 +1,24 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * SED key operations.
> > + *
> > + * Copyright (C) 2022 IBM Corporation
> > + *
> > + * These are the accessor functions (read/write) for SED Opal
> > + * keys. Specific keystores can provide overrides.
> > + *
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > +int __weak sed_read_key(char *keyname, char *key, u_int *keylen)
> > +{
> > +   return -EOPNOTSUPP;
> > +}
> > +
> > +int __weak sed_write_key(char *keyname, char *key, u_int keylen)
> > +{
> > +   return -EOPNOTSUPP;
> > +}
> > diff --git a/include/linux/sed-opal-key.h b/include/linux/sed-opal-
> > key.h
> > new file mode 100644
> > index ..c9b1447986d8
> > --- /dev/null
> > +++ b/include/linux/sed-opal-key.h
> > @@ -0,0 +1,15 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * SED key operations.
> > + *
> > + * Copyright (C) 2022 IBM Corporation
> > + *
> > + * These are the accessor functions (read/write) for SED Opal
> > + * keys. Specific keystores can provide overrides.
> > + *
> > + */
> > +
> > +#include 
> > +
> > +int sed_read_key(char *keyname, char *key, u_int *keylen);
> > +int sed_write_key(char *keyname, char *key, u_int keylen);
> > -- 
> > gjo...@linux.vnet.ibm.com
> 
> BR, Jarkko



Re: [PATCH 4/4] powerpc/pseries: update SED for PLPKS api changes

2023-06-01 Thread Greg Joyce
On Mon, 2023-05-15 at 15:52 +1000, Andrew Donnellan wrote:
> On Fri, 2023-05-05 at 14:44 -0500, gjo...@linux.vnet.ibm.com wrote:
> > From: Greg Joyce 
> > 
> > Changes to the PLPKS API require minor updates to the SED Opal
> > PLPKS keystore code.
> > 
> > Signed-off-by: Greg Joyce 
> 
> [+ Nayna]
> 
> This patch will need to be squashed with patch 2.

Thanks. I've squashed the patches and will resend shortly.

> 
> > ---
> >  arch/powerpc/platforms/pseries/Kconfig|  6 +
> >  arch/powerpc/platforms/pseries/Makefile   |  2 +-
> >  .../powerpc/platforms/pseries/plpks_sed_ops.c | 22 +
> > 
> > --
> >  block/Kconfig |  1 +
> >  4 files changed, 13 insertions(+), 18 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/Kconfig
> > b/arch/powerpc/platforms/pseries/Kconfig
> > index 21b22bf16ce6..c2f8a29e7b9b 100644
> > --- a/arch/powerpc/platforms/pseries/Kconfig
> > +++ b/arch/powerpc/platforms/pseries/Kconfig
> > @@ -163,6 +163,12 @@ config PSERIES_PLPKS
> > # This option is selected by in-kernel consumers that
> > require
> > # access to the PKS.
> >  
> > +config PSERIES_PLPKS_SED
> > +   depends on PPC_PSERIES
> > +   bool
> > +   # This option is selected by in-kernel consumers that
> > require
> > +   # access to the SED PKS keystore.
> > +
> >  config PAPR_SCM
> > depends on PPC_PSERIES && MEMORY_HOTPLUG && LIBNVDIMM
> > tristate "Support for the PAPR Storage Class Memory
> > interface"
> > diff --git a/arch/powerpc/platforms/pseries/Makefile
> > b/arch/powerpc/platforms/pseries/Makefile
> > index 4242aed0d5d3..1476c5e4433c 100644
> > --- a/arch/powerpc/platforms/pseries/Makefile
> > +++ b/arch/powerpc/platforms/pseries/Makefile
> > @@ -29,7 +29,7 @@ obj-$(CONFIG_PPC_SVM) += svm.o
> >  obj-$(CONFIG_FA_DUMP)  += rtas-fadump.o
> >  obj-$(CONFIG_PSERIES_PLPKS)+= plpks.o
> >  obj-$(CONFIG_PPC_SECURE_BOOT)  += plpks-secvar.o
> > -obj-$(CONFIG_PSERIES_PLPKS_SED)+= plpks-sed.o
> > +obj-$(CONFIG_PSERIES_PLPKS_SED)+= plpks_sed_ops.o
> 
> I think you could just use obj-$(CONFIG_BLK_SED_OPAL) and then there
> wouldn't be a need to introduce a new option? Unless there's going to
> be a second consumer.

I was following the model of CONFIG_PPC_SECURE_BOOT. That gives a
littler finer control and flexibilty for using SED and PLPKS. This also
confines use of CONFIG_BLK_SED_OPAL to the base SED OPAL code.

> 
> >  obj-$(CONFIG_SUSPEND)  += suspend.o
> >  obj-$(CONFIG_PPC_VAS)  += vas.o vas-sysfs.o
> >  
> > diff --git a/arch/powerpc/platforms/pseries/plpks_sed_ops.c
> > b/arch/powerpc/platforms/pseries/plpks_sed_ops.c
> > index 086934b319a9..c1d08075e850 100644
> > --- a/arch/powerpc/platforms/pseries/plpks_sed_ops.c
> > +++ b/arch/powerpc/platforms/pseries/plpks_sed_ops.c
> > @@ -14,7 +14,7 @@
> >  #include 
> >  #include 
> >  #include 
> > -#include "plpks.h"
> > +#include 
> >  
> >  /*
> >   * structure that contains all SED data
> > @@ -28,9 +28,6 @@ struct plpks_sed_object_data {
> > u_char key[32];
> >  };
> >  
> > -#define PLPKS_PLATVAR_POLICYWORLDREADABLE
> > -#define PLPKS_PLATVAR_OS_COMMON 4
> > -
> >  #define PLPKS_SED_OBJECT_DATA_V00
> >  #define PLPKS_SED_MANGLED_LABEL "/default/pri"
> >  #define PLPKS_SED_COMPONENT "sed-opal"
> > @@ -50,8 +47,8 @@ void plpks_init_var(struct plpks_var *var, char
> > *keyname)
> > var->name = PLPKS_SED_MANGLED_LABEL;
> > var->namelen = strlen(keyname);
> > }
> > -   var->policy = PLPKS_PLATVAR_POLICY;
> > -   var->os = PLPKS_PLATVAR_OS_COMMON;
> > +   var->policy = PLPKS_WORLDREADABLE;
> > +   var->os = PLPKS_VAR_COMMON;
> > var->data = NULL;
> > var->datalen = 0;
> > var->component = PLPKS_SED_COMPONENT;
> > @@ -64,28 +61,19 @@ int sed_read_key(char *keyname, char *key,
> > u_int
> > *keylen)
> >  {
> > struct plpks_var var;
> > struct plpks_sed_object_data data;
> > -   u_int offset;
> > int ret;
> > u_int len;
> >  
> > plpks_init_var(, keyname);
> > -   var.data = 
> > +   var.data = (u8 *)
> > var.datalen = sizeof(data);
> >  
> > ret = plpks_read_os_var();
> > if (ret != 0)
> > return ret;
> >  
> > -   offset = offsetof(struct plpks_sed_object_data, key);
> > -   if (offset > var.datalen) {
> > -   return -EINVAL;
> > -   }
> > -
> > -   len = min(be32_to_cpu(data.key_len), *keylen);
> > -
> > +   len = min_t(u16, be32_to_cpu(data.key_len), var.datalen);
> > memcpy(key, data.key, len);
> > -   kfree(var.data);
> > -
> > key[len] = '\0';
> > *keylen = len;
> >  
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 76b23114fdeb..75d4db34df5a 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ 

Re: [PATCH v3 25/34] m68k: Convert various functions to use ptdescs

2023-06-01 Thread kernel test robot
Hi Vishal,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20230531]
[cannot apply to akpm-mm/mm-everything s390/features powerpc/next powerpc/fixes 
geert-m68k/for-next geert-m68k/for-linus linus/master v6.4-rc4 v6.4-rc3 
v6.4-rc2 v6.4-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Vishal-Moola-Oracle/mm-Add-PAGE_TYPE_OP-folio-functions/20230601-053454
base:   next-20230531
patch link:
https://lore.kernel.org/r/20230531213032.25338-26-vishal.moola%40gmail.com
patch subject: [PATCH v3 25/34] m68k: Convert various functions to use ptdescs
config: m68k-randconfig-r002-20230531 
(https://download.01.org/0day-ci/archive/20230601/202306011704.i8xmwkpl-...@intel.com/config)
compiler: m68k-linux-gcc (GCC) 12.3.0
reproduce (this is a W=1 build):
mkdir -p ~/bin
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://github.com/intel-lab-lkp/linux/commit/915ab62dc3315fe0a0544fccb4ee5f3ee32694b5
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review 
Vishal-Moola-Oracle/mm-Add-PAGE_TYPE_OP-folio-functions/20230601-053454
git checkout 915ab62dc3315fe0a0544fccb4ee5f3ee32694b5
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross 
W=1 O=build_dir ARCH=m68k olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross 
W=1 O=build_dir ARCH=m68k SHELL=/bin/bash arch/m68k/mm/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202306011704.i8xmwkpl-...@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from arch/m68k/include/asm/pgalloc.h:12,
from arch/m68k/mm/init.c:26:
   arch/m68k/include/asm/mcf_pgalloc.h: In function 'pgd_alloc':
>> arch/m68k/include/asm/mcf_pgalloc.h:83:59: error: 'GFP_NOWARN' undeclared 
>> (first use in this function); did you mean 'GFP_NOWAIT'?
  83 | struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA | 
GFP_NOWARN, 0);
 |   ^~
 |   GFP_NOWAIT
   arch/m68k/include/asm/mcf_pgalloc.h:83:59: note: each undeclared identifier 
is reported only once for each function it appears in
   arch/m68k/include/asm/mcf_pgalloc.h: At top level:
>> arch/m68k/include/asm/mcf_pgalloc.h:22:27: warning: 'ptdesc_address' is 
>> static but used in inline function 'pte_alloc_one_kernel' which is not static
  22 | return (pte_t *) (ptdesc_address(ptdesc));
 |   ^~
>> arch/m68k/include/asm/mcf_pgalloc.h:17:33: warning: 'pagetable_alloc' is 
>> static but used in inline function 'pte_alloc_one_kernel' which is not static
  17 | struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA | 
__GFP_ZERO, 0);
 | ^~~
>> arch/m68k/include/asm/mcf_pgalloc.h:10:24: warning: 'virt_to_ptdesc' is 
>> static but used in inline function 'pte_free_kernel' which is not static
  10 | pagetable_free(virt_to_ptdesc(pte));
 |^~
>> arch/m68k/include/asm/mcf_pgalloc.h:10:9: warning: 'pagetable_free' is 
>> static but used in inline function 'pte_free_kernel' which is not static
  10 | pagetable_free(virt_to_ptdesc(pte));
 | ^~
--
   In file included from arch/m68k/mm/mcfmmu.c:21:
   arch/m68k/include/asm/mcf_pgalloc.h: In function 'pgd_alloc':
>> arch/m68k/include/asm/mcf_pgalloc.h:83:59: error: 'GFP_NOWARN' undeclared 
>> (first use in this function); did you mean 'GFP_NOWAIT'?
  83 | struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA | 
GFP_NOWARN, 0);
 |   ^~
 |   GFP_NOWAIT
   arch/m68k/include/asm/mcf_pgalloc.h:83:59: note: each undeclared identifier 
is reported only once for each function it appears in
   arch/m68k/mm/mcfmmu.c: At top level:
   arch/m68k/mm/mcfmmu.c:36:13: warning: no previous prototype for 
'paging_init' [-Wmissing-prototypes]
  36 | void __init paging_init(void)
 | ^~~
   arch/m68k/mm/mcfmmu.c: In function 'paging_init':
   arch/m68k/mm/mcfmmu.c:41:37: warning: variable 'bootmem_end' set b

Re: [PATCH v4 22/23] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-06-01 Thread Jonathan Cameron
On Tue, 23 May 2023 18:22:13 -0500
Terry Bowman  wrote:

> From: Robert Richter 
> 
> In Restricted CXL Device (RCD) mode a CXL device is exposed as an
> RCiEP, but CXL downstream and upstream ports are not enumerated and
> not visible in the PCIe hierarchy. Protocol and link errors are sent
> to an RCEC.
> 
> Restricted CXL host (RCH) downstream port-detected errors are signaled
> as internal AER errors, either Uncorrectable Internal Error (UIE) or
> Corrected Internal Errors (CIE). The error source is the id of the
> RCEC. A CXL handler must then inspect the error status in various CXL
> registers residing in the dport's component register space (CXL RAS
> capability) or the dport's RCRB (PCIe AER extended capability). [1]
> 
> Errors showing up in the RCEC's error handler must be handled and
> connected to the CXL subsystem. Implement this by forwarding the error
> to all CXL devices below the RCEC. Since the entire CXL device is
> controlled only using PCIe Configuration Space of device 0, function
> 0, only pass it there [2]. The error handling is limited to currently
> supported devices with the Memory Device class code set
> (PCI_CLASS_MEMORY_CXL, 502h), where the handler can be implemented in
> the existing cxl_pci driver. Support of CXL devices (e.g. a CXL.cache
> device) can be enabled later.
> 
> In addition to errors directed to the CXL endpoint device, a handler
> must also inspect the CXL RAS and PCIe AER capabilities of the CXL
> downstream port that is connected to the device.
> 
> Since CXL downstream port errors are signaled using internal errors,
> the handler requires those errors to be unmasked. This is subject of a
> follow-on patch.
> 
> The reason for choosing this implementation is that a CXL RCEC device
> is bound to the AER port driver, but the driver does not allow it to
> register a custom specific handler to support CXL. Connecting the RCEC
> hard-wired with a CXL handler does not work, as the CXL subsystem
> might not be present all the time. The alternative to add an
> implementation to the portdrv to allow the registration of a custom
> RCEC error handler isn't worth doing it as CXL would be its only user.
> Instead, just check for an CXL RCEC and pass it down to the connected
> CXL device's error handler. With this approach the code can entirely
> be implemented in the PCIe AER driver and is independent of the CXL
> subsystem. The CXL driver only provides the handler.
> 
> [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
> [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices
> 
> Co-developed-by: Terry Bowman 
> Signed-off-by: Terry Bowman 
> Signed-off-by: Robert Richter 
> Cc: "Oliver O'Halloran" 
> Cc: Bjorn Helgaas 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-...@vger.kernel.org
> ---
Reviewed-by: Jonathan Cameron 



Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-06-01 Thread Jason Gunthorpe
On Thu, Jun 01, 2023 at 12:18:43AM +0200, Jann Horn wrote:

> 3. We have to *serialize* with page table walks performed by the
> IOMMU. We're doing an RCU barrier to synchronize against page table
> walks from the MMU, but without an appropriate mmu_notifier call, we
> have nothing to ensure that we aren't yanking a page table out from
> under an IOMMU page table walker while it's in the middle of its walk.
> Sure, this isn't very likely in practice, the IOMMU page table walker
> is probably pretty fast, but still we need some kind of explicit
> synchronization to make this robust, I think.

There is another thread talking about this..

Broadly we are saying that we need to call mmu ops invalidate_range at
any time the normal CPU TLB would be invalidated.

invalidate_range will not return until the iommu HW is coherent with
the current state of the page table.

Jason


Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-06-01 Thread Gerald Schaefer
On Mon, 29 May 2023 07:36:40 -0700 (PDT)
Hugh Dickins  wrote:

> On Mon, 29 May 2023, Matthew Wilcox wrote:
> > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:  
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > + struct page *page;
> > > +
> > > + page = virt_to_page(pgtable);
> > > + call_rcu(>rcu_head, pte_free_now);
> > > +}  
> > 
> > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > tables sharing one 64kB page.  So if you have two page tables from the
> > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > and I cannot imagine things go well from that point.  
> 
> Oh yes, of course, thanks for catching that so quickly.
> So my s390 and sparc implementations will be equally broken.
> 
> > 
> > I have no idea how to solve this problem.  
> 
> I do: I'll have to go back to the more complicated implementation we
> actually ran with on powerpc - I was thinking those complications just
> related to deposit/withdraw matters, forgetting the one-rcu_head issue.
> 
> It uses large (0x1) increments of the page refcount, avoiding
> call_rcu() when already active.
> 
> It's not a complication I had wanted to explain or test for now,
> but we shall have to.  Should apply equally well to sparc, but s390
> more of a problem, since s390 already has its own refcount cleverness.

Yes, we have 2 pagetables in one 4K page, which could result in same
rcu_head reuse. It might be possible to use the cleverness from our
page_table_free() function, e.g. to only do the call_rcu() once, for
the case where both 2K pagetable fragments become unused, similar to
how we decide when to actually call __free_page().

However, it might be much worse, and page->rcu_head from a pagetable
page cannot be used at all for s390, because we also use page->lru
to keep our list of free 2K pagetable fragments. I always get confused
by struct page unions, so not completely sure, but it seems to me that
page->rcu_head would overlay with page->lru, right?


Re: [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page

2023-06-01 Thread Jann Horn
On Mon, May 29, 2023 at 8:23 AM Hugh Dickins  wrote:
> Add the generic pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This version
> suits all those architectures which use an unfragmented page for one page
> table (none of whose pte_free()s use the mm arg which was passed to it).

Pages that have been scheduled for deferred freeing can still be
locked, right? So struct page's members "ptl" and "rcu_head" can now
be in use at the same time? If that's intended, it would probably be a
good idea to add comments in the "/* Page table pages */" part of
struct page to point out that the first two members can be used by the
rcu_head while the page is still used as a page table in some
contexts, including use of the ptl.


Re: [PATCH 8/9] powerpc: Add HOTPLUG_SMT support

2023-06-01 Thread Laurent Dufour
On 24/05/2023 17:56:29, Michael Ellerman wrote:
> Add support for HOTPLUG_SMT, which enables the generic sysfs SMT support
> files in /sys/devices/system/cpu/smt, as well as the "nosmt" boot
> parameter.

Hi Michael,

It seems that there is now a conflict between with the PPC 'smt-enabled'
boot option.

Booting the patched kernel with 'smt-enabled=4', later, change to the SMT
level (for instance to 6) done through /sys/devices/system/cpu/smt/control
are not applied. Nothing happens.
Based on my early debug, I think the reasons is that cpu_smt_num_threads=8
when entering __store_smt_control(). But I need to dig further.

BTW, should the 'smt-enabled' PPC specific option remain?

Cheers,
Laurent.

> Implement the recently added hooks to allow partial SMT states, allow
> any number of threads per core.
> 
> Tie the config symbol to HOTPLUG_CPU, which enables it on the major
> platforms that support SMT. If there are other platforms that want the
> SMT support that can be tweaked in future.
> 
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/Kconfig|  1 +
>  arch/powerpc/include/asm/topology.h | 25 +
>  arch/powerpc/kernel/smp.c   |  3 +++
>  3 files changed, 29 insertions(+)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 539d1f03ff42..5cf87ca10a9c 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -273,6 +273,7 @@ config PPC
>   select HAVE_SYSCALL_TRACEPOINTS
>   select HAVE_VIRT_CPU_ACCOUNTING
>   select HAVE_VIRT_CPU_ACCOUNTING_GEN
> + select HOTPLUG_SMT  if HOTPLUG_CPU
>   select HUGETLB_PAGE_SIZE_VARIABLE   if PPC_BOOK3S_64 && HUGETLB_PAGE
>   select IOMMU_HELPER if PPC64
>   select IRQ_DOMAIN
> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index 8a4d4f4d9749..1e9117a22d14 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -143,5 +143,30 @@ static inline int cpu_to_coregroup_id(int cpu)
>  #endif
>  #endif
>  
> +#ifdef CONFIG_HOTPLUG_SMT
> +#include 
> +#include 
> +
> +static inline bool topology_smt_supported(void)
> +{
> + return threads_per_core > 1;
> +}
> +
> +static inline bool topology_smt_threads_supported(unsigned int num_threads)
> +{
> + return num_threads <= threads_per_core;
> +}
> +
> +static inline bool topology_is_primary_thread(unsigned int cpu)
> +{
> + return cpu == cpu_first_thread_sibling(cpu);
> +}
> +
> +static inline bool topology_smt_thread_allowed(unsigned int cpu)
> +{
> + return cpu_thread_in_core(cpu) < cpu_smt_num_threads;
> +}
> +#endif
> +
>  #endif /* __KERNEL__ */
>  #endif   /* _ASM_POWERPC_TOPOLOGY_H */
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 265801a3e94c..eed20b9253b7 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1154,6 +1154,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>  
>   if (smp_ops && smp_ops->probe)
>   smp_ops->probe();
> +
> + // Initalise the generic SMT topology support
> + cpu_smt_check_topology(threads_per_core);
>  }
>  
>  void smp_prepare_boot_cpu(void)



Re: [PATCH v3 03/34] s390: Use pt_frag_refcount for pagetables

2023-06-01 Thread Gerald Schaefer
 On Wed, 31 May 2023 14:30:01 -0700
"Vishal Moola (Oracle)"  wrote:

> s390 currently uses _refcount to identify fragmented page tables.
> The page table struct already has a member pt_frag_refcount used by
> powerpc, so have s390 use that instead of the _refcount field as well.
> This improves the safety for _refcount and the page table tracking.
> 
> This also allows us to simplify the tracking since we can once again use
> the lower byte of pt_frag_refcount instead of the upper byte of _refcount.

This would conflict with s390 impact of pte_free_defer() work from Hugh Dickins
https://lore.kernel.org/lkml/35e983f5-7ed3-b310-d949-9ae8b130c...@google.com/
https://lore.kernel.org/lkml/6dd63b39-e71f-2e8b-7e0-83e02f3bc...@google.com/

There he uses pt_frag_refcount, or rather pt_mm in the same union, to save
the mm_struct for deferred pte_free().

I still need to look closer into both of your patch series, but so far it
seems that you have no hard functional requirement to switch from _refcount
to pt_frag_refcount here, for s390.

If this is correct, and you do not e.g. need this to make some other use
of _refcount, I would suggest to drop this patch.


Re: [PATCH 10/10] watchdog/hardlockup: Rename HAVE_HARDLOCKUP_DETECTOR_NON_ARCH to ..._PERF_OR_BUDDY

2023-06-01 Thread Petr Mladek
On Fri 2023-05-26 18:41:40, Douglas Anderson wrote:
> HAVE_HARDLOCKUP_DETECTOR_NON_ARCH is a mouthful and
> confusing. HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY is even more of a
> mouthful, but probably less confusing. Rename the Kconfig names.

It is better. But I have an idea that might be even better.

> Signed-off-by: Douglas Anderson 
> ---
> 
>  lib/Kconfig.debug | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index eb1edd5905bc..b9e162698a82 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1058,7 +1058,7 @@ config HARDLOCKUP_DETECTOR_BUDDY
>  # needs SMP). In either case, using the "non-arch" code conflicts with
>  # the NMI watchdog code (which is sometimes used directly and sometimes used
>  # by the arch-provided hardlockup detector).

The comment above still uses the term "no-arch" and tries to
explain the confusion around it.

> -config HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
> +config HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY
>   bool
>   depends on (HAVE_HARDLOCKUP_DETECTOR_PERF || SMP) && !HAVE_NMI_WATCHDOG
>   default y
> @@ -1077,10 +1077,10 @@ config HARDLOCKUP_DETECTOR_PREFER_BUDDY
> an arch-specific hardlockup detector or if resources needed
> for the hardlockup detector are better used for other things.
>  
> -# This will select the appropriate non-arch hardlockdup detector
> -config HARDLOCKUP_DETECTOR_NON_ARCH
> +# This will select the appropriate non-arch hardlockup detector
> +config HARDLOCKUP_DETECTOR_PERF_OR_BUDDY
>   bool
> - depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
> + depends on HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY
>   select HARDLOCKUP_DETECTOR_BUDDY if !HAVE_HARDLOCKUP_DETECTOR_PERF || 
> HARDLOCKUP_DETECTOR_PREFER_BUDDY
>   select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF && 
> !HARDLOCKUP_DETECTOR_PREFER_BUDDY
>  
> @@ -1098,9 +1098,9 @@ config HARDLOCKUP_CHECK_TIMESTAMP
>  config HARDLOCKUP_DETECTOR
>   bool "Detect Hard Lockups"
>   depends on DEBUG_KERNEL && !S390
> - depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH || 
> HAVE_HARDLOCKUP_DETECTOR_ARCH
> + depends on HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY || 
> HAVE_HARDLOCKUP_DETECTOR_ARCH
>   select LOCKUP_DETECTOR
> - select HARDLOCKUP_DETECTOR_NON_ARCH if HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
> + select HARDLOCKUP_DETECTOR_PERF_OR_BUDDY if 
> HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY
>  
>   help
> Say Y here to enable the kernel to act as a watchdog to detect

I am sorry but I am still confused by the logic. For me, it is far
from clear what combinations are possible, impossible, and optional.

Especially, the effect of HAVE_NMI_WATCHDOG and
HAVE_HARDLOCKUP_DETECTOR_ARCH is quite tricky.

I was playing with it and came up with a more straightforward solution
and found more possibilities how the simplify the logic. I am going
to prepare a patchset that would replace this patch.

Just to get the idea. I made the following changes:

 + define the values in logical order:
+ HAVE_*
+ HARDLOCKUP_DETECTOR y/n value
+ HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n value
+ HARDLOCKUP_DETECTOR_PERF decision based on above
+ HARDLOCKUP_DETECTOR_BUDDY decision based on above

 + remove HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY,
   instead, explicitly define the dependencies on all HAVE_*
   variables to make it clear what it possible
   and what is not possible

 + remove HARDLOCKUP_DETECTOR_PERF_OR_BUDDY,
   instead use "imply" in HARDLOCKUP_DETECTOR to trigger
   re-evaluation of HARDLOCKUP_DETECTOR_PERF and
   HARDLOCKUP_DETECTOR_BUDDY decisions


My current version has the following in lib/Kconfig.devel:

--- cut ---
config HAVE_HARDLOCKUP_DETECTOR_BUDDY
bool
depends on SMP
default y

#
# arch/ can define HAVE_NMI_WATCHDOG to provide their own hard
# lockup detector rather than the generic perf or buddy detector.
#
config HARDLOCKUP_DETECTOR
bool "Detect Hard Lockups"
depends on DEBUG_KERNEL && !S390
depends on HAVE_HARDLOCKUP_DETECTOR_PERF || 
HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_NMI_WATCHDOG
imply HARDLOCKUP_DETECTOR_PERF
imply HARDLOCKUP_DETECTOR_BUDDY
select LOCKUP_DETECTOR

help
  Say Y here to enable the kernel to act as a watchdog to detect
  hard lockups.

  Hardlockups are bugs that cause the CPU to loop in kernel mode
  for more than 10 seconds, without letting other interrupts have a
  chance to run.  The current stack trace is displayed upon detection
  and the system will stay locked up.

#
# The architecture-specific variant is always used when available,
# see HAVE_NMI_WATCHDOG
#
config HARDLOCKUP_DETECTOR_PREFER_BUDDY
bool "Prefer the buddy CPU hardlockup detector"
depends on HARDLOCKUP_DETECTOR
depends on 

Re: [PATCH 2/2] powerpc/xmon: use KSYM_NAME_LEN in array size

2023-06-01 Thread Michael Ellerman
Miguel Ojeda  writes:
> On Thu, Jun 1, 2023 at 4:02 AM Michael Ellerman  wrote:
>>
>> > Side-note: in `get_function_bounds()`, I see `kallsyms_lookup()` being
>> > used, but the name seems discarded? Can
>> > `kallsyms_lookup_size_offset()` be used instead, thus avoiding the
>> > usage of the buffer there to begin with?
>>
>> A few lines below it uses the modname, and AFAICS there's no (easy) way
>> to lookup the modname without also looking up the name.
>
> Hmm... I think you are looking at the `xmon_print_symbol()` one? I was
> referring to the `get_function_bounds()` one, where the `modname`
> parameter is `NULL` (and the `name` contents are not used, only
> whether it was found or not).

Yes you're right, apparently I can't read :}

cheers


RE: [PATCH 00/89] i2c: Convert to platform remove callback returning void

2023-06-01 Thread Biju Das
> Subject: Re: [PATCH 00/89] i2c: Convert to platform remove callback
> returning void
> 
> [Dropped Phil Edworthy from recipents as his email address has problems]

Phil no longer works with Renesas. Adding Fabrizio who is taking care of
RZ/V2M I2C driver.

Cheers,
Biju


Re: [PATCH 00/89] i2c: Convert to platform remove callback returning void

2023-06-01 Thread Uwe Kleine-König
[Dropped Phil Edworthy from recipents as his email address has problems]

Hello,

On Mon, May 08, 2023 at 10:51:37PM +0200, Uwe Kleine-König wrote:
> this series convers the drivers below drivers/i2c to the .remove_new()
> callback of struct platform_driver(). The motivation is to make the
> remove callback less prone for errors and wrong assumptions. See commit
> 5c5a7680e67b ("platform: Provide a remove callback that returns no
> value") for a more detailed rationale.
> 
> All but one driver already returned zero unconditionally in their
> .remove() callback, so converting them to .remove_new() is trivial.
> i2c-davinci has two patches in this series, first the error path is
> improved to not return an error code, then it's converted as the others
> drivers are.
> 
> The two davinci patches are also the only interdependency in this
> series. I was unsure if I should split the series in two, the busses and
> the mux changes; if convenient these can be applied independent of each
> other.

I wonder how this series will go in. My expectation was that Wolfram
picks up the whole series via his tree?!

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | https://www.pengutronix.de/ |


signature.asc
Description: PGP signature


Re: [PATCH v8 0/7] Add pci_dev_for_each_resource() helper and update users

2023-06-01 Thread Jonas Gorski
On Wed, 31 May 2023 at 23:30, Bjorn Helgaas  wrote:
>
> On Wed, May 31, 2023 at 08:48:35PM +0200, Jonas Gorski wrote:
> > ...
>
> > Looking at the code I understand where coverity is coming from:
> >
> > #define __pci_dev_for_each_res0(dev, res, ...) \
> >for (unsigned int __b = 0;  \
> > res = pci_resource_n(dev, __b), __b < PCI_NUM_RESOURCES;   \
> > __b++)
> >
> >  res will be assigned before __b is checked for being less than
> > PCI_NUM_RESOURCES, making it point to behind the array at the end of
> > the last loop iteration.
> >
> > Rewriting the test expression as
> >
> > __b < PCI_NUM_RESOURCES && (res = pci_resource_n(dev, __b));
> >
> > should avoid the (coverity) warning by making use of lazy evaluation.
> >
> > It probably makes the code slightly less performant as res will now be
> > checked for being not NULL (which will always be true), but I doubt it
> > will be significant (or in any hot paths).
>
> Thanks a lot for looking into this!  I think you're right, and I think
> the rewritten expression is more logical as well.  Do you want to post
> a patch for it?

Not sure when I'll come around to, so I have no strong feeling here.
So feel free to just borrow my suggestion, especially since I won't be
able to test it (don't have a kernel tree ready I can build and boot).

Also looking more closely at the Coverity output, I think it might not
handle the comma operator well in the loop condition:

>  11. incr: Incrementing __b. The value of __b may now be up to 17.
>  12. alias: Assigning: r = >resource[__b]. r may now point to 
> as high as element 17 of pdev->resource (which consists of 17 64-byte 
> elements).
>  13. Condition __b < PCI_NUM_RESOURCES, taking true branch.
>  14. Condition (r = >resource[__b]) , (__b < 
> PCI_NUM_RESOURCES), taking true branch.

13 If __b is 17 ( = PCI_NUM_RESOURCES) we wouldn't taking the true
branch, but somehow Coverity thinks that we do. No idea if it is worth
reporting to Coverity.

The changed condition statement should hopefully silence the warning though.

Regards
Jonas


Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Mike Rapoport
On Thu, Jun 01, 2023 at 12:30:50PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 01, 2023 at 01:12:56PM +0300, Mike Rapoport wrote:
> 
> > +static void __init_or_module do_text_poke(void *addr, const void *opcode, 
> > size_t len)
> > +{
> > +   if (system_state < SYSTEM_RUNNING) {
> > +   text_poke_early(addr, opcode, len);
> > +   } else {
> > +   mutex_lock(_mutex);
> > +   text_poke(addr, opcode, len);
> > +   mutex_unlock(_mutex);
> > +   }
> > +}
> 
> So I don't much like do_text_poke(); why?

I believe the idea was to keep memcpy for early boot before the kernel
image is protected without going and adding if (is_module_text_address())
all over the place.

I think this can be used instead without updating all the call sites of
text_poke_early():

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 91057de8e6bc..f994e63e9903 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1458,7 +1458,7 @@ void __init_or_module text_poke_early(void *addr, const 
void *opcode,
 * code cannot be running and speculative code-fetches are
 * prevented. Just change the code.
 */
-   memcpy(addr, opcode, len);
+   text_poke_copy(addr, opcode, len);
} else {
local_irq_save(flags);
memcpy(addr, opcode, len);
 
> > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> > index aa99536b824c..d50595f2c1a6 100644
> > --- a/arch/x86/kernel/ftrace.c
> > +++ b/arch/x86/kernel/ftrace.c
> > @@ -118,10 +118,13 @@ ftrace_modify_code_direct(unsigned long ip, const 
> > char *old_code,
> > return ret;
> >  
> > /* replace the text with the new text */
> > -   if (ftrace_poke_late)
> > +   if (ftrace_poke_late) {
> > text_poke_queue((void *)ip, new_code, MCOUNT_INSN_SIZE, NULL);
> > -   else
> > -   text_poke_early((void *)ip, new_code, MCOUNT_INSN_SIZE);
> > +   } else {
> > +   mutex_lock(_mutex);
> > +   text_poke((void *)ip, new_code, MCOUNT_INSN_SIZE);
> > +   mutex_unlock(_mutex);
> > +   }
> > return 0;
> >  }
> 
> And in the above case it's actively wrong for loosing the _queue()
> thing.

-- 
Sincerely yours,
Mike.


WARN at kernel/sched/core.c:5358 (kthread_end_lazy_tlb_mm)

2023-06-01 Thread Sachin Sant
While compiling a kernel on a IBM Power system booted with
6.4.0-rc4-next-20230601 following warning is observed

[  276.351697] [ cut here ]
[  276.351709] WARNING: CPU: 27 PID: 9237 at kernel/sched/core.c:5358 
kthread_end_lazy_tlb_mm+0x90/0xa0
[  276.351719] Modules linked in: dm_mod nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 
nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding tls 
rfkill ip_set nf_tables nfnetlink sunrpc pseries_rng aes_gcm_p10_crypto xfs 
libcrc32c sd_mod sr_mod t10_pi crc64_rocksoft_generic cdrom crc64_rocksoft 
crc64 sg ibmvscsi scsi_transport_srp ibmveth vmx_crypto fuse
[  276.351752] CPU: 27 PID: 9237 Comm: cc1 Kdump: loaded Not tainted 
6.4.0-rc4-next-20230601 #1
[  276.351756] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
of:IBM,FW1030.20 (NH1030_058) hv:phyp pSeries
[  276.351759] NIP:  c01b8c10 LR: c00a8d54 CTR: c046ec00
[  276.351763] REGS: c000dce337d0 TRAP: 0700   Not tainted  
(6.4.0-rc4-next-20230601)
[  276.351766] MSR:  80029033   CR: 24002228  
XER: 
[  276.351774] CFAR: c01b8ba0 IRQMASK: 0  [  276.351774] GPR00: 
c00a8d54 c000dce33a70 c14a1800 c7852a00  [  
276.351774] GPR04: 0001   
c7852f78  [  276.351774] GPR08:   
 24002428  [  276.351774] GPR12: c000a032b608 
c0135faa5b00    [  276.351774] GPR16: 
     [  
276.351774] GPR20:    
  [  276.351774] GPR24:   
 c7852a70  [  276.351774] GPR28:  
 001b c7852a00  [  276.351810] NIP 
[c01b8c10] kthread_end_lazy_tlb_mm+0x90/0xa0
[  276.351814] LR [c00a8d54] exit_lazy_flush_tlb+0xf4/0x110
[  276.351818] Call Trace:
[  276.351820] [c000dce33a70] [0001] 0x1 (unreliable)
[  276.351825] [c000dce33ab0] [c00a8fbc] 
flush_type_needed+0x24c/0x260
[  276.351829] [c000dce33af0] [c00a91a8] __flush_all_mm+0x48/0x2c0
[  276.351833] [c000dce33b40] [c04d6dcc] tlb_finish_mmu+0x16c/0x230
[  276.351839] [c000dce33b70] [c04d2a2c] exit_mmap+0x17c/0x4c0
[  276.351844] [c000dce33c90] [c0159120] __mmput+0x60/0x1e0
[  276.351849] [c000dce33cc0] [c01689cc] exit_mm+0xdc/0x170
[  276.351853] [c000dce33d00] [c0168cec] do_exit+0x28c/0x580
[  276.351857] [c000dce33db0] [c016922c] do_group_exit+0x4c/0xc0
[  276.351861] [c000dce33df0] [c01692c8] sys_exit_group+0x28/0x30
[  276.351866] [c000dce33e10] [c0034adc] 
system_call_exception+0x13c/0x340
[  276.351871] [c000dce33e50] [c000d05c] 
system_call_vectored_common+0x15c/0x2ec
[  276.351876] --- interrupt: 3000 at 0x7fffa330c2c4
[  276.351879] NIP:  7fffa330c2c4 LR:  CTR: 
[  276.351882] REGS: c000dce33e80 TRAP: 3000   Not tainted  
(6.4.0-rc4-next-20230601)
[  276.351885] MSR:  8280f033   CR: 
44002422  XER: 
[  276.351894] IRQMASK: 0  [  276.351894] GPR00: 00ea 
7fffed1a4ca0 7fffa3a77e00   [  276.351894] GPR04: 
     [  
276.351894] GPR08:    
  [  276.351894] GPR12:  7fffa3a7cac0 
   [  276.351894] GPR16:  
    [  276.351894] GPR20: 
  003a 0001  [  
276.351894] GPR24: 7fffa34423c0  7fffa3440a38 
  [  276.351894] GPR28: 0001 7fffa3a75e48 
f000   [  276.351928] NIP [7fffa330c2c4] 
0x7fffa330c2c4
[  276.351930] LR [] 0x0
[  276.351932] --- interrupt: 3000
[  276.351935] Code: 38210020 e8010010 7c0803a6 4e800020 0fe0 6000 
6000 6000 4e800020 6000 6000 6000 <0fe0> 6000 
6000 6000  [  276.351946] ---[ end trace  ]—

Git bisect points to following code change:

commit 253808d464bf472c66d299faa3d8ffb65149f4da
 lazy tlb: consolidate lazy tlb mm switching

- Sachin



Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Peter Zijlstra
On Thu, Jun 01, 2023 at 01:12:56PM +0300, Mike Rapoport wrote:

> +static void __init_or_module do_text_poke(void *addr, const void *opcode, 
> size_t len)
> +{
> + if (system_state < SYSTEM_RUNNING) {
> + text_poke_early(addr, opcode, len);
> + } else {
> + mutex_lock(_mutex);
> + text_poke(addr, opcode, len);
> + mutex_unlock(_mutex);
> + }
> +}

So I don't much like do_text_poke(); why?

> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index aa99536b824c..d50595f2c1a6 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -118,10 +118,13 @@ ftrace_modify_code_direct(unsigned long ip, const char 
> *old_code,
>   return ret;
>  
>   /* replace the text with the new text */
> - if (ftrace_poke_late)
> + if (ftrace_poke_late) {
>   text_poke_queue((void *)ip, new_code, MCOUNT_INSN_SIZE, NULL);
> - else
> - text_poke_early((void *)ip, new_code, MCOUNT_INSN_SIZE);
> + } else {
> + mutex_lock(_mutex);
> + text_poke((void *)ip, new_code, MCOUNT_INSN_SIZE);
> + mutex_unlock(_mutex);
> + }
>   return 0;
>  }

And in the above case it's actively wrong for loosing the _queue()
thing.


Re: [PATCH 2/2] powerpc/xmon: use KSYM_NAME_LEN in array size

2023-06-01 Thread Miguel Ojeda
On Thu, Jun 1, 2023 at 4:02 AM Michael Ellerman  wrote:
>
> > Side-note: in `get_function_bounds()`, I see `kallsyms_lookup()` being
> > used, but the name seems discarded? Can
> > `kallsyms_lookup_size_offset()` be used instead, thus avoiding the
> > usage of the buffer there to begin with?
>
> A few lines below it uses the modname, and AFAICS there's no (easy) way
> to lookup the modname without also looking up the name.

Hmm... I think you are looking at the `xmon_print_symbol()` one? I was
referring to the `get_function_bounds()` one, where the `modname`
parameter is `NULL` (and the `name` contents are not used, only
whether it was found or not).

> > Side-note 2: in `scanhex()`, I see a loop `i<63` using `tmpstr` which
> > then is used to do a `kallsyms_lookup_name()`, so I guess symbols
> > larger than 64 couldn't be found. I have no idea about what are the
> > external constraints here, but perhaps it is possible to increase the
> > `line` buffer etc. to then allow for bigger symbols to be found.
>
> Yeah that looks wrong. I don't see any symbols that long in current
> kernels, but we should fix it.
>
> Thanks for looking.

My pleasure!

Cheers,
Miguel


[PATCH 13/13] x86/jitalloc: make memory allocated for code ROX

2023-06-01 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

When STRICT_KERNEL_RWX or STRICT_MODULE_RWX is enabled, force text
allocations to use KERNEL_PAGE_ROX.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/Kconfig |  3 +++
 arch/x86/Kconfig |  1 +
 arch/x86/kernel/ftrace.c |  3 ---
 arch/x86/mm/init.c   |  6 ++
 include/linux/jitalloc.h |  2 ++
 mm/jitalloc.c| 21 +
 6 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 479a7b8be191..e7c4b01307d7 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1307,6 +1307,9 @@ config STRICT_MODULE_RWX
  and non-text memory will be made non-executable. This provides
  protection against certain security exploits (e.g. writing to text)
 
+config ARCH_HAS_TEXT_POKE
+   def_bool n
+
 # select if the architecture provides an asm/dma-direct.h header
 config ARCH_HAS_PHYS_TO_DMA
bool
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fac4add6ce16..e1a512f557de 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -96,6 +96,7 @@ config X86
select ARCH_HAS_SET_DIRECT_MAP
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
+   select ARCH_HAS_TEXT_POKE
select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_UBSAN_SANITIZE_ALL
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index d50595f2c1a6..bd4dd8974ee6 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -313,7 +313,6 @@ create_trampoline(struct ftrace_ops *ops, unsigned int 
*tramp_size)
unsigned long call_offset;
unsigned long jmp_offset;
unsigned long offset;
-   unsigned long npages;
unsigned long size;
unsigned long *ptr;
void *trampoline;
@@ -350,7 +349,6 @@ create_trampoline(struct ftrace_ops *ops, unsigned int 
*tramp_size)
return 0;
 
*tramp_size = size + RET_SIZE + sizeof(void *);
-   npages = DIV_ROUND_UP(*tramp_size, PAGE_SIZE);
 
/* Copy ftrace_caller onto the trampoline memory */
ret = text_poke_copy(trampoline, (void *)start_offset, size);
@@ -416,7 +414,6 @@ create_trampoline(struct ftrace_ops *ops, unsigned int 
*tramp_size)
/* ALLOC_TRAMP flags lets us know we created it */
ops->flags |= FTRACE_OPS_FL_ALLOC_TRAMP;
 
-   set_memory_rox((unsigned long)trampoline, npages);
return (unsigned long)trampoline;
 fail:
tramp_free(trampoline);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ffaf9a3840ce..c314738991fa 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1127,6 +1127,12 @@ struct jit_alloc_params *jit_alloc_arch_params(void)
jit_alloc_params.text.start = MODULES_VADDR + get_jit_load_offset();
jit_alloc_params.text.end = MODULES_END;
 
+   if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX) ||
+   IS_ENABLED(CONFIG_STRICT_MODULE_RWX)) {
+   jit_alloc_params.text.pgprot = PAGE_KERNEL_ROX;
+   jit_alloc_params.flags |= JIT_ALLOC_USE_TEXT_POKE;
+   }
+
return _alloc_params;
 }
 #endif /* CONFIG_JIT_ALLOC */
diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
index 0ba5ef785a85..0e29e87acefe 100644
--- a/include/linux/jitalloc.h
+++ b/include/linux/jitalloc.h
@@ -15,9 +15,11 @@
 /**
  * enum jit_alloc_flags - options for executable memory allocations
  * @JIT_ALLOC_KASAN_SHADOW:allocate kasan shadow
+ * @JIT_ALLOC_USE_TEXT_POKE:   use text poking APIs to update memory
  */
 enum jit_alloc_flags {
JIT_ALLOC_KASAN_SHADOW  = (1 << 0),
+   JIT_ALLOC_USE_TEXT_POKE = (1 << 1),
 };
 
 /**
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
index a8ae64364d56..15d1067faf3f 100644
--- a/mm/jitalloc.c
+++ b/mm/jitalloc.c
@@ -7,6 +7,26 @@
 
 static struct jit_alloc_params jit_alloc_params;
 
+#ifdef CONFIG_ARCH_HAS_TEXT_POKE
+#include 
+
+static inline void jit_text_poke_copy(void *dst, const void *src, size_t len)
+{
+   if (jit_alloc_params.flags & JIT_ALLOC_USE_TEXT_POKE)
+   text_poke_copy(dst, src, len);
+   else
+   memcpy(dst, src, len);
+}
+
+static inline void jit_text_poke_set(void *addr, int c, size_t len)
+{
+   if (jit_alloc_params.flags & JIT_ALLOC_USE_TEXT_POKE)
+   text_poke_set(addr, c, len);
+   else
+   memset(addr, c, len);
+}
+
+#else
 static inline void jit_text_poke_copy(void *dst, const void *src, size_t len)
 {
memcpy(dst, src, len);
@@ -16,6 +36,7 @@ static inline void jit_text_poke_set(void *addr, int c, 
size_t len)
 {
memset(addr, c, len);
 }
+#endif
 
 static void *jit_alloc(size_t len, unsigned int alignment, pgprot_t pgprot,
   unsigned long start, unsigned long end,
-- 
2.35.1



[PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Mike Rapoport
From: Song Liu 

Replace direct memory writes to memory allocated for code with text poking
to allow allocation of executable memory as ROX.

The only exception is arch_prepare_bpf_trampoline() that cannot jit
directly into module memory yet, so it uses set_memory calls to
unprotect the memory before writing to it and to protect memory in the
end.

Signed-off-by: Song Liu 
Co-developed-by: Mike Rapoport (IBM) 
Signed-off-by: Mike Rapoport (IBM) 
---
 arch/x86/kernel/alternative.c | 43 +++
 arch/x86/kernel/ftrace.c  | 41 +
 arch/x86/kernel/module.c  | 24 +--
 arch/x86/kernel/static_call.c | 10 
 arch/x86/kernel/unwind_orc.c  | 13 +++
 arch/x86/net/bpf_jit_comp.c   | 22 +-
 6 files changed, 91 insertions(+), 62 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index f615e0cb6d93..91057de8e6bc 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -76,6 +77,19 @@ do { 
\
}   \
 } while (0)
 
+void text_poke_early(void *addr, const void *opcode, size_t len);
+
+static void __init_or_module do_text_poke(void *addr, const void *opcode, 
size_t len)
+{
+   if (system_state < SYSTEM_RUNNING) {
+   text_poke_early(addr, opcode, len);
+   } else {
+   mutex_lock(_mutex);
+   text_poke(addr, opcode, len);
+   mutex_unlock(_mutex);
+   }
+}
+
 static const unsigned char x86nops[] =
 {
BYTES_NOP1,
@@ -108,7 +122,7 @@ static void __init_or_module add_nops(void *insns, unsigned 
int len)
unsigned int noplen = len;
if (noplen > ASM_NOP_MAX)
noplen = ASM_NOP_MAX;
-   memcpy(insns, x86_nops[noplen], noplen);
+   do_text_poke(insns, x86_nops[noplen], noplen);
insns += noplen;
len -= noplen;
}
@@ -120,7 +134,6 @@ extern s32 __cfi_sites[], __cfi_sites_end[];
 extern s32 __ibt_endbr_seal[], __ibt_endbr_seal_end[];
 extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
 extern s32 __smp_locks[], __smp_locks_end[];
-void text_poke_early(void *addr, const void *opcode, size_t len);
 
 /*
  * Are we looking at a near JMP with a 1 or 4-byte displacement.
@@ -331,7 +344,7 @@ void __init_or_module noinline apply_alternatives(struct 
alt_instr *start,
 
DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn: ", instr);
 
-   text_poke_early(instr, insn_buff, insn_buff_sz);
+   do_text_poke(instr, insn_buff, insn_buff_sz);
 
 next:
optimize_nops(instr, a->instrlen);
@@ -564,7 +577,7 @@ void __init_or_module noinline apply_retpolines(s32 *start, 
s32 *end)
optimize_nops(bytes, len);
DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
-   text_poke_early(addr, bytes, len);
+   do_text_poke(addr, bytes, len);
}
}
 }
@@ -638,7 +651,7 @@ void __init_or_module noinline apply_returns(s32 *start, 
s32 *end)
if (len == insn.length) {
DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
-   text_poke_early(addr, bytes, len);
+   do_text_poke(addr, bytes, len);
}
}
 }
@@ -674,7 +687,7 @@ static void poison_endbr(void *addr, bool warn)
 */
DUMP_BYTES(((u8*)addr), 4, "%px: orig: ", addr);
DUMP_BYTES(((u8*)), 4, "%px: repl: ", addr);
-   text_poke_early(addr, , 4);
+   do_text_poke(addr, , 4);
 }
 
 /*
@@ -869,7 +882,7 @@ static int cfi_disable_callers(s32 *start, s32 *end)
if (!hash) /* nocfi callers */
continue;
 
-   text_poke_early(addr, jmp, 2);
+   do_text_poke(addr, jmp, 2);
}
 
return 0;
@@ -892,7 +905,7 @@ static int cfi_enable_callers(s32 *start, s32 *end)
if (!hash) /* nocfi callers */
continue;
 
-   text_poke_early(addr, mov, 2);
+   do_text_poke(addr, mov, 2);
}
 
return 0;
@@ -913,7 +926,7 @@ static int cfi_rand_preamble(s32 *start, s32 *end)
return -EINVAL;
 
hash = cfi_rehash(hash);
-   text_poke_early(addr + 1, , 4);
+   do_text_poke(addr + 1, , 4);
}
 
return 0;
@@ -932,9 +945,9 @@ static int cfi_rewrite_preamble(s32 *start, s32 *end)
  

[PATCH 11/13] ftrace: Add swap_func to ftrace_process_locs()

2023-06-01 Thread Mike Rapoport
From: Song Liu 

ftrace_process_locs sorts module mcount, which is inside RO memory. Add a
ftrace_swap_func so that archs can use RO-memory-poke function to do the
sorting.

Signed-off-by: Song Liu 
---
 include/linux/ftrace.h |  2 ++
 kernel/trace/ftrace.c  | 13 -
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index b23bdd414394..fe443b8ed32c 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1166,4 +1166,6 @@ unsigned long arch_syscall_addr(int nr);
 
 #endif /* CONFIG_FTRACE_SYSCALLS */
 
+void ftrace_swap_func(void *a, void *b, int n);
+
 #endif /* _LINUX_FTRACE_H */
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 764668467155..f5ddc9d4cfb6 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -6430,6 +6430,17 @@ static void test_is_sorted(unsigned long *start, 
unsigned long count)
 }
 #endif
 
+void __weak ftrace_swap_func(void *a, void *b, int n)
+{
+   unsigned long t;
+
+   WARN_ON_ONCE(n != sizeof(t));
+
+   t = *((unsigned long *)a);
+   *(unsigned long *)a = *(unsigned long *)b;
+   *(unsigned long *)b = t;
+}
+
 static int ftrace_process_locs(struct module *mod,
   unsigned long *start,
   unsigned long *end)
@@ -6455,7 +6466,7 @@ static int ftrace_process_locs(struct module *mod,
 */
if (!IS_ENABLED(CONFIG_BUILDTIME_MCOUNT_SORT) || mod) {
sort(start, count, sizeof(*start),
-ftrace_cmp_ips, NULL);
+ftrace_cmp_ips, ftrace_swap_func);
} else {
test_is_sorted(start, count);
}
-- 
2.35.1



[PATCH 10/13] modules, jitalloc: prepare to allocate executable memory as ROX

2023-06-01 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

When executable memory will be allocated as ROX it won't be possible to
update it using memset() and memcpy().

Introduce jit_update_copy() and jit_update_set() APIs and use them in
modules loading code instead of memcpy() and memset().

Signed-off-by: Mike Rapoport (IBM) 
---
 include/linux/jitalloc.h |  2 ++
 kernel/module/main.c | 19 ++-
 mm/jitalloc.c| 20 
 3 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
index 7f8cafb3cfe9..0ba5ef785a85 100644
--- a/include/linux/jitalloc.h
+++ b/include/linux/jitalloc.h
@@ -55,6 +55,8 @@ struct jit_alloc_params *jit_alloc_arch_params(void);
 void jit_free(void *buf);
 void *jit_text_alloc(size_t len);
 void *jit_data_alloc(size_t len);
+void jit_update_copy(void *buf, void *new_buf, size_t len);
+void jit_update_set(void *buf, int c, size_t len);
 
 #ifdef CONFIG_JIT_ALLOC
 void jit_alloc_init(void);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 91477aa5f671..9f0711c42aa2 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1197,9 +1197,19 @@ void __weak module_arch_freeing_init(struct module *mod)
 
 static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
 {
-   if (mod_mem_type_is_data(type))
-   return jit_data_alloc(size);
-   return jit_text_alloc(size);
+   void *p;
+
+   if (mod_mem_type_is_data(type)) {
+   p = jit_data_alloc(size);
+   if (p)
+   memset(p, 0, size);
+   } else {
+   p = jit_text_alloc(size);
+   if (p)
+   jit_update_set(p, 0, size);
+   }
+
+   return p;
 }
 
 static void module_memory_free(void *ptr, enum mod_mem_type type)
@@ -2223,7 +2233,6 @@ static int move_module(struct module *mod, struct 
load_info *info)
t = type;
goto out_enomem;
}
-   memset(ptr, 0, mod->mem[type].size);
mod->mem[type].base = ptr;
}
 
@@ -2251,7 +2260,7 @@ static int move_module(struct module *mod, struct 
load_info *info)
ret = -ENOEXEC;
goto out_enomem;
}
-   memcpy(dest, (void *)shdr->sh_addr, shdr->sh_size);
+   jit_update_copy(dest, (void *)shdr->sh_addr, 
shdr->sh_size);
}
/*
 * Update the userspace copy's ELF section address to point to
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
index 16fd715d501a..a8ae64364d56 100644
--- a/mm/jitalloc.c
+++ b/mm/jitalloc.c
@@ -7,6 +7,16 @@
 
 static struct jit_alloc_params jit_alloc_params;
 
+static inline void jit_text_poke_copy(void *dst, const void *src, size_t len)
+{
+   memcpy(dst, src, len);
+}
+
+static inline void jit_text_poke_set(void *addr, int c, size_t len)
+{
+   memset(addr, c, len);
+}
+
 static void *jit_alloc(size_t len, unsigned int alignment, pgprot_t pgprot,
   unsigned long start, unsigned long end,
   unsigned long fallback_start, unsigned long fallback_end,
@@ -86,6 +96,16 @@ void *jit_data_alloc(size_t len)
 fallback_start, fallback_end, kasan);
 }
 
+void jit_update_copy(void *buf, void *new_buf, size_t len)
+{
+   jit_text_poke_copy(buf, new_buf, len);
+}
+
+void jit_update_set(void *addr, int c, size_t len)
+{
+   jit_text_poke_set(addr, c, len);
+}
+
 struct jit_alloc_params * __weak jit_alloc_arch_params(void)
 {
return NULL;
-- 
2.35.1



[PATCH 09/13] kprobes: remove dependcy on CONFIG_MODULES

2023-06-01 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

kprobes depended on CONFIG_MODULES because it has to allocate memory for
code.

Since code allocations are now implemented with jitalloc, kprobes can be
enabled in non-modular kernels.

Add #ifdef CONFIG_MODULE guars for the code dealing with kprobes inside
modules, make CONFIG_KPROBES select CONFIG_JIT_ALLOC and drop the
dependency of CONFIG_KPROBES on CONFIG_MODULES

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/Kconfig|  2 +-
 kernel/kprobes.c| 43 +
 kernel/trace/trace_kprobe.c | 11 ++
 3 files changed, 37 insertions(+), 19 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cad..479a7b8be191 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -39,9 +39,9 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
select KALLSYMS
+   select JIT_ALLOC
select TASKS_RCU if PREEMPTION
help
  Kprobes allows you to trap at almost any kernel address and
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 3caf3561c048..11c1cfbb11ae 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1568,6 +1568,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1591,6 +1592,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
ret = -ENOENT;
}
}
+#endif
+
 out:
preempt_enable();
jump_label_unlock();
@@ -2484,24 +2487,6 @@ int kprobe_add_area_blacklist(unsigned long start, 
unsigned long end)
return 0;
 }
 
-/* Remove all symbols in given area from kprobe blacklist */
-static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
-{
-   struct kprobe_blacklist_entry *ent, *n;
-
-   list_for_each_entry_safe(ent, n, _blacklist, list) {
-   if (ent->start_addr < start || ent->start_addr >= end)
-   continue;
-   list_del(>list);
-   kfree(ent);
-   }
-}
-
-static void kprobe_remove_ksym_blacklist(unsigned long entry)
-{
-   kprobe_remove_area_blacklist(entry, entry + 1);
-}
-
 int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
   char *type, char *sym)
 {
@@ -2566,6 +2551,25 @@ static int __init populate_kprobe_blacklist(unsigned 
long *start,
return ret ? : arch_populate_kprobe_blacklist();
 }
 
+#ifdef CONFIG_MODULES
+/* Remove all symbols in given area from kprobe blacklist */
+static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
+{
+   struct kprobe_blacklist_entry *ent, *n;
+
+   list_for_each_entry_safe(ent, n, _blacklist, list) {
+   if (ent->start_addr < start || ent->start_addr >= end)
+   continue;
+   list_del(>list);
+   kfree(ent);
+   }
+}
+
+static void kprobe_remove_ksym_blacklist(unsigned long entry)
+{
+   kprobe_remove_area_blacklist(entry, entry + 1);
+}
+
 static void add_module_kprobe_blacklist(struct module *mod)
 {
unsigned long start, end;
@@ -2667,6 +2671,7 @@ static struct notifier_block kprobe_module_nb = {
.notifier_call = kprobes_module_callback,
.priority = 0
 };
+#endif
 
 void kprobe_free_init_mem(void)
 {
@@ -2726,8 +2731,10 @@ static int __init init_kprobes(void)
err = arch_init_kprobes();
if (!err)
err = register_die_notifier(_exceptions_nb);
+#ifdef CONFIG_MODULES
if (!err)
err = register_module_notifier(_module_nb);
+#endif
 
kprobes_initialized = (err == 0);
kprobe_sysctls_init();
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 59cda19a9033..cf804e372554 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -111,6 +111,7 @@ static nokprobe_inline bool 
trace_kprobe_within_module(struct trace_kprobe *tk,
return strncmp(module_name(mod), name, len) == 0 && name[len] == ':';
 }
 
+#ifdef CONFIG_MODULES
 static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
 {
char *p;
@@ -129,6 +130,12 @@ static nokprobe_inline bool 
trace_kprobe_module_exist(struct trace_kprobe *tk)
 
return ret;
 }
+#else
+static inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
+{
+   return false;
+}
+#endif
 
 static bool trace_kprobe_is_busy(struct dyn_event *ev)
 {
@@ -670,6 +677,7 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
return ret;
 }
 
+#ifdef CONFIG_MODULES
 /* Module notifier call back, checking event on the module */
 static int trace_kprobe_module_callback(struct notifier_block *nb,
   

[PATCH 08/13] arch: make jitalloc setup available regardless of CONFIG_MODULES

2023-06-01 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

jitalloc does not depend on modules, on the contrary modules use
jitalloc.

To make jitalloc available when CONFIG_MODULES=n, for instance for
kprobes, split jit_alloc_params initialization out from
arch/kernel/module.c and compile it when CONFIG_JIT_ALLOC=y

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/arm/kernel/module.c   | 32 --
 arch/arm/mm/init.c | 35 
 arch/arm64/kernel/module.c | 40 
 arch/arm64/mm/init.c   | 42 ++
 arch/loongarch/kernel/module.c | 14 
 arch/loongarch/mm/init.c   | 16 +
 arch/mips/kernel/module.c  | 17 --
 arch/mips/mm/init.c| 19 +++
 arch/parisc/kernel/module.c| 17 --
 arch/parisc/mm/init.c  | 21 -
 arch/powerpc/kernel/module.c   | 39 ---
 arch/powerpc/mm/mem.c  | 41 +
 arch/riscv/kernel/module.c | 16 -
 arch/riscv/mm/init.c   | 18 +++
 arch/s390/kernel/module.c  | 32 --
 arch/s390/mm/init.c| 35 
 arch/sparc/kernel/module.c | 19 ---
 arch/sparc/mm/Makefile |  2 ++
 arch/sparc/mm/jitalloc.c   | 21 +
 19 files changed, 249 insertions(+), 227 deletions(-)
 create mode 100644 arch/sparc/mm/jitalloc.c

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index 83ccbf98164f..054e799e7091 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -16,44 +16,12 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
 #include 
 #include 
 
-#ifdef CONFIG_XIP_KERNEL
-/*
- * The XIP kernel text is mapped in the module area for modules and
- * some other stuff to work without any indirect relocations.
- * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
- * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
- */
-#undef MODULES_VADDR
-#define MODULES_VADDR  (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
-#endif
-
-#ifdef CONFIG_MMU
-static struct jit_alloc_params jit_alloc_params = {
-   .alignment  = 1,
-   .text.start = MODULES_VADDR,
-   .text.end   = MODULES_END,
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-   jit_alloc_params.text.pgprot = PAGE_KERNEL_EXEC;
-
-   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
-   jit_alloc_params.text.fallback_start = VMALLOC_START;
-   jit_alloc_params.text.fallback_end = VMALLOC_END;
-   }
-
-   return _alloc_params;
-}
-#endif
-
 bool module_init_section(const char *name)
 {
return strstarts(name, ".init") ||
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index ce64bdb55a16..e492625b7f3d 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -486,3 +487,37 @@ void free_initrd_mem(unsigned long start, unsigned long 
end)
free_reserved_area((void *)start, (void *)end, -1, "initrd");
 }
 #endif
+
+#ifdef CONFIG_JIT_ALLOC
+#ifdef CONFIG_XIP_KERNEL
+/*
+ * The XIP kernel text is mapped in the module area for modules and
+ * some other stuff to work without any indirect relocations.
+ * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
+ * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
+ */
+#undef MODULES_VADDR
+#define MODULES_VADDR  (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
+#endif
+
+#ifdef CONFIG_MMU
+static struct jit_alloc_params jit_alloc_params = {
+   .alignment  = 1,
+   .text.start = MODULES_VADDR,
+   .text.end   = MODULES_END,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+   jit_alloc_params.text.pgprot = PAGE_KERNEL_EXEC;
+
+   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+   jit_alloc_params.text.fallback_start = VMALLOC_START;
+   jit_alloc_params.text.fallback_end = VMALLOC_END;
+   }
+
+   return _alloc_params;
+}
+#endif
+
+#endif /* CONFIG_JIT_ALLOC */
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 91ffcff5a44c..6d09b29fe9db 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -17,51 +17,11 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
 
-static struct jit_alloc_params jit_alloc_params = {
-   .alignment  = JIT_ALLOC_ALIGN,
-   .flags  = JIT_ALLOC_KASAN_SHADOW,
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-   u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
-
-   if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
-   IS_ENABLED(CONFIG_KASAN_SW_TAGS))
-   /* don't exceed the static module region - see 

[PATCH 07/13] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES

2023-06-01 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Dynamic ftrace must allocate memory for code and this was impossible
without CONFIG_MODULES.

With jitalloc separated from the modules code, the jit_text_alloc() is
available regardless of CONFIG_MODULE.

Move jitalloc initialization to x86/mm/init.c so that it won't get
compiled away when CONFIG_MODULE=n and enable dynamic ftrace
unconditionally.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/x86/Kconfig |  1 +
 arch/x86/kernel/ftrace.c |  9 
 arch/x86/kernel/module.c | 44 --
 arch/x86/mm/init.c   | 46 
 4 files changed, 47 insertions(+), 53 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..fac4add6ce16 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -35,6 +35,7 @@ config X86_64
select SWIOTLB
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32
+   select JIT_ALLOC if DYNAMIC_FTRACE
 
 config FORCE_DYNAMIC_FTRACE
def_bool y
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 157c8a799704..aa99536b824c 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -261,7 +261,6 @@ void arch_ftrace_update_code(int command)
 /* Currently only x86_64 supports dynamic trampolines */
 #ifdef CONFIG_X86_64
 
-#ifdef CONFIG_MODULES
 /* Module allocation simplifies allocating memory for code */
 static inline void *alloc_tramp(unsigned long size)
 {
@@ -271,14 +270,6 @@ static inline void tramp_free(void *tramp)
 {
jit_free(tramp);
 }
-#else
-/* Trampolines can only be created if modules are supported */
-static inline void *alloc_tramp(unsigned long size)
-{
-   return NULL;
-}
-static inline void tramp_free(void *tramp) { }
-#endif
 
 /* Defined as markers to the end of the ftrace default trampolines */
 extern void ftrace_regs_caller_end(void);
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index cacca613b8bd..94a00dc103cd 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -19,7 +19,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -37,49 +36,6 @@ do { \
 } while (0)
 #endif
 
-#ifdef CONFIG_RANDOMIZE_BASE
-static unsigned long module_load_offset;
-
-/* Mutex protects the module_load_offset. */
-static DEFINE_MUTEX(module_kaslr_mutex);
-
-static unsigned long int get_module_load_offset(void)
-{
-   if (kaslr_enabled()) {
-   mutex_lock(_kaslr_mutex);
-   /*
-* Calculate the module_load_offset the first time this
-* code is called. Once calculated it stays the same until
-* reboot.
-*/
-   if (module_load_offset == 0)
-   module_load_offset =
-   get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
-   mutex_unlock(_kaslr_mutex);
-   }
-   return module_load_offset;
-}
-#else
-static unsigned long int get_module_load_offset(void)
-{
-   return 0;
-}
-#endif
-
-static struct jit_alloc_params jit_alloc_params = {
-   .alignment  = JIT_ALLOC_ALIGN,
-   .flags  = JIT_ALLOC_KASAN_SHADOW,
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-   jit_alloc_params.text.pgprot = PAGE_KERNEL;
-   jit_alloc_params.text.start = MODULES_VADDR + get_module_load_offset();
-   jit_alloc_params.text.end = MODULES_END;
-
-   return _alloc_params;
-}
-
 #ifdef CONFIG_X86_32
 int apply_relocate(Elf32_Shdr *sechdrs,
   const char *strtab,
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 3cdac0f0055d..ffaf9a3840ce 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1084,3 +1085,48 @@ unsigned long arch_max_swapfile_size(void)
return pages;
 }
 #endif
+
+#ifdef CONFIG_JIT_ALLOC
+#ifdef CONFIG_RANDOMIZE_BASE
+static unsigned long jit_load_offset;
+
+/* Mutex protects the jit_load_offset. */
+static DEFINE_MUTEX(jit_kaslr_mutex);
+
+static unsigned long int get_jit_load_offset(void)
+{
+   if (kaslr_enabled()) {
+   mutex_lock(_kaslr_mutex);
+   /*
+* Calculate the jit_load_offset the first time this
+* code is called. Once calculated it stays the same until
+* reboot.
+*/
+   if (jit_load_offset == 0)
+   jit_load_offset =
+   get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
+   mutex_unlock(_kaslr_mutex);
+   }
+   return jit_load_offset;
+}
+#else
+static unsigned long int get_jit_load_offset(void)
+{
+   return 0;
+}
+#endif
+
+static struct jit_alloc_params jit_alloc_params = {
+   .alignment  = JIT_ALLOC_ALIGN,
+   .flags  = JIT_ALLOC_KASAN_SHADOW,
+};
+

[PATCH 06/13] mm/jitalloc: introduce jit_data_alloc()

2023-06-01 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Data related to code allocations, such as module data section, need to
comply with architecture constraints for its placement and its
allocation right now was done using jit_text_alloc().

Create a dedicated API for allocating data related to code allocations
and allow architectures to define address ranges for data allocations.

Since currently this is only relevant for powerpc variants that use the
VMALLOC address space for module data allocations, automatically reuse
address ranges defined for text unless address range for data is
explicitly defined by an architecture.

With separation of code and data allocations, data sections of the
modules are now mapped as PAGE_KERNEL rather than PAGE_KERNEL_EXEC which
was a default on many architectures.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/powerpc/kernel/module.c |  8 
 include/linux/jitalloc.h |  2 ++
 kernel/module/main.c | 15 +++
 mm/jitalloc.c| 36 
 4 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index 83bdedc7eba0..b58af61e90c0 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -96,6 +96,10 @@ static struct jit_alloc_params jit_alloc_params = {
 
 struct jit_alloc_params *jit_alloc_arch_params(void)
 {
+   /*
+* BOOK3S_32 and 8xx define MODULES_VADDR for text allocations and
+* allow allocating data in the entire vmalloc space
+*/
 #ifdef MODULES_VADDR
pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : 
PAGE_KERNEL_EXEC;
unsigned long limit = (unsigned long)_etext - SZ_32M;
@@ -112,6 +116,10 @@ struct jit_alloc_params *jit_alloc_arch_params(void)
jit_alloc_params.text.start = MODULES_VADDR;
jit_alloc_params.text.end = MODULES_END;
}
+
+   jit_alloc_params.data.pgprot= PAGE_KERNEL;
+   jit_alloc_params.data.start = VMALLOC_START;
+   jit_alloc_params.data.end   = VMALLOC_END;
 #else
jit_alloc_params.text.start = VMALLOC_START;
jit_alloc_params.text.end = VMALLOC_END;
diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
index 823b13706a90..7f8cafb3cfe9 100644
--- a/include/linux/jitalloc.h
+++ b/include/linux/jitalloc.h
@@ -45,6 +45,7 @@ struct jit_address_space {
  */
 struct jit_alloc_params {
struct jit_address_spacetext;
+   struct jit_address_spacedata;
enum jit_alloc_flagsflags;
unsigned intalignment;
 };
@@ -53,6 +54,7 @@ struct jit_alloc_params *jit_alloc_arch_params(void);
 
 void jit_free(void *buf);
 void *jit_text_alloc(size_t len);
+void *jit_data_alloc(size_t len);
 
 #ifdef CONFIG_JIT_ALLOC
 void jit_alloc_init(void);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index dfb7fa109f1a..91477aa5f671 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1195,25 +1195,16 @@ void __weak module_arch_freeing_init(struct module *mod)
 {
 }
 
-static bool mod_mem_use_vmalloc(enum mod_mem_type type)
-{
-   return IS_ENABLED(CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC) &&
-   mod_mem_type_is_core_data(type);
-}
-
 static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
 {
-   if (mod_mem_use_vmalloc(type))
-   return vzalloc(size);
+   if (mod_mem_type_is_data(type))
+   return jit_data_alloc(size);
return jit_text_alloc(size);
 }
 
 static void module_memory_free(void *ptr, enum mod_mem_type type)
 {
-   if (mod_mem_use_vmalloc(type))
-   vfree(ptr);
-   else
-   jit_free(ptr);
+   jit_free(ptr);
 }
 
 static void free_mod_mem(struct module *mod)
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
index 221940e36b46..16fd715d501a 100644
--- a/mm/jitalloc.c
+++ b/mm/jitalloc.c
@@ -72,6 +72,20 @@ void *jit_text_alloc(size_t len)
 fallback_start, fallback_end, kasan);
 }
 
+void *jit_data_alloc(size_t len)
+{
+   unsigned int align = jit_alloc_params.alignment;
+   pgprot_t pgprot = jit_alloc_params.data.pgprot;
+   unsigned long start = jit_alloc_params.data.start;
+   unsigned long end = jit_alloc_params.data.end;
+   unsigned long fallback_start = jit_alloc_params.data.fallback_start;
+   unsigned long fallback_end = jit_alloc_params.data.fallback_end;
+   bool kasan = jit_alloc_params.flags & JIT_ALLOC_KASAN_SHADOW;
+
+   return jit_alloc(len, align, pgprot, start, end,
+fallback_start, fallback_end, kasan);
+}
+
 struct jit_alloc_params * __weak jit_alloc_arch_params(void)
 {
return NULL;
@@ -88,6 +102,23 @@ static bool jit_alloc_validate_params(struct 
jit_alloc_params *p)
return true;
 }
 
+static void jit_alloc_init_missing(struct jit_alloc_params *p)
+{
+   if 

[PATCH 05/13] module, jitalloc: drop module_alloc

2023-06-01 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Define default parameters for address range for code allocations
using the current values in module_alloc() and make jit_text_alloc() use
these defaults when an architecure does not supply its specific
parameters.

With this, jit_text_alloc() implements memory allocation in a way
compatible with module_alloc() and can be used as a replacement for
module_alloc().

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/arm64/kernel/module.c   |  2 +-
 arch/s390/kernel/module.c|  2 +-
 arch/x86/kernel/module.c |  2 +-
 include/linux/jitalloc.h |  8 
 include/linux/moduleloader.h | 12 
 kernel/module/main.c |  7 ---
 mm/jitalloc.c| 31 +--
 7 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index ecf1f4030317..91ffcff5a44c 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -24,7 +24,7 @@
 #include 
 
 static struct jit_alloc_params jit_alloc_params = {
-   .alignment  = MODULE_ALIGN,
+   .alignment  = JIT_ALLOC_ALIGN,
.flags  = JIT_ALLOC_KASAN_SHADOW,
 };
 
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index 0986a1a1b261..3f85cf1e7c4e 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -56,7 +56,7 @@ static unsigned long get_module_load_offset(void)
 }
 
 static struct jit_alloc_params jit_alloc_params = {
-   .alignment  = MODULE_ALIGN,
+   .alignment  = JIT_ALLOC_ALIGN,
.flags  = JIT_ALLOC_KASAN_SHADOW,
.text.pgprot= PAGE_KERNEL,
 };
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index cce84b61a036..cacca613b8bd 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -67,7 +67,7 @@ static unsigned long int get_module_load_offset(void)
 #endif
 
 static struct jit_alloc_params jit_alloc_params = {
-   .alignment  = MODULE_ALIGN,
+   .alignment  = JIT_ALLOC_ALIGN,
.flags  = JIT_ALLOC_KASAN_SHADOW,
 };
 
diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
index 34ee57795a18..823b13706a90 100644
--- a/include/linux/jitalloc.h
+++ b/include/linux/jitalloc.h
@@ -4,6 +4,14 @@
 
 #include 
 
+#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
+   !defined(CONFIG_KASAN_VMALLOC)
+#include 
+#define JIT_ALLOC_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
+#else
+#define JIT_ALLOC_ALIGN PAGE_SIZE
+#endif
+
 /**
  * enum jit_alloc_flags - options for executable memory allocations
  * @JIT_ALLOC_KASAN_SHADOW:allocate kasan shadow
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index b3374342f7af..4321682fe849 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -25,10 +25,6 @@ int module_frob_arch_sections(Elf_Ehdr *hdr,
 /* Additional bytes needed by arch in front of individual sections */
 unsigned int arch_mod_section_prepend(struct module *mod, unsigned int 
section);
 
-/* Allocator used for allocating struct module, core sections and init
-   sections.  Returns NULL on failure. */
-void *module_alloc(unsigned long size);
-
 /* Determines if the section name is an init section (that is only used during
  * module loading).
  */
@@ -113,12 +109,4 @@ void module_arch_cleanup(struct module *mod);
 /* Any cleanup before freeing mod->module_init */
 void module_arch_freeing_init(struct module *mod);
 
-#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
-   !defined(CONFIG_KASAN_VMALLOC)
-#include 
-#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
-#else
-#define MODULE_ALIGN PAGE_SIZE
-#endif
-
 #endif
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 51278c571bcb..dfb7fa109f1a 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1600,13 +1600,6 @@ static void free_modinfo(struct module *mod)
}
 }
 
-void * __weak module_alloc(unsigned long size)
-{
-   return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-   GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
-   NUMA_NO_NODE, __builtin_return_address(0));
-}
-
 bool __weak module_init_section(const char *name)
 {
return strstarts(name, ".init");
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
index 4e10af7803f7..221940e36b46 100644
--- a/mm/jitalloc.c
+++ b/mm/jitalloc.c
@@ -60,20 +60,16 @@ void jit_free(void *buf)
 
 void *jit_text_alloc(size_t len)
 {
-   if (jit_alloc_params.text.start) {
-   unsigned int align = jit_alloc_params.alignment;
-   pgprot_t pgprot = jit_alloc_params.text.pgprot;
-   unsigned long start = jit_alloc_params.text.start;
-   unsigned long end = jit_alloc_params.text.end;
-   unsigned long fallback_start = 
jit_alloc_params.text.fallback_start;
-   

[PATCH 04/13] mm/jitalloc, arch: convert remaining overrides of module_alloc to jitalloc

2023-06-01 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Extend jitalloc parameters to accommodate more complex overrides of
module_alloc() by architectures.

This includes specification of a fallback range required by arm, arm64
and powerpc and support for allocation of KASAN shadow required by
arm64, s390 and x86.

The core implementation of jit_alloc() takes care of suppressing warnings
when the initial allocation fails but there is a fallback range defined.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/arm/kernel/module.c | 32 ++--
 arch/arm64/kernel/module.c   | 57 
 arch/powerpc/kernel/module.c | 46 +
 arch/s390/kernel/module.c| 31 
 arch/x86/kernel/module.c | 29 +++---
 include/linux/jitalloc.h | 14 +
 mm/jitalloc.c| 44 
 7 files changed, 138 insertions(+), 115 deletions(-)

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index d59c36dc0494..83ccbf98164f 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -34,23 +35,22 @@
 #endif
 
 #ifdef CONFIG_MMU
-void *module_alloc(unsigned long size)
+static struct jit_alloc_params jit_alloc_params = {
+   .alignment  = 1,
+   .text.start = MODULES_VADDR,
+   .text.end   = MODULES_END,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
-   gfp_t gfp_mask = GFP_KERNEL;
-   void *p;
-
-   /* Silence the initial allocation */
-   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS))
-   gfp_mask |= __GFP_NOWARN;
-
-   p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
-   __builtin_return_address(0));
-   if (!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) || p)
-   return p;
-   return __vmalloc_node_range(size, 1,  VMALLOC_START, VMALLOC_END,
-   GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
-   __builtin_return_address(0));
+   jit_alloc_params.text.pgprot = PAGE_KERNEL_EXEC;
+
+   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+   jit_alloc_params.text.fallback_start = VMALLOC_START;
+   jit_alloc_params.text.fallback_end = VMALLOC_END;
+   }
+
+   return _alloc_params;
 }
 #endif
 
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 5af4975caeb5..ecf1f4030317 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -17,56 +17,49 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 
-void *module_alloc(unsigned long size)
+static struct jit_alloc_params jit_alloc_params = {
+   .alignment  = MODULE_ALIGN,
+   .flags  = JIT_ALLOC_KASAN_SHADOW,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
-   gfp_t gfp_mask = GFP_KERNEL;
-   void *p;
-
-   /* Silence the initial allocation */
-   if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
-   gfp_mask |= __GFP_NOWARN;
 
if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
IS_ENABLED(CONFIG_KASAN_SW_TAGS))
/* don't exceed the static module region - see below */
module_alloc_end = MODULES_END;
 
-   p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
-   module_alloc_end, gfp_mask, PAGE_KERNEL, 
VM_DEFER_KMEMLEAK,
-   NUMA_NO_NODE, __builtin_return_address(0));
+   jit_alloc_params.text.pgprot = PAGE_KERNEL;
+   jit_alloc_params.text.start = module_alloc_base;
+   jit_alloc_params.text.end = module_alloc_end;
 
-   if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
+   /*
+* KASAN without KASAN_VMALLOC can only deal with module
+* allocations being served from the reserved module region,
+* since the remainder of the vmalloc region is already
+* backed by zero shadow pages, and punching holes into it
+* is non-trivial. Since the module region is not randomized
+* when KASAN is enabled without KASAN_VMALLOC, it is even
+* less likely that the module region gets exhausted, so we
+* can simply omit this fallback in that case.
+*/
+   if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
(IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
 (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
- !IS_ENABLED(CONFIG_KASAN_SW_TAGS
-   /*
-* KASAN without KASAN_VMALLOC can only deal with module
-* allocations being served from the reserved module region,
-* since the remainder of the vmalloc region is already
-* 

[PATCH 03/13] mm/jitalloc, arch: convert simple overrides of module_alloc to jitalloc

2023-06-01 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Several architectures override module_alloc() only to define address
range for code allocations different than VMALLOC address space.

Provide a generic implementation in jitalloc that uses the parameters
for address space ranges, required alignment and page protections
provided by architectures.

The architecures must fill jit_alloc_params structure and implement
jit_alloc_arch_params() that returns a pointer to that structure. This
way the jitalloc initialization won't be called from every architecure,
but rather from a central place, namely initialization of the core
memory management.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/loongarch/kernel/module.c | 14 --
 arch/mips/kernel/module.c  | 16 ---
 arch/nios2/kernel/module.c | 15 ++
 arch/parisc/kernel/module.c| 18 
 arch/riscv/kernel/module.c | 16 +++
 arch/sparc/kernel/module.c | 39 +++---
 include/linux/jitalloc.h   | 31 +
 mm/jitalloc.c  | 51 ++
 mm/mm_init.c   |  2 ++
 9 files changed, 156 insertions(+), 46 deletions(-)

diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c
index b8b86088b2dd..1d5e00874ae7 100644
--- a/arch/loongarch/kernel/module.c
+++ b/arch/loongarch/kernel/module.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -469,10 +470,17 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char 
*strtab,
return 0;
 }
 
-void *module_alloc(unsigned long size)
+static struct jit_alloc_params jit_alloc_params = {
+   .alignment  = 1,
+   .text.pgprot= PAGE_KERNEL,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
-   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, 
__builtin_return_address(0));
+   jit_alloc_params.text.start = MODULES_VADDR;
+   jit_alloc_params.text.end = MODULES_END;
+
+   return _alloc_params;
 }
 
 static void module_init_ftrace_plt(const Elf_Ehdr *hdr,
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 0c936cbf20c5..f762c697ab9c 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 extern void jump_label_apply_nops(struct module *mod);
 
@@ -33,11 +34,18 @@ static LIST_HEAD(dbe_list);
 static DEFINE_SPINLOCK(dbe_lock);
 
 #ifdef MODULE_START
-void *module_alloc(unsigned long size)
+
+static struct jit_alloc_params jit_alloc_params = {
+   .alignment  = 1,
+   .text.start = MODULE_START,
+   .text.end   = MODULE_END,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
-   return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
-   GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
-   __builtin_return_address(0));
+   jit_alloc_params.text.pgprot = PAGE_KERNEL;
+
+   return _alloc_params;
 }
 #endif
 
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 9c97b7513853..b41d52775ec2 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -18,15 +18,20 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
-void *module_alloc(unsigned long size)
+static struct jit_alloc_params jit_alloc_params = {
+   .alignment  = 1,
+   .text.pgprot= PAGE_KERNEL_EXEC,
+   .text.start = MODULES_VADDR,
+   .text.end   = MODULES_END,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
-   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL_EXEC,
-   VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
-   __builtin_return_address(0));
+   return _alloc_params;
 }
 
 int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index f6e38c4d3904..49fdf741fd24 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -173,15 +174,20 @@ static inline int reassemble_22(int as22)
((as22 & 0x0003ff) << 3));
 }
 
-void *module_alloc(unsigned long size)
-{
+static struct jit_alloc_params jit_alloc_params = {
+   .alignment  = 1,
/* using RWX means less protection for modules, but it's
 * easier than trying to map the text, data, init_text and
 * init_data correctly */
-   return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-   GFP_KERNEL,
-   PAGE_KERNEL_RWX, 0, NUMA_NO_NODE,
-   

[PATCH 02/13] mm: introduce jit_text_alloc() and use it instead of module_alloc()

2023-06-01 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystmes
that need to allocate code, such as ftrace, kprobes and BPF to modules
and puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing
jit_text_alloc() and jit_free() APIs.

Start with making jit_text_alloc() a wrapper for module_alloc() and
jit_free() a replacement of module_memfree() to allow updating all call
sites to use the new APIs.

The name jit_text_alloc() emphasizes that the allocated memory is for
executable code, the allocations of the associated data, like data sections
of a module will use jit_data_alloc() interface that will be added later.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/powerpc/kernel/kprobes.c|  4 ++--
 arch/s390/kernel/ftrace.c|  4 ++--
 arch/s390/kernel/kprobes.c   |  4 ++--
 arch/s390/kernel/module.c|  5 +++--
 arch/sparc/net/bpf_jit_comp_32.c |  8 
 arch/x86/kernel/ftrace.c |  6 +++---
 arch/x86/kernel/kprobes/core.c   |  4 ++--
 include/linux/jitalloc.h | 10 ++
 include/linux/moduleloader.h |  3 ---
 kernel/bpf/core.c| 14 +++---
 kernel/kprobes.c |  8 
 kernel/module/Kconfig|  1 +
 kernel/module/main.c | 23 +++
 mm/Kconfig   |  3 +++
 mm/Makefile  |  1 +
 mm/jitalloc.c| 20 
 16 files changed, 71 insertions(+), 47 deletions(-)
 create mode 100644 include/linux/jitalloc.h
 create mode 100644 mm/jitalloc.c

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index b20ee72e873a..e5835b148ec4 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -19,8 +19,8 @@
 #include 
 #include 
 #include 
-#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -130,7 +130,7 @@ void *alloc_insn_page(void)
 {
void *page;
 
-   page = module_alloc(PAGE_SIZE);
+   page = jit_text_alloc(PAGE_SIZE);
if (!page)
return NULL;
 
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index c46381ea04ec..6e50a88b9b5d 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -7,13 +7,13 @@
  *   Author(s): Martin Schwidefsky 
  */
 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void)
 {
const char *start, *end;
 
-   ftrace_plt = module_alloc(PAGE_SIZE);
+   ftrace_plt = jit_text_alloc(PAGE_SIZE);
if (!ftrace_plt)
panic("cannot allocate ftrace plt\n");
 
diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
index d4b863ed0aa7..3804945f212f 100644
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -9,7 +9,6 @@
 
 #define pr_fmt(fmt) "kprobes: " fmt
 
-#include 
 #include 
 #include 
 #include 
@@ -21,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -38,7 +38,7 @@ void *alloc_insn_page(void)
 {
void *page;
 
-   page = module_alloc(PAGE_SIZE);
+   page = jit_text_alloc(PAGE_SIZE);
if (!page)
return NULL;
set_memory_rox((unsigned long)page, 1);
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index f1b35dcdf3eb..d4844cfe3d7e 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
 #ifdef CONFIG_FUNCTION_TRACER
 void module_arch_cleanup(struct module *mod)
 {
-   module_memfree(mod->arch.trampolines_start);
+   jit_free(mod->arch.trampolines_start);
 }
 #endif
 
@@ -509,7 +510,7 @@ static int module_alloc_ftrace_hotpatch_trampolines(struct 
module *me,
 
size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size);
numpages = DIV_ROUND_UP(size, PAGE_SIZE);
-   start = module_alloc(numpages * PAGE_SIZE);
+   start = jit_text_alloc(numpages * PAGE_SIZE);
if (!start)
return -ENOMEM;
set_memory_rox((unsigned long)start, numpages);
diff --git a/arch/sparc/net/bpf_jit_comp_32.c b/arch/sparc/net/bpf_jit_comp_32.c
index a74e5004c6c8..068be1097d1a 100644
--- a/arch/sparc/net/bpf_jit_comp_32.c
+++ b/arch/sparc/net/bpf_jit_comp_32.c
@@ -1,10 +1,10 @@
 // SPDX-License-Identifier: GPL-2.0
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -713,7 +713,7 @@ 

[PATCH 01/13] nios2: define virtual address space for modules

2023-06-01 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26
cannot reach all of vmalloc address space.

Define module space as 32MiB below the kernel base and switch nios2 to
use vmalloc for module allocations.

Suggested-by: Thomas Gleixner 
Signed-off-by: Mike Rapoport (IBM) 
---
 arch/nios2/include/asm/pgtable.h |  5 -
 arch/nios2/kernel/module.c   | 19 ---
 2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index 0f5c2564e9f5..0073b289c6a4 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -25,7 +25,10 @@
 #include 
 
 #define VMALLOC_START  CONFIG_NIOS2_KERNEL_MMU_REGION_BASE
-#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
+#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1)
+
+#define MODULES_VADDR  (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M)
+#define MODULES_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
 
 struct mm_struct;
 
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 76e0a42d6e36..9c97b7513853 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -21,23 +21,12 @@
 
 #include 
 
-/*
- * Modules should NOT be allocated with kmalloc for (obvious) reasons.
- * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach
- * from 0x8000 (vmalloc area) to 0xc (kernel) (kmalloc returns
- * addresses in 0xc000)
- */
 void *module_alloc(unsigned long size)
 {
-   if (size == 0)
-   return NULL;
-   return kmalloc(size, GFP_KERNEL);
-}
-
-/* Free memory returned from module_alloc */
-void module_memfree(void *module_region)
-{
-   kfree(module_region);
+   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
+   GFP_KERNEL, PAGE_KERNEL_EXEC,
+   VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
+   __builtin_return_address(0));
 }
 
 int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
-- 
2.35.1



[PATCH 00/13] mm: jit/text allocator

2023-06-01 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Hi,

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystmes
that need to allocate code, such as ftrace, kprobes and BPF to modules
and puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

This set splits code allocation from modules by introducing
jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
sites of module_alloc() and module_memfree() with the new APIs and
implements core text and related allocation in a central place.

Instead of architecture specific overrides for module_alloc(), the
architectures that require non-default behaviour for text allocation must
fill jit_alloc_params structure and implement jit_alloc_arch_params() that
returns a pointer to that structure. If an architecture does not implement
jit_alloc_arch_params(), the defaults compatible with the current
modules::module_alloc() are used.

The new jitalloc infrastructure allows decoupling of kprobes and ftrace
from modules, and most importantly it enables ROX allocations for
executable memory.

A centralized infrastructure for code allocation allows future
optimizations for allocations of executable memory, caching large pages for
better iTLB performance and providing sub-page allocations for users that
only need small jit code snippets.

patches 1-5: split out the code allocation from modules and arch
patch 6: add dedicated API for data allocations with constraints similar to
code allocations
patches 7-9: decouple dynamic ftrace and kprobes form CONFIG_MODULES
patches 10-13: enable ROX allocations for executable memory on x86

Mike Rapoport (IBM) (11):
  nios2: define virtual address space for modules
  mm: introduce jit_text_alloc() and use it instead of module_alloc()
  mm/jitalloc, arch: convert simple overrides of module_alloc to jitalloc
  mm/jitalloc, arch: convert remaining overrides of module_alloc to jitalloc
  module, jitalloc: drop module_alloc
  mm/jitalloc: introduce jit_data_alloc()
  x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
  arch: make jitalloc setup available regardless of CONFIG_MODULES
  kprobes: remove dependcy on CONFIG_MODULES
  modules, jitalloc: prepare to allocate executable memory as ROX
  x86/jitalloc: make memory allocated for code ROX

Song Liu (2):
  ftrace: Add swap_func to ftrace_process_locs()
  x86/jitalloc: prepare to allocate exectuatble memory as ROX

 arch/Kconfig |   5 +-
 arch/arm/kernel/module.c |  32 --
 arch/arm/mm/init.c   |  35 ++
 arch/arm64/kernel/module.c   |  47 
 arch/arm64/mm/init.c |  42 +++
 arch/loongarch/kernel/module.c   |   6 -
 arch/loongarch/mm/init.c |  16 +++
 arch/mips/kernel/module.c|   9 --
 arch/mips/mm/init.c  |  19 
 arch/nios2/include/asm/pgtable.h |   5 +-
 arch/nios2/kernel/module.c   |  24 ++--
 arch/parisc/kernel/module.c  |  11 --
 arch/parisc/mm/init.c|  21 +++-
 arch/powerpc/kernel/kprobes.c|   4 +-
 arch/powerpc/kernel/module.c |  37 ---
 arch/powerpc/mm/mem.c|  41 +++
 arch/riscv/kernel/module.c   |  10 --
 arch/riscv/mm/init.c |  18 +++
 arch/s390/kernel/ftrace.c|   4 +-
 arch/s390/kernel/kprobes.c   |   4 +-
 arch/s390/kernel/module.c|  46 +---
 arch/s390/mm/init.c  |  35 ++
 arch/sparc/kernel/module.c   |  34 +-
 arch/sparc/mm/Makefile   |   2 +
 arch/sparc/mm/jitalloc.c |  21 
 arch/sparc/net/bpf_jit_comp_32.c |   8 +-
 arch/x86/Kconfig |   2 +
 arch/x86/kernel/alternative.c|  43 ---
 arch/x86/kernel/ftrace.c |  59 +-
 arch/x86/kernel/kprobes/core.c   |   4 +-
 arch/x86/kernel/module.c |  75 +
 arch/x86/kernel/static_call.c|  10 +-
 arch/x86/kernel/unwind_orc.c |  13 ++-
 arch/x86/mm/init.c   |  52 +
 arch/x86/net/bpf_jit_comp.c  |  22 +++-
 include/linux/ftrace.h   |   2 +
 include/linux/jitalloc.h |  69 
 include/linux/moduleloader.h |  15 ---
 kernel/bpf/core.c|  14 +--
 kernel/kprobes.c |  51 +
 kernel/module/Kconfig|   1 +
 kernel/module/main.c |  56 --
 kernel/trace/ftrace.c|  13 ++-
 kernel/trace/trace_kprobe.c  |  11 ++
 mm/Kconfig   |   3 +
 mm/Makefile  |   1 +
 mm/jitalloc.c| 185 +++
 mm/mm_init.c |   2 +
 48 files changed, 777 insertions(+), 462 deletions(-)
 create mode 100644 arch/sparc/mm/jitalloc.c
 create mode 100644 

Re: [PATCH v2 0/3] Remove iommu_group_remove_device() from fsl

2023-06-01 Thread Joerg Roedel
On Wed, May 31, 2023 at 05:04:04PM +1000, Michael Ellerman wrote:
> Great, yep consider it:
> 
> Tested-by: Michael Ellerman 

Alright, applied them for 6.5.



Re: [PATCH v3 30/34] sh: Convert pte_free_tlb() to use ptdescs

2023-06-01 Thread Vishal Moola
On Thu, Jun 1, 2023 at 12:28 AM John Paul Adrian Glaubitz
 wrote:
>
> Hi Geert!
>
> On Thu, 2023-06-01 at 09:20 +0200, Geert Uytterhoeven wrote:
> > On Wed, May 31, 2023 at 11:33 PM Vishal Moola (Oracle)
> >  wrote:
> > > Part of the conversions to replace pgtable constructor/destructors with
> > > ptdesc equivalents. Also cleans up some spacing issues.
> > >
> > > Signed-off-by: Vishal Moola (Oracle) 
> >
> > LGTM, so
> > Reviewed-by: Geert Uytterhoeven 
>
> I assume this series is supposed to go through some mm tree?

Hi Adrian,
I was going to have Andrew take this through mm-unstable
once it gets enough review.


Re: [PATCH v3 25/34] m68k: Convert various functions to use ptdescs

2023-06-01 Thread Vishal Moola
On Thu, Jun 1, 2023 at 12:40 AM Geert Uytterhoeven  wrote:
>
> Hi Vishal,
>
> On Wed, May 31, 2023 at 11:32 PM Vishal Moola (Oracle)
>  wrote:
> > As part of the conversions to replace pgtable constructor/destructors with
> > ptdesc equivalents, convert various page table functions to use ptdescs.
> >
> > Some of the functions use the *get*page*() helper functions. Convert
> > these to use pagetable_alloc() and ptdesc_address() instead to help
> > standardize page tables further.
> >
> > Signed-off-by: Vishal Moola (Oracle) 
>
> Thanks for your patch!
>
> > --- a/arch/m68k/include/asm/mcf_pgalloc.h
> > +++ b/arch/m68k/include/asm/mcf_pgalloc.h
> > @@ -7,20 +7,19 @@
> >
> >  extern inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
> >  {
> > -   free_page((unsigned long) pte);
> > +   pagetable_free(virt_to_ptdesc(pte));
> >  }
> >
> >  extern const char bad_pmd_string[];
> >
> >  extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
> >  {
> > -   unsigned long page = __get_free_page(GFP_DMA);
> > +   struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA | __GFP_ZERO, 0);
> >
> > -   if (!page)
> > +   if (!ptdesc)
> > return NULL;
> >
> > -   memset((void *)page, 0, PAGE_SIZE);
> > -   return (pte_t *) (page);
> > +   return (pte_t *) (ptdesc_address(ptdesc));
>
> No need to cast "void *" when returning a different pointer type.
>
> >  }
> >
> >  extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned long address)
> > @@ -35,36 +34,36 @@ extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, 
> > unsigned long address)
> >  static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t 
> > pgtable,
> >   unsigned long address)
> >  {
> > -   struct page *page = virt_to_page(pgtable);
> > +   struct ptdesc *ptdesc = virt_to_ptdesc(pgtable);
> >
> > -   pgtable_pte_page_dtor(page);
> > -   __free_page(page);
> > +   pagetable_pte_dtor(ptdesc);
> > +   pagetable_free(ptdesc);
> >  }
> >
> >  static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
> >  {
> > -   struct page *page = alloc_pages(GFP_DMA, 0);
> > +   struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA, 0);
> > pte_t *pte;
> >
> > -   if (!page)
> > +   if (!ptdesc)
> > return NULL;
> > -   if (!pgtable_pte_page_ctor(page)) {
> > -   __free_page(page);
> > +   if (!pagetable_pte_ctor(ptdesc)) {
> > +   pagetable_free(ptdesc);
> > return NULL;
> > }
> >
> > -   pte = page_address(page);
> > -   clear_page(pte);
> > +   pte = ptdesc_address(ptdesc);
> > +   pagetable_clear(pte);
> >
> > return pte;
> >  }
> >
> >  static inline void pte_free(struct mm_struct *mm, pgtable_t pgtable)
> >  {
> > -   struct page *page = virt_to_page(pgtable);
> > +   struct ptdesc *ptdesc = virt_to_ptdesc(ptdesc);
>
> virt_to_ptdesc(pgtable)
>
> (You can build this using m5475evb_defconfig)
>
> >
> > -   pgtable_pte_page_dtor(page);
> > -   __free_page(page);
> > +   pagetable_pte_dtor(ptdesc);
> > +   pagetable_free(ptdesc);
> >  }
> >
> >  /*
> > @@ -75,16 +74,18 @@ static inline void pte_free(struct mm_struct *mm, 
> > pgtable_t pgtable)
> >
> >  static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
> >  {
> > -   free_page((unsigned long) pgd);
> > +   pagetable_free(virt_to_ptdesc(pgd));
> >  }
> >
> >  static inline pgd_t *pgd_alloc(struct mm_struct *mm)
> >  {
> > pgd_t *new_pgd;
> > +   struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA | GFP_NOWARN, 0);
> >
> > -   new_pgd = (pgd_t *)__get_free_page(GFP_DMA | __GFP_NOWARN);
> > -   if (!new_pgd)
> > +   if (!ptdesc)
> > return NULL;
> > +   new_pgd = (pgd_t *) ptdesc_address(ptdesc);
>
> No need to cast "void *" when assigning to a different pointer type.
>
> > +
> > memcpy(new_pgd, swapper_pg_dir, PTRS_PER_PGD * sizeof(pgd_t));
> > memset(new_pgd, 0, PAGE_OFFSET >> PGDIR_SHIFT);
> > return new_pgd;
>
> The rest LGTM.

Thanks so much for the review! I'll make those changes in the next
version.


> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> ge...@linux-m68k.org
>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like 
> that.
> -- Linus Torvalds


Re: [PATCH v3 30/34] sh: Convert pte_free_tlb() to use ptdescs

2023-06-01 Thread John Paul Adrian Glaubitz
On Thu, 2023-06-01 at 09:42 +0200, Geert Uytterhoeven wrote:
> Hi Adrian,
> 
> On Thu, Jun 1, 2023 at 9:28 AM John Paul Adrian Glaubitz
>  wrote:
> > On Thu, 2023-06-01 at 09:20 +0200, Geert Uytterhoeven wrote:
> > > On Wed, May 31, 2023 at 11:33 PM Vishal Moola (Oracle)
> > >  wrote:
> > > > Part of the conversions to replace pgtable constructor/destructors with
> > > > ptdesc equivalents. Also cleans up some spacing issues.
> > > > 
> > > > Signed-off-by: Vishal Moola (Oracle) 
> > > 
> > > LGTM, so
> > > Reviewed-by: Geert Uytterhoeven 
> > 
> > I assume this series is supposed to go through some mm tree?
> 
> I think so, so your Acked-by would be appreciated...

OK, I will have a look. Btw, can you have a look at the second series by
Artur ("SuperH DMAC fixes")? I haven't had the time for these yet, but
I will have time in the weekend.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913


Re: [PATCH v3 30/34] sh: Convert pte_free_tlb() to use ptdescs

2023-06-01 Thread Geert Uytterhoeven
Hi Adrian,

On Thu, Jun 1, 2023 at 9:28 AM John Paul Adrian Glaubitz
 wrote:
> On Thu, 2023-06-01 at 09:20 +0200, Geert Uytterhoeven wrote:
> > On Wed, May 31, 2023 at 11:33 PM Vishal Moola (Oracle)
> >  wrote:
> > > Part of the conversions to replace pgtable constructor/destructors with
> > > ptdesc equivalents. Also cleans up some spacing issues.
> > >
> > > Signed-off-by: Vishal Moola (Oracle) 
> >
> > LGTM, so
> > Reviewed-by: Geert Uytterhoeven 
>
> I assume this series is supposed to go through some mm tree?

I think so, so your Acked-by would be appreciated...

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH v3 25/34] m68k: Convert various functions to use ptdescs

2023-06-01 Thread Geert Uytterhoeven
Hi Vishal,

On Wed, May 31, 2023 at 11:32 PM Vishal Moola (Oracle)
 wrote:
> As part of the conversions to replace pgtable constructor/destructors with
> ptdesc equivalents, convert various page table functions to use ptdescs.
>
> Some of the functions use the *get*page*() helper functions. Convert
> these to use pagetable_alloc() and ptdesc_address() instead to help
> standardize page tables further.
>
> Signed-off-by: Vishal Moola (Oracle) 

Thanks for your patch!

> --- a/arch/m68k/include/asm/mcf_pgalloc.h
> +++ b/arch/m68k/include/asm/mcf_pgalloc.h
> @@ -7,20 +7,19 @@
>
>  extern inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
>  {
> -   free_page((unsigned long) pte);
> +   pagetable_free(virt_to_ptdesc(pte));
>  }
>
>  extern const char bad_pmd_string[];
>
>  extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
>  {
> -   unsigned long page = __get_free_page(GFP_DMA);
> +   struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA | __GFP_ZERO, 0);
>
> -   if (!page)
> +   if (!ptdesc)
> return NULL;
>
> -   memset((void *)page, 0, PAGE_SIZE);
> -   return (pte_t *) (page);
> +   return (pte_t *) (ptdesc_address(ptdesc));

No need to cast "void *" when returning a different pointer type.

>  }
>
>  extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned long address)
> @@ -35,36 +34,36 @@ extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, 
> unsigned long address)
>  static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pgtable,
>   unsigned long address)
>  {
> -   struct page *page = virt_to_page(pgtable);
> +   struct ptdesc *ptdesc = virt_to_ptdesc(pgtable);
>
> -   pgtable_pte_page_dtor(page);
> -   __free_page(page);
> +   pagetable_pte_dtor(ptdesc);
> +   pagetable_free(ptdesc);
>  }
>
>  static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
>  {
> -   struct page *page = alloc_pages(GFP_DMA, 0);
> +   struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA, 0);
> pte_t *pte;
>
> -   if (!page)
> +   if (!ptdesc)
> return NULL;
> -   if (!pgtable_pte_page_ctor(page)) {
> -   __free_page(page);
> +   if (!pagetable_pte_ctor(ptdesc)) {
> +   pagetable_free(ptdesc);
> return NULL;
> }
>
> -   pte = page_address(page);
> -   clear_page(pte);
> +   pte = ptdesc_address(ptdesc);
> +   pagetable_clear(pte);
>
> return pte;
>  }
>
>  static inline void pte_free(struct mm_struct *mm, pgtable_t pgtable)
>  {
> -   struct page *page = virt_to_page(pgtable);
> +   struct ptdesc *ptdesc = virt_to_ptdesc(ptdesc);

virt_to_ptdesc(pgtable)

(You can build this using m5475evb_defconfig)

>
> -   pgtable_pte_page_dtor(page);
> -   __free_page(page);
> +   pagetable_pte_dtor(ptdesc);
> +   pagetable_free(ptdesc);
>  }
>
>  /*
> @@ -75,16 +74,18 @@ static inline void pte_free(struct mm_struct *mm, 
> pgtable_t pgtable)
>
>  static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
>  {
> -   free_page((unsigned long) pgd);
> +   pagetable_free(virt_to_ptdesc(pgd));
>  }
>
>  static inline pgd_t *pgd_alloc(struct mm_struct *mm)
>  {
> pgd_t *new_pgd;
> +   struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA | GFP_NOWARN, 0);
>
> -   new_pgd = (pgd_t *)__get_free_page(GFP_DMA | __GFP_NOWARN);
> -   if (!new_pgd)
> +   if (!ptdesc)
> return NULL;
> +   new_pgd = (pgd_t *) ptdesc_address(ptdesc);

No need to cast "void *" when assigning to a different pointer type.

> +
> memcpy(new_pgd, swapper_pg_dir, PTRS_PER_PGD * sizeof(pgd_t));
> memset(new_pgd, 0, PAGE_OFFSET >> PGDIR_SHIFT);
> return new_pgd;

The rest LGTM.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH] MAINTAINERS: Exclude m68k-only drivers from powerpc entry

2023-06-01 Thread Michael Ellerman
Finn Thain  writes:
> On Wed, 31 May 2023, Geert Uytterhoeven wrote:
>> On Wed, May 31, 2023 at 2:50 PM Michael Ellerman  wrote:
>> > The powerpc section has a "F:" entry for drivers/macintosh, matching 
>> > all files in or below drivers/macintosh. That is correct for the most 
>> > part, but there are a couple of m68k-only drivers in the directory, so 
>> > exclude those.
>> >
>> > Signed-off-by: Michael Ellerman 
>> 
>> Thanks for your patch!
>> 
>> > --- a/MAINTAINERS
>> > +++ b/MAINTAINERS
>> > @@ -11916,6 +11916,8 @@ L:  linuxppc-dev@lists.ozlabs.org
>> >  S: Odd Fixes
>> >  F: arch/powerpc/platforms/powermac/
>> >  F: drivers/macintosh/
>> > +X: drivers/macintosh/adb-iop.c
>> > +X: drivers/macintosh/via-macii.c
>> >
>> >  LINUX FOR POWERPC (32-BIT AND 64-BIT)
>> >  M: Michael Ellerman 
>> 
>> LGTM, as there are already entries for these two files under
>> "M68K ON APPLE MACINTOSH".
>
> Right. I should have addded those "X files" in commit 2ae92e8b9b7.

Or I should have suggested it when I acked that commit :}

cheers


Re: [PATCH] MAINTAINERS: Exclude m68k-only drivers from powerpc entry

2023-06-01 Thread Michael Ellerman
Geert Uytterhoeven  writes:
> Hi Michael,
>
> CC Finn
>
> On Wed, May 31, 2023 at 2:50 PM Michael Ellerman  wrote:
>> The powerpc section has a "F:" entry for drivers/macintosh, matching all
>> files in or below drivers/macintosh. That is correct for the most part,
>> but there are a couple of m68k-only drivers in the directory, so exclude
>> those.
>>
>> Signed-off-by: Michael Ellerman 
>
> Thanks for your patch!
>
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -11916,6 +11916,8 @@ L:  linuxppc-dev@lists.ozlabs.org
>>  S: Odd Fixes
>>  F: arch/powerpc/platforms/powermac/
>>  F: drivers/macintosh/
>> +X: drivers/macintosh/adb-iop.c
>> +X: drivers/macintosh/via-macii.c
>>
>>  LINUX FOR POWERPC (32-BIT AND 64-BIT)
>>  M: Michael Ellerman 
>
> LGTM, as there are already entries for these two files under
> "M68K ON APPLE MACINTOSH".
> Acked-by: Geert Uytterhoeven 

Thanks.

cheers


Re: [PATCH v3 30/34] sh: Convert pte_free_tlb() to use ptdescs

2023-06-01 Thread John Paul Adrian Glaubitz
Hi Geert!

On Thu, 2023-06-01 at 09:20 +0200, Geert Uytterhoeven wrote:
> On Wed, May 31, 2023 at 11:33 PM Vishal Moola (Oracle)
>  wrote:
> > Part of the conversions to replace pgtable constructor/destructors with
> > ptdesc equivalents. Also cleans up some spacing issues.
> > 
> > Signed-off-by: Vishal Moola (Oracle) 
> 
> LGTM, so
> Reviewed-by: Geert Uytterhoeven 

I assume this series is supposed to go through some mm tree?

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913


Re: [PATCH v3 30/34] sh: Convert pte_free_tlb() to use ptdescs

2023-06-01 Thread Geert Uytterhoeven
On Wed, May 31, 2023 at 11:33 PM Vishal Moola (Oracle)
 wrote:
> Part of the conversions to replace pgtable constructor/destructors with
> ptdesc equivalents. Also cleans up some spacing issues.
>
> Signed-off-by: Vishal Moola (Oracle) 

LGTM, so
Reviewed-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH] macintosh: Switch i2c drivers back to use .probe()

2023-06-01 Thread Jean Delvare
On Tue, 23 May 2023 21:50:53 +0200, Uwe Kleine-König wrote:
> After commit b8a1a4cd5a98 ("i2c: Provide a temporary .probe_new()
> call-back type"), all drivers being converted to .probe_new() and then
> 03c835f498b5 ("i2c: Switch .probe() to not take an id parameter") convert
> back to (the new) .probe() to be able to eventually drop .probe_new() from
> struct i2c_driver.
> 
> Signed-off-by: Uwe Kleine-König 
> ---
> Hello,
> 
> this patch was generated using coccinelle, but I aligned the result to
> the per-file indention.
> 
> I chose to convert all drivers below drivers/macintosh in a single
> patch, but if you prefer I can split by driver.
> 
> v6.4-rc1 was taken as a base, as there are no commits in next touching
> drivers/macintosh I don't expect problems when applying this patch. If
> conflicts arise until this is applied, feel free to just drop the files
> with conflicts from this patch. I'll care about the fallout later then.
> 
> Also note there is no coordination necessary with the i2c tree. Dropping
> .probe_new() will happen only when all (or most) drivers are converted,
> which will happen after v6.5-rc1 for sure.
> 
> Best regards
> Uwe
> 
>  drivers/macintosh/ams/ams-i2c.c | 2 +-
>  drivers/macintosh/therm_adt746x.c   | 2 +-
>  drivers/macintosh/therm_windtunnel.c| 2 +-
>  drivers/macintosh/windfarm_ad7417_sensor.c  | 2 +-
>  drivers/macintosh/windfarm_fcu_controls.c   | 2 +-
>  drivers/macintosh/windfarm_lm75_sensor.c| 2 +-
>  drivers/macintosh/windfarm_lm87_sensor.c| 2 +-
>  drivers/macintosh/windfarm_max6690_sensor.c | 2 +-
>  drivers/macintosh/windfarm_smu_sat.c| 2 +-
>  9 files changed, 9 insertions(+), 9 deletions(-)
> (...)

Reviewed-by: Jean Delvare 

-- 
Jean Delvare
SUSE L3 Support