riscv: implement p2m_set_range()

Oleksii Kurochko Mon, 18 Aug 2025 05:20:12 -0700


On 8/15/25 2:50 PM, Jan Beulich wrote:

On 15.08.2025 11:52, Oleksii Kurochko wrote:

On 8/5/25 6:04 PM, Jan Beulich wrote:

On 31.07.2025 17:58, Oleksii Kurochko wrote:

+static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
+{
+    write_pte(p, pte);
+    if ( clean_pte )
+        clean_dcache_va_range(p, sizeof(*p));

Not necessarily for right away, but if multiple adjacent PTEs are
written without releasing the lock, this then redundant cache flushing
can be a performance issue.

Can't it be resolved on a caller side? Something like:
    p2m_write_pte(p1, pte1, false);
    p2m_write_pte(p2, pte2, false);
    p2m_write_pte(p3, pte3, false);
    p2m_write_pte(p4, pte4, true);
where p1-p4 are adjacent.

No. You wouldn't know whether the last write flushes what the earlier
three have written. There may be a cacheline boundary in between.


Oh, correct. It would be hard to detect, so agree that it will work
badly...

Plus
I didn't really think of back-to-back writes, but e.g. a loop doing
many of them, where a single wider flush may then be more efficient.


... So IIUC you mean something like:
  for (i = 0; i < nr_entries; i++)
      p2m_write_pte(&pt[i], entries[i], false);  // no flush yet

  clean_dcache_va_range(pt, nr_entries * sizeof(pte_t));

+#define P2M_TABLE_MAP_NONE 0
+#define P2M_TABLE_MAP_NOMEM 1
+#define P2M_TABLE_SUPER_PAGE 2
+#define P2M_TABLE_NORMAL 3
+
+/*
+ * Take the currently mapped table, find the corresponding the entry
+ * corresponding to the GFN, and map the next table, if available.

Nit: Double "corresponding".

+ * The previous table will be unmapped if the next level was mapped
+ * (e.g P2M_TABLE_NORMAL returned).
+ *
+ * `alloc_tbl` parameter indicates whether intermediate tables should
+ * be allocated when not present.
+ *
+ * Return values:
+ *  P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
+ *  P2M_TABLE_MAP_NOMEM: allocating a new page failed.
+ *  P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
+ *  P2M_TABLE_NORMAL: The next entry points to a superpage.
+ */
+static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
+                          unsigned int level, pte_t **table,
+                          unsigned int offset)
+{
+    panic("%s: hasn't been implemented yet\n", __func__);
+
+    return P2M_TABLE_MAP_NONE;
+}
+
+/* Free pte sub-tree behind an entry */
+static void p2m_free_subtree(struct p2m_domain *p2m,
+                             pte_t entry, unsigned int level)
+{
+    panic("%s: hasn't been implemented yet\n", __func__);
+}
+
+/*
+ * Insert an entry in the p2m. This should be called with a mapping
+ * equal to a page/superpage.
+ */
+static int p2m_set_entry(struct p2m_domain *p2m,
+                           gfn_t gfn,
+                           unsigned long page_order,
+                           mfn_t mfn,
+                           p2m_type_t t)
+{
+    unsigned int level;
+    unsigned int target = page_order / PAGETABLE_ORDER;
+    pte_t *entry, *table, orig_pte;
+    int rc;
+    /* A mapping is removed if the MFN is invalid. */
+    bool removing_mapping = mfn_eq(mfn, INVALID_MFN);

Comment and code don't fit together. Many MFNs are invalid (any for which
mfn_valid() returns false), yet you only check for INVALID_MFN here.

Probably, it makes sense to add an|ASSERT()| here for the case when
|mfn_valid(mfn)| is false, but the MFN is not explicitly equal to|INVALID_MFN|.
This would indicate that someone attempted to perform a mapping with an
incorrect MFN, which, IMO, is entirely wrong.

No, and we've been there before. MMIO can live anywhere, and mappings for
such still will need to be permitted. It is correct to check only for
INVALID_MFN here imo; it's just the comment which also needs to reflect
that.


Got it now. The original one comment looked clear to me, but considering what
you wrote, I will update the comment then to:
  A mapping is removed only if the MFN is explicitly passed as INVALID_MFN.
Also, perhaps, it makes sense to add the following:
  Other MFNs that are not valid (e.g., MMIO) from mfn_valid() point of
  view are allowed.

Does it make more sense now?

+    /*
+     * If we are here with level > target, we must be at a leaf node,
+     * and we need to break up the superpage.
+     */
+    if ( level > target )
+    {
+        panic("Shattering isn't implemented\n");
+    }
+
+    /*
+     * We should always be there with the correct level because all the
+     * intermediate tables have been installed if necessary.
+     */
+    ASSERT(level == target);
+
+    orig_pte = *entry;
+
+    if ( removing_mapping )
+        p2m_clean_pte(entry, p2m->clean_pte);
+    else
+    {
+        pte_t pte = p2m_pte_from_mfn(mfn, t);
+
+        p2m_write_pte(entry, pte, p2m->clean_pte);
+
+        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
+                                      gfn_add(gfn, BIT(page_order, UL) - 1));
+        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
+    }
+
+    p2m->need_flush = true;
+
+    /*
+     * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
+     * is not ready for RISC-V support.
+     *
+     * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
+     * here.
+     */
+#ifdef CONFIG_HAS_PASSTHROUGH
+#   error "add code to flush IOMMU TLB"
+#endif
+
+    rc = 0;
+
+    /*
+     * Free the entry only if the original pte was valid and the base
+     * is different (to avoid freeing when permission is changed).
+     */
+    if ( pte_is_valid(orig_pte) &&
+         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )

I'm puzzled by this 2nd check: A permission change would - I expect - only
occur to a leaf entry. If the new entry is a super-page one and the old
wasn't, don't you still need to free the sub-tree, no matter whether the
MFNs are the same?

I expect the MFNs to differ in this scenario, so the old sub-tree will be freed.

You expecting something isn't a good criteria. If it's possible, even if
unexpected (by you), it needs dealing with correctly.

Based on your example (new entry is super-page and old entry isn't):
For old mapping (lets say, 4 KiB leaf) p2m_set_entry() walks all levels down
to L0, so we will have the following MMU page table walks:
    L2 PTE -> L1 PTE (MFN of L0 page table) -> L0 PTE -> RAM

When new mapping (lets say, 2 MiB superpage) will be requested, p2m_set_entry()
will stop at L1 (the superpage level):
   L2 PTE -> L1 PTE (at this moment, L1 PTE points to L0 page table, which
                     points to RAM)
Then the old L1 PTE will be saved in 'orig_pte', then writes 'entry' with
the RAM MFN for the 2 MiB mapping. The walk becomes:
    L2 PTE -> L1 PTE -> RAM

Therefore, 'entry' now holds an MFN pointing to RAM (superpage leaf). 'orig_pte'
still holds an MFN pointing to the L0 table (the old sub-tree). Since these MFNs
differ, the code calls p2m_free_subtree(p2m, orig_pte, …) and frees the old L0
sub-tree.

A particular example doesn't help. All possible cases need handling correctly.


For sure, all possible cases need handling correctly, but I don't see any cases
except one you mentioned below where MFNs will be the same.

   Plus consider the special case of MFN 0: If you clear
an entry using MFN 0, you will find old and new PTEs' both having the same
MFN.

Isn't this happen only when a mapping removal is explicitly requested?
In the case of a mapping removal it seems to me it is enough just to
clear PTE with all zeroes.

Correct. Which means original MFN (PPN) and new MFN (PPN) would match.


Oh, I got it what is the issue here. If previously MFN 0 was mapped, then
it is going to be removed and considering that during removing MFN 0 is
used, we won't put MFN 0 page reference (mapped earlier) because
p2m_free_subtree() won't be called.

In this case, if-condidtion should be updated with:
  @@ -883,7 +890,8 @@ static int p2m_set_entry(struct p2m_domain *p2m,
        * is different (to avoid freeing when permission is changed).
        */
       if ( pte_is_valid(orig_pte) &&
  -         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
  +         (!mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) ||
  +          (removing_mapping && mfn_eq(pte_get_mfn(*entry), mfn_t(0))) )

or call p2m_free_subentry() in remove_mapping handling:
  @@ -850,7 +852,12 @@ static int p2m_set_entry(struct p2m_domain *p2m,
       orig_pte = *entry;

if ( removing_mapping )

  +    {
  +       if ( mfn_eq(pte_get_mfn(*entry), mfn_t(0) )
  +            p2m_free_subtree(p2m, orig_pte, level,  virt_to_page(table), 
offsets[level]);
  +
         p2m_clean_pte(entry, p2m->clean_pte);
  +    }
       else

~ Oleksii

Re: [PATCH v3 12/20] xen/riscv: implement p2m_set_range()

Reply via email to