On 05.06.2025 16:10, Oleksii Kurochko wrote:
> 
> On 6/2/25 1:04 PM, Jan Beulich wrote:
>> On 23.05.2025 11:44, Oleksii Kurochko wrote:
>>> On 5/22/25 6:09 PM, Jan Beulich wrote:
>>>> On 22.05.2025 17:53, Oleksii Kurochko wrote:
>>>>> On 5/20/25 3:37 PM, Jan Beulich wrote:
>>>>>> On 09.05.2025 17:57, Oleksii Kurochko wrote:
>>>>>>> +static struct page_info *p2m_get_clean_page(struct domain *d)
>>>>>>> +{
>>>>>>> +    struct page_info *page;
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * As mentioned in the Priviliged Architecture Spec (version 
>>>>>>> 20240411)
>>>>>>> +     * As explained in Section 18.5.1, for the paged virtual-memory 
>>>>>>> schemes
>>>>>>> +     * (Sv32x4, Sv39x4, Sv48x4, and Sv57x4), the root page table is 16 
>>>>>>> KiB
>>>>>>> +     * and must be aligned to a 16-KiB boundary.
>>>>>>> +     */
>>>>>>> +    page = alloc_domheap_pages(NULL, 2, 0);
>>>>>> Shouldn't this allocation come from the domain's P2M pool (which is yet
>>>>>> to be introduced)?
>>>>> First, I will drop p2m_get_clean_page() as it will be used only for p2m 
>>>>> root page
>>>>> table allocation.
>>>>>
>>>>> p2m_init() is called by domain_create() 
>>>>> [->arch_domain_create()->p2m_init()] from create_domUs():
>>>>> [https://gitlab.com/xen-project/xen/-/blob/staging/xen/common/device-tree/dom0less-build.c?ref_type=heads#L984].
>>>>>
>>>>> When p2m_init() is called, p2m pool isn't ready and domain isn't created 
>>>>> yet. Last one
>>>>> is also crucial for usage of p2m pool as p2m pool belongs to domain and 
>>>>> thereby it is
>>>>> using alloc_domheap_page(d, ...) (Not NULL as for allocation of p2m root 
>>>>> table above),
>>>>> so domain should be created first.
>>>> Yet that is part of my point: This allocation should be against the domain,
>>>> so it is properly accounted. What's the problem with allocating the root
>>>> table when the pools is being created / filled?
>>> I can't use pages from pool for root table as they aren't properly aligned.
>>>
>>> At the moment, creation of p2m pool looks like:
>>>    int p2m_set_allocation(struct domain *d, unsigned long pages, bool 
>>> *preempted)
>>>    {
>>>        struct page_info *pg;
>>>
>>>        ASSERT(spin_is_locked(&d->arch.paging.lock));
>>>
>>>        for ( ; ; )
>>>        {
>>>            if ( d->arch.paging.p2m_total_pages < pages )
>>>            {
>>>                /* Need to allocate more memory from domheap */
>>>                pg = alloc_domheap_page(d, MEMF_no_owner);
>>>                if ( pg == NULL )
>>>                {
>>>                    printk(XENLOG_ERR "Failed to allocate P2M pages.\n");
>>>                    return -ENOMEM;
>>>                }
>>>                ACCESS_ONCE(d->arch.paging.p2m_total_pages) =
>>>                    d->arch.paging.p2m_total_pages + 1;
>>>                page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
>>>            }
>>>            ...
>>>        }
>>>
>>>        return 0;
>>>    }
>>> alloc_domheap_page(d, MEMF_no_owner) allocates page table with order 0, so 
>>> 4k-aligned page table.
>>> But if I needed 16k for root table and it should be 16k-aligned then I 
>>> still have to use
>>> alloc_domheap_pages(NULL, 2, 0);
>>>
>>> Or do you mean that I have to something like:
>>>    int p2m_set_allocation(struct domain *d, unsigned long pages, bool 
>>> *preempted)
>>>    {
>>>        struct page_info *pg;
>>>    
>>>        ASSERT(spin_is_locked(&d->arch.paging.lock));
>>>    
>>> +    if ( !d->arch.p2m.root )
>>> +    {
>>> +        unsigned int order = get_order_from_bytes(KB(16));
>>> +        unsigned int nr_pages = _AC(1,U) << order;
>>> +        /*
>>> +        * As mentioned in the Priviliged Architecture Spec (version 
>>> 20240411)
>>> +        * As explained in Section 18.5.1, for the paged virtual-memory 
>>> schemes
>>> +        * (Sv32x4, Sv39x4, Sv48x4, and Sv57x4), the root page table is 16 
>>> KiB
>>> +        * and must be aligned to a 16-KiB boundary.
>>> +        */
>>> +        d->arch.p2m.root = alloc_domheap_pages(d, order, MEMF_no_owner);
>>> +        if ( d->arch.p2m.root == NULL )
>>> +            panic("root page table hasn't been allocated\n");
>>> +
>>> +        clear_and_clean_page(d->arch.p2m.root);
>>> +
>>> +        /* TODO: do I need TLB flush here? */
>>> +
>>> +        ACCESS_ONCE(d->arch.paging.p2m_total_pages) =
>>> +            d->arch.paging.p2m_total_pages + nr_pages;
>>> +    }
>>> +
>>> ...
>>> }
>> Neither. I was thinking of you taking 4 pages off the pool in exchange for 
>> the
>> order-2 allocation. Primarily to get the memory accounting right (or at least
>> closer to reality).
> 
> Do you mean that I have to call 4 times page_list_remove_head(), something 
> like
> that:
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -215,6 +215,44 @@ int p2m_set_allocation(struct domain *d, unsigned long 
> pages, bool *preempted)
>           }
>       }
>   
> +    if ( !d->arch.p2m.root )
> +    {
> +        unsigned int order = get_order_from_bytes(KB(16));
> +        unsigned int nr_pages = _AC(1,U) << order;
> +
> +        if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
> +            panic("Specify more xen,domain-p2m-mem-mb\n");
> +
> +        /*
> +         * As mentioned in the Priviliged Architecture Spec (version 
> 20240411)
> +         * As explained in Section 18.5.1, for the paged virtual-memory 
> schemes
> +         * (Sv32x4, Sv39x4, Sv48x4, and Sv57x4), the root page table is 16 
> KiB
> +         * and must be aligned to a 16-KiB boundary.
> +         */
> +        d->arch.p2m.root = alloc_domheap_pages(NULL, order, 0);

Imo you'd better not use NULL here, but instead pass MEMF_no_owner. See
respective x86 code. I also think you want to do the freeing first, and
only then do this allocation, such that ...

> +        if (  d->arch.p2m.root == NULL )
> +            panic("failed to allocate p2m root page table\n");
> +
> +        for ( unsigned int i = 0; i < nr_pages; i++ )
> +        {
> +            clear_and_clean_page(d->arch.p2m.root + i);
> +
> +            /* Return memory to domheap */
> +            pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
> +            if( pg )
> +            {
> +                ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
> +                free_domheap_page(pg);
> +            }
> +            else
> +            {
> +                printk(XENLOG_ERR
> +                       "Failed to free P2M pages, P2M freelist is empty.\n");
> +                return -ENOMEM;

... this path will not have eaten more memory than was given back.

>>>>>>> +static int p2m_alloc_table(struct domain *d)
>>>>>>> +{
>>>>>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>>>>> +
>>>>>>> +    p2m->root = p2m_allocate_root(d);
>>>>>>> +    if ( !p2m->root )
>>>>>>> +        return -ENOMEM;
>>>>>>> +
>>>>>>> +    p2m->hgatp = hgatp_from_page_info(p2m->root);
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * Make sure that all TLBs corresponding to the new VMID are 
>>>>>>> flushed
>>>>>>> +     * before using it.
>>>>>>> +     */
>>>>>>> +    p2m_write_lock(p2m);
>>>>>>> +    p2m_force_tlb_flush_sync(p2m);
>>>>>>> +    p2m_write_unlock(p2m);
>>>>>> While Andrew directed you towards a better model in general, it won't be
>>>>>> usable here then, as the guest didn't run on any pCPU(s) yet. Imo you
>>>>>> want to do a single global flush e.g. when VMIDs wrap around. That'll be
>>>>>> fewer global flushes than one per VM creation.
>>>>> I am not sure that I get a phrase 'VMIDs wrap around'.
>>>> You have to allocate them somehow. Typically you'll use the next one 
>>>> available.
>>>> At some point you will need to start over, searching from the beginning. 
>>>> Prior
>>>> to that now allocation of a new one will require any flush, as none of them
>>>> had be in use before (after boot or the last such flush).
>>> Thanks. Now I get your point.
>>>
>>> Won't be better to do TLB flushing during destroying of a domain so then we 
>>> will
>>> be sure that TLBs connected to freed VMID aren't present in TLB anymore?
>> That's an option, but will result in more flushes. Furthermore there may be
>> reasons to change the VMID for a domain while it's running.
>>
>>> IIUC, it will work only if VMID is used, right?
>> Well, anything VMID related is of course only relevant when VMIDs are in use.
>>
>>> In case if VMID isn't used, probably we can drop flushing here and do a 
>>> flush
>>> during booting, right?
>> That'll be too little flushing?
> 
> I meant that instead of having TLB flush in p2m_alloc_table() we could have a 
> one flush
> during booting. And of course, we still should have flush on each context 
> switch.

Yet as said - context switches are likely too frequent for having an
unconditional flush there (if it can be avoided).

>>> Won't be enough to flushing of guest TLB only during context switch?
>> "only" is interesting here. Context switches are a relatively frequent
>> operation, which in addition you want to be fast. If a flush is necessary
>> there for correctness (e.g. when VMIDs aren't in use), you have to do it
>> there. But if you can flush less frequently without violating correctness,
>> you'd almost always want to use such an opportunity.
> 
> Then it is better to introduce VMID now, it seems it's only one place where
> it should be set, when hgatp is initialized.
> 
> Does Xen have some framework to work with VMID?

That's all arch-specific, I think.

>>>>> I am going to implement, p2m_force_tlb_flush_sync() as:
>>>>>     static void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
>>>>>     {
>>>>>       ...
>>>>>         sbi_remote_hfence_gvma(d->dirty_cpumask, 0, 0);
>>>>>       ...
>>>>>     }
>>>>>
>>>>> With such implementation if the guest didn't run on any pCPU(s) yet
>>>>> then d->dirty_cpumask is empty, then sbi_remote_hfence_gvma() will do 
>>>>> nothing
>>>>> as hmask will be NULL 
>>>>> (https://gitlab.com/xen-project/people/olkur/xen/-/blob/staging/xen/arch/riscv/sbi.c?ref_type=heads#L238).
>>>>> I am not sure that it is a good idea as I can't find a guarantee in the 
>>>>> spec
>>>>> that TLB will be empty during boot time.
>>>> If in doubt, do one global flush while booting.
>>> By booting you mean somewhere in continue_new_vcpu()?
>> I don't particularly mean any specific place. However, continue_new_vcpu()
>> (by its name) isn't involved in bringing up Xen, is it?
>>
> No, it isn't. By booting here I meant a boot of a guest domain, not Xen 
> itself.

Please don't call this "booting", but "starting of a guest" (or "launching" or
some such). When you originally said "booting" I thought RISC-V wouldn't
guarantee clean TLBs when being booted, and hence suggested to cover for this
by doing a single flush during (Xen) boot. Looks like this may not be needed
then, simply because of the misunderstanding.

Jan

Reply via email to