Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable

2022-06-28 Thread Joao Martins



On 6/28/22 13:38, Igor Mammedov wrote:
> On Mon, 20 Jun 2022 19:13:46 +0100
> Joao Martins  wrote:
> 
>> On 6/20/22 17:36, Joao Martins wrote:
>>> On 6/20/22 15:27, Igor Mammedov wrote:  
 On Fri, 17 Jun 2022 14:33:02 +0100
 Joao Martins  wrote:  
> On 6/17/22 13:32, Igor Mammedov wrote:  
>> On Fri, 17 Jun 2022 13:18:38 +0100
>> Joao Martins  wrote:
>>> On 6/16/22 15:23, Igor Mammedov wrote:
 On Fri, 20 May 2022 11:45:31 +0100
 Joao Martins  wrote:
> +hwaddr above_4g_mem_start,
> +uint64_t pci_hole64_size)
> +{
> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +X86MachineState *x86ms = X86_MACHINE(pcms);
> +MachineState *machine = MACHINE(pcms);
> +ram_addr_t device_mem_size = 0;
> +hwaddr base;
> +
> +if (!x86ms->above_4g_mem_size) {
> +   /*
> +* 32-bit pci hole goes from
> +* end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> +*/
> +return IO_APIC_DEFAULT_ADDRESS - 1;  

 lack of above_4g_mem, doesn't mean absence of device_mem_size or 
 anything else
 that's located above it.
   
>>>
>>> True. But the intent is to fix 32-bit boundaries as one of the qtests 
>>> was failing
>>> otherwise. We won't hit the 1T hole, hence a nop.
>>
>> I don't get the reasoning, can you clarify it pls?
>> 
>
> I was trying to say that what lead me here was a couple of qtests 
> failures (from v3->v4).
>
> I was doing this before based on pci_hole64. phys-bits=32 was for example 
> one
> of the test failures, and pci-hole64 sits above what 32-bit can 
> reference.  

 if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
 (including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)

 and this doesn't look to me as AMD specific issue

 perhaps do a phys-bits check as a separate patch
 that will error out if max_used_gpa is above phys-bits limit
 (maybe at machine_done time)
 (i.e. defining max_gpa and checking if compatible with configured cpu
 are 2 different things)

 (it might be possible that tests need to be fixed too to account for it)
  
>>>
>>> My old notes (from v3) tell me with such a check these tests were exiting 
>>> early thanks to
>>> that error:
>>>
>>>  1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test   ERROR   
>>> 0.07s
>>>   killed by signal 6 SIGABRT
>>>  4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp   ERROR   
>>> 0.07s
>>>   killed by signal 6 SIGABRT
>>>  7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test   ERROR   
>>> 0.07s
>>>   killed by signal 6 SIGABRT
>>> 44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR   
>>> 0.09s
>>>   killed by signal 6 SIGABRT
>>> 45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test  ERROR   
>>> 0.17s
>>>   killed by signal 6 SIGABRT
>>>
>>> But the real reason these fail is not at all related to CPU phys bits,
>>> but because we just don't handle the case where no pci_hole64 is supposed 
>>> to exist (which
>>> is what that other check is trying to do) e.g. A VM with -m 1G would
>>> observe the same thing i.e. the computations after that conditional are all 
>>> for the pci
>>> hole64, which acounts for SGX/CXL/hotplug or etc which consequently means 
>>> it's *errousnly*
>>> bigger than phys-bits=32 (by definition). So the error_report is just 
>>> telling me that
>>> pc_max_used_gpa() is just incorrect without the !x86ms->above_4g_mem_size 
>>> check.
>>>
>>> If you're not fond of:
>>>
>>> +if (!x86ms->above_4g_mem_size) {
>>> +   /*
>>> +* 32-bit pci hole goes from
>>> +* end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>> + */
>>> +return IO_APIC_DEFAULT_ADDRESS - 1;
>>> +}
>>>
>>> Then what should I use instead of the above?
>>>
>>> 'IO_APIC_DEFAULT_ADDRESS - 1' is the size of the 32-bit PCI hole, which is
>>> also what is used for i440fx/q35 code. I could move it to a macro (e.g.
>>> PCI_HOST_HOLE32_SIZE) to make it a bit readable and less hardcoded. Or
>>> perhaps your problem is on !x86ms->above_4g_mem_size and maybe I should 
>>> check
>>> in addition for hotplug/CXL/etc existence?
>>>   
>>>  Unless we plan on using
>>> pc_max_used_gpa() for something else other than this.
>>
>> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory 
>> region
>> present and that can  hit 1Tb. The same goes for pci64_hole if it's 
>> configured
>> large enough on CLI.
>> 
> So hotpluggable memory seems to assume it sits above 4g mem.
>
> pci_hole6

Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable

2022-06-28 Thread Igor Mammedov
On Mon, 20 Jun 2022 19:13:46 +0100
Joao Martins  wrote:

> On 6/20/22 17:36, Joao Martins wrote:
> > On 6/20/22 15:27, Igor Mammedov wrote:  
> >> On Fri, 17 Jun 2022 14:33:02 +0100
> >> Joao Martins  wrote:  
> >>> On 6/17/22 13:32, Igor Mammedov wrote:  
>  On Fri, 17 Jun 2022 13:18:38 +0100
>  Joao Martins  wrote:
> > On 6/16/22 15:23, Igor Mammedov wrote:
> >> On Fri, 20 May 2022 11:45:31 +0100
> >> Joao Martins  wrote:
> >>> +hwaddr above_4g_mem_start,
> >>> +uint64_t pci_hole64_size)
> >>> +{
> >>> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>> +X86MachineState *x86ms = X86_MACHINE(pcms);
> >>> +MachineState *machine = MACHINE(pcms);
> >>> +ram_addr_t device_mem_size = 0;
> >>> +hwaddr base;
> >>> +
> >>> +if (!x86ms->above_4g_mem_size) {
> >>> +   /*
> >>> +* 32-bit pci hole goes from
> >>> +* end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> >>> +*/
> >>> +return IO_APIC_DEFAULT_ADDRESS - 1;  
> >>
> >> lack of above_4g_mem, doesn't mean absence of device_mem_size or 
> >> anything else
> >> that's located above it.
> >>   
> >
> > True. But the intent is to fix 32-bit boundaries as one of the qtests 
> > was failing
> > otherwise. We won't hit the 1T hole, hence a nop.
> 
>  I don't get the reasoning, can you clarify it pls?
>  
> >>>
> >>> I was trying to say that what lead me here was a couple of qtests 
> >>> failures (from v3->v4).
> >>>
> >>> I was doing this before based on pci_hole64. phys-bits=32 was for example 
> >>> one
> >>> of the test failures, and pci-hole64 sits above what 32-bit can 
> >>> reference.  
> >>
> >> if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
> >> (including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)
> >>
> >> and this doesn't look to me as AMD specific issue
> >>
> >> perhaps do a phys-bits check as a separate patch
> >> that will error out if max_used_gpa is above phys-bits limit
> >> (maybe at machine_done time)
> >> (i.e. defining max_gpa and checking if compatible with configured cpu
> >> are 2 different things)
> >>
> >> (it might be possible that tests need to be fixed too to account for it)
> >>  
> > 
> > My old notes (from v3) tell me with such a check these tests were exiting 
> > early thanks to
> > that error:
> > 
> >  1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test   ERROR   
> > 0.07s
> >   killed by signal 6 SIGABRT
> >  4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp   ERROR   
> > 0.07s
> >   killed by signal 6 SIGABRT
> >  7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test   ERROR   
> > 0.07s
> >   killed by signal 6 SIGABRT
> > 44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR   
> > 0.09s
> >   killed by signal 6 SIGABRT
> > 45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test  ERROR   
> > 0.17s
> >   killed by signal 6 SIGABRT
> > 
> > But the real reason these fail is not at all related to CPU phys bits,
> > but because we just don't handle the case where no pci_hole64 is supposed 
> > to exist (which
> > is what that other check is trying to do) e.g. A VM with -m 1G would
> > observe the same thing i.e. the computations after that conditional are all 
> > for the pci
> > hole64, which acounts for SGX/CXL/hotplug or etc which consequently means 
> > it's *errousnly*
> > bigger than phys-bits=32 (by definition). So the error_report is just 
> > telling me that
> > pc_max_used_gpa() is just incorrect without the !x86ms->above_4g_mem_size 
> > check.
> > 
> > If you're not fond of:
> > 
> > +if (!x86ms->above_4g_mem_size) {
> > +   /*
> > +* 32-bit pci hole goes from
> > +* end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> > + */
> > +return IO_APIC_DEFAULT_ADDRESS - 1;
> > +}
> > 
> > Then what should I use instead of the above?
> > 
> > 'IO_APIC_DEFAULT_ADDRESS - 1' is the size of the 32-bit PCI hole, which is
> > also what is used for i440fx/q35 code. I could move it to a macro (e.g.
> > PCI_HOST_HOLE32_SIZE) to make it a bit readable and less hardcoded. Or
> > perhaps your problem is on !x86ms->above_4g_mem_size and maybe I should 
> > check
> > in addition for hotplug/CXL/etc existence?
> >   
> >  Unless we plan on using
> > pc_max_used_gpa() for something else other than this.
> 
>  Even if '!above_4g_mem_sizem', we can still have hotpluggable memory 
>  region
>  present and that can  hit 1Tb. The same goes for pci64_hole if it's 
>  configured
>  large enough on CLI.
>  
> >>> So hotpluggable memory seems to assume it sits above 4g mem.
> >>>
> >>> pci_hole64 likewise as it uses similar computations

Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable

2022-06-20 Thread Joao Martins
On 6/20/22 17:36, Joao Martins wrote:
> On 6/20/22 15:27, Igor Mammedov wrote:
>> On Fri, 17 Jun 2022 14:33:02 +0100
>> Joao Martins  wrote:
>>> On 6/17/22 13:32, Igor Mammedov wrote:
 On Fri, 17 Jun 2022 13:18:38 +0100
 Joao Martins  wrote:  
> On 6/16/22 15:23, Igor Mammedov wrote:  
>> On Fri, 20 May 2022 11:45:31 +0100
>> Joao Martins  wrote:  
>>> +hwaddr above_4g_mem_start,
>>> +uint64_t pci_hole64_size)
>>> +{
>>> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>> +X86MachineState *x86ms = X86_MACHINE(pcms);
>>> +MachineState *machine = MACHINE(pcms);
>>> +ram_addr_t device_mem_size = 0;
>>> +hwaddr base;
>>> +
>>> +if (!x86ms->above_4g_mem_size) {
>>> +   /*
>>> +* 32-bit pci hole goes from
>>> +* end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>> +*/
>>> +return IO_APIC_DEFAULT_ADDRESS - 1;
>>
>> lack of above_4g_mem, doesn't mean absence of device_mem_size or 
>> anything else
>> that's located above it.
>> 
>
> True. But the intent is to fix 32-bit boundaries as one of the qtests was 
> failing
> otherwise. We won't hit the 1T hole, hence a nop.  

 I don't get the reasoning, can you clarify it pls?
   
>>>
>>> I was trying to say that what lead me here was a couple of qtests failures 
>>> (from v3->v4).
>>>
>>> I was doing this before based on pci_hole64. phys-bits=32 was for example 
>>> one
>>> of the test failures, and pci-hole64 sits above what 32-bit can reference.
>>
>> if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
>> (including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)
>>
>> and this doesn't look to me as AMD specific issue
>>
>> perhaps do a phys-bits check as a separate patch
>> that will error out if max_used_gpa is above phys-bits limit
>> (maybe at machine_done time)
>> (i.e. defining max_gpa and checking if compatible with configured cpu
>> are 2 different things)
>>
>> (it might be possible that tests need to be fixed too to account for it)
>>
> 
> My old notes (from v3) tell me with such a check these tests were exiting 
> early thanks to
> that error:
> 
>  1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test   ERROR 
>   0.07s
>   killed by signal 6 SIGABRT
>  4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp   ERROR 
>   0.07s
>   killed by signal 6 SIGABRT
>  7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test   ERROR 
>   0.07s
>   killed by signal 6 SIGABRT
> 44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR 
>   0.09s
>   killed by signal 6 SIGABRT
> 45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test  ERROR 
>   0.17s
>   killed by signal 6 SIGABRT
> 
> But the real reason these fail is not at all related to CPU phys bits,
> but because we just don't handle the case where no pci_hole64 is supposed to 
> exist (which
> is what that other check is trying to do) e.g. A VM with -m 1G would
> observe the same thing i.e. the computations after that conditional are all 
> for the pci
> hole64, which acounts for SGX/CXL/hotplug or etc which consequently means 
> it's *errousnly*
> bigger than phys-bits=32 (by definition). So the error_report is just telling 
> me that
> pc_max_used_gpa() is just incorrect without the !x86ms->above_4g_mem_size 
> check.
> 
> If you're not fond of:
> 
> +if (!x86ms->above_4g_mem_size) {
> +   /*
> +* 32-bit pci hole goes from
> +* end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> + */
> +return IO_APIC_DEFAULT_ADDRESS - 1;
> +}
> 
> Then what should I use instead of the above?
> 
> 'IO_APIC_DEFAULT_ADDRESS - 1' is the size of the 32-bit PCI hole, which is
> also what is used for i440fx/q35 code. I could move it to a macro (e.g.
> PCI_HOST_HOLE32_SIZE) to make it a bit readable and less hardcoded. Or
> perhaps your problem is on !x86ms->above_4g_mem_size and maybe I should check
> in addition for hotplug/CXL/etc existence?
> 
>  Unless we plan on using
> pc_max_used_gpa() for something else other than this.  

 Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
 present and that can  hit 1Tb. The same goes for pci64_hole if it's 
 configured
 large enough on CLI.
   
>>> So hotpluggable memory seems to assume it sits above 4g mem.
>>>
>>> pci_hole64 likewise as it uses similar computations as hotplug.
>>>
>>> Unless I am misunderstanding something here.
>>>
 Looks like guesstimate we could use is taking pci64_hole_end as max used 
 GPA
   
>>> I think this was what I had before (v3[0]) and did not work.
>>
>> that had been tied to host's phys-bits directly, all in one patch
>> and duplicating existing pc_pci_hole64_star

Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable

2022-06-20 Thread Joao Martins
On 6/20/22 15:27, Igor Mammedov wrote:
> On Fri, 17 Jun 2022 14:33:02 +0100
> Joao Martins  wrote:
>> On 6/17/22 13:32, Igor Mammedov wrote:
>>> On Fri, 17 Jun 2022 13:18:38 +0100
>>> Joao Martins  wrote:  
 On 6/16/22 15:23, Igor Mammedov wrote:  
> On Fri, 20 May 2022 11:45:31 +0100
> Joao Martins  wrote:  
>> +hwaddr above_4g_mem_start,
>> +uint64_t pci_hole64_size)
>> +{
>> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +X86MachineState *x86ms = X86_MACHINE(pcms);
>> +MachineState *machine = MACHINE(pcms);
>> +ram_addr_t device_mem_size = 0;
>> +hwaddr base;
>> +
>> +if (!x86ms->above_4g_mem_size) {
>> +   /*
>> +* 32-bit pci hole goes from
>> +* end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>> +*/
>> +return IO_APIC_DEFAULT_ADDRESS - 1;
>
> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything 
> else
> that's located above it.
> 

 True. But the intent is to fix 32-bit boundaries as one of the qtests was 
 failing
 otherwise. We won't hit the 1T hole, hence a nop.  
>>>
>>> I don't get the reasoning, can you clarify it pls?
>>>   
>>
>> I was trying to say that what lead me here was a couple of qtests failures 
>> (from v3->v4).
>>
>> I was doing this before based on pci_hole64. phys-bits=32 was for example one
>> of the test failures, and pci-hole64 sits above what 32-bit can reference.
> 
> if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
> (including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)
> 
> and this doesn't look to me as AMD specific issue
> 
> perhaps do a phys-bits check as a separate patch
> that will error out if max_used_gpa is above phys-bits limit
> (maybe at machine_done time)
> (i.e. defining max_gpa and checking if compatible with configured cpu
> are 2 different things)
> 
> (it might be possible that tests need to be fixed too to account for it)
> 

My old notes (from v3) tell me with such a check these tests were exiting early 
thanks to
that error:

 1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test   ERROR   
0.07s
  killed by signal 6 SIGABRT
 4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp   ERROR   
0.07s
  killed by signal 6 SIGABRT
 7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test   ERROR   
0.07s
  killed by signal 6 SIGABRT
44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR   
0.09s
  killed by signal 6 SIGABRT
45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test  ERROR   
0.17s
  killed by signal 6 SIGABRT

But the real reason these fail is not at all related to CPU phys bits,
but because we just don't handle the case where no pci_hole64 is supposed to 
exist (which
is what that other check is trying to do) e.g. A VM with -m 1G would
observe the same thing i.e. the computations after that conditional are all for 
the pci
hole64, which acounts for SGX/CXL/hotplug or etc which consequently means it's 
*errousnly*
bigger than phys-bits=32 (by definition). So the error_report is just telling 
me that
pc_max_used_gpa() is just incorrect without the !x86ms->above_4g_mem_size check.

If you're not fond of:

+if (!x86ms->above_4g_mem_size) {
+   /*
+* 32-bit pci hole goes from
+* end-of-low-ram (@below_4g_mem_size) to IOAPIC.
+ */
+return IO_APIC_DEFAULT_ADDRESS - 1;
+}

Then what should I use instead of the above?

'IO_APIC_DEFAULT_ADDRESS - 1' is the size of the 32-bit PCI hole, which is
also what is used for i440fx/q35 code. I could move it to a macro (e.g.
PCI_HOST_HOLE32_SIZE) to make it a bit readable and less hardcoded. Or
perhaps your problem is on !x86ms->above_4g_mem_size and maybe I should check
in addition for hotplug/CXL/etc existence?

  Unless we plan on using
 pc_max_used_gpa() for something else other than this.  
>>>
>>> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
>>> present and that can  hit 1Tb. The same goes for pci64_hole if it's 
>>> configured
>>> large enough on CLI.
>>>   
>> So hotpluggable memory seems to assume it sits above 4g mem.
>>
>> pci_hole64 likewise as it uses similar computations as hotplug.
>>
>> Unless I am misunderstanding something here.
>>
>>> Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
>>>   
>> I think this was what I had before (v3[0]) and did not work.
> 
> that had been tied to host's phys-bits directly, all in one patch
> and duplicating existing pc_pci_hole64_start().
>  

Duplicating was sort of my bad attempt in this patch for pc_max_used_gpa()

I was sort of thinking to something like extracting calls to start + size 
"tuple" into
functions -- e.g. for hotplug it is pc_get_device_m

Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable

2022-06-20 Thread Igor Mammedov
On Fri, 17 Jun 2022 14:33:02 +0100
Joao Martins  wrote:

> On 6/17/22 13:32, Igor Mammedov wrote:
> > On Fri, 17 Jun 2022 13:18:38 +0100
> > Joao Martins  wrote:  
> >> On 6/16/22 15:23, Igor Mammedov wrote:  
> >>> On Fri, 20 May 2022 11:45:31 +0100
> >>> Joao Martins  wrote:  
>  +hwaddr above_4g_mem_start,
>  +uint64_t pci_hole64_size)
>  +{
>  +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>  +X86MachineState *x86ms = X86_MACHINE(pcms);
>  +MachineState *machine = MACHINE(pcms);
>  +ram_addr_t device_mem_size = 0;
>  +hwaddr base;
>  +
>  +if (!x86ms->above_4g_mem_size) {
>  +   /*
>  +* 32-bit pci hole goes from
>  +* end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>  +*/
>  +return IO_APIC_DEFAULT_ADDRESS - 1;
> >>>
> >>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything 
> >>> else
> >>> that's located above it.
> >>> 
> >>
> >> True. But the intent is to fix 32-bit boundaries as one of the qtests was 
> >> failing
> >> otherwise. We won't hit the 1T hole, hence a nop.  
> > 
> > I don't get the reasoning, can you clarify it pls?
> >   
> 
> I was trying to say that what lead me here was a couple of qtests failures 
> (from v3->v4).
> 
> I was doing this before based on pci_hole64. phys-bits=32 was for example one
> of the test failures, and pci-hole64 sits above what 32-bit can reference.

if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
(including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)

and this doesn't look to me as AMD specific issue

perhaps do a phys-bits check as a separate patch
that will error out if max_used_gpa is above phys-bits limit
(maybe at machine_done time)
(i.e. defining max_gpa and checking if compatible with configured cpu
are 2 different things)

(it might be possible that tests need to be fixed too to account for it)

> >>  Unless we plan on using
> >> pc_max_used_gpa() for something else other than this.  
> > 
> > Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
> > present and that can  hit 1Tb. The same goes for pci64_hole if it's 
> > configured
> > large enough on CLI.
> >   
> So hotpluggable memory seems to assume it sits above 4g mem.
> 
> pci_hole64 likewise as it uses similar computations as hotplug.
> 
> Unless I am misunderstanding something here.
> 
> > Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
> >   
> I think this was what I had before (v3[0]) and did not work.

that had been tied to host's phys-bits directly, all in one patch
and duplicating existing pc_pci_hole64_start().
 
> Let me revisit this edge case again.
> 
> [0] 
> https://lore.kernel.org/all/20220223184455.9057-5-joao.m.mart...@oracle.com/
> 




Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable

2022-06-17 Thread Joao Martins
On 6/17/22 13:18, Joao Martins wrote:
> On 6/16/22 15:23, Igor Mammedov wrote:
>> On Fri, 20 May 2022 11:45:31 +0100
>> Joao Martins  wrote:
>>> +}
>>> +
>>> +if (pcmc->has_reserved_memory &&
>>> +   (machine->ram_size < machine->maxram_size)) {
>>> +device_mem_size = machine->maxram_size - machine->ram_size;
>>> +}
>>> +
>>> +base = ROUND_UP(above_4g_mem_start + x86ms->above_4g_mem_size +
>>> +pcms->sgx_epc.size, 1 * GiB);
>>> +
>>> +return base + device_mem_size + pci_hole64_size;
>>
>> it's not guarantied that pci64 hole starts right away device_mem,
>> but you are not 1st doing this assumption in code, maybe instead of
>> all above use existing 
>>pc_pci_hole64_start() + pci_hole64_size
>> to gestimate max address 
>>
> I've switched the block above to that instead.
> 

I had done this, albeit on a second look (and confirmed with testing) this
will crash, provided @device_memory isn't yet initialized. And even without
hotplug, CXL might have had issues.

The problem is largely that pc_pci_hole64_start() that the above check relies
on info we only populate later on in pc_memory_init(), and I don't think I can
move this done to a later point as definitely don't want to re-initialize
MRs or anything.

So we might be left with manually calculating as I was doing in this patch
but maybe try to arrange some form of new helper that has somewhat shared
logic with pc_pci_hole64_start().

  1114  uint64_t pc_pci_hole64_start(void)
  1115  {
  1116  PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
  1117  PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
  1118  MachineState *ms = MACHINE(pcms);
  1119  X86MachineState *x86ms = X86_MACHINE(pcms);
  1120  uint64_t hole64_start = 0;
  1121
  1122  if (pcms->cxl_devices_state.host_mr.addr) {
  1123  hole64_start = pcms->cxl_devices_state.host_mr.addr +
  1124  memory_region_size(&pcms->cxl_devices_state.host_mr);
  1125  if (pcms->cxl_devices_state.fixed_windows) {
  1126  GList *it;
  1127  for (it = pcms->cxl_devices_state.fixed_windows; it; it = 
it->next) {
  1128  CXLFixedWindow *fw = it->data;
  1129  hole64_start = fw->mr.addr + 
memory_region_size(&fw->mr);
  1130  }
  1131  }
* 1132  } else if (pcmc->has_reserved_memory && ms->device_memory->base) {
  1133  hole64_start = ms->device_memory->base;
  1134  if (!pcmc->broken_reserved_end) {
  1135  hole64_start += memory_region_size(&ms->device_memory->mr);
  1136  }
  1137  } else if (pcms->sgx_epc.size != 0) {
  1138  hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
  1139  } else {
  1140  hole64_start = x86ms->above_4g_mem_start + 
x86ms->above_4g_mem_size;
  1141  }




Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable

2022-06-17 Thread Joao Martins
On 6/17/22 13:32, Igor Mammedov wrote:
> On Fri, 17 Jun 2022 13:18:38 +0100
> Joao Martins  wrote:
>> On 6/16/22 15:23, Igor Mammedov wrote:
>>> On Fri, 20 May 2022 11:45:31 +0100
>>> Joao Martins  wrote:
 +hwaddr above_4g_mem_start,
 +uint64_t pci_hole64_size)
 +{
 +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
 +X86MachineState *x86ms = X86_MACHINE(pcms);
 +MachineState *machine = MACHINE(pcms);
 +ram_addr_t device_mem_size = 0;
 +hwaddr base;
 +
 +if (!x86ms->above_4g_mem_size) {
 +   /*
 +* 32-bit pci hole goes from
 +* end-of-low-ram (@below_4g_mem_size) to IOAPIC.
 +*/
 +return IO_APIC_DEFAULT_ADDRESS - 1;  
>>>
>>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything 
>>> else
>>> that's located above it.
>>>   
>>
>> True. But the intent is to fix 32-bit boundaries as one of the qtests was 
>> failing
>> otherwise. We won't hit the 1T hole, hence a nop.
> 
> I don't get the reasoning, can you clarify it pls?
> 

I was trying to say that what lead me here was a couple of qtests failures 
(from v3->v4).

I was doing this before based on pci_hole64. phys-bits=32 was for example one
of the test failures, and pci-hole64 sits above what 32-bit can reference.

>>  Unless we plan on using
>> pc_max_used_gpa() for something else other than this.
> 
> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
> present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
> large enough on CLI.
> 
So hotpluggable memory seems to assume it sits above 4g mem.

pci_hole64 likewise as it uses similar computations as hotplug.

Unless I am misunderstanding something here.

> Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
> 
I think this was what I had before (v3[0]) and did not work.

Let me revisit this edge case again.

[0] https://lore.kernel.org/all/20220223184455.9057-5-joao.m.mart...@oracle.com/



Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable

2022-06-17 Thread Igor Mammedov
On Fri, 17 Jun 2022 13:18:38 +0100
Joao Martins  wrote:

> On 6/16/22 15:23, Igor Mammedov wrote:
> > On Fri, 20 May 2022 11:45:31 +0100
> > Joao Martins  wrote:
> >   
> >> It is assumed that the whole GPA space is available to be DMA
> >> addressable, within a given address space limit, expect for a  
> >^^^ typo?
> >   
> Yes, it should have been 'except'.
> 
> >> tiny region before the 4G. Since Linux v5.4, VFIO validates
> >> whether the selected GPA is indeed valid i.e. not reserved by
> >> IOMMU on behalf of some specific devices or platform-defined
> >> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
> >>  -EINVAL.
> >>
> >> AMD systems with an IOMMU are examples of such platforms and
> >> particularly may only have these ranges as allowed:
> >>
> >> - fedf (0  .. 3.982G)
> >>fef0 - 00fc (3.983G .. 1011.9G)
> >>0100 -  (1Tb.. 16Pb[*])
> >>
> >> We already account for the 4G hole, albeit if the guest is big
> >> enough we will fail to allocate a guest with  >1010G due to the
> >> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> >>
> >> [*] there is another reserved region unrelated to HT that exists
> >> in the 256T boundaru in Fam 17h according to Errata #1286,  
> >   ^ ditto
> >   
> Fixed.
> 
> >> documeted also in "Open-Source Register Reference for AMD Family
> >> 17h Processors (PUB)"
> >>
> >> When creating the region above 4G, take into account that on AMD
> >> platforms the HyperTransport range is reserved and hence it
> >> cannot be used either as GPAs. On those cases rather than
> >> establishing the start of ram-above-4g to be 4G, relocate instead
> >> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> >> Topology", for more information on the underlying restriction of
> >> IOVAs.
> >>
> >> After accounting for the 1Tb hole on AMD hosts, mtree should
> >> look like:
> >>
> >> -7fff (prio 0, i/o):
> >> alias ram-below-4g @pc.ram -7fff
> >> 0100-01ff7fff (prio 0, i/o):
> >>alias ram-above-4g @pc.ram 8000-00ff
> >>
> >> If the relocation is done, we also add the the reserved HT
> >> e820 range as reserved.
> >>
> >> Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
> >> to address 1Tb (0xff  ). On AMD platforms, if a
> >> ram-above-4g relocation may be desired and the CPU wasn't configured
> >> with a big enough phys-bits, print an error message to the user
> >> and do not make the relocation of the above-4g-region if phys-bits
> >> is too low.
> >>
> >> Suggested-by: Igor Mammedov 
> >> Signed-off-by: Joao Martins 
> >> ---
> >>  hw/i386/pc.c | 111 +++
> >>  1 file changed, 111 insertions(+)
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index af52d4ff89ef..652ae8ff9ccf 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -796,6 +796,110 @@ void xen_load_linux(PCMachineState *pcms)
> >>  #define PC_ROM_ALIGN   0x800
> >>  #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA)
> >>  
> >> +/*
> >> + * AMD systems with an IOMMU have an additional hole close to the
> >> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> >> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> >> + * Starting Linux v5.4 we validate it, and can't create guests on AMD 
> >> machines
> >> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> >> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> >> + * The ranges reserved for Hyper-Transport are:
> >> + *
> >> + * FD__h - FF__h
> >> + *
> >> + * The ranges represent the following:
> >> + *
> >> + * Base Address   Top Address  Use
> >> + *
> >> + * FD__h FD_F7FF_h Reserved interrupt address space
> >> + * FD_F800_h FD_F8FF_h Interrupt/EOI IntCtl
> >> + * FD_F900_h FD_F90F_h Legacy PIC IACK
> >> + * FD_F910_h FD_F91F_h System Management
> >> + * FD_F920_h FD_FAFF_h Reserved Page Tables
> >> + * FD_FB00_h FD_FBFF_h Address Translation
> >> + * FD_FC00_h FD_FDFF_h I/O Space
> >> + * FD_FE00_h FD__h Configuration
> >> + * FE__h FE_1FFF_h Extended Configuration/Device Messages
> >> + * FE_2000_h FF__h Reserved
> >> + *
> >> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> >> + * Table 3: Special Address Controls (GPA) for more information.
> >> + */
> >> +#define AMD_HT_START 0xfdUL
> >> +#define AMD_HT_END   0xffUL
> >> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> >> +#define AMD_HT_SIZE  (AMD_ABOVE_1TB_START - AMD_HT_START)
> >> +
> >> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,  
> > 
> > s/x86_max_phys

Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable

2022-06-17 Thread Joao Martins



On 6/16/22 15:23, Igor Mammedov wrote:
> On Fri, 20 May 2022 11:45:31 +0100
> Joao Martins  wrote:
> 
>> It is assumed that the whole GPA space is available to be DMA
>> addressable, within a given address space limit, expect for a
>^^^ typo?
> 
Yes, it should have been 'except'.

>> tiny region before the 4G. Since Linux v5.4, VFIO validates
>> whether the selected GPA is indeed valid i.e. not reserved by
>> IOMMU on behalf of some specific devices or platform-defined
>> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>>  -EINVAL.
>>
>> AMD systems with an IOMMU are examples of such platforms and
>> particularly may only have these ranges as allowed:
>>
>>   - fedf (0  .. 3.982G)
>>  fef0 - 00fc (3.983G .. 1011.9G)
>>  0100 -  (1Tb.. 16Pb[*])
>>
>> We already account for the 4G hole, albeit if the guest is big
>> enough we will fail to allocate a guest with  >1010G due to the
>> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
>>
>> [*] there is another reserved region unrelated to HT that exists
>> in the 256T boundaru in Fam 17h according to Errata #1286,
>   ^ ditto
> 
Fixed.

>> documeted also in "Open-Source Register Reference for AMD Family
>> 17h Processors (PUB)"
>>
>> When creating the region above 4G, take into account that on AMD
>> platforms the HyperTransport range is reserved and hence it
>> cannot be used either as GPAs. On those cases rather than
>> establishing the start of ram-above-4g to be 4G, relocate instead
>> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
>> Topology", for more information on the underlying restriction of
>> IOVAs.
>>
>> After accounting for the 1Tb hole on AMD hosts, mtree should
>> look like:
>>
>> -7fff (prio 0, i/o):
>>   alias ram-below-4g @pc.ram -7fff
>> 0100-01ff7fff (prio 0, i/o):
>>  alias ram-above-4g @pc.ram 8000-00ff
>>
>> If the relocation is done, we also add the the reserved HT
>> e820 range as reserved.
>>
>> Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
>> to address 1Tb (0xff  ). On AMD platforms, if a
>> ram-above-4g relocation may be desired and the CPU wasn't configured
>> with a big enough phys-bits, print an error message to the user
>> and do not make the relocation of the above-4g-region if phys-bits
>> is too low.
>>
>> Suggested-by: Igor Mammedov 
>> Signed-off-by: Joao Martins 
>> ---
>>  hw/i386/pc.c | 111 +++
>>  1 file changed, 111 insertions(+)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index af52d4ff89ef..652ae8ff9ccf 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -796,6 +796,110 @@ void xen_load_linux(PCMachineState *pcms)
>>  #define PC_ROM_ALIGN   0x800
>>  #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA)
>>  
>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD 
>> machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD__h - FF__h
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD__h FD_F7FF_h Reserved interrupt address space
>> + * FD_F800_h FD_F8FF_h Interrupt/EOI IntCtl
>> + * FD_F900_h FD_F90F_h Legacy PIC IACK
>> + * FD_F910_h FD_F91F_h System Management
>> + * FD_F920_h FD_FAFF_h Reserved Page Tables
>> + * FD_FB00_h FD_FBFF_h Address Translation
>> + * FD_FC00_h FD_FDFF_h I/O Space
>> + * FD_FE00_h FD__h Configuration
>> + * FE__h FE_1FFF_h Extended Configuration/Device Messages
>> + * FE_2000_h FF__h Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START 0xfdUL
>> +#define AMD_HT_END   0xffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE  (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,
> 
> s/x86_max_phys_addr/pc_max_used_gpa/
> 
Fixed.

>> +hwaddr above_4g_mem_start,
>> +uint64_t pci_hole64_size)
>> +{
>> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +X86MachineState *x86ms = X86_MACHINE(pcms);
>> +

Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable

2022-06-16 Thread Igor Mammedov
On Fri, 20 May 2022 11:45:31 +0100
Joao Martins  wrote:

> It is assumed that the whole GPA space is available to be DMA
> addressable, within a given address space limit, expect for a
   ^^^ typo?

> tiny region before the 4G. Since Linux v5.4, VFIO validates
> whether the selected GPA is indeed valid i.e. not reserved by
> IOMMU on behalf of some specific devices or platform-defined
> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>  -EINVAL.
> 
> AMD systems with an IOMMU are examples of such platforms and
> particularly may only have these ranges as allowed:
> 
>    - fedf (0  .. 3.982G)
>   fef0 - 00fc (3.983G .. 1011.9G)
>   0100 -  (1Tb.. 16Pb[*])
> 
> We already account for the 4G hole, albeit if the guest is big
> enough we will fail to allocate a guest with  >1010G due to the
> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> 
> [*] there is another reserved region unrelated to HT that exists
> in the 256T boundaru in Fam 17h according to Errata #1286,
  ^ ditto

> documeted also in "Open-Source Register Reference for AMD Family
> 17h Processors (PUB)"
> 
> When creating the region above 4G, take into account that on AMD
> platforms the HyperTransport range is reserved and hence it
> cannot be used either as GPAs. On those cases rather than
> establishing the start of ram-above-4g to be 4G, relocate instead
> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> Topology", for more information on the underlying restriction of
> IOVAs.
> 
> After accounting for the 1Tb hole on AMD hosts, mtree should
> look like:
> 
> -7fff (prio 0, i/o):
>alias ram-below-4g @pc.ram -7fff
> 0100-01ff7fff (prio 0, i/o):
>   alias ram-above-4g @pc.ram 8000-00ff
> 
> If the relocation is done, we also add the the reserved HT
> e820 range as reserved.
> 
> Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
> to address 1Tb (0xff  ). On AMD platforms, if a
> ram-above-4g relocation may be desired and the CPU wasn't configured
> with a big enough phys-bits, print an error message to the user
> and do not make the relocation of the above-4g-region if phys-bits
> is too low.
> 
> Suggested-by: Igor Mammedov 
> Signed-off-by: Joao Martins 
> ---
>  hw/i386/pc.c | 111 +++
>  1 file changed, 111 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index af52d4ff89ef..652ae8ff9ccf 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -796,6 +796,110 @@ void xen_load_linux(PCMachineState *pcms)
>  #define PC_ROM_ALIGN   0x800
>  #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA)
>  
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD 
> machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD__h - FF__h
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD__h FD_F7FF_h Reserved interrupt address space
> + * FD_F800_h FD_F8FF_h Interrupt/EOI IntCtl
> + * FD_F900_h FD_F90F_h Legacy PIC IACK
> + * FD_F910_h FD_F91F_h System Management
> + * FD_F920_h FD_FAFF_h Reserved Page Tables
> + * FD_FB00_h FD_FBFF_h Address Translation
> + * FD_FC00_h FD_FDFF_h I/O Space
> + * FD_FE00_h FD__h Configuration
> + * FE__h FE_1FFF_h Extended Configuration/Device Messages
> + * FE_2000_h FF__h Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START 0xfdUL
> +#define AMD_HT_END   0xffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE  (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,

s/x86_max_phys_addr/pc_max_used_gpa/

> +hwaddr above_4g_mem_start,
> +uint64_t pci_hole64_size)
> +{
> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +X86MachineState *x86ms = X86_MACHINE(pcms);
> +MachineState *machine = MACHINE(pcms);
> +ram_addr_t device_mem_size = 0;
> +hwaddr base;
> +
> +if (!x86ms->above_4g_mem_size) {
> +   /*
> +* 32-bit pci hole goes from
> +   

[PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable

2022-05-20 Thread Joao Martins
It is assumed that the whole GPA space is available to be DMA
addressable, within a given address space limit, expect for a
tiny region before the 4G. Since Linux v5.4, VFIO validates
whether the selected GPA is indeed valid i.e. not reserved by
IOMMU on behalf of some specific devices or platform-defined
restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
 -EINVAL.

AMD systems with an IOMMU are examples of such platforms and
particularly may only have these ranges as allowed:

 - fedf (0  .. 3.982G)
fef0 - 00fc (3.983G .. 1011.9G)
0100 -  (1Tb.. 16Pb[*])

We already account for the 4G hole, albeit if the guest is big
enough we will fail to allocate a guest with  >1010G due to the
~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).

[*] there is another reserved region unrelated to HT that exists
in the 256T boundaru in Fam 17h according to Errata #1286,
documeted also in "Open-Source Register Reference for AMD Family
17h Processors (PUB)"

When creating the region above 4G, take into account that on AMD
platforms the HyperTransport range is reserved and hence it
cannot be used either as GPAs. On those cases rather than
establishing the start of ram-above-4g to be 4G, relocate instead
to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
Topology", for more information on the underlying restriction of
IOVAs.

After accounting for the 1Tb hole on AMD hosts, mtree should
look like:

-7fff (prio 0, i/o):
 alias ram-below-4g @pc.ram -7fff
0100-01ff7fff (prio 0, i/o):
alias ram-above-4g @pc.ram 8000-00ff

If the relocation is done, we also add the the reserved HT
e820 range as reserved.

Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
to address 1Tb (0xff  ). On AMD platforms, if a
ram-above-4g relocation may be desired and the CPU wasn't configured
with a big enough phys-bits, print an error message to the user
and do not make the relocation of the above-4g-region if phys-bits
is too low.

Suggested-by: Igor Mammedov 
Signed-off-by: Joao Martins 
---
 hw/i386/pc.c | 111 +++
 1 file changed, 111 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index af52d4ff89ef..652ae8ff9ccf 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -796,6 +796,110 @@ void xen_load_linux(PCMachineState *pcms)
 #define PC_ROM_ALIGN   0x800
 #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA)
 
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD__h - FF__h
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD__h FD_F7FF_h Reserved interrupt address space
+ * FD_F800_h FD_F8FF_h Interrupt/EOI IntCtl
+ * FD_F900_h FD_F90F_h Legacy PIC IACK
+ * FD_F910_h FD_F91F_h System Management
+ * FD_F920_h FD_FAFF_h Reserved Page Tables
+ * FD_FB00_h FD_FBFF_h Address Translation
+ * FD_FC00_h FD_FDFF_h I/O Space
+ * FD_FE00_h FD__h Configuration
+ * FE__h FE_1FFF_h Extended Configuration/Device Messages
+ * FE_2000_h FF__h Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START 0xfdUL
+#define AMD_HT_END   0xffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE  (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static hwaddr x86_max_phys_addr(PCMachineState *pcms,
+hwaddr above_4g_mem_start,
+uint64_t pci_hole64_size)
+{
+PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+X86MachineState *x86ms = X86_MACHINE(pcms);
+MachineState *machine = MACHINE(pcms);
+ram_addr_t device_mem_size = 0;
+hwaddr base;
+
+if (!x86ms->above_4g_mem_size) {
+   /*
+* 32-bit pci hole goes from
+* end-of-low-ram (@below_4g_mem_size) to IOAPIC.
+*/
+return IO_APIC_DEFAULT_ADDRESS - 1;
+}
+
+if (pcmc->has_reserved_memory &&
+   (machine->ram_size < machine->maxram_size)) {
+device_mem_size = machine->maxram_size - machine->ram_size;
+}
+
+base = ROUND_UP(above_4g_mem_start + x86ms->above_4g_mem_size +
+pcms->sgx_epc.size, 1