Re: [PATCH] kfifo: add memory barrier in kfifo to prevent data loss

2019-01-02 Thread xiaoguangrong(Xiao Guangrong)
On 12/12/18 8:50 AM, Kees Cook wrote:
> On Mon, Dec 10, 2018 at 7:41 PM  wrote:
>>
>> From: Yulei Zhang 
>>
>> Early this year we spot there may be two issues in kernel
>> kfifo.
>>
>> One is reported by Xiao Guangrong to linux kernel.
>> https://lkml.org/lkml/2018/5/11/58
>> In current kfifo implementation there are missing memory
>> barrier in the read side, so that without proper barrier
>> between reading the kfifo->in and fetching the data there
>> is potential ordering issue.
>>
>> Beside that, there is another potential issue in kfifo,
>> please consider the following case:
>> at the beginning
>> ring->size = 4
>> ring->out = 0
>> ring->in = 4
>>
>>  ConsumerProducer
>> ---  --
>> index = ring->out; /* index == 0 */
>> ring->out++; /* ring->out == 1 */
>> < Re-Order >
>>   out = ring->out;
>>   if (ring->in - out >= ring->mask)
>>   return -EFULL;
>>   /* see the ring is not full */
>>   index = ring->in & ring->mask;
>>   /* index == 0 */
>>   ring->data[index] = new_data;
>>  ring->in++;
>>
>> data = ring->data[index];
>> /* you will find the old data is overwritten by the new_data */
>>
>> In order to avoid the issue:
>> 1) for the consumer, we should read the ring->data[] out before
>> updating ring->out
>> 2) for the producer, we should read ring->out before updating
>> ring->data[]
>>
>> So in this patch we introduce the following four functions which
>> are wrapped with proper memory barrier and keep in pairs to make
>> sure the in and out index are fetched and updated in order to avoid
>> data loss.
>>
>> kfifo_read_index_in()
>> kfifo_write_index_in()
>> kfifo_read_index_out()
>> kfifo_write_index_out()
>>
>> Signed-off-by: Yulei Zhang 
>> Signed-off-by: Guangrong Xiao 
> 
> I've added some more people to CC that might want to see this. Thanks
> for sending this!

Hi,

Ping... could anyone have a look? ;)

Thanks!



Re: [PATCH] KVM: try __get_user_pages_fast even if not in atomic context

2018-08-06 Thread Xiao Guangrong




On 07/27/2018 11:46 PM, Paolo Bonzini wrote:

We are currently cutting hva_to_pfn_fast short if we do not want an
immediate exit, which is represented by !async && !atomic.  However,
this is unnecessary, and __get_user_pages_fast is *much* faster
because the regular get_user_pages takes pmd_lock/pte_lock.
In fact, when many CPUs take a nested vmexit at the same time
the contention on those locks is visible, and this patch removes
about 25% (compared to 4.18) from vmexit.flat on a 16 vCPU
nested guest.



Nice improvement.

Then after that, we will unconditionally try hva_to_pfn_fast(), does
it hurt the case that the mappings in the host's page tables have not
been present yet?

Can we apply this tech to other places using gup or even squash it
into  get_user_pages()?


Re: [PATCH] KVM: try __get_user_pages_fast even if not in atomic context

2018-08-06 Thread Xiao Guangrong




On 07/27/2018 11:46 PM, Paolo Bonzini wrote:

We are currently cutting hva_to_pfn_fast short if we do not want an
immediate exit, which is represented by !async && !atomic.  However,
this is unnecessary, and __get_user_pages_fast is *much* faster
because the regular get_user_pages takes pmd_lock/pte_lock.
In fact, when many CPUs take a nested vmexit at the same time
the contention on those locks is visible, and this patch removes
about 25% (compared to 4.18) from vmexit.flat on a 16 vCPU
nested guest.



Nice improvement.

Then after that, we will unconditionally try hva_to_pfn_fast(), does
it hurt the case that the mappings in the host's page tables have not
been present yet?

Can we apply this tech to other places using gup or even squash it
into  get_user_pages()?


Re: [PATCH] KVM/MMU: Combine flushing remote tlb in mmu_set_spte()

2018-07-25 Thread Xiao Guangrong




On 07/24/2018 10:35 PM, Paolo Bonzini wrote:

On 24/07/2018 10:17, Tianyu Lan wrote:

mmu_set_spte() flushes remote tlbs for drop_parent_pte/drop_spte()
and set_spte() separately. This may introduce redundant flush. This
patch is to combine these flushes and check flush request after
calling set_spte().

Signed-off-by: Lan Tianyu 


Looks good, but I'd like a second opinion.  Guangrong, Junaid, can you
review this?



It looks good to me.

Reviewed-by: Xiao Guangrong 

BTW, the @intel box is not accessible to me now. ;)


Re: [PATCH] KVM/MMU: Combine flushing remote tlb in mmu_set_spte()

2018-07-25 Thread Xiao Guangrong




On 07/24/2018 10:35 PM, Paolo Bonzini wrote:

On 24/07/2018 10:17, Tianyu Lan wrote:

mmu_set_spte() flushes remote tlbs for drop_parent_pte/drop_spte()
and set_spte() separately. This may introduce redundant flush. This
patch is to combine these flushes and check flush request after
calling set_spte().

Signed-off-by: Lan Tianyu 


Looks good, but I'd like a second opinion.  Guangrong, Junaid, can you
review this?



It looks good to me.

Reviewed-by: Xiao Guangrong 

BTW, the @intel box is not accessible to me now. ;)


Is read barrier missed in kfifo?

2018-05-11 Thread Xiao Guangrong


Hi,

Currently, there is no read barrier between reading the index
(kfifo.in) and fetching the real data from the fifo.

I am afraid that will cause the vfifo is observed as not empty
however the data is not actually ready for read. Right?

Thanks!


Is read barrier missed in kfifo?

2018-05-11 Thread Xiao Guangrong


Hi,

Currently, there is no read barrier between reading the index
(kfifo.in) and fetching the real data from the fifo.

I am afraid that will cause the vfifo is observed as not empty
however the data is not actually ready for read. Right?

Thanks!


Re: [PATCH] KVM: X86: Fix SMRAM accessing even if VM is shutdown

2018-02-10 Thread Xiao Guangrong



On 02/09/2018 08:42 PM, Paolo Bonzini wrote:

On 09/02/2018 04:22, Xiao Guangrong wrote:




That is a good question... :)

This case (with KVM_MEMSLOT_INVALID is set) can be easily constructed,
userspace should avoid this case by itself (avoiding vCPU accessing the
memslot which is being updated). If it happens, it's a operation issue
rather than INTERNAL ERROR.

Maybe treat it as MMIO accessing and return to userspace with MMIO_EXIT
is a better solution...


Yeah, that's what emulation would do (except if it's an instruction
fetch, which will cause emulation to fail).  I think it's a bug in the
non-EPT #PF case that we return with -EFAULT.


Wanpeng, could you please do it? :)


Re: [PATCH] KVM: X86: Fix SMRAM accessing even if VM is shutdown

2018-02-10 Thread Xiao Guangrong



On 02/09/2018 08:42 PM, Paolo Bonzini wrote:

On 09/02/2018 04:22, Xiao Guangrong wrote:




That is a good question... :)

This case (with KVM_MEMSLOT_INVALID is set) can be easily constructed,
userspace should avoid this case by itself (avoiding vCPU accessing the
memslot which is being updated). If it happens, it's a operation issue
rather than INTERNAL ERROR.

Maybe treat it as MMIO accessing and return to userspace with MMIO_EXIT
is a better solution...


Yeah, that's what emulation would do (except if it's an instruction
fetch, which will cause emulation to fail).  I think it's a bug in the
non-EPT #PF case that we return with -EFAULT.


Wanpeng, could you please do it? :)


Re: [PATCH] KVM: X86: Fix SMRAM accessing even if VM is shutdown

2018-02-08 Thread Xiao Guangrong



On 02/08/2018 06:31 PM, Paolo Bonzini wrote:

On 08/02/2018 09:57, Xiao Guangrong wrote:

Maybe it should return RET_PF_EMULATE, which would cause an emulation
failure and then an exit with KVM_EXIT_INTERNAL_ERROR.


So the root cause is that a running vCPU accessing the memory whose memslot
is being updated (met the condition KVM_MEMSLOT_INVALID is set on the its
memslot).

The normal #PF handler breaks KVM_RUN and returns -EFAULT to userspace,
we'd better to make ept-misconfig's handler follow this style as well.


Why return -EFAULT and not attempt emulation (which will fail)?



That is a good question... :)

This case (with KVM_MEMSLOT_INVALID is set) can be easily constructed,
userspace should avoid this case by itself (avoiding vCPU accessing the
memslot which is being updated). If it happens, it's a operation issue
rather than INTERNAL ERROR.

Maybe treat it as MMIO accessing and return to userspace with MMIO_EXIT
is a better solution...


Re: [PATCH] KVM: X86: Fix SMRAM accessing even if VM is shutdown

2018-02-08 Thread Xiao Guangrong



On 02/08/2018 06:31 PM, Paolo Bonzini wrote:

On 08/02/2018 09:57, Xiao Guangrong wrote:

Maybe it should return RET_PF_EMULATE, which would cause an emulation
failure and then an exit with KVM_EXIT_INTERNAL_ERROR.


So the root cause is that a running vCPU accessing the memory whose memslot
is being updated (met the condition KVM_MEMSLOT_INVALID is set on the its
memslot).

The normal #PF handler breaks KVM_RUN and returns -EFAULT to userspace,
we'd better to make ept-misconfig's handler follow this style as well.


Why return -EFAULT and not attempt emulation (which will fail)?



That is a good question... :)

This case (with KVM_MEMSLOT_INVALID is set) can be easily constructed,
userspace should avoid this case by itself (avoiding vCPU accessing the
memslot which is being updated). If it happens, it's a operation issue
rather than INTERNAL ERROR.

Maybe treat it as MMIO accessing and return to userspace with MMIO_EXIT
is a better solution...


Re: [PATCH] KVM: X86: Fix SMRAM accessing even if VM is shutdown

2018-02-08 Thread Xiao Guangrong



On 02/07/2018 10:16 PM, Paolo Bonzini wrote:

On 07/02/2018 07:25, Wanpeng Li wrote:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 786cd00..445e702 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7458,6 +7458,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
goto out;
}
  
+	if (unlikely(vcpu->run->exit_reason == KVM_EXIT_SHUTDOWN)) {

+   r = -EINVAL;
+   goto out;
+   }
+
if (vcpu->run->kvm_dirty_regs) {
r = sync_regs(vcpu);
if (r != 0)



This most likely breaks triple faults in the usual case where they
should result in resetting the system; the KVM API doesn't say that you
should clear vcpu->run->exit_reason before entering.

What exactly causes the EPT misconfig to reach the WARN?  That is, how
does kvm_mmu_page_fault end up returning a negative errno value?  If I
read the code correctly only tdp_page_fault can do so, so my guess would
be kvm_handle_bad_page:

 if (pfn == KVM_PFN_ERR_RO_FAULT)
 return RET_PF_EMULATE;

 if (pfn == KVM_PFN_ERR_HWPOISON) {
 kvm_send_hwpoison_signal(kvm_vcpu_gfn_to_hva(vcpu, gfn),
current);
 return RET_PF_RETRY;
 }

/* KVM_PFN_ERR_FAULT */
 return -EFAULT;

Maybe it should return RET_PF_EMULATE, which would cause an emulation
failure and then an exit with KVM_EXIT_INTERNAL_ERROR.


So the root cause is that a running vCPU accessing the memory whose memslot
is being updated (met the condition KVM_MEMSLOT_INVALID is set on the its
memslot).

The normal #PF handler breaks KVM_RUN and returns -EFAULT to userspace, we'd
better to make ept-misconfig's handler follow this style as well.

Actually, the WARN_ON in ept-misconfig's handler is unnecessary as
kvm_mmu_page_fault() will warn us if it is the real ept misconfig, so we can
simply return kvm_mmu_page_fault().



Re: [PATCH] KVM: X86: Fix SMRAM accessing even if VM is shutdown

2018-02-08 Thread Xiao Guangrong



On 02/07/2018 10:16 PM, Paolo Bonzini wrote:

On 07/02/2018 07:25, Wanpeng Li wrote:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 786cd00..445e702 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7458,6 +7458,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
goto out;
}
  
+	if (unlikely(vcpu->run->exit_reason == KVM_EXIT_SHUTDOWN)) {

+   r = -EINVAL;
+   goto out;
+   }
+
if (vcpu->run->kvm_dirty_regs) {
r = sync_regs(vcpu);
if (r != 0)



This most likely breaks triple faults in the usual case where they
should result in resetting the system; the KVM API doesn't say that you
should clear vcpu->run->exit_reason before entering.

What exactly causes the EPT misconfig to reach the WARN?  That is, how
does kvm_mmu_page_fault end up returning a negative errno value?  If I
read the code correctly only tdp_page_fault can do so, so my guess would
be kvm_handle_bad_page:

 if (pfn == KVM_PFN_ERR_RO_FAULT)
 return RET_PF_EMULATE;

 if (pfn == KVM_PFN_ERR_HWPOISON) {
 kvm_send_hwpoison_signal(kvm_vcpu_gfn_to_hva(vcpu, gfn),
current);
 return RET_PF_RETRY;
 }

/* KVM_PFN_ERR_FAULT */
 return -EFAULT;

Maybe it should return RET_PF_EMULATE, which would cause an emulation
failure and then an exit with KVM_EXIT_INTERNAL_ERROR.


So the root cause is that a running vCPU accessing the memory whose memslot
is being updated (met the condition KVM_MEMSLOT_INVALID is set on the its
memslot).

The normal #PF handler breaks KVM_RUN and returns -EFAULT to userspace, we'd
better to make ept-misconfig's handler follow this style as well.

Actually, the WARN_ON in ept-misconfig's handler is unnecessary as
kvm_mmu_page_fault() will warn us if it is the real ept misconfig, so we can
simply return kvm_mmu_page_fault().



Re: [PATCH v5 0/2] KVM: MMU: fix kvm_is_mmio_pfn()

2017-11-08 Thread Xiao Guangrong



On 11/08/2017 03:56 PM, Haozhong Zhang wrote:

Some reserved pages, such as those from NVDIMM DAX devices, are not
for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type in addition and only treat
UC/UC- pages as MMIO.



Reviewed-by: Xiao Guangrong <xiaoguangr...@tencent.com>


Re: [PATCH v5 0/2] KVM: MMU: fix kvm_is_mmio_pfn()

2017-11-08 Thread Xiao Guangrong



On 11/08/2017 03:56 PM, Haozhong Zhang wrote:

Some reserved pages, such as those from NVDIMM DAX devices, are not
for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type in addition and only treat
UC/UC- pages as MMIO.



Reviewed-by: Xiao Guangrong 


Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-07 Thread Xiao Guangrong



On 11/03/2017 05:29 PM, Haozhong Zhang wrote:

On 11/03/17 17:24 +0800, Xiao Guangrong wrote:



On 11/03/2017 05:02 PM, Haozhong Zhang wrote:

On 11/03/17 16:51 +0800, Haozhong Zhang wrote:

On 11/03/17 14:54 +0800, Xiao Guangrong wrote:



On 11/03/2017 01:53 PM, Haozhong Zhang wrote:

Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Reported-by: Cuevas Escareno, Ivan D <ivan.d.cuevas.escar...@intel.com>
Reported-by: Kumar, Karthik <karthik.ku...@intel.com>
---
arch/x86/kvm/mmu.c | 19 ++-
1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..e9ed0e666a83 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
{
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
+   /*
+* Some reserved pages, such as those from
+* NVDIMM DAX devices, are not for MMIO, and
+* can be mapped with cached memory type for
+* better performance. However, the above
+* check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory
+* type, the performance of guest accesses to
+* those pages would be harmed. Therefore, we
+* check the host memory type in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* pat_pfn_is_uc() works only when PAT is enabled,
+* so check pat_enabled() as well.
+*/
+   (!pat_enabled() ||
+pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));


Can it be compiled if !CONFIG_PAT?


Yes.

What I check via pat_enabled() is not only whether PAT support is
compiled, but also whether PAT is enabled at runtime.



It would be better if we move pat_enabled out of kvm as well,


Surely I can combine them in one function like

bool pat_pfn_is_uc(pfn_t pfn)
{
enum page_cache_mode cm;

if (!pat_enabled())
return false;

cm = lookup_memtype(pfn_t_to_phys(pfn));

return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
}


In addition, I think it's better to split this function into
pat_pfn_is_uc() and pat_pfn_is_uc_minus() to avoid additional
confusion.


Why not use pat_pfn_is_uc_or_uc_minus(). :)


Just in case that other places other than KVM do not need both of them.



I think we need not care the future case too much, it is not ABI
anyway. i.e, it can be easily adjusted if it is really needed
in the future.

But it is not a big deal...




Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-07 Thread Xiao Guangrong



On 11/03/2017 05:29 PM, Haozhong Zhang wrote:

On 11/03/17 17:24 +0800, Xiao Guangrong wrote:



On 11/03/2017 05:02 PM, Haozhong Zhang wrote:

On 11/03/17 16:51 +0800, Haozhong Zhang wrote:

On 11/03/17 14:54 +0800, Xiao Guangrong wrote:



On 11/03/2017 01:53 PM, Haozhong Zhang wrote:

Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
---
arch/x86/kvm/mmu.c | 19 ++-
1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..e9ed0e666a83 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
{
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
+   /*
+* Some reserved pages, such as those from
+* NVDIMM DAX devices, are not for MMIO, and
+* can be mapped with cached memory type for
+* better performance. However, the above
+* check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory
+* type, the performance of guest accesses to
+* those pages would be harmed. Therefore, we
+* check the host memory type in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* pat_pfn_is_uc() works only when PAT is enabled,
+* so check pat_enabled() as well.
+*/
+   (!pat_enabled() ||
+pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));


Can it be compiled if !CONFIG_PAT?


Yes.

What I check via pat_enabled() is not only whether PAT support is
compiled, but also whether PAT is enabled at runtime.



It would be better if we move pat_enabled out of kvm as well,


Surely I can combine them in one function like

bool pat_pfn_is_uc(pfn_t pfn)
{
enum page_cache_mode cm;

if (!pat_enabled())
return false;

cm = lookup_memtype(pfn_t_to_phys(pfn));

return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
}


In addition, I think it's better to split this function into
pat_pfn_is_uc() and pat_pfn_is_uc_minus() to avoid additional
confusion.


Why not use pat_pfn_is_uc_or_uc_minus(). :)


Just in case that other places other than KVM do not need both of them.



I think we need not care the future case too much, it is not ABI
anyway. i.e, it can be easily adjusted if it is really needed
in the future.

But it is not a big deal...




Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-03 Thread Xiao Guangrong



On 11/03/2017 05:02 PM, Haozhong Zhang wrote:

On 11/03/17 16:51 +0800, Haozhong Zhang wrote:

On 11/03/17 14:54 +0800, Xiao Guangrong wrote:



On 11/03/2017 01:53 PM, Haozhong Zhang wrote:

Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Reported-by: Cuevas Escareno, Ivan D <ivan.d.cuevas.escar...@intel.com>
Reported-by: Kumar, Karthik <karthik.ku...@intel.com>
---
   arch/x86/kvm/mmu.c | 19 ++-
   1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..e9ed0e666a83 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
   static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
   {
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
+   /*
+* Some reserved pages, such as those from
+* NVDIMM DAX devices, are not for MMIO, and
+* can be mapped with cached memory type for
+* better performance. However, the above
+* check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory
+* type, the performance of guest accesses to
+* those pages would be harmed. Therefore, we
+* check the host memory type in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* pat_pfn_is_uc() works only when PAT is enabled,
+* so check pat_enabled() as well.
+*/
+   (!pat_enabled() ||
+pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));


Can it be compiled if !CONFIG_PAT?


Yes.

What I check via pat_enabled() is not only whether PAT support is
compiled, but also whether PAT is enabled at runtime.



It would be better if we move pat_enabled out of kvm as well,


Surely I can combine them in one function like

bool pat_pfn_is_uc(pfn_t pfn)
{
enum page_cache_mode cm;

if (!pat_enabled())
return false;

cm = lookup_memtype(pfn_t_to_phys(pfn));

return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
}


In addition, I think it's better to split this function into
pat_pfn_is_uc() and pat_pfn_is_uc_minus() to avoid additional
confusion.


Why not use pat_pfn_is_uc_or_uc_minus(). :)


Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-03 Thread Xiao Guangrong



On 11/03/2017 05:02 PM, Haozhong Zhang wrote:

On 11/03/17 16:51 +0800, Haozhong Zhang wrote:

On 11/03/17 14:54 +0800, Xiao Guangrong wrote:



On 11/03/2017 01:53 PM, Haozhong Zhang wrote:

Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
---
   arch/x86/kvm/mmu.c | 19 ++-
   1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..e9ed0e666a83 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
   static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
   {
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
+   /*
+* Some reserved pages, such as those from
+* NVDIMM DAX devices, are not for MMIO, and
+* can be mapped with cached memory type for
+* better performance. However, the above
+* check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory
+* type, the performance of guest accesses to
+* those pages would be harmed. Therefore, we
+* check the host memory type in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* pat_pfn_is_uc() works only when PAT is enabled,
+* so check pat_enabled() as well.
+*/
+   (!pat_enabled() ||
+pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));


Can it be compiled if !CONFIG_PAT?


Yes.

What I check via pat_enabled() is not only whether PAT support is
compiled, but also whether PAT is enabled at runtime.



It would be better if we move pat_enabled out of kvm as well,


Surely I can combine them in one function like

bool pat_pfn_is_uc(pfn_t pfn)
{
enum page_cache_mode cm;

if (!pat_enabled())
return false;

cm = lookup_memtype(pfn_t_to_phys(pfn));

return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
}


In addition, I think it's better to split this function into
pat_pfn_is_uc() and pat_pfn_is_uc_minus() to avoid additional
confusion.


Why not use pat_pfn_is_uc_or_uc_minus(). :)


Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-03 Thread Xiao Guangrong



On 11/03/2017 04:51 PM, Haozhong Zhang wrote:

On 11/03/17 14:54 +0800, Xiao Guangrong wrote:



On 11/03/2017 01:53 PM, Haozhong Zhang wrote:

Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Reported-by: Cuevas Escareno, Ivan D <ivan.d.cuevas.escar...@intel.com>
Reported-by: Kumar, Karthik <karthik.ku...@intel.com>
---
   arch/x86/kvm/mmu.c | 19 ++-
   1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..e9ed0e666a83 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
   static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
   {
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
+   /*
+* Some reserved pages, such as those from
+* NVDIMM DAX devices, are not for MMIO, and
+* can be mapped with cached memory type for
+* better performance. However, the above
+* check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory
+* type, the performance of guest accesses to
+* those pages would be harmed. Therefore, we
+* check the host memory type in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* pat_pfn_is_uc() works only when PAT is enabled,
+* so check pat_enabled() as well.
+*/
+   (!pat_enabled() ||
+pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));


Can it be compiled if !CONFIG_PAT?


Yes.

What I check via pat_enabled() is not only whether PAT support is
compiled, but also whether PAT is enabled at runtime.


The issue is about pat_pfn_is_uc() which is implemented only if CONFIG_PAT is
enabled, but you used it here unconditionally.

I am not sure if gcc is smart enough to omit pat_pfn_is_uc() completely under
this case. If you really have done the test to compile kernel and KVM module
with CONFIG_PAT disabled, it is fine.



Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-03 Thread Xiao Guangrong



On 11/03/2017 04:51 PM, Haozhong Zhang wrote:

On 11/03/17 14:54 +0800, Xiao Guangrong wrote:



On 11/03/2017 01:53 PM, Haozhong Zhang wrote:

Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
---
   arch/x86/kvm/mmu.c | 19 ++-
   1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..e9ed0e666a83 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
   static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
   {
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
+   /*
+* Some reserved pages, such as those from
+* NVDIMM DAX devices, are not for MMIO, and
+* can be mapped with cached memory type for
+* better performance. However, the above
+* check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory
+* type, the performance of guest accesses to
+* those pages would be harmed. Therefore, we
+* check the host memory type in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* pat_pfn_is_uc() works only when PAT is enabled,
+* so check pat_enabled() as well.
+*/
+   (!pat_enabled() ||
+pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));


Can it be compiled if !CONFIG_PAT?


Yes.

What I check via pat_enabled() is not only whether PAT support is
compiled, but also whether PAT is enabled at runtime.


The issue is about pat_pfn_is_uc() which is implemented only if CONFIG_PAT is
enabled, but you used it here unconditionally.

I am not sure if gcc is smart enough to omit pat_pfn_is_uc() completely under
this case. If you really have done the test to compile kernel and KVM module
with CONFIG_PAT disabled, it is fine.



Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-03 Thread Xiao Guangrong



On 11/03/2017 01:53 PM, Haozhong Zhang wrote:

Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
---
  arch/x86/kvm/mmu.c | 19 ++-
  1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..e9ed0e666a83 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
  static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
  {
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
+   /*
+* Some reserved pages, such as those from
+* NVDIMM DAX devices, are not for MMIO, and
+* can be mapped with cached memory type for
+* better performance. However, the above
+* check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory
+* type, the performance of guest accesses to
+* those pages would be harmed. Therefore, we
+* check the host memory type in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* pat_pfn_is_uc() works only when PAT is enabled,
+* so check pat_enabled() as well.
+*/
+   (!pat_enabled() ||
+pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));


Can it be compiled if !CONFIG_PAT?

It would be better if we move pat_enabled out of kvm as well, please refer
to pgprot_writecombine() which is implemented in pat.c and in
include\asm-generic\pgtable.h:

#ifndef pgprot_writecombine
#define pgprot_writecombine pgprot_noncached
#endif



Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-03 Thread Xiao Guangrong



On 11/03/2017 01:53 PM, Haozhong Zhang wrote:

Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
---
  arch/x86/kvm/mmu.c | 19 ++-
  1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..e9ed0e666a83 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
  static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
  {
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
+   /*
+* Some reserved pages, such as those from
+* NVDIMM DAX devices, are not for MMIO, and
+* can be mapped with cached memory type for
+* better performance. However, the above
+* check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory
+* type, the performance of guest accesses to
+* those pages would be harmed. Therefore, we
+* check the host memory type in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* pat_pfn_is_uc() works only when PAT is enabled,
+* so check pat_enabled() as well.
+*/
+   (!pat_enabled() ||
+pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));


Can it be compiled if !CONFIG_PAT?

It would be better if we move pat_enabled out of kvm as well, please refer
to pgprot_writecombine() which is implemented in pat.c and in
include\asm-generic\pgtable.h:

#ifndef pgprot_writecombine
#define pgprot_writecombine pgprot_noncached
#endif



Re: [PATCH v2 2/2] KVM: MMU: consider host cache mode in MMIO page check

2017-11-02 Thread Xiao Guangrong



On 10/31/2017 07:48 PM, Haozhong Zhang wrote:

Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
---
  arch/x86/kvm/mmu.c | 30 +++---
  1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..206828d18857 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2707,10 +2707,34 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
  
  static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)

  {
-   if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   bool is_mmio = true;
  
-	return true;

+   if (pfn_valid(pfn)) {
+   is_mmio = !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+   /*
+* Some reserved pages, such as those from NVDIMM DAX
+* devices, are not for MMIO, and can be mapped with
+* cached memory type for better performance. However,
+* the above check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory type,
+* the performance of guest accesses to those pages
+* would be harmed. Therefore, we check the host
+* memory type by lookup_memtype() in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* lookup_memtype() works only when PAT is enabled, so
+* add pat_enabled() check here.
+*/
+   if (is_mmio && pat_enabled()) {
+   enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
+
+   is_mmio = (cm == _PAGE_CACHE_MODE_UC ||
+  cm == _PAGE_CACHE_MODE_UC_MINUS);
+   }
+   }


You can move all of these detailed stuffs to pat.c and abstract them by 
introducing
a function, maybe named pat_pfn_is_uc(). I think this is what Ingo wants.



Re: [PATCH v2 2/2] KVM: MMU: consider host cache mode in MMIO page check

2017-11-02 Thread Xiao Guangrong



On 10/31/2017 07:48 PM, Haozhong Zhang wrote:

Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
---
  arch/x86/kvm/mmu.c | 30 +++---
  1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..206828d18857 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2707,10 +2707,34 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
  
  static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)

  {
-   if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   bool is_mmio = true;
  
-	return true;

+   if (pfn_valid(pfn)) {
+   is_mmio = !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+   /*
+* Some reserved pages, such as those from NVDIMM DAX
+* devices, are not for MMIO, and can be mapped with
+* cached memory type for better performance. However,
+* the above check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory type,
+* the performance of guest accesses to those pages
+* would be harmed. Therefore, we check the host
+* memory type by lookup_memtype() in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* lookup_memtype() works only when PAT is enabled, so
+* add pat_enabled() check here.
+*/
+   if (is_mmio && pat_enabled()) {
+   enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
+
+   is_mmio = (cm == _PAGE_CACHE_MODE_UC ||
+  cm == _PAGE_CACHE_MODE_UC_MINUS);
+   }
+   }


You can move all of these detailed stuffs to pat.c and abstract them by 
introducing
a function, maybe named pat_pfn_is_uc(). I think this is what Ingo wants.



Re: [PATCH 0/3] KVM: MMU: fix kvm_is_mmio_pfn()

2017-10-31 Thread Xiao Guangrong



On 10/27/2017 10:25 AM, Haozhong Zhang wrote:

[I just copy the commit message from patch 3]

By default, KVM treats a reserved page as for MMIO purpose, and maps
it to guest with UC memory type.  However, some reserved pages are not
for MMIO, such as pages of DAX device (e.g., /dev/daxX.Y). Mapping
them with UC memory type will harm the performance.  In order to
exclude those cases, we check the host cache mode in addition and only
treat UC/UC- pages as MMIO.



I am afraid that is not only a performance issue but also a architecture
bug - it could trigger MCE as there is alias memory mapping (a page mapped
as both WB and UC).

It may hurt mdev device as well as the device memory may be both mapped
at host and VM.


Re: [PATCH 0/3] KVM: MMU: fix kvm_is_mmio_pfn()

2017-10-31 Thread Xiao Guangrong



On 10/27/2017 10:25 AM, Haozhong Zhang wrote:

[I just copy the commit message from patch 3]

By default, KVM treats a reserved page as for MMIO purpose, and maps
it to guest with UC memory type.  However, some reserved pages are not
for MMIO, such as pages of DAX device (e.g., /dev/daxX.Y). Mapping
them with UC memory type will harm the performance.  In order to
exclude those cases, we check the host cache mode in addition and only
treat UC/UC- pages as MMIO.



I am afraid that is not only a performance issue but also a architecture
bug - it could trigger MCE as there is alias memory mapping (a page mapped
as both WB and UC).

It may hurt mdev device as well as the device memory may be both mapped
at host and VM.


Re: [PATCH v2 0/7] KVM: MMU: fast write protect

2017-07-03 Thread Xiao Guangrong



On 07/03/2017 11:47 PM, Paolo Bonzini wrote:



On 03/07/2017 16:39, Xiao Guangrong wrote:



On 06/20/2017 05:15 PM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong <xiaoguangr...@tencent.com>

Changelog in v2:
thanks to Paolo's review, this version disables write-protect-all if
PML is supported


Hi Paolo,

Do you have time to have a look at this new version? ;)
Or I should wait until the patchset of dirty ring-buffer is merged?


I will look at it soon, but I still plan to merge dirty ring buffer first.

Thanks for your understanding,


Sure, i fully understand, thank you for bearing my push. :)


Re: [PATCH v2 0/7] KVM: MMU: fast write protect

2017-07-03 Thread Xiao Guangrong



On 07/03/2017 11:47 PM, Paolo Bonzini wrote:



On 03/07/2017 16:39, Xiao Guangrong wrote:



On 06/20/2017 05:15 PM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Changelog in v2:
thanks to Paolo's review, this version disables write-protect-all if
PML is supported


Hi Paolo,

Do you have time to have a look at this new version? ;)
Or I should wait until the patchset of dirty ring-buffer is merged?


I will look at it soon, but I still plan to merge dirty ring buffer first.

Thanks for your understanding,


Sure, i fully understand, thank you for bearing my push. :)


Re: [PATCH v2 0/7] KVM: MMU: fast write protect

2017-07-03 Thread Xiao Guangrong



On 06/20/2017 05:15 PM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong <xiaoguangr...@tencent.com>

Changelog in v2:
thanks to Paolo's review, this version disables write-protect-all if
PML is supported


Hi Paolo,

Do you have time to have a look at this new version? ;)
Or I should wait until the patchset of dirty ring-buffer is merged?

Thanks!


Re: [PATCH v2 0/7] KVM: MMU: fast write protect

2017-07-03 Thread Xiao Guangrong



On 06/20/2017 05:15 PM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Changelog in v2:
thanks to Paolo's review, this version disables write-protect-all if
PML is supported


Hi Paolo,

Do you have time to have a look at this new version? ;)
Or I should wait until the patchset of dirty ring-buffer is merged?

Thanks!


Re: [PATCH 0/7] KVM: MMU: fast write protect

2017-06-08 Thread Xiao Guangrong



On 05/30/2017 12:48 AM, Paolo Bonzini wrote:



On 23/05/2017 04:23, Xiao Guangrong wrote:


Ping...

Sorry to disturb, just make this patchset not be missed. :)


It won't. :)  I'm going to look at it and the dirty page ring buffer
this week.


Ping.. :)


Re: [PATCH 0/7] KVM: MMU: fast write protect

2017-06-08 Thread Xiao Guangrong



On 05/30/2017 12:48 AM, Paolo Bonzini wrote:



On 23/05/2017 04:23, Xiao Guangrong wrote:


Ping...

Sorry to disturb, just make this patchset not be missed. :)


It won't. :)  I'm going to look at it and the dirty page ring buffer
this week.


Ping.. :)


Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect

2017-06-05 Thread Xiao Guangrong



On 06/05/2017 03:36 PM, Jay Zhou wrote:


  /* enable ucontrol for s390 */
  struct kvm_s390_ucas_mapping {
diff --git a/memory.c b/memory.c
index 4c95aaf..b836675 100644
--- a/memory.c
+++ b/memory.c
@@ -809,6 +809,13 @@ static void address_space_update_ioeventfds(AddressSpace 
*as)
  flatview_unref(view);
  }

+static write_protect_all_fn write_func;


I think there should be a declaration in memory.h,

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7fc3f48..31f3098 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1152,6 +1152,9 @@ void memory_global_dirty_log_start(void);
   */
  void memory_global_dirty_log_stop(void);

+typedef void (*write_protect_all_fn)(bool write);
+void memory_register_write_protect_all(write_protect_all_fn func);
+
  void mtree_info(fprintf_function mon_printf, void *f);



Thanks for your suggestion, Jay!

This code just demonstrates how to enable this feature in QEMU, i will
carefully consider it and merger your suggestion when the formal patch
is posted out.



Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect

2017-06-05 Thread Xiao Guangrong



On 06/05/2017 03:36 PM, Jay Zhou wrote:


  /* enable ucontrol for s390 */
  struct kvm_s390_ucas_mapping {
diff --git a/memory.c b/memory.c
index 4c95aaf..b836675 100644
--- a/memory.c
+++ b/memory.c
@@ -809,6 +809,13 @@ static void address_space_update_ioeventfds(AddressSpace 
*as)
  flatview_unref(view);
  }

+static write_protect_all_fn write_func;


I think there should be a declaration in memory.h,

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7fc3f48..31f3098 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1152,6 +1152,9 @@ void memory_global_dirty_log_start(void);
   */
  void memory_global_dirty_log_stop(void);

+typedef void (*write_protect_all_fn)(bool write);
+void memory_register_write_protect_all(write_protect_all_fn func);
+
  void mtree_info(fprintf_function mon_printf, void *f);



Thanks for your suggestion, Jay!

This code just demonstrates how to enable this feature in QEMU, i will
carefully consider it and merger your suggestion when the formal patch
is posted out.



Re: [PATCH 0/7] KVM: MMU: fast write protect

2017-05-22 Thread Xiao Guangrong


Ping...

Sorry to disturb, just make this patchset not be missed. :)

On 05/04/2017 03:06 PM, Paolo Bonzini wrote:



On 04/05/2017 05:36, Xiao Guangrong wrote:

Great.

As there is no conflict between these two patchsets except dirty
ring pages takes benefit from write-protect-all, i think they
can be developed and iterated independently, right?


I can certainly start reviewing this one.

Paolo


Or you prefer to merge dirty ring pages first then review the
new version of this patchset later?


Re: [PATCH 0/7] KVM: MMU: fast write protect

2017-05-22 Thread Xiao Guangrong


Ping...

Sorry to disturb, just make this patchset not be missed. :)

On 05/04/2017 03:06 PM, Paolo Bonzini wrote:



On 04/05/2017 05:36, Xiao Guangrong wrote:

Great.

As there is no conflict between these two patchsets except dirty
ring pages takes benefit from write-protect-all, i think they
can be developed and iterated independently, right?


I can certainly start reviewing this one.

Paolo


Or you prefer to merge dirty ring pages first then review the
new version of this patchset later?


Re: [PATCH 2/2] KVM: nVMX: fix nEPT handling of guest page table accesses

2017-05-12 Thread Xiao Guangrong


CC Kevin as i am not sure if Intel is aware of this issue, it
breaks other hypervisors, e.g, Xen, as swell.

On 05/11/2017 07:23 PM, Paolo Bonzini wrote:

The new ept_access_test_paddr_read_only_ad_disabled testcase
caused an infinite stream of EPT violations because KVM did not
find anything bad in the page tables and kept re-executing the
faulting instruction.

This is because the exit qualification said we were reading from
the page tables, but actually writing the cause of the EPT violation
was writing the A/D bits.  This happened even with eptad=0, quite
surprisingly.

Thus, always treat guest page table accesses as read+write operations,
even if the exit qualification says otherwise.  This fixes the
testcase.

Signed-off-by: Paolo Bonzini 
---
  arch/x86/kvm/vmx.c | 36 +++-
  1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c6f4ad44aa95..c868cbdad29a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6209,17 +6209,19 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
u32 error_code;
  
  	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);

+   gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+   trace_kvm_page_fault(gpa, exit_qualification);
  
-	if (is_guest_mode(vcpu)

-   && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) {
-   /*
-* Fix up exit_qualification according to whether guest
-* page table accesses are reads or writes.
-*/
-   u64 eptp = nested_ept_get_cr3(vcpu);
-   if (!(eptp & VMX_EPT_AD_ENABLE_BIT))
-   exit_qualification &= ~EPT_VIOLATION_ACC_WRITE;
-   }
+   /*
+* All guest page table accesses are potential writes to A/D bits.
+* but EPT microcode only reports them as such when EPT A/D is
+* enabled.  Tracing ept_access_test_paddr_read_only_ad_disabled (from
+* kvm-unit-tests) with eptad=0 and eptad=1 shows that the processor
+* does not change its behavior when EPTP enables A/D bits; the only
+* difference is in the exit qualification.  So fix this up here.
+*/
+   if (!(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED))
+   exit_qualification |= EPT_VIOLATION_ACC_WRITE;
  
  	/*

 * EPT violation happened while executing iret from NMI,
@@ -6231,9 +6233,6 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
(exit_qualification & INTR_INFO_UNBLOCK_NMI))
vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, 
GUEST_INTR_STATE_NMI);
  
-	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);

-   trace_kvm_page_fault(gpa, exit_qualification);
-
/* Is it a read fault? */
error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
 ? PFERR_USER_MASK : 0;
@@ -6250,6 +6249,17 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
  ? PFERR_PRESENT_MASK : 0;
  
  	vcpu->arch.gpa_available = true;

+
+   if (is_guest_mode(vcpu)
+   && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) {
+   /*
+* Now fix up exit_qualification according to what the
+* L1 hypervisor expects to see.
+*/
+   u64 eptp = nested_ept_get_cr3(vcpu);
+   if (!(eptp & VMX_EPT_AD_ENABLE_BIT))
+   exit_qualification &= ~EPT_VIOLATION_ACC_WRITE;
+   }


I am not sure if this is really needed, it (PFEC.W = 0 if A/D need to be set on
page structures) is not we expect.

Maybe always report the right behavior is better? Especially,Intel may fix its
microcode as it hurts the newest CPUs as well.

Thanks!


Re: [PATCH 2/2] KVM: nVMX: fix nEPT handling of guest page table accesses

2017-05-12 Thread Xiao Guangrong


CC Kevin as i am not sure if Intel is aware of this issue, it
breaks other hypervisors, e.g, Xen, as swell.

On 05/11/2017 07:23 PM, Paolo Bonzini wrote:

The new ept_access_test_paddr_read_only_ad_disabled testcase
caused an infinite stream of EPT violations because KVM did not
find anything bad in the page tables and kept re-executing the
faulting instruction.

This is because the exit qualification said we were reading from
the page tables, but actually writing the cause of the EPT violation
was writing the A/D bits.  This happened even with eptad=0, quite
surprisingly.

Thus, always treat guest page table accesses as read+write operations,
even if the exit qualification says otherwise.  This fixes the
testcase.

Signed-off-by: Paolo Bonzini 
---
  arch/x86/kvm/vmx.c | 36 +++-
  1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c6f4ad44aa95..c868cbdad29a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6209,17 +6209,19 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
u32 error_code;
  
  	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);

+   gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+   trace_kvm_page_fault(gpa, exit_qualification);
  
-	if (is_guest_mode(vcpu)

-   && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) {
-   /*
-* Fix up exit_qualification according to whether guest
-* page table accesses are reads or writes.
-*/
-   u64 eptp = nested_ept_get_cr3(vcpu);
-   if (!(eptp & VMX_EPT_AD_ENABLE_BIT))
-   exit_qualification &= ~EPT_VIOLATION_ACC_WRITE;
-   }
+   /*
+* All guest page table accesses are potential writes to A/D bits.
+* but EPT microcode only reports them as such when EPT A/D is
+* enabled.  Tracing ept_access_test_paddr_read_only_ad_disabled (from
+* kvm-unit-tests) with eptad=0 and eptad=1 shows that the processor
+* does not change its behavior when EPTP enables A/D bits; the only
+* difference is in the exit qualification.  So fix this up here.
+*/
+   if (!(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED))
+   exit_qualification |= EPT_VIOLATION_ACC_WRITE;
  
  	/*

 * EPT violation happened while executing iret from NMI,
@@ -6231,9 +6233,6 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
(exit_qualification & INTR_INFO_UNBLOCK_NMI))
vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, 
GUEST_INTR_STATE_NMI);
  
-	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);

-   trace_kvm_page_fault(gpa, exit_qualification);
-
/* Is it a read fault? */
error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
 ? PFERR_USER_MASK : 0;
@@ -6250,6 +6249,17 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
  ? PFERR_PRESENT_MASK : 0;
  
  	vcpu->arch.gpa_available = true;

+
+   if (is_guest_mode(vcpu)
+   && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) {
+   /*
+* Now fix up exit_qualification according to what the
+* L1 hypervisor expects to see.
+*/
+   u64 eptp = nested_ept_get_cr3(vcpu);
+   if (!(eptp & VMX_EPT_AD_ENABLE_BIT))
+   exit_qualification &= ~EPT_VIOLATION_ACC_WRITE;
+   }


I am not sure if this is really needed, it (PFEC.W = 0 if A/D need to be set on
page structures) is not we expect.

Maybe always report the right behavior is better? Especially,Intel may fix its
microcode as it hurts the newest CPUs as well.

Thanks!


Re: [PATCH 1/2] KVM: nVMX: fix EPT permissions as reported in exit qualification

2017-05-11 Thread Xiao Guangrong



On 05/12/2017 11:59 AM, Xiao Guangrong wrote:
error:

@@ -452,7 +459,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
*walker,
   */
  if (!(errcode & PFERR_RSVD_MASK)) {
  vcpu->arch.exit_qualification &= 0x187;
-vcpu->arch.exit_qualification |= ((pt_access & pte) & 0x7) << 3;

 ^ here, the 
original code
is buggy as pt_access and pte have different bit order, fortunately, this patch 
fixes it
too. :)



It's exactly what this patch aims at, do not know what i was thinking when
wrote these down. :(


Re: [PATCH 1/2] KVM: nVMX: fix EPT permissions as reported in exit qualification

2017-05-11 Thread Xiao Guangrong



On 05/12/2017 11:59 AM, Xiao Guangrong wrote:
error:

@@ -452,7 +459,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
*walker,
   */
  if (!(errcode & PFERR_RSVD_MASK)) {
  vcpu->arch.exit_qualification &= 0x187;
-vcpu->arch.exit_qualification |= ((pt_access & pte) & 0x7) << 3;

 ^ here, the 
original code
is buggy as pt_access and pte have different bit order, fortunately, this patch 
fixes it
too. :)



It's exactly what this patch aims at, do not know what i was thinking when
wrote these down. :(


Re: [PATCH 1/2] KVM: nVMX: fix EPT permissions as reported in exit qualification

2017-05-11 Thread Xiao Guangrong
ic int FNAME(walk_addr_generic)(struct guest_walker 
*walker,
goto retry_walk;
}
  
-	walker->pt_access = pt_access;

-   walker->pte_access = pte_access;
pgprintk("%s: pte %llx pte_access %x pt_access %x\n",
-__func__, (u64)pte, pte_access, pt_access);
+__func__, (u64)pte, walker->pte_access, walker->pt_access);
return 1;
  
  error:

@@ -452,7 +459,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
*walker,
 */
if (!(errcode & PFERR_RSVD_MASK)) {
vcpu->arch.exit_qualification &= 0x187;
-   vcpu->arch.exit_qualification |= ((pt_access & pte) & 0x7) << 3;

^ here, the 
original code
is buggy as pt_access and pte have different bit order, fortunately, this patch 
fixes it
too. :)

Otherwise it looks good to me, thanks for your fix.

Reviewed-by: Xiao Guangrong <xiaoguangr...@tencent.com>


Re: [PATCH 1/2] KVM: nVMX: fix EPT permissions as reported in exit qualification

2017-05-11 Thread Xiao Guangrong
addr_generic)(struct guest_walker 
*walker,
goto retry_walk;
}
  
-	walker->pt_access = pt_access;

-   walker->pte_access = pte_access;
pgprintk("%s: pte %llx pte_access %x pt_access %x\n",
-__func__, (u64)pte, pte_access, pt_access);
+__func__, (u64)pte, walker->pte_access, walker->pt_access);
return 1;
  
  error:

@@ -452,7 +459,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
*walker,
 */
if (!(errcode & PFERR_RSVD_MASK)) {
vcpu->arch.exit_qualification &= 0x187;
-   vcpu->arch.exit_qualification |= ((pt_access & pte) & 0x7) << 3;

^ here, the 
original code
is buggy as pt_access and pte have different bit order, fortunately, this patch 
fixes it
too. :)

Otherwise it looks good to me, thanks for your fix.

Reviewed-by: Xiao Guangrong 


Re: [PATCH 0/7] KVM: MMU: fast write protect

2017-05-03 Thread Xiao Guangrong



On 05/03/2017 10:57 PM, Paolo Bonzini wrote:



On 03/05/2017 16:50, Xiao Guangrong wrote:

Furthermore, userspace has no knowledge about if PML is enable (it
can be required from sysfs, but it is a good way in QEMU), so it is
difficult for the usespace to know when to use write-protect-all.
Maybe we can make KVM_CAP_X86_WRITE_PROTECT_ALL_MEM return false if
PML is enabled?


Yes, that's a good idea.  Though it's a pity that, with PML, setting the
dirty bit will still do the massive walk of the rmap.  At least with
reset_dirty_pages it's done a little bit at a time.


Also, I wonder how the alternative write protection mechanism would
affect performance of the dirty page ring buffer patches.  You would do
the write protection of all memory at the end of
kvm_vm_ioctl_reset_dirty_pages.  You wouldn't even need a separate
ioctl, which is nice.  On the other hand, checkpoints would be more
frequent and most pages would be write-protected, so it would be more
expensive to rebuild the shadow page tables...


Yup, write-protect-all can improve reset_dirty_pages indeed, i will
apply your idea after reset_dirty_pages is merged.

However, we still prefer to have a separate ioctl for write-protect-all
which cooperates with KVM_GET_DIRTY_LOG to improve live migration that
should not always depend on checkpoint.


Ok, I plan to merge the dirty ring pages early in 4.13 development.


Great.

As there is no conflict between these two patchsets except dirty
ring pages takes benefit from write-protect-all, i think they
can be developed and iterated independently, right?

Or you prefer to merge dirty ring pages first then review the
new version of this patchset later?

Thanks!



Re: [PATCH 0/7] KVM: MMU: fast write protect

2017-05-03 Thread Xiao Guangrong



On 05/03/2017 10:57 PM, Paolo Bonzini wrote:



On 03/05/2017 16:50, Xiao Guangrong wrote:

Furthermore, userspace has no knowledge about if PML is enable (it
can be required from sysfs, but it is a good way in QEMU), so it is
difficult for the usespace to know when to use write-protect-all.
Maybe we can make KVM_CAP_X86_WRITE_PROTECT_ALL_MEM return false if
PML is enabled?


Yes, that's a good idea.  Though it's a pity that, with PML, setting the
dirty bit will still do the massive walk of the rmap.  At least with
reset_dirty_pages it's done a little bit at a time.


Also, I wonder how the alternative write protection mechanism would
affect performance of the dirty page ring buffer patches.  You would do
the write protection of all memory at the end of
kvm_vm_ioctl_reset_dirty_pages.  You wouldn't even need a separate
ioctl, which is nice.  On the other hand, checkpoints would be more
frequent and most pages would be write-protected, so it would be more
expensive to rebuild the shadow page tables...


Yup, write-protect-all can improve reset_dirty_pages indeed, i will
apply your idea after reset_dirty_pages is merged.

However, we still prefer to have a separate ioctl for write-protect-all
which cooperates with KVM_GET_DIRTY_LOG to improve live migration that
should not always depend on checkpoint.


Ok, I plan to merge the dirty ring pages early in 4.13 development.


Great.

As there is no conflict between these two patchsets except dirty
ring pages takes benefit from write-protect-all, i think they
can be developed and iterated independently, right?

Or you prefer to merge dirty ring pages first then review the
new version of this patchset later?

Thanks!



Re: [PATCH 0/7] KVM: MMU: fast write protect

2017-05-03 Thread Xiao Guangrong



On 05/03/2017 08:28 PM, Paolo Bonzini wrote:

So if I understand correctly this relies on userspace doing:

1) KVM_GET_DIRTY_LOG without write protect
2) KVM_WRITE_PROTECT_ALL_MEM


Writes may happen between 1 and 2; they are not represented in the live
dirty bitmap but it's okay because they are in the snapshot and will
only be used after 2.  This is similar to what the dirty page ring
buffer patches do; in fact, the KVM_WRITE_PROTECT_ALL_MEM ioctl is very
similar to KVM_RESET_DIRTY_PAGES in those patches.



You are right. After 1) and 2), the page which has been modified either
in the bitmap returned to userspace or in the bitmap of memslot, i.e,
there is no dirty page lost.


On 03/05/2017 12:52, guangrong.x...@gmail.com wrote:

Comparing with the ordinary algorithm which
write protects last level sptes based on the rmap one by one,
it just simply updates the generation number to ask all vCPUs
to reload its root page table, particularly, it can be done out
of mmu-lock, so that it does not hurt vMMU's parallel.


This is clever.

For processors that have PML, write protecting is only done on large
pages and only for splitting purposes; not for dirty page tracking
process at 4k granularity.  In this case, I think that you should do
nothing in the new write-protect-all ioctl?


Good point, thanks for you pointing it out.
Doing nothing in write-protect-all() is not acceptable as it breaks
its semantic. :(

Furthermore, userspace has no knowledge about if PML is enable (it
can be required from sysfs, but it is a good way in QEMU), so it is
difficult for the usespace to know when to use write-protect-all.
Maybe we can make KVM_CAP_X86_WRITE_PROTECT_ALL_MEM return false if
PML is enabled?



Also, I wonder how the alternative write protection mechanism would
affect performance of the dirty page ring buffer patches.  You would do
the write protection of all memory at the end of
kvm_vm_ioctl_reset_dirty_pages.  You wouldn't even need a separate
ioctl, which is nice.  On the other hand, checkpoints would be more
frequent and most pages would be write-protected, so it would be more
expensive to rebuild the shadow page tables...


Yup, write-protect-all can improve reset_dirty_pages indeed, i will
apply your idea after reset_dirty_pages is merged.

However, we still prefer to have a separate ioctl for write-protect-all
which cooperates with KVM_GET_DIRTY_LOG to improve live migration that
should not always depend on checkpoint.



Thanks,

Paolo


@@ -490,6 +511,7 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener 
*kml,
  memset(d.dirty_bitmap, 0, allocated_size);
  
  d.slot = mem->slot | (kml->as_id << 16);

+d.flags = kvm_write_protect_all ? KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT 
: 0;
  if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, ) == -1) {
  DPRINTF("ioctl failed %d\n", errno);
  ret = -1;


How would this work when kvm_physical_sync_dirty_bitmap is called from
memory_region_sync_dirty_bitmap rather than
memory_region_global_dirty_log_sync?


You are right, we did not consider the full cases carefully, will fix it
when push it to QEMU formally.

Thank you, Paolo!



Re: [PATCH 0/7] KVM: MMU: fast write protect

2017-05-03 Thread Xiao Guangrong



On 05/03/2017 08:28 PM, Paolo Bonzini wrote:

So if I understand correctly this relies on userspace doing:

1) KVM_GET_DIRTY_LOG without write protect
2) KVM_WRITE_PROTECT_ALL_MEM


Writes may happen between 1 and 2; they are not represented in the live
dirty bitmap but it's okay because they are in the snapshot and will
only be used after 2.  This is similar to what the dirty page ring
buffer patches do; in fact, the KVM_WRITE_PROTECT_ALL_MEM ioctl is very
similar to KVM_RESET_DIRTY_PAGES in those patches.



You are right. After 1) and 2), the page which has been modified either
in the bitmap returned to userspace or in the bitmap of memslot, i.e,
there is no dirty page lost.


On 03/05/2017 12:52, guangrong.x...@gmail.com wrote:

Comparing with the ordinary algorithm which
write protects last level sptes based on the rmap one by one,
it just simply updates the generation number to ask all vCPUs
to reload its root page table, particularly, it can be done out
of mmu-lock, so that it does not hurt vMMU's parallel.


This is clever.

For processors that have PML, write protecting is only done on large
pages and only for splitting purposes; not for dirty page tracking
process at 4k granularity.  In this case, I think that you should do
nothing in the new write-protect-all ioctl?


Good point, thanks for you pointing it out.
Doing nothing in write-protect-all() is not acceptable as it breaks
its semantic. :(

Furthermore, userspace has no knowledge about if PML is enable (it
can be required from sysfs, but it is a good way in QEMU), so it is
difficult for the usespace to know when to use write-protect-all.
Maybe we can make KVM_CAP_X86_WRITE_PROTECT_ALL_MEM return false if
PML is enabled?



Also, I wonder how the alternative write protection mechanism would
affect performance of the dirty page ring buffer patches.  You would do
the write protection of all memory at the end of
kvm_vm_ioctl_reset_dirty_pages.  You wouldn't even need a separate
ioctl, which is nice.  On the other hand, checkpoints would be more
frequent and most pages would be write-protected, so it would be more
expensive to rebuild the shadow page tables...


Yup, write-protect-all can improve reset_dirty_pages indeed, i will
apply your idea after reset_dirty_pages is merged.

However, we still prefer to have a separate ioctl for write-protect-all
which cooperates with KVM_GET_DIRTY_LOG to improve live migration that
should not always depend on checkpoint.



Thanks,

Paolo


@@ -490,6 +511,7 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener 
*kml,
  memset(d.dirty_bitmap, 0, allocated_size);
  
  d.slot = mem->slot | (kml->as_id << 16);

+d.flags = kvm_write_protect_all ? KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT 
: 0;
  if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, ) == -1) {
  DPRINTF("ioctl failed %d\n", errno);
  ret = -1;


How would this work when kvm_physical_sync_dirty_bitmap is called from
memory_region_sync_dirty_bitmap rather than
memory_region_global_dirty_log_sync?


You are right, we did not consider the full cases carefully, will fix it
when push it to QEMU formally.

Thank you, Paolo!



Re: [PATCH] x86, kvm: Handle PFNs outside of kernel reach when touching GPTEs

2017-04-17 Thread Xiao Guangrong



On 04/12/2017 09:16 PM, Sironi, Filippo wrote:

Thanks for taking the time and sorry for the delay.


On 6. Apr 2017, at 16:22, Radim Krčmář  wrote:

2017-04-05 15:07+0200, Filippo Sironi:

cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
pages and the respective struct pages for mapping in the kernel virtual
address space.
This doesn't work if get_user_pages_fast() is invoked with a userspace
virtual address that's backed by PFNs outside of kernel reach (e.g.,
when limiting the kernel memory with mem= in the command line and using
/dev/mem to map memory).

If get_user_pages_fast() fails, look up the VMA that backs the userspace
virtual address, compute the PFN and the physical address, and map it in
the kernel virtual address space with memremap().


What is the reason for a configuration that voluntarily restricts access
to memory that it needs?


By using /dev/mem to provide VM memory, one can avoid the overhead of 
allocating struct page(s) for the whole memory, which is wasteful when using a 
server entirely for hosting VMs.



Sounds reasonable, however it is incomplete so far as there are some
code paths still do not support non-page backend memory, e.g,
emulator_cmpxchg_emulated().

I would suggest to unify the code introduced in this patch with existing
hva_to_pfn(), also we can introduce a common API, maybe named
kvm_map_hva(), to improve the caller sides.


BTW, i do not know why we used kmap_atomic() rather than kmap(), the
path of cmpxchg_gpte() is sleep-able anyway.

Thanks!


Re: [PATCH] x86, kvm: Handle PFNs outside of kernel reach when touching GPTEs

2017-04-17 Thread Xiao Guangrong



On 04/12/2017 09:16 PM, Sironi, Filippo wrote:

Thanks for taking the time and sorry for the delay.


On 6. Apr 2017, at 16:22, Radim Krčmář  wrote:

2017-04-05 15:07+0200, Filippo Sironi:

cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
pages and the respective struct pages for mapping in the kernel virtual
address space.
This doesn't work if get_user_pages_fast() is invoked with a userspace
virtual address that's backed by PFNs outside of kernel reach (e.g.,
when limiting the kernel memory with mem= in the command line and using
/dev/mem to map memory).

If get_user_pages_fast() fails, look up the VMA that backs the userspace
virtual address, compute the PFN and the physical address, and map it in
the kernel virtual address space with memremap().


What is the reason for a configuration that voluntarily restricts access
to memory that it needs?


By using /dev/mem to provide VM memory, one can avoid the overhead of 
allocating struct page(s) for the whole memory, which is wasteful when using a 
server entirely for hosting VMs.



Sounds reasonable, however it is incomplete so far as there are some
code paths still do not support non-page backend memory, e.g,
emulator_cmpxchg_emulated().

I would suggest to unify the code introduced in this patch with existing
hva_to_pfn(), also we can introduce a common API, maybe named
kvm_map_hva(), to improve the caller sides.


BTW, i do not know why we used kmap_atomic() rather than kmap(), the
path of cmpxchg_gpte() is sleep-able anyway.

Thanks!


Re: [PATCH 11/16] fpga: intel: fme: add partial reconfiguration sub feature support

2017-03-30 Thread Xiao Guangrong



On 31/03/2017 4:30 AM, Alan Tull wrote:

On Thu, Mar 30, 2017 at 7:08 AM, Wu Hao  wrote:

From: Kang Luwei 

Partial Reconfiguration (PR) is the most important function for FME. It
allows reconfiguration for given Port/Accelerated Function Unit (AFU).

This patch adds support for PR sub feature. In this patch, it registers
a fpga_mgr and implements fpga_manager_ops, and invoke fpga_mgr_buf_load
for PR operation once PR request received via ioctl. Below user space
interfaces are exposed by this sub feature.

Sysfs interface:
* /sys/class/fpga///interface_id
  Read-only. Indicate the hardware interface information. Userspace
  applications need to check this interface to select correct green
  bitstream format before PR.

Ioctl interface:
* FPGA_FME_PORT_PR
  Do partial reconfiguration per information from userspace, including
  target port(AFU), buffer size and address info. It returns the PR status
  (PR error code if failed) to userspace.

Signed-off-by: Tim Whisonant 
Signed-off-by: Enno Luebbers 
Signed-off-by: Shiva Rao 
Signed-off-by: Christopher Rauer 
Signed-off-by: Alan Tull 


Hi Wu Hao,

Thanks for submitting your patches.

I think there's been a misunderstanding of the meaning of
'Signed-off-by' [1].  I have not signed off on this code or had a hand
in its development.  But I'm happy to get to review it now.  It will
take a bit of time; I expect to be replying next week.


Hi Alan,

Sorry to confuse you, i think it's because you helped Chris a lot to
implement this interface and we'd like to include your credit as this
way. If you dislike, it will be dropped. :)

Thanks for your review in advance.




Re: [PATCH 11/16] fpga: intel: fme: add partial reconfiguration sub feature support

2017-03-30 Thread Xiao Guangrong



On 31/03/2017 4:30 AM, Alan Tull wrote:

On Thu, Mar 30, 2017 at 7:08 AM, Wu Hao  wrote:

From: Kang Luwei 

Partial Reconfiguration (PR) is the most important function for FME. It
allows reconfiguration for given Port/Accelerated Function Unit (AFU).

This patch adds support for PR sub feature. In this patch, it registers
a fpga_mgr and implements fpga_manager_ops, and invoke fpga_mgr_buf_load
for PR operation once PR request received via ioctl. Below user space
interfaces are exposed by this sub feature.

Sysfs interface:
* /sys/class/fpga///interface_id
  Read-only. Indicate the hardware interface information. Userspace
  applications need to check this interface to select correct green
  bitstream format before PR.

Ioctl interface:
* FPGA_FME_PORT_PR
  Do partial reconfiguration per information from userspace, including
  target port(AFU), buffer size and address info. It returns the PR status
  (PR error code if failed) to userspace.

Signed-off-by: Tim Whisonant 
Signed-off-by: Enno Luebbers 
Signed-off-by: Shiva Rao 
Signed-off-by: Christopher Rauer 
Signed-off-by: Alan Tull 


Hi Wu Hao,

Thanks for submitting your patches.

I think there's been a misunderstanding of the meaning of
'Signed-off-by' [1].  I have not signed off on this code or had a hand
in its development.  But I'm happy to get to review it now.  It will
take a bit of time; I expect to be replying next week.


Hi Alan,

Sorry to confuse you, i think it's because you helped Chris a lot to
implement this interface and we'd like to include your credit as this
way. If you dislike, it will be dropped. :)

Thanks for your review in advance.




Re: [PATCH v2] KVM: x86: cleanup the page tracking SRCU instance

2017-03-28 Thread Xiao Guangrong



On 28/03/2017 5:40 PM, Paolo Bonzini wrote:


 void kvm_page_track_init(struct kvm *kvm)
 {
struct kvm_page_track_notifier_head *head;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 64697fe475c3..f2379673912a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8070,6 +8070,7 @@ void kvm_arch_sync_events(struct kvm *kvm)
cancel_delayed_work_sync(>arch.kvmclock_update_work);
kvm_free_all_assigned_devices(kvm);
kvm_free_pit(kvm);
+   kvm_page_track_cleanup(kvm);
 }


Moving it to kvm_arch_destroy_vm() is better as the init
function is called in kvm_arch_init_vm().

Otherwise it looks great to me:
Reviewed-by: Xiao Guangrong <xiaoguangrong.e...@gmail.com>

Thanks for the fix.


Re: [PATCH v2] KVM: x86: cleanup the page tracking SRCU instance

2017-03-28 Thread Xiao Guangrong



On 28/03/2017 5:40 PM, Paolo Bonzini wrote:


 void kvm_page_track_init(struct kvm *kvm)
 {
struct kvm_page_track_notifier_head *head;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 64697fe475c3..f2379673912a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8070,6 +8070,7 @@ void kvm_arch_sync_events(struct kvm *kvm)
cancel_delayed_work_sync(>arch.kvmclock_update_work);
kvm_free_all_assigned_devices(kvm);
kvm_free_pit(kvm);
+   kvm_page_track_cleanup(kvm);
 }


Moving it to kvm_arch_destroy_vm() is better as the init
function is called in kvm_arch_init_vm().

Otherwise it looks great to me:
Reviewed-by: Xiao Guangrong 

Thanks for the fix.


Re: [PATCH v2] mm, proc: Fix region lost in /proc/self/smaps

2016-09-19 Thread Xiao Guangrong



On 09/14/2016 11:38 PM, Oleg Nesterov wrote:

On 09/13, Dave Hansen wrote:


On 09/13/2016 07:59 AM, Oleg Nesterov wrote:

I agree. I don't even understand why this was considered as a bug.
Obviously, m_stop() which drops mmap_sep should not be called, or
all the threads should be stopped, if you want to trust the result.


There was a mapping at a given address.  That mapping did not change, it
was not split, its attributes did not change.  But, it didn't show up
when reading smaps.  Folks _actually_ noticed this in a test suite
looking for that address range in smaps.


I understand, and I won't argue with any change which makes the things
better. Just I do not think this is a real problem. And this patch can't
fix other oddities and it seems it adds another one (at least) although
I can easily misread this patch and/or the code.

So we change m_cache_vma(),

-m->version = m_next_vma(m->private, vma) ? vma->vm_start : 
-1UL;
+m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL;

OK, and another change in m_start()

-if (vma && (vma = m_next_vma(priv, vma)))
+if (vma)

means that it can return the same vma if it grows in between.

show_map_vma() has another change

+   start = max(vma->vm_start, start);

so it will be reported as _another_ vma, and this doesn't look exactly
right.


We noticed it in the discussion of v1, however it is not bad as Dave said
it is about 'address range' rather that vma.



And after that *ppos will be falsely incremented... but probably this
doesn't matter because the "if (pos < mm->map_count)" logic in m_start()
looks broken anyway.


The 'broken' can happen only if it is not the first read and m->version is
zero (*ppos != 0 && m->version == 0). If i understand the code correctly,
only m->buffer overflowed can trigger this, for smaps, each vma only
uses ~1k memory that means this could not happen. Right?






Re: [PATCH v2] mm, proc: Fix region lost in /proc/self/smaps

2016-09-19 Thread Xiao Guangrong



On 09/14/2016 11:38 PM, Oleg Nesterov wrote:

On 09/13, Dave Hansen wrote:


On 09/13/2016 07:59 AM, Oleg Nesterov wrote:

I agree. I don't even understand why this was considered as a bug.
Obviously, m_stop() which drops mmap_sep should not be called, or
all the threads should be stopped, if you want to trust the result.


There was a mapping at a given address.  That mapping did not change, it
was not split, its attributes did not change.  But, it didn't show up
when reading smaps.  Folks _actually_ noticed this in a test suite
looking for that address range in smaps.


I understand, and I won't argue with any change which makes the things
better. Just I do not think this is a real problem. And this patch can't
fix other oddities and it seems it adds another one (at least) although
I can easily misread this patch and/or the code.

So we change m_cache_vma(),

-m->version = m_next_vma(m->private, vma) ? vma->vm_start : 
-1UL;
+m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL;

OK, and another change in m_start()

-if (vma && (vma = m_next_vma(priv, vma)))
+if (vma)

means that it can return the same vma if it grows in between.

show_map_vma() has another change

+   start = max(vma->vm_start, start);

so it will be reported as _another_ vma, and this doesn't look exactly
right.


We noticed it in the discussion of v1, however it is not bad as Dave said
it is about 'address range' rather that vma.



And after that *ppos will be falsely incremented... but probably this
doesn't matter because the "if (pos < mm->map_count)" logic in m_start()
looks broken anyway.


The 'broken' can happen only if it is not the first read and m->version is
zero (*ppos != 0 && m->version == 0). If i understand the code correctly,
only m->buffer overflowed can trigger this, for smaps, each vma only
uses ~1k memory that means this could not happen. Right?






Re: [PATCH v2] mm, proc: Fix region lost in /proc/self/smaps

2016-09-12 Thread Xiao Guangrong



On 09/13/2016 03:10 AM, Michal Hocko wrote:

On Mon 12-09-16 08:01:06, Dave Hansen wrote:

On 09/12/2016 05:54 AM, Michal Hocko wrote:

In order to fix this bug, we make 'file->version' indicate the end address
of current VMA

Doesn't this open doors to another weird cases. Say B would be partially
unmapped (tail of the VMA would get unmapped and reused for a new VMA.


In the end, this interface isn't about VMAs.  It's about addresses, and
we need to make sure that the _addresses_ coming out of it are sane.  In
the case that a VMA was partially unmapped, it doesn't make sense to
show the "new" VMA because we already had some output covering the
address of the "new" VMA from the old one.


OK, that is a fair point and it speaks for caching the vm_end rather
than vm_start+skip.


I am not sure we provide any guarantee when there are more read
syscalls. Hmm, even with a single read() we can get inconsistent results
from different threads without any user space synchronization.


Yeah, very true.  But, I think we _can_ at least provide the following
guarantees (among others):
1. addresses don't go backwards
2. If there is something at a given vaddr during the entirety of the
   life of the smaps walk, we will produce some output for it.


I guess we also want
  3. no overlaps with previously printed values (assuming two subsequent
 reads without seek).

the patch tries to achieve the last part as well AFAICS but I guess this
is incomplete because at least /proc//smaps will report counters
for the full vma range while the header (aka show_map_vma) will report
shorter (non-overlapping) range. I haven't checked other files which use
m_{start,next}


You are right. Will fix both /proc/PID/smaps and /proc/PID/maps in
the next version.



Considering how this all can be tricky and how partial reads can be
confusing and even misleading I am really wondering whether we
should simply document that only full reads will provide a sensible
results.


Make sense. Will document the guarantee in
Documentation/filesystems/proc.txt

Thank you, Dave and Michal, for figuring out the right direction. :)




Re: [PATCH v2] mm, proc: Fix region lost in /proc/self/smaps

2016-09-12 Thread Xiao Guangrong



On 09/13/2016 03:10 AM, Michal Hocko wrote:

On Mon 12-09-16 08:01:06, Dave Hansen wrote:

On 09/12/2016 05:54 AM, Michal Hocko wrote:

In order to fix this bug, we make 'file->version' indicate the end address
of current VMA

Doesn't this open doors to another weird cases. Say B would be partially
unmapped (tail of the VMA would get unmapped and reused for a new VMA.


In the end, this interface isn't about VMAs.  It's about addresses, and
we need to make sure that the _addresses_ coming out of it are sane.  In
the case that a VMA was partially unmapped, it doesn't make sense to
show the "new" VMA because we already had some output covering the
address of the "new" VMA from the old one.


OK, that is a fair point and it speaks for caching the vm_end rather
than vm_start+skip.


I am not sure we provide any guarantee when there are more read
syscalls. Hmm, even with a single read() we can get inconsistent results
from different threads without any user space synchronization.


Yeah, very true.  But, I think we _can_ at least provide the following
guarantees (among others):
1. addresses don't go backwards
2. If there is something at a given vaddr during the entirety of the
   life of the smaps walk, we will produce some output for it.


I guess we also want
  3. no overlaps with previously printed values (assuming two subsequent
 reads without seek).

the patch tries to achieve the last part as well AFAICS but I guess this
is incomplete because at least /proc//smaps will report counters
for the full vma range while the header (aka show_map_vma) will report
shorter (non-overlapping) range. I haven't checked other files which use
m_{start,next}


You are right. Will fix both /proc/PID/smaps and /proc/PID/maps in
the next version.



Considering how this all can be tricky and how partial reads can be
confusing and even misleading I am really wondering whether we
should simply document that only full reads will provide a sensible
results.


Make sense. Will document the guarantee in
Documentation/filesystems/proc.txt

Thank you, Dave and Michal, for figuring out the right direction. :)




Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-12 Thread Xiao Guangrong



On 09/12/2016 11:44 AM, Rudoff, Andy wrote:

Whether msync/fsync can make data persistent depends on ADR feature on
memory controller, if it exists everything works well, otherwise, we need
to have another interface that is why 'Flush hint table' in ACPI comes
in. 'Flush hint table' is particularly useful for nvdimm virtualization if
we use normal memory to emulate nvdimm with data persistent characteristic
(the data will be flushed to a persistent storage, e.g, disk).

Does current PMEM programming model fully supports 'Flush hint table'? Is
userspace allowed to use these addresses?


The Flush hint table is NOT a replacement for ADR.  To support pmem on
the x86 architecture, the platform is required to ensure that a pmem
store flushed from the CPU caches is in the persistent domain so that the
application need not take any additional steps to make it persistent.
The most common way to do this is the ADR feature.

If the above is not true, then your x86 platform does not support pmem.


Understood.

However, virtualization is a special case as we can use normal memory
to emulate NVDIMM for the vm so that vm can bypass local file-cache,
reduce memory usage and io path, etc. Currently, this usage is useful
for lightweight virtualization, such as clean container.

Under this case, ADR is available on physical platform but it can
not help us to make data persistence for the vm. So that virtualizeing
'flush hint table' is a good way to handle it based on the acpi spec:
| software can write to any one of these Flush Hint Addresses to
| cause any preceding writes to the NVDIMM region to be flushed
| out of the intervening platform buffers 1 to the targeted NVDIMM
| (to achieve durability)



Flush hints are for use by the BIOS and drivers and are not intended to
be used in user space.  Flush hints provide two things:

First, if a driver needs to write to command registers or movable windows
on a DIMM, the Flush hint (if provided in the NFIT) is required to flush
the command to the DIMM or ensure stores done through the movable window
are complete before moving it somewhere else.

Second, for the rare case where the kernel wants to flush stores to the
smallest possible failure domain (i.e. to the DIMM even though ADR will
handle flushing it from a larger domain), the flush hints provide a way
to do this.  This might be useful for things like file system journals to
help ensure the file system is consistent even in the face of ADR failure.


We are assuming ADR can fail, however, do we have a way to know whether
ADR works correctly? Maybe MCE can work on it?

This is necessary to support making data persistent without 'fsync/msync'
in userspace. Or do we need to unconditionally use 'flush hint address'
if it is available as current nvdimm driver does?

Thanks!


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-12 Thread Xiao Guangrong



On 09/12/2016 11:44 AM, Rudoff, Andy wrote:

Whether msync/fsync can make data persistent depends on ADR feature on
memory controller, if it exists everything works well, otherwise, we need
to have another interface that is why 'Flush hint table' in ACPI comes
in. 'Flush hint table' is particularly useful for nvdimm virtualization if
we use normal memory to emulate nvdimm with data persistent characteristic
(the data will be flushed to a persistent storage, e.g, disk).

Does current PMEM programming model fully supports 'Flush hint table'? Is
userspace allowed to use these addresses?


The Flush hint table is NOT a replacement for ADR.  To support pmem on
the x86 architecture, the platform is required to ensure that a pmem
store flushed from the CPU caches is in the persistent domain so that the
application need not take any additional steps to make it persistent.
The most common way to do this is the ADR feature.

If the above is not true, then your x86 platform does not support pmem.


Understood.

However, virtualization is a special case as we can use normal memory
to emulate NVDIMM for the vm so that vm can bypass local file-cache,
reduce memory usage and io path, etc. Currently, this usage is useful
for lightweight virtualization, such as clean container.

Under this case, ADR is available on physical platform but it can
not help us to make data persistence for the vm. So that virtualizeing
'flush hint table' is a good way to handle it based on the acpi spec:
| software can write to any one of these Flush Hint Addresses to
| cause any preceding writes to the NVDIMM region to be flushed
| out of the intervening platform buffers 1 to the targeted NVDIMM
| (to achieve durability)



Flush hints are for use by the BIOS and drivers and are not intended to
be used in user space.  Flush hints provide two things:

First, if a driver needs to write to command registers or movable windows
on a DIMM, the Flush hint (if provided in the NFIT) is required to flush
the command to the DIMM or ensure stores done through the movable window
are complete before moving it somewhere else.

Second, for the rare case where the kernel wants to flush stores to the
smallest possible failure domain (i.e. to the DIMM even though ADR will
handle flushing it from a larger domain), the flush hints provide a way
to do this.  This might be useful for things like file system journals to
help ensure the file system is consistent even in the face of ADR failure.


We are assuming ADR can fail, however, do we have a way to know whether
ADR works correctly? Maybe MCE can work on it?

This is necessary to support making data persistent without 'fsync/msync'
in userspace. Or do we need to unconditionally use 'flush hint address'
if it is available as current nvdimm driver does?

Thanks!


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-12 Thread Xiao Guangrong



On 09/09/2016 11:40 PM, Dan Williams wrote:

On Fri, Sep 9, 2016 at 1:55 AM, Xiao Guangrong
<guangrong.x...@linux.intel.com> wrote:
[..]


Whether a persistent memory mapping requires an msync/fsync is a
filesystem specific question.  This mincore proposal is separate from
that.  Consider device-DAX for volatile memory or mincore() called on
an anonymous memory range.  In those cases persistence and filesystem
metadata are not in the picture, but it would still be useful for
userspace to know "is there page cache backing this mapping?" or "what
is the TLB geometry of this mapping?".



I got a question about msync/fsync which is beyond the topic of this thread
:)

Whether msync/fsync can make data persistent depends on ADR feature on
memory
controller, if it exists everything works well, otherwise, we need to have
another
interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint
table' is
particularly useful for nvdimm virtualization if we use normal memory to
emulate
nvdimm with data persistent characteristic (the data will be flushed to a
persistent storage, e.g, disk).

Does current PMEM programming model fully supports 'Flush hint table'? Is
userspace allowed to use these addresses?


If you publish flush hint addresses in the virtual NFIT the guest VM
will write to them whenever a REQ_FLUSH or REQ_FUA request is sent to
the virtual /dev/pmemX device.  Yes, seems straightforward to take a
VM exit on those events and flush simulated pmem to persistent
storage.



Thank you, Dan!

However REQ_FLUSH or REQ_FUA is handled in kernel space, okay, after following
up the discussion in this thread, i understood that currently filesystems have
not supported the case that usespace itself make data be persistent without
kernel's involvement. So that works.

Hmm, Does device-DAX support this case (make data be persistent without
msync/fsync)? I guess no, but just want to confirm it. :)




Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-12 Thread Xiao Guangrong



On 09/09/2016 11:40 PM, Dan Williams wrote:

On Fri, Sep 9, 2016 at 1:55 AM, Xiao Guangrong
 wrote:
[..]


Whether a persistent memory mapping requires an msync/fsync is a
filesystem specific question.  This mincore proposal is separate from
that.  Consider device-DAX for volatile memory or mincore() called on
an anonymous memory range.  In those cases persistence and filesystem
metadata are not in the picture, but it would still be useful for
userspace to know "is there page cache backing this mapping?" or "what
is the TLB geometry of this mapping?".



I got a question about msync/fsync which is beyond the topic of this thread
:)

Whether msync/fsync can make data persistent depends on ADR feature on
memory
controller, if it exists everything works well, otherwise, we need to have
another
interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint
table' is
particularly useful for nvdimm virtualization if we use normal memory to
emulate
nvdimm with data persistent characteristic (the data will be flushed to a
persistent storage, e.g, disk).

Does current PMEM programming model fully supports 'Flush hint table'? Is
userspace allowed to use these addresses?


If you publish flush hint addresses in the virtual NFIT the guest VM
will write to them whenever a REQ_FLUSH or REQ_FUA request is sent to
the virtual /dev/pmemX device.  Yes, seems straightforward to take a
VM exit on those events and flush simulated pmem to persistent
storage.



Thank you, Dan!

However REQ_FLUSH or REQ_FUA is handled in kernel space, okay, after following
up the discussion in this thread, i understood that currently filesystems have
not supported the case that usespace itself make data be persistent without
kernel's involvement. So that works.

Hmm, Does device-DAX support this case (make data be persistent without
msync/fsync)? I guess no, but just want to confirm it. :)




[PATCH v2] mm, proc: Fix region lost in /proc/self/smaps

2016-09-11 Thread Xiao Guangrong
hreads return the same value */
for (int i = 1; i < NTHREAD; i++) {
if (ret[0] != ret[i]) {
printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
ret[0], ret[i]);
}
}

printf("%d", ret[0]);
return 0;
}

# dd if=/dev/zero of=~/out bs=2M count=1
# ./testcase ~/out

It failed as some threads can not find the memory region in
"/proc/self/smaps" which is allocated in the main process

It is caused by proc fs which uses 'file->version' to indicate the VMA that
is the last one has already been handled by read() system call. When the
next read() issues, it uses the 'version' to find the VMA, then the next
VMA is what we want to handle, the related code is as follows:

if (last_addr) {
vma = find_vma(mm, last_addr);
if (vma && (vma = m_next_vma(priv, vma)))
return vma;
}

However, VMA will be lost if the last VMA is gone, e.g:

The process VMA list is A->B->C->D

CPU 0  CPU 1
read() system call
   handle VMA B
   version = B
return to userspace

   unmap VMA B

issue read() again to continue to get
the region info
   find_vma(version) will get VMA C
   m_next_vma(C) will get VMA D
   handle D
   !!! VMA C is lost !!!

In order to fix this bug, we make 'file->version' indicate the end address
of current VMA

Changelog:
Thanks to Dave Hansen's comments, this version fixes the issue in v1 that
same virtual address range may be outputted twice, e.g:

Take two example VMAs:

vma-A: (0x1000 -> 0x2000)
vma-B: (0x2000 -> 0x3000)

read() #1: prints vma-A, sets m->version=0x2000

Now, merge A/B to make C:

vma-C: (0x1000 -> 0x3000)

read() #2: find_vma(m->version=0x2000), returns vma-C, prints vma-C

The user will see two VMAs in their output:

A: 0x1000->0x2000
C: 0x1000->0x3000

Acked-by: Dave Hansen <dave.han...@intel.com>
Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
---
 fs/proc/task_mmu.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187d84e..10ca648 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -147,7 +147,7 @@ m_next_vma(struct proc_maps_private *priv, struct 
vm_area_struct *vma)
 static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma)
 {
if (m->count < m->size) /* vma is copied successfully */
-   m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL;
+   m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL;
 }
 
 static void *m_start(struct seq_file *m, loff_t *ppos)
@@ -176,14 +176,14 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 
if (last_addr) {
vma = find_vma(mm, last_addr);
-   if (vma && (vma = m_next_vma(priv, vma)))
+   if (vma)
return vma;
}
 
m->version = 0;
if (pos < mm->map_count) {
for (vma = mm->mmap; pos; pos--) {
-   m->version = vma->vm_start;
+   m->version = vma->vm_end;
vma = vma->vm_next;
}
return vma;
@@ -293,7 +293,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct 
*vma, int is_pid)
vm_flags_t flags = vma->vm_flags;
unsigned long ino = 0;
unsigned long long pgoff = 0;
-   unsigned long start, end;
+   unsigned long end, start = m->version;
dev_t dev = 0;
const char *name = NULL;
 
@@ -304,8 +304,13 @@ show_map_vma(struct seq_file *m, struct vm_area_struct 
*vma, int is_pid)
pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
}
 
+   /*
+* the region [0, m->version) has already been handled, do not
+* handle it doubly.
+*/
+   start = max(vma->vm_start, start);
+
/* We don't show the stack guard page in /proc/maps */
-   start = vma->vm_start;
if (stack_guard_page_start(vma, start))
start += PAGE_SIZE;
end = vma->vm_end;
-- 
1.8.3.1



[PATCH v2] mm, proc: Fix region lost in /proc/self/smaps

2016-09-11 Thread Xiao Guangrong
hreads return the same value */
for (int i = 1; i < NTHREAD; i++) {
if (ret[0] != ret[i]) {
printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
ret[0], ret[i]);
}
}

printf("%d", ret[0]);
return 0;
}

# dd if=/dev/zero of=~/out bs=2M count=1
# ./testcase ~/out

It failed as some threads can not find the memory region in
"/proc/self/smaps" which is allocated in the main process

It is caused by proc fs which uses 'file->version' to indicate the VMA that
is the last one has already been handled by read() system call. When the
next read() issues, it uses the 'version' to find the VMA, then the next
VMA is what we want to handle, the related code is as follows:

if (last_addr) {
vma = find_vma(mm, last_addr);
if (vma && (vma = m_next_vma(priv, vma)))
return vma;
}

However, VMA will be lost if the last VMA is gone, e.g:

The process VMA list is A->B->C->D

CPU 0  CPU 1
read() system call
   handle VMA B
   version = B
return to userspace

   unmap VMA B

issue read() again to continue to get
the region info
   find_vma(version) will get VMA C
   m_next_vma(C) will get VMA D
   handle D
   !!! VMA C is lost !!!

In order to fix this bug, we make 'file->version' indicate the end address
of current VMA

Changelog:
Thanks to Dave Hansen's comments, this version fixes the issue in v1 that
same virtual address range may be outputted twice, e.g:

Take two example VMAs:

vma-A: (0x1000 -> 0x2000)
vma-B: (0x2000 -> 0x3000)

read() #1: prints vma-A, sets m->version=0x2000

Now, merge A/B to make C:

vma-C: (0x1000 -> 0x3000)

read() #2: find_vma(m->version=0x2000), returns vma-C, prints vma-C

The user will see two VMAs in their output:

A: 0x1000->0x2000
C: 0x1000->0x3000

Acked-by: Dave Hansen 
Signed-off-by: Xiao Guangrong 
---
 fs/proc/task_mmu.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187d84e..10ca648 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -147,7 +147,7 @@ m_next_vma(struct proc_maps_private *priv, struct 
vm_area_struct *vma)
 static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma)
 {
if (m->count < m->size) /* vma is copied successfully */
-   m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL;
+   m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL;
 }
 
 static void *m_start(struct seq_file *m, loff_t *ppos)
@@ -176,14 +176,14 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 
if (last_addr) {
vma = find_vma(mm, last_addr);
-   if (vma && (vma = m_next_vma(priv, vma)))
+   if (vma)
return vma;
}
 
m->version = 0;
if (pos < mm->map_count) {
for (vma = mm->mmap; pos; pos--) {
-   m->version = vma->vm_start;
+   m->version = vma->vm_end;
vma = vma->vm_next;
}
return vma;
@@ -293,7 +293,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct 
*vma, int is_pid)
vm_flags_t flags = vma->vm_flags;
unsigned long ino = 0;
unsigned long long pgoff = 0;
-   unsigned long start, end;
+   unsigned long end, start = m->version;
dev_t dev = 0;
const char *name = NULL;
 
@@ -304,8 +304,13 @@ show_map_vma(struct seq_file *m, struct vm_area_struct 
*vma, int is_pid)
pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
}
 
+   /*
+* the region [0, m->version) has already been handled, do not
+* handle it doubly.
+*/
+   start = max(vma->vm_start, start);
+
/* We don't show the stack guard page in /proc/maps */
-   start = vma->vm_start;
if (stack_guard_page_start(vma, start))
start += PAGE_SIZE;
end = vma->vm_end;
-- 
1.8.3.1



Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-09 Thread Xiao Guangrong



On 09/09/2016 07:04 AM, Dan Williams wrote:

On Thu, Sep 8, 2016 at 3:56 PM, Ross Zwisler
<ross.zwis...@linux.intel.com> wrote:

On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:

[ adding linux-fsdevel and linux-nvdimm ]

On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong
<guangrong.x...@linux.intel.com> wrote:
[..]

However, it is not easy to handle the case that the new VMA overlays with
the old VMA
already got by userspace. I think we have some choices:
1: One way is completely skipping the new VMA region as current kernel code
does but i
   do not think this is good as the later VMAs will be dropped.

2: show the un-overlayed portion of new VMA. In your case, we just show the
region
   (0x2000 -> 0x3000), however, it can not work well if the VMA is a new
created
   region with different attributions.

3: completely show the new VMA as this patch does.

Which one do you prefer?



I don't have a preference, but perhaps this breakage and uncertainty
is a good opportunity to propose a more reliable interface for NVML to
get the information it needs?

My understanding is that it is looking for the VM_MIXEDMAP flag which
is already ambiguous for determining if DAX is enabled even if this
dynamic listing issue is fixed.  XFS has arranged for DAX to be a
per-inode capability and has an XFS-specific inode flag.  We can make
that a common inode flag, but it seems we should have a way to
interrogate the mapping itself in the case where the inode is unknown
or unavailable.  I'm thinking extensions to mincore to have flags for
DAX and possibly whether the page is part of a pte, pmd, or pud
mapping.  Just floating that idea before starting to look into the
implementation, comments or other ideas welcome...


I think this goes back to our previous discussion about support for the PMEM
programming model.  Really I think what NVML needs isn't a way to tell if it
is getting a DAX mapping, but whether it is getting a DAX mapping on a
filesystem that fully supports the PMEM programming model.  This of course is
defined to be a filesystem where it can do all of its flushes from userspace
safely and never call fsync/msync, and that allocations that happen in page
faults will be synchronized to media before the page fault completes.

IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
everything or can I rely fully on flushes from userspace?"

For all existing implementations, I think the answer is "you need to use
fsync/msync" because we don't yet have proper support for the PMEM programming
model.

My best idea of how to support this was a per-inode flag similar to the one
supported by XFS that says "you have a PMEM capable DAX mapping", which NVML
would then interpret to mean "you can do flushes from userspace and be fully
safe".  I think we really want this interface to be common over XFS and ext4.

If we can figure out a better way of doing this interface, say via mincore,
that's fine, but I don't think we can detangle this from the PMEM API
discussion.


Whether a persistent memory mapping requires an msync/fsync is a
filesystem specific question.  This mincore proposal is separate from
that.  Consider device-DAX for volatile memory or mincore() called on
an anonymous memory range.  In those cases persistence and filesystem
metadata are not in the picture, but it would still be useful for
userspace to know "is there page cache backing this mapping?" or "what
is the TLB geometry of this mapping?".


I got a question about msync/fsync which is beyond the topic of this thread :)

Whether msync/fsync can make data persistent depends on ADR feature on memory
controller, if it exists everything works well, otherwise, we need to have 
another
interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint table' is
particularly useful for nvdimm virtualization if we use normal memory to emulate
nvdimm with data persistent characteristic (the data will be flushed to a
persistent storage, e.g, disk).

Does current PMEM programming model fully supports 'Flush hint table'? Is
userspace allowed to use these addresses?

Thanks!





Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-09 Thread Xiao Guangrong



On 09/09/2016 07:04 AM, Dan Williams wrote:

On Thu, Sep 8, 2016 at 3:56 PM, Ross Zwisler
 wrote:

On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:

[ adding linux-fsdevel and linux-nvdimm ]

On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong
 wrote:
[..]

However, it is not easy to handle the case that the new VMA overlays with
the old VMA
already got by userspace. I think we have some choices:
1: One way is completely skipping the new VMA region as current kernel code
does but i
   do not think this is good as the later VMAs will be dropped.

2: show the un-overlayed portion of new VMA. In your case, we just show the
region
   (0x2000 -> 0x3000), however, it can not work well if the VMA is a new
created
   region with different attributions.

3: completely show the new VMA as this patch does.

Which one do you prefer?



I don't have a preference, but perhaps this breakage and uncertainty
is a good opportunity to propose a more reliable interface for NVML to
get the information it needs?

My understanding is that it is looking for the VM_MIXEDMAP flag which
is already ambiguous for determining if DAX is enabled even if this
dynamic listing issue is fixed.  XFS has arranged for DAX to be a
per-inode capability and has an XFS-specific inode flag.  We can make
that a common inode flag, but it seems we should have a way to
interrogate the mapping itself in the case where the inode is unknown
or unavailable.  I'm thinking extensions to mincore to have flags for
DAX and possibly whether the page is part of a pte, pmd, or pud
mapping.  Just floating that idea before starting to look into the
implementation, comments or other ideas welcome...


I think this goes back to our previous discussion about support for the PMEM
programming model.  Really I think what NVML needs isn't a way to tell if it
is getting a DAX mapping, but whether it is getting a DAX mapping on a
filesystem that fully supports the PMEM programming model.  This of course is
defined to be a filesystem where it can do all of its flushes from userspace
safely and never call fsync/msync, and that allocations that happen in page
faults will be synchronized to media before the page fault completes.

IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
everything or can I rely fully on flushes from userspace?"

For all existing implementations, I think the answer is "you need to use
fsync/msync" because we don't yet have proper support for the PMEM programming
model.

My best idea of how to support this was a per-inode flag similar to the one
supported by XFS that says "you have a PMEM capable DAX mapping", which NVML
would then interpret to mean "you can do flushes from userspace and be fully
safe".  I think we really want this interface to be common over XFS and ext4.

If we can figure out a better way of doing this interface, say via mincore,
that's fine, but I don't think we can detangle this from the PMEM API
discussion.


Whether a persistent memory mapping requires an msync/fsync is a
filesystem specific question.  This mincore proposal is separate from
that.  Consider device-DAX for volatile memory or mincore() called on
an anonymous memory range.  In those cases persistence and filesystem
metadata are not in the picture, but it would still be useful for
userspace to know "is there page cache backing this mapping?" or "what
is the TLB geometry of this mapping?".


I got a question about msync/fsync which is beyond the topic of this thread :)

Whether msync/fsync can make data persistent depends on ADR feature on memory
controller, if it exists everything works well, otherwise, we need to have 
another
interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint table' is
particularly useful for nvdimm virtualization if we use normal memory to emulate
nvdimm with data persistent characteristic (the data will be flushed to a
persistent storage, e.g, disk).

Does current PMEM programming model fully supports 'Flush hint table'? Is
userspace allowed to use these addresses?

Thanks!





Re: [PATCH] Fix region lost in /proc/self/smaps

2016-09-09 Thread Xiao Guangrong



On 09/08/2016 10:05 PM, Dave Hansen wrote:

On 09/07/2016 08:36 PM, Xiao Guangrong wrote:>> The user will see two
VMAs in their output:


A: 0x1000->0x2000
C: 0x1000->0x3000

Will it confuse them to see the same virtual address range twice?  Or is
there something preventing that happening that I'm missing?



You are right. Nothing can prevent it.

However, it is not easy to handle the case that the new VMA overlays
with the old VMA
already got by userspace. I think we have some choices:
1: One way is completely skipping the new VMA region as current kernel
code does but i
   do not think this is good as the later VMAs will be dropped.

2: show the un-overlayed portion of new VMA. In your case, we just show
the region
   (0x2000 -> 0x3000), however, it can not work well if the VMA is a new
created
   region with different attributions.

3: completely show the new VMA as this patch does.

Which one do you prefer?


I'd be willing to bet that #3 will break *somebody's* tooling.
Addresses going backwards is certainly screwy.  Imagine somebody using
smaps to search for address holes and doing hole_size=0x1000-0x2000.

#1 can lies about there being no mapping in place where there there may
have _always_ been a mapping and is very similar to the bug you were
originally fixing.  I think that throws it out.

#2 is our best bet, I think.  It's unfortunately also the most code.
It's also a bit of a fib because it'll show a mapping that never
actually existed, but I think this is OK.  I'm not sure what the
downside is that you're referring to, though.  Can you explain?


Yes. I was talking the case as follows:
   1: read() #1: prints vma-A(0x1000 -> 0x2000)
   2: unmap vma-A(0x1000 -> 0x2000)
   3: create vma-B(0x80 -> 0x3000) on other file with different permission
  (w, r, x)
   4: read #2: prints vma-B(0x2000 -> 0x3000)

Then userspace will get just a portion of vma-B. well, maybe it is not too bad. 
:)

How about this changes:

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187d84e..10ca648 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -147,7 +147,7 @@ m_next_vma(struct proc_maps_private *priv, struct 
vm_area_struct *vma)
 static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma)
 {
if (m->count < m->size) /* vma is copied successfully */
-   m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL;
+   m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL;
 }

 static void *m_start(struct seq_file *m, loff_t *ppos)
@@ -176,14 +176,14 @@ static void *m_start(struct seq_file *m, loff_t *ppos)

if (last_addr) {
vma = find_vma(mm, last_addr);
-   if (vma && (vma = m_next_vma(priv, vma)))
+   if (vma)
return vma;
}

m->version = 0;
if (pos < mm->map_count) {
for (vma = mm->mmap; pos; pos--) {
-   m->version = vma->vm_start;
+   m->version = vma->vm_end;
vma = vma->vm_next;
}
return vma;
@@ -293,7 +293,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct 
*vma, int is_pid)
vm_flags_t flags = vma->vm_flags;
unsigned long ino = 0;
unsigned long long pgoff = 0;
-   unsigned long start, end;
+   unsigned long end, start = m->version;
dev_t dev = 0;
const char *name = NULL;

@@ -304,8 +304,13 @@ show_map_vma(struct seq_file *m, struct vm_area_struct 
*vma, int is_pid)
pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
}

+   /*
+* the region [0, m->version) has already been handled, do not
+* handle it doubly.
+*/
+   start = max(vma->vm_start, start);
+
/* We don't show the stack guard page in /proc/maps */
-   start = vma->vm_start;
if (stack_guard_page_start(vma, start))
start += PAGE_SIZE;
end = vma->vm_end;


Re: [PATCH] Fix region lost in /proc/self/smaps

2016-09-09 Thread Xiao Guangrong



On 09/08/2016 10:05 PM, Dave Hansen wrote:

On 09/07/2016 08:36 PM, Xiao Guangrong wrote:>> The user will see two
VMAs in their output:


A: 0x1000->0x2000
C: 0x1000->0x3000

Will it confuse them to see the same virtual address range twice?  Or is
there something preventing that happening that I'm missing?



You are right. Nothing can prevent it.

However, it is not easy to handle the case that the new VMA overlays
with the old VMA
already got by userspace. I think we have some choices:
1: One way is completely skipping the new VMA region as current kernel
code does but i
   do not think this is good as the later VMAs will be dropped.

2: show the un-overlayed portion of new VMA. In your case, we just show
the region
   (0x2000 -> 0x3000), however, it can not work well if the VMA is a new
created
   region with different attributions.

3: completely show the new VMA as this patch does.

Which one do you prefer?


I'd be willing to bet that #3 will break *somebody's* tooling.
Addresses going backwards is certainly screwy.  Imagine somebody using
smaps to search for address holes and doing hole_size=0x1000-0x2000.

#1 can lies about there being no mapping in place where there there may
have _always_ been a mapping and is very similar to the bug you were
originally fixing.  I think that throws it out.

#2 is our best bet, I think.  It's unfortunately also the most code.
It's also a bit of a fib because it'll show a mapping that never
actually existed, but I think this is OK.  I'm not sure what the
downside is that you're referring to, though.  Can you explain?


Yes. I was talking the case as follows:
   1: read() #1: prints vma-A(0x1000 -> 0x2000)
   2: unmap vma-A(0x1000 -> 0x2000)
   3: create vma-B(0x80 -> 0x3000) on other file with different permission
  (w, r, x)
   4: read #2: prints vma-B(0x2000 -> 0x3000)

Then userspace will get just a portion of vma-B. well, maybe it is not too bad. 
:)

How about this changes:

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187d84e..10ca648 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -147,7 +147,7 @@ m_next_vma(struct proc_maps_private *priv, struct 
vm_area_struct *vma)
 static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma)
 {
if (m->count < m->size) /* vma is copied successfully */
-   m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL;
+   m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL;
 }

 static void *m_start(struct seq_file *m, loff_t *ppos)
@@ -176,14 +176,14 @@ static void *m_start(struct seq_file *m, loff_t *ppos)

if (last_addr) {
vma = find_vma(mm, last_addr);
-   if (vma && (vma = m_next_vma(priv, vma)))
+   if (vma)
return vma;
}

m->version = 0;
if (pos < mm->map_count) {
for (vma = mm->mmap; pos; pos--) {
-   m->version = vma->vm_start;
+   m->version = vma->vm_end;
vma = vma->vm_next;
}
return vma;
@@ -293,7 +293,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct 
*vma, int is_pid)
vm_flags_t flags = vma->vm_flags;
unsigned long ino = 0;
unsigned long long pgoff = 0;
-   unsigned long start, end;
+   unsigned long end, start = m->version;
dev_t dev = 0;
const char *name = NULL;

@@ -304,8 +304,13 @@ show_map_vma(struct seq_file *m, struct vm_area_struct 
*vma, int is_pid)
pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
}

+   /*
+* the region [0, m->version) has already been handled, do not
+* handle it doubly.
+*/
+   start = max(vma->vm_start, start);
+
/* We don't show the stack guard page in /proc/maps */
-   start = vma->vm_start;
if (stack_guard_page_start(vma, start))
start += PAGE_SIZE;
end = vma->vm_end;


Re: [PATCH] Fix region lost in /proc/self/smaps

2016-09-07 Thread Xiao Guangrong



On 09/08/2016 12:34 AM, Dave Hansen wrote:

On 09/06/2016 11:51 PM, Xiao Guangrong wrote:

In order to fix this bug, we make 'file->version' indicate the next VMA
we want to handle


This new approach makes it more likely that we'll skip a new VMA that
gets inserted in between the read()s.  But, I guess that's OK.  We don't
exactly claim to be giving super up-to-date data at the time of read().


Yes, I completely agree with you. :)



With the old code, was there also a case that we could print out the
same virtual address range more than once?  It seems like that could
happen if we had a VMA split between two reads.


Yes.



I think this introduces one oddity: if you have a VMA merge between two
reads(), you might get the same virtual address range twice in your
output.  This didn't happen before because we would have just skipped
over the area that got merged.

Take two example VMAs:

vma-A: (0x1000 -> 0x2000)
vma-B: (0x2000 -> 0x3000)

read() #1: prints vma-A, sets m->version=0x2000

Now, merge A/B to make C:

vma-C: (0x1000 -> 0x3000)

read() #2: find_vma(m->version=0x2000), returns vma-C, prints vma-C

The user will see two VMAs in their output:

A: 0x1000->0x2000
C: 0x1000->0x3000

Will it confuse them to see the same virtual address range twice?  Or is
there something preventing that happening that I'm missing?



You are right. Nothing can prevent it.

However, it is not easy to handle the case that the new VMA overlays with the 
old VMA
already got by userspace. I think we have some choices:
1: One way is completely skipping the new VMA region as current kernel code 
does but i
   do not think this is good as the later VMAs will be dropped.

2: show the un-overlayed portion of new VMA. In your case, we just show the 
region
   (0x2000 -> 0x3000), however, it can not work well if the VMA is a new created
   region with different attributions.

3: completely show the new VMA as this patch does.

Which one do you prefer?

Thanks!



Re: [PATCH] Fix region lost in /proc/self/smaps

2016-09-07 Thread Xiao Guangrong



On 09/08/2016 12:34 AM, Dave Hansen wrote:

On 09/06/2016 11:51 PM, Xiao Guangrong wrote:

In order to fix this bug, we make 'file->version' indicate the next VMA
we want to handle


This new approach makes it more likely that we'll skip a new VMA that
gets inserted in between the read()s.  But, I guess that's OK.  We don't
exactly claim to be giving super up-to-date data at the time of read().


Yes, I completely agree with you. :)



With the old code, was there also a case that we could print out the
same virtual address range more than once?  It seems like that could
happen if we had a VMA split between two reads.


Yes.



I think this introduces one oddity: if you have a VMA merge between two
reads(), you might get the same virtual address range twice in your
output.  This didn't happen before because we would have just skipped
over the area that got merged.

Take two example VMAs:

vma-A: (0x1000 -> 0x2000)
vma-B: (0x2000 -> 0x3000)

read() #1: prints vma-A, sets m->version=0x2000

Now, merge A/B to make C:

vma-C: (0x1000 -> 0x3000)

read() #2: find_vma(m->version=0x2000), returns vma-C, prints vma-C

The user will see two VMAs in their output:

A: 0x1000->0x2000
C: 0x1000->0x3000

Will it confuse them to see the same virtual address range twice?  Or is
there something preventing that happening that I'm missing?



You are right. Nothing can prevent it.

However, it is not easy to handle the case that the new VMA overlays with the 
old VMA
already got by userspace. I think we have some choices:
1: One way is completely skipping the new VMA region as current kernel code 
does but i
   do not think this is good as the later VMAs will be dropped.

2: show the un-overlayed portion of new VMA. In your case, we just show the 
region
   (0x2000 -> 0x3000), however, it can not work well if the VMA is a new created
   region with different attributions.

3: completely show the new VMA as this patch does.

Which one do you prefer?

Thanks!



Re: [PATCH] Fix region lost in /proc/self/smaps

2016-09-07 Thread Xiao Guangrong


Sorry, the title should be [PATCH] mm, proc: Fix region lost in /proc/self/smaps

On 09/07/2016 02:51 PM, Xiao Guangrong wrote:

Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
more detailed info please refer to:
   https://bugzilla.redhat.com/show_bug.cgi?id=1365721

Actually, this bug is not only for NVDIMM/DAX but also for any other file
systems. This simple test case abstracted from nvml can easily reproduce
this bug in common environment:

-- testcase.c -
#define PROCMAXLEN 4096

int
is_pmem_proc(const void *addr, size_t len)
{
const char *caddr = addr;

FILE *fp;
if ((fp = fopen("/proc/self/smaps", "r")) == NULL) {
printf("!/proc/self/smaps");
return 0;
}

int retval = 0; /* assume false until proven otherwise */
char line[PROCMAXLEN];  /* for fgets() */
char *lo = NULL;/* beginning of current range in smaps file */
char *hi = NULL;/* end of current range in smaps file */
int needmm = 0; /* looking for mm flag for current range */
while (fgets(line, PROCMAXLEN, fp) != NULL) {
static const char vmflags[] = "VmFlags:";
static const char mm[] = " wr";

/* check for range line */
if (sscanf(line, "%p-%p", , ) == 2) {
if (needmm) {
/* last range matched, but no mm flag found */
printf("never found mm flag.\n");
break;
} else if (caddr < lo) {
/* never found the range for caddr */
printf("###no match for addr %p.\n", caddr);
break;
} else if (caddr < hi) {
/* start address is in this range */
size_t rangelen = (size_t)(hi - caddr);

/* remember that matching has started */
needmm = 1;

/* calculate remaining range to search for */
if (len > rangelen) {
len -= rangelen;
caddr += rangelen;
printf("matched %zu bytes in range "
"%p-%p, %zu left over.\n",
rangelen, lo, hi, len);
} else {
len = 0;
printf("matched all bytes in range "
"%p-%p.\n", lo, hi);
}
}
} else if (needmm && strncmp(line, vmflags,
sizeof(vmflags) - 1) == 0) {
if (strstr([sizeof(vmflags) - 1], mm) != NULL) {
printf("mm flag found.\n");
if (len == 0) {
/* entire range matched */
retval = 1;
break;
}
needmm = 0; /* saw what was needed */
} else {
/* mm flag not set for some or all of range */
printf("range has no mm flag.\n");
break;
}
}
}

fclose(fp);

printf("returning %d.\n", retval);
return retval;
}

#define NTHREAD 16

void *Addr;
size_t Size;

/*
 * worker -- the work each thread performs
 */
static void *
worker(void *arg)
{
int *ret = (int *)arg;
*ret =  is_pmem_proc(Addr, Size);
return NULL;
}

int main(int argc, char *argv[])
{
if (argc <  2 || argc > 3) {
printf("usage: %s file [env].\n", argv[0]);
return -1;
}

int fd = open(argv[1], O_RDWR);

struct stat stbuf;
fstat(fd, );

Size = stbuf.st_size;
Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);

close(fd);

pthread_t threads[NTHREAD];
int ret[NTHREAD];

/* kick off NTHREAD threads */
for (int i = 0; i < NTHREAD; i++)
pthread_create([i], NULL, worker, [i]);

/* wait for all the threads to complete */
   

Re: [PATCH] Fix region lost in /proc/self/smaps

2016-09-07 Thread Xiao Guangrong


Sorry, the title should be [PATCH] mm, proc: Fix region lost in /proc/self/smaps

On 09/07/2016 02:51 PM, Xiao Guangrong wrote:

Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
more detailed info please refer to:
   https://bugzilla.redhat.com/show_bug.cgi?id=1365721

Actually, this bug is not only for NVDIMM/DAX but also for any other file
systems. This simple test case abstracted from nvml can easily reproduce
this bug in common environment:

-- testcase.c -
#define PROCMAXLEN 4096

int
is_pmem_proc(const void *addr, size_t len)
{
const char *caddr = addr;

FILE *fp;
if ((fp = fopen("/proc/self/smaps", "r")) == NULL) {
printf("!/proc/self/smaps");
return 0;
}

int retval = 0; /* assume false until proven otherwise */
char line[PROCMAXLEN];  /* for fgets() */
char *lo = NULL;/* beginning of current range in smaps file */
char *hi = NULL;/* end of current range in smaps file */
int needmm = 0; /* looking for mm flag for current range */
while (fgets(line, PROCMAXLEN, fp) != NULL) {
static const char vmflags[] = "VmFlags:";
static const char mm[] = " wr";

/* check for range line */
if (sscanf(line, "%p-%p", , ) == 2) {
if (needmm) {
/* last range matched, but no mm flag found */
printf("never found mm flag.\n");
break;
} else if (caddr < lo) {
/* never found the range for caddr */
printf("###no match for addr %p.\n", caddr);
break;
} else if (caddr < hi) {
/* start address is in this range */
size_t rangelen = (size_t)(hi - caddr);

/* remember that matching has started */
needmm = 1;

/* calculate remaining range to search for */
if (len > rangelen) {
len -= rangelen;
caddr += rangelen;
printf("matched %zu bytes in range "
"%p-%p, %zu left over.\n",
rangelen, lo, hi, len);
} else {
len = 0;
printf("matched all bytes in range "
"%p-%p.\n", lo, hi);
}
}
} else if (needmm && strncmp(line, vmflags,
sizeof(vmflags) - 1) == 0) {
if (strstr([sizeof(vmflags) - 1], mm) != NULL) {
printf("mm flag found.\n");
if (len == 0) {
/* entire range matched */
retval = 1;
break;
}
needmm = 0; /* saw what was needed */
} else {
/* mm flag not set for some or all of range */
printf("range has no mm flag.\n");
break;
}
}
}

fclose(fp);

printf("returning %d.\n", retval);
return retval;
}

#define NTHREAD 16

void *Addr;
size_t Size;

/*
 * worker -- the work each thread performs
 */
static void *
worker(void *arg)
{
int *ret = (int *)arg;
*ret =  is_pmem_proc(Addr, Size);
return NULL;
}

int main(int argc, char *argv[])
{
if (argc <  2 || argc > 3) {
printf("usage: %s file [env].\n", argv[0]);
return -1;
}

int fd = open(argv[1], O_RDWR);

struct stat stbuf;
fstat(fd, );

Size = stbuf.st_size;
Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);

close(fd);

pthread_t threads[NTHREAD];
int ret[NTHREAD];

/* kick off NTHREAD threads */
for (int i = 0; i < NTHREAD; i++)
pthread_create([i], NULL, worker, [i]);

/* wait for all the threads to complete */
   

[PATCH] Fix region lost in /proc/self/smaps

2016-09-07 Thread Xiao Guangrong
hreads return the same value */
for (int i = 1; i < NTHREAD; i++) {
if (ret[0] != ret[i]) {
printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
ret[0], ret[i]);
}
}

printf("%d", ret[0]);
return 0;
}

# dd if=/dev/zero of=~/out bs=2M count=1
# ./testcase ~/out

It failed as some threads can not find the memory region in
"/proc/self/smaps" which is allocated in the mail process

It is caused by proc fs which uses 'file->version' to indicate the VMA that
is the last one has already been handled by read() system call. When the
next read() issues, it uses the 'version' to find the VMA, then the next
VMA is what we want to handle, the related code is as follows:

if (last_addr) {
vma = find_vma(mm, last_addr);
if (vma && (vma = m_next_vma(priv, vma)))
return vma;
}

However, VMA will be lost if the last VMA is gone, eg:

The process VMA list is A->B->C->D

CPU 0  CPU 1
read() system call
   handle VMA B
   version = B
return to userspace

   unmap VMA B

issue read() again to continue to get
the region info
   find_vma(version) will get VMA C
   m_next_vma(C) will get VMA D
   handle D
   !!! VMA C is lost !!!

In order to fix this bug, we make 'file->version' indicate the next VMA
we want to handle

Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
---
 fs/proc/task_mmu.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187d84e..ace4a69 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -146,8 +146,12 @@ m_next_vma(struct proc_maps_private *priv, struct 
vm_area_struct *vma)
 
 static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma)
 {
-   if (m->count < m->size) /* vma is copied successfully */
-   m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL;
+   /* vma is copied successfully */
+   if (m->count < m->size) {
+   struct vm_area_struct *vma_next =  m_next_vma(m->private, vma);
+
+   m->version = vma_next ? vma_next->vm_start : -1UL;
+   }
 }
 
 static void *m_start(struct seq_file *m, loff_t *ppos)
@@ -176,15 +180,15 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 
if (last_addr) {
vma = find_vma(mm, last_addr);
-   if (vma && (vma = m_next_vma(priv, vma)))
+   if (vma)
return vma;
}
 
m->version = 0;
if (pos < mm->map_count) {
for (vma = mm->mmap; pos; pos--) {
-   m->version = vma->vm_start;
vma = vma->vm_next;
+   m->version = vma->vm_start;
}
return vma;
}
-- 
1.8.3.1



[PATCH] Fix region lost in /proc/self/smaps

2016-09-07 Thread Xiao Guangrong
hreads return the same value */
for (int i = 1; i < NTHREAD; i++) {
if (ret[0] != ret[i]) {
printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
ret[0], ret[i]);
}
}

printf("%d", ret[0]);
return 0;
}

# dd if=/dev/zero of=~/out bs=2M count=1
# ./testcase ~/out

It failed as some threads can not find the memory region in
"/proc/self/smaps" which is allocated in the mail process

It is caused by proc fs which uses 'file->version' to indicate the VMA that
is the last one has already been handled by read() system call. When the
next read() issues, it uses the 'version' to find the VMA, then the next
VMA is what we want to handle, the related code is as follows:

if (last_addr) {
vma = find_vma(mm, last_addr);
if (vma && (vma = m_next_vma(priv, vma)))
return vma;
}

However, VMA will be lost if the last VMA is gone, eg:

The process VMA list is A->B->C->D

CPU 0  CPU 1
read() system call
   handle VMA B
   version = B
return to userspace

   unmap VMA B

issue read() again to continue to get
the region info
   find_vma(version) will get VMA C
   m_next_vma(C) will get VMA D
   handle D
   !!! VMA C is lost !!!

In order to fix this bug, we make 'file->version' indicate the next VMA
we want to handle

Signed-off-by: Xiao Guangrong 
---
 fs/proc/task_mmu.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187d84e..ace4a69 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -146,8 +146,12 @@ m_next_vma(struct proc_maps_private *priv, struct 
vm_area_struct *vma)
 
 static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma)
 {
-   if (m->count < m->size) /* vma is copied successfully */
-   m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL;
+   /* vma is copied successfully */
+   if (m->count < m->size) {
+   struct vm_area_struct *vma_next =  m_next_vma(m->private, vma);
+
+   m->version = vma_next ? vma_next->vm_start : -1UL;
+   }
 }
 
 static void *m_start(struct seq_file *m, loff_t *ppos)
@@ -176,15 +180,15 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 
if (last_addr) {
vma = find_vma(mm, last_addr);
-   if (vma && (vma = m_next_vma(priv, vma)))
+   if (vma)
return vma;
}
 
m->version = 0;
if (pos < mm->map_count) {
for (vma = mm->mmap; pos; pos--) {
-   m->version = vma->vm_start;
vma = vma->vm_next;
+   m->version = vma->vm_start;
}
return vma;
}
-- 
1.8.3.1



Re: DAX can not work on virtual nvdimm device

2016-08-31 Thread Xiao Guangrong



On 08/31/2016 01:09 AM, Dan Williams wrote:



Can you post your exact reproduction steps?  This test is not failing for me.



Sure.

1. make the guest kernel based on your tree, the top commit is
   10d7902fa0e82b (dax: unmap/truncate on device shutdown) and
   the config file can be found in this thread.

2. add guest kernel command line: memmap=6G!10G

3: start the guest:
   x86_64-softmmu/qemu-system-x86_64 -machine pc,nvdimm --enable-kvm \
   -smp 16 -m 32G,maxmem=100G,slots=100 /other/VMs/centos6.img -monitor stdio

4: in guest:
   mkfs.ext4 /dev/pmem0
   mount -o dax /dev/pmem0  /mnt/pmem/
   echo > /mnt/pmem/xxx
   ./mmap /mnt/pmem/xxx
   ./read /mnt/pmem/xxx

  The source code of mmap and read has been attached in this mail.

  Hopefully, you can detect the error triggered by read test.

Thanks!
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
	char *filename;

	if (argc < 2) {
		printf("arg: filename.\n");
		return -1;
	}

	filename = argv[1];
	printf("test on %s.\n", filename);

	int fd = open(filename, O_RDWR);

	if (fd < 0) {
		perror("open");
		return -1;
	}

	int count = 0;

	while (1) {
		ssize_t ret;
		char buf;

		ret = read(fd, , sizeof(buf));
		if (ret < 0) {
			perror("READ");
			return -1;
		}

		if (ret == 0)
			break;
		if (ret != sizeof(buf)) {
			printf("Count %x Ret %lx sizeof(buf) %lx.\n",
count, ret, sizeof(buf));
			return -1;
		}

		count++;
		printf("%c", buf);
	}

	printf("\n Good Read.\n");
	return 0;
}
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void write_test(char *buf, int size)
{
	int i;

	printf("Try to write %p for %x size.\n", buf, size);
	for (i = 0; i < size; i++)
		buf[i] = (char)(i & 0xFF);
	printf("Write Done.\n");
}

static void read_test(char *buf, int size)
{
	int i;

	printf("Try to read %p for %x size.\n", buf, size);
	for (i = 0; i < size; i++)
		if (buf[i] != (char)(i & 0xFF)) {
			printf("Read Check failed.\n");
			printf("Index %x, %c expect %x.\n", i, buf[i], i & 0xFF);
			exit(-1);
		}
	printf("Read Done.\n");
}

static void fread_test(int fd, int size)
{
	char buf;
	int i;

	i = lseek(fd, 0, SEEK_END);
	printf("End: %x.\n", i);

	i = lseek(fd, 0, SEEK_SET);
	if (i != 0) {
		perror("seek");
		exit(-1);
	}

	printf("Try to fread fd=%d size %x sizeof(buf) %ld.\n", fd, size,
		sizeof(buf));

	for (i = 0; i < size; i++) {
		int ret;

		ret = read(fd, , sizeof(buf));
		if (ret != sizeof(buf)) {
			printf("Read failed, offset %x, ret %d ftell %ld.\n",
i, ret, lseek(fd, 0, SEEK_CUR));
			perror("read");
			exit(-1);
		}
		if (buf != (char)(i & 0xFF)) {
			printf("Read Check failed.\n");
			printf("Index %x, %c expect %x.\n", i, buf, i & 0xFF);
			exit (-1);
		}
	}

	printf("Fread Done.\n");
}

int main(int argc, char *argv[])
{
	char *filename;
	char *buf;
	int fd;

	if (argc < 2) {
		printf("arg: filename.\n");
		return -1;
	}

	filename = argv[1];
	printf("mmap test on %s.\n", filename);

	fd = open(filename, O_RDWR);
	if (fd < 0) {
		perror("open");
		return -1;
	}

	if (lseek(fd, 0x1000, SEEK_SET) != 0x1000) {
		perror("seek");
		return -1;
	}

	if (ftruncate(fd, 0x1000) != 0) {
		perror("ftruncate");
		return -1;
	}

//	return 0;

	buf = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (buf == MAP_FAILED) {
		perror("mmap");
		return -1;
	}

	write_test(buf, 0x1000);
	read_test(buf, 0x1000);
	msync(buf, 0x1000, MS_SYNC);
	fread_test(fd, 0x1000);

	close(fd);

	return 0;
}


Re: DAX can not work on virtual nvdimm device

2016-08-31 Thread Xiao Guangrong



On 08/31/2016 01:09 AM, Dan Williams wrote:



Can you post your exact reproduction steps?  This test is not failing for me.



Sure.

1. make the guest kernel based on your tree, the top commit is
   10d7902fa0e82b (dax: unmap/truncate on device shutdown) and
   the config file can be found in this thread.

2. add guest kernel command line: memmap=6G!10G

3: start the guest:
   x86_64-softmmu/qemu-system-x86_64 -machine pc,nvdimm --enable-kvm \
   -smp 16 -m 32G,maxmem=100G,slots=100 /other/VMs/centos6.img -monitor stdio

4: in guest:
   mkfs.ext4 /dev/pmem0
   mount -o dax /dev/pmem0  /mnt/pmem/
   echo > /mnt/pmem/xxx
   ./mmap /mnt/pmem/xxx
   ./read /mnt/pmem/xxx

  The source code of mmap and read has been attached in this mail.

  Hopefully, you can detect the error triggered by read test.

Thanks!
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
	char *filename;

	if (argc < 2) {
		printf("arg: filename.\n");
		return -1;
	}

	filename = argv[1];
	printf("test on %s.\n", filename);

	int fd = open(filename, O_RDWR);

	if (fd < 0) {
		perror("open");
		return -1;
	}

	int count = 0;

	while (1) {
		ssize_t ret;
		char buf;

		ret = read(fd, , sizeof(buf));
		if (ret < 0) {
			perror("READ");
			return -1;
		}

		if (ret == 0)
			break;
		if (ret != sizeof(buf)) {
			printf("Count %x Ret %lx sizeof(buf) %lx.\n",
count, ret, sizeof(buf));
			return -1;
		}

		count++;
		printf("%c", buf);
	}

	printf("\n Good Read.\n");
	return 0;
}
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void write_test(char *buf, int size)
{
	int i;

	printf("Try to write %p for %x size.\n", buf, size);
	for (i = 0; i < size; i++)
		buf[i] = (char)(i & 0xFF);
	printf("Write Done.\n");
}

static void read_test(char *buf, int size)
{
	int i;

	printf("Try to read %p for %x size.\n", buf, size);
	for (i = 0; i < size; i++)
		if (buf[i] != (char)(i & 0xFF)) {
			printf("Read Check failed.\n");
			printf("Index %x, %c expect %x.\n", i, buf[i], i & 0xFF);
			exit(-1);
		}
	printf("Read Done.\n");
}

static void fread_test(int fd, int size)
{
	char buf;
	int i;

	i = lseek(fd, 0, SEEK_END);
	printf("End: %x.\n", i);

	i = lseek(fd, 0, SEEK_SET);
	if (i != 0) {
		perror("seek");
		exit(-1);
	}

	printf("Try to fread fd=%d size %x sizeof(buf) %ld.\n", fd, size,
		sizeof(buf));

	for (i = 0; i < size; i++) {
		int ret;

		ret = read(fd, , sizeof(buf));
		if (ret != sizeof(buf)) {
			printf("Read failed, offset %x, ret %d ftell %ld.\n",
i, ret, lseek(fd, 0, SEEK_CUR));
			perror("read");
			exit(-1);
		}
		if (buf != (char)(i & 0xFF)) {
			printf("Read Check failed.\n");
			printf("Index %x, %c expect %x.\n", i, buf, i & 0xFF);
			exit (-1);
		}
	}

	printf("Fread Done.\n");
}

int main(int argc, char *argv[])
{
	char *filename;
	char *buf;
	int fd;

	if (argc < 2) {
		printf("arg: filename.\n");
		return -1;
	}

	filename = argv[1];
	printf("mmap test on %s.\n", filename);

	fd = open(filename, O_RDWR);
	if (fd < 0) {
		perror("open");
		return -1;
	}

	if (lseek(fd, 0x1000, SEEK_SET) != 0x1000) {
		perror("seek");
		return -1;
	}

	if (ftruncate(fd, 0x1000) != 0) {
		perror("ftruncate");
		return -1;
	}

//	return 0;

	buf = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (buf == MAP_FAILED) {
		perror("mmap");
		return -1;
	}

	write_test(buf, 0x1000);
	read_test(buf, 0x1000);
	msync(buf, 0x1000, MS_SYNC);
	fread_test(fd, 0x1000);

	close(fd);

	return 0;
}


Re: DAX can not work on virtual nvdimm device

2016-08-30 Thread Xiao Guangrong



On 08/30/2016 03:30 AM, Ross Zwisler wrote:



Can you please verify that you are using "usable" memory for your memmap?  All
the details are here:

https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system


Sure.

This is the BIOS E820 info in the guest:

e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x-0x0009fbff] usable
BIOS-e820: [mem 0x0009fc00-0x0009] reserved
BIOS-e820: [mem 0x000f-0x000f] reserved
BIOS-e820: [mem 0x0010-0xbffdefff] usable
BIOS-e820: [mem 0xbffdf000-0xbfff] reserved
BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
BIOS-e820: [mem 0xfffc-0x] reserved
BIOS-e820: [mem 0x0001-0x00083fff] usable
NX (Execute Disable) protection: active
e820: user-defined physical RAM map:
user: [mem 0x-0x0009fbff] usable
user: [mem 0x0009fc00-0x0009] reserved
user: [mem 0x000f-0x000f] reserved
user: [mem 0x0010-0xbffdefff] usable
user: [mem 0xbffdf000-0xbfff] reserved
user: [mem 0xfeffc000-0xfeff] reserved
user: [mem 0xfffc-0x] reserved
user: [mem 0x0001-0x00027fff] usable
user: [mem 0x00028000-0x0003] persistent (type 12)
user: [mem 0x0004-0x00083fff] usable

So that the memory we used to emulate PMEM split the 'usable' region.



My guess is that Boaz was correct, and that your memmap is off using addresses
that don't actually map to memory.


I do not think so. :(

I did mmap-write and mmap-read test, the data written by mmap-write can be
correctly read out, so that the backend memory is really existing.

Thanks!


Re: DAX can not work on virtual nvdimm device

2016-08-30 Thread Xiao Guangrong



On 08/30/2016 03:30 AM, Ross Zwisler wrote:



Can you please verify that you are using "usable" memory for your memmap?  All
the details are here:

https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system


Sure.

This is the BIOS E820 info in the guest:

e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x-0x0009fbff] usable
BIOS-e820: [mem 0x0009fc00-0x0009] reserved
BIOS-e820: [mem 0x000f-0x000f] reserved
BIOS-e820: [mem 0x0010-0xbffdefff] usable
BIOS-e820: [mem 0xbffdf000-0xbfff] reserved
BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
BIOS-e820: [mem 0xfffc-0x] reserved
BIOS-e820: [mem 0x0001-0x00083fff] usable
NX (Execute Disable) protection: active
e820: user-defined physical RAM map:
user: [mem 0x-0x0009fbff] usable
user: [mem 0x0009fc00-0x0009] reserved
user: [mem 0x000f-0x000f] reserved
user: [mem 0x0010-0xbffdefff] usable
user: [mem 0xbffdf000-0xbfff] reserved
user: [mem 0xfeffc000-0xfeff] reserved
user: [mem 0xfffc-0x] reserved
user: [mem 0x0001-0x00027fff] usable
user: [mem 0x00028000-0x0003] persistent (type 12)
user: [mem 0x0004-0x00083fff] usable

So that the memory we used to emulate PMEM split the 'usable' region.



My guess is that Boaz was correct, and that your memmap is off using addresses
that don't actually map to memory.


I do not think so. :(

I did mmap-write and mmap-read test, the data written by mmap-write can be
correctly read out, so that the backend memory is really existing.

Thanks!


Re: DAX can not work on virtual nvdimm device

2016-08-29 Thread Xiao Guangrong


Hi Ross,

Sorry for the delay, i just returned back from KVM Forum.

On 08/20/2016 02:30 AM, Ross Zwisler wrote:

On Fri, Aug 19, 2016 at 07:59:29AM -0700, Dan Williams wrote:

On Fri, Aug 19, 2016 at 4:19 AM, Xiao Guangrong
<guangrong.x...@linux.intel.com> wrote:


Hi Dan,

Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
more detailed info please refer to:
   https://bugzilla.redhat.com/show_bug.cgi?id=1365721

The reason for this bug is that the memory region created by mmap()
on the dax-based file was gone so that the region can not be found
in /proc/self/smaps during the runtime.

This is a simple way to trigger this issue:
   mount -o dax /dev/pmem0 /mnt/pmem/
   vim /mnt/pmem/xxx
then 'vim' is crashed due to segment fault.

This bug can be reproduced on your tree, the top commit is
10d7902fa0e82b (dax: unmap/truncate on device shutdown), the kernel
configure file is attached.

Your thought or comment is highly appreciated.


I'm going to be offline until Tuesday, but I will investigate when I'm
back.  In the meantime if Ross or Vishal had an opportunity to take a
look I wouldn't say "no" :).


I haven't been able to reproduce this vim segfault.  I'm using QEMU v2.6.0,
and the kernel commit you mentioned, and your kernel config.

Here's my QEMU command line:

sudo ~/qemu/bin/qemu-system-x86_64 /var/lib/libvirt/images/alara.qcow2 \
-machine pc,nvdimm -m 8G,maxmem=100G,slots=100  -object \
memory-backend-file,id=mem1,share,mem-path=/dev/pmem0,size=8G -device \
nvdimm,memdev=mem1,id=nv1 -smp 6 -machine pc,accel=kvm

With this I'm able to mkfs the guest's /dev/pmem0, mount it with -o dax, and
write a file with vim.


Thanks for your test. That's strange...



Can you reproduce your results with a pmem device created via a memmap kernel
command line parameter in the guest?  You'll need to update your kernel
config to enable CONFIG_X86_PMEM_LEGACY and CONFIG_X86_PMEM_LEGACY_DEVICE.



Okay, i tested it with mmap=6G!10G, it failed too. So it looks like it's a
filesystem or DAX issue.

More precisely, i figured out the root case that read() returns a wrong value
when it reaches the end of the file, following test case can trigger it:

#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
char *filename;

if (argc < 2) {
printf("arg: filename.\n");
return -1;
}

filename = argv[1];
printf("test on %s.\n", filename);

int fd = open(filename, O_RDWR);

if (fd < 0) {
perror("open");
return -1;
}

int count = 0;

while (1) {
ssize_t ret;
char buf;

ret = read(fd, , sizeof(buf));
if (ret < 0) {
perror("READ");
return -1;
}

if (ret == 0)
break;
if (ret != sizeof(buf)) {
printf("Count %x Ret %lx sizeof(buf) %lx.\n",
count, ret, sizeof(buf));
return -1;
}

count++;
printf("%c", buf);
}

printf("\n Good Read.\n");
return 0;
}



It will fail at "ret != sizeof(buf)", for example, the error output on my
test env is:
  Count 1000 Ret 22f84200 sizeof(buf) 1.


Re: DAX can not work on virtual nvdimm device

2016-08-29 Thread Xiao Guangrong


Hi Ross,

Sorry for the delay, i just returned back from KVM Forum.

On 08/20/2016 02:30 AM, Ross Zwisler wrote:

On Fri, Aug 19, 2016 at 07:59:29AM -0700, Dan Williams wrote:

On Fri, Aug 19, 2016 at 4:19 AM, Xiao Guangrong
 wrote:


Hi Dan,

Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
more detailed info please refer to:
   https://bugzilla.redhat.com/show_bug.cgi?id=1365721

The reason for this bug is that the memory region created by mmap()
on the dax-based file was gone so that the region can not be found
in /proc/self/smaps during the runtime.

This is a simple way to trigger this issue:
   mount -o dax /dev/pmem0 /mnt/pmem/
   vim /mnt/pmem/xxx
then 'vim' is crashed due to segment fault.

This bug can be reproduced on your tree, the top commit is
10d7902fa0e82b (dax: unmap/truncate on device shutdown), the kernel
configure file is attached.

Your thought or comment is highly appreciated.


I'm going to be offline until Tuesday, but I will investigate when I'm
back.  In the meantime if Ross or Vishal had an opportunity to take a
look I wouldn't say "no" :).


I haven't been able to reproduce this vim segfault.  I'm using QEMU v2.6.0,
and the kernel commit you mentioned, and your kernel config.

Here's my QEMU command line:

sudo ~/qemu/bin/qemu-system-x86_64 /var/lib/libvirt/images/alara.qcow2 \
-machine pc,nvdimm -m 8G,maxmem=100G,slots=100  -object \
memory-backend-file,id=mem1,share,mem-path=/dev/pmem0,size=8G -device \
nvdimm,memdev=mem1,id=nv1 -smp 6 -machine pc,accel=kvm

With this I'm able to mkfs the guest's /dev/pmem0, mount it with -o dax, and
write a file with vim.


Thanks for your test. That's strange...



Can you reproduce your results with a pmem device created via a memmap kernel
command line parameter in the guest?  You'll need to update your kernel
config to enable CONFIG_X86_PMEM_LEGACY and CONFIG_X86_PMEM_LEGACY_DEVICE.



Okay, i tested it with mmap=6G!10G, it failed too. So it looks like it's a
filesystem or DAX issue.

More precisely, i figured out the root case that read() returns a wrong value
when it reaches the end of the file, following test case can trigger it:

#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
char *filename;

if (argc < 2) {
printf("arg: filename.\n");
return -1;
}

filename = argv[1];
printf("test on %s.\n", filename);

int fd = open(filename, O_RDWR);

if (fd < 0) {
perror("open");
return -1;
}

int count = 0;

while (1) {
ssize_t ret;
char buf;

ret = read(fd, , sizeof(buf));
if (ret < 0) {
perror("READ");
return -1;
}

if (ret == 0)
break;
if (ret != sizeof(buf)) {
printf("Count %x Ret %lx sizeof(buf) %lx.\n",
count, ret, sizeof(buf));
return -1;
}

count++;
printf("%c", buf);
}

printf("\n Good Read.\n");
return 0;
}



It will fail at "ret != sizeof(buf)", for example, the error output on my
test env is:
  Count 1000 Ret 22f84200 sizeof(buf) 1.


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-06 Thread Xiao Guangrong



On 07/06/2016 07:48 PM, Paolo Bonzini wrote:



On 06/07/2016 06:02, Xiao Guangrong wrote:




May I ask you what the exact issue you have with this interface for
Intel to support
your own GPU virtualization?


Intel's vGPU can work with this framework. We really appreciate your
/ nvidia's
contribution.


Then, I don't think we should embargo Paolo's patch.


This patchset is specific for the framework design, i.e, mapping memory when
fault happens rather than mmap(), and this design is exact what we are
discussing for nearly two days.


I disagree, this patch fixes a bug because what Neo is doing is legal.
It may not be the design that will be committed, but the bug they found
in KVM is real.



I just worried if we really need fault-on-demand for device memory, i.e,
if device memory overcommit is safe enough.

It lacks a graceful way to recover the workload if the resource is really
overloaded. Unlike with normal memory, host kernel and guest kernel can not
do anything except killing the VM under this case. So the VM get crashed
due to device emulation, that is not safe as the device can be accessed in
userspace even with unprivileged user, it is vulnerable in data center.



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-06 Thread Xiao Guangrong



On 07/06/2016 07:48 PM, Paolo Bonzini wrote:



On 06/07/2016 06:02, Xiao Guangrong wrote:




May I ask you what the exact issue you have with this interface for
Intel to support
your own GPU virtualization?


Intel's vGPU can work with this framework. We really appreciate your
/ nvidia's
contribution.


Then, I don't think we should embargo Paolo's patch.


This patchset is specific for the framework design, i.e, mapping memory when
fault happens rather than mmap(), and this design is exact what we are
discussing for nearly two days.


I disagree, this patch fixes a bug because what Neo is doing is legal.
It may not be the design that will be committed, but the bug they found
in KVM is real.



I just worried if we really need fault-on-demand for device memory, i.e,
if device memory overcommit is safe enough.

It lacks a graceful way to recover the workload if the resource is really
overloaded. Unlike with normal memory, host kernel and guest kernel can not
do anything except killing the VM under this case. So the VM get crashed
due to device emulation, that is not safe as the device can be accessed in
userspace even with unprivileged user, it is vulnerable in data center.



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/06/2016 10:57 AM, Neo Jia wrote:

On Wed, Jul 06, 2016 at 10:35:18AM +0800, Xiao Guangrong wrote:



On 07/06/2016 10:18 AM, Neo Jia wrote:

On Wed, Jul 06, 2016 at 10:00:46AM +0800, Xiao Guangrong wrote:



On 07/05/2016 08:18 PM, Paolo Bonzini wrote:



On 05/07/2016 07:41, Neo Jia wrote:

On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.

KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these
patches should fix this.


Hi Paolo,

I have tested your patches with the mediated passthru patchset that is being
reviewed in KVM and QEMU mailing list.

The fault handler gets called successfully and the previously mapped memory gets
unmmaped correctly via unmap_mapping_range.


Great, then I'll include them in 4.8.


Code is okay, but i still suspect if this implementation, fetch mmio pages in 
fault
handler, is needed. We'd better include these patches after the design of vfio
framework is decided.


Hi Guangrong,

I disagree. The design of VFIO framework has been actively discussed in the KVM
and QEMU mailing for a while and the fault handler is agreed upon to provide the
flexibility for different driver vendors' implementation. With that said, I am
still open to discuss with you and anybody else about this framework as the goal
is to allow multiple vendor to plugin into this framework to support their
mediated device virtualization scheme, such as Intel, IBM and us.


The discussion is still going on. And current vfio patchset we reviewed is still
problematic.


My point is the fault handler part has been discussed already, with that said I
am always open to any constructive suggestions to make things better and
maintainable. (Appreciate your code review on the VFIO thread, I think we still
own you another response, will do that.)



It always can be changed especially the vfio patchset is not in a good shape.





May I ask you what the exact issue you have with this interface for Intel to 
support
your own GPU virtualization?


Intel's vGPU can work with this framework. We really appreciate your / nvidia's
contribution.


Then, I don't think we should embargo Paolo's patch.


This patchset is specific for the framework design, i.e, mapping memory when 
fault
happens rather than mmap(), and this design is exact what we are discussing for
nearly two days.



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/06/2016 10:57 AM, Neo Jia wrote:

On Wed, Jul 06, 2016 at 10:35:18AM +0800, Xiao Guangrong wrote:



On 07/06/2016 10:18 AM, Neo Jia wrote:

On Wed, Jul 06, 2016 at 10:00:46AM +0800, Xiao Guangrong wrote:



On 07/05/2016 08:18 PM, Paolo Bonzini wrote:



On 05/07/2016 07:41, Neo Jia wrote:

On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.

KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these
patches should fix this.


Hi Paolo,

I have tested your patches with the mediated passthru patchset that is being
reviewed in KVM and QEMU mailing list.

The fault handler gets called successfully and the previously mapped memory gets
unmmaped correctly via unmap_mapping_range.


Great, then I'll include them in 4.8.


Code is okay, but i still suspect if this implementation, fetch mmio pages in 
fault
handler, is needed. We'd better include these patches after the design of vfio
framework is decided.


Hi Guangrong,

I disagree. The design of VFIO framework has been actively discussed in the KVM
and QEMU mailing for a while and the fault handler is agreed upon to provide the
flexibility for different driver vendors' implementation. With that said, I am
still open to discuss with you and anybody else about this framework as the goal
is to allow multiple vendor to plugin into this framework to support their
mediated device virtualization scheme, such as Intel, IBM and us.


The discussion is still going on. And current vfio patchset we reviewed is still
problematic.


My point is the fault handler part has been discussed already, with that said I
am always open to any constructive suggestions to make things better and
maintainable. (Appreciate your code review on the VFIO thread, I think we still
own you another response, will do that.)



It always can be changed especially the vfio patchset is not in a good shape.





May I ask you what the exact issue you have with this interface for Intel to 
support
your own GPU virtualization?


Intel's vGPU can work with this framework. We really appreciate your / nvidia's
contribution.


Then, I don't think we should embargo Paolo's patch.


This patchset is specific for the framework design, i.e, mapping memory when 
fault
happens rather than mmap(), and this design is exact what we are discussing for
nearly two days.



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/06/2016 10:18 AM, Neo Jia wrote:

On Wed, Jul 06, 2016 at 10:00:46AM +0800, Xiao Guangrong wrote:



On 07/05/2016 08:18 PM, Paolo Bonzini wrote:



On 05/07/2016 07:41, Neo Jia wrote:

On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.

KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these
patches should fix this.


Hi Paolo,

I have tested your patches with the mediated passthru patchset that is being
reviewed in KVM and QEMU mailing list.

The fault handler gets called successfully and the previously mapped memory gets
unmmaped correctly via unmap_mapping_range.


Great, then I'll include them in 4.8.


Code is okay, but i still suspect if this implementation, fetch mmio pages in 
fault
handler, is needed. We'd better include these patches after the design of vfio
framework is decided.


Hi Guangrong,

I disagree. The design of VFIO framework has been actively discussed in the KVM
and QEMU mailing for a while and the fault handler is agreed upon to provide the
flexibility for different driver vendors' implementation. With that said, I am
still open to discuss with you and anybody else about this framework as the goal
is to allow multiple vendor to plugin into this framework to support their
mediated device virtualization scheme, such as Intel, IBM and us.


The discussion is still going on. And current vfio patchset we reviewed is still
problematic.



May I ask you what the exact issue you have with this interface for Intel to 
support
your own GPU virtualization?


Intel's vGPU can work with this framework. We really appreciate your / nvidia's
contribution.

i didn’t mean to offend you, i just want to make sure if this complexity is 
really
needed and inspect if this framework is safe enough and think it over if we have
a better implementation.





Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/06/2016 10:18 AM, Neo Jia wrote:

On Wed, Jul 06, 2016 at 10:00:46AM +0800, Xiao Guangrong wrote:



On 07/05/2016 08:18 PM, Paolo Bonzini wrote:



On 05/07/2016 07:41, Neo Jia wrote:

On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.

KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these
patches should fix this.


Hi Paolo,

I have tested your patches with the mediated passthru patchset that is being
reviewed in KVM and QEMU mailing list.

The fault handler gets called successfully and the previously mapped memory gets
unmmaped correctly via unmap_mapping_range.


Great, then I'll include them in 4.8.


Code is okay, but i still suspect if this implementation, fetch mmio pages in 
fault
handler, is needed. We'd better include these patches after the design of vfio
framework is decided.


Hi Guangrong,

I disagree. The design of VFIO framework has been actively discussed in the KVM
and QEMU mailing for a while and the fault handler is agreed upon to provide the
flexibility for different driver vendors' implementation. With that said, I am
still open to discuss with you and anybody else about this framework as the goal
is to allow multiple vendor to plugin into this framework to support their
mediated device virtualization scheme, such as Intel, IBM and us.


The discussion is still going on. And current vfio patchset we reviewed is still
problematic.



May I ask you what the exact issue you have with this interface for Intel to 
support
your own GPU virtualization?


Intel's vGPU can work with this framework. We really appreciate your / nvidia's
contribution.

i didn’t mean to offend you, i just want to make sure if this complexity is 
really
needed and inspect if this framework is safe enough and think it over if we have
a better implementation.





Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/05/2016 11:07 PM, Neo Jia wrote:

On Tue, Jul 05, 2016 at 05:02:46PM +0800, Xiao Guangrong wrote:




It is physically contiguous but it is done during the runtime, physically 
contiguous doesn't mean
static partition at boot time. And only during runtime, the proper HW resource 
will be requested therefore
the right portion of MMIO region will be granted by the mediated device driver 
on the host.


Okay. This is your implantation design rather than the hardware limitation, 
right?


I don't think it matters here. We are talking about framework so it should
provide the flexibility for different driver vendor.


It really matters. It is the reason why we design the framework like this and
we need to make sure whether we have a better design to fill the requirements.





For example, if the instance require 512M memory (the size can be specified by 
QEMU
command line), it can tell its requirement to the mediated device driver via 
create()
interface, then the driver can allocate then memory for this instance before it 
is running.


BAR != your device memory

We don't set the BAR size via QEMU command line, BAR size is extracted by QEMU
from config space provided by vendor driver.



Anyway, the BAR size have a way to configure, e.g, specify the size as a 
parameter when
you create a mdev via sysfs.



Theoretically, the hardware is able to do memory management as this style, but 
for some
reasons you choose allocating memory in the runtime. right? If my understanding 
is right,
could you please tell us what benefit you want to get from this 
running-allocation style?


Your understanding is incorrect.


Then WHY?







Then the req_size and pgoff will both come from the mediated device driver 
based on his internal book
keeping of the hw resource allocation, which is only available during runtime. 
And such book keeping
can be built part of para-virtualization scheme between guest and host device 
driver.



I am talking the parameters you passed to validate_map_request(). req_size is 
calculated like this:

+   offset   = virtaddr - vma->vm_start;
+   phyaddr  = (vma->vm_pgoff << PAGE_SHIFT) + offset;
+   pgoff= phyaddr >> PAGE_SHIFT;

All these info is from vma which is available in mmmap().

pgoff is got from:
+   pg_prot  = vma->vm_page_prot;
that is also available in mmap().


This is kept there in case the validate_map_request() is not provided by vendor
driver then by default assume 1:1 mapping. So if validate_map_request() is not
provided, fault handler should not fail.


THESE are the parameters you passed to validate_map_request(), and these info is
available in mmap(), it really does not matter if you move 
validate_map_request()
to mmap(). That's what i want to say.






None of such information is available at VFIO mmap() time. For example, several 
VMs
are sharing the same physical device to provide mediated access. All VMs will
call the VFIO mmap() on their virtual BAR as part of QEMU vfio/pci 
initialization
process, at that moment, we definitely can't mmap the entire physical MMIO
into both VM blindly for obvious reason.



mmap() carries @length information, so you only need to allocate the specified 
size
(corresponding to @length) of memory for them.


Again, you still look at this as a static partition at QEMU configuration time
where the guest mmio will be mapped as a whole at some offset of the physical
mmio region. (You still can do that like I said above by not providing
validate_map_request() in your vendor driver.)



Then you can move validate_map_request() to here to achieve custom 
allocation-policy.


But this is not the framework we are defining here.

The framework we have here is to provide the driver vendor flexibility to decide
the guest mmio and physical mmio mapping on page basis, and such information is
available during runtime.

How such information gets communicated between guest and host driver is up to
driver vendor.


The problems is the sequence of the way "provide the driver vendor
flexibility to decide the guest mmio and physical mmio mapping on page basis"
and mmap().

We should provide such allocation info first then do mmap(). You current design,
do mmap() -> communication telling such info -> use such info when fault 
happens,
is really BAD, because you can not control the time when memory fault will 
happen.
The guest may access this memory before the communication you mentioned above,
and another reason is that KVM MMU can prefetch memory at any time.



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/05/2016 11:07 PM, Neo Jia wrote:

On Tue, Jul 05, 2016 at 05:02:46PM +0800, Xiao Guangrong wrote:




It is physically contiguous but it is done during the runtime, physically 
contiguous doesn't mean
static partition at boot time. And only during runtime, the proper HW resource 
will be requested therefore
the right portion of MMIO region will be granted by the mediated device driver 
on the host.


Okay. This is your implantation design rather than the hardware limitation, 
right?


I don't think it matters here. We are talking about framework so it should
provide the flexibility for different driver vendor.


It really matters. It is the reason why we design the framework like this and
we need to make sure whether we have a better design to fill the requirements.





For example, if the instance require 512M memory (the size can be specified by 
QEMU
command line), it can tell its requirement to the mediated device driver via 
create()
interface, then the driver can allocate then memory for this instance before it 
is running.


BAR != your device memory

We don't set the BAR size via QEMU command line, BAR size is extracted by QEMU
from config space provided by vendor driver.



Anyway, the BAR size have a way to configure, e.g, specify the size as a 
parameter when
you create a mdev via sysfs.



Theoretically, the hardware is able to do memory management as this style, but 
for some
reasons you choose allocating memory in the runtime. right? If my understanding 
is right,
could you please tell us what benefit you want to get from this 
running-allocation style?


Your understanding is incorrect.


Then WHY?







Then the req_size and pgoff will both come from the mediated device driver 
based on his internal book
keeping of the hw resource allocation, which is only available during runtime. 
And such book keeping
can be built part of para-virtualization scheme between guest and host device 
driver.



I am talking the parameters you passed to validate_map_request(). req_size is 
calculated like this:

+   offset   = virtaddr - vma->vm_start;
+   phyaddr  = (vma->vm_pgoff << PAGE_SHIFT) + offset;
+   pgoff= phyaddr >> PAGE_SHIFT;

All these info is from vma which is available in mmmap().

pgoff is got from:
+   pg_prot  = vma->vm_page_prot;
that is also available in mmap().


This is kept there in case the validate_map_request() is not provided by vendor
driver then by default assume 1:1 mapping. So if validate_map_request() is not
provided, fault handler should not fail.


THESE are the parameters you passed to validate_map_request(), and these info is
available in mmap(), it really does not matter if you move 
validate_map_request()
to mmap(). That's what i want to say.






None of such information is available at VFIO mmap() time. For example, several 
VMs
are sharing the same physical device to provide mediated access. All VMs will
call the VFIO mmap() on their virtual BAR as part of QEMU vfio/pci 
initialization
process, at that moment, we definitely can't mmap the entire physical MMIO
into both VM blindly for obvious reason.



mmap() carries @length information, so you only need to allocate the specified 
size
(corresponding to @length) of memory for them.


Again, you still look at this as a static partition at QEMU configuration time
where the guest mmio will be mapped as a whole at some offset of the physical
mmio region. (You still can do that like I said above by not providing
validate_map_request() in your vendor driver.)



Then you can move validate_map_request() to here to achieve custom 
allocation-policy.


But this is not the framework we are defining here.

The framework we have here is to provide the driver vendor flexibility to decide
the guest mmio and physical mmio mapping on page basis, and such information is
available during runtime.

How such information gets communicated between guest and host driver is up to
driver vendor.


The problems is the sequence of the way "provide the driver vendor
flexibility to decide the guest mmio and physical mmio mapping on page basis"
and mmap().

We should provide such allocation info first then do mmap(). You current design,
do mmap() -> communication telling such info -> use such info when fault 
happens,
is really BAD, because you can not control the time when memory fault will 
happen.
The guest may access this memory before the communication you mentioned above,
and another reason is that KVM MMU can prefetch memory at any time.



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/05/2016 08:18 PM, Paolo Bonzini wrote:



On 05/07/2016 07:41, Neo Jia wrote:

On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.

KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these
patches should fix this.


Hi Paolo,

I have tested your patches with the mediated passthru patchset that is being
reviewed in KVM and QEMU mailing list.

The fault handler gets called successfully and the previously mapped memory gets
unmmaped correctly via unmap_mapping_range.


Great, then I'll include them in 4.8.


Code is okay, but i still suspect if this implementation, fetch mmio pages in 
fault
handler, is needed. We'd better include these patches after the design of vfio
framework is decided.



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/05/2016 08:18 PM, Paolo Bonzini wrote:



On 05/07/2016 07:41, Neo Jia wrote:

On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.

KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these
patches should fix this.


Hi Paolo,

I have tested your patches with the mediated passthru patchset that is being
reviewed in KVM and QEMU mailing list.

The fault handler gets called successfully and the previously mapped memory gets
unmmaped correctly via unmap_mapping_range.


Great, then I'll include them in 4.8.


Code is okay, but i still suspect if this implementation, fetch mmio pages in 
fault
handler, is needed. We'd better include these patches after the design of vfio
framework is decided.



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/05/2016 03:30 PM, Neo Jia wrote:




(Just for completeness, if you really want to use a device in above example as
VFIO passthru, the second step is not completely handled in userspace, it is 
actually the guest
driver who will allocate and setup the proper hw resource which will later ready
for you to access via some mmio pages.)


Hmm... i always treat the VM as userspace.


It is OK to treat VM as userspace, but I think it is better to put out details
so we are always on the same page.



Okay. I should pay more attention on it when i discuss with the driver people. 
:)







This is how QEMU/VFIO currently works, could you please tell me the special 
points
of your solution comparing with current QEMU/VFIO and why current model can not 
fit
your requirement? So that we can better understand your scenario?


The scenario I am describing here is mediated passthru case, but what you are
describing here (more or less) is VFIO direct assigned case. It is different in 
several
areas, but major difference related to this topic here is:

1) In VFIO direct assigned case, the device (and its resource) is completely 
owned by the VM
therefore its mmio region can be mapped directly into the VM during the VFIO 
mmap() call as
there is no resource sharing among VMs and there is no mediated device driver on
the host to manage such resource, so it is completely owned by the guest.


I understand this difference, However, as you told to me that the MMIO region 
allocated for the
VM is continuous, so i assume the portion of physical MMIO region is completely 
owned by guest.
The only difference i can see is mediated device driver need to allocate that 
region.


It is physically contiguous but it is done during the runtime, physically 
contiguous doesn't mean
static partition at boot time. And only during runtime, the proper HW resource 
will be requested therefore
the right portion of MMIO region will be granted by the mediated device driver 
on the host.


Okay. This is your implantation design rather than the hardware limitation, 
right?

For example, if the instance require 512M memory (the size can be specified by 
QEMU
command line), it can tell its requirement to the mediated device driver via 
create()
interface, then the driver can allocate then memory for this instance before it 
is running.

Theoretically, the hardware is able to do memory management as this style, but 
for some
reasons you choose allocating memory in the runtime. right? If my understanding 
is right,
could you please tell us what benefit you want to get from this 
running-allocation style?



Also, the physically contiguous doesn't mean the guest and host mmio is 1:1
always. You can have a 8GB host physical mmio while the guest will only have
256MB.


Thanks for your patience, it is clearer to me and at least i am able to try to 
guess the
whole picture. :)







2) In mediated passthru case, multiple VMs are sharing the same physical 
device, so how
the HW resource gets allocated is completely decided by the guest and host 
device driver of
the virtualized DMA device, here is the GPU, same as the MMIO pages used to 
access those Hw resource.


I can not see what guest's affair is here, look at your code, you cooked the 
fault handler like
this:


You shouldn't as that depends on how different devices are getting
para-virtualized by their own implementations.



PV method. It is interesting. More comments below.



+   ret = parent->ops->validate_map_request(mdev, virtaddr,
+, _size,
+_prot);

Please tell me what information is got from guest? All these info can be found 
at the time of
mmap().


The virtaddr is the guest mmio address that triggers this fault, which will be
used for the mediated device driver to locate the resource that he has 
previously allocated.


The virtaddr is not the guest mmio address, it is the virtual address of QEMU. 
vfio is not
able to figure out the guest mmio address as the mapping is handled in 
userspace as we
discussed above.

And we can get the virtaddr from [vma->start, vma->end) when we do mmap().



Then the req_size and pgoff will both come from the mediated device driver 
based on his internal book
keeping of the hw resource allocation, which is only available during runtime. 
And such book keeping
can be built part of para-virtualization scheme between guest and host device 
driver.



I am talking the parameters you passed to validate_map_request(). req_size is 
calculated like this:

+   offset   = virtaddr - vma->vm_start;
+   phyaddr  = (vma->vm_pgoff << PAGE_SHIFT) + offset;
+   pgoff= phyaddr >> PAGE_SHIFT;

All these info is from vma which is available in mmmap().

pgoff is got from:
+   pg_prot  = vma->vm_page_prot;
that is also available in mmap().


None of such information is available at VFIO mmap() time. For example, several 
VMs

Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/05/2016 03:30 PM, Neo Jia wrote:




(Just for completeness, if you really want to use a device in above example as
VFIO passthru, the second step is not completely handled in userspace, it is 
actually the guest
driver who will allocate and setup the proper hw resource which will later ready
for you to access via some mmio pages.)


Hmm... i always treat the VM as userspace.


It is OK to treat VM as userspace, but I think it is better to put out details
so we are always on the same page.



Okay. I should pay more attention on it when i discuss with the driver people. 
:)







This is how QEMU/VFIO currently works, could you please tell me the special 
points
of your solution comparing with current QEMU/VFIO and why current model can not 
fit
your requirement? So that we can better understand your scenario?


The scenario I am describing here is mediated passthru case, but what you are
describing here (more or less) is VFIO direct assigned case. It is different in 
several
areas, but major difference related to this topic here is:

1) In VFIO direct assigned case, the device (and its resource) is completely 
owned by the VM
therefore its mmio region can be mapped directly into the VM during the VFIO 
mmap() call as
there is no resource sharing among VMs and there is no mediated device driver on
the host to manage such resource, so it is completely owned by the guest.


I understand this difference, However, as you told to me that the MMIO region 
allocated for the
VM is continuous, so i assume the portion of physical MMIO region is completely 
owned by guest.
The only difference i can see is mediated device driver need to allocate that 
region.


It is physically contiguous but it is done during the runtime, physically 
contiguous doesn't mean
static partition at boot time. And only during runtime, the proper HW resource 
will be requested therefore
the right portion of MMIO region will be granted by the mediated device driver 
on the host.


Okay. This is your implantation design rather than the hardware limitation, 
right?

For example, if the instance require 512M memory (the size can be specified by 
QEMU
command line), it can tell its requirement to the mediated device driver via 
create()
interface, then the driver can allocate then memory for this instance before it 
is running.

Theoretically, the hardware is able to do memory management as this style, but 
for some
reasons you choose allocating memory in the runtime. right? If my understanding 
is right,
could you please tell us what benefit you want to get from this 
running-allocation style?



Also, the physically contiguous doesn't mean the guest and host mmio is 1:1
always. You can have a 8GB host physical mmio while the guest will only have
256MB.


Thanks for your patience, it is clearer to me and at least i am able to try to 
guess the
whole picture. :)







2) In mediated passthru case, multiple VMs are sharing the same physical 
device, so how
the HW resource gets allocated is completely decided by the guest and host 
device driver of
the virtualized DMA device, here is the GPU, same as the MMIO pages used to 
access those Hw resource.


I can not see what guest's affair is here, look at your code, you cooked the 
fault handler like
this:


You shouldn't as that depends on how different devices are getting
para-virtualized by their own implementations.



PV method. It is interesting. More comments below.



+   ret = parent->ops->validate_map_request(mdev, virtaddr,
+, _size,
+_prot);

Please tell me what information is got from guest? All these info can be found 
at the time of
mmap().


The virtaddr is the guest mmio address that triggers this fault, which will be
used for the mediated device driver to locate the resource that he has 
previously allocated.


The virtaddr is not the guest mmio address, it is the virtual address of QEMU. 
vfio is not
able to figure out the guest mmio address as the mapping is handled in 
userspace as we
discussed above.

And we can get the virtaddr from [vma->start, vma->end) when we do mmap().



Then the req_size and pgoff will both come from the mediated device driver 
based on his internal book
keeping of the hw resource allocation, which is only available during runtime. 
And such book keeping
can be built part of para-virtualization scheme between guest and host device 
driver.



I am talking the parameters you passed to validate_map_request(). req_size is 
calculated like this:

+   offset   = virtaddr - vma->vm_start;
+   phyaddr  = (vma->vm_pgoff << PAGE_SHIFT) + offset;
+   pgoff= phyaddr >> PAGE_SHIFT;

All these info is from vma which is available in mmmap().

pgoff is got from:
+   pg_prot  = vma->vm_page_prot;
that is also available in mmap().


None of such information is available at VFIO mmap() time. For example, several 
VMs

Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/05/2016 01:16 PM, Neo Jia wrote:

On Tue, Jul 05, 2016 at 12:02:42PM +0800, Xiao Guangrong wrote:



On 07/05/2016 09:35 AM, Neo Jia wrote:

On Tue, Jul 05, 2016 at 09:19:40AM +0800, Xiao Guangrong wrote:



On 07/04/2016 11:33 PM, Neo Jia wrote:



Sorry, I think I misread the "allocation" as "mapping". We only delay the
cpu mapping, not the allocation.


So how to understand your statement:
"at that moment nobody has any knowledge about how the physical mmio gets 
virtualized"

The resource, physical MMIO region, has been allocated, why we do not know the 
physical
address mapped to the VM?



>From a device driver point of view, the physical mmio region never gets 
allocated until
the corresponding resource is requested by clients and granted by the mediated 
device driver.


Hmm... but you told me that you did not delay the allocation. :(


Hi Guangrong,

The allocation here is the allocation of device resource, and the only way to
access that kind of device resource is via a mmio region of some pages there.

For example, if VM needs resource A, and the only way to access resource A is
via some kind of device memory at mmio address X.

So, we never defer the allocation request during runtime, we just setup the
CPU mapping later when it actually gets accessed.



So it returns to my original question: why not allocate the physical mmio 
region in mmap()?



Without running anything inside the VM, how do you know how the hw resource gets
allocated, therefore no knowledge of the use of mmio region.


The allocation and mapping can be two independent processes:
- the first process is just allocation. The MMIO region is allocated from 
physical
   hardware and this region is mapped into _QEMU's_ arbitrary virtual address 
by mmap().
   At this time, VM can not actually use this resource.

- the second process is mapping. When VM enable this region, e.g, it enables the
   PCI BAR, then QEMU maps its virtual address returned by mmap() to VM's 
physical
   memory. After that, VM can access this region.

The second process is completed handled in userspace, that means, the mediated
device driver needn't care how the resource is mapped into VM.


In your example, you are still picturing it as VFIO direct assign, but the 
solution we are
talking here is mediated passthru via VFIO framework to virtualize DMA devices 
without SR-IOV.



Please see my comments below.


(Just for completeness, if you really want to use a device in above example as
VFIO passthru, the second step is not completely handled in userspace, it is 
actually the guest
driver who will allocate and setup the proper hw resource which will later ready
for you to access via some mmio pages.)


Hmm... i always treat the VM as userspace.





This is how QEMU/VFIO currently works, could you please tell me the special 
points
of your solution comparing with current QEMU/VFIO and why current model can not 
fit
your requirement? So that we can better understand your scenario?


The scenario I am describing here is mediated passthru case, but what you are
describing here (more or less) is VFIO direct assigned case. It is different in 
several
areas, but major difference related to this topic here is:

1) In VFIO direct assigned case, the device (and its resource) is completely 
owned by the VM
therefore its mmio region can be mapped directly into the VM during the VFIO 
mmap() call as
there is no resource sharing among VMs and there is no mediated device driver on
the host to manage such resource, so it is completely owned by the guest.


I understand this difference, However, as you told to me that the MMIO region 
allocated for the
VM is continuous, so i assume the portion of physical MMIO region is completely 
owned by guest.
The only difference i can see is mediated device driver need to allocate that 
region.



2) In mediated passthru case, multiple VMs are sharing the same physical 
device, so how
the HW resource gets allocated is completely decided by the guest and host 
device driver of
the virtualized DMA device, here is the GPU, same as the MMIO pages used to 
access those Hw resource.


I can not see what guest's affair is here, look at your code, you cooked the 
fault handler like
this:

+   ret = parent->ops->validate_map_request(mdev, virtaddr,
+, _size,
+_prot);

Please tell me what information is got from guest? All these info can be found 
at the time of
mmap().



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/05/2016 01:16 PM, Neo Jia wrote:

On Tue, Jul 05, 2016 at 12:02:42PM +0800, Xiao Guangrong wrote:



On 07/05/2016 09:35 AM, Neo Jia wrote:

On Tue, Jul 05, 2016 at 09:19:40AM +0800, Xiao Guangrong wrote:



On 07/04/2016 11:33 PM, Neo Jia wrote:



Sorry, I think I misread the "allocation" as "mapping". We only delay the
cpu mapping, not the allocation.


So how to understand your statement:
"at that moment nobody has any knowledge about how the physical mmio gets 
virtualized"

The resource, physical MMIO region, has been allocated, why we do not know the 
physical
address mapped to the VM?



>From a device driver point of view, the physical mmio region never gets 
allocated until
the corresponding resource is requested by clients and granted by the mediated 
device driver.


Hmm... but you told me that you did not delay the allocation. :(


Hi Guangrong,

The allocation here is the allocation of device resource, and the only way to
access that kind of device resource is via a mmio region of some pages there.

For example, if VM needs resource A, and the only way to access resource A is
via some kind of device memory at mmio address X.

So, we never defer the allocation request during runtime, we just setup the
CPU mapping later when it actually gets accessed.



So it returns to my original question: why not allocate the physical mmio 
region in mmap()?



Without running anything inside the VM, how do you know how the hw resource gets
allocated, therefore no knowledge of the use of mmio region.


The allocation and mapping can be two independent processes:
- the first process is just allocation. The MMIO region is allocated from 
physical
   hardware and this region is mapped into _QEMU's_ arbitrary virtual address 
by mmap().
   At this time, VM can not actually use this resource.

- the second process is mapping. When VM enable this region, e.g, it enables the
   PCI BAR, then QEMU maps its virtual address returned by mmap() to VM's 
physical
   memory. After that, VM can access this region.

The second process is completed handled in userspace, that means, the mediated
device driver needn't care how the resource is mapped into VM.


In your example, you are still picturing it as VFIO direct assign, but the 
solution we are
talking here is mediated passthru via VFIO framework to virtualize DMA devices 
without SR-IOV.



Please see my comments below.


(Just for completeness, if you really want to use a device in above example as
VFIO passthru, the second step is not completely handled in userspace, it is 
actually the guest
driver who will allocate and setup the proper hw resource which will later ready
for you to access via some mmio pages.)


Hmm... i always treat the VM as userspace.





This is how QEMU/VFIO currently works, could you please tell me the special 
points
of your solution comparing with current QEMU/VFIO and why current model can not 
fit
your requirement? So that we can better understand your scenario?


The scenario I am describing here is mediated passthru case, but what you are
describing here (more or less) is VFIO direct assigned case. It is different in 
several
areas, but major difference related to this topic here is:

1) In VFIO direct assigned case, the device (and its resource) is completely 
owned by the VM
therefore its mmio region can be mapped directly into the VM during the VFIO 
mmap() call as
there is no resource sharing among VMs and there is no mediated device driver on
the host to manage such resource, so it is completely owned by the guest.


I understand this difference, However, as you told to me that the MMIO region 
allocated for the
VM is continuous, so i assume the portion of physical MMIO region is completely 
owned by guest.
The only difference i can see is mediated device driver need to allocate that 
region.



2) In mediated passthru case, multiple VMs are sharing the same physical 
device, so how
the HW resource gets allocated is completely decided by the guest and host 
device driver of
the virtualized DMA device, here is the GPU, same as the MMIO pages used to 
access those Hw resource.


I can not see what guest's affair is here, look at your code, you cooked the 
fault handler like
this:

+   ret = parent->ops->validate_map_request(mdev, virtaddr,
+, _size,
+_prot);

Please tell me what information is got from guest? All these info can be found 
at the time of
mmap().



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/05/2016 09:35 AM, Neo Jia wrote:

On Tue, Jul 05, 2016 at 09:19:40AM +0800, Xiao Guangrong wrote:



On 07/04/2016 11:33 PM, Neo Jia wrote:



Sorry, I think I misread the "allocation" as "mapping". We only delay the
cpu mapping, not the allocation.


So how to understand your statement:
"at that moment nobody has any knowledge about how the physical mmio gets 
virtualized"

The resource, physical MMIO region, has been allocated, why we do not know the 
physical
address mapped to the VM?



>From a device driver point of view, the physical mmio region never gets 
allocated until
the corresponding resource is requested by clients and granted by the mediated 
device driver.


Hmm... but you told me that you did not delay the allocation. :(


Hi Guangrong,

The allocation here is the allocation of device resource, and the only way to
access that kind of device resource is via a mmio region of some pages there.

For example, if VM needs resource A, and the only way to access resource A is
via some kind of device memory at mmio address X.

So, we never defer the allocation request during runtime, we just setup the
CPU mapping later when it actually gets accessed.



So it returns to my original question: why not allocate the physical mmio 
region in mmap()?



Without running anything inside the VM, how do you know how the hw resource gets
allocated, therefore no knowledge of the use of mmio region.


The allocation and mapping can be two independent processes:
- the first process is just allocation. The MMIO region is allocated from 
physical
  hardware and this region is mapped into _QEMU's_ arbitrary virtual address by 
mmap().
  At this time, VM can not actually use this resource.

- the second process is mapping. When VM enable this region, e.g, it enables the
  PCI BAR, then QEMU maps its virtual address returned by mmap() to VM's 
physical
  memory. After that, VM can access this region.

The second process is completed handled in userspace, that means, the mediated
device driver needn't care how the resource is mapped into VM.

This is how QEMU/VFIO currently works, could you please tell me the special 
points
of your solution comparing with current QEMU/VFIO and why current model can not 
fit
your requirement? So that we can better understand your scenario?


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/05/2016 09:35 AM, Neo Jia wrote:

On Tue, Jul 05, 2016 at 09:19:40AM +0800, Xiao Guangrong wrote:



On 07/04/2016 11:33 PM, Neo Jia wrote:



Sorry, I think I misread the "allocation" as "mapping". We only delay the
cpu mapping, not the allocation.


So how to understand your statement:
"at that moment nobody has any knowledge about how the physical mmio gets 
virtualized"

The resource, physical MMIO region, has been allocated, why we do not know the 
physical
address mapped to the VM?



>From a device driver point of view, the physical mmio region never gets 
allocated until
the corresponding resource is requested by clients and granted by the mediated 
device driver.


Hmm... but you told me that you did not delay the allocation. :(


Hi Guangrong,

The allocation here is the allocation of device resource, and the only way to
access that kind of device resource is via a mmio region of some pages there.

For example, if VM needs resource A, and the only way to access resource A is
via some kind of device memory at mmio address X.

So, we never defer the allocation request during runtime, we just setup the
CPU mapping later when it actually gets accessed.



So it returns to my original question: why not allocate the physical mmio 
region in mmap()?



Without running anything inside the VM, how do you know how the hw resource gets
allocated, therefore no knowledge of the use of mmio region.


The allocation and mapping can be two independent processes:
- the first process is just allocation. The MMIO region is allocated from 
physical
  hardware and this region is mapped into _QEMU's_ arbitrary virtual address by 
mmap().
  At this time, VM can not actually use this resource.

- the second process is mapping. When VM enable this region, e.g, it enables the
  PCI BAR, then QEMU maps its virtual address returned by mmap() to VM's 
physical
  memory. After that, VM can access this region.

The second process is completed handled in userspace, that means, the mediated
device driver needn't care how the resource is mapped into VM.

This is how QEMU/VFIO currently works, could you please tell me the special 
points
of your solution comparing with current QEMU/VFIO and why current model can not 
fit
your requirement? So that we can better understand your scenario?


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 11:33 PM, Neo Jia wrote:



Sorry, I think I misread the "allocation" as "mapping". We only delay the
cpu mapping, not the allocation.


So how to understand your statement:
"at that moment nobody has any knowledge about how the physical mmio gets 
virtualized"

The resource, physical MMIO region, has been allocated, why we do not know the 
physical
address mapped to the VM?




From a device driver point of view, the physical mmio region never gets 
allocated until

the corresponding resource is requested by clients and granted by the mediated 
device driver.


Hmm... but you told me that you did not delay the allocation. :(

So it returns to my original question: why not allocate the physical mmio 
region in mmap()?






Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 11:33 PM, Neo Jia wrote:



Sorry, I think I misread the "allocation" as "mapping". We only delay the
cpu mapping, not the allocation.


So how to understand your statement:
"at that moment nobody has any knowledge about how the physical mmio gets 
virtualized"

The resource, physical MMIO region, has been allocated, why we do not know the 
physical
address mapped to the VM?




From a device driver point of view, the physical mmio region never gets 
allocated until

the corresponding resource is requested by clients and granted by the mediated 
device driver.


Hmm... but you told me that you did not delay the allocation. :(

So it returns to my original question: why not allocate the physical mmio 
region in mmap()?






Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 05:16 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 04:45:05PM +0800, Xiao Guangrong wrote:



On 07/04/2016 04:41 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 04:19:20PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:53 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:03 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote:



On 06/30/2016 09:01 PM, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.


Why does it require fetching the pfn when the fault is triggered rather
than when mmap() is called?


Hi Guangrong,

as such mapping information between virtual mmio to physical mmio is only 
available
at runtime.


Sorry, i do not know what the different between mmap() and the time VM actually
accesses the memory for your case. Could you please more detail?


Hi Guangrong,

Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when
setting up the virtual mmio, at that moment nobody has any knowledge about how
the physical mmio gets virtualized.

When the vm (or application if we don't want to limit ourselves to vmm term)
starts, the virtual and physical mmio gets mapped by mpci kernel module with the
help from vendor supplied mediated host driver according to the hw resource
assigned to this vm / application.


Thanks for your expiation.

It sounds like a strategy of resource allocation, you delay the allocation 
until VM really
accesses it, right?


Yes, that is where the fault handler inside mpci code comes to the picture.



I am not sure this strategy is good. The instance is successfully created, and 
it is started
successful, but the VM is crashed due to the resource of that instance is not 
enough. That sounds
unreasonable.



Sorry, I think I misread the "allocation" as "mapping". We only delay the
cpu mapping, not the allocation.


So how to understand your statement:
"at that moment nobody has any knowledge about how the physical mmio gets 
virtualized"

The resource, physical MMIO region, has been allocated, why we do not know the 
physical
address mapped to the VM?




Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 05:16 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 04:45:05PM +0800, Xiao Guangrong wrote:



On 07/04/2016 04:41 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 04:19:20PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:53 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:03 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote:



On 06/30/2016 09:01 PM, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.


Why does it require fetching the pfn when the fault is triggered rather
than when mmap() is called?


Hi Guangrong,

as such mapping information between virtual mmio to physical mmio is only 
available
at runtime.


Sorry, i do not know what the different between mmap() and the time VM actually
accesses the memory for your case. Could you please more detail?


Hi Guangrong,

Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when
setting up the virtual mmio, at that moment nobody has any knowledge about how
the physical mmio gets virtualized.

When the vm (or application if we don't want to limit ourselves to vmm term)
starts, the virtual and physical mmio gets mapped by mpci kernel module with the
help from vendor supplied mediated host driver according to the hw resource
assigned to this vm / application.


Thanks for your expiation.

It sounds like a strategy of resource allocation, you delay the allocation 
until VM really
accesses it, right?


Yes, that is where the fault handler inside mpci code comes to the picture.



I am not sure this strategy is good. The instance is successfully created, and 
it is started
successful, but the VM is crashed due to the resource of that instance is not 
enough. That sounds
unreasonable.



Sorry, I think I misread the "allocation" as "mapping". We only delay the
cpu mapping, not the allocation.


So how to understand your statement:
"at that moment nobody has any knowledge about how the physical mmio gets 
virtualized"

The resource, physical MMIO region, has been allocated, why we do not know the 
physical
address mapped to the VM?




Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 04:45 PM, Xiao Guangrong wrote:



On 07/04/2016 04:41 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 04:19:20PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:53 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:03 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote:



On 06/30/2016 09:01 PM, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.


Why does it require fetching the pfn when the fault is triggered rather
than when mmap() is called?


Hi Guangrong,

as such mapping information between virtual mmio to physical mmio is only 
available
at runtime.


Sorry, i do not know what the different between mmap() and the time VM actually
accesses the memory for your case. Could you please more detail?


Hi Guangrong,

Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when
setting up the virtual mmio, at that moment nobody has any knowledge about how
the physical mmio gets virtualized.

When the vm (or application if we don't want to limit ourselves to vmm term)
starts, the virtual and physical mmio gets mapped by mpci kernel module with the
help from vendor supplied mediated host driver according to the hw resource
assigned to this vm / application.


Thanks for your expiation.

It sounds like a strategy of resource allocation, you delay the allocation 
until VM really
accesses it, right?


Yes, that is where the fault handler inside mpci code comes to the picture.



I am not sure this strategy is good. The instance is successfully created, and 
it is started
successful, but the VM is crashed due to the resource of that instance is not 
enough. That sounds
unreasonable.



Especially, you can not squeeze this kind of memory to balance the usage 
between all VMs. Does
this strategy still make sense?




  1   2   3   4   5   6   7   8   9   10   >