Re: [Xen-devel] question: xen/qemu - mmio mapping issues for device pass-through

2017-03-21 Thread Venu Busireddy


> -Original Message-
> From: Konrad Rzeszutek Wilk
> Sent: Tuesday, March 21, 2017 08:19 AM
> To: Xuquan (Quan Xu); Venu Busireddy
> Cc: Jan Beulich; anthony.per...@citrix.com; george.dun...@eu.citrix.com;
> ian.jack...@eu.citrix.com; Fanhenglong; Kevin Tian; StefanoStabellini;
> xen-devel@lists.xen.org
> Subject: Re: question: xen/qemu - mmio mapping issues for device pass-
> through
> 
> .. snip..
> > support to pass-through large bar (pci-e bar > 4G) device..
> 
> Yes it does work.
> >
> > > I was assuming large BAR handling to work so far
> > >(Konrad had done some adjustments there quite a while ago, from all I
> recall).
> > >
> >
> >
> > _iirc_ what Konrad mentioned was using qemu-trad..
> 
> Yes but we also did tests on qemu-xen and it worked. CCing Venu.
> 
> Venu, does passing in large BARs work with qemu-xen (aka 'xl')?

Sorry, I do not know the answer!

Venu


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] question: xen/qemu - mmio mapping issues for device pass-through

2017-03-21 Thread Konrad Rzeszutek Wilk
. snip..
> support to pass-through large bar (pci-e bar > 4G) device..

Yes it does work. 
> 
> > I was assuming large BAR handling to work so far
> >(Konrad had done some adjustments there quite a while ago, from all I 
> >recall).
> >
> 
> 
> _iirc_ what Konrad mentioned was using qemu-trad..

Yes but we also did tests on qemu-xen and it worked. CCing Venu.

Venu, does passing in large BARs work with qemu-xen (aka 'xl')?

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] question: xen/qemu - mmio mapping issues for device pass-through

2017-03-21 Thread Jan Beulich
>>> On 21.03.17 at 02:53,  wrote:
> On March 20, 2017 3:35 PM, Jan Beulich wrote:
> On 20.03.17 at 02:58,  wrote:
>>> On March 16, 2017 11:32 PM, Jan Beulich wrote:
>>> On 16.03.17 at 15:21,  wrote:
> On March 16, 2017 10:06 PM, Jan Beulich wrote:
> On 16.03.17 at 14:55,  wrote:
>>> I try to pass-through a device with 8G large bar, such as nvidia
>>> M60(note1, pci-e info as below). It takes about '__15 sconds__' to
>>> update 8G large bar in QEMU::xen_pt_region_update()..
>>> Specifically, it is xc_domain_memory_mapping() in
xen_pt_region_update().
>>>
>>> Digged into xc_domain_memory_mapping(), I find it mainly call
>>> "do_domctl
>>> (…case XEN_DOMCTL_memory_mapping…)"
>>> to mapping mmio region.. of cause, I find out that this mapping
>>> could take a while in the code comment below ' case
>>XEN_DOMCTL_memory_mapping '.
>>>
>>> my questions:
>>> 1. could we make this mapping mmio region quicker?
>>
>
> Thanks for your quick reply.
>
>>Yes, e.g. by using large (2M or 1G) pages. This has been on my todo
>>list for quite a while...
>>
>>> 2. if could not, does it limit by hardware performance?
>>
>>I'm afraid I don't understand the question. If you mean "Is it
>>limited by hw performance", then no, see above. If you mean "Does it
>>limit hw performance", then again no, I don't think so (other than
>>the effect of having more IOMMU translation levels than really
>>necessary for such
large a region).
>>
>
> Sorry, my question is  "Is it limited by hw performance"...
>
> I am much confused. why does this mmio mapping take a while?
> I guessed it takes a lot of time to set up p2m / iommu entry. That's
> why I ask "Is it limited by hw performance".

Well, just count the number of page table entries and that of the
resulting hypercall continuations. It's the sheer amount of work
that's causing the slowness, together with the need for us to use
continuations to be on the safe side. There may well be redundant TLB
invalidations as well. Since we can do better (by using large
pages) I wouldn't call this "limited by hw performance", but of course
one may.

>>>
>>> I agree.
>>> So far as I know, xen upstream doesn't support to pass-through
>>> large bar (pci-e bar > 4G) device, such as nvidia M60, However cloud
>>> providers may want to leverage this feature for machine learning .etc.
>>> Is it on your TODO list?
>>
>>Is what on my todo list?
> 
> support to pass-through large bar (pci-e bar > 4G) device..
> 
>> I was assuming large BAR handling to work so far
>>(Konrad had done some adjustments there quite a while ago, from all I 
> recall).
>>
> 
> 
> _iirc_ what Konrad mentioned was using qemu-trad..

Quite possible (albeit my memory says hvmloader), but the qemu
side (trad or upstream) isn't my realm anyway.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] question: xen/qemu - mmio mapping issues for device pass-through

2017-03-20 Thread Xuquan (Quan Xu)
On March 20, 2017 3:35 PM, Jan Beulich wrote:
 On 20.03.17 at 02:58,  wrote:
>> On March 16, 2017 11:32 PM, Jan Beulich wrote:
>> On 16.03.17 at 15:21,  wrote:
 On March 16, 2017 10:06 PM, Jan Beulich wrote:
 On 16.03.17 at 14:55,  wrote:
>> I try to pass-through a device with 8G large bar, such as nvidia
>> M60(note1, pci-e info as below). It takes about '__15 sconds__' to
>> update 8G large bar in QEMU::xen_pt_region_update()..
>> Specifically, it is xc_domain_memory_mapping() in
>>>xen_pt_region_update().
>>
>> Digged into xc_domain_memory_mapping(), I find it mainly call
>> "do_domctl
>> (…case XEN_DOMCTL_memory_mapping…)"
>> to mapping mmio region.. of cause, I find out that this mapping
>> could take a while in the code comment below ' case
>XEN_DOMCTL_memory_mapping '.
>>
>> my questions:
>> 1. could we make this mapping mmio region quicker?
>

 Thanks for your quick reply.

>Yes, e.g. by using large (2M or 1G) pages. This has been on my todo
>list for quite a while...
>
>> 2. if could not, does it limit by hardware performance?
>
>I'm afraid I don't understand the question. If you mean "Is it
>limited by hw performance", then no, see above. If you mean "Does it
>limit hw performance", then again no, I don't think so (other than
>the effect of having more IOMMU translation levels than really
>necessary for such
>>>large a region).
>

 Sorry, my question is  "Is it limited by hw performance"...

 I am much confused. why does this mmio mapping take a while?
 I guessed it takes a lot of time to set up p2m / iommu entry. That's
 why I ask "Is it limited by hw performance".
>>>
>>>Well, just count the number of page table entries and that of the
>>>resulting hypercall continuations. It's the sheer amount of work
>>>that's causing the slowness, together with the need for us to use
>>>continuations to be on the safe side. There may well be redundant TLB
>>>invalidations as well. Since we can do better (by using large
>>>pages) I wouldn't call this "limited by hw performance", but of course
>>>one may.
>>>
>>
>> I agree.
>> So far as I know, xen upstream doesn't support to pass-through
>> large bar (pci-e bar > 4G) device, such as nvidia M60, However cloud
>> providers may want to leverage this feature for machine learning .etc.
>> Is it on your TODO list?
>
>Is what on my todo list?

support to pass-through large bar (pci-e bar > 4G) device..

> I was assuming large BAR handling to work so far
>(Konrad had done some adjustments there quite a while ago, from all I recall).
>


_iirc_ what Konrad mentioned was using qemu-trad..


Quan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] question: xen/qemu - mmio mapping issues for device pass-through

2017-03-20 Thread Jan Beulich
>>> On 20.03.17 at 02:58,  wrote:
> On March 16, 2017 11:32 PM, Jan Beulich wrote:
> On 16.03.17 at 15:21,  wrote:
>>> On March 16, 2017 10:06 PM, Jan Beulich wrote:
>>> On 16.03.17 at 14:55,  wrote:
> I try to pass-through a device with 8G large bar, such as nvidia
> M60(note1, pci-e info as below). It takes about '__15 sconds__' to
> update 8G large bar in QEMU::xen_pt_region_update()..
> Specifically, it is xc_domain_memory_mapping() in
>>xen_pt_region_update().
>
> Digged into xc_domain_memory_mapping(), I find it mainly call
> "do_domctl
> (…case XEN_DOMCTL_memory_mapping…)"
> to mapping mmio region.. of cause, I find out that this mapping
> could take a while in the code comment below ' case
XEN_DOMCTL_memory_mapping '.
>
> my questions:
> 1. could we make this mapping mmio region quicker?

>>>
>>> Thanks for your quick reply.
>>>
Yes, e.g. by using large (2M or 1G) pages. This has been on my todo
list for quite a while...

> 2. if could not, does it limit by hardware performance?

I'm afraid I don't understand the question. If you mean "Is it limited
by hw performance", then no, see above. If you mean "Does it limit hw
performance", then again no, I don't think so (other than the effect
of having more IOMMU translation levels than really necessary for such
>>large a region).

>>>
>>> Sorry, my question is  "Is it limited by hw performance"...
>>>
>>> I am much confused. why does this mmio mapping take a while?
>>> I guessed it takes a lot of time to set up p2m / iommu entry. That's
>>> why I ask "Is it limited by hw performance".
>>
>>Well, just count the number of page table entries and that of the resulting
>>hypercall continuations. It's the sheer amount of work that's causing the
>>slowness, together with the need for us to use continuations to be on the safe
>>side. There may well be redundant TLB invalidations as well. Since we can do
>>better (by using large
>>pages) I wouldn't call this "limited by hw performance", but of course one
>>may.
>>
> 
> I agree.
> So far as I know, xen upstream doesn't support to pass-through large bar 
> (pci-e bar > 4G) device, such as nvidia M60,
> However cloud providers may want to leverage this feature for machine 
> learning .etc.
> Is it on your TODO list?

Is what on my todo list? I was assuming large BAR handling to work
so far (Konrad had done some adjustments there quite a while ago,
from all I recall).

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] question: xen/qemu - mmio mapping issues for device pass-through

2017-03-19 Thread Xuquan (Quan Xu)
On March 16, 2017 11:32 PM, Jan Beulich wrote:
 On 16.03.17 at 15:21,  wrote:
>> On March 16, 2017 10:06 PM, Jan Beulich wrote:
>> On 16.03.17 at 14:55,  wrote:
 I try to pass-through a device with 8G large bar, such as nvidia
 M60(note1, pci-e info as below). It takes about '__15 sconds__' to
 update 8G large bar in QEMU::xen_pt_region_update()..
 Specifically, it is xc_domain_memory_mapping() in
>xen_pt_region_update().

 Digged into xc_domain_memory_mapping(), I find it mainly call
 "do_domctl
 (…case XEN_DOMCTL_memory_mapping…)"
 to mapping mmio region.. of cause, I find out that this mapping
 could take a while in the code comment below ' case
>>>XEN_DOMCTL_memory_mapping '.

 my questions:
 1. could we make this mapping mmio region quicker?
>>>
>>
>> Thanks for your quick reply.
>>
>>>Yes, e.g. by using large (2M or 1G) pages. This has been on my todo
>>>list for quite a while...
>>>
 2. if could not, does it limit by hardware performance?
>>>
>>>I'm afraid I don't understand the question. If you mean "Is it limited
>>>by hw performance", then no, see above. If you mean "Does it limit hw
>>>performance", then again no, I don't think so (other than the effect
>>>of having more IOMMU translation levels than really necessary for such
>large a region).
>>>
>>
>> Sorry, my question is  "Is it limited by hw performance"...
>>
>> I am much confused. why does this mmio mapping take a while?
>> I guessed it takes a lot of time to set up p2m / iommu entry. That's
>> why I ask "Is it limited by hw performance".
>
>Well, just count the number of page table entries and that of the resulting
>hypercall continuations. It's the sheer amount of work that's causing the
>slowness, together with the need for us to use continuations to be on the safe
>side. There may well be redundant TLB invalidations as well. Since we can do
>better (by using large
>pages) I wouldn't call this "limited by hw performance", but of course one
>may.
>

I agree.
So far as I know, xen upstream doesn't support to pass-through large bar 
(pci-e bar > 4G) device, such as nvidia M60,
However cloud providers may want to leverage this feature for machine learning 
.etc.
Is it on your TODO list?

Quan









___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] question: xen/qemu - mmio mapping issues for device pass-through

2017-03-16 Thread Jan Beulich
>>> On 16.03.17 at 15:21,  wrote:
> On March 16, 2017 10:06 PM, Jan Beulich wrote:
> On 16.03.17 at 14:55,  wrote:
>>> I try to pass-through a device with 8G large bar, such as nvidia
>>> M60(note1, pci-e info as below). It takes about '__15 sconds__' to
>>> update 8G large bar in QEMU::xen_pt_region_update()..
>>> Specifically, it is xc_domain_memory_mapping() in xen_pt_region_update().
>>>
>>> Digged into xc_domain_memory_mapping(), I find it mainly call
>>> "do_domctl
>>> (…case XEN_DOMCTL_memory_mapping…)"
>>> to mapping mmio region.. of cause, I find out that this mapping could
>>> take a while in the code comment below ' case
>>XEN_DOMCTL_memory_mapping '.
>>>
>>> my questions:
>>> 1. could we make this mapping mmio region quicker?
>>
> 
> Thanks for your quick reply.
> 
>>Yes, e.g. by using large (2M or 1G) pages. This has been on my todo list for
>>quite a while...
>>
>>> 2. if could not, does it limit by hardware performance?
>>
>>I'm afraid I don't understand the question. If you mean "Is it limited by hw
>>performance", then no, see above. If you mean "Does it limit hw performance",
>>then again no, I don't think so (other than the effect of having more IOMMU
>>translation levels than really necessary for such large a region).
>>
> 
> Sorry, my question is  "Is it limited by hw performance"... 
> 
> I am much confused. why does this mmio mapping take a while?
> I guessed it takes a lot of time to set up p2m / iommu entry. That's why I 
> ask "Is it limited by hw performance".

Well, just count the number of page table entries and that of the
resulting hypercall continuations. It's the sheer amount of work
that's causing the slowness, together with the need for us to use
continuations to be on the safe side. There may well be redundant
TLB invalidations as well. Since we can do better (by using large
pages) I wouldn't call this "limited by hw performance", but of
course one may.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] question: xen/qemu - mmio mapping issues for device pass-through

2017-03-16 Thread Xuquan (Quan Xu)
On March 16, 2017 10:06 PM, Jan Beulich wrote:
 On 16.03.17 at 14:55,  wrote:
>> I try to pass-through a device with 8G large bar, such as nvidia
>> M60(note1, pci-e info as below). It takes about '__15 sconds__' to
>> update 8G large bar in QEMU::xen_pt_region_update()..
>> Specifically, it is xc_domain_memory_mapping() in xen_pt_region_update().
>>
>> Digged into xc_domain_memory_mapping(), I find it mainly call
>> "do_domctl
>> (…case XEN_DOMCTL_memory_mapping…)"
>> to mapping mmio region.. of cause, I find out that this mapping could
>> take a while in the code comment below ' case
>XEN_DOMCTL_memory_mapping '.
>>
>> my questions:
>> 1. could we make this mapping mmio region quicker?
>

Thanks for your quick reply.

>Yes, e.g. by using large (2M or 1G) pages. This has been on my todo list for
>quite a while...
>
>> 2. if could not, does it limit by hardware performance?
>
>I'm afraid I don't understand the question. If you mean "Is it limited by hw
>performance", then no, see above. If you mean "Does it limit hw performance",
>then again no, I don't think so (other than the effect of having more IOMMU
>translation levels than really necessary for such large a region).
>

Sorry, my question is  "Is it limited by hw performance"... 

I am much confused. why does this mmio mapping take a while?
I guessed it takes a lot of time to set up p2m / iommu entry. That's why I ask 
"Is it limited by hw performance".

Quan
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] question: xen/qemu - mmio mapping issues for device pass-through

2017-03-16 Thread Jan Beulich
>>> On 16.03.17 at 14:55,  wrote:
> I try to pass-through a device with 8G large bar, such as nvidia M60(note1, 
> pci-e info as below). It takes about '__15 sconds__' to update 8G large bar 
> in 
> QEMU::xen_pt_region_update()..
> Specifically, it is xc_domain_memory_mapping() in xen_pt_region_update().
> 
> Digged into xc_domain_memory_mapping(), I find it mainly call "do_domctl 
> (…case XEN_DOMCTL_memory_mapping…)" 
> to mapping mmio region.. of cause, I find out that this mapping could take a 
> while in the code comment below ' case XEN_DOMCTL_memory_mapping '.
> 
> my questions:
> 1. could we make this mapping mmio region quicker?

Yes, e.g. by using large (2M or 1G) pages. This has been on my todo
list for quite a while...

> 2. if could not, does it limit by hardware performance?

I'm afraid I don't understand the question. If you mean "Is it
limited by hw performance", then no, see above. If you mean
"Does it limit hw performance", then again no, I don't think so
(other than the effect of having more IOMMU translation levels
than really necessary for such large a region).

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel