[iscsiadm] iscsiadm creates multiple same sessions when run with --login option in parallel.

2017-09-28 Thread Tangchen (UVP)
Hi guys,

If we run iscsiadm -m node --login command through the same IP address 4 times, 
only one session will be created.
But if we run them in parallel, then 4 same sessions could be created.
( Here, xxx.xxx.xxx.xxx is the IP address to the IPSAN. I'm using the same IP 
in these 4 commands. )

# iscsiadm -m node -p xxx.xxx.xxx.xxx  --login &
# iscsiadm -m node -p xxx.xxx.xxx.xxx  --login &
# iscsiadm -m node -p xxx.xxx.xxx.xxx  --login &
# iscsiadm -m node -p xxx.xxx.xxx.xxx  --login &
Logging in to [iface: default, target: iqn. xxx.xxx.xxx.xxx, portal: 
xxx.xxx.xxx.xxx] (multiple)
Logging in to [iface: default, target: iqn. xxx.xxx.xxx.xxx, portal: 
xxx.xxx.xxx.xxx] (multiple)
Logging in to [iface: default, target: iqn. xxx.xxx.xxx.xxx, portal: 
xxx.xxx.xxx.xxx] (multiple)
Logging in to [iface: default, target: iqn. xxx.xxx.xxx.xxx, portal: 
xxx.xxx.xxx.xxx] (multiple)
Login to [iface: default, target: xxx.xxx.xxx.xxx, portal: xxx.xxx.xxx.xxx] 
successful.
Login to [iface: default, target: xxx.xxx.xxx.xxx, portal: xxx.xxx.xxx.xxx] 
successful.
Login to [iface: default, target: xxx.xxx.xxx.xxx, portal: xxx.xxx.xxx.xxx] 
successful.
Login to [iface: default, target: xxx.xxx.xxx.xxx, portal: xxx.xxx.xxx.xxx] 
successful.

# iscsiadm -m session
tcp: [1] xxx.xxx.xxx.xxx (non-flash)
tcp: [2] xxx.xxx.xxx.xxx (non-flash)
tcp: [3] xxx.xxx.xxx.xxx (non-flash)
tcp: [4] xxx.xxx.xxx.xxx (non-flash)

If we check the net connection in /proc/net/nf_conntrack, they are 4 TCP 
connections with different src ports.
And if we run logout command only once, all the 4 sessions will be destroyed.

Unfortunately, service like multipathd cannot tell the difference between them. 
If we have 4 same sessions, 
4 paths will be created connecting to the dm device. But actually they are the 
same paths.

Referring to the code, iscsiadm command does prevent creating same session by 
checking /sys/class/iscsi_session/ dir.
But no multi-thread protection in there. 

Any idea how to solve this problem ?

Thanks.


[iscsiadm] iscsiadm creates multiple same sessions when run with --login option in parallel.

2017-09-28 Thread Tangchen (UVP)
Hi guys,

If we run iscsiadm -m node --login command through the same IP address 4 times, 
only one session will be created.
But if we run them in parallel, then 4 same sessions could be created.
( Here, xxx.xxx.xxx.xxx is the IP address to the IPSAN. I'm using the same IP 
in these 4 commands. )

# iscsiadm -m node -p xxx.xxx.xxx.xxx  --login &
# iscsiadm -m node -p xxx.xxx.xxx.xxx  --login &
# iscsiadm -m node -p xxx.xxx.xxx.xxx  --login &
# iscsiadm -m node -p xxx.xxx.xxx.xxx  --login &
Logging in to [iface: default, target: iqn. xxx.xxx.xxx.xxx, portal: 
xxx.xxx.xxx.xxx] (multiple)
Logging in to [iface: default, target: iqn. xxx.xxx.xxx.xxx, portal: 
xxx.xxx.xxx.xxx] (multiple)
Logging in to [iface: default, target: iqn. xxx.xxx.xxx.xxx, portal: 
xxx.xxx.xxx.xxx] (multiple)
Logging in to [iface: default, target: iqn. xxx.xxx.xxx.xxx, portal: 
xxx.xxx.xxx.xxx] (multiple)
Login to [iface: default, target: xxx.xxx.xxx.xxx, portal: xxx.xxx.xxx.xxx] 
successful.
Login to [iface: default, target: xxx.xxx.xxx.xxx, portal: xxx.xxx.xxx.xxx] 
successful.
Login to [iface: default, target: xxx.xxx.xxx.xxx, portal: xxx.xxx.xxx.xxx] 
successful.
Login to [iface: default, target: xxx.xxx.xxx.xxx, portal: xxx.xxx.xxx.xxx] 
successful.

# iscsiadm -m session
tcp: [1] xxx.xxx.xxx.xxx (non-flash)
tcp: [2] xxx.xxx.xxx.xxx (non-flash)
tcp: [3] xxx.xxx.xxx.xxx (non-flash)
tcp: [4] xxx.xxx.xxx.xxx (non-flash)

If we check the net connection in /proc/net/nf_conntrack, they are 4 TCP 
connections with different src ports.
And if we run logout command only once, all the 4 sessions will be destroyed.

Unfortunately, service like multipathd cannot tell the difference between them. 
If we have 4 same sessions, 
4 paths will be created connecting to the dm device. But actually they are the 
same paths.

Referring to the code, iscsiadm command does prevent creating same session by 
checking /sys/class/iscsi_session/ dir.
But no multi-thread protection in there. 

Any idea how to solve this problem ?

Thanks.


RE: 答复: [iscsi] Deadlock occurred when network is in error

2017-08-15 Thread Tangchen (UVP)
> On Tue, 2017-08-15 at 02:16 +0000, Tangchen (UVP) wrote:
> > But I'm not using mq, and I run into these two problems in a non-mq system.
> > The patch you pointed out is fix for mq, so I don't think it can resolve 
> > this
> problem.
> >
> > IIUC, mq is for SSD ?  I'm not using ssd, so mq is disabled.
> 
> Hello Tangchen,
> 
> Please post replies below the original e-mail instead of above - that is the 
> reply
> style used on all Linux-related mailing lists I know of. From
> https://en.wikipedia.org/wiki/Posting_style:
> 
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?

Hi Bart,

Thanks for the reply. Will post the reply in e-mail. :)

> 
> Regarding your question: sorry but I quoted the wrong commit in my previous
> e-mail. The commit I should have referred to is 255ee9320e5d ("scsi: Make
> __scsi_remove_device go straight from BLOCKED to DEL"). That patch not only
> affects scsi-mq but also the single-queue code in the SCSI core.

OK, I'll try this one. Thx.

> 
> blk-mq/scsi-mq was introduced for SSDs but is not only intended for SSDs.
> The plan is to remove the blk-sq/scsi-sq code once the blk-mq/scsi-mq code
> works at least as fast as the single queue code for all supported devices.
> That includes hard disks.

OK, thanks for tell me this.

> 
> Bart.


RE: 答复: [iscsi] Deadlock occurred when network is in error

2017-08-15 Thread Tangchen (UVP)
> On Tue, 2017-08-15 at 02:16 +0000, Tangchen (UVP) wrote:
> > But I'm not using mq, and I run into these two problems in a non-mq system.
> > The patch you pointed out is fix for mq, so I don't think it can resolve 
> > this
> problem.
> >
> > IIUC, mq is for SSD ?  I'm not using ssd, so mq is disabled.
> 
> Hello Tangchen,
> 
> Please post replies below the original e-mail instead of above - that is the 
> reply
> style used on all Linux-related mailing lists I know of. From
> https://en.wikipedia.org/wiki/Posting_style:
> 
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?

Hi Bart,

Thanks for the reply. Will post the reply in e-mail. :)

> 
> Regarding your question: sorry but I quoted the wrong commit in my previous
> e-mail. The commit I should have referred to is 255ee9320e5d ("scsi: Make
> __scsi_remove_device go straight from BLOCKED to DEL"). That patch not only
> affects scsi-mq but also the single-queue code in the SCSI core.

OK, I'll try this one. Thx.

> 
> blk-mq/scsi-mq was introduced for SSDs but is not only intended for SSDs.
> The plan is to remove the blk-sq/scsi-sq code once the blk-mq/scsi-mq code
> works at least as fast as the single queue code for all supported devices.
> That includes hard disks.

OK, thanks for tell me this.

> 
> Bart.


答复: [iscsi] Deadlock occurred when network is in error

2017-08-14 Thread Tangchen (UVP)
Hi, Bart,

Thank you very much for the quick response. 

But I'm not using mq, and I run into these two problems in a non-mq system.
The patch you pointed out is fix for mq, so I don't think it can resolve this 
problem.

IIUC, mq is for SSD ?  I'm not using ssd, so mq is disabled.


On Mon, 2017-08-14 at 11:23 +, Tangchen (UVP) wrote:
> Problem 2:
> 
> ***
> [What it looks like]
> ***
> When remove a scsi device, and the network error happens, __blk_drain_queue() 
> could hang forever.
> 
> # cat /proc/19160/stack
> [] msleep+0x1d/0x30
> [] __blk_drain_queue+0xe4/0x160 [] 
> blk_cleanup_queue+0x106/0x2e0 [] 
> __scsi_remove_device+0x52/0xc0 [scsi_mod] [] 
> scsi_remove_device+0x2b/0x40 [scsi_mod] [] 
> sdev_store_delete_callback+0x10/0x20 [scsi_mod] [] 
> sysfs_schedule_callback_work+0x15/0x80
> [] process_one_work+0x169/0x340 [] 
> worker_thread+0x183/0x490 [] kthread+0x96/0xa0 
> [] kernel_thread_helper+0x4/0x10 
> [] 0x
> 
> The request queue of this device was stopped. So the following check will be 
> true forever:
> __blk_run_queue()
> {
> if (unlikely(blk_queue_stopped(q)))
> return;
> 
> __blk_run_queue_uncond(q);
> }
> 
> So __blk_run_queue_uncond() will never be called, and the process hang.
> 
> [ ... ]
>
> 
> [How to reproduce]
> 
> Unfortunately I cannot reproduce it in the latest kernel. 
> The script below will help to reproduce, but not very often.
> 
> # create network error
> tc qdisc add dev eth1 root netem loss 60%
> 
> # restart iscsid and rescan scsi bus again and again while [ 1 ] do 
> systemctl restart iscsid
> rescan-scsi-bus
> (http://manpages.ubuntu.com/manpages/trusty/man8/rescan-scsi-bus.8.html)
> done

This should have been fixed by commit 36e3cf273977 ("scsi: Avoid that SCSI 
queues get stuck"). The first mainline kernel that includes this commit is 
kernel v4.11.

> void __blk_run_queue(struct request_queue *q) {
> -   if (unlikely(blk_queue_stopped(q)))
> +   if (unlikely(blk_queue_stopped(q)) && 
> + unlikely(!blk_queue_dying(q)))
> return;
> 
> __blk_run_queue_uncond(q);

Are you aware that the single queue block layer is on its way out and will be 
removed sooner or later? Please focus your testing on scsi-mq. 

Regarding the above patch: it is wrong because it will cause lockups during 
path removal for other block drivers. Please drop this patch.

Bart.


答复: [iscsi] Deadlock occurred when network is in error

2017-08-14 Thread Tangchen (UVP)
Hi, Bart,

Thank you very much for the quick response. 

But I'm not using mq, and I run into these two problems in a non-mq system.
The patch you pointed out is fix for mq, so I don't think it can resolve this 
problem.

IIUC, mq is for SSD ?  I'm not using ssd, so mq is disabled.


On Mon, 2017-08-14 at 11:23 +, Tangchen (UVP) wrote:
> Problem 2:
> 
> ***
> [What it looks like]
> ***
> When remove a scsi device, and the network error happens, __blk_drain_queue() 
> could hang forever.
> 
> # cat /proc/19160/stack
> [] msleep+0x1d/0x30
> [] __blk_drain_queue+0xe4/0x160 [] 
> blk_cleanup_queue+0x106/0x2e0 [] 
> __scsi_remove_device+0x52/0xc0 [scsi_mod] [] 
> scsi_remove_device+0x2b/0x40 [scsi_mod] [] 
> sdev_store_delete_callback+0x10/0x20 [scsi_mod] [] 
> sysfs_schedule_callback_work+0x15/0x80
> [] process_one_work+0x169/0x340 [] 
> worker_thread+0x183/0x490 [] kthread+0x96/0xa0 
> [] kernel_thread_helper+0x4/0x10 
> [] 0x
> 
> The request queue of this device was stopped. So the following check will be 
> true forever:
> __blk_run_queue()
> {
> if (unlikely(blk_queue_stopped(q)))
> return;
> 
> __blk_run_queue_uncond(q);
> }
> 
> So __blk_run_queue_uncond() will never be called, and the process hang.
> 
> [ ... ]
>
> 
> [How to reproduce]
> 
> Unfortunately I cannot reproduce it in the latest kernel. 
> The script below will help to reproduce, but not very often.
> 
> # create network error
> tc qdisc add dev eth1 root netem loss 60%
> 
> # restart iscsid and rescan scsi bus again and again while [ 1 ] do 
> systemctl restart iscsid
> rescan-scsi-bus
> (http://manpages.ubuntu.com/manpages/trusty/man8/rescan-scsi-bus.8.html)
> done

This should have been fixed by commit 36e3cf273977 ("scsi: Avoid that SCSI 
queues get stuck"). The first mainline kernel that includes this commit is 
kernel v4.11.

> void __blk_run_queue(struct request_queue *q) {
> -   if (unlikely(blk_queue_stopped(q)))
> +   if (unlikely(blk_queue_stopped(q)) && 
> + unlikely(!blk_queue_dying(q)))
> return;
> 
> __blk_run_queue_uncond(q);

Are you aware that the single queue block layer is on its way out and will be 
removed sooner or later? Please focus your testing on scsi-mq. 

Regarding the above patch: it is wrong because it will cause lockups during 
path removal for other block drivers. Please drop this patch.

Bart.


[iscsi] Deadlock occurred when network is in error

2017-08-14 Thread Tangchen (UVP)
Hi,

I found two hangup problems between iscsid service and iscsi module. And I can 
reproduce one
of them in the latest kernel always. So I think the problems really exist. 

It really took me a long time to find out why due to my lack of knowledge of 
iscsi. But I cannot
find a good way to solve them both.

Please do help to take a look at them. Thx.

=
Problem 1:

***
[What it looks like]
***
First, we connect to 10 remote LUNs with iscsid service with at least two 
dirrerent sessions. 
When network error occurs, the session could be in error. If we do login and 
logout, iscsid
service could run into D state.

My colleague has posted an email to report this problem before. And he posted a 
long call trace.
But barely gain any feedback.
(https://lkml.org/lkml/2017/6/19/330)


**
[Why it happens]
**
In the latest kernel, asynchronous part of sd_probe() was executed
in scsi_sd_probe_domain, and sd_remove() would wait until all the
works in scsi_sd_probe_domain finished. When we use iscsi based
remote storage, and the network is broken, the following deadlock
could happen.

1. An iscsi session login is in progress, and calls sd_probe() to
   probe a remote lun. The synchronous part has finished, and the
   asynchronous part is scheduled in scsi_sd_probe_domain, and will
   submit io to execute scsi cmd to obtain device info. When the
   network is broken, the session will go into ISCSI_SESSION_FAILED
   state, and the io will retry until the session becomes
   ISCSI_SESSION_FREE. As a result, the work in scsi_sd_probe_domain
   hangs.

2. On the other hand, iscsi kernel module detects network ping
   timeout, and triggers ISCSI_KEVENT_CONN_ERROR event. iscsid in
   user space will handle this event by triggering
   ISCSI_UEVENT_DESTROY_SESSION event. Destroy session process is
   synchronous, and when it calls sd_remove() to remove the lun,
   it waits until all the works in scsi_sd_probe_domain finish. As
   a result, it hangs, and iscsid in user space goes into D state
   which is not killable, and not able to handle all the other
   events.



[How to reproduce]

With the script below, I can reproduce it in the latest kernel always.

# create network errors
tc qdisc add dev eth1 root netem loss 60%

while [1]
do
iscsiadm -m node -T xx -login
sleep 5
iscsiadm -m node -T xx -logout &
iscsiadm -m node -T yy -login &
done

xx and yy are two different target names.

Connect to about 10 remote LUNs, and run the script for about half an hour will 
reproduce the problem.


***
[How I avoid it for now]
***
To avoid this problem, I simply remove scsi_sd_probe_domain, and call 
sd_probe_async() synchronously in sd_probe().
So sd_remove() doesn't need to wait for the domain again.

@@ -2986,7 +2986,40 @@ static int sd_probe(struct device *dev)
get_device(>dev); /* prevent release before async_schedule */
-   async_schedule_domain(sd_probe_async, sdkp, _sd_probe_domain);
+   sd_probe_async((void *)sdkp, 0);

I know this is not a good way, so would you please give some advice about it ?



=
Problem 2:

***
[What it looks like]
***
When remove a scsi device, and the network error happens, __blk_drain_queue() 
could hang forever.

# cat /proc/19160/stack 
[] msleep+0x1d/0x30
[] __blk_drain_queue+0xe4/0x160
[] blk_cleanup_queue+0x106/0x2e0
[] __scsi_remove_device+0x52/0xc0 [scsi_mod]
[] scsi_remove_device+0x2b/0x40 [scsi_mod]
[] sdev_store_delete_callback+0x10/0x20 [scsi_mod]
[] sysfs_schedule_callback_work+0x15/0x80
[] process_one_work+0x169/0x340
[] worker_thread+0x183/0x490
[] kthread+0x96/0xa0
[] kernel_thread_helper+0x4/0x10
[] 0x

The request queue of this device was stopped. So the following check will be 
true forever:
__blk_run_queue()
{
if (unlikely(blk_queue_stopped(q)))
return;

__blk_run_queue_uncond(q);
}

So __blk_run_queue_uncond() will never be called, and the process hang.


**
[Why it happens]
**
When the network error happens, iscsi kernel module detected the ping timeout 
and 
tried to recover the session. Here, the queue was stopped, or you can also say 
session was blocked.

iscsi_start_session_recovery(session, conn, flag);
|-> iscsi_block_session(session->cls_session);
   |-> blk_stop_queue(q)

The session should be unblocked if the session is recovered or the recovery 
times out.
But it was not unblocked properly because scsi_remove_device() deleted the the 
device 
first, and then called __blk_drain_queue(). 

__scsi_remove_device()
|-> device_del(dev)
|-> blk_cleanup_queue()
  |-> scsi_request_fn()
|-> __blk_drain_queue()

At this time, the device was not on the children list of the parent device. So 
when 
__iscsi_unblock_session() tried to unblock the parent device and its children, 

[iscsi] Deadlock occurred when network is in error

2017-08-14 Thread Tangchen (UVP)
Hi,

I found two hangup problems between iscsid service and iscsi module. And I can 
reproduce one
of them in the latest kernel always. So I think the problems really exist. 

It really took me a long time to find out why due to my lack of knowledge of 
iscsi. But I cannot
find a good way to solve them both.

Please do help to take a look at them. Thx.

=
Problem 1:

***
[What it looks like]
***
First, we connect to 10 remote LUNs with iscsid service with at least two 
dirrerent sessions. 
When network error occurs, the session could be in error. If we do login and 
logout, iscsid
service could run into D state.

My colleague has posted an email to report this problem before. And he posted a 
long call trace.
But barely gain any feedback.
(https://lkml.org/lkml/2017/6/19/330)


**
[Why it happens]
**
In the latest kernel, asynchronous part of sd_probe() was executed
in scsi_sd_probe_domain, and sd_remove() would wait until all the
works in scsi_sd_probe_domain finished. When we use iscsi based
remote storage, and the network is broken, the following deadlock
could happen.

1. An iscsi session login is in progress, and calls sd_probe() to
   probe a remote lun. The synchronous part has finished, and the
   asynchronous part is scheduled in scsi_sd_probe_domain, and will
   submit io to execute scsi cmd to obtain device info. When the
   network is broken, the session will go into ISCSI_SESSION_FAILED
   state, and the io will retry until the session becomes
   ISCSI_SESSION_FREE. As a result, the work in scsi_sd_probe_domain
   hangs.

2. On the other hand, iscsi kernel module detects network ping
   timeout, and triggers ISCSI_KEVENT_CONN_ERROR event. iscsid in
   user space will handle this event by triggering
   ISCSI_UEVENT_DESTROY_SESSION event. Destroy session process is
   synchronous, and when it calls sd_remove() to remove the lun,
   it waits until all the works in scsi_sd_probe_domain finish. As
   a result, it hangs, and iscsid in user space goes into D state
   which is not killable, and not able to handle all the other
   events.



[How to reproduce]

With the script below, I can reproduce it in the latest kernel always.

# create network errors
tc qdisc add dev eth1 root netem loss 60%

while [1]
do
iscsiadm -m node -T xx -login
sleep 5
iscsiadm -m node -T xx -logout &
iscsiadm -m node -T yy -login &
done

xx and yy are two different target names.

Connect to about 10 remote LUNs, and run the script for about half an hour will 
reproduce the problem.


***
[How I avoid it for now]
***
To avoid this problem, I simply remove scsi_sd_probe_domain, and call 
sd_probe_async() synchronously in sd_probe().
So sd_remove() doesn't need to wait for the domain again.

@@ -2986,7 +2986,40 @@ static int sd_probe(struct device *dev)
get_device(>dev); /* prevent release before async_schedule */
-   async_schedule_domain(sd_probe_async, sdkp, _sd_probe_domain);
+   sd_probe_async((void *)sdkp, 0);

I know this is not a good way, so would you please give some advice about it ?



=
Problem 2:

***
[What it looks like]
***
When remove a scsi device, and the network error happens, __blk_drain_queue() 
could hang forever.

# cat /proc/19160/stack 
[] msleep+0x1d/0x30
[] __blk_drain_queue+0xe4/0x160
[] blk_cleanup_queue+0x106/0x2e0
[] __scsi_remove_device+0x52/0xc0 [scsi_mod]
[] scsi_remove_device+0x2b/0x40 [scsi_mod]
[] sdev_store_delete_callback+0x10/0x20 [scsi_mod]
[] sysfs_schedule_callback_work+0x15/0x80
[] process_one_work+0x169/0x340
[] worker_thread+0x183/0x490
[] kthread+0x96/0xa0
[] kernel_thread_helper+0x4/0x10
[] 0x

The request queue of this device was stopped. So the following check will be 
true forever:
__blk_run_queue()
{
if (unlikely(blk_queue_stopped(q)))
return;

__blk_run_queue_uncond(q);
}

So __blk_run_queue_uncond() will never be called, and the process hang.


**
[Why it happens]
**
When the network error happens, iscsi kernel module detected the ping timeout 
and 
tried to recover the session. Here, the queue was stopped, or you can also say 
session was blocked.

iscsi_start_session_recovery(session, conn, flag);
|-> iscsi_block_session(session->cls_session);
   |-> blk_stop_queue(q)

The session should be unblocked if the session is recovered or the recovery 
times out.
But it was not unblocked properly because scsi_remove_device() deleted the the 
device 
first, and then called __blk_drain_queue(). 

__scsi_remove_device()
|-> device_del(dev)
|-> blk_cleanup_queue()
  |-> scsi_request_fn()
|-> __blk_drain_queue()

At this time, the device was not on the children list of the parent device. So 
when 
__iscsi_unblock_session() tried to unblock the parent device and its children, 

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread tangchen

Hi Paolo,

On 09/11/2014 10:24 PM, Paolo Bonzini wrote:

Il 11/09/2014 16:21, Gleb Natapov ha scritto:

As far as I can tell the if that is needed there is:

if (!is_guest_mode() || !(vmcs12->secondary_vm_exec_control & 
ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
 write(PIC_ACCESS_ADDR)

In other words if L2 shares L1 apic access page then reload, otherwise do 
nothing.

What if the page being swapped out is L1's APIC access page?  We don't
run prepare_vmcs12 in that case because it's an L2->L0->L2 entry, so we
need to "do something".


Are you talking about the case that L1 and L2 have different apic pages ?
I think I didn't deal with this situation in this patch set.

Sorry I didn't say it clearly. Here, I assume L1 and L2 share the same 
apic page.
If we are in L2, and the page is migrated, we updated L2's vmcs by 
making vcpu
request. And of course, we should also update L1's vmcs. This is done by 
patch 5.

We make vcpu request again in nested_vmx_exit().

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread tangchen

Hi Gleb, Paolo,

On 09/11/2014 10:47 PM, Gleb Natapov wrote:

On Thu, Sep 11, 2014 at 04:37:39PM +0200, Paolo Bonzini wrote:

Il 11/09/2014 16:31, Gleb Natapov ha scritto:

What if the page being swapped out is L1's APIC access page?  We don't
run prepare_vmcs12 in that case because it's an L2->L0->L2 entry, so we
need to "do something".

We will do something on L2->L1 exit. We will call kvm_reload_apic_access_page().
That is what patch 5 of this series is doing.

Sorry, I meant "the APIC access page prepared by L1" for L2's execution.

You wrote:


if (!is_guest_mode() || !(vmcs12->secondary_vm_exec_control & 
ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
 write(PIC_ACCESS_ADDR)

In other words if L2 shares L1 apic access page then reload, otherwise do 
nothing.

but in that case you have to redo nested_get_page, so "do nothing"
doesn't work.


Ah, 7/7 is new in this submission. Before that this page was still
pinned.  Looking at 7/7 now I do not see how it can work since it has no
code for mmu notifier to detect that it deals with such page and call
kvm_reload_apic_access_page().


Since L1 and L2 share one apic page, if the page is unmapped, 
mmu_notifier will

be called, and :

 - if vcpu is in L1, a L1->L0 exit is rised. apic page's pa will be 
updated in the next

   L0->L1 entry by making vcpu request.

 - if vcpu is in L2 (is_guest_mode, right?), a L2->L0 exit is rised. 
nested_vmx_vmexit()
   will not be called since it is called in L2->L1 exit. It returns 
from vmx_vcpu_run()
   directly, right ? So we should update apic page in L0->L2 entry. 
This is also done

   by making vcpu request, right ?.

   prepare_vmcs02() is called in L1->L2 entry, and nested_vmx_vmexit() 
is called in
   L2->L1 exit. So we also need to update L1's vmcs in 
nested_vmx_vmexit() in patch 5/7.


IIUC, I think patch 1~6 has done such things.

And yes, the is_guest_mode() check is not needed.


I said to Tang previously that nested
kvm has a bunch of pinned page that are hard to deal with and suggested
to iron out non nested case first :(


Yes, and maybe adding patch 7 is not a good idea for now.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 3/7] kvm: Make init_rmode_identity_map() return 0 on success.

2014-09-11 Thread tangchen


On 09/11/2014 05:17 PM, Paolo Bonzini wrote:

..
@@ -7645,7 +7642,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
kvm->arch.ept_identity_map_addr =
VMX_EPT_IDENTITY_PAGETABLE_ADDR;
err = -ENOMEM;
-   if (!init_rmode_identity_map(kvm))
+   if (init_rmode_identity_map(kvm))
Please add "< 0" here.  I would also consider setting err to the return
value of init_rmode_identity_map, and initializing it to -ENOMEM only
after the "if".


I'd like to move err = -ENOMEM to the following place:

vmx_create_vcpu()
{
..
err = kvm_vcpu_init(>vcpu, kvm, id);
if (err)
goto free_vcpu;

err = -ENOMEM;  -- move it here

vmx->guest_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL);

vmx->loaded_vmcs->vmcs = alloc_vmcs();

}

So that it can be used to handle the next two memory allocation error.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread tangchen


On 09/11/2014 05:21 PM, Paolo Bonzini wrote:

Il 11/09/2014 07:38, Tang Chen ha scritto:

apic access page is pinned in memory. As a result, it cannot be 
migrated/hot-removed.
Actually, it is not necessary to be pinned.

The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When
the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the
corresponding ept entry. This patch introduces a new vcpu request named
KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this time,
and force all the vcpus exit guest, and re-enter guest till they updates the 
VMCS
APIC_ACCESS_ADDR pointer to the new apic access page address, and updates
kvm->arch.apic_access_page to the new page.

Signed-off-by: Tang Chen 
---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/svm.c  |  6 ++
  arch/x86/kvm/vmx.c  |  6 ++
  arch/x86/kvm/x86.c  | 15 +++
  include/linux/kvm_host.h|  2 ++
  virt/kvm/kvm_main.c | 12 
  6 files changed, 42 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 35171c7..514183e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -739,6 +739,7 @@ struct kvm_x86_ops {
void (*hwapic_isr_update)(struct kvm *kvm, int isr);
void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
+   void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1d941ad..f2eacc4 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3619,6 +3619,11 @@ static void svm_set_virtual_x2apic_mode(struct kvm_vcpu 
*vcpu, bool set)
return;
  }
  
+static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)

+{
+   return;
+}
+
  static int svm_vm_has_apicv(struct kvm *kvm)
  {
return 0;
@@ -4373,6 +4378,7 @@ static struct kvm_x86_ops svm_x86_ops = {
.enable_irq_window = enable_irq_window,
.update_cr8_intercept = update_cr8_intercept,
.set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
+   .set_apic_access_page_addr = svm_set_apic_access_page_addr,
.vm_has_apicv = svm_vm_has_apicv,
.load_eoi_exitmap = svm_load_eoi_exitmap,
.hwapic_isr_update = svm_hwapic_isr_update,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 63c4c3e..da6d55d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct kvm_vcpu 
*vcpu, bool set)
vmx_set_msr_bitmap(vcpu);
  }
  
+static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)

+{
+   vmcs_write64(APIC_ACCESS_ADDR, hpa);

This has to be guarded by "if (!is_guest_mode(vcpu))".


Since we cannot get vcpu through kvm, I'd like to move this check to
vcpu_reload_apic_access_page() when 
kvm_x86_ops->set_apic_access_page_addr()

is called.

Thanks.



+}
+
  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
  {
u16 status;
@@ -8910,6 +8915,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
.enable_irq_window = enable_irq_window,
.update_cr8_intercept = update_cr8_intercept,
.set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode,
+   .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
.vm_has_apicv = vmx_vm_has_apicv,
.load_eoi_exitmap = vmx_load_eoi_exitmap,
.hwapic_irr_update = vmx_hwapic_irr_update,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e05bd58..96f4188 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5989,6 +5989,19 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
kvm_apic_update_tmr(vcpu, tmr);
  }
  
+static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)

+{
+   /*
+* apic access page could be migrated. When the page is being migrated,
+* GUP will wait till the migrate entry is replaced with the new pte
+* entry pointing to the new page.
+*/
+   vcpu->kvm->arch.apic_access_page = gfn_to_page(vcpu->kvm,
+   APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
+   kvm_x86_ops->set_apic_access_page_addr(vcpu->kvm,
+   page_to_phys(vcpu->kvm->arch.apic_access_page));
+}
+
  /*
   * Returns 1 to let __vcpu_run() continue the guest execution loop without
   * exiting to the userspace.  Otherwise, the value will be returned to the
@@ -6049,6 +6062,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_deliver_pmi(vcpu);
if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu))
   

Re: [PATCH v5 7/7] kvm, mem-hotplug: Unpin and remove nested_vmx->apic_access_page.

2014-09-11 Thread tangchen


On 09/11/2014 05:33 PM, Paolo Bonzini wrote:

This patch is not against the latest KVM tree.  The call to
nested_get_page is now in nested_get_vmcs12_pages, and you have to
handle virtual_apic_page in a similar manner.

Hi Paolo,

Thanks for the reviewing.

This patch-set is against Linux v3.17-rc4.
Will make it against the latest KVM tree, and resend a patch set 
following you comments.


Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread tangchen

Hi Gleb, Paolo,

On 09/11/2014 10:47 PM, Gleb Natapov wrote:

On Thu, Sep 11, 2014 at 04:37:39PM +0200, Paolo Bonzini wrote:

Il 11/09/2014 16:31, Gleb Natapov ha scritto:

What if the page being swapped out is L1's APIC access page?  We don't
run prepare_vmcs12 in that case because it's an L2-L0-L2 entry, so we
need to do something.

We will do something on L2-L1 exit. We will call kvm_reload_apic_access_page().
That is what patch 5 of this series is doing.

Sorry, I meant the APIC access page prepared by L1 for L2's execution.

You wrote:


if (!is_guest_mode() || !(vmcs12-secondary_vm_exec_control  
ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
 write(PIC_ACCESS_ADDR)

In other words if L2 shares L1 apic access page then reload, otherwise do 
nothing.

but in that case you have to redo nested_get_page, so do nothing
doesn't work.


Ah, 7/7 is new in this submission. Before that this page was still
pinned.  Looking at 7/7 now I do not see how it can work since it has no
code for mmu notifier to detect that it deals with such page and call
kvm_reload_apic_access_page().


Since L1 and L2 share one apic page, if the page is unmapped, 
mmu_notifier will

be called, and :

 - if vcpu is in L1, a L1-L0 exit is rised. apic page's pa will be 
updated in the next

   L0-L1 entry by making vcpu request.

 - if vcpu is in L2 (is_guest_mode, right?), a L2-L0 exit is rised. 
nested_vmx_vmexit()
   will not be called since it is called in L2-L1 exit. It returns 
from vmx_vcpu_run()
   directly, right ? So we should update apic page in L0-L2 entry. 
This is also done

   by making vcpu request, right ?.

   prepare_vmcs02() is called in L1-L2 entry, and nested_vmx_vmexit() 
is called in
   L2-L1 exit. So we also need to update L1's vmcs in 
nested_vmx_vmexit() in patch 5/7.


IIUC, I think patch 1~6 has done such things.

And yes, the is_guest_mode() check is not needed.


I said to Tang previously that nested
kvm has a bunch of pinned page that are hard to deal with and suggested
to iron out non nested case first :(


Yes, and maybe adding patch 7 is not a good idea for now.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread tangchen

Hi Paolo,

On 09/11/2014 10:24 PM, Paolo Bonzini wrote:

Il 11/09/2014 16:21, Gleb Natapov ha scritto:

As far as I can tell the if that is needed there is:

if (!is_guest_mode() || !(vmcs12-secondary_vm_exec_control  
ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
 write(PIC_ACCESS_ADDR)

In other words if L2 shares L1 apic access page then reload, otherwise do 
nothing.

What if the page being swapped out is L1's APIC access page?  We don't
run prepare_vmcs12 in that case because it's an L2-L0-L2 entry, so we
need to do something.


Are you talking about the case that L1 and L2 have different apic pages ?
I think I didn't deal with this situation in this patch set.

Sorry I didn't say it clearly. Here, I assume L1 and L2 share the same 
apic page.
If we are in L2, and the page is migrated, we updated L2's vmcs by 
making vcpu
request. And of course, we should also update L1's vmcs. This is done by 
patch 5.

We make vcpu request again in nested_vmx_exit().

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 7/7] kvm, mem-hotplug: Unpin and remove nested_vmx-apic_access_page.

2014-09-11 Thread tangchen


On 09/11/2014 05:33 PM, Paolo Bonzini wrote:

This patch is not against the latest KVM tree.  The call to
nested_get_page is now in nested_get_vmcs12_pages, and you have to
handle virtual_apic_page in a similar manner.

Hi Paolo,

Thanks for the reviewing.

This patch-set is against Linux v3.17-rc4.
Will make it against the latest KVM tree, and resend a patch set 
following you comments.


Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread tangchen


On 09/11/2014 05:21 PM, Paolo Bonzini wrote:

Il 11/09/2014 07:38, Tang Chen ha scritto:

apic access page is pinned in memory. As a result, it cannot be 
migrated/hot-removed.
Actually, it is not necessary to be pinned.

The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When
the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the
corresponding ept entry. This patch introduces a new vcpu request named
KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this time,
and force all the vcpus exit guest, and re-enter guest till they updates the 
VMCS
APIC_ACCESS_ADDR pointer to the new apic access page address, and updates
kvm-arch.apic_access_page to the new page.

Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/svm.c  |  6 ++
  arch/x86/kvm/vmx.c  |  6 ++
  arch/x86/kvm/x86.c  | 15 +++
  include/linux/kvm_host.h|  2 ++
  virt/kvm/kvm_main.c | 12 
  6 files changed, 42 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 35171c7..514183e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -739,6 +739,7 @@ struct kvm_x86_ops {
void (*hwapic_isr_update)(struct kvm *kvm, int isr);
void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
+   void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1d941ad..f2eacc4 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3619,6 +3619,11 @@ static void svm_set_virtual_x2apic_mode(struct kvm_vcpu 
*vcpu, bool set)
return;
  }
  
+static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)

+{
+   return;
+}
+
  static int svm_vm_has_apicv(struct kvm *kvm)
  {
return 0;
@@ -4373,6 +4378,7 @@ static struct kvm_x86_ops svm_x86_ops = {
.enable_irq_window = enable_irq_window,
.update_cr8_intercept = update_cr8_intercept,
.set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
+   .set_apic_access_page_addr = svm_set_apic_access_page_addr,
.vm_has_apicv = svm_vm_has_apicv,
.load_eoi_exitmap = svm_load_eoi_exitmap,
.hwapic_isr_update = svm_hwapic_isr_update,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 63c4c3e..da6d55d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct kvm_vcpu 
*vcpu, bool set)
vmx_set_msr_bitmap(vcpu);
  }
  
+static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)

+{
+   vmcs_write64(APIC_ACCESS_ADDR, hpa);

This has to be guarded by if (!is_guest_mode(vcpu)).


Since we cannot get vcpu through kvm, I'd like to move this check to
vcpu_reload_apic_access_page() when 
kvm_x86_ops-set_apic_access_page_addr()

is called.

Thanks.



+}
+
  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
  {
u16 status;
@@ -8910,6 +8915,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
.enable_irq_window = enable_irq_window,
.update_cr8_intercept = update_cr8_intercept,
.set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode,
+   .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
.vm_has_apicv = vmx_vm_has_apicv,
.load_eoi_exitmap = vmx_load_eoi_exitmap,
.hwapic_irr_update = vmx_hwapic_irr_update,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e05bd58..96f4188 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5989,6 +5989,19 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
kvm_apic_update_tmr(vcpu, tmr);
  }
  
+static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)

+{
+   /*
+* apic access page could be migrated. When the page is being migrated,
+* GUP will wait till the migrate entry is replaced with the new pte
+* entry pointing to the new page.
+*/
+   vcpu-kvm-arch.apic_access_page = gfn_to_page(vcpu-kvm,
+   APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT);
+   kvm_x86_ops-set_apic_access_page_addr(vcpu-kvm,
+   page_to_phys(vcpu-kvm-arch.apic_access_page));
+}
+
  /*
   * Returns 1 to let __vcpu_run() continue the guest execution loop without
   * exiting to the userspace.  Otherwise, the value will be returned to the
@@ -6049,6 +6062,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_deliver_pmi(vcpu);
if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu))
 

Re: [PATCH v5 3/7] kvm: Make init_rmode_identity_map() return 0 on success.

2014-09-11 Thread tangchen


On 09/11/2014 05:17 PM, Paolo Bonzini wrote:

..
@@ -7645,7 +7642,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
kvm-arch.ept_identity_map_addr =
VMX_EPT_IDENTITY_PAGETABLE_ADDR;
err = -ENOMEM;
-   if (!init_rmode_identity_map(kvm))
+   if (init_rmode_identity_map(kvm))
Please add  0 here.  I would also consider setting err to the return
value of init_rmode_identity_map, and initializing it to -ENOMEM only
after the if.


I'd like to move err = -ENOMEM to the following place:

vmx_create_vcpu()
{
..
err = kvm_vcpu_init(vmx-vcpu, kvm, id);
if (err)
goto free_vcpu;

err = -ENOMEM;  -- move it here

vmx-guest_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL);

vmx-loaded_vmcs-vmcs = alloc_vmcs();

}

So that it can be used to handle the next two memory allocation error.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-09 Thread tangchen

Hi Gleb,

On 09/03/2014 11:04 PM, Gleb Natapov wrote:

On Wed, Sep 03, 2014 at 09:42:30AM +0800, tangchen wrote:

Hi Gleb,

On 09/03/2014 12:00 AM, Gleb Natapov wrote:

..
+static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
+{
+   /*
+* apic access page could be migrated. When the page is being migrated,
+* GUP will wait till the migrate entry is replaced with the new pte
+* entry pointing to the new page.
+*/
+   vcpu->kvm->arch.apic_access_page = gfn_to_page(vcpu->kvm,
+   APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
+   kvm_x86_ops->set_apic_access_page_addr(vcpu->kvm,
+   page_to_phys(vcpu->kvm->arch.apic_access_page));
I am a little bit worried that here all vcpus write to 
vcpu->kvm->arch.apic_access_page
without any locking. It is probably benign since pointer write is atomic on 
x86. Paolo?

Do we even need apic_access_page? Why not call
  gfn_to_page(APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT)
  put_page()
on rare occasions we need to know its address?

Isn't it a necessary item defined in hardware spec ?


vcpu->kvm->arch.apic_access_page? No. This is internal kvm data structure.


I didn't read intel spec deeply, but according to the code, the page's
address is
written into vmcs. And it made me think that we cannnot remove it.


We cannot remove writing of apic page address into vmcs, but this is not done by
assigning to vcpu->kvm->arch.apic_access_page, but by vmwrite in 
set_apic_access_page_addr().


OK, I'll try to remove kvm->arch.apic_access_page and send a patch for 
it soon.


BTW, if you don't have objection to the first two patches, would you 
please help to

commit them first ?

Thanks.



--
Gleb.
.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-09 Thread tangchen

Hi Gleb,

On 09/03/2014 11:04 PM, Gleb Natapov wrote:

On Wed, Sep 03, 2014 at 09:42:30AM +0800, tangchen wrote:

Hi Gleb,

On 09/03/2014 12:00 AM, Gleb Natapov wrote:

..
+static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
+{
+   /*
+* apic access page could be migrated. When the page is being migrated,
+* GUP will wait till the migrate entry is replaced with the new pte
+* entry pointing to the new page.
+*/
+   vcpu-kvm-arch.apic_access_page = gfn_to_page(vcpu-kvm,
+   APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT);
+   kvm_x86_ops-set_apic_access_page_addr(vcpu-kvm,
+   page_to_phys(vcpu-kvm-arch.apic_access_page));
I am a little bit worried that here all vcpus write to 
vcpu-kvm-arch.apic_access_page
without any locking. It is probably benign since pointer write is atomic on 
x86. Paolo?

Do we even need apic_access_page? Why not call
  gfn_to_page(APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT)
  put_page()
on rare occasions we need to know its address?

Isn't it a necessary item defined in hardware spec ?


vcpu-kvm-arch.apic_access_page? No. This is internal kvm data structure.


I didn't read intel spec deeply, but according to the code, the page's
address is
written into vmcs. And it made me think that we cannnot remove it.


We cannot remove writing of apic page address into vmcs, but this is not done by
assigning to vcpu-kvm-arch.apic_access_page, but by vmwrite in 
set_apic_access_page_addr().


OK, I'll try to remove kvm-arch.apic_access_page and send a patch for 
it soon.


BTW, if you don't have objection to the first two patches, would you 
please help to

commit them first ?

Thanks.



--
Gleb.
.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 5/6] kvm, mem-hotplug: Reload L1's apic access page on migration when L2 is running.

2014-09-02 Thread tangchen

Hi Gleb,

By the way, when testing nested vm, I started L1 and L2 vm with
-cpu XXX, -x2apic

But with or with out this patch 5/6, when migrating apic access page,
the nested vm didn't corrupt.

We cannot migrate L2 vm because it pinned some other pages in memory.
Without this patch, if we migrate apic access page, I thought L2 vm will
corrupt. But it didn't.

Did I made any mistake you can obviously find out ?

Thanks.

On 08/27/2014 06:17 PM, Tang Chen wrote:

This patch only handle "L1 and L2 vm share one apic access page" situation.

When L1 vm is running, if the shared apic access page is migrated, mmu_notifier 
will
request all vcpus to exit to L0, and reload apic access page physical address 
for
all the vcpus' vmcs (which is done by patch 5/6). And when it enters L2 vm, 
L2's vmcs
will be updated in prepare_vmcs02() called by nested_vm_run(). So we need to do
nothing.

When L2 vm is running, if the shared apic access page is migrated, mmu_notifier 
will
request all vcpus to exit to L0, and reload apic access page physical address 
for
all L2 vmcs. And this patch requests apic access page reload in L2->L1 vmexit.

Signed-off-by: Tang Chen 
---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/svm.c  |  6 ++
  arch/x86/kvm/vmx.c  | 32 
  arch/x86/kvm/x86.c  |  3 +++
  virt/kvm/kvm_main.c |  1 +
  5 files changed, 43 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 514183e..13fbb62 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -740,6 +740,7 @@ struct kvm_x86_ops {
void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
+   void (*set_nested_apic_page_migrated)(struct kvm_vcpu *vcpu, bool set);
void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f2eacc4..da88646 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3624,6 +3624,11 @@ static void svm_set_apic_access_page_addr(struct kvm 
*kvm, hpa_t hpa)
return;
  }
  
+static void svm_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool set)

+{
+   return;
+}
+
  static int svm_vm_has_apicv(struct kvm *kvm)
  {
return 0;
@@ -4379,6 +4384,7 @@ static struct kvm_x86_ops svm_x86_ops = {
.update_cr8_intercept = update_cr8_intercept,
.set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
.set_apic_access_page_addr = svm_set_apic_access_page_addr,
+   .set_nested_apic_page_migrated = svm_set_nested_apic_page_migrated,
.vm_has_apicv = svm_vm_has_apicv,
.load_eoi_exitmap = svm_load_eoi_exitmap,
.hwapic_isr_update = svm_hwapic_isr_update,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index da6d55d..9035fd1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -379,6 +379,16 @@ struct nested_vmx {
 * we must keep them pinned while L2 runs.
 */
struct page *apic_access_page;
+   /*
+* L1's apic access page can be migrated. When L1 and L2 are sharing
+* the apic access page, after the page is migrated when L2 is running,
+* we have to reload it to L1 vmcs before we enter L1.
+*
+* When the shared apic access page is migrated in L1 mode, we don't
+* need to do anything else because we reload apic access page each
+* time when entering L2 in prepare_vmcs02().
+*/
+   bool apic_access_page_migrated;
u64 msr_ia32_feature_control;
  
  	struct hrtimer preemption_timer;

@@ -7098,6 +7108,12 @@ static void vmx_set_apic_access_page_addr(struct kvm 
*kvm, hpa_t hpa)
vmcs_write64(APIC_ACCESS_ADDR, hpa);
  }
  
+static void vmx_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool set)

+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   vmx->nested.apic_access_page_migrated = set;
+}
+
  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
  {
u16 status;
@@ -8796,6 +8812,21 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 
exit_reason,
}
  
  	/*

+* When shared (L1 & L2) apic access page is migrated during L2 is
+* running, mmu_notifier will force to reload the page's hpa for L2
+* vmcs. Need to reload it for L1 before entering L1.
+*/
+   if (vmx->nested.apic_access_page_migrated) {
+   /*
+* Do not call kvm_reload_apic_access_page() because we are now
+* in L2. We should not call make_all_cpus_request() to exit to
+* L0, otherwise we will reload for L2 vmcs again.
+   

Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-02 Thread tangchen

Hi Gleb,

On 09/03/2014 12:00 AM, Gleb Natapov wrote:

..
+static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
+{
+   /*
+* apic access page could be migrated. When the page is being migrated,
+* GUP will wait till the migrate entry is replaced with the new pte
+* entry pointing to the new page.
+*/
+   vcpu->kvm->arch.apic_access_page = gfn_to_page(vcpu->kvm,
+   APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
+   kvm_x86_ops->set_apic_access_page_addr(vcpu->kvm,
+   page_to_phys(vcpu->kvm->arch.apic_access_page));
I am a little bit worried that here all vcpus write to 
vcpu->kvm->arch.apic_access_page
without any locking. It is probably benign since pointer write is atomic on 
x86. Paolo?

Do we even need apic_access_page? Why not call
  gfn_to_page(APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT)
  put_page()
on rare occasions we need to know its address?


Isn't it a necessary item defined in hardware spec ?

I didn't read intel spec deeply, but according to the code, the page's 
address is

written into vmcs. And it made me think that we cannnot remove it.

Thanks.




+}
+
  /*
   * Returns 1 to let __vcpu_run() continue the guest execution loop without
   * exiting to the userspace.  Otherwise, the value will be returned to the
@@ -6049,6 +6062,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_deliver_pmi(vcpu);
if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu))
vcpu_scan_ioapic(vcpu);
+   if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
+   vcpu_reload_apic_access_page(vcpu);
}
  
  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a4c33b3..8be076a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -136,6 +136,7 @@ static inline bool is_error_page(struct page *page)
  #define KVM_REQ_GLOBAL_CLOCK_UPDATE 22
  #define KVM_REQ_ENABLE_IBS23
  #define KVM_REQ_DISABLE_IBS   24
+#define KVM_REQ_APIC_PAGE_RELOAD  25
  
  #define KVM_USERSPACE_IRQ_SOURCE_ID		0

  #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID  1
@@ -579,6 +580,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
  void kvm_reload_remote_mmus(struct kvm *kvm);
  void kvm_make_mclock_inprogress_request(struct kvm *kvm);
  void kvm_make_scan_ioapic_request(struct kvm *kvm);
+void kvm_reload_apic_access_page(struct kvm *kvm);
  
  long kvm_arch_dev_ioctl(struct file *filp,

unsigned int ioctl, unsigned long arg);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 33712fb..d8280de 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -210,6 +210,11 @@ void kvm_make_scan_ioapic_request(struct kvm *kvm)
make_all_cpus_request(kvm, KVM_REQ_SCAN_IOAPIC);
  }
  
+void kvm_reload_apic_access_page(struct kvm *kvm)

+{
+   make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+}
+
  int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
  {
struct page *page;
@@ -294,6 +299,13 @@ static void kvm_mmu_notifier_invalidate_page(struct 
mmu_notifier *mn,
if (need_tlb_flush)
kvm_flush_remote_tlbs(kvm);
  
+	/*

+* The physical address of apic access page is stroed in VMCS.
+* So need to update it when it becomes invalid.
+*/
+   if (address == gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT))
+   kvm_reload_apic_access_page(kvm);
+
spin_unlock(>mmu_lock);
srcu_read_unlock(>srcu, idx);
  }
--
1.8.3.1


--
Gleb.
.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-02 Thread tangchen

Hi Gleb,

On 09/03/2014 12:00 AM, Gleb Natapov wrote:

..
+static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
+{
+   /*
+* apic access page could be migrated. When the page is being migrated,
+* GUP will wait till the migrate entry is replaced with the new pte
+* entry pointing to the new page.
+*/
+   vcpu-kvm-arch.apic_access_page = gfn_to_page(vcpu-kvm,
+   APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT);
+   kvm_x86_ops-set_apic_access_page_addr(vcpu-kvm,
+   page_to_phys(vcpu-kvm-arch.apic_access_page));
I am a little bit worried that here all vcpus write to 
vcpu-kvm-arch.apic_access_page
without any locking. It is probably benign since pointer write is atomic on 
x86. Paolo?

Do we even need apic_access_page? Why not call
  gfn_to_page(APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT)
  put_page()
on rare occasions we need to know its address?


Isn't it a necessary item defined in hardware spec ?

I didn't read intel spec deeply, but according to the code, the page's 
address is

written into vmcs. And it made me think that we cannnot remove it.

Thanks.




+}
+
  /*
   * Returns 1 to let __vcpu_run() continue the guest execution loop without
   * exiting to the userspace.  Otherwise, the value will be returned to the
@@ -6049,6 +6062,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_deliver_pmi(vcpu);
if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu))
vcpu_scan_ioapic(vcpu);
+   if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
+   vcpu_reload_apic_access_page(vcpu);
}
  
  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a4c33b3..8be076a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -136,6 +136,7 @@ static inline bool is_error_page(struct page *page)
  #define KVM_REQ_GLOBAL_CLOCK_UPDATE 22
  #define KVM_REQ_ENABLE_IBS23
  #define KVM_REQ_DISABLE_IBS   24
+#define KVM_REQ_APIC_PAGE_RELOAD  25
  
  #define KVM_USERSPACE_IRQ_SOURCE_ID		0

  #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID  1
@@ -579,6 +580,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
  void kvm_reload_remote_mmus(struct kvm *kvm);
  void kvm_make_mclock_inprogress_request(struct kvm *kvm);
  void kvm_make_scan_ioapic_request(struct kvm *kvm);
+void kvm_reload_apic_access_page(struct kvm *kvm);
  
  long kvm_arch_dev_ioctl(struct file *filp,

unsigned int ioctl, unsigned long arg);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 33712fb..d8280de 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -210,6 +210,11 @@ void kvm_make_scan_ioapic_request(struct kvm *kvm)
make_all_cpus_request(kvm, KVM_REQ_SCAN_IOAPIC);
  }
  
+void kvm_reload_apic_access_page(struct kvm *kvm)

+{
+   make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+}
+
  int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
  {
struct page *page;
@@ -294,6 +299,13 @@ static void kvm_mmu_notifier_invalidate_page(struct 
mmu_notifier *mn,
if (need_tlb_flush)
kvm_flush_remote_tlbs(kvm);
  
+	/*

+* The physical address of apic access page is stroed in VMCS.
+* So need to update it when it becomes invalid.
+*/
+   if (address == gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT))
+   kvm_reload_apic_access_page(kvm);
+
spin_unlock(kvm-mmu_lock);
srcu_read_unlock(kvm-srcu, idx);
  }
--
1.8.3.1


--
Gleb.
.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 5/6] kvm, mem-hotplug: Reload L1's apic access page on migration when L2 is running.

2014-09-02 Thread tangchen

Hi Gleb,

By the way, when testing nested vm, I started L1 and L2 vm with
-cpu XXX, -x2apic

But with or with out this patch 5/6, when migrating apic access page,
the nested vm didn't corrupt.

We cannot migrate L2 vm because it pinned some other pages in memory.
Without this patch, if we migrate apic access page, I thought L2 vm will
corrupt. But it didn't.

Did I made any mistake you can obviously find out ?

Thanks.

On 08/27/2014 06:17 PM, Tang Chen wrote:

This patch only handle L1 and L2 vm share one apic access page situation.

When L1 vm is running, if the shared apic access page is migrated, mmu_notifier 
will
request all vcpus to exit to L0, and reload apic access page physical address 
for
all the vcpus' vmcs (which is done by patch 5/6). And when it enters L2 vm, 
L2's vmcs
will be updated in prepare_vmcs02() called by nested_vm_run(). So we need to do
nothing.

When L2 vm is running, if the shared apic access page is migrated, mmu_notifier 
will
request all vcpus to exit to L0, and reload apic access page physical address 
for
all L2 vmcs. And this patch requests apic access page reload in L2-L1 vmexit.

Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/svm.c  |  6 ++
  arch/x86/kvm/vmx.c  | 32 
  arch/x86/kvm/x86.c  |  3 +++
  virt/kvm/kvm_main.c |  1 +
  5 files changed, 43 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 514183e..13fbb62 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -740,6 +740,7 @@ struct kvm_x86_ops {
void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
+   void (*set_nested_apic_page_migrated)(struct kvm_vcpu *vcpu, bool set);
void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f2eacc4..da88646 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3624,6 +3624,11 @@ static void svm_set_apic_access_page_addr(struct kvm 
*kvm, hpa_t hpa)
return;
  }
  
+static void svm_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool set)

+{
+   return;
+}
+
  static int svm_vm_has_apicv(struct kvm *kvm)
  {
return 0;
@@ -4379,6 +4384,7 @@ static struct kvm_x86_ops svm_x86_ops = {
.update_cr8_intercept = update_cr8_intercept,
.set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
.set_apic_access_page_addr = svm_set_apic_access_page_addr,
+   .set_nested_apic_page_migrated = svm_set_nested_apic_page_migrated,
.vm_has_apicv = svm_vm_has_apicv,
.load_eoi_exitmap = svm_load_eoi_exitmap,
.hwapic_isr_update = svm_hwapic_isr_update,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index da6d55d..9035fd1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -379,6 +379,16 @@ struct nested_vmx {
 * we must keep them pinned while L2 runs.
 */
struct page *apic_access_page;
+   /*
+* L1's apic access page can be migrated. When L1 and L2 are sharing
+* the apic access page, after the page is migrated when L2 is running,
+* we have to reload it to L1 vmcs before we enter L1.
+*
+* When the shared apic access page is migrated in L1 mode, we don't
+* need to do anything else because we reload apic access page each
+* time when entering L2 in prepare_vmcs02().
+*/
+   bool apic_access_page_migrated;
u64 msr_ia32_feature_control;
  
  	struct hrtimer preemption_timer;

@@ -7098,6 +7108,12 @@ static void vmx_set_apic_access_page_addr(struct kvm 
*kvm, hpa_t hpa)
vmcs_write64(APIC_ACCESS_ADDR, hpa);
  }
  
+static void vmx_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool set)

+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   vmx-nested.apic_access_page_migrated = set;
+}
+
  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
  {
u16 status;
@@ -8796,6 +8812,21 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 
exit_reason,
}
  
  	/*

+* When shared (L1  L2) apic access page is migrated during L2 is
+* running, mmu_notifier will force to reload the page's hpa for L2
+* vmcs. Need to reload it for L1 before entering L1.
+*/
+   if (vmx-nested.apic_access_page_migrated) {
+   /*
+* Do not call kvm_reload_apic_access_page() because we are now
+* in L2. We should not call make_all_cpus_request() to exit to
+* L0, otherwise we will reload for L2 

Re: [PATCH v4 0/6] kvm, mem-hotplug: Do not pin ept identity pagetable and apic access page.

2014-08-31 Thread tangchen

Hi Gleb,

Would you please help to review these patches ?

Thanks.

On 08/27/2014 06:17 PM, Tang Chen wrote:

ept identity pagetable and apic access page in kvm are pinned in memory.
As a result, they cannot be migrated/hot-removed.

But actually they don't need to be pinned in memory.

[For ept identity page]
Just do not pin it. When it is migrated, guest will be able to find the
new page in the next ept violation.

[For apic access page]
The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer.
When apic access page is migrated, we update VMCS APIC_ACCESS_ADDR pointer
for each vcpu in addition.

NOTE: Tested with -cpu xxx,-x2apic option.
   But since nested vm pins some other pages in memory, if user uses nested
   vm, memory hot-remove will not work.

Change log v3 -> v4:
1. The original patch 6 is now patch 5. ( by Jan Kiszka  )
2. The original patch 1 is now patch 6 since we should unpin apic access page
at the very last moment.


Tang Chen (6):
   kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address.
   kvm: Remove ept_identity_pagetable from struct kvm_arch.
   kvm: Make init_rmode_identity_map() return 0 on success.
   kvm, mem-hotplug: Reload L1' apic access page on migration in
 vcpu_enter_guest().
   kvm, mem-hotplug: Reload L1's apic access page on migration when L2 is
 running.
   kvm, mem-hotplug: Do not pin apic access page in memory.

  arch/x86/include/asm/kvm_host.h |   3 +-
  arch/x86/kvm/svm.c  |  15 +-
  arch/x86/kvm/vmx.c  | 103 +++-
  arch/x86/kvm/x86.c  |  22 +++--
  include/linux/kvm_host.h|   3 ++
  virt/kvm/kvm_main.c |  30 +++-
  6 files changed, 135 insertions(+), 41 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 0/6] kvm, mem-hotplug: Do not pin ept identity pagetable and apic access page.

2014-08-31 Thread tangchen

Hi Gleb,

Would you please help to review these patches ?

Thanks.

On 08/27/2014 06:17 PM, Tang Chen wrote:

ept identity pagetable and apic access page in kvm are pinned in memory.
As a result, they cannot be migrated/hot-removed.

But actually they don't need to be pinned in memory.

[For ept identity page]
Just do not pin it. When it is migrated, guest will be able to find the
new page in the next ept violation.

[For apic access page]
The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer.
When apic access page is migrated, we update VMCS APIC_ACCESS_ADDR pointer
for each vcpu in addition.

NOTE: Tested with -cpu xxx,-x2apic option.
   But since nested vm pins some other pages in memory, if user uses nested
   vm, memory hot-remove will not work.

Change log v3 - v4:
1. The original patch 6 is now patch 5. ( by Jan Kiszka jan.kis...@web.de )
2. The original patch 1 is now patch 6 since we should unpin apic access page
at the very last moment.


Tang Chen (6):
   kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address.
   kvm: Remove ept_identity_pagetable from struct kvm_arch.
   kvm: Make init_rmode_identity_map() return 0 on success.
   kvm, mem-hotplug: Reload L1' apic access page on migration in
 vcpu_enter_guest().
   kvm, mem-hotplug: Reload L1's apic access page on migration when L2 is
 running.
   kvm, mem-hotplug: Do not pin apic access page in memory.

  arch/x86/include/asm/kvm_host.h |   3 +-
  arch/x86/kvm/svm.c  |  15 +-
  arch/x86/kvm/vmx.c  | 103 +++-
  arch/x86/kvm/x86.c  |  22 +++--
  include/linux/kvm_host.h|   3 ++
  virt/kvm/kvm_main.c |  30 +++-
  6 files changed, 135 insertions(+), 41 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mem-hotplug: introduce movablenodes boot option for memory hotplug debugging

2014-08-19 Thread tangchen


On 08/19/2014 06:02 PM, Xishi Qiu wrote:

This patch introduces a new boot option "movablenodes". This parameter
depends on movable_node, it is used for debugging memory hotplug.
Instead SRAT specifies which memory is hotpluggable.

e.g. movable_node movablenodes=1,2,4

It means nodes 1,2,4 will be set to movable nodes, the other nodes are
unmovable nodes. Usually movable nodes are parsed from SRAT table which
offered by BIOS.


This may not work on some machines. So far as I know, there are machines
that after a reboot, node id will change. So node 1,2,4 may be not the same
nodes as before in the next boot.

Thanks.



Signed-off-by: Xishi Qiu 
---
  Documentation/kernel-parameters.txt |5 
  arch/x86/mm/srat.c  |   36 +++
  2 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 5ae8608..e072ccf 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1949,6 +1949,11 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
movable_node[KNL,X86] Boot-time switch to enable the effects
of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
  
+	movablenodes=	[KNL,X86] This parameter depends on movable_node, it

+   is used for debugging memory hotplug. Instead SRAT
+   specifies which memory is hotpluggable.
+   e.g. movablenodes=1,2,4
+
MTD_Partition=  [MTD]
Format: ,,,
  
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c

index 66338a6..523e58b 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -157,6 +157,37 @@ static inline int save_add_info(void) {return 1;}
  static inline int save_add_info(void) {return 0;}
  #endif
  
+static nodemask_t movablenodes_mask;

+
+static void __init parse_movablenodes_one(char *p)
+{
+   int node;
+
+   get_option(, );
+   node_set(node, movablenodes_mask);
+}
+
+static int __init parse_movablenodes_opt(char *str)
+{
+   nodes_clear(movablenodes_mask);
+
+#ifdef CONFIG_MOVABLE_NODE
+   while (str) {
+   char *k = strchr(str, ',');
+
+   if (k)
+   *k++ = 0;
+   parse_movablenodes_one(str);
+   str = k;
+   }
+#else
+   pr_warn("movable_node option not supported\n");
+#endif
+
+   return 0;
+}
+early_param("movablenodes", parse_movablenodes_opt);
+
  /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
  int __init
  acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
@@ -202,6 +233,11 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
pr_warn("SRAT: Failed to mark hotplug range [mem %#010Lx-%#010Lx] in 
memblock\n",
(unsigned long long)start, (unsigned long long)end - 1);
  
+	if (node_isset(node, movablenodes_mask) &&

+   memblock_mark_hotplug(start, ma->length))
+   pr_warn("SRAT debug: Failed to mark hotplug range [mem 
%#010Lx-%#010Lx] in memblock\n",
+   (unsigned long long)start, (unsigned long long)end - 1);
+
return 0;
  out_err_bad_srat:
bad_srat();


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mem-hotplug: introduce movablenodes boot option for memory hotplug debugging

2014-08-19 Thread tangchen


On 08/19/2014 06:02 PM, Xishi Qiu wrote:

This patch introduces a new boot option movablenodes. This parameter
depends on movable_node, it is used for debugging memory hotplug.
Instead SRAT specifies which memory is hotpluggable.

e.g. movable_node movablenodes=1,2,4

It means nodes 1,2,4 will be set to movable nodes, the other nodes are
unmovable nodes. Usually movable nodes are parsed from SRAT table which
offered by BIOS.


This may not work on some machines. So far as I know, there are machines
that after a reboot, node id will change. So node 1,2,4 may be not the same
nodes as before in the next boot.

Thanks.



Signed-off-by: Xishi Qiu qiuxi...@huawei.com
---
  Documentation/kernel-parameters.txt |5 
  arch/x86/mm/srat.c  |   36 +++
  2 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 5ae8608..e072ccf 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1949,6 +1949,11 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
movable_node[KNL,X86] Boot-time switch to enable the effects
of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
  
+	movablenodes=	[KNL,X86] This parameter depends on movable_node, it

+   is used for debugging memory hotplug. Instead SRAT
+   specifies which memory is hotpluggable.
+   e.g. movablenodes=1,2,4
+
MTD_Partition=  [MTD]
Format: name,region-number,size,offset
  
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c

index 66338a6..523e58b 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -157,6 +157,37 @@ static inline int save_add_info(void) {return 1;}
  static inline int save_add_info(void) {return 0;}
  #endif
  
+static nodemask_t movablenodes_mask;

+
+static void __init parse_movablenodes_one(char *p)
+{
+   int node;
+
+   get_option(p, node);
+   node_set(node, movablenodes_mask);
+}
+
+static int __init parse_movablenodes_opt(char *str)
+{
+   nodes_clear(movablenodes_mask);
+
+#ifdef CONFIG_MOVABLE_NODE
+   while (str) {
+   char *k = strchr(str, ',');
+
+   if (k)
+   *k++ = 0;
+   parse_movablenodes_one(str);
+   str = k;
+   }
+#else
+   pr_warn(movable_node option not supported\n);
+#endif
+
+   return 0;
+}
+early_param(movablenodes, parse_movablenodes_opt);
+
  /* Callback for parsing of the Proximity Domain - Memory Area mappings */
  int __init
  acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
@@ -202,6 +233,11 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
pr_warn(SRAT: Failed to mark hotplug range [mem %#010Lx-%#010Lx] in 
memblock\n,
(unsigned long long)start, (unsigned long long)end - 1);
  
+	if (node_isset(node, movablenodes_mask) 

+   memblock_mark_hotplug(start, ma-length))
+   pr_warn(SRAT debug: Failed to mark hotplug range [mem 
%#010Lx-%#010Lx] in memblock\n,
+   (unsigned long long)start, (unsigned long long)end - 1);
+
return 0;
  out_err_bad_srat:
bad_srat();


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mem-hotplug: let memblock skip the hotpluggable memory regions in __next_mem_range()

2014-08-17 Thread tangchen

Hi tj,

On 08/17/2014 07:08 PM, Tejun Heo wrote:

Hello,

On Sat, Aug 16, 2014 at 10:36:41PM +0800, Xishi Qiu wrote:

numa_clear_node_hotplug()? There is only numa_clear_kernel_node_hotplug().

Yeah, that one.


If we don't clear hotpluggable flag in free_low_memory_core_early(), the
memory which marked hotpluggable flag will not free to buddy allocator.
Because __next_mem_range() will skip them.

free_low_memory_core_early
for_each_free_mem_range
for_each_mem_range
__next_mem_range

Ah, okay, so the patch fixes __next_mem_range() and thus makes
free_low_memory_core_early() to skip hotpluggable regions unlike
before.  Please explain things like that in the changelog.  Also,
what's its relationship with numa_clear_kernel_node_hotplug()?  Do we
still need them?  If so, what are the different roles that these two
separate places serve?


numa_clear_kernel_node_hotplug() only clears hotplug flags for the nodes
the kernel resides in, not for hotpluggable nodes. The reason why we did
this is to enable the kernel to allocate memory in case all the nodes are
hotpluggable.

And we clear hotplug flags for all the nodes in free_low_memory_core_early()
is because if we do not, all hotpluggable memory won't be able to be freed
to buddy after Qiu's patch.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mem-hotplug: let memblock skip the hotpluggable memory regions in __next_mem_range()

2014-08-17 Thread tangchen

Hi tj,

On 08/17/2014 07:08 PM, Tejun Heo wrote:

Hello,

On Sat, Aug 16, 2014 at 10:36:41PM +0800, Xishi Qiu wrote:

numa_clear_node_hotplug()? There is only numa_clear_kernel_node_hotplug().

Yeah, that one.


If we don't clear hotpluggable flag in free_low_memory_core_early(), the
memory which marked hotpluggable flag will not free to buddy allocator.
Because __next_mem_range() will skip them.

free_low_memory_core_early
for_each_free_mem_range
for_each_mem_range
__next_mem_range

Ah, okay, so the patch fixes __next_mem_range() and thus makes
free_low_memory_core_early() to skip hotpluggable regions unlike
before.  Please explain things like that in the changelog.  Also,
what's its relationship with numa_clear_kernel_node_hotplug()?  Do we
still need them?  If so, what are the different roles that these two
separate places serve?


numa_clear_kernel_node_hotplug() only clears hotplug flags for the nodes
the kernel resides in, not for hotpluggable nodes. The reason why we did
this is to enable the kernel to allocate memory in case all the nodes are
hotpluggable.

And we clear hotplug flags for all the nodes in free_low_memory_core_early()
is because if we do not, all hotpluggable memory won't be able to be freed
to buddy after Qiu's patch.

Thanks.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] memblock, memhotplug: Fix wrong type in memblock_find_in_range_node().

2014-08-12 Thread tangchen


On 08/13/2014 06:03 AM, Andrew Morton wrote:

On Sun, 10 Aug 2014 14:12:03 +0800 Tang Chen  wrote:


In memblock_find_in_range_node(), we defeind ret as int. But it shoule
be phys_addr_t because it is used to store the return value from
__memblock_find_range_bottom_up().

The bug has not been triggered because when allocating low memory near
the kernel end, the "int ret" won't turn out to be minus. When we started
to allocate memory on other nodes, and the "int ret" could be minus.
Then the kernel will panic.

A simple way to reproduce this: comment out the following code in numa_init(),

 memblock_set_bottom_up(false);

and the kernel won't boot.

Which kernel versions need this fix?


This bug has been in the kernel since v3.13-rc1.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] memblock, memhotplug: Fix wrong type in memblock_find_in_range_node().

2014-08-12 Thread tangchen


On 08/13/2014 06:03 AM, Andrew Morton wrote:

On Sun, 10 Aug 2014 14:12:03 +0800 Tang Chen tangc...@cn.fujitsu.com wrote:


In memblock_find_in_range_node(), we defeind ret as int. But it shoule
be phys_addr_t because it is used to store the return value from
__memblock_find_range_bottom_up().

The bug has not been triggered because when allocating low memory near
the kernel end, the int ret won't turn out to be minus. When we started
to allocate memory on other nodes, and the int ret could be minus.
Then the kernel will panic.

A simple way to reproduce this: comment out the following code in numa_init(),

 memblock_set_bottom_up(false);

and the kernel won't boot.

Which kernel versions need this fix?


This bug has been in the kernel since v3.13-rc1.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] memblock, memhotplug: Fix wrong type in memblock_find_in_range_node().

2014-08-10 Thread tangchen

Sorry, add Xishi Qiu 

On 08/10/2014 02:12 PM, Tang Chen wrote:

In memblock_find_in_range_node(), we defeind ret as int. But it shoule
be phys_addr_t because it is used to store the return value from
__memblock_find_range_bottom_up().

The bug has not been triggered because when allocating low memory near
the kernel end, the "int ret" won't turn out to be minus. When we started
to allocate memory on other nodes, and the "int ret" could be minus.
Then the kernel will panic.

A simple way to reproduce this: comment out the following code in numa_init(),

 memblock_set_bottom_up(false);

and the kernel won't boot.

Reported-by: Xishi Qiu 
Signed-off-by: Tang Chen 
---
  mm/memblock.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 6d2f219..70fad0c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -192,8 +192,7 @@ phys_addr_t __init_memblock 
memblock_find_in_range_node(phys_addr_t size,
phys_addr_t align, phys_addr_t start,
phys_addr_t end, int nid)
  {
-   int ret;
-   phys_addr_t kernel_end;
+   phys_addr_t kernel_end, ret;
  
  	/* pump up @end */

if (end == MEMBLOCK_ALLOC_ACCESSIBLE)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] memblock, memhotplug: Fix wrong type in memblock_find_in_range_node().

2014-08-10 Thread tangchen

Sorry, add Xishi Qiu qiuxi...@huawei.com

On 08/10/2014 02:12 PM, Tang Chen wrote:

In memblock_find_in_range_node(), we defeind ret as int. But it shoule
be phys_addr_t because it is used to store the return value from
__memblock_find_range_bottom_up().

The bug has not been triggered because when allocating low memory near
the kernel end, the int ret won't turn out to be minus. When we started
to allocate memory on other nodes, and the int ret could be minus.
Then the kernel will panic.

A simple way to reproduce this: comment out the following code in numa_init(),

 memblock_set_bottom_up(false);

and the kernel won't boot.

Reported-by: Xishi Qiu qiuxi...@huawei.com
Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
  mm/memblock.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 6d2f219..70fad0c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -192,8 +192,7 @@ phys_addr_t __init_memblock 
memblock_find_in_range_node(phys_addr_t size,
phys_addr_t align, phys_addr_t start,
phys_addr_t end, int nid)
  {
-   int ret;
-   phys_addr_t kernel_end;
+   phys_addr_t kernel_end, ret;
  
  	/* pump up @end */

if (end == MEMBLOCK_ALLOC_ACCESSIBLE)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 6/6] kvm, mem-hotplug: Reload L1's apic access page if it is migrated when L2 is running.

2014-07-29 Thread tangchen


On 07/26/2014 04:44 AM, Jan Kiszka wrote:

On 2014-07-23 21:42, Tang Chen wrote:

This patch only handle "L1 and L2 vm share one apic access page" situation.

When L1 vm is running, if the shared apic access page is migrated, mmu_notifier 
will
request all vcpus to exit to L0, and reload apic access page physical address 
for
all the vcpus' vmcs (which is done by patch 5/6). And when it enters L2 vm, 
L2's vmcs
will be updated in prepare_vmcs02() called by nested_vm_run(). So we need to do
nothing.

When L2 vm is running, if the shared apic access page is migrated, mmu_notifier 
will
request all vcpus to exit to L0, and reload apic access page physical address 
for
all L2 vmcs. And this patch requests apic access page reload in L2->L1 vmexit.

Shouldn't this patch come before we allow apic access page migration?

Yes, it should come before patch 5.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 6/6] kvm, mem-hotplug: Reload L1's apic access page if it is migrated when L2 is running.

2014-07-29 Thread tangchen


On 07/26/2014 04:44 AM, Jan Kiszka wrote:

On 2014-07-23 21:42, Tang Chen wrote:

This patch only handle L1 and L2 vm share one apic access page situation.

When L1 vm is running, if the shared apic access page is migrated, mmu_notifier 
will
request all vcpus to exit to L0, and reload apic access page physical address 
for
all the vcpus' vmcs (which is done by patch 5/6). And when it enters L2 vm, 
L2's vmcs
will be updated in prepare_vmcs02() called by nested_vm_run(). So we need to do
nothing.

When L2 vm is running, if the shared apic access page is migrated, mmu_notifier 
will
request all vcpus to exit to L0, and reload apic access page physical address 
for
all L2 vmcs. And this patch requests apic access page reload in L2-L1 vmexit.

Shouldn't this patch come before we allow apic access page migration?

Yes, it should come before patch 5.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/