Re: [PATCH] KVM: x86: reinstate vendor-agnostic check on SPEC_CTRL cpuid bits

2020-12-08 Thread Denis V. Lunev
On 12/3/20 6:15 PM, Paolo Bonzini wrote:
> Until commit e7c587da1252 ("x86/speculation: Use synthetic bits for 
> IBRS/IBPB/STIBP",
> 2018-05-17), KVM was testing both Intel and AMD CPUID bits before allowing the
> guest to write MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD.  Testing only Intel 
> bits
> on VMX processors, or only AMD bits on SVM processors, fails if the guests are
> created with the "opposite" vendor as the host.
>
> While at it, also tweak the host CPU check to use the vendor-agnostic feature 
> bit
> X86_FEATURE_IBPB, since we only care about the availability of the MSR on the 
> host
> here and not about specific CPUID bits.
>
> Fixes: e7c587da1252 ("x86/speculation: Use synthetic bits for 
> IBRS/IBPB/STIBP")
> Cc: sta...@vger.kernel.org
> Reported-by: Denis V. Lunev 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/svm/svm.c |  3 ++-
>  arch/x86/kvm/vmx/vmx.c | 10 +++---
>  2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 62390fbc9233..0b4aa60b2754 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2686,12 +2686,13 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr)
>   break;
>   case MSR_IA32_PRED_CMD:
>   if (!msr->host_initiated &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL) &&
>   !guest_cpuid_has(vcpu, X86_FEATURE_AMD_IBPB))
>   return 1;
>  
>   if (data & ~PRED_CMD_IBPB)
>   return 1;
> - if (!boot_cpu_has(X86_FEATURE_AMD_IBPB))
> + if (!boot_cpu_has(X86_FEATURE_IBPB))
>   return 1;
>   if (!data)
>   break;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c3441e7e5a87..b74d2105ced7 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2028,7 +2028,10 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   break;
>   case MSR_IA32_SPEC_CTRL:
>   if (!msr_info->host_initiated &&
> - !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> +!guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL) &&
> +!guest_cpuid_has(vcpu, X86_FEATURE_AMD_STIBP) &&
> +!guest_cpuid_has(vcpu, X86_FEATURE_AMD_IBRS) &&
> +!guest_cpuid_has(vcpu, X86_FEATURE_AMD_SSBD))
>   return 1;
>  
>   if (kvm_spec_ctrl_test_value(data))
> @@ -2063,12 +2066,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   goto find_uret_msr;
>   case MSR_IA32_PRED_CMD:
>   if (!msr_info->host_initiated &&
> - !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> + !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL) &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_AMD_IBPB))
>       return 1;
>  
>   if (data & ~PRED_CMD_IBPB)
>   return 1;
> - if (!boot_cpu_has(X86_FEATURE_SPEC_CTRL))
> + if (!boot_cpu_has(X86_FEATURE_IBPB))
>   return 1;
>   if (!data)
>   break;
Reviewed-by: Denis V. Lunev 


[PATCH 1/1] i40iw: remove bogus call to netdev_master_upper_dev_get

2020-04-28 Thread Denis V. Lunev
Local variable netdev is not used in these calls.

It should be noted, that this change is required to work in bonded mode.
In the other case we would get the following assert:
 "RTNL: assertion failed at net/core/dev.c (5665)"
with the calltrace as follows:
dump_stack+0x19/0x1b
netdev_master_upper_dev_get+0x61/0x70
i40iw_addr_resolve_neigh+0x1e8/0x220
i40iw_make_cm_node+0x296/0x700
? i40iw_find_listener.isra.10+0xcc/0x110
i40iw_receive_ilq+0x3d4/0x810
i40iw_puda_poll_completion+0x341/0x420
i40iw_process_ceq+0xa5/0x280
i40iw_ceq_dpc+0x1e/0x40
tasklet_action+0x83/0x140
__do_softirq+0x125/0x2bb
call_softirq+0x1c/0x30
do_softirq+0x65/0xa0
irq_exit+0x105/0x110
do_IRQ+0x56/0xf0
common_interrupt+0x16a/0x16a
? cpuidle_enter_state+0x57/0xd0
cpuidle_idle_call+0xde/0x230
arch_cpu_idle+0xe/0xc0
cpu_startup_entry+0x14a/0x1e0
start_secondary+0x1f7/0x270
start_cpu+0x5/0x14

Signed-off-by: Denis V. Lunev 
CC: Konstantin Khorenko 
CC: Faisal Latif 
CC: Shiraz Saleem 
CC: Doug Ledford 
CC: Jason Gunthorpe 
CC: linux-r...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
---
 drivers/infiniband/hw/i40iw/i40iw_cm.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c 
b/drivers/infiniband/hw/i40iw/i40iw_cm.c
index bb78d3280acc..fa7a5ff498c7 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_cm.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c
@@ -1987,7 +1987,6 @@ static int i40iw_addr_resolve_neigh(struct i40iw_device 
*iwdev,
struct rtable *rt;
struct neighbour *neigh;
int rc = arpindex;
-   struct net_device *netdev = iwdev->netdev;
__be32 dst_ipaddr = htonl(dst_ip);
__be32 src_ipaddr = htonl(src_ip);
 
@@ -1997,9 +1996,6 @@ static int i40iw_addr_resolve_neigh(struct i40iw_device 
*iwdev,
return rc;
}
 
-   if (netif_is_bond_slave(netdev))
-   netdev = netdev_master_upper_dev_get(netdev);
-
neigh = dst_neigh_lookup(>dst, _ipaddr);
 
rcu_read_lock();
@@ -2065,7 +2061,6 @@ static int i40iw_addr_resolve_neigh_ipv6(struct 
i40iw_device *iwdev,
 {
struct neighbour *neigh;
int rc = arpindex;
-   struct net_device *netdev = iwdev->netdev;
struct dst_entry *dst;
struct sockaddr_in6 dst_addr;
struct sockaddr_in6 src_addr;
@@ -2086,9 +2081,6 @@ static int i40iw_addr_resolve_neigh_ipv6(struct 
i40iw_device *iwdev,
return rc;
}
 
-   if (netif_is_bond_slave(netdev))
-   netdev = netdev_master_upper_dev_get(netdev);
-
neigh = dst_neigh_lookup(dst, dst_addr.sin6_addr.in6_u.u6_addr32);
 
rcu_read_lock();
-- 
2.17.1



Re: [PATCH RFC 0/2] ignore LBR-related MSRs

2017-12-06 Thread Denis V. Lunev
On 12/06/2017 07:39 PM, Jan Dakinevich wrote:
> On Wed, 6 Dec 2017 10:06:48 -0500
> Konrad Rzeszutek Wilk  wrote:
>
>> On Wed, Dec 06, 2017 at 02:43:01PM +0300, Jan Dakinevich wrote:
>>> w2k16 essentials fails to boot if underlying hypervisor lacks of
>>> support for LBR MSRs. To workaround the issue, it suggessted to
>>> ignore these MSRs (but not all).
>> This is without any hyperv enablement? Meaning normal stock guest?
>>
> Yes, it is normal guest. No hyperv enlightenments were enabled, and
> "-cpu host" was specified in QEMU command line.

The problem also exists with HyperV enabled.

Den


Re: [PATCH RFC 0/2] ignore LBR-related MSRs

2017-12-06 Thread Denis V. Lunev
On 12/06/2017 07:39 PM, Jan Dakinevich wrote:
> On Wed, 6 Dec 2017 10:06:48 -0500
> Konrad Rzeszutek Wilk  wrote:
>
>> On Wed, Dec 06, 2017 at 02:43:01PM +0300, Jan Dakinevich wrote:
>>> w2k16 essentials fails to boot if underlying hypervisor lacks of
>>> support for LBR MSRs. To workaround the issue, it suggessted to
>>> ignore these MSRs (but not all).
>> This is without any hyperv enablement? Meaning normal stock guest?
>>
> Yes, it is normal guest. No hyperv enlightenments were enabled, and
> "-cpu host" was specified in QEMU command line.

The problem also exists with HyperV enabled.

Den


Re: [PATCH] virtio_balloon: prevent uninitialized variable use

2017-03-23 Thread Denis V. Lunev
On 03/23/2017 06:17 PM, Arnd Bergmann wrote:
> The latest gcc-7.0.1 snapshot reports a new warning:
>
> virtio/virtio_balloon.c: In function 'update_balloon_stats':
> virtio/virtio_balloon.c:258:26: error: 'events[2]' is used uninitialized in 
> this function [-Werror=uninitialized]
> virtio/virtio_balloon.c:260:26: error: 'events[3]' is used uninitialized in 
> this function [-Werror=uninitialized]
> virtio/virtio_balloon.c:261:56: error: 'events[18]' is used uninitialized in 
> this function [-Werror=uninitialized]
> virtio/virtio_balloon.c:262:56: error: 'events[17]' is used uninitialized in 
> this function [-Werror=uninitialized]
>
> This seems absolutely right, so we should add an extra check to
> prevent copying uninitialized stack data into the statistics.
> From all I can tell, this has been broken since the statistics code
> was originally added in 2.6.34.
>
> Fixes: 9564e138b1f6 ("virtio: Add memory statistics reporting to the balloon 
> driver (V4)")
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
Reviewed-by: Denis V. Lunev <d...@openvz.org>



> ---
>  drivers/virtio/virtio_balloon.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 4e1191508228..cd5c54e2003d 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -254,12 +254,14 @@ static void update_balloon_stats(struct virtio_balloon 
> *vb)
>  
>   available = si_mem_available();
>  
> +#ifdef CONFIG_VM_EVENT_COUNTERS
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
>   pages_to_bytes(events[PSWPIN]));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
>   pages_to_bytes(events[PSWPOUT]));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
> +#endif
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
>   pages_to_bytes(i.freeram));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT,



Re: [PATCH] virtio_balloon: prevent uninitialized variable use

2017-03-23 Thread Denis V. Lunev
On 03/23/2017 06:17 PM, Arnd Bergmann wrote:
> The latest gcc-7.0.1 snapshot reports a new warning:
>
> virtio/virtio_balloon.c: In function 'update_balloon_stats':
> virtio/virtio_balloon.c:258:26: error: 'events[2]' is used uninitialized in 
> this function [-Werror=uninitialized]
> virtio/virtio_balloon.c:260:26: error: 'events[3]' is used uninitialized in 
> this function [-Werror=uninitialized]
> virtio/virtio_balloon.c:261:56: error: 'events[18]' is used uninitialized in 
> this function [-Werror=uninitialized]
> virtio/virtio_balloon.c:262:56: error: 'events[17]' is used uninitialized in 
> this function [-Werror=uninitialized]
>
> This seems absolutely right, so we should add an extra check to
> prevent copying uninitialized stack data into the statistics.
> From all I can tell, this has been broken since the statistics code
> was originally added in 2.6.34.
>
> Fixes: 9564e138b1f6 ("virtio: Add memory statistics reporting to the balloon 
> driver (V4)")
> Signed-off-by: Arnd Bergmann 
Reviewed-by: Denis V. Lunev 



> ---
>  drivers/virtio/virtio_balloon.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 4e1191508228..cd5c54e2003d 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -254,12 +254,14 @@ static void update_balloon_stats(struct virtio_balloon 
> *vb)
>  
>   available = si_mem_available();
>  
> +#ifdef CONFIG_VM_EVENT_COUNTERS
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
>   pages_to_bytes(events[PSWPIN]));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
>   pages_to_bytes(events[PSWPOUT]));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
> +#endif
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
>   pages_to_bytes(i.freeram));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT,



[PATCH v2 1/1] virtio: update balloon size in balloon "probe"

2016-09-29 Thread Denis V. Lunev
From: Konstantin Neumoin <kneum...@virtuozzo.com>

The following commit 'fad7b7b27b6a (virtio_balloon: Use a workqueue
instead of "vballoon" kthread)' has added a regression. Original code with
kthread starts the thread inside probe and checks the necessity to update
balloon inside the thread immediately.

Nowadays the code behaves differently. Work is queued only on the first
command from the host after the negotiation. Thus there is a window
especially at the guest startup or the module reloading when the balloon
size is not updated until the notification from the host.

This patch adds balloon size check at the end of the probe to match
original behaviour.

Signed-off-by: Konstantin Neumoin <kneum...@virtuozzo.com>
Signed-off-by: Denis V. Lunev <d...@openvz.org>
CC: Michael S. Tsirkin <m...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 2 ++
 1 file changed, 2 insertions(+)

Changes from v1:
- fixed description
- removed update_balloon_size() call

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4e7003d..181793f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -577,6 +577,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
 
virtio_device_ready(vdev);
 
+   if (towards_target(vb))
+   virtballoon_changed(vdev);
return 0;
 
 out_del_vqs:
-- 
2.7.4



[PATCH v2 1/1] virtio: update balloon size in balloon "probe"

2016-09-29 Thread Denis V. Lunev
From: Konstantin Neumoin 

The following commit 'fad7b7b27b6a (virtio_balloon: Use a workqueue
instead of "vballoon" kthread)' has added a regression. Original code with
kthread starts the thread inside probe and checks the necessity to update
balloon inside the thread immediately.

Nowadays the code behaves differently. Work is queued only on the first
command from the host after the negotiation. Thus there is a window
especially at the guest startup or the module reloading when the balloon
size is not updated until the notification from the host.

This patch adds balloon size check at the end of the probe to match
original behaviour.

Signed-off-by: Konstantin Neumoin 
Signed-off-by: Denis V. Lunev 
CC: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 2 ++
 1 file changed, 2 insertions(+)

Changes from v1:
- fixed description
- removed update_balloon_size() call

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4e7003d..181793f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -577,6 +577,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
 
virtio_device_ready(vdev);
 
+   if (towards_target(vb))
+   virtballoon_changed(vdev);
return 0;
 
 out_del_vqs:
-- 
2.7.4



[PATCH 1/1] update balloon size in balloon "probe"

2016-09-23 Thread Denis V. Lunev
From: Konstantin Neumoin <kneum...@virtuozzo.com>

Patch
Commit 3d2a3774c1b046f548ebea0391a602fd5685a307
Author: Michael S. Tsirkin <m...@redhat.com>
Date:   Tue Mar 10 11:55:08 2015 +1030
virtio-balloon: do not call blocking ops when !TASK_RUNNING
has added a regression. Original code with wait_event_interruptible
checked the condition before start waiting and started balloon operations
if necessary.

Right now balloon is not inflated if ballon target is set before the
driver is loaded.

Signed-off-by: Konstantin Neumoin <kneum...@virtuozzo.com>
Signed-off-by: Denis V. Lunev <d...@openvz.org>
CC: "Michael S. Tsirkin" <m...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4e7003d..0a6c10f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -577,6 +577,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 
virtio_device_ready(vdev);
 
+   if (towards_target(vb))
+   virtballoon_changed(vdev);
+   update_balloon_size(vb);
+
return 0;
 
 out_del_vqs:
-- 
2.7.4



[PATCH 1/1] update balloon size in balloon "probe"

2016-09-23 Thread Denis V. Lunev
From: Konstantin Neumoin 

Patch
Commit 3d2a3774c1b046f548ebea0391a602fd5685a307
Author: Michael S. Tsirkin 
Date:   Tue Mar 10 11:55:08 2015 +1030
virtio-balloon: do not call blocking ops when !TASK_RUNNING
has added a regression. Original code with wait_event_interruptible
checked the condition before start waiting and started balloon operations
if necessary.

Right now balloon is not inflated if ballon target is set before the
driver is loaded.

Signed-off-by: Konstantin Neumoin 
Signed-off-by: Denis V. Lunev 
CC: "Michael S. Tsirkin" 
---
 drivers/virtio/virtio_balloon.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4e7003d..0a6c10f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -577,6 +577,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 
virtio_device_ready(vdev);
 
+   if (towards_target(vb))
+   virtballoon_changed(vdev);
+   update_balloon_size(vb);
+
return 0;
 
 out_del_vqs:
-- 
2.7.4



[PATCH 1/1] balloon: stop inflate balloon after oom notification

2016-09-09 Thread Denis V. Lunev
From: Konstantin Neumoin <kneum...@virtuozzo.com>

At this moment oom notification in balloon does not work as expected.
After virtballoon_oom_notify there is an infinitive loop:
 - virtballoon_oom_notify was called and balloon was deflated
 - balloon get notification that config was changed, compare target and
   actual and try to reach target again.

This patch adds global variable fail_counter which indicates that oom has
been happened. We check that fail_counter was changed between calls
update_balloon_size_func. In this case we should not try to inflate balloon
even if actual != target.

Signed-off-by: Konstantin Neumoin <kneum...@virtuozzo.com>
Signed-off-by: Denis V. Lunev <d...@openvz.org>
CC: Michael S. Tsirkin <m...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4e7003d..253bf05 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -50,6 +50,8 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+static unsigned long fail_count;
+
 struct virtio_balloon {
struct virtio_device *vdev;
struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -361,6 +363,8 @@ static int virtballoon_oom_notify(struct notifier_block 
*self,
unsigned long *freed;
unsigned num_freed_pages;
 
+   fail_count++;
+
vb = container_of(self, struct virtio_balloon, nb);
if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
return NOTIFY_OK;
@@ -386,11 +390,22 @@ static void update_balloon_size_func(struct work_struct 
*work)
 {
struct virtio_balloon *vb;
s64 diff;
+   static unsigned long fc;
+
+   if (fc == 0)
+   fc = fail_count;
 
vb = container_of(work, struct virtio_balloon,
  update_balloon_size_work);
diff = towards_target(vb);
 
+   if (fc != fail_count) {
+   fc = fail_count;
+   /* Don't inflate balloon after oom notification */
+   if (diff > 0)
+   return;
+   }
+
if (diff > 0)
diff -= fill_balloon(vb, diff);
else if (diff < 0)
-- 
2.7.4



[PATCH 1/1] balloon: stop inflate balloon after oom notification

2016-09-09 Thread Denis V. Lunev
From: Konstantin Neumoin 

At this moment oom notification in balloon does not work as expected.
After virtballoon_oom_notify there is an infinitive loop:
 - virtballoon_oom_notify was called and balloon was deflated
 - balloon get notification that config was changed, compare target and
   actual and try to reach target again.

This patch adds global variable fail_counter which indicates that oom has
been happened. We check that fail_counter was changed between calls
update_balloon_size_func. In this case we should not try to inflate balloon
even if actual != target.

Signed-off-by: Konstantin Neumoin 
Signed-off-by: Denis V. Lunev 
CC: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4e7003d..253bf05 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -50,6 +50,8 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+static unsigned long fail_count;
+
 struct virtio_balloon {
struct virtio_device *vdev;
struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -361,6 +363,8 @@ static int virtballoon_oom_notify(struct notifier_block 
*self,
unsigned long *freed;
unsigned num_freed_pages;
 
+   fail_count++;
+
vb = container_of(self, struct virtio_balloon, nb);
if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
return NOTIFY_OK;
@@ -386,11 +390,22 @@ static void update_balloon_size_func(struct work_struct 
*work)
 {
struct virtio_balloon *vb;
s64 diff;
+   static unsigned long fc;
+
+   if (fc == 0)
+   fc = fail_count;
 
vb = container_of(work, struct virtio_balloon,
  update_balloon_size_work);
diff = towards_target(vb);
 
+   if (fc != fail_count) {
+   fc = fail_count;
+   /* Don't inflate balloon after oom notification */
+   if (diff > 0)
+   return;
+   }
+
if (diff > 0)
diff -= fill_balloon(vb, diff);
else if (diff < 0)
-- 
2.7.4



[PATCH 1/1] balloon: check the number of available pages in leak balloon

2016-07-11 Thread Denis V. Lunev
From: Konstantin Neumoin <kneum...@virtuozzo.com>

The balloon has a special mechanism that is subscribed to the oom
notification which leads to deflation for a fixed number of pages.
The number is always fixed even when the balloon is fully deflated.
But leak_balloon did not expect that the pages to deflate will be more
than taken, and raise a "BUG" in balloon_page_dequeue when page list
will be empty.

So, the simplest solution would be to check that the number of releases
pages is less or equal to the number taken pages.

Signed-off-by: Konstantin Neumoin <kneum...@virtuozzo.com>
Signed-off-by: Denis V. Lunev <d...@openvz.org>
CC: Michael S. Tsirkin <m...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 476c0e3..f6ea8f4 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -202,6 +202,8 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
num = min(num, ARRAY_SIZE(vb->pfns));
 
mutex_lock(>balloon_lock);
+   /* We can't release more pages than taken */
+   num = min(num, (size_t)vb->num_pages);
for (vb->num_pfns = 0; vb->num_pfns < num;
 vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
page = balloon_page_dequeue(vb_dev_info);
-- 
2.1.4



[PATCH 1/1] balloon: check the number of available pages in leak balloon

2016-07-11 Thread Denis V. Lunev
From: Konstantin Neumoin 

The balloon has a special mechanism that is subscribed to the oom
notification which leads to deflation for a fixed number of pages.
The number is always fixed even when the balloon is fully deflated.
But leak_balloon did not expect that the pages to deflate will be more
than taken, and raise a "BUG" in balloon_page_dequeue when page list
will be empty.

So, the simplest solution would be to check that the number of releases
pages is less or equal to the number taken pages.

Signed-off-by: Konstantin Neumoin 
Signed-off-by: Denis V. Lunev 
CC: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 476c0e3..f6ea8f4 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -202,6 +202,8 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
num = min(num, ARRAY_SIZE(vb->pfns));
 
mutex_lock(>balloon_lock);
+   /* We can't release more pages than taken */
+   num = min(num, (size_t)vb->num_pages);
for (vb->num_pfns = 0; vb->num_pfns < num;
 vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
page = balloon_page_dequeue(vb_dev_info);
-- 
2.1.4



Re: [PATCH 2/2] virtio_balloon: export 'available' memory to balloon statistics

2016-02-23 Thread Denis V. Lunev

On 02/23/2016 06:53 PM, Michael S. Tsirkin wrote:

On Tue, Feb 23, 2016 at 06:26:47PM +0300, Denis V. Lunev wrote:

On 02/23/2016 06:10 PM, Michael S. Tsirkin wrote:

On Tue, Feb 16, 2016 at 06:50:52PM +0300, Denis V. Lunev wrote:

From: Igor Redko <red...@virtuozzo.com>

Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap.

Signed-off-by: Igor Redko <red...@virtuozzo.com>
Reviewed-by: Roman Kagan <rka...@virtuozzo.com>
Signed-off-by: Denis V. Lunev <d...@openvz.org>
CC: Michael S. Tsirkin <m...@redhat.com>
CC: Andrew Morton <a...@linux-foundation.org>

Oops - I missed the fact that this affects host/guest ABI.

Can you please submit ABI update proposal to virtio tc?
Spec patch would be even better.

This is important to ensure there are no conflicts
with other features being developed in parallel.

hmmm

 From my point of view ABI remains untouched.

Anything exposed by guest to host is ABI.
Once we add stuff there, we never can remove it
as some host might rely on it.


The guest can send any amount of ;
pairs and unknown tags are properly ignored
by the host.

That is why I think that this change is safe.

What happens if someone uses the tag you
used for VIRTIO_BALLOON_S_AVAIL, for some
other purpose?
Any tools using VIRTIO_BALLOON_S_AVAIL will be confused.

actually this constant resides in QEMU only,
values are reported above using JSON and
string tags.


Really, it's not hard to get a tag number from virtio TC,
so please just do this.


ok. So do you propose to negotiate maximum allowed
tag to send at the driver start time?

we will have to guard this exchange with proper flag
in feature space then. This could be done but from my
point of view this looks like serious over-complication.
Do we have somebody who can judge?

Den


Re: [PATCH 2/2] virtio_balloon: export 'available' memory to balloon statistics

2016-02-23 Thread Denis V. Lunev

On 02/23/2016 06:53 PM, Michael S. Tsirkin wrote:

On Tue, Feb 23, 2016 at 06:26:47PM +0300, Denis V. Lunev wrote:

On 02/23/2016 06:10 PM, Michael S. Tsirkin wrote:

On Tue, Feb 16, 2016 at 06:50:52PM +0300, Denis V. Lunev wrote:

From: Igor Redko 

Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap.

Signed-off-by: Igor Redko 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Michael S. Tsirkin 
CC: Andrew Morton 

Oops - I missed the fact that this affects host/guest ABI.

Can you please submit ABI update proposal to virtio tc?
Spec patch would be even better.

This is important to ensure there are no conflicts
with other features being developed in parallel.

hmmm

 From my point of view ABI remains untouched.

Anything exposed by guest to host is ABI.
Once we add stuff there, we never can remove it
as some host might rely on it.


The guest can send any amount of ;
pairs and unknown tags are properly ignored
by the host.

That is why I think that this change is safe.

What happens if someone uses the tag you
used for VIRTIO_BALLOON_S_AVAIL, for some
other purpose?
Any tools using VIRTIO_BALLOON_S_AVAIL will be confused.

actually this constant resides in QEMU only,
values are reported above using JSON and
string tags.


Really, it's not hard to get a tag number from virtio TC,
so please just do this.


ok. So do you propose to negotiate maximum allowed
tag to send at the driver start time?

we will have to guard this exchange with proper flag
in feature space then. This could be done but from my
point of view this looks like serious over-complication.
Do we have somebody who can judge?

Den


Re: [PATCH 2/2] virtio_balloon: export 'available' memory to balloon statistics

2016-02-23 Thread Denis V. Lunev

On 02/23/2016 06:10 PM, Michael S. Tsirkin wrote:

On Tue, Feb 16, 2016 at 06:50:52PM +0300, Denis V. Lunev wrote:

From: Igor Redko <red...@virtuozzo.com>

Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap.

Signed-off-by: Igor Redko <red...@virtuozzo.com>
Reviewed-by: Roman Kagan <rka...@virtuozzo.com>
Signed-off-by: Denis V. Lunev <d...@openvz.org>
CC: Michael S. Tsirkin <m...@redhat.com>
CC: Andrew Morton <a...@linux-foundation.org>

Oops - I missed the fact that this affects host/guest ABI.

Can you please submit ABI update proposal to virtio tc?
Spec patch would be even better.

This is important to ensure there are no conflicts
with other features being developed in parallel.


hmmm

From my point of view ABI remains untouched.
The guest can send any amount of ;
pairs and unknown tags are properly ignored
by the host.

That is why I think that this change is safe.


Re: [PATCH 2/2] virtio_balloon: export 'available' memory to balloon statistics

2016-02-23 Thread Denis V. Lunev

On 02/23/2016 06:10 PM, Michael S. Tsirkin wrote:

On Tue, Feb 16, 2016 at 06:50:52PM +0300, Denis V. Lunev wrote:

From: Igor Redko 

Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap.

Signed-off-by: Igor Redko 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Michael S. Tsirkin 
CC: Andrew Morton 

Oops - I missed the fact that this affects host/guest ABI.

Can you please submit ABI update proposal to virtio tc?
Spec patch would be even better.

This is important to ensure there are no conflicts
with other features being developed in parallel.


hmmm

From my point of view ABI remains untouched.
The guest can send any amount of ;
pairs and unknown tags are properly ignored
by the host.

That is why I think that this change is safe.


[PATCH 1/2] calculate 'available' memory in the separate function

2016-02-16 Thread Denis V. Lunev
From: Igor Redko <red...@virtuozzo.com>

Factor out calculation of the available memory counter into a separate
exportable function, in order to be able to use it in other parts of
the kernel.

In particular, it appears a relevant metric to report to the
hypervisor via virtio-balloon statistics interface (in a followup
patch).

Signed-off-by: Igor Redko <red...@virtuozzo.com>
Reviewed-by: Roman Kagan <rka...@virtuozzo.com>
Signed-off-by: Denis V. Lunev <d...@openvz.org>
CC: Michael S. Tsirkin <m...@redhat.com>
CC: Andrew Morton <a...@linux-foundation.org>
---
 fs/proc/meminfo.c  | 31 +--
 include/linux/mm.h |  1 +
 mm/page_alloc.c| 43 +++
 3 files changed, 45 insertions(+), 30 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index df4661a..8372046 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -29,10 +29,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
unsigned long committed;
long cached;
long available;
-   unsigned long pagecache;
-   unsigned long wmark_low = 0;
unsigned long pages[NR_LRU_LISTS];
-   struct zone *zone;
int lru;
 
 /*
@@ -51,33 +48,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
pages[lru] = global_page_state(NR_LRU_BASE + lru);
 
-   for_each_zone(zone)
-   wmark_low += zone->watermark[WMARK_LOW];
-
-   /*
-* Estimate the amount of memory available for userspace allocations,
-* without causing swapping.
-*/
-   available = i.freeram - totalreserve_pages;
-
-   /*
-* Not all the page cache can be freed, otherwise the system will
-* start swapping. Assume at least half of the page cache, or the
-* low watermark worth of cache, needs to stay.
-*/
-   pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
-   pagecache -= min(pagecache / 2, wmark_low);
-   available += pagecache;
-
-   /*
-* Part of the reclaimable slab consists of items that are in use,
-* and cannot be freed. Cap this estimate at the low watermark.
-*/
-   available += global_page_state(NR_SLAB_RECLAIMABLE) -
-min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
-
-   if (available < 0)
-   available = 0;
+   available = si_mem_available();
 
/*
 * Tagged format, for easy grepping and expansion.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 516e149..a8c4144 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1862,6 +1862,7 @@ extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
 extern void __init mmap_init(void);
 extern void show_mem(unsigned int flags);
+extern long si_mem_available(void);
 extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 838ca8bb..dae813c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3603,6 +3603,49 @@ static inline void show_node(struct zone *zone)
printk("Node %d ", zone_to_nid(zone));
 }
 
+long si_mem_available(void)
+{
+   long available;
+   unsigned long pagecache;
+   unsigned long wmark_low = 0;
+   unsigned long pages[NR_LRU_LISTS];
+   struct zone *zone;
+   int lru;
+
+   for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
+   pages[lru] = global_page_state(NR_LRU_BASE + lru);
+
+   for_each_zone(zone)
+   wmark_low += zone->watermark[WMARK_LOW];
+
+   /*
+* Estimate the amount of memory available for userspace allocations,
+* without causing swapping.
+*/
+   available = global_page_state(NR_FREE_PAGES) - totalreserve_pages;
+
+   /*
+* Not all the page cache can be freed, otherwise the system will
+* start swapping. Assume at least half of the page cache, or the
+* low watermark worth of cache, needs to stay.
+*/
+   pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
+   pagecache -= min(pagecache / 2, wmark_low);
+   available += pagecache;
+
+   /*
+* Part of the reclaimable slab consists of items that are in use,
+* and cannot be freed. Cap this estimate at the low watermark.
+*/
+   available += global_page_state(NR_SLAB_RECLAIMABLE) -
+min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
+
+   if (available < 0)
+   available = 0;
+   return available;
+}
+EXPORT_SYMBOL_GPL(si_mem_available);
+
 void si_meminfo(struct sysinfo *val)
 {
val->totalram = totalram_pages;
-- 
2.5.0



[PATCH 1/2] calculate 'available' memory in the separate function

2016-02-16 Thread Denis V. Lunev
From: Igor Redko 

Factor out calculation of the available memory counter into a separate
exportable function, in order to be able to use it in other parts of
the kernel.

In particular, it appears a relevant metric to report to the
hypervisor via virtio-balloon statistics interface (in a followup
patch).

Signed-off-by: Igor Redko 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Michael S. Tsirkin 
CC: Andrew Morton 
---
 fs/proc/meminfo.c  | 31 +--
 include/linux/mm.h |  1 +
 mm/page_alloc.c| 43 +++
 3 files changed, 45 insertions(+), 30 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index df4661a..8372046 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -29,10 +29,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
unsigned long committed;
long cached;
long available;
-   unsigned long pagecache;
-   unsigned long wmark_low = 0;
unsigned long pages[NR_LRU_LISTS];
-   struct zone *zone;
int lru;
 
 /*
@@ -51,33 +48,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
pages[lru] = global_page_state(NR_LRU_BASE + lru);
 
-   for_each_zone(zone)
-   wmark_low += zone->watermark[WMARK_LOW];
-
-   /*
-* Estimate the amount of memory available for userspace allocations,
-* without causing swapping.
-*/
-   available = i.freeram - totalreserve_pages;
-
-   /*
-* Not all the page cache can be freed, otherwise the system will
-* start swapping. Assume at least half of the page cache, or the
-* low watermark worth of cache, needs to stay.
-*/
-   pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
-   pagecache -= min(pagecache / 2, wmark_low);
-   available += pagecache;
-
-   /*
-* Part of the reclaimable slab consists of items that are in use,
-* and cannot be freed. Cap this estimate at the low watermark.
-*/
-   available += global_page_state(NR_SLAB_RECLAIMABLE) -
-min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
-
-   if (available < 0)
-   available = 0;
+   available = si_mem_available();
 
/*
 * Tagged format, for easy grepping and expansion.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 516e149..a8c4144 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1862,6 +1862,7 @@ extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
 extern void __init mmap_init(void);
 extern void show_mem(unsigned int flags);
+extern long si_mem_available(void);
 extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 838ca8bb..dae813c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3603,6 +3603,49 @@ static inline void show_node(struct zone *zone)
printk("Node %d ", zone_to_nid(zone));
 }
 
+long si_mem_available(void)
+{
+   long available;
+   unsigned long pagecache;
+   unsigned long wmark_low = 0;
+   unsigned long pages[NR_LRU_LISTS];
+   struct zone *zone;
+   int lru;
+
+   for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
+   pages[lru] = global_page_state(NR_LRU_BASE + lru);
+
+   for_each_zone(zone)
+   wmark_low += zone->watermark[WMARK_LOW];
+
+   /*
+* Estimate the amount of memory available for userspace allocations,
+* without causing swapping.
+*/
+   available = global_page_state(NR_FREE_PAGES) - totalreserve_pages;
+
+   /*
+* Not all the page cache can be freed, otherwise the system will
+* start swapping. Assume at least half of the page cache, or the
+* low watermark worth of cache, needs to stay.
+*/
+   pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
+   pagecache -= min(pagecache / 2, wmark_low);
+   available += pagecache;
+
+   /*
+* Part of the reclaimable slab consists of items that are in use,
+* and cannot be freed. Cap this estimate at the low watermark.
+*/
+   available += global_page_state(NR_SLAB_RECLAIMABLE) -
+min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
+
+   if (available < 0)
+   available = 0;
+   return available;
+}
+EXPORT_SYMBOL_GPL(si_mem_available);
+
 void si_meminfo(struct sysinfo *val)
 {
val->totalram = totalram_pages;
-- 
2.5.0



[PATCH 0/2] export 'available' memory to virtio balloon statistics

2016-02-16 Thread Denis V. Lunev
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap. This metric would be very
useful in VM orchestration software to improve memory management
of different VMs under overcommit.

Signed-off-by: Igor Redko <red...@virtuozzo.com>
Reviewed-by: Roman Kagan <rka...@virtuozzo.com>
Signed-off-by: Denis V. Lunev <d...@openvz.org>
CC: Michael S. Tsirkin <m...@redhat.com>
CC: Andrew Morton <a...@linux-foundation.org>

Igor Redko (2):
  calculate 'available' memory in the separate function
  virtio_balloon: export 'available' memory to balloon statistics

 drivers/virtio/virtio_balloon.c |  6 ++
 fs/proc/meminfo.c   | 31 +-
 include/linux/mm.h  |  1 +
 include/uapi/linux/virtio_balloon.h |  3 ++-
 mm/page_alloc.c | 43 +
 5 files changed, 53 insertions(+), 31 deletions(-)

-- 
2.5.0



[PATCH 0/2] export 'available' memory to virtio balloon statistics

2016-02-16 Thread Denis V. Lunev
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap. This metric would be very
useful in VM orchestration software to improve memory management
of different VMs under overcommit.

Signed-off-by: Igor Redko 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Michael S. Tsirkin 
CC: Andrew Morton 

Igor Redko (2):
  calculate 'available' memory in the separate function
  virtio_balloon: export 'available' memory to balloon statistics

 drivers/virtio/virtio_balloon.c |  6 ++
 fs/proc/meminfo.c   | 31 +-
 include/linux/mm.h  |  1 +
 include/uapi/linux/virtio_balloon.h |  3 ++-
 mm/page_alloc.c | 43 +
 5 files changed, 53 insertions(+), 31 deletions(-)

-- 
2.5.0



[PATCH 2/2] virtio_balloon: export 'available' memory to balloon statistics

2016-02-16 Thread Denis V. Lunev
From: Igor Redko <red...@virtuozzo.com>

Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap.

Signed-off-by: Igor Redko <red...@virtuozzo.com>
Reviewed-by: Roman Kagan <rka...@virtuozzo.com>
Signed-off-by: Denis V. Lunev <d...@openvz.org>
CC: Michael S. Tsirkin <m...@redhat.com>
CC: Andrew Morton <a...@linux-foundation.org>
---
 drivers/virtio/virtio_balloon.c | 6 ++
 include/uapi/linux/virtio_balloon.h | 3 ++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0c3691f..f2b77de 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -229,10 +230,13 @@ static void update_balloon_stats(struct virtio_balloon 
*vb)
unsigned long events[NR_VM_EVENT_ITEMS];
struct sysinfo i;
int idx = 0;
+   long available;
 
all_vm_events(events);
si_meminfo();
 
+   available = si_mem_available();
+
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
pages_to_bytes(events[PSWPIN]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
@@ -243,6 +247,8 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(i.freeram));
update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT,
pages_to_bytes(i.totalram));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_AVAIL,
+   pages_to_bytes(available));
 }
 
 /*
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index d7f1cbc..343d7dd 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -51,7 +51,8 @@ struct virtio_balloon_config {
 #define VIRTIO_BALLOON_S_MINFLT   3   /* Number of minor faults */
 #define VIRTIO_BALLOON_S_MEMFREE  4   /* Total amount of free memory */
 #define VIRTIO_BALLOON_S_MEMTOT   5   /* Total amount of memory */
-#define VIRTIO_BALLOON_S_NR   6
+#define VIRTIO_BALLOON_S_AVAIL6   /* Available memory as in /proc */
+#define VIRTIO_BALLOON_S_NR   7
 
 /*
  * Memory statistics structure.
-- 
2.5.0



[PATCH 2/2] virtio_balloon: export 'available' memory to balloon statistics

2016-02-16 Thread Denis V. Lunev
From: Igor Redko 

Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap.

Signed-off-by: Igor Redko 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Michael S. Tsirkin 
CC: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 6 ++
 include/uapi/linux/virtio_balloon.h | 3 ++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0c3691f..f2b77de 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -229,10 +230,13 @@ static void update_balloon_stats(struct virtio_balloon 
*vb)
unsigned long events[NR_VM_EVENT_ITEMS];
struct sysinfo i;
int idx = 0;
+   long available;
 
all_vm_events(events);
si_meminfo();
 
+   available = si_mem_available();
+
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
pages_to_bytes(events[PSWPIN]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
@@ -243,6 +247,8 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(i.freeram));
update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT,
pages_to_bytes(i.totalram));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_AVAIL,
+   pages_to_bytes(available));
 }
 
 /*
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index d7f1cbc..343d7dd 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -51,7 +51,8 @@ struct virtio_balloon_config {
 #define VIRTIO_BALLOON_S_MINFLT   3   /* Number of minor faults */
 #define VIRTIO_BALLOON_S_MEMFREE  4   /* Total amount of free memory */
 #define VIRTIO_BALLOON_S_MEMTOT   5   /* Total amount of memory */
-#define VIRTIO_BALLOON_S_NR   6
+#define VIRTIO_BALLOON_S_AVAIL6   /* Available memory as in /proc */
+#define VIRTIO_BALLOON_S_NR   7
 
 /*
  * Memory statistics structure.
-- 
2.5.0



Re: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value

2015-11-12 Thread Denis V. Lunev

On 11/12/2015 08:35 PM, KY Srinivasan wrote:



-Original Message-
From: Denis V. Lunev [mailto:d...@openvz.org]
Sent: Wednesday, November 11, 2015 11:16 PM
To: KY Srinivasan 
Cc: rka...@virtuozzo.com; de...@linuxdriverproject.org; linux-
ker...@vger.kernel.org; Andrey Smetanin ;
Haiyang Zhang ; Vitaly Kuznetsov

Subject: Re: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value

On 11/02/2015 10:42 PM, KY Srinivasan wrote:

-Original Message-
From: Denis V. Lunev [mailto:d...@openvz.org]
Sent: Monday, November 2, 2015 3:34 AM
Cc: rka...@virtuozzo.com; de...@linuxdriverproject.org; linux-
ker...@vger.kernel.org; Andrey Smetanin ;

KY

Srinivasan ; Haiyang Zhang
; Vitaly Kuznetsov ;
Denis V. Lunev 
Subject: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value

From: Andrey Smetanin 

Hypervisor Top Level Functional Specification v3/4 says
that TSC page sequence value = -1(0x) is used to
indicate that TSC page no longer reliable source of reference
timer. Unfortunately, we found that Windows Hyper-V guest
side implementation uses sequence value = 0 to indicate
that Tsc page no longer valid. This is clearly visible
inside Windows 2012R2 ntoskrnl.exe HvlGetReferenceTime()
function dissassembly:

HvlGetReferenceTime proc near
   xchgax, ax
loc_1401C3132:
   mov rax, cs:HvlpReferenceTscPage
   mov r9d, [rax]
   testr9d, r9d
   jz  short loc_1401C3176
   rdtsc
   mov rcx, cs:HvlpReferenceTscPage
   shl rdx, 20h
   or  rdx, rax
   mov rax, [rcx+8]
   mov rcx, cs:HvlpReferenceTscPage
   mov r8, [rcx+10h]
   mul rdx
   mov rax, cs:HvlpReferenceTscPage
   add rdx, r8
   mov ecx, [rax]
   cmp ecx, r9d
   jnz short loc_1401C3132
   jmp short loc_1401C3184
loc_1401C3176:
   mov ecx, 4020h
   rdmsr
   shl rdx, 20h
   or  rdx, rax
loc_1401C3184:
   mov rax, rdx
   retn
HvlGetReferenceTime endp

This patch aligns Tsc page invalid sequence value with
Windows Hyper-V guest implementation which is more
compatible with both Hyper-V hypervisor and KVM hypervisor.

Signed-off-by: Andrey Smetanin 
CC: "K. Y. Srinivasan" 

Thanks Andrey; the Hyper-V team will be updating the Hyper-V

documentation.

Acked-by: K. Y. Srinivasan 

Regards,

K. Y

K.Y.,

can you pls clarify the state of this patch? It is a bit unclear
to me whether it is applied or not.

I will be submitting this shortly.


By the way, I also do not see the following patch
"drivers/hv: cleanup synic msrs if vmbus connect failed"
as applied in the Linux-next. You have promised to resend
it will correct author.

It was resent on October 29. It is in Greg's queue.

Regards,

K. Y

Thank you in advance,
  Den

thank you very much for this update :)

Den
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value

2015-11-12 Thread Denis V. Lunev

On 11/12/2015 08:35 PM, KY Srinivasan wrote:



-Original Message-
From: Denis V. Lunev [mailto:d...@openvz.org]
Sent: Wednesday, November 11, 2015 11:16 PM
To: KY Srinivasan <k...@microsoft.com>
Cc: rka...@virtuozzo.com; de...@linuxdriverproject.org; linux-
ker...@vger.kernel.org; Andrey Smetanin <asmeta...@virtuozzo.com>;
Haiyang Zhang <haiya...@microsoft.com>; Vitaly Kuznetsov
<vkuzn...@redhat.com>
Subject: Re: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value

On 11/02/2015 10:42 PM, KY Srinivasan wrote:

-Original Message-----
From: Denis V. Lunev [mailto:d...@openvz.org]
Sent: Monday, November 2, 2015 3:34 AM
Cc: rka...@virtuozzo.com; de...@linuxdriverproject.org; linux-
ker...@vger.kernel.org; Andrey Smetanin <asmeta...@virtuozzo.com>;

KY

Srinivasan <k...@microsoft.com>; Haiyang Zhang
<haiya...@microsoft.com>; Vitaly Kuznetsov <vkuzn...@redhat.com>;
Denis V. Lunev <d...@openvz.org>
Subject: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value

From: Andrey Smetanin <asmeta...@virtuozzo.com>

Hypervisor Top Level Functional Specification v3/4 says
that TSC page sequence value = -1(0x) is used to
indicate that TSC page no longer reliable source of reference
timer. Unfortunately, we found that Windows Hyper-V guest
side implementation uses sequence value = 0 to indicate
that Tsc page no longer valid. This is clearly visible
inside Windows 2012R2 ntoskrnl.exe HvlGetReferenceTime()
function dissassembly:

HvlGetReferenceTime proc near
   xchgax, ax
loc_1401C3132:
   mov rax, cs:HvlpReferenceTscPage
   mov r9d, [rax]
   testr9d, r9d
   jz  short loc_1401C3176
   rdtsc
   mov rcx, cs:HvlpReferenceTscPage
   shl rdx, 20h
   or  rdx, rax
   mov rax, [rcx+8]
   mov rcx, cs:HvlpReferenceTscPage
   mov r8, [rcx+10h]
   mul rdx
   mov rax, cs:HvlpReferenceTscPage
   add rdx, r8
   mov ecx, [rax]
   cmp ecx, r9d
   jnz short loc_1401C3132
   jmp short loc_1401C3184
loc_1401C3176:
   mov ecx, 4020h
   rdmsr
   shl rdx, 20h
   or  rdx, rax
loc_1401C3184:
   mov rax, rdx
   retn
HvlGetReferenceTime endp

This patch aligns Tsc page invalid sequence value with
Windows Hyper-V guest implementation which is more
compatible with both Hyper-V hypervisor and KVM hypervisor.

Signed-off-by: Andrey Smetanin <asmeta...@virtuozzo.com>
CC: "K. Y. Srinivasan" <k...@microsoft.com>

Thanks Andrey; the Hyper-V team will be updating the Hyper-V

documentation.

Acked-by: K. Y. Srinivasan <k...@microsoft.com>

Regards,

K. Y

K.Y.,

can you pls clarify the state of this patch? It is a bit unclear
to me whether it is applied or not.

I will be submitting this shortly.


By the way, I also do not see the following patch
"drivers/hv: cleanup synic msrs if vmbus connect failed"
as applied in the Linux-next. You have promised to resend
it will correct author.

It was resent on October 29. It is in Greg's queue.

Regards,

K. Y

Thank you in advance,
  Den

thank you very much for this update :)

Den
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value

2015-11-11 Thread Denis V. Lunev

On 11/02/2015 10:42 PM, KY Srinivasan wrote:



-Original Message-
From: Denis V. Lunev [mailto:d...@openvz.org]
Sent: Monday, November 2, 2015 3:34 AM
Cc: rka...@virtuozzo.com; de...@linuxdriverproject.org; linux-
ker...@vger.kernel.org; Andrey Smetanin ; KY
Srinivasan ; Haiyang Zhang
; Vitaly Kuznetsov ;
Denis V. Lunev 
Subject: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value

From: Andrey Smetanin 

Hypervisor Top Level Functional Specification v3/4 says
that TSC page sequence value = -1(0x) is used to
indicate that TSC page no longer reliable source of reference
timer. Unfortunately, we found that Windows Hyper-V guest
side implementation uses sequence value = 0 to indicate
that Tsc page no longer valid. This is clearly visible
inside Windows 2012R2 ntoskrnl.exe HvlGetReferenceTime()
function dissassembly:

HvlGetReferenceTime proc near
  xchgax, ax
loc_1401C3132:
  mov rax, cs:HvlpReferenceTscPage
  mov r9d, [rax]
  testr9d, r9d
  jz  short loc_1401C3176
  rdtsc
  mov rcx, cs:HvlpReferenceTscPage
  shl rdx, 20h
  or  rdx, rax
  mov rax, [rcx+8]
  mov rcx, cs:HvlpReferenceTscPage
  mov r8, [rcx+10h]
  mul rdx
  mov rax, cs:HvlpReferenceTscPage
  add rdx, r8
  mov ecx, [rax]
  cmp ecx, r9d
  jnz short loc_1401C3132
  jmp short loc_1401C3184
loc_1401C3176:
  mov ecx, 4020h
  rdmsr
  shl rdx, 20h
  or  rdx, rax
loc_1401C3184:
  mov rax, rdx
  retn
HvlGetReferenceTime endp

This patch aligns Tsc page invalid sequence value with
Windows Hyper-V guest implementation which is more
compatible with both Hyper-V hypervisor and KVM hypervisor.

Signed-off-by: Andrey Smetanin 
CC: "K. Y. Srinivasan" 

Thanks Andrey; the Hyper-V team will be updating the Hyper-V documentation.

Acked-by: K. Y. Srinivasan 

Regards,

K. Y

K.Y.,

can you pls clarify the state of this patch? It is a bit unclear
to me whether it is applied or not.

By the way, I also do not see the following patch
"drivers/hv: cleanup synic msrs if vmbus connect failed"
as applied in the Linux-next. You have promised to resend
it will correct author.

Thank you in advance,
Den
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value

2015-11-11 Thread Denis V. Lunev

On 11/02/2015 10:42 PM, KY Srinivasan wrote:



-Original Message-
From: Denis V. Lunev [mailto:d...@openvz.org]
Sent: Monday, November 2, 2015 3:34 AM
Cc: rka...@virtuozzo.com; de...@linuxdriverproject.org; linux-
ker...@vger.kernel.org; Andrey Smetanin <asmeta...@virtuozzo.com>; KY
Srinivasan <k...@microsoft.com>; Haiyang Zhang
<haiya...@microsoft.com>; Vitaly Kuznetsov <vkuzn...@redhat.com>;
Denis V. Lunev <d...@openvz.org>
Subject: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value

From: Andrey Smetanin <asmeta...@virtuozzo.com>

Hypervisor Top Level Functional Specification v3/4 says
that TSC page sequence value = -1(0x) is used to
indicate that TSC page no longer reliable source of reference
timer. Unfortunately, we found that Windows Hyper-V guest
side implementation uses sequence value = 0 to indicate
that Tsc page no longer valid. This is clearly visible
inside Windows 2012R2 ntoskrnl.exe HvlGetReferenceTime()
function dissassembly:

HvlGetReferenceTime proc near
  xchgax, ax
loc_1401C3132:
  mov rax, cs:HvlpReferenceTscPage
  mov r9d, [rax]
  testr9d, r9d
  jz  short loc_1401C3176
  rdtsc
  mov rcx, cs:HvlpReferenceTscPage
  shl rdx, 20h
  or  rdx, rax
  mov rax, [rcx+8]
  mov rcx, cs:HvlpReferenceTscPage
  mov r8, [rcx+10h]
  mul rdx
  mov rax, cs:HvlpReferenceTscPage
  add rdx, r8
  mov ecx, [rax]
  cmp ecx, r9d
  jnz short loc_1401C3132
  jmp short loc_1401C3184
loc_1401C3176:
  mov ecx, 4020h
  rdmsr
  shl rdx, 20h
  or  rdx, rax
loc_1401C3184:
  mov rax, rdx
  retn
HvlGetReferenceTime endp

This patch aligns Tsc page invalid sequence value with
Windows Hyper-V guest implementation which is more
compatible with both Hyper-V hypervisor and KVM hypervisor.

Signed-off-by: Andrey Smetanin <asmeta...@virtuozzo.com>
CC: "K. Y. Srinivasan" <k...@microsoft.com>

Thanks Andrey; the Hyper-V team will be updating the Hyper-V documentation.

Acked-by: K. Y. Srinivasan <k...@microsoft.com>

Regards,

K. Y

K.Y.,

can you pls clarify the state of this patch? It is a bit unclear
to me whether it is applied or not.

By the way, I also do not see the following patch
"drivers/hv: cleanup synic msrs if vmbus connect failed"
as applied in the Linux-next. You have promised to resend
it will correct author.

Thank you in advance,
Den
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] drivers/hv: correct tsc page sequence invalid value

2015-11-02 Thread Denis V. Lunev
From: Andrey Smetanin 

Hypervisor Top Level Functional Specification v3/4 says
that TSC page sequence value = -1(0x) is used to
indicate that TSC page no longer reliable source of reference
timer. Unfortunately, we found that Windows Hyper-V guest
side implementation uses sequence value = 0 to indicate
that Tsc page no longer valid. This is clearly visible
inside Windows 2012R2 ntoskrnl.exe HvlGetReferenceTime()
function dissassembly:

HvlGetReferenceTime proc near
 xchgax, ax
loc_1401C3132:
 mov rax, cs:HvlpReferenceTscPage
 mov r9d, [rax]
 testr9d, r9d
 jz  short loc_1401C3176
 rdtsc
 mov rcx, cs:HvlpReferenceTscPage
 shl rdx, 20h
 or  rdx, rax
 mov rax, [rcx+8]
 mov rcx, cs:HvlpReferenceTscPage
 mov r8, [rcx+10h]
 mul rdx
 mov rax, cs:HvlpReferenceTscPage
 add rdx, r8
 mov ecx, [rax]
 cmp ecx, r9d
 jnz short loc_1401C3132
 jmp short loc_1401C3184
loc_1401C3176:
 mov ecx, 4020h
 rdmsr
 shl rdx, 20h
 or  rdx, rax
loc_1401C3184:
 mov rax, rdx
 retn
HvlGetReferenceTime endp

This patch aligns Tsc page invalid sequence value with
Windows Hyper-V guest implementation which is more
compatible with both Hyper-V hypervisor and KVM hypervisor.

Signed-off-by: Andrey Smetanin 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 

TARGET: Microsoft Hyper-V Linux guest drivers
Signed-off-by: Denis V. Lunev 
---
 drivers/hv/hv.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 6341be8..68d0701 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -139,7 +139,7 @@ static cycle_t read_hv_clock_tsc(struct clocksource *arg)
cycle_t current_tick;
struct ms_hyperv_tsc_page *tsc_pg = hv_context.tsc_page;
 
-   if (tsc_pg->tsc_sequence != -1) {
+   if (tsc_pg->tsc_sequence != 0) {
/*
 * Use the tsc page to compute the value.
 */
@@ -161,7 +161,7 @@ static cycle_t read_hv_clock_tsc(struct clocksource *arg)
if (tsc_pg->tsc_sequence == sequence)
return current_tick;
 
-   if (tsc_pg->tsc_sequence != -1)
+   if (tsc_pg->tsc_sequence != 0)
continue;
/*
 * Fallback using MSR method.
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] drivers/hv: correct tsc page sequence invalid value

2015-11-02 Thread Denis V. Lunev
From: Andrey Smetanin <asmeta...@virtuozzo.com>

Hypervisor Top Level Functional Specification v3/4 says
that TSC page sequence value = -1(0x) is used to
indicate that TSC page no longer reliable source of reference
timer. Unfortunately, we found that Windows Hyper-V guest
side implementation uses sequence value = 0 to indicate
that Tsc page no longer valid. This is clearly visible
inside Windows 2012R2 ntoskrnl.exe HvlGetReferenceTime()
function dissassembly:

HvlGetReferenceTime proc near
 xchgax, ax
loc_1401C3132:
 mov rax, cs:HvlpReferenceTscPage
 mov r9d, [rax]
 testr9d, r9d
 jz  short loc_1401C3176
 rdtsc
 mov rcx, cs:HvlpReferenceTscPage
 shl rdx, 20h
 or  rdx, rax
 mov rax, [rcx+8]
 mov rcx, cs:HvlpReferenceTscPage
 mov r8, [rcx+10h]
 mul rdx
 mov rax, cs:HvlpReferenceTscPage
 add rdx, r8
 mov ecx, [rax]
 cmp ecx, r9d
 jnz short loc_1401C3132
 jmp short loc_1401C3184
loc_1401C3176:
 mov ecx, 4020h
 rdmsr
 shl rdx, 20h
 or  rdx, rax
loc_1401C3184:
 mov rax, rdx
 retn
HvlGetReferenceTime endp

This patch aligns Tsc page invalid sequence value with
Windows Hyper-V guest implementation which is more
compatible with both Hyper-V hypervisor and KVM hypervisor.

Signed-off-by: Andrey Smetanin <asmeta...@virtuozzo.com>
CC: "K. Y. Srinivasan" <k...@microsoft.com>
CC: Haiyang Zhang <haiya...@microsoft.com>
CC: Vitaly Kuznetsov <vkuzn...@redhat.com>

TARGET: Microsoft Hyper-V Linux guest drivers
Signed-off-by: Denis V. Lunev <d...@openvz.org>
---
 drivers/hv/hv.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 6341be8..68d0701 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -139,7 +139,7 @@ static cycle_t read_hv_clock_tsc(struct clocksource *arg)
cycle_t current_tick;
struct ms_hyperv_tsc_page *tsc_pg = hv_context.tsc_page;
 
-   if (tsc_pg->tsc_sequence != -1) {
+   if (tsc_pg->tsc_sequence != 0) {
/*
 * Use the tsc page to compute the value.
 */
@@ -161,7 +161,7 @@ static cycle_t read_hv_clock_tsc(struct clocksource *arg)
if (tsc_pg->tsc_sequence == sequence)
return current_tick;
 
-   if (tsc_pg->tsc_sequence != -1)
+   if (tsc_pg->tsc_sequence != 0)
continue;
/*
 * Fallback using MSR method.
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] drivers/hv: cleanup synic msrs if vmbus connect failed

2015-10-08 Thread Denis V. Lunev

On 10/08/2015 08:28 PM, KY Srinivasan wrote:



-Original Message-
From: Denis V. Lunev [mailto:d...@openvz.org]
Sent: Thursday, October 8, 2015 10:20 AM
To: KY Srinivasan ; gre...@linuxfoundation.org; linux-
ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
a...@canonical.com; vkuzn...@redhat.com; jasow...@redhat.com
Cc: Andrey Smetanin ; Haiyang Zhang

Subject: Re: [PATCH 07/10] drivers/hv: cleanup synic msrs if vmbus connect
failed

On 10/08/2015 05:01 AM, K. Y. Srinivasan wrote:

From: Denis V. Lunev 

K.Y.,

there is one subtle thing in this submission. You have changed "From:"
field in comparison with the original letter. I have submitted
the patch with "From: Andrey Smetanin "
In this case Author: in the resulted git mainstream commit will
be Andrey. With your submission the resulted Author will be I.

This was already happened once with

commit cc2dd4027a43bb36c846f195a764edabc0828602
Author: Denis V. Lunev 
Date:   Sat Aug 1 16:08:20 2015 -0700

  mshyperv: fix recognition of Hyper-V guest crash MSR's

The situation looks a bit unfair.

Can we do something with that now/next time?

I am going to be resubmitting this series. I will fix it up.

thank you very much :)

Den
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] drivers/hv: cleanup synic msrs if vmbus connect failed

2015-10-08 Thread Denis V. Lunev

On 10/08/2015 05:01 AM, K. Y. Srinivasan wrote:

From: Denis V. Lunev 


K.Y.,

there is one subtle thing in this submission. You have changed "From:"
field in comparison with the original letter. I have submitted
the patch with "From: Andrey Smetanin "
In this case Author: in the resulted git mainstream commit will
be Andrey. With your submission the resulted Author will be I.

This was already happened once with

commit cc2dd4027a43bb36c846f195a764edabc0828602
Author: Denis V. Lunev 
Date:   Sat Aug 1 16:08:20 2015 -0700

mshyperv: fix recognition of Hyper-V guest crash MSR's

The situation looks a bit unfair.

Can we do something with that now/next time?

Den


Before vmbus_connect() synic is setup per vcpu - this means
hypervisor receives writes at synic msr's and probably allocate
hypervisor resources per synic setup.

If vmbus_connect() failed for some reason it's neccessary to cleanup
synic setup by call hv_synic_cleanup() at each vcpu to get a chance
to free allocated resources by hypervisor per synic.

This patch does appropriate cleanup in case of vmbus_connect() failure.

Signed-off-by: Andrey Smetanin 
Signed-off-by: Denis V. Lunev 
Reviewed-by: Vitaly Kuznetsov 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 
Signed-off-by: K. Y. Srinivasan 
---
  drivers/hv/vmbus_drv.c |4 +++-
  1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index f19b6f7..3297731 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -867,7 +867,7 @@ static int vmbus_bus_init(int irq)
on_each_cpu(hv_synic_init, NULL, 1);
ret = vmbus_connect();
if (ret)
-   goto err_alloc;
+   goto err_connect;
  
  	if (vmbus_proto_version > VERSION_WIN7)

cpu_hotplug_disable();
@@ -885,6 +885,8 @@ static int vmbus_bus_init(int irq)
  
  	return 0;
  
+err_connect:

+   on_each_cpu(hv_synic_cleanup, NULL, 1);
  err_alloc:
hv_synic_free();
hv_remove_vmbus_irq();


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] drivers/hv: cleanup synic msrs if vmbus connect failed

2015-10-08 Thread Denis V. Lunev

On 10/08/2015 05:01 AM, K. Y. Srinivasan wrote:

From: Denis V. Lunev <d...@openvz.org>


K.Y.,

there is one subtle thing in this submission. You have changed "From:"
field in comparison with the original letter. I have submitted
the patch with "From: Andrey Smetanin <asmeta...@virtuozzo.com>"
In this case Author: in the resulted git mainstream commit will
be Andrey. With your submission the resulted Author will be I.

This was already happened once with

commit cc2dd4027a43bb36c846f195a764edabc0828602
Author: Denis V. Lunev <d...@openvz.org>
Date:   Sat Aug 1 16:08:20 2015 -0700

mshyperv: fix recognition of Hyper-V guest crash MSR's

The situation looks a bit unfair.

Can we do something with that now/next time?

Den


Before vmbus_connect() synic is setup per vcpu - this means
hypervisor receives writes at synic msr's and probably allocate
hypervisor resources per synic setup.

If vmbus_connect() failed for some reason it's neccessary to cleanup
synic setup by call hv_synic_cleanup() at each vcpu to get a chance
to free allocated resources by hypervisor per synic.

This patch does appropriate cleanup in case of vmbus_connect() failure.

Signed-off-by: Andrey Smetanin <asmeta...@virtuozzo.com>
Signed-off-by: Denis V. Lunev <d...@openvz.org>
Reviewed-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
CC: "K. Y. Srinivasan" <k...@microsoft.com>
CC: Haiyang Zhang <haiya...@microsoft.com>
CC: Vitaly Kuznetsov <vkuzn...@redhat.com>
Signed-off-by: K. Y. Srinivasan <k...@microsoft.com>
---
  drivers/hv/vmbus_drv.c |4 +++-
  1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index f19b6f7..3297731 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -867,7 +867,7 @@ static int vmbus_bus_init(int irq)
on_each_cpu(hv_synic_init, NULL, 1);
ret = vmbus_connect();
if (ret)
-   goto err_alloc;
+   goto err_connect;
  
  	if (vmbus_proto_version > VERSION_WIN7)

cpu_hotplug_disable();
@@ -885,6 +885,8 @@ static int vmbus_bus_init(int irq)
  
  	return 0;
  
+err_connect:

+   on_each_cpu(hv_synic_cleanup, NULL, 1);
  err_alloc:
hv_synic_free();
hv_remove_vmbus_irq();


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] drivers/hv: cleanup synic msrs if vmbus connect failed

2015-10-08 Thread Denis V. Lunev

On 10/08/2015 08:28 PM, KY Srinivasan wrote:



-Original Message-
From: Denis V. Lunev [mailto:d...@openvz.org]
Sent: Thursday, October 8, 2015 10:20 AM
To: KY Srinivasan <k...@microsoft.com>; gre...@linuxfoundation.org; linux-
ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
a...@canonical.com; vkuzn...@redhat.com; jasow...@redhat.com
Cc: Andrey Smetanin <asmeta...@virtuozzo.com>; Haiyang Zhang
<haiya...@microsoft.com>
Subject: Re: [PATCH 07/10] drivers/hv: cleanup synic msrs if vmbus connect
failed

On 10/08/2015 05:01 AM, K. Y. Srinivasan wrote:

From: Denis V. Lunev <d...@openvz.org>

K.Y.,

there is one subtle thing in this submission. You have changed "From:"
field in comparison with the original letter. I have submitted
the patch with "From: Andrey Smetanin <asmeta...@virtuozzo.com>"
In this case Author: in the resulted git mainstream commit will
be Andrey. With your submission the resulted Author will be I.

This was already happened once with

commit cc2dd4027a43bb36c846f195a764edabc0828602
Author: Denis V. Lunev <d...@openvz.org>
Date:   Sat Aug 1 16:08:20 2015 -0700

  mshyperv: fix recognition of Hyper-V guest crash MSR's

The situation looks a bit unfair.

Can we do something with that now/next time?

I am going to be resubmitting this series. I will fix it up.

thank you very much :)

Den
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] drivers/hv: cleanup synic msrs if vmbus connect failed

2015-10-07 Thread Denis V. Lunev
From: Andrey Smetanin 

Before vmbus_connect() synic is setup per vcpu - this means
hypervisor receives writes at synic msr's and probably allocate
hypervisor resources per synic setup.

If vmbus_connect() failed for some reason it's neccessary to cleanup
synic setup by call hv_synic_cleanup() at each vcpu to get a chance
to free allocated resources by hypervisor per synic.

This patch does appropriate cleanup in case of vmbus_connect() failure.

Signed-off-by: Andrey Smetanin 
Signed-off-by: Denis V. Lunev 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 
---
 drivers/hv/vmbus_drv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index f19b6f7..3297731 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -867,7 +867,7 @@ static int vmbus_bus_init(int irq)
on_each_cpu(hv_synic_init, NULL, 1);
ret = vmbus_connect();
if (ret)
-   goto err_alloc;
+   goto err_connect;
 
if (vmbus_proto_version > VERSION_WIN7)
cpu_hotplug_disable();
@@ -885,6 +885,8 @@ static int vmbus_bus_init(int irq)
 
return 0;
 
+err_connect:
+   on_each_cpu(hv_synic_cleanup, NULL, 1);
 err_alloc:
hv_synic_free();
hv_remove_vmbus_irq();
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] drivers/hv: cleanup synic msrs if vmbus connect failed

2015-10-07 Thread Denis V. Lunev
From: Andrey Smetanin <asmeta...@virtuozzo.com>

Before vmbus_connect() synic is setup per vcpu - this means
hypervisor receives writes at synic msr's and probably allocate
hypervisor resources per synic setup.

If vmbus_connect() failed for some reason it's neccessary to cleanup
synic setup by call hv_synic_cleanup() at each vcpu to get a chance
to free allocated resources by hypervisor per synic.

This patch does appropriate cleanup in case of vmbus_connect() failure.

Signed-off-by: Andrey Smetanin <asmeta...@virtuozzo.com>
Signed-off-by: Denis V. Lunev <d...@openvz.org>
CC: "K. Y. Srinivasan" <k...@microsoft.com>
CC: Haiyang Zhang <haiya...@microsoft.com>
CC: Vitaly Kuznetsov <vkuzn...@redhat.com>
---
 drivers/hv/vmbus_drv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index f19b6f7..3297731 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -867,7 +867,7 @@ static int vmbus_bus_init(int irq)
on_each_cpu(hv_synic_init, NULL, 1);
ret = vmbus_connect();
if (ret)
-   goto err_alloc;
+   goto err_connect;
 
if (vmbus_proto_version > VERSION_WIN7)
cpu_hotplug_disable();
@@ -885,6 +885,8 @@ static int vmbus_bus_init(int irq)
 
return 0;
 
+err_connect:
+   on_each_cpu(hv_synic_cleanup, NULL, 1);
 err_alloc:
hv_synic_free();
hv_remove_vmbus_irq();
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 1/1] mshyperv: fix recognition of Hyper-V guest crash MSR's

2015-07-02 Thread Denis V. Lunev
From: Andrey Smetanin 

Hypervisor Top Level Functional Specification v3.1/4.0 notes that cpuid
(0x4003) EDX's 10th bit should be used to check that Hyper-V guest
crash MSR's functionality available.

This patch should fix this recognition. Currently the code checks EAX
register instead of EDX.

Signed-off-by: Andrey Smetanin 
Signed-off-by: Denis V. Lunev 
CC: Nick Meier 
CC: K. Y. Srinivasan 
CC: Haiyang Zhang 
---
 arch/x86/include/asm/mshyperv.h | 1 +
 arch/x86/kernel/cpu/mshyperv.c  | 1 +
 drivers/hv/vmbus_drv.c  | 4 ++--
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index c163215..eebe433 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -7,6 +7,7 @@
 
 struct ms_hyperv_info {
u32 features;
+   u32 misc_features;
u32 hints;
 };
 
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index aad4bd8..6d172a2 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -114,6 +114,7 @@ static void __init ms_hyperv_init_platform(void)
 * Extract the features and hints
 */
ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
+   ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
 
printk(KERN_INFO "HyperV: features 0x%x, hints 0x%x\n",
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index cf20400..67af13a 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -841,7 +841,7 @@ static int vmbus_bus_init(int irq)
/*
 * Only register if the crash MSRs are available
 */
-   if (ms_hyperv.features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
+   if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
atomic_notifier_chain_register(_notifier_list,
   _panic_block);
}
@@ -1110,7 +1110,7 @@ static void __exit vmbus_exit(void)
hv_remove_vmbus_irq();
tasklet_kill(_dpc);
vmbus_free_channels();
-   if (ms_hyperv.features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
+   if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
atomic_notifier_chain_unregister(_notifier_list,
 _panic_block);
}
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 1/1] mshyperv: fix recognition of Hyper-V guest crash MSR's

2015-07-02 Thread Denis V. Lunev
From: Andrey Smetanin asmeta...@virtuozzo.com

Hypervisor Top Level Functional Specification v3.1/4.0 notes that cpuid
(0x4003) EDX's 10th bit should be used to check that Hyper-V guest
crash MSR's functionality available.

This patch should fix this recognition. Currently the code checks EAX
register instead of EDX.

Signed-off-by: Andrey Smetanin asmeta...@virtuozzo.com
Signed-off-by: Denis V. Lunev d...@openvz.org
CC: Nick Meier nme...@microsoft.com
CC: K. Y. Srinivasan k...@microsoft.com
CC: Haiyang Zhang haiya...@microsoft.com
---
 arch/x86/include/asm/mshyperv.h | 1 +
 arch/x86/kernel/cpu/mshyperv.c  | 1 +
 drivers/hv/vmbus_drv.c  | 4 ++--
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index c163215..eebe433 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -7,6 +7,7 @@
 
 struct ms_hyperv_info {
u32 features;
+   u32 misc_features;
u32 hints;
 };
 
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index aad4bd8..6d172a2 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -114,6 +114,7 @@ static void __init ms_hyperv_init_platform(void)
 * Extract the features and hints
 */
ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
+   ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
 
printk(KERN_INFO HyperV: features 0x%x, hints 0x%x\n,
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index cf20400..67af13a 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -841,7 +841,7 @@ static int vmbus_bus_init(int irq)
/*
 * Only register if the crash MSRs are available
 */
-   if (ms_hyperv.features  HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
+   if (ms_hyperv.misc_features  HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
atomic_notifier_chain_register(panic_notifier_list,
   hyperv_panic_block);
}
@@ -1110,7 +1110,7 @@ static void __exit vmbus_exit(void)
hv_remove_vmbus_irq();
tasklet_kill(msg_dpc);
vmbus_free_channels();
-   if (ms_hyperv.features  HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
+   if (ms_hyperv.misc_features  HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
atomic_notifier_chain_unregister(panic_notifier_list,
 hyperv_panic_block);
}
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 0/2] shrink virtio baloon on OOM in guest

2014-10-15 Thread Denis V. Lunev
Excessive virtio_balloon inflation can cause invocation of OOM-killer, when
Linux is under severe memory pressure. Various mechanisms are responsible for
correct virtio_balloon memory management. Nevertheless it is often the case
that these control tools does not have enough time to react on fast changing
memory load. As a result OS runs out of memory and invokes OOM-killer.
The balancing of memory by use of the virtio balloon should not cause the
termination of processes while there are pages in the balloon. Now there is
no way for virtio balloon driver to free memory at the last moment before
some process get killed by OOM-killer.

This does not provide a security breach as baloon itself is running
inside guest OS and is working in the cooperation with the host. Thus
some improvements from guest side should be considered as normal.

To solve the problem, introduce a virtio_balloon callback which is expected
to be called from the oom notifier call chain in out_of_memory() function.
If virtio balloon could release some memory, it will make the system to
return and retry the allocation that forced the out of memory killer to run.

Patch 1 of this series adds support for implementation of virtio_balloon
callback, so now leak_balloon() function returns number of freed pages.
Patch 2 implements virtio_balloon callback itself.

Changes from v2:
- added feature bit to control OOM baloon behavior from host
Changes from v1:
- minor cosmetic tweaks suggested by rusty@

Signed-off-by: Raushaniya Maksudova 
Signed-off-by: Denis V. Lunev 
CC: Rusty Russell 
CC: Michael S. Tsirkin 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] virtio_balloon: return the amount of freed memory from leak_balloon()

2014-10-15 Thread Denis V. Lunev
From: Raushaniya Maksudova 

This value would be useful in the next patch to provide the amount of
the freed memory for OOM killer.

Signed-off-by: Raushaniya Maksudova 
Signed-off-by: Denis V. Lunev 
CC: Rusty Russell 
CC: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f893148..66cac10 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -168,8 +168,9 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned 
int num)
}
 }
 
-static void leak_balloon(struct virtio_balloon *vb, size_t num)
+static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 {
+   unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
 
@@ -186,6 +187,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
+   num_freed_pages = vb->num_pfns;
/*
 * Note that if
 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
@@ -195,6 +197,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
tell_host(vb, vb->deflate_vq);
mutex_unlock(>balloon_lock);
release_pages_by_pfn(vb->pfns, vb->num_pfns);
+   return num_freed_pages;
 }
 
 static inline void update_stat(struct virtio_balloon *vb, int idx,
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] virtio_balloon: free some memory from balloon on OOM

2014-10-15 Thread Denis V. Lunev
From: Raushaniya Maksudova 

Excessive virtio_balloon inflation can cause invocation of OOM-killer,
when Linux is under severe memory pressure. Various mechanisms are
responsible for correct virtio_balloon memory management. Nevertheless
it is often the case that these control tools does not have enough time
to react on fast changing memory load. As a result OS runs out of memory
and invokes OOM-killer. The balancing of memory by use of the virtio
balloon should not cause the termination of processes while there are
pages in the balloon. Now there is no way for virtio balloon driver to
free some memory at the last moment before some process will be get
killed by OOM-killer.

This does not provide a security breach as balloon itself is running
inside guest OS and is working in the cooperation with the host. Thus
some improvements from guest side should be considered as normal.

To solve the problem, introduce a virtio_balloon callback which is
expected to be called from the oom notifier call chain in out_of_memory()
function. If virtio balloon could release some memory, it will make
the system to return and retry the allocation that forced the out of
memory killer to run.

Allocate virtio  feature bit for this: it is not set by default,
the the guest will not deflate virtio balloon on OOM without explicit
permission from host.

Signed-off-by: Raushaniya Maksudova 
Signed-off-by: Denis V. Lunev 
CC: Rusty Russell 
CC: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 52 +
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 53 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 66cac10..88d73a0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -36,6 +37,12 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> 
VIRTIO_BALLOON_PFN_SHIFT)
 #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
+#define OOM_VBALLOON_DEFAULT_PAGES 256
+#define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
+
+static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
+module_param(oom_pages, int, S_IRUSR | S_IWUSR);
+MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
 struct virtio_balloon
 {
@@ -71,6 +78,9 @@ struct virtio_balloon
/* Memory statistics */
int need_stats_update;
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+
+   /* To register callback in oom notifier call chain */
+   struct notifier_block nb;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -290,6 +300,38 @@ static void update_balloon_size(struct virtio_balloon *vb)
  );
 }
 
+/*
+ * virtballoon_oom_notify - release pages when system is under severe
+ * memory pressure (called from out_of_memory())
+ * @self : notifier block struct
+ * @dummy: not used
+ * @parm : returned - number of freed pages
+ *
+ * The balancing of memory by use of the virtio balloon should not cause
+ * the termination of processes while there are pages in the balloon.
+ * If virtio balloon manages to release some memory, it will make the
+ * system return and retry the allocation that forced the OOM killer
+ * to run.
+ */
+static int virtballoon_oom_notify(struct notifier_block *self,
+ unsigned long dummy, void *parm)
+{
+   struct virtio_balloon *vb;
+   unsigned long *freed;
+   unsigned num_freed_pages;
+
+   vb = container_of(self, struct virtio_balloon, nb);
+   if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+   return NOTIFY_OK;
+
+   freed = parm;
+   num_freed_pages = leak_balloon(vb, oom_pages);
+   update_balloon_size(vb);
+   *freed += num_freed_pages;
+
+   return NOTIFY_OK;
+}
+
 static int balloon(void *_vballoon)
 {
struct virtio_balloon *vb = _vballoon;
@@ -446,6 +488,12 @@ static int virtballoon_probe(struct virtio_device *vdev)
if (err)
goto out_free_vb;
 
+   vb->nb.notifier_call = virtballoon_oom_notify;
+   vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
+   err = register_oom_notifier(>nb);
+   if (err < 0)
+   goto out_oom_notify;
+
vb->thread = kthread_run(balloon, vb, "vballoon");
if (IS_ERR(vb->thread)) {
err = PTR_ERR(vb->thread);
@@ -455,6 +503,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
return 0;
 
 out_del_vqs:
+   unregister_oom_notifier(>nb);
+out_oom_notify:
vdev->config->del_vqs(vdev);
 out_free_vb:
kfree(vb);
@@ -479,6 +529,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 {
struct virtio_balloon *vb = vdev->priv;
 
+   unregister_oom_notifier(>nb);
k

[PATCH 2/2] virtio_balloon: free some memory from balloon on OOM

2014-10-15 Thread Denis V. Lunev
From: Raushaniya Maksudova rmaksud...@parallels.com

Excessive virtio_balloon inflation can cause invocation of OOM-killer,
when Linux is under severe memory pressure. Various mechanisms are
responsible for correct virtio_balloon memory management. Nevertheless
it is often the case that these control tools does not have enough time
to react on fast changing memory load. As a result OS runs out of memory
and invokes OOM-killer. The balancing of memory by use of the virtio
balloon should not cause the termination of processes while there are
pages in the balloon. Now there is no way for virtio balloon driver to
free some memory at the last moment before some process will be get
killed by OOM-killer.

This does not provide a security breach as balloon itself is running
inside guest OS and is working in the cooperation with the host. Thus
some improvements from guest side should be considered as normal.

To solve the problem, introduce a virtio_balloon callback which is
expected to be called from the oom notifier call chain in out_of_memory()
function. If virtio balloon could release some memory, it will make
the system to return and retry the allocation that forced the out of
memory killer to run.

Allocate virtio  feature bit for this: it is not set by default,
the the guest will not deflate virtio balloon on OOM without explicit
permission from host.

Signed-off-by: Raushaniya Maksudova rmaksud...@parallels.com
Signed-off-by: Denis V. Lunev d...@openvz.org
CC: Rusty Russell ru...@rustcorp.com.au
CC: Michael S. Tsirkin m...@redhat.com
---
 drivers/virtio/virtio_balloon.c | 52 +
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 53 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 66cac10..88d73a0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -28,6 +28,7 @@
 #include linux/slab.h
 #include linux/module.h
 #include linux/balloon_compaction.h
+#include linux/oom.h
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -36,6 +37,12 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE  
VIRTIO_BALLOON_PFN_SHIFT)
 #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
+#define OOM_VBALLOON_DEFAULT_PAGES 256
+#define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
+
+static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
+module_param(oom_pages, int, S_IRUSR | S_IWUSR);
+MODULE_PARM_DESC(oom_pages, pages to free on OOM);
 
 struct virtio_balloon
 {
@@ -71,6 +78,9 @@ struct virtio_balloon
/* Memory statistics */
int need_stats_update;
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+
+   /* To register callback in oom notifier call chain */
+   struct notifier_block nb;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -290,6 +300,38 @@ static void update_balloon_size(struct virtio_balloon *vb)
  actual);
 }
 
+/*
+ * virtballoon_oom_notify - release pages when system is under severe
+ * memory pressure (called from out_of_memory())
+ * @self : notifier block struct
+ * @dummy: not used
+ * @parm : returned - number of freed pages
+ *
+ * The balancing of memory by use of the virtio balloon should not cause
+ * the termination of processes while there are pages in the balloon.
+ * If virtio balloon manages to release some memory, it will make the
+ * system return and retry the allocation that forced the OOM killer
+ * to run.
+ */
+static int virtballoon_oom_notify(struct notifier_block *self,
+ unsigned long dummy, void *parm)
+{
+   struct virtio_balloon *vb;
+   unsigned long *freed;
+   unsigned num_freed_pages;
+
+   vb = container_of(self, struct virtio_balloon, nb);
+   if (!virtio_has_feature(vb-vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+   return NOTIFY_OK;
+
+   freed = parm;
+   num_freed_pages = leak_balloon(vb, oom_pages);
+   update_balloon_size(vb);
+   *freed += num_freed_pages;
+
+   return NOTIFY_OK;
+}
+
 static int balloon(void *_vballoon)
 {
struct virtio_balloon *vb = _vballoon;
@@ -446,6 +488,12 @@ static int virtballoon_probe(struct virtio_device *vdev)
if (err)
goto out_free_vb;
 
+   vb-nb.notifier_call = virtballoon_oom_notify;
+   vb-nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
+   err = register_oom_notifier(vb-nb);
+   if (err  0)
+   goto out_oom_notify;
+
vb-thread = kthread_run(balloon, vb, vballoon);
if (IS_ERR(vb-thread)) {
err = PTR_ERR(vb-thread);
@@ -455,6 +503,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
return 0;
 
 out_del_vqs:
+   unregister_oom_notifier(vb-nb);
+out_oom_notify:
vdev-config-del_vqs(vdev);
 out_free_vb:
kfree(vb);
@@ -479,6 +529,7 @@ static void virtballoon_remove(struct virtio_device *vdev

[PATCH 1/2] virtio_balloon: return the amount of freed memory from leak_balloon()

2014-10-15 Thread Denis V. Lunev
From: Raushaniya Maksudova rmaksud...@parallels.com

This value would be useful in the next patch to provide the amount of
the freed memory for OOM killer.

Signed-off-by: Raushaniya Maksudova rmaksud...@parallels.com
Signed-off-by: Denis V. Lunev d...@openvz.org
CC: Rusty Russell ru...@rustcorp.com.au
CC: Michael S. Tsirkin m...@redhat.com
---
 drivers/virtio/virtio_balloon.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f893148..66cac10 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -168,8 +168,9 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned 
int num)
}
 }
 
-static void leak_balloon(struct virtio_balloon *vb, size_t num)
+static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 {
+   unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = vb-vb_dev_info;
 
@@ -186,6 +187,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
vb-num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
+   num_freed_pages = vb-num_pfns;
/*
 * Note that if
 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
@@ -195,6 +197,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
tell_host(vb, vb-deflate_vq);
mutex_unlock(vb-balloon_lock);
release_pages_by_pfn(vb-pfns, vb-num_pfns);
+   return num_freed_pages;
 }
 
 static inline void update_stat(struct virtio_balloon *vb, int idx,
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 0/2] shrink virtio baloon on OOM in guest

2014-10-15 Thread Denis V. Lunev
Excessive virtio_balloon inflation can cause invocation of OOM-killer, when
Linux is under severe memory pressure. Various mechanisms are responsible for
correct virtio_balloon memory management. Nevertheless it is often the case
that these control tools does not have enough time to react on fast changing
memory load. As a result OS runs out of memory and invokes OOM-killer.
The balancing of memory by use of the virtio balloon should not cause the
termination of processes while there are pages in the balloon. Now there is
no way for virtio balloon driver to free memory at the last moment before
some process get killed by OOM-killer.

This does not provide a security breach as baloon itself is running
inside guest OS and is working in the cooperation with the host. Thus
some improvements from guest side should be considered as normal.

To solve the problem, introduce a virtio_balloon callback which is expected
to be called from the oom notifier call chain in out_of_memory() function.
If virtio balloon could release some memory, it will make the system to
return and retry the allocation that forced the out of memory killer to run.

Patch 1 of this series adds support for implementation of virtio_balloon
callback, so now leak_balloon() function returns number of freed pages.
Patch 2 implements virtio_balloon callback itself.

Changes from v2:
- added feature bit to control OOM baloon behavior from host
Changes from v1:
- minor cosmetic tweaks suggested by rusty@

Signed-off-by: Raushaniya Maksudova rmaksud...@parallels.com
Signed-off-by: Denis V. Lunev d...@openvz.org
CC: Rusty Russell ru...@rustcorp.com.au
CC: Michael S. Tsirkin m...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] virtio_balloon: return the amount of freed memory from leak_balloon()

2014-10-14 Thread Denis V. Lunev
From: Raushaniya Maksudova 

This value would be useful in the next patch to provide the amount of
the freed memory for OOM killer.

Signed-off-by: Raushaniya Maksudova 
Signed-off-by: Denis V. Lunev 
CC: Rusty Russell 
CC: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f893148..66cac10 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -168,8 +168,9 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned 
int num)
}
 }
 
-static void leak_balloon(struct virtio_balloon *vb, size_t num)
+static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 {
+   unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
 
@@ -186,6 +187,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
+   num_freed_pages = vb->num_pfns;
/*
 * Note that if
 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
@@ -195,6 +197,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
tell_host(vb, vb->deflate_vq);
mutex_unlock(>balloon_lock);
release_pages_by_pfn(vb->pfns, vb->num_pfns);
+   return num_freed_pages;
 }
 
 static inline void update_stat(struct virtio_balloon *vb, int idx,
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/2] shrink virtio baloon on OOM in guest

2014-10-14 Thread Denis V. Lunev
Excessive virtio_balloon inflation can cause invocation of OOM-killer, when
Linux is under severe memory pressure. Various mechanisms are responsible for
correct virtio_balloon memory management. Nevertheless it is often the case
that these control tools does not have enough time to react on fast changing
memory load. As a result OS runs out of memory and invokes OOM-killer.
The balancing of memory by use of the virtio balloon should not cause the
termination of processes while there are pages in the balloon. Now there is
no way for virtio balloon driver to free memory at the last moment before
some process get killed by OOM-killer.

This does not provide a security breach as baloon itself is running
inside guest OS and is working in the cooperation with the host. Thus
some improvements from guest side should be considered as normal.

To solve the problem, introduce a virtio_balloon callback which is expected
to be called from the oom notifier call chain in out_of_memory() function.
If virtio balloon could release some memory, it will make the system to
return and retry the allocation that forced the out of memory killer to run.

Patch 1 of this series adds support for implementation of virtio_balloon
callback, so now leak_balloon() function returns number of freed pages.
Patch 2 implements virtio_balloon callback itself.

Changes from v1:
- minor cosmetic tweaks suggested by rusty@

Signed-off-by: Raushaniya Maksudova 
Signed-off-by: Denis V. Lunev 
CC: Rusty Russell 
CC: Michael S. Tsirkin 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] virtio_balloon: free some memory from balloon on OOM

2014-10-14 Thread Denis V. Lunev
From: Raushaniya Maksudova 

Excessive virtio_balloon inflation can cause invocation of OOM-killer,
when Linux is under severe memory pressure. Various mechanisms are
responsible for correct virtio_balloon memory management. Nevertheless
it is often the case that these control tools does not have enough time
to react on fast changing memory load. As a result OS runs out of memory
and invokes OOM-killer. The balancing of memory by use of the virtio
balloon should not cause the termination of processes while there are
pages in the balloon. Now there is no way for virtio balloon driver to
free some memory at the last moment before some process will be get
killed by OOM-killer.

This does not provide a security breach as balloon itself is running
inside guest OS and is working in the cooperation with the host. Thus
some improvements from guest side should be considered as normal.

To solve the problem, introduce a virtio_balloon callback which is
expected to be called from the oom notifier call chain in out_of_memory()
function. If virtio balloon could release some memory, it will make
the system to return and retry the allocation that forced the out of
memory killer to run.

Signed-off-by: Raushaniya Maksudova 
Signed-off-by: Denis V. Lunev 
CC: Rusty Russell 
CC: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 48 +
 1 file changed, 48 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 66cac10..ab1fa69 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -36,6 +37,12 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> 
VIRTIO_BALLOON_PFN_SHIFT)
 #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
+#define OOM_VBALLOON_DEFAULT_PAGES 256
+#define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
+
+static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
+module_param(oom_pages, int, S_IRUSR | S_IWUSR);
+MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
 struct virtio_balloon
 {
@@ -71,6 +78,9 @@ struct virtio_balloon
/* Memory statistics */
int need_stats_update;
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+
+   /* To register callback in oom notifier call chain */
+   struct notifier_block nb;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -290,6 +300,35 @@ static void update_balloon_size(struct virtio_balloon *vb)
  );
 }
 
+/*
+ * virtballoon_oom_notify - release pages when system is under severe
+ *  memory pressure (called from out_of_memory())
+ * @self : notifier block struct
+ * @dummy: not used
+ * @parm : returned - number of freed pages
+ *
+ * The balancing of memory by use of the virtio balloon should not cause
+ * the termination of processes while there are pages in the balloon.
+ * If virtio balloon manages to release some memory, it will make the
+ * system return and retry the allocation that forced the OOM killer
+ * to run.
+ */
+static int virtballoon_oom_notify(struct notifier_block *self,
+ unsigned long dummy, void *parm)
+{
+   unsigned num_freed_pages;
+   unsigned long *freed;
+   struct virtio_balloon *vb;
+
+   freed = parm;
+   vb = container_of(self, struct virtio_balloon, nb);
+   num_freed_pages = leak_balloon(vb, oom_pages);
+   update_balloon_size(vb);
+   *freed += num_freed_pages;
+
+   return NOTIFY_OK;
+}
+
 static int balloon(void *_vballoon)
 {
struct virtio_balloon *vb = _vballoon;
@@ -446,6 +485,12 @@ static int virtballoon_probe(struct virtio_device *vdev)
if (err)
goto out_free_vb;
 
+   vb->nb.notifier_call = virtballoon_oom_notify;
+   vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
+   err = register_oom_notifier(>nb);
+   if (err < 0)
+   goto out_oom_notify;
+
vb->thread = kthread_run(balloon, vb, "vballoon");
if (IS_ERR(vb->thread)) {
err = PTR_ERR(vb->thread);
@@ -455,6 +500,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
return 0;
 
 out_del_vqs:
+   unregister_oom_notifier(>nb);
+out_oom_notify:
vdev->config->del_vqs(vdev);
 out_free_vb:
kfree(vb);
@@ -479,6 +526,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 {
struct virtio_balloon *vb = vdev->priv;
 
+   unregister_oom_notifier(>nb);
kthread_stop(vb->thread);
remove_common(vb);
kfree(vb);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] virtio_balloon: free some memory from balloon on OOM

2014-10-14 Thread Denis V. Lunev
From: Raushaniya Maksudova rmaksud...@parallels.com

Excessive virtio_balloon inflation can cause invocation of OOM-killer,
when Linux is under severe memory pressure. Various mechanisms are
responsible for correct virtio_balloon memory management. Nevertheless
it is often the case that these control tools does not have enough time
to react on fast changing memory load. As a result OS runs out of memory
and invokes OOM-killer. The balancing of memory by use of the virtio
balloon should not cause the termination of processes while there are
pages in the balloon. Now there is no way for virtio balloon driver to
free some memory at the last moment before some process will be get
killed by OOM-killer.

This does not provide a security breach as balloon itself is running
inside guest OS and is working in the cooperation with the host. Thus
some improvements from guest side should be considered as normal.

To solve the problem, introduce a virtio_balloon callback which is
expected to be called from the oom notifier call chain in out_of_memory()
function. If virtio balloon could release some memory, it will make
the system to return and retry the allocation that forced the out of
memory killer to run.

Signed-off-by: Raushaniya Maksudova rmaksud...@parallels.com
Signed-off-by: Denis V. Lunev d...@openvz.org
CC: Rusty Russell ru...@rustcorp.com.au
CC: Michael S. Tsirkin m...@redhat.com
---
 drivers/virtio/virtio_balloon.c | 48 +
 1 file changed, 48 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 66cac10..ab1fa69 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -28,6 +28,7 @@
 #include linux/slab.h
 #include linux/module.h
 #include linux/balloon_compaction.h
+#include linux/oom.h
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -36,6 +37,12 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE  
VIRTIO_BALLOON_PFN_SHIFT)
 #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
+#define OOM_VBALLOON_DEFAULT_PAGES 256
+#define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
+
+static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
+module_param(oom_pages, int, S_IRUSR | S_IWUSR);
+MODULE_PARM_DESC(oom_pages, pages to free on OOM);
 
 struct virtio_balloon
 {
@@ -71,6 +78,9 @@ struct virtio_balloon
/* Memory statistics */
int need_stats_update;
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+
+   /* To register callback in oom notifier call chain */
+   struct notifier_block nb;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -290,6 +300,35 @@ static void update_balloon_size(struct virtio_balloon *vb)
  actual);
 }
 
+/*
+ * virtballoon_oom_notify - release pages when system is under severe
+ *  memory pressure (called from out_of_memory())
+ * @self : notifier block struct
+ * @dummy: not used
+ * @parm : returned - number of freed pages
+ *
+ * The balancing of memory by use of the virtio balloon should not cause
+ * the termination of processes while there are pages in the balloon.
+ * If virtio balloon manages to release some memory, it will make the
+ * system return and retry the allocation that forced the OOM killer
+ * to run.
+ */
+static int virtballoon_oom_notify(struct notifier_block *self,
+ unsigned long dummy, void *parm)
+{
+   unsigned num_freed_pages;
+   unsigned long *freed;
+   struct virtio_balloon *vb;
+
+   freed = parm;
+   vb = container_of(self, struct virtio_balloon, nb);
+   num_freed_pages = leak_balloon(vb, oom_pages);
+   update_balloon_size(vb);
+   *freed += num_freed_pages;
+
+   return NOTIFY_OK;
+}
+
 static int balloon(void *_vballoon)
 {
struct virtio_balloon *vb = _vballoon;
@@ -446,6 +485,12 @@ static int virtballoon_probe(struct virtio_device *vdev)
if (err)
goto out_free_vb;
 
+   vb-nb.notifier_call = virtballoon_oom_notify;
+   vb-nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
+   err = register_oom_notifier(vb-nb);
+   if (err  0)
+   goto out_oom_notify;
+
vb-thread = kthread_run(balloon, vb, vballoon);
if (IS_ERR(vb-thread)) {
err = PTR_ERR(vb-thread);
@@ -455,6 +500,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
return 0;
 
 out_del_vqs:
+   unregister_oom_notifier(vb-nb);
+out_oom_notify:
vdev-config-del_vqs(vdev);
 out_free_vb:
kfree(vb);
@@ -479,6 +526,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 {
struct virtio_balloon *vb = vdev-priv;
 
+   unregister_oom_notifier(vb-nb);
kthread_stop(vb-thread);
remove_common(vb);
kfree(vb);
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo

[PATCH 1/2] virtio_balloon: return the amount of freed memory from leak_balloon()

2014-10-14 Thread Denis V. Lunev
From: Raushaniya Maksudova rmaksud...@parallels.com

This value would be useful in the next patch to provide the amount of
the freed memory for OOM killer.

Signed-off-by: Raushaniya Maksudova rmaksud...@parallels.com
Signed-off-by: Denis V. Lunev d...@openvz.org
CC: Rusty Russell ru...@rustcorp.com.au
CC: Michael S. Tsirkin m...@redhat.com
---
 drivers/virtio/virtio_balloon.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f893148..66cac10 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -168,8 +168,9 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned 
int num)
}
 }
 
-static void leak_balloon(struct virtio_balloon *vb, size_t num)
+static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 {
+   unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = vb-vb_dev_info;
 
@@ -186,6 +187,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
vb-num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
+   num_freed_pages = vb-num_pfns;
/*
 * Note that if
 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
@@ -195,6 +197,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
tell_host(vb, vb-deflate_vq);
mutex_unlock(vb-balloon_lock);
release_pages_by_pfn(vb-pfns, vb-num_pfns);
+   return num_freed_pages;
 }
 
 static inline void update_stat(struct virtio_balloon *vb, int idx,
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/2] shrink virtio baloon on OOM in guest

2014-10-14 Thread Denis V. Lunev
Excessive virtio_balloon inflation can cause invocation of OOM-killer, when
Linux is under severe memory pressure. Various mechanisms are responsible for
correct virtio_balloon memory management. Nevertheless it is often the case
that these control tools does not have enough time to react on fast changing
memory load. As a result OS runs out of memory and invokes OOM-killer.
The balancing of memory by use of the virtio balloon should not cause the
termination of processes while there are pages in the balloon. Now there is
no way for virtio balloon driver to free memory at the last moment before
some process get killed by OOM-killer.

This does not provide a security breach as baloon itself is running
inside guest OS and is working in the cooperation with the host. Thus
some improvements from guest side should be considered as normal.

To solve the problem, introduce a virtio_balloon callback which is expected
to be called from the oom notifier call chain in out_of_memory() function.
If virtio balloon could release some memory, it will make the system to
return and retry the allocation that forced the out of memory killer to run.

Patch 1 of this series adds support for implementation of virtio_balloon
callback, so now leak_balloon() function returns number of freed pages.
Patch 2 implements virtio_balloon callback itself.

Changes from v1:
- minor cosmetic tweaks suggested by rusty@

Signed-off-by: Raushaniya Maksudova rmaksud...@parallels.com
Signed-off-by: Denis V. Lunev d...@openvz.org
CC: Rusty Russell ru...@rustcorp.com.au
CC: Michael S. Tsirkin m...@redhat.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lock_task_group_list() can be called from the atomic context

2008-02-12 Thread Denis V. Lunev
Acked-by: Denis V. Lunev <[EMAIL PROTECTED]>

On Mon, 2008-02-11 at 14:33 +0100, Peter Zijlstra wrote:
> On Mon, 2008-02-11 at 15:09 +0300, Denis V. Lunev wrote:
> 
> Curious, I hadn't yet seen it... Does the below fix it?
> 
> > BUG: sleeping function called from invalid context
> > at /home/den/src/linux-netns26/kernel/mutex.c:209
> > in_atomic():1, irqs_disabled():0
> > no locks held by swapper/0.
> > Pid: 0, comm: swapper Not tainted 2.6.24 #304
> > 
> > Call Trace:
> >[] ? __debug_show_held_locks+0x15/0x27
> >  [] __might_sleep+0xc0/0xdf
> >  [] mutex_lock_nested+0x28/0x2a9
> >  [] sched_destroy_group+0x18/0xea
> >  [] sched_destroy_user+0xd/0xf
> >  [] free_uid+0x8a/0xab
> >  [] __put_task_struct+0x3f/0xd3
> >  [] delayed_put_task_struct+0x23/0x25
> >  [] __rcu_process_callbacks+0x8d/0x215
> >  [] rcu_process_callbacks+0x23/0x44
> >  [] __do_softirq+0x79/0xf8
> >  [] ? profile_pc+0x2a/0x67
> >  [] call_softirq+0x1c/0x30
> >  [] do_softirq+0x61/0x9c
> >  [] irq_exit+0x51/0x53
> >  [] smp_apic_timer_interrupt+0x77/0xad
> >  [] apic_timer_interrupt+0x6b/0x70
> >[] ? default_idle+0x43/0x76
> >  [] ? default_idle+0x41/0x76
> >  [] ? default_idle+0x0/0x76
> >  [] ? cpu_idle+0x76/0x98
> 
> separate the tg->shares protection from the task_group lock.
> 
> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> ---
>  kernel/sched.c |   37 +
>  1 file changed, 17 insertions(+), 20 deletions(-)
> 
> Index: linux-2.6/kernel/sched.c
> ===
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -232,10 +232,10 @@ static struct cfs_rq *init_cfs_rq_p[NR_C
>  static struct sched_rt_entity *init_sched_rt_entity_p[NR_CPUS];
>  static struct rt_rq *init_rt_rq_p[NR_CPUS];
>  
> -/* task_group_mutex serializes add/remove of task groups and also changes to
> +/* task_group_lock serializes add/remove of task groups and also changes to
>   * a task group's cpu shares.
>   */
> -static DEFINE_MUTEX(task_group_mutex);
> +static DEFINE_SPINLOCK(task_group_lock);
>  
>  /* doms_cur_mutex serializes access to doms_cur[] array */
>  static DEFINE_MUTEX(doms_cur_mutex);
> @@ -295,16 +295,6 @@ static inline void set_task_rq(struct ta
>   p->rt.parent = task_group(p)->rt_se[cpu];
>  }
>  
> -static inline void lock_task_group_list(void)
> -{
> - mutex_lock(_group_mutex);
> -}
> -
> -static inline void unlock_task_group_list(void)
> -{
> - mutex_unlock(_group_mutex);
> -}
> -
>  static inline void lock_doms_cur(void)
>  {
>   mutex_lock(_cur_mutex);
> @@ -318,8 +308,6 @@ static inline void unlock_doms_cur(void)
>  #else
>  
>  static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
> -static inline void lock_task_group_list(void) { }
> -static inline void unlock_task_group_list(void) { }
>  static inline void lock_doms_cur(void) { }
>  static inline void unlock_doms_cur(void) { }
>  
> @@ -7571,6 +7559,7 @@ struct task_group *sched_create_group(vo
>   struct rt_rq *rt_rq;
>   struct sched_rt_entity *rt_se;
>   struct rq *rq;
> + unsigned long flags;
>   int i;
>  
>   tg = kzalloc(sizeof(*tg), GFP_KERNEL);
> @@ -7620,7 +7609,7 @@ struct task_group *sched_create_group(vo
>   init_tg_rt_entry(rq, tg, rt_rq, rt_se, i, 0);
>   }
>  
> - lock_task_group_list();
> + spin_lock_irqsave(_group_lock, flags);
>   for_each_possible_cpu(i) {
>   rq = cpu_rq(i);
>   cfs_rq = tg->cfs_rq[i];
> @@ -7629,7 +7618,7 @@ struct task_group *sched_create_group(vo
>   list_add_rcu(_rq->leaf_rt_rq_list, >leaf_rt_rq_list);
>   }
>   list_add_rcu(>list, _groups);
> - unlock_task_group_list();
> + spin_unlock_irqrestore(_group_lock, flags);
>  
>   return tg;
>  
> @@ -7650,9 +7639,10 @@ void sched_destroy_group(struct task_gro
>  {
>   struct cfs_rq *cfs_rq = NULL;
>   struct rt_rq *rt_rq = NULL;
> + unsigned long flags;
>   int i;
>  
> - lock_task_group_list();
> + spin_lock_irqsave(_group_lock, flags);
>   for_each_possible_cpu(i) {
>   cfs_rq = tg->cfs_rq[i];
>   list_del_rcu(_rq->leaf_cfs_rq_list);
> @@ -7660,7 +7650,7 @@ void sched_destroy_group(struct task_gro
>   list_del_rcu(_rq->leaf_rt_rq_list);
>   }
>   list_del_rcu(>list);
> - unlock_task_group_list();
> + spin_unlock_irqrestore(_group_l

Re: lock_task_group_list() can be called from the atomic context

2008-02-12 Thread Denis V. Lunev
Acked-by: Denis V. Lunev [EMAIL PROTECTED]

On Mon, 2008-02-11 at 14:33 +0100, Peter Zijlstra wrote:
 On Mon, 2008-02-11 at 15:09 +0300, Denis V. Lunev wrote:
 
 Curious, I hadn't yet seen it... Does the below fix it?
 
  BUG: sleeping function called from invalid context
  at /home/den/src/linux-netns26/kernel/mutex.c:209
  in_atomic():1, irqs_disabled():0
  no locks held by swapper/0.
  Pid: 0, comm: swapper Not tainted 2.6.24 #304
  
  Call Trace:
   IRQ  [80252d1e] ? __debug_show_held_locks+0x15/0x27
   [8022c2a8] __might_sleep+0xc0/0xdf
   [8049f1df] mutex_lock_nested+0x28/0x2a9
   [80231294] sched_destroy_group+0x18/0xea
   [8023e835] sched_destroy_user+0xd/0xf
   [8023e8c1] free_uid+0x8a/0xab
   [80233e24] __put_task_struct+0x3f/0xd3
   [80236708] delayed_put_task_struct+0x23/0x25
   [8026fda7] __rcu_process_callbacks+0x8d/0x215
   [8026ff52] rcu_process_callbacks+0x23/0x44
   [8023a2ae] __do_softirq+0x79/0xf8
   [8020f8c3] ? profile_pc+0x2a/0x67
   [8020d38c] call_softirq+0x1c/0x30
   [8020f689] do_softirq+0x61/0x9c
   [8023a233] irq_exit+0x51/0x53
   [8021bd1a] smp_apic_timer_interrupt+0x77/0xad
   [8020ce3b] apic_timer_interrupt+0x6b/0x70
   EOI  [8020b0dd] ? default_idle+0x43/0x76
   [8020b0db] ? default_idle+0x41/0x76
   [8020b09a] ? default_idle+0x0/0x76
   [8020b186] ? cpu_idle+0x76/0x98
 
 separate the tg-shares protection from the task_group lock.
 
 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
 ---
  kernel/sched.c |   37 +
  1 file changed, 17 insertions(+), 20 deletions(-)
 
 Index: linux-2.6/kernel/sched.c
 ===
 --- linux-2.6.orig/kernel/sched.c
 +++ linux-2.6/kernel/sched.c
 @@ -232,10 +232,10 @@ static struct cfs_rq *init_cfs_rq_p[NR_C
  static struct sched_rt_entity *init_sched_rt_entity_p[NR_CPUS];
  static struct rt_rq *init_rt_rq_p[NR_CPUS];
  
 -/* task_group_mutex serializes add/remove of task groups and also changes to
 +/* task_group_lock serializes add/remove of task groups and also changes to
   * a task group's cpu shares.
   */
 -static DEFINE_MUTEX(task_group_mutex);
 +static DEFINE_SPINLOCK(task_group_lock);
  
  /* doms_cur_mutex serializes access to doms_cur[] array */
  static DEFINE_MUTEX(doms_cur_mutex);
 @@ -295,16 +295,6 @@ static inline void set_task_rq(struct ta
   p-rt.parent = task_group(p)-rt_se[cpu];
  }
  
 -static inline void lock_task_group_list(void)
 -{
 - mutex_lock(task_group_mutex);
 -}
 -
 -static inline void unlock_task_group_list(void)
 -{
 - mutex_unlock(task_group_mutex);
 -}
 -
  static inline void lock_doms_cur(void)
  {
   mutex_lock(doms_cur_mutex);
 @@ -318,8 +308,6 @@ static inline void unlock_doms_cur(void)
  #else
  
  static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
 -static inline void lock_task_group_list(void) { }
 -static inline void unlock_task_group_list(void) { }
  static inline void lock_doms_cur(void) { }
  static inline void unlock_doms_cur(void) { }
  
 @@ -7571,6 +7559,7 @@ struct task_group *sched_create_group(vo
   struct rt_rq *rt_rq;
   struct sched_rt_entity *rt_se;
   struct rq *rq;
 + unsigned long flags;
   int i;
  
   tg = kzalloc(sizeof(*tg), GFP_KERNEL);
 @@ -7620,7 +7609,7 @@ struct task_group *sched_create_group(vo
   init_tg_rt_entry(rq, tg, rt_rq, rt_se, i, 0);
   }
  
 - lock_task_group_list();
 + spin_lock_irqsave(task_group_lock, flags);
   for_each_possible_cpu(i) {
   rq = cpu_rq(i);
   cfs_rq = tg-cfs_rq[i];
 @@ -7629,7 +7618,7 @@ struct task_group *sched_create_group(vo
   list_add_rcu(rt_rq-leaf_rt_rq_list, rq-leaf_rt_rq_list);
   }
   list_add_rcu(tg-list, task_groups);
 - unlock_task_group_list();
 + spin_unlock_irqrestore(task_group_lock, flags);
  
   return tg;
  
 @@ -7650,9 +7639,10 @@ void sched_destroy_group(struct task_gro
  {
   struct cfs_rq *cfs_rq = NULL;
   struct rt_rq *rt_rq = NULL;
 + unsigned long flags;
   int i;
  
 - lock_task_group_list();
 + spin_lock_irqsave(task_group_lock, flags);
   for_each_possible_cpu(i) {
   cfs_rq = tg-cfs_rq[i];
   list_del_rcu(cfs_rq-leaf_cfs_rq_list);
 @@ -7660,7 +7650,7 @@ void sched_destroy_group(struct task_gro
   list_del_rcu(rt_rq-leaf_rt_rq_list);
   }
   list_del_rcu(tg-list);
 - unlock_task_group_list();
 + spin_unlock_irqrestore(task_group_lock, flags);
  
   BUG_ON(!cfs_rq);
  
 @@ -7728,13 +7718,16 @@ static void set_se_shares(struct sched_e
   }
  }
  
 +static DEFINE_MUTEX(shares_mutex);
 +
  int sched_group_set_shares(struct task_group *tg, unsigned long shares)
  {
   int i;
   struct cfs_rq *cfs_rq;
   struct rq *rq

lock_task_group_list() can be called from the atomic context

2008-02-11 Thread Denis V. Lunev
Hello, Ingo!

I am seeing the following calltrace every day I am connecting to my test
host by the gnome-panel:
   ssh -Y test.host gnome-terminal

BUG: sleeping function called from invalid context
at /home/den/src/linux-netns26/kernel/mutex.c:209
in_atomic():1, irqs_disabled():0
no locks held by swapper/0.
Pid: 0, comm: swapper Not tainted 2.6.24 #304

Call Trace:
   [] ? __debug_show_held_locks+0x15/0x27
 [] __might_sleep+0xc0/0xdf
 [] mutex_lock_nested+0x28/0x2a9
 [] sched_destroy_group+0x18/0xea
 [] sched_destroy_user+0xd/0xf
 [] free_uid+0x8a/0xab
 [] __put_task_struct+0x3f/0xd3
 [] delayed_put_task_struct+0x23/0x25
 [] __rcu_process_callbacks+0x8d/0x215
 [] rcu_process_callbacks+0x23/0x44
 [] __do_softirq+0x79/0xf8
 [] ? profile_pc+0x2a/0x67
 [] call_softirq+0x1c/0x30
 [] do_softirq+0x61/0x9c
 [] irq_exit+0x51/0x53
 [] smp_apic_timer_interrupt+0x77/0xad
 [] apic_timer_interrupt+0x6b/0x70
   [] ? default_idle+0x43/0x76
 [] ? default_idle+0x41/0x76
 [] ? default_idle+0x0/0x76
 [] ? cpu_idle+0x76/0x98

Config is attached.

The kernel is today pulled Dave Miller's tree.

Regards,
Den
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24
# Mon Feb 11 11:14:57 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
# CONFIG_QUICKLIST is not set
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
# CONFIG_CGROUPS is not set
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
# CONFIG_FAIR_CGROUP_SCHED is not set
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_COMPAT_BRK=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
# CONFIG_PROFILING is not set
# CONFIG_MARKERS is not set
CONFIG_HAVE_OPROFILE=y
# CONFIG_KPROBES is not set
CONFIG_HAVE_KPROBES=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_BSG is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
# CONFIG_IOSCHED_AS is not set
# CONFIG_IOSCHED_DEADLINE is not set
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_CLASSIC_RCU=y
# CONFIG_PREEMPT_RCU is not set

#
# Processor type and features
#
# CONFIG_TICK_ONESHOT is not set
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not 

lock_task_group_list() can be called from the atomic context

2008-02-11 Thread Denis V. Lunev
Hello, Ingo!

I am seeing the following calltrace every day I am connecting to my test
host by the gnome-panel:
   ssh -Y test.host gnome-terminal

BUG: sleeping function called from invalid context
at /home/den/src/linux-netns26/kernel/mutex.c:209
in_atomic():1, irqs_disabled():0
no locks held by swapper/0.
Pid: 0, comm: swapper Not tainted 2.6.24 #304

Call Trace:
 IRQ  [80252d1e] ? __debug_show_held_locks+0x15/0x27
 [8022c2a8] __might_sleep+0xc0/0xdf
 [8049f1df] mutex_lock_nested+0x28/0x2a9
 [80231294] sched_destroy_group+0x18/0xea
 [8023e835] sched_destroy_user+0xd/0xf
 [8023e8c1] free_uid+0x8a/0xab
 [80233e24] __put_task_struct+0x3f/0xd3
 [80236708] delayed_put_task_struct+0x23/0x25
 [8026fda7] __rcu_process_callbacks+0x8d/0x215
 [8026ff52] rcu_process_callbacks+0x23/0x44
 [8023a2ae] __do_softirq+0x79/0xf8
 [8020f8c3] ? profile_pc+0x2a/0x67
 [8020d38c] call_softirq+0x1c/0x30
 [8020f689] do_softirq+0x61/0x9c
 [8023a233] irq_exit+0x51/0x53
 [8021bd1a] smp_apic_timer_interrupt+0x77/0xad
 [8020ce3b] apic_timer_interrupt+0x6b/0x70
 EOI  [8020b0dd] ? default_idle+0x43/0x76
 [8020b0db] ? default_idle+0x41/0x76
 [8020b09a] ? default_idle+0x0/0x76
 [8020b186] ? cpu_idle+0x76/0x98

Config is attached.

The kernel is today pulled Dave Miller's tree.

Regards,
Den
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24
# Mon Feb 11 11:14:57 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
# CONFIG_QUICKLIST is not set
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
# CONFIG_CGROUPS is not set
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
# CONFIG_FAIR_CGROUP_SCHED is not set
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_COMPAT_BRK=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
# CONFIG_PROFILING is not set
# CONFIG_MARKERS is not set
CONFIG_HAVE_OPROFILE=y
# CONFIG_KPROBES is not set
CONFIG_HAVE_KPROBES=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_BSG is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
# CONFIG_IOSCHED_AS is not set
# CONFIG_IOSCHED_DEADLINE is not set
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED=cfq
CONFIG_CLASSIC_RCU=y
# 

[PATCH 6/6] [NETNS]: Lookup in FIB semantic hashes taking into account the namespace.

2008-01-31 Thread Denis V. Lunev
The namespace is not available in the fib_sync_down_addr, add it
as a parameter.

Looking up a device by the pointer to it is OK. Looking up using a result
from fib_trie/fib_hash table lookup is also safe. No need to fix that at all.
So, just fix lookup by address and insertion to the hash table path.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 include/net/ip_fib.h |2 +-
 net/ipv4/fib_frontend.c  |2 +-
 net/ipv4/fib_semantics.c |6 +-
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index cb0df37..90d1175 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -220,7 +220,7 @@ extern void fib_select_default(struct net *net, const 
struct flowi *flp,
 /* Exported by fib_semantics.c */
 extern int ip_fib_check_default(__be32 gw, struct net_device *dev);
 extern int fib_sync_down_dev(struct net_device *dev, int force);
-extern int fib_sync_down_addr(__be32 local);
+extern int fib_sync_down_addr(struct net *net, __be32 local);
 extern int fib_sync_up(struct net_device *dev);
 extern __be32  __fib_res_prefsrc(struct fib_result *res);
 extern void fib_select_multipath(const struct flowi *flp, struct fib_result 
*res);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d69ffa2..86ff271 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -808,7 +808,7 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa)
   First of all, we scan fib_info list searching
   for stray nexthop entries, then ignite fib_flush.
*/
-   if (fib_sync_down_addr(ifa->ifa_local))
+   if (fib_sync_down_addr(dev->nd_net, ifa->ifa_local))
fib_flush(dev->nd_net);
}
}
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 97cc494..a13c847 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -229,6 +229,8 @@ static struct fib_info *fib_find_info(const struct fib_info 
*nfi)
head = _info_hash[hash];
 
hlist_for_each_entry(fi, node, head, fib_hash) {
+   if (fi->fib_net != nfi->fib_net)
+   continue;
if (fi->fib_nhs != nfi->fib_nhs)
continue;
if (nfi->fib_protocol == fi->fib_protocol &&
@@ -1031,7 +1033,7 @@ nla_put_failure:
  referring to it.
- device went down -> we must shutdown all nexthops going via it.
  */
-int fib_sync_down_addr(__be32 local)
+int fib_sync_down_addr(struct net *net, __be32 local)
 {
int ret = 0;
unsigned int hash = fib_laddr_hashfn(local);
@@ -1043,6 +1045,8 @@ int fib_sync_down_addr(__be32 local)
return 0;
 
hlist_for_each_entry(fi, node, head, fib_lhash) {
+   if (fi->fib_net != net)
+   continue;
if (fi->fib_prefsrc == local) {
fi->fib_flags |= RTNH_F_DEAD;
ret++;
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/6] [IPV4]: fib_sync_down rework.

2008-01-31 Thread Denis V. Lunev
fib_sync_down can be called with an address and with a device. In reality
it is called either with address OR with a device. The codepath inside is
completely different, so lets separate it into two calls for these two
cases.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 include/net/ip_fib.h |3 +-
 net/ipv4/fib_frontend.c  |4 +-
 net/ipv4/fib_semantics.c |  104 +++--
 3 files changed, 57 insertions(+), 54 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 9daa60b..1b2f008 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -218,7 +218,8 @@ extern void fib_select_default(struct net *net, const 
struct flowi *flp,
 
 /* Exported by fib_semantics.c */
 extern int ip_fib_check_default(__be32 gw, struct net_device *dev);
-extern int fib_sync_down(__be32 local, struct net_device *dev, int force);
+extern int fib_sync_down_dev(struct net_device *dev, int force);
+extern int fib_sync_down_addr(__be32 local);
 extern int fib_sync_up(struct net_device *dev);
 extern __be32  __fib_res_prefsrc(struct fib_result *res);
 extern void fib_select_multipath(const struct flowi *flp, struct fib_result 
*res);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d0507f4..d69ffa2 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -808,7 +808,7 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa)
   First of all, we scan fib_info list searching
   for stray nexthop entries, then ignite fib_flush.
*/
-   if (fib_sync_down(ifa->ifa_local, NULL, 0))
+   if (fib_sync_down_addr(ifa->ifa_local))
fib_flush(dev->nd_net);
}
}
@@ -898,7 +898,7 @@ static void nl_fib_lookup_exit(struct net *net)
 
 static void fib_disable_ip(struct net_device *dev, int force)
 {
-   if (fib_sync_down(0, dev, force))
+   if (fib_sync_down_dev(dev, force))
fib_flush(dev->nd_net);
rt_cache_flush(0);
arp_ifdown(dev);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c791286..5beff2e 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1031,70 +1031,72 @@ nla_put_failure:
  referring to it.
- device went down -> we must shutdown all nexthops going via it.
  */
-
-int fib_sync_down(__be32 local, struct net_device *dev, int force)
+int fib_sync_down_addr(__be32 local)
 {
int ret = 0;
-   int scope = RT_SCOPE_NOWHERE;
-
-   if (force)
-   scope = -1;
+   unsigned int hash = fib_laddr_hashfn(local);
+   struct hlist_head *head = _info_laddrhash[hash];
+   struct hlist_node *node;
+   struct fib_info *fi;
 
-   if (local && fib_info_laddrhash) {
-   unsigned int hash = fib_laddr_hashfn(local);
-   struct hlist_head *head = _info_laddrhash[hash];
-   struct hlist_node *node;
-   struct fib_info *fi;
+   if (fib_info_laddrhash == NULL || local == 0)
+   return 0;
 
-   hlist_for_each_entry(fi, node, head, fib_lhash) {
-   if (fi->fib_prefsrc == local) {
-   fi->fib_flags |= RTNH_F_DEAD;
-   ret++;
-   }
+   hlist_for_each_entry(fi, node, head, fib_lhash) {
+   if (fi->fib_prefsrc == local) {
+   fi->fib_flags |= RTNH_F_DEAD;
+   ret++;
}
}
+   return ret;
+}
 
-   if (dev) {
-   struct fib_info *prev_fi = NULL;
-   unsigned int hash = fib_devindex_hashfn(dev->ifindex);
-   struct hlist_head *head = _info_devhash[hash];
-   struct hlist_node *node;
-   struct fib_nh *nh;
+int fib_sync_down_dev(struct net_device *dev, int force)
+{
+   int ret = 0;
+   int scope = RT_SCOPE_NOWHERE;
+   struct fib_info *prev_fi = NULL;
+   unsigned int hash = fib_devindex_hashfn(dev->ifindex);
+   struct hlist_head *head = _info_devhash[hash];
+   struct hlist_node *node;
+   struct fib_nh *nh;
 
-   hlist_for_each_entry(nh, node, head, nh_hash) {
-   struct fib_info *fi = nh->nh_parent;
-   int dead;
+   if (force)
+   scope = -1;
 
-   BUG_ON(!fi->fib_nhs);
-   if (nh->nh_dev != dev || fi == prev_fi)
-   continue;
-   prev_fi = fi;
-   dead = 0;
-   change_nexthops(fi) {
-   if (nh->nh_flags_F_DEAD)
-   dead++;
-   else if (nh->nh_dev == dev &am

[PATCH 3/6] [NETNS]: Process interface address manipulation routines in the namespace.

2008-01-31 Thread Denis V. Lunev
The namespace is available when required except rtm_to_ifaddr. Add
namespace argument to it.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 net/ipv4/devinet.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index e55c85e..6a6e92e 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -485,7 +485,7 @@ errout:
return err;
 }
 
-static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh)
+static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh)
 {
struct nlattr *tb[IFA_MAX+1];
struct in_ifaddr *ifa;
@@ -503,7 +503,7 @@ static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh)
if (ifm->ifa_prefixlen > 32 || tb[IFA_LOCAL] == NULL)
goto errout;
 
-   dev = __dev_get_by_index(_net, ifm->ifa_index);
+   dev = __dev_get_by_index(net, ifm->ifa_index);
err = -ENODEV;
if (dev == NULL)
goto errout;
@@ -571,7 +571,7 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct 
nlmsghdr *nlh, void *arg
if (net != _net)
return -EINVAL;
 
-   ifa = rtm_to_ifaddr(nlh);
+   ifa = rtm_to_ifaddr(net, nlh);
if (IS_ERR(ifa))
return PTR_ERR(ifa);
 
@@ -1189,7 +1189,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
 
s_ip_idx = ip_idx = cb->args[1];
idx = 0;
-   for_each_netdev(_net, dev) {
+   for_each_netdev(net, dev) {
if (idx < s_idx)
goto cont;
if (idx > s_idx)
@@ -1223,7 +1223,9 @@ static void rtmsg_ifa(int event, struct in_ifaddr* ifa, 
struct nlmsghdr *nlh,
struct sk_buff *skb;
u32 seq = nlh ? nlh->nlmsg_seq : 0;
int err = -ENOBUFS;
+   struct net *net;
 
+   net = ifa->ifa_dev->dev->nd_net;
skb = nlmsg_new(inet_nlmsg_size(), GFP_KERNEL);
if (skb == NULL)
goto errout;
@@ -1235,10 +1237,10 @@ static void rtmsg_ifa(int event, struct in_ifaddr* ifa, 
struct nlmsghdr *nlh,
kfree_skb(skb);
goto errout;
}
-   err = rtnl_notify(skb, _net, pid, RTNLGRP_IPV4_IFADDR, nlh, 
GFP_KERNEL);
+   err = rtnl_notify(skb, net, pid, RTNLGRP_IPV4_IFADDR, nlh, GFP_KERNEL);
 errout:
if (err < 0)
-   rtnl_set_sk_err(_net, RTNLGRP_IPV4_IFADDR, err);
+   rtnl_set_sk_err(net, RTNLGRP_IPV4_IFADDR, err);
 }
 
 #ifdef CONFIG_SYSCTL
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/6] [NETNS]: Add a namespace mark to fib_info.

2008-01-31 Thread Denis V. Lunev
This is required to make fib_info lookups namespace aware. In the other case
initial namespace devices are marked as dead in the local routing table
during other namespace stop.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 include/net/ip_fib.h |1 +
 net/ipv4/fib_semantics.c |8 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 1b2f008..cb0df37 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -69,6 +69,7 @@ struct fib_nh {
 struct fib_info {
struct hlist_node   fib_hash;
struct hlist_node   fib_lhash;
+   struct net  *fib_net;
int fib_treeref;
atomic_tfib_clntref;
int fib_dead;
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 5beff2e..97cc494 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -687,6 +687,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
struct fib_info *fi = NULL;
struct fib_info *ofi;
int nhs = 1;
+   struct net *net = cfg->fc_nlinfo.nl_net;
 
/* Fast check to catch the most weird cases */
if (fib_props[cfg->fc_type].scope > cfg->fc_scope)
@@ -727,6 +728,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
goto failure;
fib_info_cnt++;
 
+   fi->fib_net = net;
fi->fib_protocol = cfg->fc_protocol;
fi->fib_flags = cfg->fc_flags;
fi->fib_priority = cfg->fc_priority;
@@ -798,8 +800,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
if (nhs != 1 || nh->nh_gw)
goto err_inval;
nh->nh_scope = RT_SCOPE_NOWHERE;
-   nh->nh_dev = dev_get_by_index(cfg->fc_nlinfo.nl_net,
- fi->fib_nh->nh_oif);
+   nh->nh_dev = dev_get_by_index(net, fi->fib_nh->nh_oif);
err = -ENODEV;
if (nh->nh_dev == NULL)
goto failure;
@@ -813,8 +814,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
if (fi->fib_prefsrc) {
if (cfg->fc_type != RTN_LOCAL || !cfg->fc_dst ||
fi->fib_prefsrc != cfg->fc_dst)
-   if (inet_addr_type(cfg->fc_nlinfo.nl_net,
-  fi->fib_prefsrc) != RTN_LOCAL)
+   if (inet_addr_type(net, fi->fib_prefsrc) != RTN_LOCAL)
goto err_inval;
}
 
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/6] [IPV4]: Fix memory leak on error path during FIB initialization.

2008-01-31 Thread Denis V. Lunev
net->ipv4.fib_table_hash is not freed when fib4_rules_init failed. The problem
has been introduced by the following commit.
commit c8050bf6d84785a7edd2e81591e8f833231477e8
Author: Denis V. Lunev <[EMAIL PROTECTED]>
Date:   Thu Jan 10 03:28:24 2008 -0800

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 net/ipv4/fib_frontend.c |   10 +-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d282618..d0507f4 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -975,6 +975,7 @@ static struct notifier_block fib_netdev_notifier = {
 
 static int __net_init ip_fib_net_init(struct net *net)
 {
+   int err;
unsigned int i;
 
net->ipv4.fib_table_hash = kzalloc(
@@ -985,7 +986,14 @@ static int __net_init ip_fib_net_init(struct net *net)
for (i = 0; i < FIB_TABLE_HASHSZ; i++)
INIT_HLIST_HEAD(>ipv4.fib_table_hash[i]);
 
-   return fib4_rules_init(net);
+   err = fib4_rules_init(net);
+   if (err < 0)
+   goto fail;
+   return 0;
+
+fail:
+   kfree(net->ipv4.fib_table_hash);
+   return err;
 }
 
 static void __net_exit ip_fib_net_exit(struct net *net)
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/6] [IPV4]: Small style cleanup of the error path in rtm_to_ifaddr.

2008-01-31 Thread Denis V. Lunev
Remove error code assignment inside brackets on failure. The code looks better
if the error is assigned before condition check. Also, the compiler treats this
better.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 net/ipv4/devinet.c |   21 -
 1 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 21f71bf..9da4c68 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -492,39 +492,34 @@ static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr 
*nlh)
struct ifaddrmsg *ifm;
struct net_device *dev;
struct in_device *in_dev;
-   int err = -EINVAL;
+   int err;
 
err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFA_MAX, ifa_ipv4_policy);
if (err < 0)
goto errout;
 
ifm = nlmsg_data(nlh);
-   if (ifm->ifa_prefixlen > 32 || tb[IFA_LOCAL] == NULL) {
-   err = -EINVAL;
+   err = -EINVAL;
+   if (ifm->ifa_prefixlen > 32 || tb[IFA_LOCAL] == NULL)
goto errout;
-   }
 
dev = __dev_get_by_index(_net, ifm->ifa_index);
-   if (dev == NULL) {
-   err = -ENODEV;
+   err = -ENODEV;
+   if (dev == NULL)
goto errout;
-   }
 
in_dev = __in_dev_get_rtnl(dev);
-   if (in_dev == NULL) {
-   err = -ENOBUFS;
+   err = -ENOBUFS;
+   if (in_dev == NULL)
goto errout;
-   }
 
ifa = inet_alloc_ifa();
-   if (ifa == NULL) {
+   if (ifa == NULL)
/*
 * A potential indev allocation can be left alive, it stays
 * assigned to its device and is destroy with it.
 */
-   err = -ENOBUFS;
goto errout;
-   }
 
ipv4_devconf_setall(in_dev);
in_dev_hold(in_dev);
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] [RAW]: Wrong content of the /proc/net/raw6.

2008-01-31 Thread Denis V. Lunev
The address of IPv6 raw sockets was shown in the wrong format, from IPv4 ones.
The problem has been introduced by the
commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5
Author: Pavel Emelyanov <[EMAIL PROTECTED]>
Date:   Mon Nov 19 22:38:33 2007 -0800

Thanks to Adrian Bunk who originally noticed the problem.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 include/net/raw.h |3 ++-
 net/ipv4/raw.c|8 
 net/ipv6/raw.c|2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/net/raw.h b/include/net/raw.h
index c7ea7a2..1828f81 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -48,7 +48,8 @@ struct raw_iter_state {
 void *raw_seq_start(struct seq_file *seq, loff_t *pos);
 void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos);
 void raw_seq_stop(struct seq_file *seq, void *v);
-int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h);
+int raw_seq_open(struct inode *ino, struct file *file,
+struct raw_hashinfo *h, const struct seq_operations *ops);
 
 #endif
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 830f19e..a3002fe 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -962,13 +962,13 @@ static const struct seq_operations raw_seq_ops = {
.show  = raw_seq_show,
 };
 
-int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h)
+int raw_seq_open(struct inode *ino, struct file *file,
+struct raw_hashinfo *h, const struct seq_operations *ops)
 {
int err;
struct raw_iter_state *i;
 
-   err = seq_open_net(ino, file, _seq_ops,
-   sizeof(struct raw_iter_state));
+   err = seq_open_net(ino, file, ops, sizeof(struct raw_iter_state));
if (err < 0)
return err;
 
@@ -980,7 +980,7 @@ EXPORT_SYMBOL_GPL(raw_seq_open);
 
 static int raw_v4_seq_open(struct inode *inode, struct file *file)
 {
-   return raw_seq_open(inode, file, _v4_hashinfo);
+   return raw_seq_open(inode, file, _v4_hashinfo, _seq_ops);
 }
 
 static const struct file_operations raw_seq_fops = {
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index a2cf499..8897ccf 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1262,7 +1262,7 @@ static const struct seq_operations raw6_seq_ops = {
 
 static int raw6_seq_open(struct inode *inode, struct file *file)
 {
-   return raw_seq_open(inode, file, _v6_hashinfo);
+   return raw_seq_open(inode, file, _v6_hashinfo, _seq_ops);
 }
 
 static const struct file_operations raw6_seq_fops = {
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] [RAW]: Family check in the /proc/net/raw[6] is extra.

2008-01-31 Thread Denis V. Lunev
Different hashtables are used for IPv6 and IPv4 raw sockets, so no need to
check the socket family in the iterator over hashtables. Clean this out.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 include/net/raw.h |4 +---
 net/ipv4/raw.c|   12 
 net/ipv6/raw.c|2 +-
 3 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/include/net/raw.h b/include/net/raw.h
index cca81d8..c7ea7a2 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -41,7 +41,6 @@ extern void raw_proc_exit(void);
 struct raw_iter_state {
struct seq_net_private p;
int bucket;
-   unsigned short family;
struct raw_hashinfo *h;
 };
 
@@ -49,8 +48,7 @@ struct raw_iter_state {
 void *raw_seq_start(struct seq_file *seq, loff_t *pos);
 void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos);
 void raw_seq_stop(struct seq_file *seq, void *v);
-int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h,
-   unsigned short family);
+int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h);
 
 #endif
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index f863c3d..507cbfe 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -862,8 +862,7 @@ static struct sock *raw_get_first(struct seq_file *seq)
struct hlist_node *node;
 
sk_for_each(sk, node, >h->ht[state->bucket])
-   if (sk->sk_net == state->p.net &&
-   sk->sk_family == state->family)
+   if (sk->sk_net == state->p.net)
goto found;
}
sk = NULL;
@@ -879,8 +878,7 @@ static struct sock *raw_get_next(struct seq_file *seq, 
struct sock *sk)
sk = sk_next(sk);
 try_again:
;
-   } while (sk && sk->sk_net != state->p.net &&
-   sk->sk_family != state->family);
+   } while (sk && sk->sk_net != state->p.net);
 
if (!sk && ++state->bucket < RAW_HTABLE_SIZE) {
sk = sk_head(>h->ht[state->bucket]);
@@ -974,8 +972,7 @@ static const struct seq_operations raw_seq_ops = {
.show  = raw_seq_show,
 };
 
-int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h,
-   unsigned short family)
+int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h)
 {
int err;
struct raw_iter_state *i;
@@ -987,14 +984,13 @@ int raw_seq_open(struct inode *ino, struct file *file, 
struct raw_hashinfo *h,
 
i = raw_seq_private((struct seq_file *)file->private_data);
i->h = h;
-   i->family = family;
return 0;
 }
 EXPORT_SYMBOL_GPL(raw_seq_open);
 
 static int raw_v4_seq_open(struct inode *inode, struct file *file)
 {
-   return raw_seq_open(inode, file, _v4_hashinfo, PF_INET);
+   return raw_seq_open(inode, file, _v4_hashinfo);
 }
 
 static const struct file_operations raw_seq_fops = {
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index d61c63d..a2cf499 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1262,7 +1262,7 @@ static const struct seq_operations raw6_seq_ops = {
 
 static int raw6_seq_open(struct inode *inode, struct file *file)
 {
-   return raw_seq_open(inode, file, _v6_hashinfo, PF_INET6);
+   return raw_seq_open(inode, file, _v6_hashinfo);
 }
 
 static const struct file_operations raw6_seq_fops = {
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] [RAW]: Cleanup IPv4 raw_seq_show.

2008-01-31 Thread Denis V. Lunev
There is no need to use 128 bytes on the stack at all. Clean the code in
the IPv6 style.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 net/ipv4/raw.c |   24 +++-
 1 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 507cbfe..830f19e 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -927,7 +927,7 @@ void raw_seq_stop(struct seq_file *seq, void *v)
 }
 EXPORT_SYMBOL_GPL(raw_seq_stop);
 
-static __inline__ char *get_raw_sock(struct sock *sp, char *tmpbuf, int i)
+static void raw_sock_seq_show(struct seq_file *seq, struct sock *sp, int i)
 {
struct inet_sock *inet = inet_sk(sp);
__be32 dest = inet->daddr,
@@ -935,33 +935,23 @@ static __inline__ char *get_raw_sock(struct sock *sp, 
char *tmpbuf, int i)
__u16 destp = 0,
  srcp  = inet->num;
 
-   sprintf(tmpbuf, "%4d: %08X:%04X %08X:%04X"
+   seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d",
i, src, srcp, dest, destp, sp->sk_state,
atomic_read(>sk_wmem_alloc),
atomic_read(>sk_rmem_alloc),
0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
atomic_read(>sk_refcnt), sp, atomic_read(>sk_drops));
-   return tmpbuf;
 }
 
-#define TMPSZ 128
-
 static int raw_seq_show(struct seq_file *seq, void *v)
 {
-   char tmpbuf[TMPSZ+1];
-
if (v == SEQ_START_TOKEN)
-   seq_printf(seq, "%-*s\n", TMPSZ-1,
-  "  sl  local_address rem_address   st tx_queue "
-  "rx_queue tr tm->when retrnsmt   uid  timeout "
-  "inode  drops");
-   else {
-   struct raw_iter_state *state = raw_seq_private(seq);
-
-   seq_printf(seq, "%-*s\n", TMPSZ-1,
-  get_raw_sock(v, tmpbuf, state->bucket));
-   }
+   seq_printf(seq, "  sl  local_address rem_address   st tx_queue "
+   "rx_queue tr tm->when retrnsmt   uid  timeout "
+   "inode  drops\n");
+   else
+   raw_sock_seq_show(seq, v, raw_seq_private(seq)->bucket);
return 0;
 }
 
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/3] [RAW]: proc output cleanups.

2008-01-31 Thread Denis V. Lunev
yesterday Adrian Bunk noticed, that the commit

commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5
Author: Pavel Emelyanov <[EMAIL PROTECTED]>
Date:   Mon Nov 19 22:38:33 2007 -0800

is somewhat strange. Basically, the commit is obviously wrong as the
content of the /proc/net/raw6 is incorrect due to it.

This series of patches fixes original problem and slightly cleanups the
code around.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/3] [RAW]: proc output cleanups.

2008-01-31 Thread Denis V. Lunev
yesterday Adrian Bunk noticed, that the commit

commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5
Author: Pavel Emelyanov [EMAIL PROTECTED]
Date:   Mon Nov 19 22:38:33 2007 -0800

is somewhat strange. Basically, the commit is obviously wrong as the
content of the /proc/net/raw6 is incorrect due to it.

This series of patches fixes original problem and slightly cleanups the
code around.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] [RAW]: Cleanup IPv4 raw_seq_show.

2008-01-31 Thread Denis V. Lunev
There is no need to use 128 bytes on the stack at all. Clean the code in
the IPv6 style.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 net/ipv4/raw.c |   24 +++-
 1 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 507cbfe..830f19e 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -927,7 +927,7 @@ void raw_seq_stop(struct seq_file *seq, void *v)
 }
 EXPORT_SYMBOL_GPL(raw_seq_stop);
 
-static __inline__ char *get_raw_sock(struct sock *sp, char *tmpbuf, int i)
+static void raw_sock_seq_show(struct seq_file *seq, struct sock *sp, int i)
 {
struct inet_sock *inet = inet_sk(sp);
__be32 dest = inet-daddr,
@@ -935,33 +935,23 @@ static __inline__ char *get_raw_sock(struct sock *sp, 
char *tmpbuf, int i)
__u16 destp = 0,
  srcp  = inet-num;
 
-   sprintf(tmpbuf, %4d: %08X:%04X %08X:%04X
+   seq_printf(seq, %4d: %08X:%04X %08X:%04X
 %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d,
i, src, srcp, dest, destp, sp-sk_state,
atomic_read(sp-sk_wmem_alloc),
atomic_read(sp-sk_rmem_alloc),
0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
atomic_read(sp-sk_refcnt), sp, atomic_read(sp-sk_drops));
-   return tmpbuf;
 }
 
-#define TMPSZ 128
-
 static int raw_seq_show(struct seq_file *seq, void *v)
 {
-   char tmpbuf[TMPSZ+1];
-
if (v == SEQ_START_TOKEN)
-   seq_printf(seq, %-*s\n, TMPSZ-1,
-sl  local_address rem_address   st tx_queue 
-  rx_queue tr tm-when retrnsmt   uid  timeout 
-  inode  drops);
-   else {
-   struct raw_iter_state *state = raw_seq_private(seq);
-
-   seq_printf(seq, %-*s\n, TMPSZ-1,
-  get_raw_sock(v, tmpbuf, state-bucket));
-   }
+   seq_printf(seq,   sl  local_address rem_address   st tx_queue 
+   rx_queue tr tm-when retrnsmt   uid  timeout 
+   inode  drops\n);
+   else
+   raw_sock_seq_show(seq, v, raw_seq_private(seq)-bucket);
return 0;
 }
 
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] [RAW]: Wrong content of the /proc/net/raw6.

2008-01-31 Thread Denis V. Lunev
The address of IPv6 raw sockets was shown in the wrong format, from IPv4 ones.
The problem has been introduced by the
commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5
Author: Pavel Emelyanov [EMAIL PROTECTED]
Date:   Mon Nov 19 22:38:33 2007 -0800

Thanks to Adrian Bunk who originally noticed the problem.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 include/net/raw.h |3 ++-
 net/ipv4/raw.c|8 
 net/ipv6/raw.c|2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/net/raw.h b/include/net/raw.h
index c7ea7a2..1828f81 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -48,7 +48,8 @@ struct raw_iter_state {
 void *raw_seq_start(struct seq_file *seq, loff_t *pos);
 void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos);
 void raw_seq_stop(struct seq_file *seq, void *v);
-int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h);
+int raw_seq_open(struct inode *ino, struct file *file,
+struct raw_hashinfo *h, const struct seq_operations *ops);
 
 #endif
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 830f19e..a3002fe 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -962,13 +962,13 @@ static const struct seq_operations raw_seq_ops = {
.show  = raw_seq_show,
 };
 
-int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h)
+int raw_seq_open(struct inode *ino, struct file *file,
+struct raw_hashinfo *h, const struct seq_operations *ops)
 {
int err;
struct raw_iter_state *i;
 
-   err = seq_open_net(ino, file, raw_seq_ops,
-   sizeof(struct raw_iter_state));
+   err = seq_open_net(ino, file, ops, sizeof(struct raw_iter_state));
if (err  0)
return err;
 
@@ -980,7 +980,7 @@ EXPORT_SYMBOL_GPL(raw_seq_open);
 
 static int raw_v4_seq_open(struct inode *inode, struct file *file)
 {
-   return raw_seq_open(inode, file, raw_v4_hashinfo);
+   return raw_seq_open(inode, file, raw_v4_hashinfo, raw_seq_ops);
 }
 
 static const struct file_operations raw_seq_fops = {
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index a2cf499..8897ccf 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1262,7 +1262,7 @@ static const struct seq_operations raw6_seq_ops = {
 
 static int raw6_seq_open(struct inode *inode, struct file *file)
 {
-   return raw_seq_open(inode, file, raw_v6_hashinfo);
+   return raw_seq_open(inode, file, raw_v6_hashinfo, raw6_seq_ops);
 }
 
 static const struct file_operations raw6_seq_fops = {
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] [RAW]: Family check in the /proc/net/raw[6] is extra.

2008-01-31 Thread Denis V. Lunev
Different hashtables are used for IPv6 and IPv4 raw sockets, so no need to
check the socket family in the iterator over hashtables. Clean this out.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 include/net/raw.h |4 +---
 net/ipv4/raw.c|   12 
 net/ipv6/raw.c|2 +-
 3 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/include/net/raw.h b/include/net/raw.h
index cca81d8..c7ea7a2 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -41,7 +41,6 @@ extern void raw_proc_exit(void);
 struct raw_iter_state {
struct seq_net_private p;
int bucket;
-   unsigned short family;
struct raw_hashinfo *h;
 };
 
@@ -49,8 +48,7 @@ struct raw_iter_state {
 void *raw_seq_start(struct seq_file *seq, loff_t *pos);
 void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos);
 void raw_seq_stop(struct seq_file *seq, void *v);
-int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h,
-   unsigned short family);
+int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h);
 
 #endif
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index f863c3d..507cbfe 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -862,8 +862,7 @@ static struct sock *raw_get_first(struct seq_file *seq)
struct hlist_node *node;
 
sk_for_each(sk, node, state-h-ht[state-bucket])
-   if (sk-sk_net == state-p.net 
-   sk-sk_family == state-family)
+   if (sk-sk_net == state-p.net)
goto found;
}
sk = NULL;
@@ -879,8 +878,7 @@ static struct sock *raw_get_next(struct seq_file *seq, 
struct sock *sk)
sk = sk_next(sk);
 try_again:
;
-   } while (sk  sk-sk_net != state-p.net 
-   sk-sk_family != state-family);
+   } while (sk  sk-sk_net != state-p.net);
 
if (!sk  ++state-bucket  RAW_HTABLE_SIZE) {
sk = sk_head(state-h-ht[state-bucket]);
@@ -974,8 +972,7 @@ static const struct seq_operations raw_seq_ops = {
.show  = raw_seq_show,
 };
 
-int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h,
-   unsigned short family)
+int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h)
 {
int err;
struct raw_iter_state *i;
@@ -987,14 +984,13 @@ int raw_seq_open(struct inode *ino, struct file *file, 
struct raw_hashinfo *h,
 
i = raw_seq_private((struct seq_file *)file-private_data);
i-h = h;
-   i-family = family;
return 0;
 }
 EXPORT_SYMBOL_GPL(raw_seq_open);
 
 static int raw_v4_seq_open(struct inode *inode, struct file *file)
 {
-   return raw_seq_open(inode, file, raw_v4_hashinfo, PF_INET);
+   return raw_seq_open(inode, file, raw_v4_hashinfo);
 }
 
 static const struct file_operations raw_seq_fops = {
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index d61c63d..a2cf499 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1262,7 +1262,7 @@ static const struct seq_operations raw6_seq_ops = {
 
 static int raw6_seq_open(struct inode *inode, struct file *file)
 {
-   return raw_seq_open(inode, file, raw_v6_hashinfo, PF_INET6);
+   return raw_seq_open(inode, file, raw_v6_hashinfo);
 }
 
 static const struct file_operations raw6_seq_fops = {
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/6] [IPV4]: fib_sync_down rework.

2008-01-31 Thread Denis V. Lunev
fib_sync_down can be called with an address and with a device. In reality
it is called either with address OR with a device. The codepath inside is
completely different, so lets separate it into two calls for these two
cases.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 include/net/ip_fib.h |3 +-
 net/ipv4/fib_frontend.c  |4 +-
 net/ipv4/fib_semantics.c |  104 +++--
 3 files changed, 57 insertions(+), 54 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 9daa60b..1b2f008 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -218,7 +218,8 @@ extern void fib_select_default(struct net *net, const 
struct flowi *flp,
 
 /* Exported by fib_semantics.c */
 extern int ip_fib_check_default(__be32 gw, struct net_device *dev);
-extern int fib_sync_down(__be32 local, struct net_device *dev, int force);
+extern int fib_sync_down_dev(struct net_device *dev, int force);
+extern int fib_sync_down_addr(__be32 local);
 extern int fib_sync_up(struct net_device *dev);
 extern __be32  __fib_res_prefsrc(struct fib_result *res);
 extern void fib_select_multipath(const struct flowi *flp, struct fib_result 
*res);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d0507f4..d69ffa2 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -808,7 +808,7 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa)
   First of all, we scan fib_info list searching
   for stray nexthop entries, then ignite fib_flush.
*/
-   if (fib_sync_down(ifa-ifa_local, NULL, 0))
+   if (fib_sync_down_addr(ifa-ifa_local))
fib_flush(dev-nd_net);
}
}
@@ -898,7 +898,7 @@ static void nl_fib_lookup_exit(struct net *net)
 
 static void fib_disable_ip(struct net_device *dev, int force)
 {
-   if (fib_sync_down(0, dev, force))
+   if (fib_sync_down_dev(dev, force))
fib_flush(dev-nd_net);
rt_cache_flush(0);
arp_ifdown(dev);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c791286..5beff2e 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1031,70 +1031,72 @@ nla_put_failure:
  referring to it.
- device went down - we must shutdown all nexthops going via it.
  */
-
-int fib_sync_down(__be32 local, struct net_device *dev, int force)
+int fib_sync_down_addr(__be32 local)
 {
int ret = 0;
-   int scope = RT_SCOPE_NOWHERE;
-
-   if (force)
-   scope = -1;
+   unsigned int hash = fib_laddr_hashfn(local);
+   struct hlist_head *head = fib_info_laddrhash[hash];
+   struct hlist_node *node;
+   struct fib_info *fi;
 
-   if (local  fib_info_laddrhash) {
-   unsigned int hash = fib_laddr_hashfn(local);
-   struct hlist_head *head = fib_info_laddrhash[hash];
-   struct hlist_node *node;
-   struct fib_info *fi;
+   if (fib_info_laddrhash == NULL || local == 0)
+   return 0;
 
-   hlist_for_each_entry(fi, node, head, fib_lhash) {
-   if (fi-fib_prefsrc == local) {
-   fi-fib_flags |= RTNH_F_DEAD;
-   ret++;
-   }
+   hlist_for_each_entry(fi, node, head, fib_lhash) {
+   if (fi-fib_prefsrc == local) {
+   fi-fib_flags |= RTNH_F_DEAD;
+   ret++;
}
}
+   return ret;
+}
 
-   if (dev) {
-   struct fib_info *prev_fi = NULL;
-   unsigned int hash = fib_devindex_hashfn(dev-ifindex);
-   struct hlist_head *head = fib_info_devhash[hash];
-   struct hlist_node *node;
-   struct fib_nh *nh;
+int fib_sync_down_dev(struct net_device *dev, int force)
+{
+   int ret = 0;
+   int scope = RT_SCOPE_NOWHERE;
+   struct fib_info *prev_fi = NULL;
+   unsigned int hash = fib_devindex_hashfn(dev-ifindex);
+   struct hlist_head *head = fib_info_devhash[hash];
+   struct hlist_node *node;
+   struct fib_nh *nh;
 
-   hlist_for_each_entry(nh, node, head, nh_hash) {
-   struct fib_info *fi = nh-nh_parent;
-   int dead;
+   if (force)
+   scope = -1;
 
-   BUG_ON(!fi-fib_nhs);
-   if (nh-nh_dev != dev || fi == prev_fi)
-   continue;
-   prev_fi = fi;
-   dead = 0;
-   change_nexthops(fi) {
-   if (nh-nh_flagsRTNH_F_DEAD)
-   dead++;
-   else if (nh-nh_dev == dev 
-nh-nh_scope != scope

[PATCH 3/6] [NETNS]: Process interface address manipulation routines in the namespace.

2008-01-31 Thread Denis V. Lunev
The namespace is available when required except rtm_to_ifaddr. Add
namespace argument to it.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 net/ipv4/devinet.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index e55c85e..6a6e92e 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -485,7 +485,7 @@ errout:
return err;
 }
 
-static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh)
+static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh)
 {
struct nlattr *tb[IFA_MAX+1];
struct in_ifaddr *ifa;
@@ -503,7 +503,7 @@ static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh)
if (ifm-ifa_prefixlen  32 || tb[IFA_LOCAL] == NULL)
goto errout;
 
-   dev = __dev_get_by_index(init_net, ifm-ifa_index);
+   dev = __dev_get_by_index(net, ifm-ifa_index);
err = -ENODEV;
if (dev == NULL)
goto errout;
@@ -571,7 +571,7 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct 
nlmsghdr *nlh, void *arg
if (net != init_net)
return -EINVAL;
 
-   ifa = rtm_to_ifaddr(nlh);
+   ifa = rtm_to_ifaddr(net, nlh);
if (IS_ERR(ifa))
return PTR_ERR(ifa);
 
@@ -1189,7 +1189,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
 
s_ip_idx = ip_idx = cb-args[1];
idx = 0;
-   for_each_netdev(init_net, dev) {
+   for_each_netdev(net, dev) {
if (idx  s_idx)
goto cont;
if (idx  s_idx)
@@ -1223,7 +1223,9 @@ static void rtmsg_ifa(int event, struct in_ifaddr* ifa, 
struct nlmsghdr *nlh,
struct sk_buff *skb;
u32 seq = nlh ? nlh-nlmsg_seq : 0;
int err = -ENOBUFS;
+   struct net *net;
 
+   net = ifa-ifa_dev-dev-nd_net;
skb = nlmsg_new(inet_nlmsg_size(), GFP_KERNEL);
if (skb == NULL)
goto errout;
@@ -1235,10 +1237,10 @@ static void rtmsg_ifa(int event, struct in_ifaddr* ifa, 
struct nlmsghdr *nlh,
kfree_skb(skb);
goto errout;
}
-   err = rtnl_notify(skb, init_net, pid, RTNLGRP_IPV4_IFADDR, nlh, 
GFP_KERNEL);
+   err = rtnl_notify(skb, net, pid, RTNLGRP_IPV4_IFADDR, nlh, GFP_KERNEL);
 errout:
if (err  0)
-   rtnl_set_sk_err(init_net, RTNLGRP_IPV4_IFADDR, err);
+   rtnl_set_sk_err(net, RTNLGRP_IPV4_IFADDR, err);
 }
 
 #ifdef CONFIG_SYSCTL
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/6] [NETNS]: Add a namespace mark to fib_info.

2008-01-31 Thread Denis V. Lunev
This is required to make fib_info lookups namespace aware. In the other case
initial namespace devices are marked as dead in the local routing table
during other namespace stop.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 include/net/ip_fib.h |1 +
 net/ipv4/fib_semantics.c |8 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 1b2f008..cb0df37 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -69,6 +69,7 @@ struct fib_nh {
 struct fib_info {
struct hlist_node   fib_hash;
struct hlist_node   fib_lhash;
+   struct net  *fib_net;
int fib_treeref;
atomic_tfib_clntref;
int fib_dead;
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 5beff2e..97cc494 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -687,6 +687,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
struct fib_info *fi = NULL;
struct fib_info *ofi;
int nhs = 1;
+   struct net *net = cfg-fc_nlinfo.nl_net;
 
/* Fast check to catch the most weird cases */
if (fib_props[cfg-fc_type].scope  cfg-fc_scope)
@@ -727,6 +728,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
goto failure;
fib_info_cnt++;
 
+   fi-fib_net = net;
fi-fib_protocol = cfg-fc_protocol;
fi-fib_flags = cfg-fc_flags;
fi-fib_priority = cfg-fc_priority;
@@ -798,8 +800,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
if (nhs != 1 || nh-nh_gw)
goto err_inval;
nh-nh_scope = RT_SCOPE_NOWHERE;
-   nh-nh_dev = dev_get_by_index(cfg-fc_nlinfo.nl_net,
- fi-fib_nh-nh_oif);
+   nh-nh_dev = dev_get_by_index(net, fi-fib_nh-nh_oif);
err = -ENODEV;
if (nh-nh_dev == NULL)
goto failure;
@@ -813,8 +814,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
if (fi-fib_prefsrc) {
if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst ||
fi-fib_prefsrc != cfg-fc_dst)
-   if (inet_addr_type(cfg-fc_nlinfo.nl_net,
-  fi-fib_prefsrc) != RTN_LOCAL)
+   if (inet_addr_type(net, fi-fib_prefsrc) != RTN_LOCAL)
goto err_inval;
}
 
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/6] [IPV4]: Fix memory leak on error path during FIB initialization.

2008-01-31 Thread Denis V. Lunev
net-ipv4.fib_table_hash is not freed when fib4_rules_init failed. The problem
has been introduced by the following commit.
commit c8050bf6d84785a7edd2e81591e8f833231477e8
Author: Denis V. Lunev [EMAIL PROTECTED]
Date:   Thu Jan 10 03:28:24 2008 -0800

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 net/ipv4/fib_frontend.c |   10 +-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d282618..d0507f4 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -975,6 +975,7 @@ static struct notifier_block fib_netdev_notifier = {
 
 static int __net_init ip_fib_net_init(struct net *net)
 {
+   int err;
unsigned int i;
 
net-ipv4.fib_table_hash = kzalloc(
@@ -985,7 +986,14 @@ static int __net_init ip_fib_net_init(struct net *net)
for (i = 0; i  FIB_TABLE_HASHSZ; i++)
INIT_HLIST_HEAD(net-ipv4.fib_table_hash[i]);
 
-   return fib4_rules_init(net);
+   err = fib4_rules_init(net);
+   if (err  0)
+   goto fail;
+   return 0;
+
+fail:
+   kfree(net-ipv4.fib_table_hash);
+   return err;
 }
 
 static void __net_exit ip_fib_net_exit(struct net *net)
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/6] [IPV4]: Small style cleanup of the error path in rtm_to_ifaddr.

2008-01-31 Thread Denis V. Lunev
Remove error code assignment inside brackets on failure. The code looks better
if the error is assigned before condition check. Also, the compiler treats this
better.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 net/ipv4/devinet.c |   21 -
 1 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 21f71bf..9da4c68 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -492,39 +492,34 @@ static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr 
*nlh)
struct ifaddrmsg *ifm;
struct net_device *dev;
struct in_device *in_dev;
-   int err = -EINVAL;
+   int err;
 
err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFA_MAX, ifa_ipv4_policy);
if (err  0)
goto errout;
 
ifm = nlmsg_data(nlh);
-   if (ifm-ifa_prefixlen  32 || tb[IFA_LOCAL] == NULL) {
-   err = -EINVAL;
+   err = -EINVAL;
+   if (ifm-ifa_prefixlen  32 || tb[IFA_LOCAL] == NULL)
goto errout;
-   }
 
dev = __dev_get_by_index(init_net, ifm-ifa_index);
-   if (dev == NULL) {
-   err = -ENODEV;
+   err = -ENODEV;
+   if (dev == NULL)
goto errout;
-   }
 
in_dev = __in_dev_get_rtnl(dev);
-   if (in_dev == NULL) {
-   err = -ENOBUFS;
+   err = -ENOBUFS;
+   if (in_dev == NULL)
goto errout;
-   }
 
ifa = inet_alloc_ifa();
-   if (ifa == NULL) {
+   if (ifa == NULL)
/*
 * A potential indev allocation can be left alive, it stays
 * assigned to its device and is destroy with it.
 */
-   err = -ENOBUFS;
goto errout;
-   }
 
ipv4_devconf_setall(in_dev);
in_dev_hold(in_dev);
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/6] [NETNS]: Lookup in FIB semantic hashes taking into account the namespace.

2008-01-31 Thread Denis V. Lunev
The namespace is not available in the fib_sync_down_addr, add it
as a parameter.

Looking up a device by the pointer to it is OK. Looking up using a result
from fib_trie/fib_hash table lookup is also safe. No need to fix that at all.
So, just fix lookup by address and insertion to the hash table path.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 include/net/ip_fib.h |2 +-
 net/ipv4/fib_frontend.c  |2 +-
 net/ipv4/fib_semantics.c |6 +-
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index cb0df37..90d1175 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -220,7 +220,7 @@ extern void fib_select_default(struct net *net, const 
struct flowi *flp,
 /* Exported by fib_semantics.c */
 extern int ip_fib_check_default(__be32 gw, struct net_device *dev);
 extern int fib_sync_down_dev(struct net_device *dev, int force);
-extern int fib_sync_down_addr(__be32 local);
+extern int fib_sync_down_addr(struct net *net, __be32 local);
 extern int fib_sync_up(struct net_device *dev);
 extern __be32  __fib_res_prefsrc(struct fib_result *res);
 extern void fib_select_multipath(const struct flowi *flp, struct fib_result 
*res);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d69ffa2..86ff271 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -808,7 +808,7 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa)
   First of all, we scan fib_info list searching
   for stray nexthop entries, then ignite fib_flush.
*/
-   if (fib_sync_down_addr(ifa-ifa_local))
+   if (fib_sync_down_addr(dev-nd_net, ifa-ifa_local))
fib_flush(dev-nd_net);
}
}
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 97cc494..a13c847 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -229,6 +229,8 @@ static struct fib_info *fib_find_info(const struct fib_info 
*nfi)
head = fib_info_hash[hash];
 
hlist_for_each_entry(fi, node, head, fib_hash) {
+   if (fi-fib_net != nfi-fib_net)
+   continue;
if (fi-fib_nhs != nfi-fib_nhs)
continue;
if (nfi-fib_protocol == fi-fib_protocol 
@@ -1031,7 +1033,7 @@ nla_put_failure:
  referring to it.
- device went down - we must shutdown all nexthops going via it.
  */
-int fib_sync_down_addr(__be32 local)
+int fib_sync_down_addr(struct net *net, __be32 local)
 {
int ret = 0;
unsigned int hash = fib_laddr_hashfn(local);
@@ -1043,6 +1045,8 @@ int fib_sync_down_addr(__be32 local)
return 0;
 
hlist_for_each_entry(fi, node, head, fib_lhash) {
+   if (fi-fib_net != net)
+   continue;
if (fi-fib_prefsrc == local) {
fi-fib_flags |= RTNH_F_DEAD;
ret++;
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [NFS]: Lock daemon start/stop rework.

2008-01-30 Thread Denis V. Lunev
Christoph Hellwig wrote:
> On Wed, Jan 30, 2008 at 02:41:34PM +0300, Denis V. Lunev wrote:
>> The pid of the locking daemon can be substituted with a task struct
>> without a problem. Namely, the value if filled in the context of the lockd
>> thread and used in lockd_up/lockd_down.
>>
>> It is possible to save task struct instead and use it to kill the process.
>> The safety of this operation is guaranteed by the RCU, i.e. task can't
>> disappear without passing a quiscent state.
> 
> We have a patch series pending on the nfs list that does this plus a lot
> more in the area.
> 
> 
where can I have to look them? :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [NFS]: Lock daemon start/stop rework.

2008-01-30 Thread Denis V. Lunev
The pid of the locking daemon can be substituted with a task struct
without a problem. Namely, the value if filled in the context of the lockd
thread and used in lockd_up/lockd_down.

It is possible to save task struct instead and use it to kill the process.
The safety of this operation is guaranteed by the RCU, i.e. task can't
disappear without passing a quiscent state.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 fs/lockd/svc.c |   38 +-
 1 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 82e2192..4979e70 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -48,7 +48,7 @@ EXPORT_SYMBOL(nlmsvc_ops);
 
 static DEFINE_MUTEX(nlmsvc_mutex);
 static unsigned intnlmsvc_users;
-static pid_t   nlmsvc_pid;
+static struct task_struct  *nlmsvc_task;
 static struct svc_serv *nlmsvc_serv;
 intnlmsvc_grace_period;
 unsigned long  nlmsvc_timeout;
@@ -128,7 +128,8 @@ lockd(struct svc_rqst *rqstp)
/*
 * Let our maker know we're running.
 */
-   nlmsvc_pid = current->pid;
+   rcu_assign_pointer(nlmsvc_task, current);
+
nlmsvc_serv = rqstp->rq_server;
complete(_start_done);
 
@@ -151,7 +152,7 @@ lockd(struct svc_rqst *rqstp)
 * NFS mount or NFS daemon has gone away, and we've been sent a
 * signal, or else another process has taken over our job.
 */
-   while ((nlmsvc_users || !signalled()) && nlmsvc_pid == current->pid) {
+   while ((nlmsvc_users || !signalled()) && nlmsvc_task == current) {
long timeout = MAX_SCHEDULE_TIMEOUT;
char buf[RPC_MAX_ADDRBUFLEN];
 
@@ -200,12 +201,12 @@ lockd(struct svc_rqst *rqstp)
 * Check whether there's a new lockd process before
 * shutting down the hosts and clearing the slot.
 */
-   if (!nlmsvc_pid || current->pid == nlmsvc_pid) {
+   if (nlmsvc_task == NULL || current == nlmsvc_task) {
if (nlmsvc_ops)
nlmsvc_invalidate_all();
nlm_shutdown_hosts();
-   nlmsvc_pid = 0;
nlmsvc_serv = NULL;
+   rcu_assign_pointer(nlmsvc_task, NULL);
} else
printk(KERN_DEBUG
"lockd: new process, skipping host shutdown\n");
@@ -273,7 +274,7 @@ lockd_up(int proto) /* Maybe add a 'family' option when 
IPv6 is supported ?? */
/*
 * Check whether we're already up and running.
 */
-   if (nlmsvc_pid) {
+   if (nlmsvc_task != NULL) {
if (proto)
error = make_socks(nlmsvc_serv, proto);
goto out;
@@ -329,38 +330,49 @@ void
 lockd_down(void)
 {
static int warned;
+   struct task_struct *tsk;
 
mutex_lock(_mutex);
+   rcu_read_lock();
+   tsk = rcu_dereference(nlmsvc_task);
if (nlmsvc_users) {
if (--nlmsvc_users)
-   goto out;
+   goto out_rcu_unlock;
} else
-   printk(KERN_WARNING "lockd_down: no users! pid=%d\n", 
nlmsvc_pid);
+   printk(KERN_WARNING "lockd_down: no users! pid=%d\n",
+  task_pid_nr(tsk));
 
-   if (!nlmsvc_pid) {
+   if (tsk == NULL) {
if (warned++ == 0)
printk(KERN_WARNING "lockd_down: no lockd running.\n"); 
-   goto out;
+   goto out_rcu_unlock;
}
warned = 0;
 
-   kill_proc(nlmsvc_pid, SIGKILL, 1);
+   send_sig(SIGKILL, tsk, 1);
+   rcu_read_unlock();
+
/*
 * Wait for the lockd process to exit, but since we're holding
 * the lockd semaphore, we can't wait around forever ...
 */
clear_thread_flag(TIF_SIGPENDING);
interruptible_sleep_on_timeout(_exit, HZ);
-   if (nlmsvc_pid) {
+   if (nlmsvc_task != NULL) {
printk(KERN_WARNING 
"lockd_down: lockd failed to exit, clearing pid\n");
-   nlmsvc_pid = 0;
+   rcu_assign_pointer(nlmsvc_task, NULL);
}
spin_lock_irq(>sighand->siglock);
recalc_sigpending();
spin_unlock_irq(>sighand->siglock);
 out:
mutex_unlock(_mutex);
+   return;
+
+out_rcu_unlock:
+   rcu_read_unlock();
+   goto out;
 }
 EXPORT_SYMBOL(lockd_down);
 
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [NFS]: Lock daemon start/stop rework.

2008-01-30 Thread Denis V. Lunev
The pid of the locking daemon can be substituted with a task struct
without a problem. Namely, the value if filled in the context of the lockd
thread and used in lockd_up/lockd_down.

It is possible to save task struct instead and use it to kill the process.
The safety of this operation is guaranteed by the RCU, i.e. task can't
disappear without passing a quiscent state.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 fs/lockd/svc.c |   38 +-
 1 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 82e2192..4979e70 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -48,7 +48,7 @@ EXPORT_SYMBOL(nlmsvc_ops);
 
 static DEFINE_MUTEX(nlmsvc_mutex);
 static unsigned intnlmsvc_users;
-static pid_t   nlmsvc_pid;
+static struct task_struct  *nlmsvc_task;
 static struct svc_serv *nlmsvc_serv;
 intnlmsvc_grace_period;
 unsigned long  nlmsvc_timeout;
@@ -128,7 +128,8 @@ lockd(struct svc_rqst *rqstp)
/*
 * Let our maker know we're running.
 */
-   nlmsvc_pid = current-pid;
+   rcu_assign_pointer(nlmsvc_task, current);
+
nlmsvc_serv = rqstp-rq_server;
complete(lockd_start_done);
 
@@ -151,7 +152,7 @@ lockd(struct svc_rqst *rqstp)
 * NFS mount or NFS daemon has gone away, and we've been sent a
 * signal, or else another process has taken over our job.
 */
-   while ((nlmsvc_users || !signalled())  nlmsvc_pid == current-pid) {
+   while ((nlmsvc_users || !signalled())  nlmsvc_task == current) {
long timeout = MAX_SCHEDULE_TIMEOUT;
char buf[RPC_MAX_ADDRBUFLEN];
 
@@ -200,12 +201,12 @@ lockd(struct svc_rqst *rqstp)
 * Check whether there's a new lockd process before
 * shutting down the hosts and clearing the slot.
 */
-   if (!nlmsvc_pid || current-pid == nlmsvc_pid) {
+   if (nlmsvc_task == NULL || current == nlmsvc_task) {
if (nlmsvc_ops)
nlmsvc_invalidate_all();
nlm_shutdown_hosts();
-   nlmsvc_pid = 0;
nlmsvc_serv = NULL;
+   rcu_assign_pointer(nlmsvc_task, NULL);
} else
printk(KERN_DEBUG
lockd: new process, skipping host shutdown\n);
@@ -273,7 +274,7 @@ lockd_up(int proto) /* Maybe add a 'family' option when 
IPv6 is supported ?? */
/*
 * Check whether we're already up and running.
 */
-   if (nlmsvc_pid) {
+   if (nlmsvc_task != NULL) {
if (proto)
error = make_socks(nlmsvc_serv, proto);
goto out;
@@ -329,38 +330,49 @@ void
 lockd_down(void)
 {
static int warned;
+   struct task_struct *tsk;
 
mutex_lock(nlmsvc_mutex);
+   rcu_read_lock();
+   tsk = rcu_dereference(nlmsvc_task);
if (nlmsvc_users) {
if (--nlmsvc_users)
-   goto out;
+   goto out_rcu_unlock;
} else
-   printk(KERN_WARNING lockd_down: no users! pid=%d\n, 
nlmsvc_pid);
+   printk(KERN_WARNING lockd_down: no users! pid=%d\n,
+  task_pid_nr(tsk));
 
-   if (!nlmsvc_pid) {
+   if (tsk == NULL) {
if (warned++ == 0)
printk(KERN_WARNING lockd_down: no lockd running.\n); 
-   goto out;
+   goto out_rcu_unlock;
}
warned = 0;
 
-   kill_proc(nlmsvc_pid, SIGKILL, 1);
+   send_sig(SIGKILL, tsk, 1);
+   rcu_read_unlock();
+
/*
 * Wait for the lockd process to exit, but since we're holding
 * the lockd semaphore, we can't wait around forever ...
 */
clear_thread_flag(TIF_SIGPENDING);
interruptible_sleep_on_timeout(lockd_exit, HZ);
-   if (nlmsvc_pid) {
+   if (nlmsvc_task != NULL) {
printk(KERN_WARNING 
lockd_down: lockd failed to exit, clearing pid\n);
-   nlmsvc_pid = 0;
+   rcu_assign_pointer(nlmsvc_task, NULL);
}
spin_lock_irq(current-sighand-siglock);
recalc_sigpending();
spin_unlock_irq(current-sighand-siglock);
 out:
mutex_unlock(nlmsvc_mutex);
+   return;
+
+out_rcu_unlock:
+   rcu_read_unlock();
+   goto out;
 }
 EXPORT_SYMBOL(lockd_down);
 
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Debugfs compile fix.

2008-01-29 Thread Denis V. Lunev
Debugfs is not compiled without CONFIG_SYSFS in net-2.6 tree. Move
kobject_create_and_add under appropriate ifdef. The fix looks correct
from a first glance, but may be the dependency should be added into
the Kconfig.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 fs/debugfs/inode.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index d26e282..61cc937 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -432,9 +432,11 @@ static int __init debugfs_init(void)
 {
int retval;
 
+#ifdef CONFIG_SYSFS
debug_kobj = kobject_create_and_add("debug", kernel_kobj);
if (!debug_kobj)
return -EINVAL;
+#endif
 
retval = register_filesystem(_fs_type);
if (retval)
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Debugfs compile fix.

2008-01-29 Thread Denis V. Lunev
Debugfs is not compiled without CONFIG_SYSFS in net-2.6 tree. Move
kobject_create_and_add under appropriate ifdef. The fix looks correct
from a first glance, but may be the dependency should be added into
the Kconfig.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 fs/debugfs/inode.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index d26e282..61cc937 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -432,9 +432,11 @@ static int __init debugfs_init(void)
 {
int retval;
 
+#ifdef CONFIG_SYSFS
debug_kobj = kobject_create_and_add(debug, kernel_kobj);
if (!debug_kobj)
return -EINVAL;
+#endif
 
retval = register_filesystem(debug_fs_type);
if (retval)
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.24-rc4] proc: Remove/Fix proc generic d_revalidate

2007-12-10 Thread Denis V. Lunev
could you, plz, check patch sent by Eric above in this thread.

I have tried it on my test node and it works for module you have
provided. The problem exists without it.

Regards,
Den

Petr Vandrovec wrote:
> Eric W. Biederman wrote:
>> Ultimately to implement /proc perfectly we need an implementation
>> of d_revalidate because files and directories can be removed behind
>> the back of the VFS, and d_revalidate is the only way we can let
>> the VFS know that this has happened.
>>
>> So until we get a proper test for keeping dentries in the dcache
>> fix the current d_revalidate method by completely removing it.  This
>> returns us to the current status quo.
> 
> Hello,
>I know that I'm late to the party, but mount points is not only
> problem with d_revalidate.  With your patch in place module below gets
> refcount incremented by two every time I do 'ls -la /proc/fs/vmblock'.
> 
> 
> #include 
> #include 
> #include 
> 
> static int vmblockinit(void) {
>struct proc_dir_entry *controlProcDirEntry;
> 
>/* Create /proc/fs/vmblock */
>controlProcDirEntry = proc_mkdir("vmblock", proc_root_fs);
>if (!controlProcDirEntry) {
>   printk(KERN_DEBUG "Bad...\n");
>   return -EINVAL;
>}
>controlProcDirEntry->owner = THIS_MODULE;
>return 0;
> }
> 
> static void vmblockexit(void) {
>remove_proc_entry("vmblock", proc_root_fs);
> }
> 
> module_init(vmblockinit);
> module_exit(vmblockexit);
> 
> 
> (code comes from VMware's vmblock module,
> http://sourceforge.net/project/showfiles.php?group_id=204462)
> Thanks,
> Petr
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.24-rc4] proc: Remove/Fix proc generic d_revalidate

2007-12-10 Thread Denis V. Lunev
could you, plz, check patch sent by Eric above in this thread.

I have tried it on my test node and it works for module you have
provided. The problem exists without it.

Regards,
Den

Petr Vandrovec wrote:
 Eric W. Biederman wrote:
 Ultimately to implement /proc perfectly we need an implementation
 of d_revalidate because files and directories can be removed behind
 the back of the VFS, and d_revalidate is the only way we can let
 the VFS know that this has happened.

 So until we get a proper test for keeping dentries in the dcache
 fix the current d_revalidate method by completely removing it.  This
 returns us to the current status quo.
 
 Hello,
I know that I'm late to the party, but mount points is not only
 problem with d_revalidate.  With your patch in place module below gets
 refcount incremented by two every time I do 'ls -la /proc/fs/vmblock'.
 
 
 #include linux/kernel.h
 #include linux/module.h
 #include linux/proc_fs.h
 
 static int vmblockinit(void) {
struct proc_dir_entry *controlProcDirEntry;
 
/* Create /proc/fs/vmblock */
controlProcDirEntry = proc_mkdir(vmblock, proc_root_fs);
if (!controlProcDirEntry) {
   printk(KERN_DEBUG Bad...\n);
   return -EINVAL;
}
controlProcDirEntry-owner = THIS_MODULE;
return 0;
 }
 
 static void vmblockexit(void) {
remove_proc_entry(vmblock, proc_root_fs);
 }
 
 module_init(vmblockinit);
 module_exit(vmblockexit);
 
 
 (code comes from VMware's vmblock module,
 http://sourceforge.net/project/showfiles.php?group_id=204462)
 Thanks,
 Petr
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.24-rc3] Fix /proc/net breakage

2007-12-07 Thread Denis V. Lunev
Andrew Morton wrote:
> On Fri, 07 Dec 2007 04:51:37 + David Woodhouse <[EMAIL PROTECTED]> wrote:
> 
>> On Mon, 2007-11-26 at 15:17 -0700, Eric W. Biederman wrote:
>>> Well I clearly goofed when I added the initial network namespace support
>>> for /proc/net.  Currently things work but there are odd details visible
>>> to user space, even when we have a single network namespace.
>>>
>>> Since we do not cache proc_dir_entry dentries at the moment we can
>>> just modify ->lookup to return a different directory inode depending
>>> on the network namespace of the process looking at /proc/net, replacing
>>> the current technique of using a magic and fragile follow_link method.
>>>
>>> To accomplish that this patch:
>>> - introduces a shadow_proc method to allow different dentries to
>>>   be returned from proc_lookup.
>>> - Removes the old /proc/net follow_link magic
>>> - Fixes a weakness in our not caching of proc generic dentries.
>>>
>>> As shadow_proc uses a task struct to decided which dentry to return we
>>> can go back later and fix the proc generic caching without modifying any 
>>> code that
>>> uses the shadow_proc method.
>>>
>>> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
>>> ---
>>>  fs/proc/generic.c   |   12 ++-
>>>  fs/proc/proc_net.c  |   86 
>>> +++
>>>  include/linux/proc_fs.h |3 ++
>>>  3 files changed, 19 insertions(+), 82 deletions(-)
>> (commit 2b1e300a9dfc3196ccddf6f1d74b91b7af55e416)
>>
>> This seems to have broken the use of /proc/bus/usb as a mountpoint. It
>> always appears empty now, whatever's supposed to be mounted there.
>>
> 
> Yes.  Denis and Eric are tossing around competing patches but afaik nobody
> is happy with any of them.  Guys, could we get this sorted soonish please?
> 

Andrew, I become too relaxed after receiving
"Tested-by: Giacomo Catenazzi <[EMAIL PROTECTED]>"

Eric, I believe that reverting an original behavior is better than your
new one as
- you introduce search into the depth by calling have_submounts(dentry)
during revalidation for all(!) /proc dentries
- your shadowing behavior will be broken if you'll mount something in
the depth of shadowed tree (this can be done as a DoS attempt)

As a last minute call, may be it will be better to pin network namespace
like a pid namespace during mount to avoid this crap at all?

Regards,
Den
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.24-rc3] Fix /proc/net breakage

2007-12-07 Thread Denis V. Lunev
Andrew Morton wrote:
 On Fri, 07 Dec 2007 04:51:37 + David Woodhouse [EMAIL PROTECTED] wrote:
 
 On Mon, 2007-11-26 at 15:17 -0700, Eric W. Biederman wrote:
 Well I clearly goofed when I added the initial network namespace support
 for /proc/net.  Currently things work but there are odd details visible
 to user space, even when we have a single network namespace.

 Since we do not cache proc_dir_entry dentries at the moment we can
 just modify -lookup to return a different directory inode depending
 on the network namespace of the process looking at /proc/net, replacing
 the current technique of using a magic and fragile follow_link method.

 To accomplish that this patch:
 - introduces a shadow_proc method to allow different dentries to
   be returned from proc_lookup.
 - Removes the old /proc/net follow_link magic
 - Fixes a weakness in our not caching of proc generic dentries.

 As shadow_proc uses a task struct to decided which dentry to return we
 can go back later and fix the proc generic caching without modifying any 
 code that
 uses the shadow_proc method.

 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
 ---
  fs/proc/generic.c   |   12 ++-
  fs/proc/proc_net.c  |   86 
 +++
  include/linux/proc_fs.h |3 ++
  3 files changed, 19 insertions(+), 82 deletions(-)
 (commit 2b1e300a9dfc3196ccddf6f1d74b91b7af55e416)

 This seems to have broken the use of /proc/bus/usb as a mountpoint. It
 always appears empty now, whatever's supposed to be mounted there.

 
 Yes.  Denis and Eric are tossing around competing patches but afaik nobody
 is happy with any of them.  Guys, could we get this sorted soonish please?
 

Andrew, I become too relaxed after receiving
Tested-by: Giacomo Catenazzi [EMAIL PROTECTED]

Eric, I believe that reverting an original behavior is better than your
new one as
- you introduce search into the depth by calling have_submounts(dentry)
during revalidation for all(!) /proc dentries
- your shadowing behavior will be broken if you'll mount something in
the depth of shadowed tree (this can be done as a DoS attempt)

As a last minute call, may be it will be better to pin network namespace
like a pid namespace during mount to avoid this crap at all?

Regards,
Den
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: Do not invalidate dentries with submounts

2007-12-06 Thread Denis V. Lunev
you have changed the behavior of revalidation by shadows. I think it
will be better to restore it and keep new one for shadows (and below)
only, which has been done by my yesterday patch.

Regards,
Den

Eric W. Biederman wrote:
> If the dcache path to a mount point is ever broken it becomes
> impossible to unmount it, and we leak a vfsmount.  Therefore it is not
> valid to invalidate dentries with mount points at or below them.
> 
> This patch uses the have_submounts test as the other network
> filesystem revalidate routines do.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
> ---
>  fs/proc/base.c|9 +
>  fs/proc/generic.c |5 +
>  2 files changed, 14 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 0e71707..552d752 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1216,6 +1216,9 @@ static int pid_revalidate(struct dentry *dentry, struct 
> nameidata *nd)
>   put_task_struct(task);
>   return 1;
>   }
> + /* Force validity if something is mounted under us */
> + if (inode && S_ISDIR(inode->i_mode) && have_submounts(dentry))
> + return 1;
>   d_drop(dentry);
>   return 0;
>  }
> @@ -1393,6 +1396,9 @@ static int tid_fd_revalidate(struct dentry *dentry, 
> struct nameidata *nd)
>   }
>   put_task_struct(task);
>   }
> + /* Force validity if something is mounted under us */
> + if (inode && S_ISDIR(inode->i_mode) && have_submounts(dentry))
> + return 1;
>   d_drop(dentry);
>   return 0;
>  }
> @@ -2056,6 +2062,9 @@ static int proc_base_revalidate(struct dentry *dentry, 
> struct nameidata *nd)
>   put_task_struct(task);
>   return 1;
>   }
> + /* Force validity if something is mounted under us */
> + if (inode && S_ISDIR(inode->i_mode) && have_submounts(dentry))
> + return 1;
>   d_drop(dentry);
>   return 0;
>  }
> diff --git a/fs/proc/generic.c b/fs/proc/generic.c
> index 4abd568..233dcdc 100644
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -370,6 +370,11 @@ static int proc_delete_dentry(struct dentry * dentry)
>  
>  static int proc_revalidate_dentry(struct dentry *dentry, struct nameidata 
> *nd)
>  {
> + struct inode *inode = dentry->d_inode;
> +
> + /* Force validity if something is mounted under us */
> + if (inode && S_ISDIR(inode->i_mode) && have_submounts(dentry))
> + return 1;
>   d_drop(dentry);
>   return 0;
>  }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: Do not invalidate dentries with submounts

2007-12-06 Thread Denis V. Lunev
you have changed the behavior of revalidation by shadows. I think it
will be better to restore it and keep new one for shadows (and below)
only, which has been done by my yesterday patch.

Regards,
Den

Eric W. Biederman wrote:
 If the dcache path to a mount point is ever broken it becomes
 impossible to unmount it, and we leak a vfsmount.  Therefore it is not
 valid to invalidate dentries with mount points at or below them.
 
 This patch uses the have_submounts test as the other network
 filesystem revalidate routines do.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
 ---
  fs/proc/base.c|9 +
  fs/proc/generic.c |5 +
  2 files changed, 14 insertions(+), 0 deletions(-)
 
 diff --git a/fs/proc/base.c b/fs/proc/base.c
 index 0e71707..552d752 100644
 --- a/fs/proc/base.c
 +++ b/fs/proc/base.c
 @@ -1216,6 +1216,9 @@ static int pid_revalidate(struct dentry *dentry, struct 
 nameidata *nd)
   put_task_struct(task);
   return 1;
   }
 + /* Force validity if something is mounted under us */
 + if (inode  S_ISDIR(inode-i_mode)  have_submounts(dentry))
 + return 1;
   d_drop(dentry);
   return 0;
  }
 @@ -1393,6 +1396,9 @@ static int tid_fd_revalidate(struct dentry *dentry, 
 struct nameidata *nd)
   }
   put_task_struct(task);
   }
 + /* Force validity if something is mounted under us */
 + if (inode  S_ISDIR(inode-i_mode)  have_submounts(dentry))
 + return 1;
   d_drop(dentry);
   return 0;
  }
 @@ -2056,6 +2062,9 @@ static int proc_base_revalidate(struct dentry *dentry, 
 struct nameidata *nd)
   put_task_struct(task);
   return 1;
   }
 + /* Force validity if something is mounted under us */
 + if (inode  S_ISDIR(inode-i_mode)  have_submounts(dentry))
 + return 1;
   d_drop(dentry);
   return 0;
  }
 diff --git a/fs/proc/generic.c b/fs/proc/generic.c
 index 4abd568..233dcdc 100644
 --- a/fs/proc/generic.c
 +++ b/fs/proc/generic.c
 @@ -370,6 +370,11 @@ static int proc_delete_dentry(struct dentry * dentry)
  
  static int proc_revalidate_dentry(struct dentry *dentry, struct nameidata 
 *nd)
  {
 + struct inode *inode = dentry-d_inode;
 +
 + /* Force validity if something is mounted under us */
 + if (inode  S_ISDIR(inode-i_mode)  have_submounts(dentry))
 + return 1;
   d_drop(dentry);
   return 0;
  }

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] lost content of /proc/sys/fs/binfmt_misc

2007-12-05 Thread Denis V. Lunev
/proc/sys/fs/binfmt_misc dentry disappeared during d_revalidate.
d_revalidate only dentries from shadowed one and below.
http://bugzilla.kernel.org/show_bug.cgi?id=9504

CC: Eric W. Biederman <[EMAIL PROTECTED]>
CC: Marcus Better <[EMAIL PROTECTED]>
Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 5fccfe2..1497ac4 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -380,12 +380,17 @@ static int proc_revalidate_dentry(struct dentry *dentry, 
struct nameidata *nd)
return 0;
 }
 
-static struct dentry_operations proc_dentry_operations =
+static struct dentry_operations proc_dentry_shadow_operations =
 {
.d_delete   = proc_delete_dentry,
.d_revalidate   = proc_revalidate_dentry,
 };
 
+static struct dentry_operations proc_dentry_operations =
+{
+   .d_delete   = proc_delete_dentry,
+};
+
 /*
  * Don't create negative dentries here, return -ENOENT by hand
  * instead.
@@ -394,6 +399,7 @@ struct dentry *proc_lookup(struct inode * dir, struct 
dentry *dentry, struct nam
 {
struct inode *inode = NULL;
struct proc_dir_entry * de;
+   int use_shadow = 0;
int error = -ENOENT;
 
lock_kernel();
@@ -406,8 +412,10 @@ struct dentry *proc_lookup(struct inode * dir, struct 
dentry *dentry, struct nam
if (!memcmp(dentry->d_name.name, de->name, 
de->namelen)) {
unsigned int ino;
 
-   if (de->shadow_proc)
+   if (de->shadow_proc) {
de = de->shadow_proc(current, de);
+   use_shadow = 1;
+   }
ino = de->low_ino;
de_get(de);
spin_unlock(_subdir_lock);
@@ -423,6 +431,8 @@ struct dentry *proc_lookup(struct inode * dir, struct 
dentry *dentry, struct nam
 
if (inode) {
dentry->d_op = _dentry_operations;
+   dentry->d_op = use_shadow ?
+   _dentry_shadow_operations : dentry->d_parent->d_op;
d_add(dentry, inode);
return NULL;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] lost content of /proc/sys/fs/binfmt_misc

2007-12-05 Thread Denis V. Lunev
/proc/sys/fs/binfmt_misc dentry disappeared during d_revalidate.
d_revalidate only dentries from shadowed one and below.
http://bugzilla.kernel.org/show_bug.cgi?id=9504

CC: Eric W. Biederman [EMAIL PROTECTED]
CC: Marcus Better [EMAIL PROTECTED]
Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 5fccfe2..1497ac4 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -380,12 +380,17 @@ static int proc_revalidate_dentry(struct dentry *dentry, 
struct nameidata *nd)
return 0;
 }
 
-static struct dentry_operations proc_dentry_operations =
+static struct dentry_operations proc_dentry_shadow_operations =
 {
.d_delete   = proc_delete_dentry,
.d_revalidate   = proc_revalidate_dentry,
 };
 
+static struct dentry_operations proc_dentry_operations =
+{
+   .d_delete   = proc_delete_dentry,
+};
+
 /*
  * Don't create negative dentries here, return -ENOENT by hand
  * instead.
@@ -394,6 +399,7 @@ struct dentry *proc_lookup(struct inode * dir, struct 
dentry *dentry, struct nam
 {
struct inode *inode = NULL;
struct proc_dir_entry * de;
+   int use_shadow = 0;
int error = -ENOENT;
 
lock_kernel();
@@ -406,8 +412,10 @@ struct dentry *proc_lookup(struct inode * dir, struct 
dentry *dentry, struct nam
if (!memcmp(dentry-d_name.name, de-name, 
de-namelen)) {
unsigned int ino;
 
-   if (de-shadow_proc)
+   if (de-shadow_proc) {
de = de-shadow_proc(current, de);
+   use_shadow = 1;
+   }
ino = de-low_ino;
de_get(de);
spin_unlock(proc_subdir_lock);
@@ -423,6 +431,8 @@ struct dentry *proc_lookup(struct inode * dir, struct 
dentry *dentry, struct nam
 
if (inode) {
dentry-d_op = proc_dentry_operations;
+   dentry-d_op = use_shadow ?
+   proc_dentry_shadow_operations : dentry-d_parent-d_op;
d_add(dentry, inode);
return NULL;
}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] AB-BA deadlock in drop_caches sysctl (resend, the one sent was for 2.6.18)

2007-12-04 Thread Denis V. Lunev
Andrew Morton wrote:
> On Mon, 3 Dec 2007 16:52:47 +0300
> "Denis V. Lunev" <[EMAIL PROTECTED]> wrote:
> 
>> There is a AB-BA deadlock regarding drop_caches sysctl. Here are the code
>> paths:
>>
>> drop_pagecache
>>   spin_lock(_lock);
>>   invalidate_mapping_pages
>> try_to_release_page
>>   ext3_releasepage
>> journal_try_to_free_buffers
>>   __journal_try_to_free_buffer
>>  spin_lock(>j_list_lock);
>>
>> __journal_temp_unlink_buffer (called under journal->j_list_lock by comments)
>>   mark_buffer_dirty
>> __set_page_dirty
>>   __mark_inode_dirty
>> spin_lock(_lock);
>>
>> The patch tries to address the issue - it drops inode_lock before digging 
>> into
>> invalidate_inode_pages. This seems sane as inode hold should not gone from 
>> the
>> list and should not change its place.
>>
>> Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
>> --
>> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
>> index 59375ef..4ac80d8 100644
>> --- a/fs/drop_caches.c
>> +++ b/fs/drop_caches.c
>> @@ -14,15 +14,27 @@ int sysctl_drop_caches;
>>  
>>  static void drop_pagecache_sb(struct super_block *sb)
>>  {
>> -struct inode *inode;
>> +struct inode *inode, *old;
>>  
>> +old = NULL;
>>  spin_lock(_lock);
>>  list_for_each_entry(inode, >s_inodes, i_sb_list) {
>>  if (inode->i_state & (I_FREEING|I_WILL_FREE))
>>  continue;
>> -__invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
>> +__iget(inode);
>> +spin_unlock(_lock);
>> +
>> +if (old != NULL)
>> +iput(old);
>> +invalidate_mapping_pages(inode->i_mapping, 0, -1);
>> +old = inode;
>> +
>> +spin_lock(_lock);
>>  }
>>  spin_unlock(_lock);
>> +
>> +if (old != NULL)
>> +iput(old);
>>  }
> 
> We need to hold onto inode_lock while walking sb->s_inodes.  Otherwise the
> inode which we're currently looking at could get removed from i_sb_list and
> bad things will happen (drop_pagecache_sb will go infinite, or will oops, I
> guess).

as far as I understand, there are the following place removing inode
from i_sb_list:
- generic_delete_inode (via iput_final)
- generic_forget_inode (via iput_final)
- hugetlbfs_forget_inode
- dispose_list after the check under inode_lock for i_count

So, the patch is sane from disappearing point of view:
- I hold inode under inode_lock
- and iput it after new inode to clean has been found and hold

Nevertheless we'll think a bit about ext3 fix. Though other staff like
gfs2 etc can also be affected.

Regards,
Den
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] AB-BA deadlock in drop_caches sysctl (resend, the one sent was for 2.6.18)

2007-12-04 Thread Denis V. Lunev
Andrew Morton wrote:
 On Mon, 3 Dec 2007 16:52:47 +0300
 Denis V. Lunev [EMAIL PROTECTED] wrote:
 
 There is a AB-BA deadlock regarding drop_caches sysctl. Here are the code
 paths:

 drop_pagecache
   spin_lock(inode_lock);
   invalidate_mapping_pages
 try_to_release_page
   ext3_releasepage
 journal_try_to_free_buffers
   __journal_try_to_free_buffer
  spin_lock(journal-j_list_lock);

 __journal_temp_unlink_buffer (called under journal-j_list_lock by comments)
   mark_buffer_dirty
 __set_page_dirty
   __mark_inode_dirty
 spin_lock(inode_lock);

 The patch tries to address the issue - it drops inode_lock before digging 
 into
 invalidate_inode_pages. This seems sane as inode hold should not gone from 
 the
 list and should not change its place.

 Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
 --
 diff --git a/fs/drop_caches.c b/fs/drop_caches.c
 index 59375ef..4ac80d8 100644
 --- a/fs/drop_caches.c
 +++ b/fs/drop_caches.c
 @@ -14,15 +14,27 @@ int sysctl_drop_caches;
  
  static void drop_pagecache_sb(struct super_block *sb)
  {
 -struct inode *inode;
 +struct inode *inode, *old;
  
 +old = NULL;
  spin_lock(inode_lock);
  list_for_each_entry(inode, sb-s_inodes, i_sb_list) {
  if (inode-i_state  (I_FREEING|I_WILL_FREE))
  continue;
 -__invalidate_mapping_pages(inode-i_mapping, 0, -1, true);
 +__iget(inode);
 +spin_unlock(inode_lock);
 +
 +if (old != NULL)
 +iput(old);
 +invalidate_mapping_pages(inode-i_mapping, 0, -1);
 +old = inode;
 +
 +spin_lock(inode_lock);
  }
  spin_unlock(inode_lock);
 +
 +if (old != NULL)
 +iput(old);
  }
 
 We need to hold onto inode_lock while walking sb-s_inodes.  Otherwise the
 inode which we're currently looking at could get removed from i_sb_list and
 bad things will happen (drop_pagecache_sb will go infinite, or will oops, I
 guess).

as far as I understand, there are the following place removing inode
from i_sb_list:
- generic_delete_inode (via iput_final)
- generic_forget_inode (via iput_final)
- hugetlbfs_forget_inode
- dispose_list after the check under inode_lock for i_count

So, the patch is sane from disappearing point of view:
- I hold inode under inode_lock
- and iput it after new inode to clean has been found and hold

Nevertheless we'll think a bit about ext3 fix. Though other staff like
gfs2 etc can also be affected.

Regards,
Den
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] AB-BA deadlock in drop_caches sysctl (resend, the one sent was for 2.6.18)

2007-12-03 Thread Denis V. Lunev
There is a AB-BA deadlock regarding drop_caches sysctl. Here are the code
paths:

drop_pagecache
  spin_lock(_lock);
  invalidate_mapping_pages
try_to_release_page
  ext3_releasepage
journal_try_to_free_buffers
  __journal_try_to_free_buffer
spin_lock(>j_list_lock);

__journal_temp_unlink_buffer (called under journal->j_list_lock by comments)
  mark_buffer_dirty
__set_page_dirty
  __mark_inode_dirty
spin_lock(_lock);

The patch tries to address the issue - it drops inode_lock before digging into
invalidate_inode_pages. This seems sane as inode hold should not gone from the
list and should not change its place.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
--
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 59375ef..4ac80d8 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -14,15 +14,27 @@ int sysctl_drop_caches;
 
 static void drop_pagecache_sb(struct super_block *sb)
 {
-   struct inode *inode;
+   struct inode *inode, *old;
 
+   old = NULL;
spin_lock(_lock);
list_for_each_entry(inode, >s_inodes, i_sb_list) {
if (inode->i_state & (I_FREEING|I_WILL_FREE))
continue;
-   __invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
+   __iget(inode);
+   spin_unlock(_lock);
+
+   if (old != NULL)
+   iput(old);
+   invalidate_mapping_pages(inode->i_mapping, 0, -1);
+   old = inode;
+
+   spin_lock(_lock);
}
spin_unlock(_lock);
+
+   if (old != NULL)
+   iput(old);
 }
 
 void drop_pagecache(void)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..b424dff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1649,9 +1649,6 @@ extern int __invalidate_device(struct block_device *);
 extern int invalidate_partition(struct gendisk *, int);
 #endif
 extern int invalidate_inodes(struct super_block *);
-unsigned long __invalidate_mapping_pages(struct address_space *mapping,
-   pgoff_t start, pgoff_t end,
-   bool be_atomic);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t end);
 
diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..5cbe1e7 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -266,8 +266,21 @@ void truncate_inode_pages(struct address_space *mapping, 
loff_t lstart)
 }
 EXPORT_SYMBOL(truncate_inode_pages);
 
-unsigned long __invalidate_mapping_pages(struct address_space *mapping,
-   pgoff_t start, pgoff_t end, bool be_atomic)
+/**
+ * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
+ * @mapping: the address_space which holds the pages to invalidate
+ * @start: the offset 'from' which to invalidate
+ * @end: the offset 'to' which to invalidate (inclusive)
+ *
+ * This function only removes the unlocked pages, if you want to
+ * remove all the pages of one inode, you must call truncate_inode_pages.
+ *
+ * invalidate_mapping_pages() will not block on IO activity. It will not
+ * invalidate pages which are dirty, locked, under writeback or mapped into
+ * pagetables.
+ */
+unsigned long invalidate_mapping_pages(struct address_space *mapping,
+   pgoff_t start, pgoff_t end)
 {
struct pagevec pvec;
pgoff_t next = start;
@@ -308,30 +321,10 @@ unlock:
break;
}
pagevec_release();
-   if (likely(!be_atomic))
-   cond_resched();
+   cond_resched();
}
return ret;
 }
-
-/**
- * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
- * @mapping: the address_space which holds the pages to invalidate
- * @start: the offset 'from' which to invalidate
- * @end: the offset 'to' which to invalidate (inclusive)
- *
- * This function only removes the unlocked pages, if you want to
- * remove all the pages of one inode, you must call truncate_inode_pages.
- *
- * invalidate_mapping_pages() will not block on IO activity. It will not
- * invalidate pages which are dirty, locked, under writeback or mapped into
- * pagetables.
- */
-unsigned long invalidate_mapping_pages(struct address_space *mapping,
-   pgoff_t start, pgoff_t end)
-{
-   return __invalidate_mapping_pages(mapping, start, end, false);
-}
 EXPORT_SYMBOL(invalidate_mapping_pages);
 
 /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] AB-BA deadlock in drop_caches sysctl

2007-12-03 Thread Denis V. Lunev
There is a AB-BA deadlock regarding drop_caches sysctl. Here are the code
paths:

drop_pagecache
  spin_lock(_lock);
  invalidate_mapping_pages
try_to_release_page
  ext3_releasepage
journal_try_to_free_buffers
  __journal_try_to_free_buffer
spin_lock(>j_list_lock);

journal_commit_transaction
  spin_lock(>j_list_lock);
  __journal_remove_checkpoint
__journal_refile_buffer
  __journal_unfile_buffer
__journal_temp_unlink_buffer
  __set_page_dirty_nobuffers
__mark_inode_dirt
spin_lock(_lock);

The patch tries to address the issue - it drops inode_lock before digging into
invalidate_inode_pages. This seems sane as inode hold should not gone from the
list and should not change its place.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>

--- ./fs/drop_caches.c.marker   2006-09-20 07:42:06.0 +0400
+++ ./fs/drop_caches.c  2007-12-03 15:43:44.0 +0300
@@ -14,15 +14,27 @@ int sysctl_drop_caches;
 
 static void drop_pagecache_sb(struct super_block *sb)
 {
-   struct inode *inode;
+   struct inode *inode, *old;
 
+   old = NULL;
spin_lock(_lock);
list_for_each_entry(inode, >s_inodes, i_sb_list) {
if (inode->i_state & (I_FREEING|I_WILL_FREE))
continue;
+   __iget(inode);
+   spin_unlock(_lock);
+
invalidate_inode_pages(inode->i_mapping);
+   if (old != NULL)
+   iput(old);
+   old = inode;
+
+   spin_lock(_lock);
}
spin_unlock(_lock);
+
+   if (old != NULL)
+   iput(old);
 }
 
 void drop_pagecache(void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: namespace support requires network modules to say "GPL"

2007-12-03 Thread Denis V. Lunev
Patrick McHardy wrote:
> Adrian Bunk wrote:
>> On Sun, Dec 02, 2007 at 09:03:56PM +0100, Patrick McHardy wrote:
>>
>>> For all I care binary modules can break, but frankly I don't see
>>> how encapsulating a couple of structures and pointers in a new
>>> structure and adding a new argument to existing functions shifts
>>> the decision about how a function should be usable to the namespace
>>> guys. IMO all functions should continue to be usable as before,
>>> as decided by whoever actually wrote them.
>>> ...
>>
>> Even ignoring the fact that it's unclear whether distributing modules
>> with not GPLv2 compatible licences is legal at all or might bring you
>> in jail,
> 
> Agreed, lets ignore that :)
> 
>> your statement has an interesting implication:
>>
>> Stuff like e.g. the EXPORT_SYMBOL(sk_alloc) predates the
>> EXPORT_SYMBOL_GPL stuff.
>>
>> Who is considered the author of this code?
>>
>> And when should he state whether he prefers to use EXPORT_SYMBOL_GPL
>> but wasn't able to use it at that when he wrote it since his code
>> predates it and is glad to be able to decide this now?
> 
> 
> He can state it when he feels like it, I don't see the point.
> Authors generally get to decide whether they use EXPORT_SYMBOL
> or EXPORT_SYMBOL_GPL unless in cases where its really clear-cut
> that EXPORT_SYMBOL is inapproriate. But thats a different matter.
> 
> If a symbol was OK to be used previously and something using it
> would not automatically be considered a derived work, how does
> passing _net to the function just to make the compiler
> happy, avoid BUG_ONs and generally keep things working as before
> make it more of a derived work?

We, namely, Pavel Emelyanov and me, if we have some rights as a
committers to this staff :), do not mind against change
EXPORT_SYMBOL_GPL to EXPORT_SYMBOL.

Regards,
Den
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: namespace support requires network modules to say GPL

2007-12-03 Thread Denis V. Lunev
Patrick McHardy wrote:
 Adrian Bunk wrote:
 On Sun, Dec 02, 2007 at 09:03:56PM +0100, Patrick McHardy wrote:

 For all I care binary modules can break, but frankly I don't see
 how encapsulating a couple of structures and pointers in a new
 structure and adding a new argument to existing functions shifts
 the decision about how a function should be usable to the namespace
 guys. IMO all functions should continue to be usable as before,
 as decided by whoever actually wrote them.
 ...

 Even ignoring the fact that it's unclear whether distributing modules
 with not GPLv2 compatible licences is legal at all or might bring you
 in jail,
 
 Agreed, lets ignore that :)
 
 your statement has an interesting implication:

 Stuff like e.g. the EXPORT_SYMBOL(sk_alloc) predates the
 EXPORT_SYMBOL_GPL stuff.

 Who is considered the author of this code?

 And when should he state whether he prefers to use EXPORT_SYMBOL_GPL
 but wasn't able to use it at that when he wrote it since his code
 predates it and is glad to be able to decide this now?
 
 
 He can state it when he feels like it, I don't see the point.
 Authors generally get to decide whether they use EXPORT_SYMBOL
 or EXPORT_SYMBOL_GPL unless in cases where its really clear-cut
 that EXPORT_SYMBOL is inapproriate. But thats a different matter.
 
 If a symbol was OK to be used previously and something using it
 would not automatically be considered a derived work, how does
 passing init_net to the function just to make the compiler
 happy, avoid BUG_ONs and generally keep things working as before
 make it more of a derived work?

We, namely, Pavel Emelyanov and me, if we have some rights as a
committers to this staff :), do not mind against change
EXPORT_SYMBOL_GPL to EXPORT_SYMBOL.

Regards,
Den
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] AB-BA deadlock in drop_caches sysctl (resend, the one sent was for 2.6.18)

2007-12-03 Thread Denis V. Lunev
There is a AB-BA deadlock regarding drop_caches sysctl. Here are the code
paths:

drop_pagecache
  spin_lock(inode_lock);
  invalidate_mapping_pages
try_to_release_page
  ext3_releasepage
journal_try_to_free_buffers
  __journal_try_to_free_buffer
spin_lock(journal-j_list_lock);

__journal_temp_unlink_buffer (called under journal-j_list_lock by comments)
  mark_buffer_dirty
__set_page_dirty
  __mark_inode_dirty
spin_lock(inode_lock);

The patch tries to address the issue - it drops inode_lock before digging into
invalidate_inode_pages. This seems sane as inode hold should not gone from the
list and should not change its place.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
--
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 59375ef..4ac80d8 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -14,15 +14,27 @@ int sysctl_drop_caches;
 
 static void drop_pagecache_sb(struct super_block *sb)
 {
-   struct inode *inode;
+   struct inode *inode, *old;
 
+   old = NULL;
spin_lock(inode_lock);
list_for_each_entry(inode, sb-s_inodes, i_sb_list) {
if (inode-i_state  (I_FREEING|I_WILL_FREE))
continue;
-   __invalidate_mapping_pages(inode-i_mapping, 0, -1, true);
+   __iget(inode);
+   spin_unlock(inode_lock);
+
+   if (old != NULL)
+   iput(old);
+   invalidate_mapping_pages(inode-i_mapping, 0, -1);
+   old = inode;
+
+   spin_lock(inode_lock);
}
spin_unlock(inode_lock);
+
+   if (old != NULL)
+   iput(old);
 }
 
 void drop_pagecache(void)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..b424dff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1649,9 +1649,6 @@ extern int __invalidate_device(struct block_device *);
 extern int invalidate_partition(struct gendisk *, int);
 #endif
 extern int invalidate_inodes(struct super_block *);
-unsigned long __invalidate_mapping_pages(struct address_space *mapping,
-   pgoff_t start, pgoff_t end,
-   bool be_atomic);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t end);
 
diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..5cbe1e7 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -266,8 +266,21 @@ void truncate_inode_pages(struct address_space *mapping, 
loff_t lstart)
 }
 EXPORT_SYMBOL(truncate_inode_pages);
 
-unsigned long __invalidate_mapping_pages(struct address_space *mapping,
-   pgoff_t start, pgoff_t end, bool be_atomic)
+/**
+ * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
+ * @mapping: the address_space which holds the pages to invalidate
+ * @start: the offset 'from' which to invalidate
+ * @end: the offset 'to' which to invalidate (inclusive)
+ *
+ * This function only removes the unlocked pages, if you want to
+ * remove all the pages of one inode, you must call truncate_inode_pages.
+ *
+ * invalidate_mapping_pages() will not block on IO activity. It will not
+ * invalidate pages which are dirty, locked, under writeback or mapped into
+ * pagetables.
+ */
+unsigned long invalidate_mapping_pages(struct address_space *mapping,
+   pgoff_t start, pgoff_t end)
 {
struct pagevec pvec;
pgoff_t next = start;
@@ -308,30 +321,10 @@ unlock:
break;
}
pagevec_release(pvec);
-   if (likely(!be_atomic))
-   cond_resched();
+   cond_resched();
}
return ret;
 }
-
-/**
- * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
- * @mapping: the address_space which holds the pages to invalidate
- * @start: the offset 'from' which to invalidate
- * @end: the offset 'to' which to invalidate (inclusive)
- *
- * This function only removes the unlocked pages, if you want to
- * remove all the pages of one inode, you must call truncate_inode_pages.
- *
- * invalidate_mapping_pages() will not block on IO activity. It will not
- * invalidate pages which are dirty, locked, under writeback or mapped into
- * pagetables.
- */
-unsigned long invalidate_mapping_pages(struct address_space *mapping,
-   pgoff_t start, pgoff_t end)
-{
-   return __invalidate_mapping_pages(mapping, start, end, false);
-}
 EXPORT_SYMBOL(invalidate_mapping_pages);
 
 /*
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] AB-BA deadlock in drop_caches sysctl

2007-12-03 Thread Denis V. Lunev
There is a AB-BA deadlock regarding drop_caches sysctl. Here are the code
paths:

drop_pagecache
  spin_lock(inode_lock);
  invalidate_mapping_pages
try_to_release_page
  ext3_releasepage
journal_try_to_free_buffers
  __journal_try_to_free_buffer
spin_lock(journal-j_list_lock);

journal_commit_transaction
  spin_lock(journal-j_list_lock);
  __journal_remove_checkpoint
__journal_refile_buffer
  __journal_unfile_buffer
__journal_temp_unlink_buffer
  __set_page_dirty_nobuffers
__mark_inode_dirt
spin_lock(inode_lock);

The patch tries to address the issue - it drops inode_lock before digging into
invalidate_inode_pages. This seems sane as inode hold should not gone from the
list and should not change its place.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]

--- ./fs/drop_caches.c.marker   2006-09-20 07:42:06.0 +0400
+++ ./fs/drop_caches.c  2007-12-03 15:43:44.0 +0300
@@ -14,15 +14,27 @@ int sysctl_drop_caches;
 
 static void drop_pagecache_sb(struct super_block *sb)
 {
-   struct inode *inode;
+   struct inode *inode, *old;
 
+   old = NULL;
spin_lock(inode_lock);
list_for_each_entry(inode, sb-s_inodes, i_sb_list) {
if (inode-i_state  (I_FREEING|I_WILL_FREE))
continue;
+   __iget(inode);
+   spin_unlock(inode_lock);
+
invalidate_inode_pages(inode-i_mapping);
+   if (old != NULL)
+   iput(old);
+   old = inode;
+
+   spin_lock(inode_lock);
}
spin_unlock(inode_lock);
+
+   if (old != NULL)
+   iput(old);
 }
 
 void drop_pagecache(void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] .gitignore update for x86 arch

2007-10-16 Thread Denis V. Lunev
This patch:
- makes .gitignore files visible to git
- makes arch/x86/kernel/vsyscall_32.lds and arch/i386/boot invisible

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>

diff --git a/.gitignore b/.gitignore
index 27c3e83..22fb8fa 100644
--- a/.gitignore
+++ b/.gitignore
@@ -26,6 +26,7 @@ vmlinux*
 !vmlinux.lds.S
 System.map
 Module.symvers
+!.gitignore
 
 #
 # Generated include files
diff --git a/arch/i386/.gitignore b/arch/i386/.gitignore
new file mode 100644
index 000..36ef4c3
--- /dev/null
+++ b/arch/i386/.gitignore
@@ -0,0 +1 @@
+boot
diff --git a/arch/x86/kernel/.gitignore b/arch/x86/kernel/.gitignore
index 40836ad..4ea38a3 100644
--- a/arch/x86/kernel/.gitignore
+++ b/arch/x86/kernel/.gitignore
@@ -1 +1,2 @@
 vsyscall.lds
+vsyscall_32.lds
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] .gitignore update for x86 arch

2007-10-16 Thread Denis V. Lunev
This patch:
- makes .gitignore files visible to git
- makes arch/x86/kernel/vsyscall_32.lds and arch/i386/boot invisible

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]

diff --git a/.gitignore b/.gitignore
index 27c3e83..22fb8fa 100644
--- a/.gitignore
+++ b/.gitignore
@@ -26,6 +26,7 @@ vmlinux*
 !vmlinux.lds.S
 System.map
 Module.symvers
+!.gitignore
 
 #
 # Generated include files
diff --git a/arch/i386/.gitignore b/arch/i386/.gitignore
new file mode 100644
index 000..36ef4c3
--- /dev/null
+++ b/arch/i386/.gitignore
@@ -0,0 +1 @@
+boot
diff --git a/arch/x86/kernel/.gitignore b/arch/x86/kernel/.gitignore
index 40836ad..4ea38a3 100644
--- a/arch/x86/kernel/.gitignore
+++ b/arch/x86/kernel/.gitignore
@@ -1 +1,2 @@
 vsyscall.lds
+vsyscall_32.lds
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [NET] IPv6 oops bisected

2007-10-08 Thread Denis V. Lunev
David Miller wrote:
> From: "Denis V. Lunev" <[EMAIL PROTECTED]>
> Date: Mon, 08 Oct 2007 10:34:23 +0400
> 
>> OK. I am installing Fedora 7 right now...
> 
> You don't need to install Fedora, just read the code! :-)
> 
> The bug is obvious and it's been explained thoroughly in this
> thread.
> 
> When 'dev' is NULL in ip6_route_add() we need to figure out what
> namespace and/or loopback device you want to use.
> 
> There is no reason to do an entire dist install to work on fixing this
> bug, yikes!

I do understand the conditions when the bug happens. Its completely
clear :) Though I do not understand how to trigger it from command line
to test that the problem is resolved. Jeff was not kind enough to give
exact command line :(

The unfortunate thing with this place is that original Eric's code is
also broken here. I was too optimistic working on the original patchset.
In other way, but broken :(

So, I must stop and think...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [NET] IPv6 oops bisected

2007-10-08 Thread Denis V. Lunev
David Miller wrote:
> From: Herbert Xu <[EMAIL PROTECTED]>
> Date: Mon, 8 Oct 2007 14:19:42 +0800
> 
>> On Sun, Oct 07, 2007 at 11:16:08PM -0700, David Miller wrote:
>>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>>> index a7db84c..7109ad6 100644
>>> --- a/net/ipv6/route.c
>>> +++ b/net/ipv6/route.c
>>> @@ -1188,7 +1188,7 @@ int ip6_route_add(struct fib6_config *cfg)
>>> if ((cfg->fc_flags & RTF_REJECT) ||
>>> (dev && (dev->flags_LOOPBACK) && 
>>> !(addr_type_ADDR_LOOPBACK))) {
>>> /* hold loopback dev/idev if we haven't done so. */
>>> -   if (dev != dev->nd_net->loopback_dev) {
>>> +   if (!dev || (dev != dev->nd_net->loopback_dev)) {
>>> if (dev) {
>>> dev_put(dev);
>>> in6_dev_put(idev);
>> Unfortunately this'll still oops a few lines down when it tries
>> to assign dev->nd_net->loopabck_dev to dev.  The issue here is
>> which namespace are we in if dev is NULL.
> 
> Good catch.
> 
> I'm just going to revert my bogus fix and the original change
> for now.  Denis can resubmit the original patch once this
> is resolved.

OK. I am installing Fedora 7 right now...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [NET] IPv6 oops bisected

2007-10-08 Thread Denis V. Lunev
David Miller wrote:
 From: Herbert Xu [EMAIL PROTECTED]
 Date: Mon, 8 Oct 2007 14:19:42 +0800
 
 On Sun, Oct 07, 2007 at 11:16:08PM -0700, David Miller wrote:
 diff --git a/net/ipv6/route.c b/net/ipv6/route.c
 index a7db84c..7109ad6 100644
 --- a/net/ipv6/route.c
 +++ b/net/ipv6/route.c
 @@ -1188,7 +1188,7 @@ int ip6_route_add(struct fib6_config *cfg)
 if ((cfg-fc_flags  RTF_REJECT) ||
 (dev  (dev-flagsIFF_LOOPBACK)  
 !(addr_typeIPV6_ADDR_LOOPBACK))) {
 /* hold loopback dev/idev if we haven't done so. */
 -   if (dev != dev-nd_net-loopback_dev) {
 +   if (!dev || (dev != dev-nd_net-loopback_dev)) {
 if (dev) {
 dev_put(dev);
 in6_dev_put(idev);
 Unfortunately this'll still oops a few lines down when it tries
 to assign dev-nd_net-loopabck_dev to dev.  The issue here is
 which namespace are we in if dev is NULL.
 
 Good catch.
 
 I'm just going to revert my bogus fix and the original change
 for now.  Denis can resubmit the original patch once this
 is resolved.

OK. I am installing Fedora 7 right now...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >