Re: [PATCH] KVM: x86: reinstate vendor-agnostic check on SPEC_CTRL cpuid bits
On 12/3/20 6:15 PM, Paolo Bonzini wrote: > Until commit e7c587da1252 ("x86/speculation: Use synthetic bits for > IBRS/IBPB/STIBP", > 2018-05-17), KVM was testing both Intel and AMD CPUID bits before allowing the > guest to write MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD. Testing only Intel > bits > on VMX processors, or only AMD bits on SVM processors, fails if the guests are > created with the "opposite" vendor as the host. > > While at it, also tweak the host CPU check to use the vendor-agnostic feature > bit > X86_FEATURE_IBPB, since we only care about the availability of the MSR on the > host > here and not about specific CPUID bits. > > Fixes: e7c587da1252 ("x86/speculation: Use synthetic bits for > IBRS/IBPB/STIBP") > Cc: sta...@vger.kernel.org > Reported-by: Denis V. Lunev > Signed-off-by: Paolo Bonzini > --- > arch/x86/kvm/svm/svm.c | 3 ++- > arch/x86/kvm/vmx/vmx.c | 10 +++--- > 2 files changed, 9 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index 62390fbc9233..0b4aa60b2754 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -2686,12 +2686,13 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct > msr_data *msr) > break; > case MSR_IA32_PRED_CMD: > if (!msr->host_initiated && > + !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL) && > !guest_cpuid_has(vcpu, X86_FEATURE_AMD_IBPB)) > return 1; > > if (data & ~PRED_CMD_IBPB) > return 1; > - if (!boot_cpu_has(X86_FEATURE_AMD_IBPB)) > + if (!boot_cpu_has(X86_FEATURE_IBPB)) > return 1; > if (!data) > break; > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index c3441e7e5a87..b74d2105ced7 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -2028,7 +2028,10 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct > msr_data *msr_info) > break; > case MSR_IA32_SPEC_CTRL: > if (!msr_info->host_initiated && > - !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL)) > +!guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL) && > +!guest_cpuid_has(vcpu, X86_FEATURE_AMD_STIBP) && > +!guest_cpuid_has(vcpu, X86_FEATURE_AMD_IBRS) && > +!guest_cpuid_has(vcpu, X86_FEATURE_AMD_SSBD)) > return 1; > > if (kvm_spec_ctrl_test_value(data)) > @@ -2063,12 +2066,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct > msr_data *msr_info) > goto find_uret_msr; > case MSR_IA32_PRED_CMD: > if (!msr_info->host_initiated && > - !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL)) > + !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL) && > + !guest_cpuid_has(vcpu, X86_FEATURE_AMD_IBPB)) > return 1; > > if (data & ~PRED_CMD_IBPB) > return 1; > - if (!boot_cpu_has(X86_FEATURE_SPEC_CTRL)) > + if (!boot_cpu_has(X86_FEATURE_IBPB)) > return 1; > if (!data) > break; Reviewed-by: Denis V. Lunev
[PATCH 1/1] i40iw: remove bogus call to netdev_master_upper_dev_get
Local variable netdev is not used in these calls. It should be noted, that this change is required to work in bonded mode. In the other case we would get the following assert: "RTNL: assertion failed at net/core/dev.c (5665)" with the calltrace as follows: dump_stack+0x19/0x1b netdev_master_upper_dev_get+0x61/0x70 i40iw_addr_resolve_neigh+0x1e8/0x220 i40iw_make_cm_node+0x296/0x700 ? i40iw_find_listener.isra.10+0xcc/0x110 i40iw_receive_ilq+0x3d4/0x810 i40iw_puda_poll_completion+0x341/0x420 i40iw_process_ceq+0xa5/0x280 i40iw_ceq_dpc+0x1e/0x40 tasklet_action+0x83/0x140 __do_softirq+0x125/0x2bb call_softirq+0x1c/0x30 do_softirq+0x65/0xa0 irq_exit+0x105/0x110 do_IRQ+0x56/0xf0 common_interrupt+0x16a/0x16a ? cpuidle_enter_state+0x57/0xd0 cpuidle_idle_call+0xde/0x230 arch_cpu_idle+0xe/0xc0 cpu_startup_entry+0x14a/0x1e0 start_secondary+0x1f7/0x270 start_cpu+0x5/0x14 Signed-off-by: Denis V. Lunev CC: Konstantin Khorenko CC: Faisal Latif CC: Shiraz Saleem CC: Doug Ledford CC: Jason Gunthorpe CC: linux-r...@vger.kernel.org CC: linux-kernel@vger.kernel.org --- drivers/infiniband/hw/i40iw/i40iw_cm.c | 8 1 file changed, 8 deletions(-) diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c b/drivers/infiniband/hw/i40iw/i40iw_cm.c index bb78d3280acc..fa7a5ff498c7 100644 --- a/drivers/infiniband/hw/i40iw/i40iw_cm.c +++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c @@ -1987,7 +1987,6 @@ static int i40iw_addr_resolve_neigh(struct i40iw_device *iwdev, struct rtable *rt; struct neighbour *neigh; int rc = arpindex; - struct net_device *netdev = iwdev->netdev; __be32 dst_ipaddr = htonl(dst_ip); __be32 src_ipaddr = htonl(src_ip); @@ -1997,9 +1996,6 @@ static int i40iw_addr_resolve_neigh(struct i40iw_device *iwdev, return rc; } - if (netif_is_bond_slave(netdev)) - netdev = netdev_master_upper_dev_get(netdev); - neigh = dst_neigh_lookup(>dst, _ipaddr); rcu_read_lock(); @@ -2065,7 +2061,6 @@ static int i40iw_addr_resolve_neigh_ipv6(struct i40iw_device *iwdev, { struct neighbour *neigh; int rc = arpindex; - struct net_device *netdev = iwdev->netdev; struct dst_entry *dst; struct sockaddr_in6 dst_addr; struct sockaddr_in6 src_addr; @@ -2086,9 +2081,6 @@ static int i40iw_addr_resolve_neigh_ipv6(struct i40iw_device *iwdev, return rc; } - if (netif_is_bond_slave(netdev)) - netdev = netdev_master_upper_dev_get(netdev); - neigh = dst_neigh_lookup(dst, dst_addr.sin6_addr.in6_u.u6_addr32); rcu_read_lock(); -- 2.17.1
Re: [PATCH RFC 0/2] ignore LBR-related MSRs
On 12/06/2017 07:39 PM, Jan Dakinevich wrote: > On Wed, 6 Dec 2017 10:06:48 -0500 > Konrad Rzeszutek Wilkwrote: > >> On Wed, Dec 06, 2017 at 02:43:01PM +0300, Jan Dakinevich wrote: >>> w2k16 essentials fails to boot if underlying hypervisor lacks of >>> support for LBR MSRs. To workaround the issue, it suggessted to >>> ignore these MSRs (but not all). >> This is without any hyperv enablement? Meaning normal stock guest? >> > Yes, it is normal guest. No hyperv enlightenments were enabled, and > "-cpu host" was specified in QEMU command line. The problem also exists with HyperV enabled. Den
Re: [PATCH RFC 0/2] ignore LBR-related MSRs
On 12/06/2017 07:39 PM, Jan Dakinevich wrote: > On Wed, 6 Dec 2017 10:06:48 -0500 > Konrad Rzeszutek Wilk wrote: > >> On Wed, Dec 06, 2017 at 02:43:01PM +0300, Jan Dakinevich wrote: >>> w2k16 essentials fails to boot if underlying hypervisor lacks of >>> support for LBR MSRs. To workaround the issue, it suggessted to >>> ignore these MSRs (but not all). >> This is without any hyperv enablement? Meaning normal stock guest? >> > Yes, it is normal guest. No hyperv enlightenments were enabled, and > "-cpu host" was specified in QEMU command line. The problem also exists with HyperV enabled. Den
Re: [PATCH] virtio_balloon: prevent uninitialized variable use
On 03/23/2017 06:17 PM, Arnd Bergmann wrote: > The latest gcc-7.0.1 snapshot reports a new warning: > > virtio/virtio_balloon.c: In function 'update_balloon_stats': > virtio/virtio_balloon.c:258:26: error: 'events[2]' is used uninitialized in > this function [-Werror=uninitialized] > virtio/virtio_balloon.c:260:26: error: 'events[3]' is used uninitialized in > this function [-Werror=uninitialized] > virtio/virtio_balloon.c:261:56: error: 'events[18]' is used uninitialized in > this function [-Werror=uninitialized] > virtio/virtio_balloon.c:262:56: error: 'events[17]' is used uninitialized in > this function [-Werror=uninitialized] > > This seems absolutely right, so we should add an extra check to > prevent copying uninitialized stack data into the statistics. > From all I can tell, this has been broken since the statistics code > was originally added in 2.6.34. > > Fixes: 9564e138b1f6 ("virtio: Add memory statistics reporting to the balloon > driver (V4)") > Signed-off-by: Arnd Bergmann <a...@arndb.de> Reviewed-by: Denis V. Lunev <d...@openvz.org> > --- > drivers/virtio/virtio_balloon.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index 4e1191508228..cd5c54e2003d 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -254,12 +254,14 @@ static void update_balloon_stats(struct virtio_balloon > *vb) > > available = si_mem_available(); > > +#ifdef CONFIG_VM_EVENT_COUNTERS > update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN, > pages_to_bytes(events[PSWPIN])); > update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT, > pages_to_bytes(events[PSWPOUT])); > update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]); > update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]); > +#endif > update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE, > pages_to_bytes(i.freeram)); > update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT,
Re: [PATCH] virtio_balloon: prevent uninitialized variable use
On 03/23/2017 06:17 PM, Arnd Bergmann wrote: > The latest gcc-7.0.1 snapshot reports a new warning: > > virtio/virtio_balloon.c: In function 'update_balloon_stats': > virtio/virtio_balloon.c:258:26: error: 'events[2]' is used uninitialized in > this function [-Werror=uninitialized] > virtio/virtio_balloon.c:260:26: error: 'events[3]' is used uninitialized in > this function [-Werror=uninitialized] > virtio/virtio_balloon.c:261:56: error: 'events[18]' is used uninitialized in > this function [-Werror=uninitialized] > virtio/virtio_balloon.c:262:56: error: 'events[17]' is used uninitialized in > this function [-Werror=uninitialized] > > This seems absolutely right, so we should add an extra check to > prevent copying uninitialized stack data into the statistics. > From all I can tell, this has been broken since the statistics code > was originally added in 2.6.34. > > Fixes: 9564e138b1f6 ("virtio: Add memory statistics reporting to the balloon > driver (V4)") > Signed-off-by: Arnd Bergmann Reviewed-by: Denis V. Lunev > --- > drivers/virtio/virtio_balloon.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index 4e1191508228..cd5c54e2003d 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -254,12 +254,14 @@ static void update_balloon_stats(struct virtio_balloon > *vb) > > available = si_mem_available(); > > +#ifdef CONFIG_VM_EVENT_COUNTERS > update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN, > pages_to_bytes(events[PSWPIN])); > update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT, > pages_to_bytes(events[PSWPOUT])); > update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]); > update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]); > +#endif > update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE, > pages_to_bytes(i.freeram)); > update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT,
[PATCH v2 1/1] virtio: update balloon size in balloon "probe"
From: Konstantin Neumoin <kneum...@virtuozzo.com> The following commit 'fad7b7b27b6a (virtio_balloon: Use a workqueue instead of "vballoon" kthread)' has added a regression. Original code with kthread starts the thread inside probe and checks the necessity to update balloon inside the thread immediately. Nowadays the code behaves differently. Work is queued only on the first command from the host after the negotiation. Thus there is a window especially at the guest startup or the module reloading when the balloon size is not updated until the notification from the host. This patch adds balloon size check at the end of the probe to match original behaviour. Signed-off-by: Konstantin Neumoin <kneum...@virtuozzo.com> Signed-off-by: Denis V. Lunev <d...@openvz.org> CC: Michael S. Tsirkin <m...@redhat.com> --- drivers/virtio/virtio_balloon.c | 2 ++ 1 file changed, 2 insertions(+) Changes from v1: - fixed description - removed update_balloon_size() call diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 4e7003d..181793f 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -577,6 +577,8 @@ static int virtballoon_probe(struct virtio_device *vdev) virtio_device_ready(vdev); + if (towards_target(vb)) + virtballoon_changed(vdev); return 0; out_del_vqs: -- 2.7.4
[PATCH v2 1/1] virtio: update balloon size in balloon "probe"
From: Konstantin Neumoin The following commit 'fad7b7b27b6a (virtio_balloon: Use a workqueue instead of "vballoon" kthread)' has added a regression. Original code with kthread starts the thread inside probe and checks the necessity to update balloon inside the thread immediately. Nowadays the code behaves differently. Work is queued only on the first command from the host after the negotiation. Thus there is a window especially at the guest startup or the module reloading when the balloon size is not updated until the notification from the host. This patch adds balloon size check at the end of the probe to match original behaviour. Signed-off-by: Konstantin Neumoin Signed-off-by: Denis V. Lunev CC: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 2 ++ 1 file changed, 2 insertions(+) Changes from v1: - fixed description - removed update_balloon_size() call diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 4e7003d..181793f 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -577,6 +577,8 @@ static int virtballoon_probe(struct virtio_device *vdev) virtio_device_ready(vdev); + if (towards_target(vb)) + virtballoon_changed(vdev); return 0; out_del_vqs: -- 2.7.4
[PATCH 1/1] update balloon size in balloon "probe"
From: Konstantin Neumoin <kneum...@virtuozzo.com> Patch Commit 3d2a3774c1b046f548ebea0391a602fd5685a307 Author: Michael S. Tsirkin <m...@redhat.com> Date: Tue Mar 10 11:55:08 2015 +1030 virtio-balloon: do not call blocking ops when !TASK_RUNNING has added a regression. Original code with wait_event_interruptible checked the condition before start waiting and started balloon operations if necessary. Right now balloon is not inflated if ballon target is set before the driver is loaded. Signed-off-by: Konstantin Neumoin <kneum...@virtuozzo.com> Signed-off-by: Denis V. Lunev <d...@openvz.org> CC: "Michael S. Tsirkin" <m...@redhat.com> --- drivers/virtio/virtio_balloon.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 4e7003d..0a6c10f 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -577,6 +577,10 @@ static int virtballoon_probe(struct virtio_device *vdev) virtio_device_ready(vdev); + if (towards_target(vb)) + virtballoon_changed(vdev); + update_balloon_size(vb); + return 0; out_del_vqs: -- 2.7.4
[PATCH 1/1] update balloon size in balloon "probe"
From: Konstantin Neumoin Patch Commit 3d2a3774c1b046f548ebea0391a602fd5685a307 Author: Michael S. Tsirkin Date: Tue Mar 10 11:55:08 2015 +1030 virtio-balloon: do not call blocking ops when !TASK_RUNNING has added a regression. Original code with wait_event_interruptible checked the condition before start waiting and started balloon operations if necessary. Right now balloon is not inflated if ballon target is set before the driver is loaded. Signed-off-by: Konstantin Neumoin Signed-off-by: Denis V. Lunev CC: "Michael S. Tsirkin" --- drivers/virtio/virtio_balloon.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 4e7003d..0a6c10f 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -577,6 +577,10 @@ static int virtballoon_probe(struct virtio_device *vdev) virtio_device_ready(vdev); + if (towards_target(vb)) + virtballoon_changed(vdev); + update_balloon_size(vb); + return 0; out_del_vqs: -- 2.7.4
[PATCH 1/1] balloon: stop inflate balloon after oom notification
From: Konstantin Neumoin <kneum...@virtuozzo.com> At this moment oom notification in balloon does not work as expected. After virtballoon_oom_notify there is an infinitive loop: - virtballoon_oom_notify was called and balloon was deflated - balloon get notification that config was changed, compare target and actual and try to reach target again. This patch adds global variable fail_counter which indicates that oom has been happened. We check that fail_counter was changed between calls update_balloon_size_func. In this case we should not try to inflate balloon even if actual != target. Signed-off-by: Konstantin Neumoin <kneum...@virtuozzo.com> Signed-off-by: Denis V. Lunev <d...@openvz.org> CC: Michael S. Tsirkin <m...@redhat.com> --- drivers/virtio/virtio_balloon.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 4e7003d..253bf05 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -50,6 +50,8 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); static struct vfsmount *balloon_mnt; #endif +static unsigned long fail_count; + struct virtio_balloon { struct virtio_device *vdev; struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -361,6 +363,8 @@ static int virtballoon_oom_notify(struct notifier_block *self, unsigned long *freed; unsigned num_freed_pages; + fail_count++; + vb = container_of(self, struct virtio_balloon, nb); if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) return NOTIFY_OK; @@ -386,11 +390,22 @@ static void update_balloon_size_func(struct work_struct *work) { struct virtio_balloon *vb; s64 diff; + static unsigned long fc; + + if (fc == 0) + fc = fail_count; vb = container_of(work, struct virtio_balloon, update_balloon_size_work); diff = towards_target(vb); + if (fc != fail_count) { + fc = fail_count; + /* Don't inflate balloon after oom notification */ + if (diff > 0) + return; + } + if (diff > 0) diff -= fill_balloon(vb, diff); else if (diff < 0) -- 2.7.4
[PATCH 1/1] balloon: stop inflate balloon after oom notification
From: Konstantin Neumoin At this moment oom notification in balloon does not work as expected. After virtballoon_oom_notify there is an infinitive loop: - virtballoon_oom_notify was called and balloon was deflated - balloon get notification that config was changed, compare target and actual and try to reach target again. This patch adds global variable fail_counter which indicates that oom has been happened. We check that fail_counter was changed between calls update_balloon_size_func. In this case we should not try to inflate balloon even if actual != target. Signed-off-by: Konstantin Neumoin Signed-off-by: Denis V. Lunev CC: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 4e7003d..253bf05 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -50,6 +50,8 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); static struct vfsmount *balloon_mnt; #endif +static unsigned long fail_count; + struct virtio_balloon { struct virtio_device *vdev; struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -361,6 +363,8 @@ static int virtballoon_oom_notify(struct notifier_block *self, unsigned long *freed; unsigned num_freed_pages; + fail_count++; + vb = container_of(self, struct virtio_balloon, nb); if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) return NOTIFY_OK; @@ -386,11 +390,22 @@ static void update_balloon_size_func(struct work_struct *work) { struct virtio_balloon *vb; s64 diff; + static unsigned long fc; + + if (fc == 0) + fc = fail_count; vb = container_of(work, struct virtio_balloon, update_balloon_size_work); diff = towards_target(vb); + if (fc != fail_count) { + fc = fail_count; + /* Don't inflate balloon after oom notification */ + if (diff > 0) + return; + } + if (diff > 0) diff -= fill_balloon(vb, diff); else if (diff < 0) -- 2.7.4
[PATCH 1/1] balloon: check the number of available pages in leak balloon
From: Konstantin Neumoin <kneum...@virtuozzo.com> The balloon has a special mechanism that is subscribed to the oom notification which leads to deflation for a fixed number of pages. The number is always fixed even when the balloon is fully deflated. But leak_balloon did not expect that the pages to deflate will be more than taken, and raise a "BUG" in balloon_page_dequeue when page list will be empty. So, the simplest solution would be to check that the number of releases pages is less or equal to the number taken pages. Signed-off-by: Konstantin Neumoin <kneum...@virtuozzo.com> Signed-off-by: Denis V. Lunev <d...@openvz.org> CC: Michael S. Tsirkin <m...@redhat.com> --- drivers/virtio/virtio_balloon.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 476c0e3..f6ea8f4 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -202,6 +202,8 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) num = min(num, ARRAY_SIZE(vb->pfns)); mutex_lock(>balloon_lock); + /* We can't release more pages than taken */ + num = min(num, (size_t)vb->num_pages); for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { page = balloon_page_dequeue(vb_dev_info); -- 2.1.4
[PATCH 1/1] balloon: check the number of available pages in leak balloon
From: Konstantin Neumoin The balloon has a special mechanism that is subscribed to the oom notification which leads to deflation for a fixed number of pages. The number is always fixed even when the balloon is fully deflated. But leak_balloon did not expect that the pages to deflate will be more than taken, and raise a "BUG" in balloon_page_dequeue when page list will be empty. So, the simplest solution would be to check that the number of releases pages is less or equal to the number taken pages. Signed-off-by: Konstantin Neumoin Signed-off-by: Denis V. Lunev CC: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 476c0e3..f6ea8f4 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -202,6 +202,8 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) num = min(num, ARRAY_SIZE(vb->pfns)); mutex_lock(>balloon_lock); + /* We can't release more pages than taken */ + num = min(num, (size_t)vb->num_pages); for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { page = balloon_page_dequeue(vb_dev_info); -- 2.1.4
Re: [PATCH 2/2] virtio_balloon: export 'available' memory to balloon statistics
On 02/23/2016 06:53 PM, Michael S. Tsirkin wrote: On Tue, Feb 23, 2016 at 06:26:47PM +0300, Denis V. Lunev wrote: On 02/23/2016 06:10 PM, Michael S. Tsirkin wrote: On Tue, Feb 16, 2016 at 06:50:52PM +0300, Denis V. Lunev wrote: From: Igor Redko <red...@virtuozzo.com> Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. Signed-off-by: Igor Redko <red...@virtuozzo.com> Reviewed-by: Roman Kagan <rka...@virtuozzo.com> Signed-off-by: Denis V. Lunev <d...@openvz.org> CC: Michael S. Tsirkin <m...@redhat.com> CC: Andrew Morton <a...@linux-foundation.org> Oops - I missed the fact that this affects host/guest ABI. Can you please submit ABI update proposal to virtio tc? Spec patch would be even better. This is important to ensure there are no conflicts with other features being developed in parallel. hmmm From my point of view ABI remains untouched. Anything exposed by guest to host is ABI. Once we add stuff there, we never can remove it as some host might rely on it. The guest can send any amount of ; pairs and unknown tags are properly ignored by the host. That is why I think that this change is safe. What happens if someone uses the tag you used for VIRTIO_BALLOON_S_AVAIL, for some other purpose? Any tools using VIRTIO_BALLOON_S_AVAIL will be confused. actually this constant resides in QEMU only, values are reported above using JSON and string tags. Really, it's not hard to get a tag number from virtio TC, so please just do this. ok. So do you propose to negotiate maximum allowed tag to send at the driver start time? we will have to guard this exchange with proper flag in feature space then. This could be done but from my point of view this looks like serious over-complication. Do we have somebody who can judge? Den
Re: [PATCH 2/2] virtio_balloon: export 'available' memory to balloon statistics
On 02/23/2016 06:53 PM, Michael S. Tsirkin wrote: On Tue, Feb 23, 2016 at 06:26:47PM +0300, Denis V. Lunev wrote: On 02/23/2016 06:10 PM, Michael S. Tsirkin wrote: On Tue, Feb 16, 2016 at 06:50:52PM +0300, Denis V. Lunev wrote: From: Igor Redko Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. Signed-off-by: Igor Redko Reviewed-by: Roman Kagan Signed-off-by: Denis V. Lunev CC: Michael S. Tsirkin CC: Andrew Morton Oops - I missed the fact that this affects host/guest ABI. Can you please submit ABI update proposal to virtio tc? Spec patch would be even better. This is important to ensure there are no conflicts with other features being developed in parallel. hmmm From my point of view ABI remains untouched. Anything exposed by guest to host is ABI. Once we add stuff there, we never can remove it as some host might rely on it. The guest can send any amount of ; pairs and unknown tags are properly ignored by the host. That is why I think that this change is safe. What happens if someone uses the tag you used for VIRTIO_BALLOON_S_AVAIL, for some other purpose? Any tools using VIRTIO_BALLOON_S_AVAIL will be confused. actually this constant resides in QEMU only, values are reported above using JSON and string tags. Really, it's not hard to get a tag number from virtio TC, so please just do this. ok. So do you propose to negotiate maximum allowed tag to send at the driver start time? we will have to guard this exchange with proper flag in feature space then. This could be done but from my point of view this looks like serious over-complication. Do we have somebody who can judge? Den
Re: [PATCH 2/2] virtio_balloon: export 'available' memory to balloon statistics
On 02/23/2016 06:10 PM, Michael S. Tsirkin wrote: On Tue, Feb 16, 2016 at 06:50:52PM +0300, Denis V. Lunev wrote: From: Igor Redko <red...@virtuozzo.com> Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. Signed-off-by: Igor Redko <red...@virtuozzo.com> Reviewed-by: Roman Kagan <rka...@virtuozzo.com> Signed-off-by: Denis V. Lunev <d...@openvz.org> CC: Michael S. Tsirkin <m...@redhat.com> CC: Andrew Morton <a...@linux-foundation.org> Oops - I missed the fact that this affects host/guest ABI. Can you please submit ABI update proposal to virtio tc? Spec patch would be even better. This is important to ensure there are no conflicts with other features being developed in parallel. hmmm From my point of view ABI remains untouched. The guest can send any amount of ; pairs and unknown tags are properly ignored by the host. That is why I think that this change is safe.
Re: [PATCH 2/2] virtio_balloon: export 'available' memory to balloon statistics
On 02/23/2016 06:10 PM, Michael S. Tsirkin wrote: On Tue, Feb 16, 2016 at 06:50:52PM +0300, Denis V. Lunev wrote: From: Igor Redko Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. Signed-off-by: Igor Redko Reviewed-by: Roman Kagan Signed-off-by: Denis V. Lunev CC: Michael S. Tsirkin CC: Andrew Morton Oops - I missed the fact that this affects host/guest ABI. Can you please submit ABI update proposal to virtio tc? Spec patch would be even better. This is important to ensure there are no conflicts with other features being developed in parallel. hmmm From my point of view ABI remains untouched. The guest can send any amount of ; pairs and unknown tags are properly ignored by the host. That is why I think that this change is safe.
[PATCH 1/2] calculate 'available' memory in the separate function
From: Igor Redko <red...@virtuozzo.com> Factor out calculation of the available memory counter into a separate exportable function, in order to be able to use it in other parts of the kernel. In particular, it appears a relevant metric to report to the hypervisor via virtio-balloon statistics interface (in a followup patch). Signed-off-by: Igor Redko <red...@virtuozzo.com> Reviewed-by: Roman Kagan <rka...@virtuozzo.com> Signed-off-by: Denis V. Lunev <d...@openvz.org> CC: Michael S. Tsirkin <m...@redhat.com> CC: Andrew Morton <a...@linux-foundation.org> --- fs/proc/meminfo.c | 31 +-- include/linux/mm.h | 1 + mm/page_alloc.c| 43 +++ 3 files changed, 45 insertions(+), 30 deletions(-) diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index df4661a..8372046 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -29,10 +29,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) unsigned long committed; long cached; long available; - unsigned long pagecache; - unsigned long wmark_low = 0; unsigned long pages[NR_LRU_LISTS]; - struct zone *zone; int lru; /* @@ -51,33 +48,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++) pages[lru] = global_page_state(NR_LRU_BASE + lru); - for_each_zone(zone) - wmark_low += zone->watermark[WMARK_LOW]; - - /* -* Estimate the amount of memory available for userspace allocations, -* without causing swapping. -*/ - available = i.freeram - totalreserve_pages; - - /* -* Not all the page cache can be freed, otherwise the system will -* start swapping. Assume at least half of the page cache, or the -* low watermark worth of cache, needs to stay. -*/ - pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE]; - pagecache -= min(pagecache / 2, wmark_low); - available += pagecache; - - /* -* Part of the reclaimable slab consists of items that are in use, -* and cannot be freed. Cap this estimate at the low watermark. -*/ - available += global_page_state(NR_SLAB_RECLAIMABLE) - -min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low); - - if (available < 0) - available = 0; + available = si_mem_available(); /* * Tagged format, for easy grepping and expansion. diff --git a/include/linux/mm.h b/include/linux/mm.h index 516e149..a8c4144 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1862,6 +1862,7 @@ extern int __meminit init_per_zone_wmark_min(void); extern void mem_init(void); extern void __init mmap_init(void); extern void show_mem(unsigned int flags); +extern long si_mem_available(void); extern void si_meminfo(struct sysinfo * val); extern void si_meminfo_node(struct sysinfo *val, int nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 838ca8bb..dae813c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3603,6 +3603,49 @@ static inline void show_node(struct zone *zone) printk("Node %d ", zone_to_nid(zone)); } +long si_mem_available(void) +{ + long available; + unsigned long pagecache; + unsigned long wmark_low = 0; + unsigned long pages[NR_LRU_LISTS]; + struct zone *zone; + int lru; + + for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++) + pages[lru] = global_page_state(NR_LRU_BASE + lru); + + for_each_zone(zone) + wmark_low += zone->watermark[WMARK_LOW]; + + /* +* Estimate the amount of memory available for userspace allocations, +* without causing swapping. +*/ + available = global_page_state(NR_FREE_PAGES) - totalreserve_pages; + + /* +* Not all the page cache can be freed, otherwise the system will +* start swapping. Assume at least half of the page cache, or the +* low watermark worth of cache, needs to stay. +*/ + pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE]; + pagecache -= min(pagecache / 2, wmark_low); + available += pagecache; + + /* +* Part of the reclaimable slab consists of items that are in use, +* and cannot be freed. Cap this estimate at the low watermark. +*/ + available += global_page_state(NR_SLAB_RECLAIMABLE) - +min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low); + + if (available < 0) + available = 0; + return available; +} +EXPORT_SYMBOL_GPL(si_mem_available); + void si_meminfo(struct sysinfo *val) { val->totalram = totalram_pages; -- 2.5.0
[PATCH 1/2] calculate 'available' memory in the separate function
From: Igor Redko Factor out calculation of the available memory counter into a separate exportable function, in order to be able to use it in other parts of the kernel. In particular, it appears a relevant metric to report to the hypervisor via virtio-balloon statistics interface (in a followup patch). Signed-off-by: Igor Redko Reviewed-by: Roman Kagan Signed-off-by: Denis V. Lunev CC: Michael S. Tsirkin CC: Andrew Morton --- fs/proc/meminfo.c | 31 +-- include/linux/mm.h | 1 + mm/page_alloc.c| 43 +++ 3 files changed, 45 insertions(+), 30 deletions(-) diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index df4661a..8372046 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -29,10 +29,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) unsigned long committed; long cached; long available; - unsigned long pagecache; - unsigned long wmark_low = 0; unsigned long pages[NR_LRU_LISTS]; - struct zone *zone; int lru; /* @@ -51,33 +48,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++) pages[lru] = global_page_state(NR_LRU_BASE + lru); - for_each_zone(zone) - wmark_low += zone->watermark[WMARK_LOW]; - - /* -* Estimate the amount of memory available for userspace allocations, -* without causing swapping. -*/ - available = i.freeram - totalreserve_pages; - - /* -* Not all the page cache can be freed, otherwise the system will -* start swapping. Assume at least half of the page cache, or the -* low watermark worth of cache, needs to stay. -*/ - pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE]; - pagecache -= min(pagecache / 2, wmark_low); - available += pagecache; - - /* -* Part of the reclaimable slab consists of items that are in use, -* and cannot be freed. Cap this estimate at the low watermark. -*/ - available += global_page_state(NR_SLAB_RECLAIMABLE) - -min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low); - - if (available < 0) - available = 0; + available = si_mem_available(); /* * Tagged format, for easy grepping and expansion. diff --git a/include/linux/mm.h b/include/linux/mm.h index 516e149..a8c4144 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1862,6 +1862,7 @@ extern int __meminit init_per_zone_wmark_min(void); extern void mem_init(void); extern void __init mmap_init(void); extern void show_mem(unsigned int flags); +extern long si_mem_available(void); extern void si_meminfo(struct sysinfo * val); extern void si_meminfo_node(struct sysinfo *val, int nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 838ca8bb..dae813c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3603,6 +3603,49 @@ static inline void show_node(struct zone *zone) printk("Node %d ", zone_to_nid(zone)); } +long si_mem_available(void) +{ + long available; + unsigned long pagecache; + unsigned long wmark_low = 0; + unsigned long pages[NR_LRU_LISTS]; + struct zone *zone; + int lru; + + for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++) + pages[lru] = global_page_state(NR_LRU_BASE + lru); + + for_each_zone(zone) + wmark_low += zone->watermark[WMARK_LOW]; + + /* +* Estimate the amount of memory available for userspace allocations, +* without causing swapping. +*/ + available = global_page_state(NR_FREE_PAGES) - totalreserve_pages; + + /* +* Not all the page cache can be freed, otherwise the system will +* start swapping. Assume at least half of the page cache, or the +* low watermark worth of cache, needs to stay. +*/ + pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE]; + pagecache -= min(pagecache / 2, wmark_low); + available += pagecache; + + /* +* Part of the reclaimable slab consists of items that are in use, +* and cannot be freed. Cap this estimate at the low watermark. +*/ + available += global_page_state(NR_SLAB_RECLAIMABLE) - +min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low); + + if (available < 0) + available = 0; + return available; +} +EXPORT_SYMBOL_GPL(si_mem_available); + void si_meminfo(struct sysinfo *val) { val->totalram = totalram_pages; -- 2.5.0
[PATCH 0/2] export 'available' memory to virtio balloon statistics
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. This metric would be very useful in VM orchestration software to improve memory management of different VMs under overcommit. Signed-off-by: Igor Redko <red...@virtuozzo.com> Reviewed-by: Roman Kagan <rka...@virtuozzo.com> Signed-off-by: Denis V. Lunev <d...@openvz.org> CC: Michael S. Tsirkin <m...@redhat.com> CC: Andrew Morton <a...@linux-foundation.org> Igor Redko (2): calculate 'available' memory in the separate function virtio_balloon: export 'available' memory to balloon statistics drivers/virtio/virtio_balloon.c | 6 ++ fs/proc/meminfo.c | 31 +- include/linux/mm.h | 1 + include/uapi/linux/virtio_balloon.h | 3 ++- mm/page_alloc.c | 43 + 5 files changed, 53 insertions(+), 31 deletions(-) -- 2.5.0
[PATCH 0/2] export 'available' memory to virtio balloon statistics
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. This metric would be very useful in VM orchestration software to improve memory management of different VMs under overcommit. Signed-off-by: Igor Redko Reviewed-by: Roman Kagan Signed-off-by: Denis V. Lunev CC: Michael S. Tsirkin CC: Andrew Morton Igor Redko (2): calculate 'available' memory in the separate function virtio_balloon: export 'available' memory to balloon statistics drivers/virtio/virtio_balloon.c | 6 ++ fs/proc/meminfo.c | 31 +- include/linux/mm.h | 1 + include/uapi/linux/virtio_balloon.h | 3 ++- mm/page_alloc.c | 43 + 5 files changed, 53 insertions(+), 31 deletions(-) -- 2.5.0
[PATCH 2/2] virtio_balloon: export 'available' memory to balloon statistics
From: Igor Redko <red...@virtuozzo.com> Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. Signed-off-by: Igor Redko <red...@virtuozzo.com> Reviewed-by: Roman Kagan <rka...@virtuozzo.com> Signed-off-by: Denis V. Lunev <d...@openvz.org> CC: Michael S. Tsirkin <m...@redhat.com> CC: Andrew Morton <a...@linux-foundation.org> --- drivers/virtio/virtio_balloon.c | 6 ++ include/uapi/linux/virtio_balloon.h | 3 ++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 0c3691f..f2b77de 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -30,6 +30,7 @@ #include #include #include +#include /* * Balloon device works in 4K page units. So each page is pointed to by @@ -229,10 +230,13 @@ static void update_balloon_stats(struct virtio_balloon *vb) unsigned long events[NR_VM_EVENT_ITEMS]; struct sysinfo i; int idx = 0; + long available; all_vm_events(events); si_meminfo(); + available = si_mem_available(); + update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN, pages_to_bytes(events[PSWPIN])); update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT, @@ -243,6 +247,8 @@ static void update_balloon_stats(struct virtio_balloon *vb) pages_to_bytes(i.freeram)); update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT, pages_to_bytes(i.totalram)); + update_stat(vb, idx++, VIRTIO_BALLOON_S_AVAIL, + pages_to_bytes(available)); } /* diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index d7f1cbc..343d7dd 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -51,7 +51,8 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_MINFLT 3 /* Number of minor faults */ #define VIRTIO_BALLOON_S_MEMFREE 4 /* Total amount of free memory */ #define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */ -#define VIRTIO_BALLOON_S_NR 6 +#define VIRTIO_BALLOON_S_AVAIL6 /* Available memory as in /proc */ +#define VIRTIO_BALLOON_S_NR 7 /* * Memory statistics structure. -- 2.5.0
[PATCH 2/2] virtio_balloon: export 'available' memory to balloon statistics
From: Igor Redko Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. Signed-off-by: Igor Redko Reviewed-by: Roman Kagan Signed-off-by: Denis V. Lunev CC: Michael S. Tsirkin CC: Andrew Morton --- drivers/virtio/virtio_balloon.c | 6 ++ include/uapi/linux/virtio_balloon.h | 3 ++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 0c3691f..f2b77de 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -30,6 +30,7 @@ #include #include #include +#include /* * Balloon device works in 4K page units. So each page is pointed to by @@ -229,10 +230,13 @@ static void update_balloon_stats(struct virtio_balloon *vb) unsigned long events[NR_VM_EVENT_ITEMS]; struct sysinfo i; int idx = 0; + long available; all_vm_events(events); si_meminfo(); + available = si_mem_available(); + update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN, pages_to_bytes(events[PSWPIN])); update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT, @@ -243,6 +247,8 @@ static void update_balloon_stats(struct virtio_balloon *vb) pages_to_bytes(i.freeram)); update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT, pages_to_bytes(i.totalram)); + update_stat(vb, idx++, VIRTIO_BALLOON_S_AVAIL, + pages_to_bytes(available)); } /* diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index d7f1cbc..343d7dd 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -51,7 +51,8 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_MINFLT 3 /* Number of minor faults */ #define VIRTIO_BALLOON_S_MEMFREE 4 /* Total amount of free memory */ #define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */ -#define VIRTIO_BALLOON_S_NR 6 +#define VIRTIO_BALLOON_S_AVAIL6 /* Available memory as in /proc */ +#define VIRTIO_BALLOON_S_NR 7 /* * Memory statistics structure. -- 2.5.0
Re: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value
On 11/12/2015 08:35 PM, KY Srinivasan wrote: -Original Message- From: Denis V. Lunev [mailto:d...@openvz.org] Sent: Wednesday, November 11, 2015 11:16 PM To: KY Srinivasan Cc: rka...@virtuozzo.com; de...@linuxdriverproject.org; linux- ker...@vger.kernel.org; Andrey Smetanin ; Haiyang Zhang ; Vitaly Kuznetsov Subject: Re: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value On 11/02/2015 10:42 PM, KY Srinivasan wrote: -Original Message- From: Denis V. Lunev [mailto:d...@openvz.org] Sent: Monday, November 2, 2015 3:34 AM Cc: rka...@virtuozzo.com; de...@linuxdriverproject.org; linux- ker...@vger.kernel.org; Andrey Smetanin ; KY Srinivasan ; Haiyang Zhang ; Vitaly Kuznetsov ; Denis V. Lunev Subject: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value From: Andrey Smetanin Hypervisor Top Level Functional Specification v3/4 says that TSC page sequence value = -1(0x) is used to indicate that TSC page no longer reliable source of reference timer. Unfortunately, we found that Windows Hyper-V guest side implementation uses sequence value = 0 to indicate that Tsc page no longer valid. This is clearly visible inside Windows 2012R2 ntoskrnl.exe HvlGetReferenceTime() function dissassembly: HvlGetReferenceTime proc near xchgax, ax loc_1401C3132: mov rax, cs:HvlpReferenceTscPage mov r9d, [rax] testr9d, r9d jz short loc_1401C3176 rdtsc mov rcx, cs:HvlpReferenceTscPage shl rdx, 20h or rdx, rax mov rax, [rcx+8] mov rcx, cs:HvlpReferenceTscPage mov r8, [rcx+10h] mul rdx mov rax, cs:HvlpReferenceTscPage add rdx, r8 mov ecx, [rax] cmp ecx, r9d jnz short loc_1401C3132 jmp short loc_1401C3184 loc_1401C3176: mov ecx, 4020h rdmsr shl rdx, 20h or rdx, rax loc_1401C3184: mov rax, rdx retn HvlGetReferenceTime endp This patch aligns Tsc page invalid sequence value with Windows Hyper-V guest implementation which is more compatible with both Hyper-V hypervisor and KVM hypervisor. Signed-off-by: Andrey Smetanin CC: "K. Y. Srinivasan" Thanks Andrey; the Hyper-V team will be updating the Hyper-V documentation. Acked-by: K. Y. Srinivasan Regards, K. Y K.Y., can you pls clarify the state of this patch? It is a bit unclear to me whether it is applied or not. I will be submitting this shortly. By the way, I also do not see the following patch "drivers/hv: cleanup synic msrs if vmbus connect failed" as applied in the Linux-next. You have promised to resend it will correct author. It was resent on October 29. It is in Greg's queue. Regards, K. Y Thank you in advance, Den thank you very much for this update :) Den -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value
On 11/12/2015 08:35 PM, KY Srinivasan wrote: -Original Message- From: Denis V. Lunev [mailto:d...@openvz.org] Sent: Wednesday, November 11, 2015 11:16 PM To: KY Srinivasan <k...@microsoft.com> Cc: rka...@virtuozzo.com; de...@linuxdriverproject.org; linux- ker...@vger.kernel.org; Andrey Smetanin <asmeta...@virtuozzo.com>; Haiyang Zhang <haiya...@microsoft.com>; Vitaly Kuznetsov <vkuzn...@redhat.com> Subject: Re: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value On 11/02/2015 10:42 PM, KY Srinivasan wrote: -Original Message----- From: Denis V. Lunev [mailto:d...@openvz.org] Sent: Monday, November 2, 2015 3:34 AM Cc: rka...@virtuozzo.com; de...@linuxdriverproject.org; linux- ker...@vger.kernel.org; Andrey Smetanin <asmeta...@virtuozzo.com>; KY Srinivasan <k...@microsoft.com>; Haiyang Zhang <haiya...@microsoft.com>; Vitaly Kuznetsov <vkuzn...@redhat.com>; Denis V. Lunev <d...@openvz.org> Subject: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value From: Andrey Smetanin <asmeta...@virtuozzo.com> Hypervisor Top Level Functional Specification v3/4 says that TSC page sequence value = -1(0x) is used to indicate that TSC page no longer reliable source of reference timer. Unfortunately, we found that Windows Hyper-V guest side implementation uses sequence value = 0 to indicate that Tsc page no longer valid. This is clearly visible inside Windows 2012R2 ntoskrnl.exe HvlGetReferenceTime() function dissassembly: HvlGetReferenceTime proc near xchgax, ax loc_1401C3132: mov rax, cs:HvlpReferenceTscPage mov r9d, [rax] testr9d, r9d jz short loc_1401C3176 rdtsc mov rcx, cs:HvlpReferenceTscPage shl rdx, 20h or rdx, rax mov rax, [rcx+8] mov rcx, cs:HvlpReferenceTscPage mov r8, [rcx+10h] mul rdx mov rax, cs:HvlpReferenceTscPage add rdx, r8 mov ecx, [rax] cmp ecx, r9d jnz short loc_1401C3132 jmp short loc_1401C3184 loc_1401C3176: mov ecx, 4020h rdmsr shl rdx, 20h or rdx, rax loc_1401C3184: mov rax, rdx retn HvlGetReferenceTime endp This patch aligns Tsc page invalid sequence value with Windows Hyper-V guest implementation which is more compatible with both Hyper-V hypervisor and KVM hypervisor. Signed-off-by: Andrey Smetanin <asmeta...@virtuozzo.com> CC: "K. Y. Srinivasan" <k...@microsoft.com> Thanks Andrey; the Hyper-V team will be updating the Hyper-V documentation. Acked-by: K. Y. Srinivasan <k...@microsoft.com> Regards, K. Y K.Y., can you pls clarify the state of this patch? It is a bit unclear to me whether it is applied or not. I will be submitting this shortly. By the way, I also do not see the following patch "drivers/hv: cleanup synic msrs if vmbus connect failed" as applied in the Linux-next. You have promised to resend it will correct author. It was resent on October 29. It is in Greg's queue. Regards, K. Y Thank you in advance, Den thank you very much for this update :) Den -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value
On 11/02/2015 10:42 PM, KY Srinivasan wrote: -Original Message- From: Denis V. Lunev [mailto:d...@openvz.org] Sent: Monday, November 2, 2015 3:34 AM Cc: rka...@virtuozzo.com; de...@linuxdriverproject.org; linux- ker...@vger.kernel.org; Andrey Smetanin ; KY Srinivasan ; Haiyang Zhang ; Vitaly Kuznetsov ; Denis V. Lunev Subject: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value From: Andrey Smetanin Hypervisor Top Level Functional Specification v3/4 says that TSC page sequence value = -1(0x) is used to indicate that TSC page no longer reliable source of reference timer. Unfortunately, we found that Windows Hyper-V guest side implementation uses sequence value = 0 to indicate that Tsc page no longer valid. This is clearly visible inside Windows 2012R2 ntoskrnl.exe HvlGetReferenceTime() function dissassembly: HvlGetReferenceTime proc near xchgax, ax loc_1401C3132: mov rax, cs:HvlpReferenceTscPage mov r9d, [rax] testr9d, r9d jz short loc_1401C3176 rdtsc mov rcx, cs:HvlpReferenceTscPage shl rdx, 20h or rdx, rax mov rax, [rcx+8] mov rcx, cs:HvlpReferenceTscPage mov r8, [rcx+10h] mul rdx mov rax, cs:HvlpReferenceTscPage add rdx, r8 mov ecx, [rax] cmp ecx, r9d jnz short loc_1401C3132 jmp short loc_1401C3184 loc_1401C3176: mov ecx, 4020h rdmsr shl rdx, 20h or rdx, rax loc_1401C3184: mov rax, rdx retn HvlGetReferenceTime endp This patch aligns Tsc page invalid sequence value with Windows Hyper-V guest implementation which is more compatible with both Hyper-V hypervisor and KVM hypervisor. Signed-off-by: Andrey Smetanin CC: "K. Y. Srinivasan" Thanks Andrey; the Hyper-V team will be updating the Hyper-V documentation. Acked-by: K. Y. Srinivasan Regards, K. Y K.Y., can you pls clarify the state of this patch? It is a bit unclear to me whether it is applied or not. By the way, I also do not see the following patch "drivers/hv: cleanup synic msrs if vmbus connect failed" as applied in the Linux-next. You have promised to resend it will correct author. Thank you in advance, Den -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value
On 11/02/2015 10:42 PM, KY Srinivasan wrote: -Original Message- From: Denis V. Lunev [mailto:d...@openvz.org] Sent: Monday, November 2, 2015 3:34 AM Cc: rka...@virtuozzo.com; de...@linuxdriverproject.org; linux- ker...@vger.kernel.org; Andrey Smetanin <asmeta...@virtuozzo.com>; KY Srinivasan <k...@microsoft.com>; Haiyang Zhang <haiya...@microsoft.com>; Vitaly Kuznetsov <vkuzn...@redhat.com>; Denis V. Lunev <d...@openvz.org> Subject: [PATCH 1/1] drivers/hv: correct tsc page sequence invalid value From: Andrey Smetanin <asmeta...@virtuozzo.com> Hypervisor Top Level Functional Specification v3/4 says that TSC page sequence value = -1(0x) is used to indicate that TSC page no longer reliable source of reference timer. Unfortunately, we found that Windows Hyper-V guest side implementation uses sequence value = 0 to indicate that Tsc page no longer valid. This is clearly visible inside Windows 2012R2 ntoskrnl.exe HvlGetReferenceTime() function dissassembly: HvlGetReferenceTime proc near xchgax, ax loc_1401C3132: mov rax, cs:HvlpReferenceTscPage mov r9d, [rax] testr9d, r9d jz short loc_1401C3176 rdtsc mov rcx, cs:HvlpReferenceTscPage shl rdx, 20h or rdx, rax mov rax, [rcx+8] mov rcx, cs:HvlpReferenceTscPage mov r8, [rcx+10h] mul rdx mov rax, cs:HvlpReferenceTscPage add rdx, r8 mov ecx, [rax] cmp ecx, r9d jnz short loc_1401C3132 jmp short loc_1401C3184 loc_1401C3176: mov ecx, 4020h rdmsr shl rdx, 20h or rdx, rax loc_1401C3184: mov rax, rdx retn HvlGetReferenceTime endp This patch aligns Tsc page invalid sequence value with Windows Hyper-V guest implementation which is more compatible with both Hyper-V hypervisor and KVM hypervisor. Signed-off-by: Andrey Smetanin <asmeta...@virtuozzo.com> CC: "K. Y. Srinivasan" <k...@microsoft.com> Thanks Andrey; the Hyper-V team will be updating the Hyper-V documentation. Acked-by: K. Y. Srinivasan <k...@microsoft.com> Regards, K. Y K.Y., can you pls clarify the state of this patch? It is a bit unclear to me whether it is applied or not. By the way, I also do not see the following patch "drivers/hv: cleanup synic msrs if vmbus connect failed" as applied in the Linux-next. You have promised to resend it will correct author. Thank you in advance, Den -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] drivers/hv: correct tsc page sequence invalid value
From: Andrey Smetanin Hypervisor Top Level Functional Specification v3/4 says that TSC page sequence value = -1(0x) is used to indicate that TSC page no longer reliable source of reference timer. Unfortunately, we found that Windows Hyper-V guest side implementation uses sequence value = 0 to indicate that Tsc page no longer valid. This is clearly visible inside Windows 2012R2 ntoskrnl.exe HvlGetReferenceTime() function dissassembly: HvlGetReferenceTime proc near xchgax, ax loc_1401C3132: mov rax, cs:HvlpReferenceTscPage mov r9d, [rax] testr9d, r9d jz short loc_1401C3176 rdtsc mov rcx, cs:HvlpReferenceTscPage shl rdx, 20h or rdx, rax mov rax, [rcx+8] mov rcx, cs:HvlpReferenceTscPage mov r8, [rcx+10h] mul rdx mov rax, cs:HvlpReferenceTscPage add rdx, r8 mov ecx, [rax] cmp ecx, r9d jnz short loc_1401C3132 jmp short loc_1401C3184 loc_1401C3176: mov ecx, 4020h rdmsr shl rdx, 20h or rdx, rax loc_1401C3184: mov rax, rdx retn HvlGetReferenceTime endp This patch aligns Tsc page invalid sequence value with Windows Hyper-V guest implementation which is more compatible with both Hyper-V hypervisor and KVM hypervisor. Signed-off-by: Andrey Smetanin CC: "K. Y. Srinivasan" CC: Haiyang Zhang CC: Vitaly Kuznetsov TARGET: Microsoft Hyper-V Linux guest drivers Signed-off-by: Denis V. Lunev --- drivers/hv/hv.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c index 6341be8..68d0701 100644 --- a/drivers/hv/hv.c +++ b/drivers/hv/hv.c @@ -139,7 +139,7 @@ static cycle_t read_hv_clock_tsc(struct clocksource *arg) cycle_t current_tick; struct ms_hyperv_tsc_page *tsc_pg = hv_context.tsc_page; - if (tsc_pg->tsc_sequence != -1) { + if (tsc_pg->tsc_sequence != 0) { /* * Use the tsc page to compute the value. */ @@ -161,7 +161,7 @@ static cycle_t read_hv_clock_tsc(struct clocksource *arg) if (tsc_pg->tsc_sequence == sequence) return current_tick; - if (tsc_pg->tsc_sequence != -1) + if (tsc_pg->tsc_sequence != 0) continue; /* * Fallback using MSR method. -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] drivers/hv: correct tsc page sequence invalid value
From: Andrey Smetanin <asmeta...@virtuozzo.com> Hypervisor Top Level Functional Specification v3/4 says that TSC page sequence value = -1(0x) is used to indicate that TSC page no longer reliable source of reference timer. Unfortunately, we found that Windows Hyper-V guest side implementation uses sequence value = 0 to indicate that Tsc page no longer valid. This is clearly visible inside Windows 2012R2 ntoskrnl.exe HvlGetReferenceTime() function dissassembly: HvlGetReferenceTime proc near xchgax, ax loc_1401C3132: mov rax, cs:HvlpReferenceTscPage mov r9d, [rax] testr9d, r9d jz short loc_1401C3176 rdtsc mov rcx, cs:HvlpReferenceTscPage shl rdx, 20h or rdx, rax mov rax, [rcx+8] mov rcx, cs:HvlpReferenceTscPage mov r8, [rcx+10h] mul rdx mov rax, cs:HvlpReferenceTscPage add rdx, r8 mov ecx, [rax] cmp ecx, r9d jnz short loc_1401C3132 jmp short loc_1401C3184 loc_1401C3176: mov ecx, 4020h rdmsr shl rdx, 20h or rdx, rax loc_1401C3184: mov rax, rdx retn HvlGetReferenceTime endp This patch aligns Tsc page invalid sequence value with Windows Hyper-V guest implementation which is more compatible with both Hyper-V hypervisor and KVM hypervisor. Signed-off-by: Andrey Smetanin <asmeta...@virtuozzo.com> CC: "K. Y. Srinivasan" <k...@microsoft.com> CC: Haiyang Zhang <haiya...@microsoft.com> CC: Vitaly Kuznetsov <vkuzn...@redhat.com> TARGET: Microsoft Hyper-V Linux guest drivers Signed-off-by: Denis V. Lunev <d...@openvz.org> --- drivers/hv/hv.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c index 6341be8..68d0701 100644 --- a/drivers/hv/hv.c +++ b/drivers/hv/hv.c @@ -139,7 +139,7 @@ static cycle_t read_hv_clock_tsc(struct clocksource *arg) cycle_t current_tick; struct ms_hyperv_tsc_page *tsc_pg = hv_context.tsc_page; - if (tsc_pg->tsc_sequence != -1) { + if (tsc_pg->tsc_sequence != 0) { /* * Use the tsc page to compute the value. */ @@ -161,7 +161,7 @@ static cycle_t read_hv_clock_tsc(struct clocksource *arg) if (tsc_pg->tsc_sequence == sequence) return current_tick; - if (tsc_pg->tsc_sequence != -1) + if (tsc_pg->tsc_sequence != 0) continue; /* * Fallback using MSR method. -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 07/10] drivers/hv: cleanup synic msrs if vmbus connect failed
On 10/08/2015 08:28 PM, KY Srinivasan wrote: -Original Message- From: Denis V. Lunev [mailto:d...@openvz.org] Sent: Thursday, October 8, 2015 10:20 AM To: KY Srinivasan ; gre...@linuxfoundation.org; linux- ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; vkuzn...@redhat.com; jasow...@redhat.com Cc: Andrey Smetanin ; Haiyang Zhang Subject: Re: [PATCH 07/10] drivers/hv: cleanup synic msrs if vmbus connect failed On 10/08/2015 05:01 AM, K. Y. Srinivasan wrote: From: Denis V. Lunev K.Y., there is one subtle thing in this submission. You have changed "From:" field in comparison with the original letter. I have submitted the patch with "From: Andrey Smetanin " In this case Author: in the resulted git mainstream commit will be Andrey. With your submission the resulted Author will be I. This was already happened once with commit cc2dd4027a43bb36c846f195a764edabc0828602 Author: Denis V. Lunev Date: Sat Aug 1 16:08:20 2015 -0700 mshyperv: fix recognition of Hyper-V guest crash MSR's The situation looks a bit unfair. Can we do something with that now/next time? I am going to be resubmitting this series. I will fix it up. thank you very much :) Den -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 07/10] drivers/hv: cleanup synic msrs if vmbus connect failed
On 10/08/2015 05:01 AM, K. Y. Srinivasan wrote: From: Denis V. Lunev K.Y., there is one subtle thing in this submission. You have changed "From:" field in comparison with the original letter. I have submitted the patch with "From: Andrey Smetanin " In this case Author: in the resulted git mainstream commit will be Andrey. With your submission the resulted Author will be I. This was already happened once with commit cc2dd4027a43bb36c846f195a764edabc0828602 Author: Denis V. Lunev Date: Sat Aug 1 16:08:20 2015 -0700 mshyperv: fix recognition of Hyper-V guest crash MSR's The situation looks a bit unfair. Can we do something with that now/next time? Den Before vmbus_connect() synic is setup per vcpu - this means hypervisor receives writes at synic msr's and probably allocate hypervisor resources per synic setup. If vmbus_connect() failed for some reason it's neccessary to cleanup synic setup by call hv_synic_cleanup() at each vcpu to get a chance to free allocated resources by hypervisor per synic. This patch does appropriate cleanup in case of vmbus_connect() failure. Signed-off-by: Andrey Smetanin Signed-off-by: Denis V. Lunev Reviewed-by: Vitaly Kuznetsov CC: "K. Y. Srinivasan" CC: Haiyang Zhang CC: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan --- drivers/hv/vmbus_drv.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index f19b6f7..3297731 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -867,7 +867,7 @@ static int vmbus_bus_init(int irq) on_each_cpu(hv_synic_init, NULL, 1); ret = vmbus_connect(); if (ret) - goto err_alloc; + goto err_connect; if (vmbus_proto_version > VERSION_WIN7) cpu_hotplug_disable(); @@ -885,6 +885,8 @@ static int vmbus_bus_init(int irq) return 0; +err_connect: + on_each_cpu(hv_synic_cleanup, NULL, 1); err_alloc: hv_synic_free(); hv_remove_vmbus_irq(); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 07/10] drivers/hv: cleanup synic msrs if vmbus connect failed
On 10/08/2015 05:01 AM, K. Y. Srinivasan wrote: From: Denis V. Lunev <d...@openvz.org> K.Y., there is one subtle thing in this submission. You have changed "From:" field in comparison with the original letter. I have submitted the patch with "From: Andrey Smetanin <asmeta...@virtuozzo.com>" In this case Author: in the resulted git mainstream commit will be Andrey. With your submission the resulted Author will be I. This was already happened once with commit cc2dd4027a43bb36c846f195a764edabc0828602 Author: Denis V. Lunev <d...@openvz.org> Date: Sat Aug 1 16:08:20 2015 -0700 mshyperv: fix recognition of Hyper-V guest crash MSR's The situation looks a bit unfair. Can we do something with that now/next time? Den Before vmbus_connect() synic is setup per vcpu - this means hypervisor receives writes at synic msr's and probably allocate hypervisor resources per synic setup. If vmbus_connect() failed for some reason it's neccessary to cleanup synic setup by call hv_synic_cleanup() at each vcpu to get a chance to free allocated resources by hypervisor per synic. This patch does appropriate cleanup in case of vmbus_connect() failure. Signed-off-by: Andrey Smetanin <asmeta...@virtuozzo.com> Signed-off-by: Denis V. Lunev <d...@openvz.org> Reviewed-by: Vitaly Kuznetsov <vkuzn...@redhat.com> CC: "K. Y. Srinivasan" <k...@microsoft.com> CC: Haiyang Zhang <haiya...@microsoft.com> CC: Vitaly Kuznetsov <vkuzn...@redhat.com> Signed-off-by: K. Y. Srinivasan <k...@microsoft.com> --- drivers/hv/vmbus_drv.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index f19b6f7..3297731 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -867,7 +867,7 @@ static int vmbus_bus_init(int irq) on_each_cpu(hv_synic_init, NULL, 1); ret = vmbus_connect(); if (ret) - goto err_alloc; + goto err_connect; if (vmbus_proto_version > VERSION_WIN7) cpu_hotplug_disable(); @@ -885,6 +885,8 @@ static int vmbus_bus_init(int irq) return 0; +err_connect: + on_each_cpu(hv_synic_cleanup, NULL, 1); err_alloc: hv_synic_free(); hv_remove_vmbus_irq(); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 07/10] drivers/hv: cleanup synic msrs if vmbus connect failed
On 10/08/2015 08:28 PM, KY Srinivasan wrote: -Original Message- From: Denis V. Lunev [mailto:d...@openvz.org] Sent: Thursday, October 8, 2015 10:20 AM To: KY Srinivasan <k...@microsoft.com>; gre...@linuxfoundation.org; linux- ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; vkuzn...@redhat.com; jasow...@redhat.com Cc: Andrey Smetanin <asmeta...@virtuozzo.com>; Haiyang Zhang <haiya...@microsoft.com> Subject: Re: [PATCH 07/10] drivers/hv: cleanup synic msrs if vmbus connect failed On 10/08/2015 05:01 AM, K. Y. Srinivasan wrote: From: Denis V. Lunev <d...@openvz.org> K.Y., there is one subtle thing in this submission. You have changed "From:" field in comparison with the original letter. I have submitted the patch with "From: Andrey Smetanin <asmeta...@virtuozzo.com>" In this case Author: in the resulted git mainstream commit will be Andrey. With your submission the resulted Author will be I. This was already happened once with commit cc2dd4027a43bb36c846f195a764edabc0828602 Author: Denis V. Lunev <d...@openvz.org> Date: Sat Aug 1 16:08:20 2015 -0700 mshyperv: fix recognition of Hyper-V guest crash MSR's The situation looks a bit unfair. Can we do something with that now/next time? I am going to be resubmitting this series. I will fix it up. thank you very much :) Den -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] drivers/hv: cleanup synic msrs if vmbus connect failed
From: Andrey Smetanin Before vmbus_connect() synic is setup per vcpu - this means hypervisor receives writes at synic msr's and probably allocate hypervisor resources per synic setup. If vmbus_connect() failed for some reason it's neccessary to cleanup synic setup by call hv_synic_cleanup() at each vcpu to get a chance to free allocated resources by hypervisor per synic. This patch does appropriate cleanup in case of vmbus_connect() failure. Signed-off-by: Andrey Smetanin Signed-off-by: Denis V. Lunev CC: "K. Y. Srinivasan" CC: Haiyang Zhang CC: Vitaly Kuznetsov --- drivers/hv/vmbus_drv.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index f19b6f7..3297731 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -867,7 +867,7 @@ static int vmbus_bus_init(int irq) on_each_cpu(hv_synic_init, NULL, 1); ret = vmbus_connect(); if (ret) - goto err_alloc; + goto err_connect; if (vmbus_proto_version > VERSION_WIN7) cpu_hotplug_disable(); @@ -885,6 +885,8 @@ static int vmbus_bus_init(int irq) return 0; +err_connect: + on_each_cpu(hv_synic_cleanup, NULL, 1); err_alloc: hv_synic_free(); hv_remove_vmbus_irq(); -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] drivers/hv: cleanup synic msrs if vmbus connect failed
From: Andrey Smetanin <asmeta...@virtuozzo.com> Before vmbus_connect() synic is setup per vcpu - this means hypervisor receives writes at synic msr's and probably allocate hypervisor resources per synic setup. If vmbus_connect() failed for some reason it's neccessary to cleanup synic setup by call hv_synic_cleanup() at each vcpu to get a chance to free allocated resources by hypervisor per synic. This patch does appropriate cleanup in case of vmbus_connect() failure. Signed-off-by: Andrey Smetanin <asmeta...@virtuozzo.com> Signed-off-by: Denis V. Lunev <d...@openvz.org> CC: "K. Y. Srinivasan" <k...@microsoft.com> CC: Haiyang Zhang <haiya...@microsoft.com> CC: Vitaly Kuznetsov <vkuzn...@redhat.com> --- drivers/hv/vmbus_drv.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index f19b6f7..3297731 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -867,7 +867,7 @@ static int vmbus_bus_init(int irq) on_each_cpu(hv_synic_init, NULL, 1); ret = vmbus_connect(); if (ret) - goto err_alloc; + goto err_connect; if (vmbus_proto_version > VERSION_WIN7) cpu_hotplug_disable(); @@ -885,6 +885,8 @@ static int vmbus_bus_init(int irq) return 0; +err_connect: + on_each_cpu(hv_synic_cleanup, NULL, 1); err_alloc: hv_synic_free(); hv_remove_vmbus_irq(); -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 1/1] mshyperv: fix recognition of Hyper-V guest crash MSR's
From: Andrey Smetanin Hypervisor Top Level Functional Specification v3.1/4.0 notes that cpuid (0x4003) EDX's 10th bit should be used to check that Hyper-V guest crash MSR's functionality available. This patch should fix this recognition. Currently the code checks EAX register instead of EDX. Signed-off-by: Andrey Smetanin Signed-off-by: Denis V. Lunev CC: Nick Meier CC: K. Y. Srinivasan CC: Haiyang Zhang --- arch/x86/include/asm/mshyperv.h | 1 + arch/x86/kernel/cpu/mshyperv.c | 1 + drivers/hv/vmbus_drv.c | 4 ++-- 3 files changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index c163215..eebe433 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -7,6 +7,7 @@ struct ms_hyperv_info { u32 features; + u32 misc_features; u32 hints; }; diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index aad4bd8..6d172a2 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -114,6 +114,7 @@ static void __init ms_hyperv_init_platform(void) * Extract the features and hints */ ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES); + ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES); ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO); printk(KERN_INFO "HyperV: features 0x%x, hints 0x%x\n", diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index cf20400..67af13a 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -841,7 +841,7 @@ static int vmbus_bus_init(int irq) /* * Only register if the crash MSRs are available */ - if (ms_hyperv.features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) { + if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) { atomic_notifier_chain_register(_notifier_list, _panic_block); } @@ -1110,7 +1110,7 @@ static void __exit vmbus_exit(void) hv_remove_vmbus_irq(); tasklet_kill(_dpc); vmbus_free_channels(); - if (ms_hyperv.features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) { + if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) { atomic_notifier_chain_unregister(_notifier_list, _panic_block); } -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 1/1] mshyperv: fix recognition of Hyper-V guest crash MSR's
From: Andrey Smetanin asmeta...@virtuozzo.com Hypervisor Top Level Functional Specification v3.1/4.0 notes that cpuid (0x4003) EDX's 10th bit should be used to check that Hyper-V guest crash MSR's functionality available. This patch should fix this recognition. Currently the code checks EAX register instead of EDX. Signed-off-by: Andrey Smetanin asmeta...@virtuozzo.com Signed-off-by: Denis V. Lunev d...@openvz.org CC: Nick Meier nme...@microsoft.com CC: K. Y. Srinivasan k...@microsoft.com CC: Haiyang Zhang haiya...@microsoft.com --- arch/x86/include/asm/mshyperv.h | 1 + arch/x86/kernel/cpu/mshyperv.c | 1 + drivers/hv/vmbus_drv.c | 4 ++-- 3 files changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index c163215..eebe433 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -7,6 +7,7 @@ struct ms_hyperv_info { u32 features; + u32 misc_features; u32 hints; }; diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index aad4bd8..6d172a2 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -114,6 +114,7 @@ static void __init ms_hyperv_init_platform(void) * Extract the features and hints */ ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES); + ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES); ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO); printk(KERN_INFO HyperV: features 0x%x, hints 0x%x\n, diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index cf20400..67af13a 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -841,7 +841,7 @@ static int vmbus_bus_init(int irq) /* * Only register if the crash MSRs are available */ - if (ms_hyperv.features HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) { + if (ms_hyperv.misc_features HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) { atomic_notifier_chain_register(panic_notifier_list, hyperv_panic_block); } @@ -1110,7 +1110,7 @@ static void __exit vmbus_exit(void) hv_remove_vmbus_irq(); tasklet_kill(msg_dpc); vmbus_free_channels(); - if (ms_hyperv.features HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) { + if (ms_hyperv.misc_features HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) { atomic_notifier_chain_unregister(panic_notifier_list, hyperv_panic_block); } -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 0/2] shrink virtio baloon on OOM in guest
Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free memory at the last moment before some process get killed by OOM-killer. This does not provide a security breach as baloon itself is running inside guest OS and is working in the cooperation with the host. Thus some improvements from guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system to return and retry the allocation that forced the out of memory killer to run. Patch 1 of this series adds support for implementation of virtio_balloon callback, so now leak_balloon() function returns number of freed pages. Patch 2 implements virtio_balloon callback itself. Changes from v2: - added feature bit to control OOM baloon behavior from host Changes from v1: - minor cosmetic tweaks suggested by rusty@ Signed-off-by: Raushaniya Maksudova Signed-off-by: Denis V. Lunev CC: Rusty Russell CC: Michael S. Tsirkin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] virtio_balloon: return the amount of freed memory from leak_balloon()
From: Raushaniya Maksudova This value would be useful in the next patch to provide the amount of the freed memory for OOM killer. Signed-off-by: Raushaniya Maksudova Signed-off-by: Denis V. Lunev CC: Rusty Russell CC: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index f893148..66cac10 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -168,8 +168,9 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned int num) } } -static void leak_balloon(struct virtio_balloon *vb, size_t num) +static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) { + unsigned num_freed_pages; struct page *page; struct balloon_dev_info *vb_dev_info = >vb_dev_info; @@ -186,6 +187,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE; } + num_freed_pages = vb->num_pfns; /* * Note that if * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); @@ -195,6 +197,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) tell_host(vb, vb->deflate_vq); mutex_unlock(>balloon_lock); release_pages_by_pfn(vb->pfns, vb->num_pfns); + return num_freed_pages; } static inline void update_stat(struct virtio_balloon *vb, int idx, -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] virtio_balloon: free some memory from balloon on OOM
From: Raushaniya Maksudova Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free some memory at the last moment before some process will be get killed by OOM-killer. This does not provide a security breach as balloon itself is running inside guest OS and is working in the cooperation with the host. Thus some improvements from guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system to return and retry the allocation that forced the out of memory killer to run. Allocate virtio feature bit for this: it is not set by default, the the guest will not deflate virtio balloon on OOM without explicit permission from host. Signed-off-by: Raushaniya Maksudova Signed-off-by: Denis V. Lunev CC: Rusty Russell CC: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 52 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 53 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 66cac10..88d73a0 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -28,6 +28,7 @@ #include #include #include +#include /* * Balloon device works in 4K page units. So each page is pointed to by @@ -36,6 +37,12 @@ */ #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT) #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256 +#define OOM_VBALLOON_DEFAULT_PAGES 256 +#define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80 + +static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES; +module_param(oom_pages, int, S_IRUSR | S_IWUSR); +MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); struct virtio_balloon { @@ -71,6 +78,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* To register callback in oom notifier call chain */ + struct notifier_block nb; }; static struct virtio_device_id id_table[] = { @@ -290,6 +300,38 @@ static void update_balloon_size(struct virtio_balloon *vb) ); } +/* + * virtballoon_oom_notify - release pages when system is under severe + * memory pressure (called from out_of_memory()) + * @self : notifier block struct + * @dummy: not used + * @parm : returned - number of freed pages + * + * The balancing of memory by use of the virtio balloon should not cause + * the termination of processes while there are pages in the balloon. + * If virtio balloon manages to release some memory, it will make the + * system return and retry the allocation that forced the OOM killer + * to run. + */ +static int virtballoon_oom_notify(struct notifier_block *self, + unsigned long dummy, void *parm) +{ + struct virtio_balloon *vb; + unsigned long *freed; + unsigned num_freed_pages; + + vb = container_of(self, struct virtio_balloon, nb); + if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) + return NOTIFY_OK; + + freed = parm; + num_freed_pages = leak_balloon(vb, oom_pages); + update_balloon_size(vb); + *freed += num_freed_pages; + + return NOTIFY_OK; +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -446,6 +488,12 @@ static int virtballoon_probe(struct virtio_device *vdev) if (err) goto out_free_vb; + vb->nb.notifier_call = virtballoon_oom_notify; + vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY; + err = register_oom_notifier(>nb); + if (err < 0) + goto out_oom_notify; + vb->thread = kthread_run(balloon, vb, "vballoon"); if (IS_ERR(vb->thread)) { err = PTR_ERR(vb->thread); @@ -455,6 +503,8 @@ static int virtballoon_probe(struct virtio_device *vdev) return 0; out_del_vqs: + unregister_oom_notifier(>nb); +out_oom_notify: vdev->config->del_vqs(vdev); out_free_vb: kfree(vb); @@ -479,6 +529,7 @@ static void virtballoon_remove(struct virtio_device *vdev) { struct virtio_balloon *vb = vdev->priv; + unregister_oom_notifier(>nb); k
[PATCH 2/2] virtio_balloon: free some memory from balloon on OOM
From: Raushaniya Maksudova rmaksud...@parallels.com Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free some memory at the last moment before some process will be get killed by OOM-killer. This does not provide a security breach as balloon itself is running inside guest OS and is working in the cooperation with the host. Thus some improvements from guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system to return and retry the allocation that forced the out of memory killer to run. Allocate virtio feature bit for this: it is not set by default, the the guest will not deflate virtio balloon on OOM without explicit permission from host. Signed-off-by: Raushaniya Maksudova rmaksud...@parallels.com Signed-off-by: Denis V. Lunev d...@openvz.org CC: Rusty Russell ru...@rustcorp.com.au CC: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_balloon.c | 52 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 53 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 66cac10..88d73a0 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -28,6 +28,7 @@ #include linux/slab.h #include linux/module.h #include linux/balloon_compaction.h +#include linux/oom.h /* * Balloon device works in 4K page units. So each page is pointed to by @@ -36,6 +37,12 @@ */ #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE VIRTIO_BALLOON_PFN_SHIFT) #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256 +#define OOM_VBALLOON_DEFAULT_PAGES 256 +#define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80 + +static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES; +module_param(oom_pages, int, S_IRUSR | S_IWUSR); +MODULE_PARM_DESC(oom_pages, pages to free on OOM); struct virtio_balloon { @@ -71,6 +78,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* To register callback in oom notifier call chain */ + struct notifier_block nb; }; static struct virtio_device_id id_table[] = { @@ -290,6 +300,38 @@ static void update_balloon_size(struct virtio_balloon *vb) actual); } +/* + * virtballoon_oom_notify - release pages when system is under severe + * memory pressure (called from out_of_memory()) + * @self : notifier block struct + * @dummy: not used + * @parm : returned - number of freed pages + * + * The balancing of memory by use of the virtio balloon should not cause + * the termination of processes while there are pages in the balloon. + * If virtio balloon manages to release some memory, it will make the + * system return and retry the allocation that forced the OOM killer + * to run. + */ +static int virtballoon_oom_notify(struct notifier_block *self, + unsigned long dummy, void *parm) +{ + struct virtio_balloon *vb; + unsigned long *freed; + unsigned num_freed_pages; + + vb = container_of(self, struct virtio_balloon, nb); + if (!virtio_has_feature(vb-vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) + return NOTIFY_OK; + + freed = parm; + num_freed_pages = leak_balloon(vb, oom_pages); + update_balloon_size(vb); + *freed += num_freed_pages; + + return NOTIFY_OK; +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -446,6 +488,12 @@ static int virtballoon_probe(struct virtio_device *vdev) if (err) goto out_free_vb; + vb-nb.notifier_call = virtballoon_oom_notify; + vb-nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY; + err = register_oom_notifier(vb-nb); + if (err 0) + goto out_oom_notify; + vb-thread = kthread_run(balloon, vb, vballoon); if (IS_ERR(vb-thread)) { err = PTR_ERR(vb-thread); @@ -455,6 +503,8 @@ static int virtballoon_probe(struct virtio_device *vdev) return 0; out_del_vqs: + unregister_oom_notifier(vb-nb); +out_oom_notify: vdev-config-del_vqs(vdev); out_free_vb: kfree(vb); @@ -479,6 +529,7 @@ static void virtballoon_remove(struct virtio_device *vdev
[PATCH 1/2] virtio_balloon: return the amount of freed memory from leak_balloon()
From: Raushaniya Maksudova rmaksud...@parallels.com This value would be useful in the next patch to provide the amount of the freed memory for OOM killer. Signed-off-by: Raushaniya Maksudova rmaksud...@parallels.com Signed-off-by: Denis V. Lunev d...@openvz.org CC: Rusty Russell ru...@rustcorp.com.au CC: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_balloon.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index f893148..66cac10 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -168,8 +168,9 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned int num) } } -static void leak_balloon(struct virtio_balloon *vb, size_t num) +static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) { + unsigned num_freed_pages; struct page *page; struct balloon_dev_info *vb_dev_info = vb-vb_dev_info; @@ -186,6 +187,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) vb-num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE; } + num_freed_pages = vb-num_pfns; /* * Note that if * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); @@ -195,6 +197,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) tell_host(vb, vb-deflate_vq); mutex_unlock(vb-balloon_lock); release_pages_by_pfn(vb-pfns, vb-num_pfns); + return num_freed_pages; } static inline void update_stat(struct virtio_balloon *vb, int idx, -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 0/2] shrink virtio baloon on OOM in guest
Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free memory at the last moment before some process get killed by OOM-killer. This does not provide a security breach as baloon itself is running inside guest OS and is working in the cooperation with the host. Thus some improvements from guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system to return and retry the allocation that forced the out of memory killer to run. Patch 1 of this series adds support for implementation of virtio_balloon callback, so now leak_balloon() function returns number of freed pages. Patch 2 implements virtio_balloon callback itself. Changes from v2: - added feature bit to control OOM baloon behavior from host Changes from v1: - minor cosmetic tweaks suggested by rusty@ Signed-off-by: Raushaniya Maksudova rmaksud...@parallels.com Signed-off-by: Denis V. Lunev d...@openvz.org CC: Rusty Russell ru...@rustcorp.com.au CC: Michael S. Tsirkin m...@redhat.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] virtio_balloon: return the amount of freed memory from leak_balloon()
From: Raushaniya Maksudova This value would be useful in the next patch to provide the amount of the freed memory for OOM killer. Signed-off-by: Raushaniya Maksudova Signed-off-by: Denis V. Lunev CC: Rusty Russell CC: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index f893148..66cac10 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -168,8 +168,9 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned int num) } } -static void leak_balloon(struct virtio_balloon *vb, size_t num) +static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) { + unsigned num_freed_pages; struct page *page; struct balloon_dev_info *vb_dev_info = >vb_dev_info; @@ -186,6 +187,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE; } + num_freed_pages = vb->num_pfns; /* * Note that if * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); @@ -195,6 +197,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) tell_host(vb, vb->deflate_vq); mutex_unlock(>balloon_lock); release_pages_by_pfn(vb->pfns, vb->num_pfns); + return num_freed_pages; } static inline void update_stat(struct virtio_balloon *vb, int idx, -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/2] shrink virtio baloon on OOM in guest
Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free memory at the last moment before some process get killed by OOM-killer. This does not provide a security breach as baloon itself is running inside guest OS and is working in the cooperation with the host. Thus some improvements from guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system to return and retry the allocation that forced the out of memory killer to run. Patch 1 of this series adds support for implementation of virtio_balloon callback, so now leak_balloon() function returns number of freed pages. Patch 2 implements virtio_balloon callback itself. Changes from v1: - minor cosmetic tweaks suggested by rusty@ Signed-off-by: Raushaniya Maksudova Signed-off-by: Denis V. Lunev CC: Rusty Russell CC: Michael S. Tsirkin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] virtio_balloon: free some memory from balloon on OOM
From: Raushaniya Maksudova Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free some memory at the last moment before some process will be get killed by OOM-killer. This does not provide a security breach as balloon itself is running inside guest OS and is working in the cooperation with the host. Thus some improvements from guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system to return and retry the allocation that forced the out of memory killer to run. Signed-off-by: Raushaniya Maksudova Signed-off-by: Denis V. Lunev CC: Rusty Russell CC: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 48 + 1 file changed, 48 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 66cac10..ab1fa69 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -28,6 +28,7 @@ #include #include #include +#include /* * Balloon device works in 4K page units. So each page is pointed to by @@ -36,6 +37,12 @@ */ #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT) #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256 +#define OOM_VBALLOON_DEFAULT_PAGES 256 +#define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80 + +static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES; +module_param(oom_pages, int, S_IRUSR | S_IWUSR); +MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); struct virtio_balloon { @@ -71,6 +78,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* To register callback in oom notifier call chain */ + struct notifier_block nb; }; static struct virtio_device_id id_table[] = { @@ -290,6 +300,35 @@ static void update_balloon_size(struct virtio_balloon *vb) ); } +/* + * virtballoon_oom_notify - release pages when system is under severe + * memory pressure (called from out_of_memory()) + * @self : notifier block struct + * @dummy: not used + * @parm : returned - number of freed pages + * + * The balancing of memory by use of the virtio balloon should not cause + * the termination of processes while there are pages in the balloon. + * If virtio balloon manages to release some memory, it will make the + * system return and retry the allocation that forced the OOM killer + * to run. + */ +static int virtballoon_oom_notify(struct notifier_block *self, + unsigned long dummy, void *parm) +{ + unsigned num_freed_pages; + unsigned long *freed; + struct virtio_balloon *vb; + + freed = parm; + vb = container_of(self, struct virtio_balloon, nb); + num_freed_pages = leak_balloon(vb, oom_pages); + update_balloon_size(vb); + *freed += num_freed_pages; + + return NOTIFY_OK; +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -446,6 +485,12 @@ static int virtballoon_probe(struct virtio_device *vdev) if (err) goto out_free_vb; + vb->nb.notifier_call = virtballoon_oom_notify; + vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY; + err = register_oom_notifier(>nb); + if (err < 0) + goto out_oom_notify; + vb->thread = kthread_run(balloon, vb, "vballoon"); if (IS_ERR(vb->thread)) { err = PTR_ERR(vb->thread); @@ -455,6 +500,8 @@ static int virtballoon_probe(struct virtio_device *vdev) return 0; out_del_vqs: + unregister_oom_notifier(>nb); +out_oom_notify: vdev->config->del_vqs(vdev); out_free_vb: kfree(vb); @@ -479,6 +526,7 @@ static void virtballoon_remove(struct virtio_device *vdev) { struct virtio_balloon *vb = vdev->priv; + unregister_oom_notifier(>nb); kthread_stop(vb->thread); remove_common(vb); kfree(vb); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] virtio_balloon: free some memory from balloon on OOM
From: Raushaniya Maksudova rmaksud...@parallels.com Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free some memory at the last moment before some process will be get killed by OOM-killer. This does not provide a security breach as balloon itself is running inside guest OS and is working in the cooperation with the host. Thus some improvements from guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system to return and retry the allocation that forced the out of memory killer to run. Signed-off-by: Raushaniya Maksudova rmaksud...@parallels.com Signed-off-by: Denis V. Lunev d...@openvz.org CC: Rusty Russell ru...@rustcorp.com.au CC: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_balloon.c | 48 + 1 file changed, 48 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 66cac10..ab1fa69 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -28,6 +28,7 @@ #include linux/slab.h #include linux/module.h #include linux/balloon_compaction.h +#include linux/oom.h /* * Balloon device works in 4K page units. So each page is pointed to by @@ -36,6 +37,12 @@ */ #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE VIRTIO_BALLOON_PFN_SHIFT) #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256 +#define OOM_VBALLOON_DEFAULT_PAGES 256 +#define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80 + +static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES; +module_param(oom_pages, int, S_IRUSR | S_IWUSR); +MODULE_PARM_DESC(oom_pages, pages to free on OOM); struct virtio_balloon { @@ -71,6 +78,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* To register callback in oom notifier call chain */ + struct notifier_block nb; }; static struct virtio_device_id id_table[] = { @@ -290,6 +300,35 @@ static void update_balloon_size(struct virtio_balloon *vb) actual); } +/* + * virtballoon_oom_notify - release pages when system is under severe + * memory pressure (called from out_of_memory()) + * @self : notifier block struct + * @dummy: not used + * @parm : returned - number of freed pages + * + * The balancing of memory by use of the virtio balloon should not cause + * the termination of processes while there are pages in the balloon. + * If virtio balloon manages to release some memory, it will make the + * system return and retry the allocation that forced the OOM killer + * to run. + */ +static int virtballoon_oom_notify(struct notifier_block *self, + unsigned long dummy, void *parm) +{ + unsigned num_freed_pages; + unsigned long *freed; + struct virtio_balloon *vb; + + freed = parm; + vb = container_of(self, struct virtio_balloon, nb); + num_freed_pages = leak_balloon(vb, oom_pages); + update_balloon_size(vb); + *freed += num_freed_pages; + + return NOTIFY_OK; +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -446,6 +485,12 @@ static int virtballoon_probe(struct virtio_device *vdev) if (err) goto out_free_vb; + vb-nb.notifier_call = virtballoon_oom_notify; + vb-nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY; + err = register_oom_notifier(vb-nb); + if (err 0) + goto out_oom_notify; + vb-thread = kthread_run(balloon, vb, vballoon); if (IS_ERR(vb-thread)) { err = PTR_ERR(vb-thread); @@ -455,6 +500,8 @@ static int virtballoon_probe(struct virtio_device *vdev) return 0; out_del_vqs: + unregister_oom_notifier(vb-nb); +out_oom_notify: vdev-config-del_vqs(vdev); out_free_vb: kfree(vb); @@ -479,6 +526,7 @@ static void virtballoon_remove(struct virtio_device *vdev) { struct virtio_balloon *vb = vdev-priv; + unregister_oom_notifier(vb-nb); kthread_stop(vb-thread); remove_common(vb); kfree(vb); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo
[PATCH 1/2] virtio_balloon: return the amount of freed memory from leak_balloon()
From: Raushaniya Maksudova rmaksud...@parallels.com This value would be useful in the next patch to provide the amount of the freed memory for OOM killer. Signed-off-by: Raushaniya Maksudova rmaksud...@parallels.com Signed-off-by: Denis V. Lunev d...@openvz.org CC: Rusty Russell ru...@rustcorp.com.au CC: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_balloon.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index f893148..66cac10 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -168,8 +168,9 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned int num) } } -static void leak_balloon(struct virtio_balloon *vb, size_t num) +static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) { + unsigned num_freed_pages; struct page *page; struct balloon_dev_info *vb_dev_info = vb-vb_dev_info; @@ -186,6 +187,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) vb-num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE; } + num_freed_pages = vb-num_pfns; /* * Note that if * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); @@ -195,6 +197,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) tell_host(vb, vb-deflate_vq); mutex_unlock(vb-balloon_lock); release_pages_by_pfn(vb-pfns, vb-num_pfns); + return num_freed_pages; } static inline void update_stat(struct virtio_balloon *vb, int idx, -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/2] shrink virtio baloon on OOM in guest
Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free memory at the last moment before some process get killed by OOM-killer. This does not provide a security breach as baloon itself is running inside guest OS and is working in the cooperation with the host. Thus some improvements from guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system to return and retry the allocation that forced the out of memory killer to run. Patch 1 of this series adds support for implementation of virtio_balloon callback, so now leak_balloon() function returns number of freed pages. Patch 2 implements virtio_balloon callback itself. Changes from v1: - minor cosmetic tweaks suggested by rusty@ Signed-off-by: Raushaniya Maksudova rmaksud...@parallels.com Signed-off-by: Denis V. Lunev d...@openvz.org CC: Rusty Russell ru...@rustcorp.com.au CC: Michael S. Tsirkin m...@redhat.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lock_task_group_list() can be called from the atomic context
Acked-by: Denis V. Lunev <[EMAIL PROTECTED]> On Mon, 2008-02-11 at 14:33 +0100, Peter Zijlstra wrote: > On Mon, 2008-02-11 at 15:09 +0300, Denis V. Lunev wrote: > > Curious, I hadn't yet seen it... Does the below fix it? > > > BUG: sleeping function called from invalid context > > at /home/den/src/linux-netns26/kernel/mutex.c:209 > > in_atomic():1, irqs_disabled():0 > > no locks held by swapper/0. > > Pid: 0, comm: swapper Not tainted 2.6.24 #304 > > > > Call Trace: > >[] ? __debug_show_held_locks+0x15/0x27 > > [] __might_sleep+0xc0/0xdf > > [] mutex_lock_nested+0x28/0x2a9 > > [] sched_destroy_group+0x18/0xea > > [] sched_destroy_user+0xd/0xf > > [] free_uid+0x8a/0xab > > [] __put_task_struct+0x3f/0xd3 > > [] delayed_put_task_struct+0x23/0x25 > > [] __rcu_process_callbacks+0x8d/0x215 > > [] rcu_process_callbacks+0x23/0x44 > > [] __do_softirq+0x79/0xf8 > > [] ? profile_pc+0x2a/0x67 > > [] call_softirq+0x1c/0x30 > > [] do_softirq+0x61/0x9c > > [] irq_exit+0x51/0x53 > > [] smp_apic_timer_interrupt+0x77/0xad > > [] apic_timer_interrupt+0x6b/0x70 > >[] ? default_idle+0x43/0x76 > > [] ? default_idle+0x41/0x76 > > [] ? default_idle+0x0/0x76 > > [] ? cpu_idle+0x76/0x98 > > separate the tg->shares protection from the task_group lock. > > Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> > --- > kernel/sched.c | 37 + > 1 file changed, 17 insertions(+), 20 deletions(-) > > Index: linux-2.6/kernel/sched.c > === > --- linux-2.6.orig/kernel/sched.c > +++ linux-2.6/kernel/sched.c > @@ -232,10 +232,10 @@ static struct cfs_rq *init_cfs_rq_p[NR_C > static struct sched_rt_entity *init_sched_rt_entity_p[NR_CPUS]; > static struct rt_rq *init_rt_rq_p[NR_CPUS]; > > -/* task_group_mutex serializes add/remove of task groups and also changes to > +/* task_group_lock serializes add/remove of task groups and also changes to > * a task group's cpu shares. > */ > -static DEFINE_MUTEX(task_group_mutex); > +static DEFINE_SPINLOCK(task_group_lock); > > /* doms_cur_mutex serializes access to doms_cur[] array */ > static DEFINE_MUTEX(doms_cur_mutex); > @@ -295,16 +295,6 @@ static inline void set_task_rq(struct ta > p->rt.parent = task_group(p)->rt_se[cpu]; > } > > -static inline void lock_task_group_list(void) > -{ > - mutex_lock(_group_mutex); > -} > - > -static inline void unlock_task_group_list(void) > -{ > - mutex_unlock(_group_mutex); > -} > - > static inline void lock_doms_cur(void) > { > mutex_lock(_cur_mutex); > @@ -318,8 +308,6 @@ static inline void unlock_doms_cur(void) > #else > > static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { } > -static inline void lock_task_group_list(void) { } > -static inline void unlock_task_group_list(void) { } > static inline void lock_doms_cur(void) { } > static inline void unlock_doms_cur(void) { } > > @@ -7571,6 +7559,7 @@ struct task_group *sched_create_group(vo > struct rt_rq *rt_rq; > struct sched_rt_entity *rt_se; > struct rq *rq; > + unsigned long flags; > int i; > > tg = kzalloc(sizeof(*tg), GFP_KERNEL); > @@ -7620,7 +7609,7 @@ struct task_group *sched_create_group(vo > init_tg_rt_entry(rq, tg, rt_rq, rt_se, i, 0); > } > > - lock_task_group_list(); > + spin_lock_irqsave(_group_lock, flags); > for_each_possible_cpu(i) { > rq = cpu_rq(i); > cfs_rq = tg->cfs_rq[i]; > @@ -7629,7 +7618,7 @@ struct task_group *sched_create_group(vo > list_add_rcu(_rq->leaf_rt_rq_list, >leaf_rt_rq_list); > } > list_add_rcu(>list, _groups); > - unlock_task_group_list(); > + spin_unlock_irqrestore(_group_lock, flags); > > return tg; > > @@ -7650,9 +7639,10 @@ void sched_destroy_group(struct task_gro > { > struct cfs_rq *cfs_rq = NULL; > struct rt_rq *rt_rq = NULL; > + unsigned long flags; > int i; > > - lock_task_group_list(); > + spin_lock_irqsave(_group_lock, flags); > for_each_possible_cpu(i) { > cfs_rq = tg->cfs_rq[i]; > list_del_rcu(_rq->leaf_cfs_rq_list); > @@ -7660,7 +7650,7 @@ void sched_destroy_group(struct task_gro > list_del_rcu(_rq->leaf_rt_rq_list); > } > list_del_rcu(>list); > - unlock_task_group_list(); > + spin_unlock_irqrestore(_group_l
Re: lock_task_group_list() can be called from the atomic context
Acked-by: Denis V. Lunev [EMAIL PROTECTED] On Mon, 2008-02-11 at 14:33 +0100, Peter Zijlstra wrote: On Mon, 2008-02-11 at 15:09 +0300, Denis V. Lunev wrote: Curious, I hadn't yet seen it... Does the below fix it? BUG: sleeping function called from invalid context at /home/den/src/linux-netns26/kernel/mutex.c:209 in_atomic():1, irqs_disabled():0 no locks held by swapper/0. Pid: 0, comm: swapper Not tainted 2.6.24 #304 Call Trace: IRQ [80252d1e] ? __debug_show_held_locks+0x15/0x27 [8022c2a8] __might_sleep+0xc0/0xdf [8049f1df] mutex_lock_nested+0x28/0x2a9 [80231294] sched_destroy_group+0x18/0xea [8023e835] sched_destroy_user+0xd/0xf [8023e8c1] free_uid+0x8a/0xab [80233e24] __put_task_struct+0x3f/0xd3 [80236708] delayed_put_task_struct+0x23/0x25 [8026fda7] __rcu_process_callbacks+0x8d/0x215 [8026ff52] rcu_process_callbacks+0x23/0x44 [8023a2ae] __do_softirq+0x79/0xf8 [8020f8c3] ? profile_pc+0x2a/0x67 [8020d38c] call_softirq+0x1c/0x30 [8020f689] do_softirq+0x61/0x9c [8023a233] irq_exit+0x51/0x53 [8021bd1a] smp_apic_timer_interrupt+0x77/0xad [8020ce3b] apic_timer_interrupt+0x6b/0x70 EOI [8020b0dd] ? default_idle+0x43/0x76 [8020b0db] ? default_idle+0x41/0x76 [8020b09a] ? default_idle+0x0/0x76 [8020b186] ? cpu_idle+0x76/0x98 separate the tg-shares protection from the task_group lock. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- kernel/sched.c | 37 + 1 file changed, 17 insertions(+), 20 deletions(-) Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -232,10 +232,10 @@ static struct cfs_rq *init_cfs_rq_p[NR_C static struct sched_rt_entity *init_sched_rt_entity_p[NR_CPUS]; static struct rt_rq *init_rt_rq_p[NR_CPUS]; -/* task_group_mutex serializes add/remove of task groups and also changes to +/* task_group_lock serializes add/remove of task groups and also changes to * a task group's cpu shares. */ -static DEFINE_MUTEX(task_group_mutex); +static DEFINE_SPINLOCK(task_group_lock); /* doms_cur_mutex serializes access to doms_cur[] array */ static DEFINE_MUTEX(doms_cur_mutex); @@ -295,16 +295,6 @@ static inline void set_task_rq(struct ta p-rt.parent = task_group(p)-rt_se[cpu]; } -static inline void lock_task_group_list(void) -{ - mutex_lock(task_group_mutex); -} - -static inline void unlock_task_group_list(void) -{ - mutex_unlock(task_group_mutex); -} - static inline void lock_doms_cur(void) { mutex_lock(doms_cur_mutex); @@ -318,8 +308,6 @@ static inline void unlock_doms_cur(void) #else static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { } -static inline void lock_task_group_list(void) { } -static inline void unlock_task_group_list(void) { } static inline void lock_doms_cur(void) { } static inline void unlock_doms_cur(void) { } @@ -7571,6 +7559,7 @@ struct task_group *sched_create_group(vo struct rt_rq *rt_rq; struct sched_rt_entity *rt_se; struct rq *rq; + unsigned long flags; int i; tg = kzalloc(sizeof(*tg), GFP_KERNEL); @@ -7620,7 +7609,7 @@ struct task_group *sched_create_group(vo init_tg_rt_entry(rq, tg, rt_rq, rt_se, i, 0); } - lock_task_group_list(); + spin_lock_irqsave(task_group_lock, flags); for_each_possible_cpu(i) { rq = cpu_rq(i); cfs_rq = tg-cfs_rq[i]; @@ -7629,7 +7618,7 @@ struct task_group *sched_create_group(vo list_add_rcu(rt_rq-leaf_rt_rq_list, rq-leaf_rt_rq_list); } list_add_rcu(tg-list, task_groups); - unlock_task_group_list(); + spin_unlock_irqrestore(task_group_lock, flags); return tg; @@ -7650,9 +7639,10 @@ void sched_destroy_group(struct task_gro { struct cfs_rq *cfs_rq = NULL; struct rt_rq *rt_rq = NULL; + unsigned long flags; int i; - lock_task_group_list(); + spin_lock_irqsave(task_group_lock, flags); for_each_possible_cpu(i) { cfs_rq = tg-cfs_rq[i]; list_del_rcu(cfs_rq-leaf_cfs_rq_list); @@ -7660,7 +7650,7 @@ void sched_destroy_group(struct task_gro list_del_rcu(rt_rq-leaf_rt_rq_list); } list_del_rcu(tg-list); - unlock_task_group_list(); + spin_unlock_irqrestore(task_group_lock, flags); BUG_ON(!cfs_rq); @@ -7728,13 +7718,16 @@ static void set_se_shares(struct sched_e } } +static DEFINE_MUTEX(shares_mutex); + int sched_group_set_shares(struct task_group *tg, unsigned long shares) { int i; struct cfs_rq *cfs_rq; struct rq *rq
lock_task_group_list() can be called from the atomic context
Hello, Ingo! I am seeing the following calltrace every day I am connecting to my test host by the gnome-panel: ssh -Y test.host gnome-terminal BUG: sleeping function called from invalid context at /home/den/src/linux-netns26/kernel/mutex.c:209 in_atomic():1, irqs_disabled():0 no locks held by swapper/0. Pid: 0, comm: swapper Not tainted 2.6.24 #304 Call Trace: [] ? __debug_show_held_locks+0x15/0x27 [] __might_sleep+0xc0/0xdf [] mutex_lock_nested+0x28/0x2a9 [] sched_destroy_group+0x18/0xea [] sched_destroy_user+0xd/0xf [] free_uid+0x8a/0xab [] __put_task_struct+0x3f/0xd3 [] delayed_put_task_struct+0x23/0x25 [] __rcu_process_callbacks+0x8d/0x215 [] rcu_process_callbacks+0x23/0x44 [] __do_softirq+0x79/0xf8 [] ? profile_pc+0x2a/0x67 [] call_softirq+0x1c/0x30 [] do_softirq+0x61/0x9c [] irq_exit+0x51/0x53 [] smp_apic_timer_interrupt+0x77/0xad [] apic_timer_interrupt+0x6b/0x70 [] ? default_idle+0x43/0x76 [] ? default_idle+0x41/0x76 [] ? default_idle+0x0/0x76 [] ? cpu_idle+0x76/0x98 Config is attached. The kernel is today pulled Dave Miller's tree. Regards, Den # # Automatically generated make config: don't edit # Linux kernel version: 2.6.24 # Mon Feb 11 11:14:57 2008 # CONFIG_64BIT=y # CONFIG_X86_32 is not set CONFIG_X86_64=y CONFIG_X86=y # CONFIG_GENERIC_LOCKBREAK is not set CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_FAST_CMPXCHG_LOCAL=y CONFIG_MMU=y CONFIG_ZONE_DMA=y # CONFIG_QUICKLIST is not set CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y # CONFIG_GENERIC_GPIO is not set CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_RWSEM_GENERIC_SPINLOCK=y # CONFIG_RWSEM_XCHGADD_ALGORITHM is not set # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ZONE_DMA32=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_AOUT=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_X86_SMP=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_X86_TRAMPOLINE=y # CONFIG_KTIME_SCALAR is not set CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y # CONFIG_TASK_XACCT is not set CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_TREE=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=17 # CONFIG_CGROUPS is not set CONFIG_FAIR_GROUP_SCHED=y CONFIG_FAIR_USER_SCHED=y # CONFIG_FAIR_CGROUP_SCHED is not set CONFIG_RELAY=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y CONFIG_USER_NS=y CONFIG_PID_NS=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_EMBEDDED=y CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y CONFIG_KALLSYMS_EXTRA_PASS=y CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_COMPAT_BRK=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set # CONFIG_PROFILING is not set # CONFIG_MARKERS is not set CONFIG_HAVE_OPROFILE=y # CONFIG_KPROBES is not set CONFIG_HAVE_KPROBES=y CONFIG_PROC_PAGE_MONITOR=y CONFIG_SLABINFO=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y # CONFIG_BLK_DEV_BSG is not set CONFIG_BLOCK_COMPAT=y # # IO Schedulers # CONFIG_IOSCHED_NOOP=y # CONFIG_IOSCHED_AS is not set # CONFIG_IOSCHED_DEADLINE is not set CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" CONFIG_CLASSIC_RCU=y # CONFIG_PREEMPT_RCU is not set # # Processor type and features # # CONFIG_TICK_ONESHOT is not set # CONFIG_NO_HZ is not set # CONFIG_HIGH_RES_TIMERS is not set CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_SMP=y CONFIG_X86_PC=y # CONFIG_X86_ELAN is not set # CONFIG_X86_VOYAGER is not set # CONFIG_X86_NUMAQ is not set # CONFIG_X86_SUMMIT is not
lock_task_group_list() can be called from the atomic context
Hello, Ingo! I am seeing the following calltrace every day I am connecting to my test host by the gnome-panel: ssh -Y test.host gnome-terminal BUG: sleeping function called from invalid context at /home/den/src/linux-netns26/kernel/mutex.c:209 in_atomic():1, irqs_disabled():0 no locks held by swapper/0. Pid: 0, comm: swapper Not tainted 2.6.24 #304 Call Trace: IRQ [80252d1e] ? __debug_show_held_locks+0x15/0x27 [8022c2a8] __might_sleep+0xc0/0xdf [8049f1df] mutex_lock_nested+0x28/0x2a9 [80231294] sched_destroy_group+0x18/0xea [8023e835] sched_destroy_user+0xd/0xf [8023e8c1] free_uid+0x8a/0xab [80233e24] __put_task_struct+0x3f/0xd3 [80236708] delayed_put_task_struct+0x23/0x25 [8026fda7] __rcu_process_callbacks+0x8d/0x215 [8026ff52] rcu_process_callbacks+0x23/0x44 [8023a2ae] __do_softirq+0x79/0xf8 [8020f8c3] ? profile_pc+0x2a/0x67 [8020d38c] call_softirq+0x1c/0x30 [8020f689] do_softirq+0x61/0x9c [8023a233] irq_exit+0x51/0x53 [8021bd1a] smp_apic_timer_interrupt+0x77/0xad [8020ce3b] apic_timer_interrupt+0x6b/0x70 EOI [8020b0dd] ? default_idle+0x43/0x76 [8020b0db] ? default_idle+0x41/0x76 [8020b09a] ? default_idle+0x0/0x76 [8020b186] ? cpu_idle+0x76/0x98 Config is attached. The kernel is today pulled Dave Miller's tree. Regards, Den # # Automatically generated make config: don't edit # Linux kernel version: 2.6.24 # Mon Feb 11 11:14:57 2008 # CONFIG_64BIT=y # CONFIG_X86_32 is not set CONFIG_X86_64=y CONFIG_X86=y # CONFIG_GENERIC_LOCKBREAK is not set CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_FAST_CMPXCHG_LOCAL=y CONFIG_MMU=y CONFIG_ZONE_DMA=y # CONFIG_QUICKLIST is not set CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y # CONFIG_GENERIC_GPIO is not set CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_RWSEM_GENERIC_SPINLOCK=y # CONFIG_RWSEM_XCHGADD_ALGORITHM is not set # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ZONE_DMA32=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_AOUT=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_X86_SMP=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_X86_TRAMPOLINE=y # CONFIG_KTIME_SCALAR is not set CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION= # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y # CONFIG_TASK_XACCT is not set CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_TREE=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=17 # CONFIG_CGROUPS is not set CONFIG_FAIR_GROUP_SCHED=y CONFIG_FAIR_USER_SCHED=y # CONFIG_FAIR_CGROUP_SCHED is not set CONFIG_RELAY=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y CONFIG_USER_NS=y CONFIG_PID_NS=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE= # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_EMBEDDED=y CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y CONFIG_KALLSYMS_EXTRA_PASS=y CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_COMPAT_BRK=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set # CONFIG_PROFILING is not set # CONFIG_MARKERS is not set CONFIG_HAVE_OPROFILE=y # CONFIG_KPROBES is not set CONFIG_HAVE_KPROBES=y CONFIG_PROC_PAGE_MONITOR=y CONFIG_SLABINFO=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y # CONFIG_BLK_DEV_BSG is not set CONFIG_BLOCK_COMPAT=y # # IO Schedulers # CONFIG_IOSCHED_NOOP=y # CONFIG_IOSCHED_AS is not set # CONFIG_IOSCHED_DEADLINE is not set CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED=cfq CONFIG_CLASSIC_RCU=y #
[PATCH 6/6] [NETNS]: Lookup in FIB semantic hashes taking into account the namespace.
The namespace is not available in the fib_sync_down_addr, add it as a parameter. Looking up a device by the pointer to it is OK. Looking up using a result from fib_trie/fib_hash table lookup is also safe. No need to fix that at all. So, just fix lookup by address and insertion to the hash table path. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- include/net/ip_fib.h |2 +- net/ipv4/fib_frontend.c |2 +- net/ipv4/fib_semantics.c |6 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index cb0df37..90d1175 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -220,7 +220,7 @@ extern void fib_select_default(struct net *net, const struct flowi *flp, /* Exported by fib_semantics.c */ extern int ip_fib_check_default(__be32 gw, struct net_device *dev); extern int fib_sync_down_dev(struct net_device *dev, int force); -extern int fib_sync_down_addr(__be32 local); +extern int fib_sync_down_addr(struct net *net, __be32 local); extern int fib_sync_up(struct net_device *dev); extern __be32 __fib_res_prefsrc(struct fib_result *res); extern void fib_select_multipath(const struct flowi *flp, struct fib_result *res); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d69ffa2..86ff271 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -808,7 +808,7 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa) First of all, we scan fib_info list searching for stray nexthop entries, then ignite fib_flush. */ - if (fib_sync_down_addr(ifa->ifa_local)) + if (fib_sync_down_addr(dev->nd_net, ifa->ifa_local)) fib_flush(dev->nd_net); } } diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 97cc494..a13c847 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -229,6 +229,8 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi) head = _info_hash[hash]; hlist_for_each_entry(fi, node, head, fib_hash) { + if (fi->fib_net != nfi->fib_net) + continue; if (fi->fib_nhs != nfi->fib_nhs) continue; if (nfi->fib_protocol == fi->fib_protocol && @@ -1031,7 +1033,7 @@ nla_put_failure: referring to it. - device went down -> we must shutdown all nexthops going via it. */ -int fib_sync_down_addr(__be32 local) +int fib_sync_down_addr(struct net *net, __be32 local) { int ret = 0; unsigned int hash = fib_laddr_hashfn(local); @@ -1043,6 +1045,8 @@ int fib_sync_down_addr(__be32 local) return 0; hlist_for_each_entry(fi, node, head, fib_lhash) { + if (fi->fib_net != net) + continue; if (fi->fib_prefsrc == local) { fi->fib_flags |= RTNH_F_DEAD; ret++; -- 1.5.3.rc5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/6] [IPV4]: fib_sync_down rework.
fib_sync_down can be called with an address and with a device. In reality it is called either with address OR with a device. The codepath inside is completely different, so lets separate it into two calls for these two cases. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- include/net/ip_fib.h |3 +- net/ipv4/fib_frontend.c |4 +- net/ipv4/fib_semantics.c | 104 +++-- 3 files changed, 57 insertions(+), 54 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 9daa60b..1b2f008 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -218,7 +218,8 @@ extern void fib_select_default(struct net *net, const struct flowi *flp, /* Exported by fib_semantics.c */ extern int ip_fib_check_default(__be32 gw, struct net_device *dev); -extern int fib_sync_down(__be32 local, struct net_device *dev, int force); +extern int fib_sync_down_dev(struct net_device *dev, int force); +extern int fib_sync_down_addr(__be32 local); extern int fib_sync_up(struct net_device *dev); extern __be32 __fib_res_prefsrc(struct fib_result *res); extern void fib_select_multipath(const struct flowi *flp, struct fib_result *res); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d0507f4..d69ffa2 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -808,7 +808,7 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa) First of all, we scan fib_info list searching for stray nexthop entries, then ignite fib_flush. */ - if (fib_sync_down(ifa->ifa_local, NULL, 0)) + if (fib_sync_down_addr(ifa->ifa_local)) fib_flush(dev->nd_net); } } @@ -898,7 +898,7 @@ static void nl_fib_lookup_exit(struct net *net) static void fib_disable_ip(struct net_device *dev, int force) { - if (fib_sync_down(0, dev, force)) + if (fib_sync_down_dev(dev, force)) fib_flush(dev->nd_net); rt_cache_flush(0); arp_ifdown(dev); diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index c791286..5beff2e 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -1031,70 +1031,72 @@ nla_put_failure: referring to it. - device went down -> we must shutdown all nexthops going via it. */ - -int fib_sync_down(__be32 local, struct net_device *dev, int force) +int fib_sync_down_addr(__be32 local) { int ret = 0; - int scope = RT_SCOPE_NOWHERE; - - if (force) - scope = -1; + unsigned int hash = fib_laddr_hashfn(local); + struct hlist_head *head = _info_laddrhash[hash]; + struct hlist_node *node; + struct fib_info *fi; - if (local && fib_info_laddrhash) { - unsigned int hash = fib_laddr_hashfn(local); - struct hlist_head *head = _info_laddrhash[hash]; - struct hlist_node *node; - struct fib_info *fi; + if (fib_info_laddrhash == NULL || local == 0) + return 0; - hlist_for_each_entry(fi, node, head, fib_lhash) { - if (fi->fib_prefsrc == local) { - fi->fib_flags |= RTNH_F_DEAD; - ret++; - } + hlist_for_each_entry(fi, node, head, fib_lhash) { + if (fi->fib_prefsrc == local) { + fi->fib_flags |= RTNH_F_DEAD; + ret++; } } + return ret; +} - if (dev) { - struct fib_info *prev_fi = NULL; - unsigned int hash = fib_devindex_hashfn(dev->ifindex); - struct hlist_head *head = _info_devhash[hash]; - struct hlist_node *node; - struct fib_nh *nh; +int fib_sync_down_dev(struct net_device *dev, int force) +{ + int ret = 0; + int scope = RT_SCOPE_NOWHERE; + struct fib_info *prev_fi = NULL; + unsigned int hash = fib_devindex_hashfn(dev->ifindex); + struct hlist_head *head = _info_devhash[hash]; + struct hlist_node *node; + struct fib_nh *nh; - hlist_for_each_entry(nh, node, head, nh_hash) { - struct fib_info *fi = nh->nh_parent; - int dead; + if (force) + scope = -1; - BUG_ON(!fi->fib_nhs); - if (nh->nh_dev != dev || fi == prev_fi) - continue; - prev_fi = fi; - dead = 0; - change_nexthops(fi) { - if (nh->nh_flags_F_DEAD) - dead++; - else if (nh->nh_dev == dev &am
[PATCH 3/6] [NETNS]: Process interface address manipulation routines in the namespace.
The namespace is available when required except rtm_to_ifaddr. Add namespace argument to it. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- net/ipv4/devinet.c | 14 -- 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index e55c85e..6a6e92e 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -485,7 +485,7 @@ errout: return err; } -static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh) +static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh) { struct nlattr *tb[IFA_MAX+1]; struct in_ifaddr *ifa; @@ -503,7 +503,7 @@ static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh) if (ifm->ifa_prefixlen > 32 || tb[IFA_LOCAL] == NULL) goto errout; - dev = __dev_get_by_index(_net, ifm->ifa_index); + dev = __dev_get_by_index(net, ifm->ifa_index); err = -ENODEV; if (dev == NULL) goto errout; @@ -571,7 +571,7 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg if (net != _net) return -EINVAL; - ifa = rtm_to_ifaddr(nlh); + ifa = rtm_to_ifaddr(net, nlh); if (IS_ERR(ifa)) return PTR_ERR(ifa); @@ -1189,7 +1189,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb) s_ip_idx = ip_idx = cb->args[1]; idx = 0; - for_each_netdev(_net, dev) { + for_each_netdev(net, dev) { if (idx < s_idx) goto cont; if (idx > s_idx) @@ -1223,7 +1223,9 @@ static void rtmsg_ifa(int event, struct in_ifaddr* ifa, struct nlmsghdr *nlh, struct sk_buff *skb; u32 seq = nlh ? nlh->nlmsg_seq : 0; int err = -ENOBUFS; + struct net *net; + net = ifa->ifa_dev->dev->nd_net; skb = nlmsg_new(inet_nlmsg_size(), GFP_KERNEL); if (skb == NULL) goto errout; @@ -1235,10 +1237,10 @@ static void rtmsg_ifa(int event, struct in_ifaddr* ifa, struct nlmsghdr *nlh, kfree_skb(skb); goto errout; } - err = rtnl_notify(skb, _net, pid, RTNLGRP_IPV4_IFADDR, nlh, GFP_KERNEL); + err = rtnl_notify(skb, net, pid, RTNLGRP_IPV4_IFADDR, nlh, GFP_KERNEL); errout: if (err < 0) - rtnl_set_sk_err(_net, RTNLGRP_IPV4_IFADDR, err); + rtnl_set_sk_err(net, RTNLGRP_IPV4_IFADDR, err); } #ifdef CONFIG_SYSCTL -- 1.5.3.rc5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/6] [NETNS]: Add a namespace mark to fib_info.
This is required to make fib_info lookups namespace aware. In the other case initial namespace devices are marked as dead in the local routing table during other namespace stop. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- include/net/ip_fib.h |1 + net/ipv4/fib_semantics.c |8 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 1b2f008..cb0df37 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -69,6 +69,7 @@ struct fib_nh { struct fib_info { struct hlist_node fib_hash; struct hlist_node fib_lhash; + struct net *fib_net; int fib_treeref; atomic_tfib_clntref; int fib_dead; diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 5beff2e..97cc494 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -687,6 +687,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) struct fib_info *fi = NULL; struct fib_info *ofi; int nhs = 1; + struct net *net = cfg->fc_nlinfo.nl_net; /* Fast check to catch the most weird cases */ if (fib_props[cfg->fc_type].scope > cfg->fc_scope) @@ -727,6 +728,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) goto failure; fib_info_cnt++; + fi->fib_net = net; fi->fib_protocol = cfg->fc_protocol; fi->fib_flags = cfg->fc_flags; fi->fib_priority = cfg->fc_priority; @@ -798,8 +800,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) if (nhs != 1 || nh->nh_gw) goto err_inval; nh->nh_scope = RT_SCOPE_NOWHERE; - nh->nh_dev = dev_get_by_index(cfg->fc_nlinfo.nl_net, - fi->fib_nh->nh_oif); + nh->nh_dev = dev_get_by_index(net, fi->fib_nh->nh_oif); err = -ENODEV; if (nh->nh_dev == NULL) goto failure; @@ -813,8 +814,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) if (fi->fib_prefsrc) { if (cfg->fc_type != RTN_LOCAL || !cfg->fc_dst || fi->fib_prefsrc != cfg->fc_dst) - if (inet_addr_type(cfg->fc_nlinfo.nl_net, - fi->fib_prefsrc) != RTN_LOCAL) + if (inet_addr_type(net, fi->fib_prefsrc) != RTN_LOCAL) goto err_inval; } -- 1.5.3.rc5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/6] [IPV4]: Fix memory leak on error path during FIB initialization.
net->ipv4.fib_table_hash is not freed when fib4_rules_init failed. The problem has been introduced by the following commit. commit c8050bf6d84785a7edd2e81591e8f833231477e8 Author: Denis V. Lunev <[EMAIL PROTECTED]> Date: Thu Jan 10 03:28:24 2008 -0800 Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- net/ipv4/fib_frontend.c | 10 +- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d282618..d0507f4 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -975,6 +975,7 @@ static struct notifier_block fib_netdev_notifier = { static int __net_init ip_fib_net_init(struct net *net) { + int err; unsigned int i; net->ipv4.fib_table_hash = kzalloc( @@ -985,7 +986,14 @@ static int __net_init ip_fib_net_init(struct net *net) for (i = 0; i < FIB_TABLE_HASHSZ; i++) INIT_HLIST_HEAD(>ipv4.fib_table_hash[i]); - return fib4_rules_init(net); + err = fib4_rules_init(net); + if (err < 0) + goto fail; + return 0; + +fail: + kfree(net->ipv4.fib_table_hash); + return err; } static void __net_exit ip_fib_net_exit(struct net *net) -- 1.5.3.rc5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/6] [IPV4]: Small style cleanup of the error path in rtm_to_ifaddr.
Remove error code assignment inside brackets on failure. The code looks better if the error is assigned before condition check. Also, the compiler treats this better. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- net/ipv4/devinet.c | 21 - 1 files changed, 8 insertions(+), 13 deletions(-) diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index 21f71bf..9da4c68 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -492,39 +492,34 @@ static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh) struct ifaddrmsg *ifm; struct net_device *dev; struct in_device *in_dev; - int err = -EINVAL; + int err; err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFA_MAX, ifa_ipv4_policy); if (err < 0) goto errout; ifm = nlmsg_data(nlh); - if (ifm->ifa_prefixlen > 32 || tb[IFA_LOCAL] == NULL) { - err = -EINVAL; + err = -EINVAL; + if (ifm->ifa_prefixlen > 32 || tb[IFA_LOCAL] == NULL) goto errout; - } dev = __dev_get_by_index(_net, ifm->ifa_index); - if (dev == NULL) { - err = -ENODEV; + err = -ENODEV; + if (dev == NULL) goto errout; - } in_dev = __in_dev_get_rtnl(dev); - if (in_dev == NULL) { - err = -ENOBUFS; + err = -ENOBUFS; + if (in_dev == NULL) goto errout; - } ifa = inet_alloc_ifa(); - if (ifa == NULL) { + if (ifa == NULL) /* * A potential indev allocation can be left alive, it stays * assigned to its device and is destroy with it. */ - err = -ENOBUFS; goto errout; - } ipv4_devconf_setall(in_dev); in_dev_hold(in_dev); -- 1.5.3.rc5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] [RAW]: Wrong content of the /proc/net/raw6.
The address of IPv6 raw sockets was shown in the wrong format, from IPv4 ones. The problem has been introduced by the commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5 Author: Pavel Emelyanov <[EMAIL PROTECTED]> Date: Mon Nov 19 22:38:33 2007 -0800 Thanks to Adrian Bunk who originally noticed the problem. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- include/net/raw.h |3 ++- net/ipv4/raw.c|8 net/ipv6/raw.c|2 +- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/include/net/raw.h b/include/net/raw.h index c7ea7a2..1828f81 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -48,7 +48,8 @@ struct raw_iter_state { void *raw_seq_start(struct seq_file *seq, loff_t *pos); void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos); void raw_seq_stop(struct seq_file *seq, void *v); -int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h); +int raw_seq_open(struct inode *ino, struct file *file, +struct raw_hashinfo *h, const struct seq_operations *ops); #endif diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 830f19e..a3002fe 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -962,13 +962,13 @@ static const struct seq_operations raw_seq_ops = { .show = raw_seq_show, }; -int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h) +int raw_seq_open(struct inode *ino, struct file *file, +struct raw_hashinfo *h, const struct seq_operations *ops) { int err; struct raw_iter_state *i; - err = seq_open_net(ino, file, _seq_ops, - sizeof(struct raw_iter_state)); + err = seq_open_net(ino, file, ops, sizeof(struct raw_iter_state)); if (err < 0) return err; @@ -980,7 +980,7 @@ EXPORT_SYMBOL_GPL(raw_seq_open); static int raw_v4_seq_open(struct inode *inode, struct file *file) { - return raw_seq_open(inode, file, _v4_hashinfo); + return raw_seq_open(inode, file, _v4_hashinfo, _seq_ops); } static const struct file_operations raw_seq_fops = { diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index a2cf499..8897ccf 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -1262,7 +1262,7 @@ static const struct seq_operations raw6_seq_ops = { static int raw6_seq_open(struct inode *inode, struct file *file) { - return raw_seq_open(inode, file, _v6_hashinfo); + return raw_seq_open(inode, file, _v6_hashinfo, _seq_ops); } static const struct file_operations raw6_seq_fops = { -- 1.5.3.rc5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] [RAW]: Family check in the /proc/net/raw[6] is extra.
Different hashtables are used for IPv6 and IPv4 raw sockets, so no need to check the socket family in the iterator over hashtables. Clean this out. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- include/net/raw.h |4 +--- net/ipv4/raw.c| 12 net/ipv6/raw.c|2 +- 3 files changed, 6 insertions(+), 12 deletions(-) diff --git a/include/net/raw.h b/include/net/raw.h index cca81d8..c7ea7a2 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -41,7 +41,6 @@ extern void raw_proc_exit(void); struct raw_iter_state { struct seq_net_private p; int bucket; - unsigned short family; struct raw_hashinfo *h; }; @@ -49,8 +48,7 @@ struct raw_iter_state { void *raw_seq_start(struct seq_file *seq, loff_t *pos); void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos); void raw_seq_stop(struct seq_file *seq, void *v); -int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h, - unsigned short family); +int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h); #endif diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index f863c3d..507cbfe 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -862,8 +862,7 @@ static struct sock *raw_get_first(struct seq_file *seq) struct hlist_node *node; sk_for_each(sk, node, >h->ht[state->bucket]) - if (sk->sk_net == state->p.net && - sk->sk_family == state->family) + if (sk->sk_net == state->p.net) goto found; } sk = NULL; @@ -879,8 +878,7 @@ static struct sock *raw_get_next(struct seq_file *seq, struct sock *sk) sk = sk_next(sk); try_again: ; - } while (sk && sk->sk_net != state->p.net && - sk->sk_family != state->family); + } while (sk && sk->sk_net != state->p.net); if (!sk && ++state->bucket < RAW_HTABLE_SIZE) { sk = sk_head(>h->ht[state->bucket]); @@ -974,8 +972,7 @@ static const struct seq_operations raw_seq_ops = { .show = raw_seq_show, }; -int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h, - unsigned short family) +int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h) { int err; struct raw_iter_state *i; @@ -987,14 +984,13 @@ int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h, i = raw_seq_private((struct seq_file *)file->private_data); i->h = h; - i->family = family; return 0; } EXPORT_SYMBOL_GPL(raw_seq_open); static int raw_v4_seq_open(struct inode *inode, struct file *file) { - return raw_seq_open(inode, file, _v4_hashinfo, PF_INET); + return raw_seq_open(inode, file, _v4_hashinfo); } static const struct file_operations raw_seq_fops = { diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index d61c63d..a2cf499 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -1262,7 +1262,7 @@ static const struct seq_operations raw6_seq_ops = { static int raw6_seq_open(struct inode *inode, struct file *file) { - return raw_seq_open(inode, file, _v6_hashinfo, PF_INET6); + return raw_seq_open(inode, file, _v6_hashinfo); } static const struct file_operations raw6_seq_fops = { -- 1.5.3.rc5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] [RAW]: Cleanup IPv4 raw_seq_show.
There is no need to use 128 bytes on the stack at all. Clean the code in the IPv6 style. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- net/ipv4/raw.c | 24 +++- 1 files changed, 7 insertions(+), 17 deletions(-) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 507cbfe..830f19e 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -927,7 +927,7 @@ void raw_seq_stop(struct seq_file *seq, void *v) } EXPORT_SYMBOL_GPL(raw_seq_stop); -static __inline__ char *get_raw_sock(struct sock *sp, char *tmpbuf, int i) +static void raw_sock_seq_show(struct seq_file *seq, struct sock *sp, int i) { struct inet_sock *inet = inet_sk(sp); __be32 dest = inet->daddr, @@ -935,33 +935,23 @@ static __inline__ char *get_raw_sock(struct sock *sp, char *tmpbuf, int i) __u16 destp = 0, srcp = inet->num; - sprintf(tmpbuf, "%4d: %08X:%04X %08X:%04X" + seq_printf(seq, "%4d: %08X:%04X %08X:%04X" " %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d", i, src, srcp, dest, destp, sp->sk_state, atomic_read(>sk_wmem_alloc), atomic_read(>sk_rmem_alloc), 0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp), atomic_read(>sk_refcnt), sp, atomic_read(>sk_drops)); - return tmpbuf; } -#define TMPSZ 128 - static int raw_seq_show(struct seq_file *seq, void *v) { - char tmpbuf[TMPSZ+1]; - if (v == SEQ_START_TOKEN) - seq_printf(seq, "%-*s\n", TMPSZ-1, - " sl local_address rem_address st tx_queue " - "rx_queue tr tm->when retrnsmt uid timeout " - "inode drops"); - else { - struct raw_iter_state *state = raw_seq_private(seq); - - seq_printf(seq, "%-*s\n", TMPSZ-1, - get_raw_sock(v, tmpbuf, state->bucket)); - } + seq_printf(seq, " sl local_address rem_address st tx_queue " + "rx_queue tr tm->when retrnsmt uid timeout " + "inode drops\n"); + else + raw_sock_seq_show(seq, v, raw_seq_private(seq)->bucket); return 0; } -- 1.5.3.rc5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] [RAW]: proc output cleanups.
yesterday Adrian Bunk noticed, that the commit commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5 Author: Pavel Emelyanov <[EMAIL PROTECTED]> Date: Mon Nov 19 22:38:33 2007 -0800 is somewhat strange. Basically, the commit is obviously wrong as the content of the /proc/net/raw6 is incorrect due to it. This series of patches fixes original problem and slightly cleanups the code around. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] [RAW]: proc output cleanups.
yesterday Adrian Bunk noticed, that the commit commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5 Author: Pavel Emelyanov [EMAIL PROTECTED] Date: Mon Nov 19 22:38:33 2007 -0800 is somewhat strange. Basically, the commit is obviously wrong as the content of the /proc/net/raw6 is incorrect due to it. This series of patches fixes original problem and slightly cleanups the code around. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] [RAW]: Cleanup IPv4 raw_seq_show.
There is no need to use 128 bytes on the stack at all. Clean the code in the IPv6 style. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- net/ipv4/raw.c | 24 +++- 1 files changed, 7 insertions(+), 17 deletions(-) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 507cbfe..830f19e 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -927,7 +927,7 @@ void raw_seq_stop(struct seq_file *seq, void *v) } EXPORT_SYMBOL_GPL(raw_seq_stop); -static __inline__ char *get_raw_sock(struct sock *sp, char *tmpbuf, int i) +static void raw_sock_seq_show(struct seq_file *seq, struct sock *sp, int i) { struct inet_sock *inet = inet_sk(sp); __be32 dest = inet-daddr, @@ -935,33 +935,23 @@ static __inline__ char *get_raw_sock(struct sock *sp, char *tmpbuf, int i) __u16 destp = 0, srcp = inet-num; - sprintf(tmpbuf, %4d: %08X:%04X %08X:%04X + seq_printf(seq, %4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d, i, src, srcp, dest, destp, sp-sk_state, atomic_read(sp-sk_wmem_alloc), atomic_read(sp-sk_rmem_alloc), 0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp), atomic_read(sp-sk_refcnt), sp, atomic_read(sp-sk_drops)); - return tmpbuf; } -#define TMPSZ 128 - static int raw_seq_show(struct seq_file *seq, void *v) { - char tmpbuf[TMPSZ+1]; - if (v == SEQ_START_TOKEN) - seq_printf(seq, %-*s\n, TMPSZ-1, -sl local_address rem_address st tx_queue - rx_queue tr tm-when retrnsmt uid timeout - inode drops); - else { - struct raw_iter_state *state = raw_seq_private(seq); - - seq_printf(seq, %-*s\n, TMPSZ-1, - get_raw_sock(v, tmpbuf, state-bucket)); - } + seq_printf(seq, sl local_address rem_address st tx_queue + rx_queue tr tm-when retrnsmt uid timeout + inode drops\n); + else + raw_sock_seq_show(seq, v, raw_seq_private(seq)-bucket); return 0; } -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] [RAW]: Wrong content of the /proc/net/raw6.
The address of IPv6 raw sockets was shown in the wrong format, from IPv4 ones. The problem has been introduced by the commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5 Author: Pavel Emelyanov [EMAIL PROTECTED] Date: Mon Nov 19 22:38:33 2007 -0800 Thanks to Adrian Bunk who originally noticed the problem. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- include/net/raw.h |3 ++- net/ipv4/raw.c|8 net/ipv6/raw.c|2 +- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/include/net/raw.h b/include/net/raw.h index c7ea7a2..1828f81 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -48,7 +48,8 @@ struct raw_iter_state { void *raw_seq_start(struct seq_file *seq, loff_t *pos); void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos); void raw_seq_stop(struct seq_file *seq, void *v); -int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h); +int raw_seq_open(struct inode *ino, struct file *file, +struct raw_hashinfo *h, const struct seq_operations *ops); #endif diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 830f19e..a3002fe 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -962,13 +962,13 @@ static const struct seq_operations raw_seq_ops = { .show = raw_seq_show, }; -int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h) +int raw_seq_open(struct inode *ino, struct file *file, +struct raw_hashinfo *h, const struct seq_operations *ops) { int err; struct raw_iter_state *i; - err = seq_open_net(ino, file, raw_seq_ops, - sizeof(struct raw_iter_state)); + err = seq_open_net(ino, file, ops, sizeof(struct raw_iter_state)); if (err 0) return err; @@ -980,7 +980,7 @@ EXPORT_SYMBOL_GPL(raw_seq_open); static int raw_v4_seq_open(struct inode *inode, struct file *file) { - return raw_seq_open(inode, file, raw_v4_hashinfo); + return raw_seq_open(inode, file, raw_v4_hashinfo, raw_seq_ops); } static const struct file_operations raw_seq_fops = { diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index a2cf499..8897ccf 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -1262,7 +1262,7 @@ static const struct seq_operations raw6_seq_ops = { static int raw6_seq_open(struct inode *inode, struct file *file) { - return raw_seq_open(inode, file, raw_v6_hashinfo); + return raw_seq_open(inode, file, raw_v6_hashinfo, raw6_seq_ops); } static const struct file_operations raw6_seq_fops = { -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] [RAW]: Family check in the /proc/net/raw[6] is extra.
Different hashtables are used for IPv6 and IPv4 raw sockets, so no need to check the socket family in the iterator over hashtables. Clean this out. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- include/net/raw.h |4 +--- net/ipv4/raw.c| 12 net/ipv6/raw.c|2 +- 3 files changed, 6 insertions(+), 12 deletions(-) diff --git a/include/net/raw.h b/include/net/raw.h index cca81d8..c7ea7a2 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -41,7 +41,6 @@ extern void raw_proc_exit(void); struct raw_iter_state { struct seq_net_private p; int bucket; - unsigned short family; struct raw_hashinfo *h; }; @@ -49,8 +48,7 @@ struct raw_iter_state { void *raw_seq_start(struct seq_file *seq, loff_t *pos); void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos); void raw_seq_stop(struct seq_file *seq, void *v); -int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h, - unsigned short family); +int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h); #endif diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index f863c3d..507cbfe 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -862,8 +862,7 @@ static struct sock *raw_get_first(struct seq_file *seq) struct hlist_node *node; sk_for_each(sk, node, state-h-ht[state-bucket]) - if (sk-sk_net == state-p.net - sk-sk_family == state-family) + if (sk-sk_net == state-p.net) goto found; } sk = NULL; @@ -879,8 +878,7 @@ static struct sock *raw_get_next(struct seq_file *seq, struct sock *sk) sk = sk_next(sk); try_again: ; - } while (sk sk-sk_net != state-p.net - sk-sk_family != state-family); + } while (sk sk-sk_net != state-p.net); if (!sk ++state-bucket RAW_HTABLE_SIZE) { sk = sk_head(state-h-ht[state-bucket]); @@ -974,8 +972,7 @@ static const struct seq_operations raw_seq_ops = { .show = raw_seq_show, }; -int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h, - unsigned short family) +int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h) { int err; struct raw_iter_state *i; @@ -987,14 +984,13 @@ int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h, i = raw_seq_private((struct seq_file *)file-private_data); i-h = h; - i-family = family; return 0; } EXPORT_SYMBOL_GPL(raw_seq_open); static int raw_v4_seq_open(struct inode *inode, struct file *file) { - return raw_seq_open(inode, file, raw_v4_hashinfo, PF_INET); + return raw_seq_open(inode, file, raw_v4_hashinfo); } static const struct file_operations raw_seq_fops = { diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index d61c63d..a2cf499 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -1262,7 +1262,7 @@ static const struct seq_operations raw6_seq_ops = { static int raw6_seq_open(struct inode *inode, struct file *file) { - return raw_seq_open(inode, file, raw_v6_hashinfo, PF_INET6); + return raw_seq_open(inode, file, raw_v6_hashinfo); } static const struct file_operations raw6_seq_fops = { -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/6] [IPV4]: fib_sync_down rework.
fib_sync_down can be called with an address and with a device. In reality it is called either with address OR with a device. The codepath inside is completely different, so lets separate it into two calls for these two cases. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- include/net/ip_fib.h |3 +- net/ipv4/fib_frontend.c |4 +- net/ipv4/fib_semantics.c | 104 +++-- 3 files changed, 57 insertions(+), 54 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 9daa60b..1b2f008 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -218,7 +218,8 @@ extern void fib_select_default(struct net *net, const struct flowi *flp, /* Exported by fib_semantics.c */ extern int ip_fib_check_default(__be32 gw, struct net_device *dev); -extern int fib_sync_down(__be32 local, struct net_device *dev, int force); +extern int fib_sync_down_dev(struct net_device *dev, int force); +extern int fib_sync_down_addr(__be32 local); extern int fib_sync_up(struct net_device *dev); extern __be32 __fib_res_prefsrc(struct fib_result *res); extern void fib_select_multipath(const struct flowi *flp, struct fib_result *res); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d0507f4..d69ffa2 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -808,7 +808,7 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa) First of all, we scan fib_info list searching for stray nexthop entries, then ignite fib_flush. */ - if (fib_sync_down(ifa-ifa_local, NULL, 0)) + if (fib_sync_down_addr(ifa-ifa_local)) fib_flush(dev-nd_net); } } @@ -898,7 +898,7 @@ static void nl_fib_lookup_exit(struct net *net) static void fib_disable_ip(struct net_device *dev, int force) { - if (fib_sync_down(0, dev, force)) + if (fib_sync_down_dev(dev, force)) fib_flush(dev-nd_net); rt_cache_flush(0); arp_ifdown(dev); diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index c791286..5beff2e 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -1031,70 +1031,72 @@ nla_put_failure: referring to it. - device went down - we must shutdown all nexthops going via it. */ - -int fib_sync_down(__be32 local, struct net_device *dev, int force) +int fib_sync_down_addr(__be32 local) { int ret = 0; - int scope = RT_SCOPE_NOWHERE; - - if (force) - scope = -1; + unsigned int hash = fib_laddr_hashfn(local); + struct hlist_head *head = fib_info_laddrhash[hash]; + struct hlist_node *node; + struct fib_info *fi; - if (local fib_info_laddrhash) { - unsigned int hash = fib_laddr_hashfn(local); - struct hlist_head *head = fib_info_laddrhash[hash]; - struct hlist_node *node; - struct fib_info *fi; + if (fib_info_laddrhash == NULL || local == 0) + return 0; - hlist_for_each_entry(fi, node, head, fib_lhash) { - if (fi-fib_prefsrc == local) { - fi-fib_flags |= RTNH_F_DEAD; - ret++; - } + hlist_for_each_entry(fi, node, head, fib_lhash) { + if (fi-fib_prefsrc == local) { + fi-fib_flags |= RTNH_F_DEAD; + ret++; } } + return ret; +} - if (dev) { - struct fib_info *prev_fi = NULL; - unsigned int hash = fib_devindex_hashfn(dev-ifindex); - struct hlist_head *head = fib_info_devhash[hash]; - struct hlist_node *node; - struct fib_nh *nh; +int fib_sync_down_dev(struct net_device *dev, int force) +{ + int ret = 0; + int scope = RT_SCOPE_NOWHERE; + struct fib_info *prev_fi = NULL; + unsigned int hash = fib_devindex_hashfn(dev-ifindex); + struct hlist_head *head = fib_info_devhash[hash]; + struct hlist_node *node; + struct fib_nh *nh; - hlist_for_each_entry(nh, node, head, nh_hash) { - struct fib_info *fi = nh-nh_parent; - int dead; + if (force) + scope = -1; - BUG_ON(!fi-fib_nhs); - if (nh-nh_dev != dev || fi == prev_fi) - continue; - prev_fi = fi; - dead = 0; - change_nexthops(fi) { - if (nh-nh_flagsRTNH_F_DEAD) - dead++; - else if (nh-nh_dev == dev -nh-nh_scope != scope
[PATCH 3/6] [NETNS]: Process interface address manipulation routines in the namespace.
The namespace is available when required except rtm_to_ifaddr. Add namespace argument to it. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- net/ipv4/devinet.c | 14 -- 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index e55c85e..6a6e92e 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -485,7 +485,7 @@ errout: return err; } -static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh) +static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh) { struct nlattr *tb[IFA_MAX+1]; struct in_ifaddr *ifa; @@ -503,7 +503,7 @@ static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh) if (ifm-ifa_prefixlen 32 || tb[IFA_LOCAL] == NULL) goto errout; - dev = __dev_get_by_index(init_net, ifm-ifa_index); + dev = __dev_get_by_index(net, ifm-ifa_index); err = -ENODEV; if (dev == NULL) goto errout; @@ -571,7 +571,7 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg if (net != init_net) return -EINVAL; - ifa = rtm_to_ifaddr(nlh); + ifa = rtm_to_ifaddr(net, nlh); if (IS_ERR(ifa)) return PTR_ERR(ifa); @@ -1189,7 +1189,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb) s_ip_idx = ip_idx = cb-args[1]; idx = 0; - for_each_netdev(init_net, dev) { + for_each_netdev(net, dev) { if (idx s_idx) goto cont; if (idx s_idx) @@ -1223,7 +1223,9 @@ static void rtmsg_ifa(int event, struct in_ifaddr* ifa, struct nlmsghdr *nlh, struct sk_buff *skb; u32 seq = nlh ? nlh-nlmsg_seq : 0; int err = -ENOBUFS; + struct net *net; + net = ifa-ifa_dev-dev-nd_net; skb = nlmsg_new(inet_nlmsg_size(), GFP_KERNEL); if (skb == NULL) goto errout; @@ -1235,10 +1237,10 @@ static void rtmsg_ifa(int event, struct in_ifaddr* ifa, struct nlmsghdr *nlh, kfree_skb(skb); goto errout; } - err = rtnl_notify(skb, init_net, pid, RTNLGRP_IPV4_IFADDR, nlh, GFP_KERNEL); + err = rtnl_notify(skb, net, pid, RTNLGRP_IPV4_IFADDR, nlh, GFP_KERNEL); errout: if (err 0) - rtnl_set_sk_err(init_net, RTNLGRP_IPV4_IFADDR, err); + rtnl_set_sk_err(net, RTNLGRP_IPV4_IFADDR, err); } #ifdef CONFIG_SYSCTL -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/6] [NETNS]: Add a namespace mark to fib_info.
This is required to make fib_info lookups namespace aware. In the other case initial namespace devices are marked as dead in the local routing table during other namespace stop. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- include/net/ip_fib.h |1 + net/ipv4/fib_semantics.c |8 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 1b2f008..cb0df37 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -69,6 +69,7 @@ struct fib_nh { struct fib_info { struct hlist_node fib_hash; struct hlist_node fib_lhash; + struct net *fib_net; int fib_treeref; atomic_tfib_clntref; int fib_dead; diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 5beff2e..97cc494 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -687,6 +687,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) struct fib_info *fi = NULL; struct fib_info *ofi; int nhs = 1; + struct net *net = cfg-fc_nlinfo.nl_net; /* Fast check to catch the most weird cases */ if (fib_props[cfg-fc_type].scope cfg-fc_scope) @@ -727,6 +728,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) goto failure; fib_info_cnt++; + fi-fib_net = net; fi-fib_protocol = cfg-fc_protocol; fi-fib_flags = cfg-fc_flags; fi-fib_priority = cfg-fc_priority; @@ -798,8 +800,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) if (nhs != 1 || nh-nh_gw) goto err_inval; nh-nh_scope = RT_SCOPE_NOWHERE; - nh-nh_dev = dev_get_by_index(cfg-fc_nlinfo.nl_net, - fi-fib_nh-nh_oif); + nh-nh_dev = dev_get_by_index(net, fi-fib_nh-nh_oif); err = -ENODEV; if (nh-nh_dev == NULL) goto failure; @@ -813,8 +814,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) if (fi-fib_prefsrc) { if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst || fi-fib_prefsrc != cfg-fc_dst) - if (inet_addr_type(cfg-fc_nlinfo.nl_net, - fi-fib_prefsrc) != RTN_LOCAL) + if (inet_addr_type(net, fi-fib_prefsrc) != RTN_LOCAL) goto err_inval; } -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/6] [IPV4]: Fix memory leak on error path during FIB initialization.
net-ipv4.fib_table_hash is not freed when fib4_rules_init failed. The problem has been introduced by the following commit. commit c8050bf6d84785a7edd2e81591e8f833231477e8 Author: Denis V. Lunev [EMAIL PROTECTED] Date: Thu Jan 10 03:28:24 2008 -0800 Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- net/ipv4/fib_frontend.c | 10 +- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d282618..d0507f4 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -975,6 +975,7 @@ static struct notifier_block fib_netdev_notifier = { static int __net_init ip_fib_net_init(struct net *net) { + int err; unsigned int i; net-ipv4.fib_table_hash = kzalloc( @@ -985,7 +986,14 @@ static int __net_init ip_fib_net_init(struct net *net) for (i = 0; i FIB_TABLE_HASHSZ; i++) INIT_HLIST_HEAD(net-ipv4.fib_table_hash[i]); - return fib4_rules_init(net); + err = fib4_rules_init(net); + if (err 0) + goto fail; + return 0; + +fail: + kfree(net-ipv4.fib_table_hash); + return err; } static void __net_exit ip_fib_net_exit(struct net *net) -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/6] [IPV4]: Small style cleanup of the error path in rtm_to_ifaddr.
Remove error code assignment inside brackets on failure. The code looks better if the error is assigned before condition check. Also, the compiler treats this better. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- net/ipv4/devinet.c | 21 - 1 files changed, 8 insertions(+), 13 deletions(-) diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index 21f71bf..9da4c68 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -492,39 +492,34 @@ static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh) struct ifaddrmsg *ifm; struct net_device *dev; struct in_device *in_dev; - int err = -EINVAL; + int err; err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFA_MAX, ifa_ipv4_policy); if (err 0) goto errout; ifm = nlmsg_data(nlh); - if (ifm-ifa_prefixlen 32 || tb[IFA_LOCAL] == NULL) { - err = -EINVAL; + err = -EINVAL; + if (ifm-ifa_prefixlen 32 || tb[IFA_LOCAL] == NULL) goto errout; - } dev = __dev_get_by_index(init_net, ifm-ifa_index); - if (dev == NULL) { - err = -ENODEV; + err = -ENODEV; + if (dev == NULL) goto errout; - } in_dev = __in_dev_get_rtnl(dev); - if (in_dev == NULL) { - err = -ENOBUFS; + err = -ENOBUFS; + if (in_dev == NULL) goto errout; - } ifa = inet_alloc_ifa(); - if (ifa == NULL) { + if (ifa == NULL) /* * A potential indev allocation can be left alive, it stays * assigned to its device and is destroy with it. */ - err = -ENOBUFS; goto errout; - } ipv4_devconf_setall(in_dev); in_dev_hold(in_dev); -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/6] [NETNS]: Lookup in FIB semantic hashes taking into account the namespace.
The namespace is not available in the fib_sync_down_addr, add it as a parameter. Looking up a device by the pointer to it is OK. Looking up using a result from fib_trie/fib_hash table lookup is also safe. No need to fix that at all. So, just fix lookup by address and insertion to the hash table path. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- include/net/ip_fib.h |2 +- net/ipv4/fib_frontend.c |2 +- net/ipv4/fib_semantics.c |6 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index cb0df37..90d1175 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -220,7 +220,7 @@ extern void fib_select_default(struct net *net, const struct flowi *flp, /* Exported by fib_semantics.c */ extern int ip_fib_check_default(__be32 gw, struct net_device *dev); extern int fib_sync_down_dev(struct net_device *dev, int force); -extern int fib_sync_down_addr(__be32 local); +extern int fib_sync_down_addr(struct net *net, __be32 local); extern int fib_sync_up(struct net_device *dev); extern __be32 __fib_res_prefsrc(struct fib_result *res); extern void fib_select_multipath(const struct flowi *flp, struct fib_result *res); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d69ffa2..86ff271 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -808,7 +808,7 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa) First of all, we scan fib_info list searching for stray nexthop entries, then ignite fib_flush. */ - if (fib_sync_down_addr(ifa-ifa_local)) + if (fib_sync_down_addr(dev-nd_net, ifa-ifa_local)) fib_flush(dev-nd_net); } } diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 97cc494..a13c847 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -229,6 +229,8 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi) head = fib_info_hash[hash]; hlist_for_each_entry(fi, node, head, fib_hash) { + if (fi-fib_net != nfi-fib_net) + continue; if (fi-fib_nhs != nfi-fib_nhs) continue; if (nfi-fib_protocol == fi-fib_protocol @@ -1031,7 +1033,7 @@ nla_put_failure: referring to it. - device went down - we must shutdown all nexthops going via it. */ -int fib_sync_down_addr(__be32 local) +int fib_sync_down_addr(struct net *net, __be32 local) { int ret = 0; unsigned int hash = fib_laddr_hashfn(local); @@ -1043,6 +1045,8 @@ int fib_sync_down_addr(__be32 local) return 0; hlist_for_each_entry(fi, node, head, fib_lhash) { + if (fi-fib_net != net) + continue; if (fi-fib_prefsrc == local) { fi-fib_flags |= RTNH_F_DEAD; ret++; -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [NFS]: Lock daemon start/stop rework.
Christoph Hellwig wrote: > On Wed, Jan 30, 2008 at 02:41:34PM +0300, Denis V. Lunev wrote: >> The pid of the locking daemon can be substituted with a task struct >> without a problem. Namely, the value if filled in the context of the lockd >> thread and used in lockd_up/lockd_down. >> >> It is possible to save task struct instead and use it to kill the process. >> The safety of this operation is guaranteed by the RCU, i.e. task can't >> disappear without passing a quiscent state. > > We have a patch series pending on the nfs list that does this plus a lot > more in the area. > > where can I have to look them? :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] [NFS]: Lock daemon start/stop rework.
The pid of the locking daemon can be substituted with a task struct without a problem. Namely, the value if filled in the context of the lockd thread and used in lockd_up/lockd_down. It is possible to save task struct instead and use it to kill the process. The safety of this operation is guaranteed by the RCU, i.e. task can't disappear without passing a quiscent state. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- fs/lockd/svc.c | 38 +- 1 files changed, 25 insertions(+), 13 deletions(-) diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c index 82e2192..4979e70 100644 --- a/fs/lockd/svc.c +++ b/fs/lockd/svc.c @@ -48,7 +48,7 @@ EXPORT_SYMBOL(nlmsvc_ops); static DEFINE_MUTEX(nlmsvc_mutex); static unsigned intnlmsvc_users; -static pid_t nlmsvc_pid; +static struct task_struct *nlmsvc_task; static struct svc_serv *nlmsvc_serv; intnlmsvc_grace_period; unsigned long nlmsvc_timeout; @@ -128,7 +128,8 @@ lockd(struct svc_rqst *rqstp) /* * Let our maker know we're running. */ - nlmsvc_pid = current->pid; + rcu_assign_pointer(nlmsvc_task, current); + nlmsvc_serv = rqstp->rq_server; complete(_start_done); @@ -151,7 +152,7 @@ lockd(struct svc_rqst *rqstp) * NFS mount or NFS daemon has gone away, and we've been sent a * signal, or else another process has taken over our job. */ - while ((nlmsvc_users || !signalled()) && nlmsvc_pid == current->pid) { + while ((nlmsvc_users || !signalled()) && nlmsvc_task == current) { long timeout = MAX_SCHEDULE_TIMEOUT; char buf[RPC_MAX_ADDRBUFLEN]; @@ -200,12 +201,12 @@ lockd(struct svc_rqst *rqstp) * Check whether there's a new lockd process before * shutting down the hosts and clearing the slot. */ - if (!nlmsvc_pid || current->pid == nlmsvc_pid) { + if (nlmsvc_task == NULL || current == nlmsvc_task) { if (nlmsvc_ops) nlmsvc_invalidate_all(); nlm_shutdown_hosts(); - nlmsvc_pid = 0; nlmsvc_serv = NULL; + rcu_assign_pointer(nlmsvc_task, NULL); } else printk(KERN_DEBUG "lockd: new process, skipping host shutdown\n"); @@ -273,7 +274,7 @@ lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */ /* * Check whether we're already up and running. */ - if (nlmsvc_pid) { + if (nlmsvc_task != NULL) { if (proto) error = make_socks(nlmsvc_serv, proto); goto out; @@ -329,38 +330,49 @@ void lockd_down(void) { static int warned; + struct task_struct *tsk; mutex_lock(_mutex); + rcu_read_lock(); + tsk = rcu_dereference(nlmsvc_task); if (nlmsvc_users) { if (--nlmsvc_users) - goto out; + goto out_rcu_unlock; } else - printk(KERN_WARNING "lockd_down: no users! pid=%d\n", nlmsvc_pid); + printk(KERN_WARNING "lockd_down: no users! pid=%d\n", + task_pid_nr(tsk)); - if (!nlmsvc_pid) { + if (tsk == NULL) { if (warned++ == 0) printk(KERN_WARNING "lockd_down: no lockd running.\n"); - goto out; + goto out_rcu_unlock; } warned = 0; - kill_proc(nlmsvc_pid, SIGKILL, 1); + send_sig(SIGKILL, tsk, 1); + rcu_read_unlock(); + /* * Wait for the lockd process to exit, but since we're holding * the lockd semaphore, we can't wait around forever ... */ clear_thread_flag(TIF_SIGPENDING); interruptible_sleep_on_timeout(_exit, HZ); - if (nlmsvc_pid) { + if (nlmsvc_task != NULL) { printk(KERN_WARNING "lockd_down: lockd failed to exit, clearing pid\n"); - nlmsvc_pid = 0; + rcu_assign_pointer(nlmsvc_task, NULL); } spin_lock_irq(>sighand->siglock); recalc_sigpending(); spin_unlock_irq(>sighand->siglock); out: mutex_unlock(_mutex); + return; + +out_rcu_unlock: + rcu_read_unlock(); + goto out; } EXPORT_SYMBOL(lockd_down); -- 1.5.3.rc5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] [NFS]: Lock daemon start/stop rework.
The pid of the locking daemon can be substituted with a task struct without a problem. Namely, the value if filled in the context of the lockd thread and used in lockd_up/lockd_down. It is possible to save task struct instead and use it to kill the process. The safety of this operation is guaranteed by the RCU, i.e. task can't disappear without passing a quiscent state. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- fs/lockd/svc.c | 38 +- 1 files changed, 25 insertions(+), 13 deletions(-) diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c index 82e2192..4979e70 100644 --- a/fs/lockd/svc.c +++ b/fs/lockd/svc.c @@ -48,7 +48,7 @@ EXPORT_SYMBOL(nlmsvc_ops); static DEFINE_MUTEX(nlmsvc_mutex); static unsigned intnlmsvc_users; -static pid_t nlmsvc_pid; +static struct task_struct *nlmsvc_task; static struct svc_serv *nlmsvc_serv; intnlmsvc_grace_period; unsigned long nlmsvc_timeout; @@ -128,7 +128,8 @@ lockd(struct svc_rqst *rqstp) /* * Let our maker know we're running. */ - nlmsvc_pid = current-pid; + rcu_assign_pointer(nlmsvc_task, current); + nlmsvc_serv = rqstp-rq_server; complete(lockd_start_done); @@ -151,7 +152,7 @@ lockd(struct svc_rqst *rqstp) * NFS mount or NFS daemon has gone away, and we've been sent a * signal, or else another process has taken over our job. */ - while ((nlmsvc_users || !signalled()) nlmsvc_pid == current-pid) { + while ((nlmsvc_users || !signalled()) nlmsvc_task == current) { long timeout = MAX_SCHEDULE_TIMEOUT; char buf[RPC_MAX_ADDRBUFLEN]; @@ -200,12 +201,12 @@ lockd(struct svc_rqst *rqstp) * Check whether there's a new lockd process before * shutting down the hosts and clearing the slot. */ - if (!nlmsvc_pid || current-pid == nlmsvc_pid) { + if (nlmsvc_task == NULL || current == nlmsvc_task) { if (nlmsvc_ops) nlmsvc_invalidate_all(); nlm_shutdown_hosts(); - nlmsvc_pid = 0; nlmsvc_serv = NULL; + rcu_assign_pointer(nlmsvc_task, NULL); } else printk(KERN_DEBUG lockd: new process, skipping host shutdown\n); @@ -273,7 +274,7 @@ lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */ /* * Check whether we're already up and running. */ - if (nlmsvc_pid) { + if (nlmsvc_task != NULL) { if (proto) error = make_socks(nlmsvc_serv, proto); goto out; @@ -329,38 +330,49 @@ void lockd_down(void) { static int warned; + struct task_struct *tsk; mutex_lock(nlmsvc_mutex); + rcu_read_lock(); + tsk = rcu_dereference(nlmsvc_task); if (nlmsvc_users) { if (--nlmsvc_users) - goto out; + goto out_rcu_unlock; } else - printk(KERN_WARNING lockd_down: no users! pid=%d\n, nlmsvc_pid); + printk(KERN_WARNING lockd_down: no users! pid=%d\n, + task_pid_nr(tsk)); - if (!nlmsvc_pid) { + if (tsk == NULL) { if (warned++ == 0) printk(KERN_WARNING lockd_down: no lockd running.\n); - goto out; + goto out_rcu_unlock; } warned = 0; - kill_proc(nlmsvc_pid, SIGKILL, 1); + send_sig(SIGKILL, tsk, 1); + rcu_read_unlock(); + /* * Wait for the lockd process to exit, but since we're holding * the lockd semaphore, we can't wait around forever ... */ clear_thread_flag(TIF_SIGPENDING); interruptible_sleep_on_timeout(lockd_exit, HZ); - if (nlmsvc_pid) { + if (nlmsvc_task != NULL) { printk(KERN_WARNING lockd_down: lockd failed to exit, clearing pid\n); - nlmsvc_pid = 0; + rcu_assign_pointer(nlmsvc_task, NULL); } spin_lock_irq(current-sighand-siglock); recalc_sigpending(); spin_unlock_irq(current-sighand-siglock); out: mutex_unlock(nlmsvc_mutex); + return; + +out_rcu_unlock: + rcu_read_unlock(); + goto out; } EXPORT_SYMBOL(lockd_down); -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Debugfs compile fix.
Debugfs is not compiled without CONFIG_SYSFS in net-2.6 tree. Move kobject_create_and_add under appropriate ifdef. The fix looks correct from a first glance, but may be the dependency should be added into the Kconfig. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- fs/debugfs/inode.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index d26e282..61cc937 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -432,9 +432,11 @@ static int __init debugfs_init(void) { int retval; +#ifdef CONFIG_SYSFS debug_kobj = kobject_create_and_add("debug", kernel_kobj); if (!debug_kobj) return -EINVAL; +#endif retval = register_filesystem(_fs_type); if (retval) -- 1.5.3.rc5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Debugfs compile fix.
Debugfs is not compiled without CONFIG_SYSFS in net-2.6 tree. Move kobject_create_and_add under appropriate ifdef. The fix looks correct from a first glance, but may be the dependency should be added into the Kconfig. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- fs/debugfs/inode.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index d26e282..61cc937 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -432,9 +432,11 @@ static int __init debugfs_init(void) { int retval; +#ifdef CONFIG_SYSFS debug_kobj = kobject_create_and_add(debug, kernel_kobj); if (!debug_kobj) return -EINVAL; +#endif retval = register_filesystem(debug_fs_type); if (retval) -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24-rc4] proc: Remove/Fix proc generic d_revalidate
could you, plz, check patch sent by Eric above in this thread. I have tried it on my test node and it works for module you have provided. The problem exists without it. Regards, Den Petr Vandrovec wrote: > Eric W. Biederman wrote: >> Ultimately to implement /proc perfectly we need an implementation >> of d_revalidate because files and directories can be removed behind >> the back of the VFS, and d_revalidate is the only way we can let >> the VFS know that this has happened. >> >> So until we get a proper test for keeping dentries in the dcache >> fix the current d_revalidate method by completely removing it. This >> returns us to the current status quo. > > Hello, >I know that I'm late to the party, but mount points is not only > problem with d_revalidate. With your patch in place module below gets > refcount incremented by two every time I do 'ls -la /proc/fs/vmblock'. > > > #include > #include > #include > > static int vmblockinit(void) { >struct proc_dir_entry *controlProcDirEntry; > >/* Create /proc/fs/vmblock */ >controlProcDirEntry = proc_mkdir("vmblock", proc_root_fs); >if (!controlProcDirEntry) { > printk(KERN_DEBUG "Bad...\n"); > return -EINVAL; >} >controlProcDirEntry->owner = THIS_MODULE; >return 0; > } > > static void vmblockexit(void) { >remove_proc_entry("vmblock", proc_root_fs); > } > > module_init(vmblockinit); > module_exit(vmblockexit); > > > (code comes from VMware's vmblock module, > http://sourceforge.net/project/showfiles.php?group_id=204462) > Thanks, > Petr > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24-rc4] proc: Remove/Fix proc generic d_revalidate
could you, plz, check patch sent by Eric above in this thread. I have tried it on my test node and it works for module you have provided. The problem exists without it. Regards, Den Petr Vandrovec wrote: Eric W. Biederman wrote: Ultimately to implement /proc perfectly we need an implementation of d_revalidate because files and directories can be removed behind the back of the VFS, and d_revalidate is the only way we can let the VFS know that this has happened. So until we get a proper test for keeping dentries in the dcache fix the current d_revalidate method by completely removing it. This returns us to the current status quo. Hello, I know that I'm late to the party, but mount points is not only problem with d_revalidate. With your patch in place module below gets refcount incremented by two every time I do 'ls -la /proc/fs/vmblock'. #include linux/kernel.h #include linux/module.h #include linux/proc_fs.h static int vmblockinit(void) { struct proc_dir_entry *controlProcDirEntry; /* Create /proc/fs/vmblock */ controlProcDirEntry = proc_mkdir(vmblock, proc_root_fs); if (!controlProcDirEntry) { printk(KERN_DEBUG Bad...\n); return -EINVAL; } controlProcDirEntry-owner = THIS_MODULE; return 0; } static void vmblockexit(void) { remove_proc_entry(vmblock, proc_root_fs); } module_init(vmblockinit); module_exit(vmblockexit); (code comes from VMware's vmblock module, http://sourceforge.net/project/showfiles.php?group_id=204462) Thanks, Petr -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24-rc3] Fix /proc/net breakage
Andrew Morton wrote: > On Fri, 07 Dec 2007 04:51:37 + David Woodhouse <[EMAIL PROTECTED]> wrote: > >> On Mon, 2007-11-26 at 15:17 -0700, Eric W. Biederman wrote: >>> Well I clearly goofed when I added the initial network namespace support >>> for /proc/net. Currently things work but there are odd details visible >>> to user space, even when we have a single network namespace. >>> >>> Since we do not cache proc_dir_entry dentries at the moment we can >>> just modify ->lookup to return a different directory inode depending >>> on the network namespace of the process looking at /proc/net, replacing >>> the current technique of using a magic and fragile follow_link method. >>> >>> To accomplish that this patch: >>> - introduces a shadow_proc method to allow different dentries to >>> be returned from proc_lookup. >>> - Removes the old /proc/net follow_link magic >>> - Fixes a weakness in our not caching of proc generic dentries. >>> >>> As shadow_proc uses a task struct to decided which dentry to return we >>> can go back later and fix the proc generic caching without modifying any >>> code that >>> uses the shadow_proc method. >>> >>> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]> >>> --- >>> fs/proc/generic.c | 12 ++- >>> fs/proc/proc_net.c | 86 >>> +++ >>> include/linux/proc_fs.h |3 ++ >>> 3 files changed, 19 insertions(+), 82 deletions(-) >> (commit 2b1e300a9dfc3196ccddf6f1d74b91b7af55e416) >> >> This seems to have broken the use of /proc/bus/usb as a mountpoint. It >> always appears empty now, whatever's supposed to be mounted there. >> > > Yes. Denis and Eric are tossing around competing patches but afaik nobody > is happy with any of them. Guys, could we get this sorted soonish please? > Andrew, I become too relaxed after receiving "Tested-by: Giacomo Catenazzi <[EMAIL PROTECTED]>" Eric, I believe that reverting an original behavior is better than your new one as - you introduce search into the depth by calling have_submounts(dentry) during revalidation for all(!) /proc dentries - your shadowing behavior will be broken if you'll mount something in the depth of shadowed tree (this can be done as a DoS attempt) As a last minute call, may be it will be better to pin network namespace like a pid namespace during mount to avoid this crap at all? Regards, Den -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24-rc3] Fix /proc/net breakage
Andrew Morton wrote: On Fri, 07 Dec 2007 04:51:37 + David Woodhouse [EMAIL PROTECTED] wrote: On Mon, 2007-11-26 at 15:17 -0700, Eric W. Biederman wrote: Well I clearly goofed when I added the initial network namespace support for /proc/net. Currently things work but there are odd details visible to user space, even when we have a single network namespace. Since we do not cache proc_dir_entry dentries at the moment we can just modify -lookup to return a different directory inode depending on the network namespace of the process looking at /proc/net, replacing the current technique of using a magic and fragile follow_link method. To accomplish that this patch: - introduces a shadow_proc method to allow different dentries to be returned from proc_lookup. - Removes the old /proc/net follow_link magic - Fixes a weakness in our not caching of proc generic dentries. As shadow_proc uses a task struct to decided which dentry to return we can go back later and fix the proc generic caching without modifying any code that uses the shadow_proc method. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- fs/proc/generic.c | 12 ++- fs/proc/proc_net.c | 86 +++ include/linux/proc_fs.h |3 ++ 3 files changed, 19 insertions(+), 82 deletions(-) (commit 2b1e300a9dfc3196ccddf6f1d74b91b7af55e416) This seems to have broken the use of /proc/bus/usb as a mountpoint. It always appears empty now, whatever's supposed to be mounted there. Yes. Denis and Eric are tossing around competing patches but afaik nobody is happy with any of them. Guys, could we get this sorted soonish please? Andrew, I become too relaxed after receiving Tested-by: Giacomo Catenazzi [EMAIL PROTECTED] Eric, I believe that reverting an original behavior is better than your new one as - you introduce search into the depth by calling have_submounts(dentry) during revalidation for all(!) /proc dentries - your shadowing behavior will be broken if you'll mount something in the depth of shadowed tree (this can be done as a DoS attempt) As a last minute call, may be it will be better to pin network namespace like a pid namespace during mount to avoid this crap at all? Regards, Den -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] proc: Do not invalidate dentries with submounts
you have changed the behavior of revalidation by shadows. I think it will be better to restore it and keep new one for shadows (and below) only, which has been done by my yesterday patch. Regards, Den Eric W. Biederman wrote: > If the dcache path to a mount point is ever broken it becomes > impossible to unmount it, and we leak a vfsmount. Therefore it is not > valid to invalidate dentries with mount points at or below them. > > This patch uses the have_submounts test as the other network > filesystem revalidate routines do. > > Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]> > --- > fs/proc/base.c|9 + > fs/proc/generic.c |5 + > 2 files changed, 14 insertions(+), 0 deletions(-) > > diff --git a/fs/proc/base.c b/fs/proc/base.c > index 0e71707..552d752 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -1216,6 +1216,9 @@ static int pid_revalidate(struct dentry *dentry, struct > nameidata *nd) > put_task_struct(task); > return 1; > } > + /* Force validity if something is mounted under us */ > + if (inode && S_ISDIR(inode->i_mode) && have_submounts(dentry)) > + return 1; > d_drop(dentry); > return 0; > } > @@ -1393,6 +1396,9 @@ static int tid_fd_revalidate(struct dentry *dentry, > struct nameidata *nd) > } > put_task_struct(task); > } > + /* Force validity if something is mounted under us */ > + if (inode && S_ISDIR(inode->i_mode) && have_submounts(dentry)) > + return 1; > d_drop(dentry); > return 0; > } > @@ -2056,6 +2062,9 @@ static int proc_base_revalidate(struct dentry *dentry, > struct nameidata *nd) > put_task_struct(task); > return 1; > } > + /* Force validity if something is mounted under us */ > + if (inode && S_ISDIR(inode->i_mode) && have_submounts(dentry)) > + return 1; > d_drop(dentry); > return 0; > } > diff --git a/fs/proc/generic.c b/fs/proc/generic.c > index 4abd568..233dcdc 100644 > --- a/fs/proc/generic.c > +++ b/fs/proc/generic.c > @@ -370,6 +370,11 @@ static int proc_delete_dentry(struct dentry * dentry) > > static int proc_revalidate_dentry(struct dentry *dentry, struct nameidata > *nd) > { > + struct inode *inode = dentry->d_inode; > + > + /* Force validity if something is mounted under us */ > + if (inode && S_ISDIR(inode->i_mode) && have_submounts(dentry)) > + return 1; > d_drop(dentry); > return 0; > } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] proc: Do not invalidate dentries with submounts
you have changed the behavior of revalidation by shadows. I think it will be better to restore it and keep new one for shadows (and below) only, which has been done by my yesterday patch. Regards, Den Eric W. Biederman wrote: If the dcache path to a mount point is ever broken it becomes impossible to unmount it, and we leak a vfsmount. Therefore it is not valid to invalidate dentries with mount points at or below them. This patch uses the have_submounts test as the other network filesystem revalidate routines do. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- fs/proc/base.c|9 + fs/proc/generic.c |5 + 2 files changed, 14 insertions(+), 0 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 0e71707..552d752 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1216,6 +1216,9 @@ static int pid_revalidate(struct dentry *dentry, struct nameidata *nd) put_task_struct(task); return 1; } + /* Force validity if something is mounted under us */ + if (inode S_ISDIR(inode-i_mode) have_submounts(dentry)) + return 1; d_drop(dentry); return 0; } @@ -1393,6 +1396,9 @@ static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd) } put_task_struct(task); } + /* Force validity if something is mounted under us */ + if (inode S_ISDIR(inode-i_mode) have_submounts(dentry)) + return 1; d_drop(dentry); return 0; } @@ -2056,6 +2062,9 @@ static int proc_base_revalidate(struct dentry *dentry, struct nameidata *nd) put_task_struct(task); return 1; } + /* Force validity if something is mounted under us */ + if (inode S_ISDIR(inode-i_mode) have_submounts(dentry)) + return 1; d_drop(dentry); return 0; } diff --git a/fs/proc/generic.c b/fs/proc/generic.c index 4abd568..233dcdc 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -370,6 +370,11 @@ static int proc_delete_dentry(struct dentry * dentry) static int proc_revalidate_dentry(struct dentry *dentry, struct nameidata *nd) { + struct inode *inode = dentry-d_inode; + + /* Force validity if something is mounted under us */ + if (inode S_ISDIR(inode-i_mode) have_submounts(dentry)) + return 1; d_drop(dentry); return 0; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] lost content of /proc/sys/fs/binfmt_misc
/proc/sys/fs/binfmt_misc dentry disappeared during d_revalidate. d_revalidate only dentries from shadowed one and below. http://bugzilla.kernel.org/show_bug.cgi?id=9504 CC: Eric W. Biederman <[EMAIL PROTECTED]> CC: Marcus Better <[EMAIL PROTECTED]> Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> diff --git a/fs/proc/generic.c b/fs/proc/generic.c index 5fccfe2..1497ac4 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -380,12 +380,17 @@ static int proc_revalidate_dentry(struct dentry *dentry, struct nameidata *nd) return 0; } -static struct dentry_operations proc_dentry_operations = +static struct dentry_operations proc_dentry_shadow_operations = { .d_delete = proc_delete_dentry, .d_revalidate = proc_revalidate_dentry, }; +static struct dentry_operations proc_dentry_operations = +{ + .d_delete = proc_delete_dentry, +}; + /* * Don't create negative dentries here, return -ENOENT by hand * instead. @@ -394,6 +399,7 @@ struct dentry *proc_lookup(struct inode * dir, struct dentry *dentry, struct nam { struct inode *inode = NULL; struct proc_dir_entry * de; + int use_shadow = 0; int error = -ENOENT; lock_kernel(); @@ -406,8 +412,10 @@ struct dentry *proc_lookup(struct inode * dir, struct dentry *dentry, struct nam if (!memcmp(dentry->d_name.name, de->name, de->namelen)) { unsigned int ino; - if (de->shadow_proc) + if (de->shadow_proc) { de = de->shadow_proc(current, de); + use_shadow = 1; + } ino = de->low_ino; de_get(de); spin_unlock(_subdir_lock); @@ -423,6 +431,8 @@ struct dentry *proc_lookup(struct inode * dir, struct dentry *dentry, struct nam if (inode) { dentry->d_op = _dentry_operations; + dentry->d_op = use_shadow ? + _dentry_shadow_operations : dentry->d_parent->d_op; d_add(dentry, inode); return NULL; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] lost content of /proc/sys/fs/binfmt_misc
/proc/sys/fs/binfmt_misc dentry disappeared during d_revalidate. d_revalidate only dentries from shadowed one and below. http://bugzilla.kernel.org/show_bug.cgi?id=9504 CC: Eric W. Biederman [EMAIL PROTECTED] CC: Marcus Better [EMAIL PROTECTED] Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] diff --git a/fs/proc/generic.c b/fs/proc/generic.c index 5fccfe2..1497ac4 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -380,12 +380,17 @@ static int proc_revalidate_dentry(struct dentry *dentry, struct nameidata *nd) return 0; } -static struct dentry_operations proc_dentry_operations = +static struct dentry_operations proc_dentry_shadow_operations = { .d_delete = proc_delete_dentry, .d_revalidate = proc_revalidate_dentry, }; +static struct dentry_operations proc_dentry_operations = +{ + .d_delete = proc_delete_dentry, +}; + /* * Don't create negative dentries here, return -ENOENT by hand * instead. @@ -394,6 +399,7 @@ struct dentry *proc_lookup(struct inode * dir, struct dentry *dentry, struct nam { struct inode *inode = NULL; struct proc_dir_entry * de; + int use_shadow = 0; int error = -ENOENT; lock_kernel(); @@ -406,8 +412,10 @@ struct dentry *proc_lookup(struct inode * dir, struct dentry *dentry, struct nam if (!memcmp(dentry-d_name.name, de-name, de-namelen)) { unsigned int ino; - if (de-shadow_proc) + if (de-shadow_proc) { de = de-shadow_proc(current, de); + use_shadow = 1; + } ino = de-low_ino; de_get(de); spin_unlock(proc_subdir_lock); @@ -423,6 +431,8 @@ struct dentry *proc_lookup(struct inode * dir, struct dentry *dentry, struct nam if (inode) { dentry-d_op = proc_dentry_operations; + dentry-d_op = use_shadow ? + proc_dentry_shadow_operations : dentry-d_parent-d_op; d_add(dentry, inode); return NULL; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] AB-BA deadlock in drop_caches sysctl (resend, the one sent was for 2.6.18)
Andrew Morton wrote: > On Mon, 3 Dec 2007 16:52:47 +0300 > "Denis V. Lunev" <[EMAIL PROTECTED]> wrote: > >> There is a AB-BA deadlock regarding drop_caches sysctl. Here are the code >> paths: >> >> drop_pagecache >> spin_lock(_lock); >> invalidate_mapping_pages >> try_to_release_page >> ext3_releasepage >> journal_try_to_free_buffers >> __journal_try_to_free_buffer >> spin_lock(>j_list_lock); >> >> __journal_temp_unlink_buffer (called under journal->j_list_lock by comments) >> mark_buffer_dirty >> __set_page_dirty >> __mark_inode_dirty >> spin_lock(_lock); >> >> The patch tries to address the issue - it drops inode_lock before digging >> into >> invalidate_inode_pages. This seems sane as inode hold should not gone from >> the >> list and should not change its place. >> >> Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> >> -- >> diff --git a/fs/drop_caches.c b/fs/drop_caches.c >> index 59375ef..4ac80d8 100644 >> --- a/fs/drop_caches.c >> +++ b/fs/drop_caches.c >> @@ -14,15 +14,27 @@ int sysctl_drop_caches; >> >> static void drop_pagecache_sb(struct super_block *sb) >> { >> -struct inode *inode; >> +struct inode *inode, *old; >> >> +old = NULL; >> spin_lock(_lock); >> list_for_each_entry(inode, >s_inodes, i_sb_list) { >> if (inode->i_state & (I_FREEING|I_WILL_FREE)) >> continue; >> -__invalidate_mapping_pages(inode->i_mapping, 0, -1, true); >> +__iget(inode); >> +spin_unlock(_lock); >> + >> +if (old != NULL) >> +iput(old); >> +invalidate_mapping_pages(inode->i_mapping, 0, -1); >> +old = inode; >> + >> +spin_lock(_lock); >> } >> spin_unlock(_lock); >> + >> +if (old != NULL) >> +iput(old); >> } > > We need to hold onto inode_lock while walking sb->s_inodes. Otherwise the > inode which we're currently looking at could get removed from i_sb_list and > bad things will happen (drop_pagecache_sb will go infinite, or will oops, I > guess). as far as I understand, there are the following place removing inode from i_sb_list: - generic_delete_inode (via iput_final) - generic_forget_inode (via iput_final) - hugetlbfs_forget_inode - dispose_list after the check under inode_lock for i_count So, the patch is sane from disappearing point of view: - I hold inode under inode_lock - and iput it after new inode to clean has been found and hold Nevertheless we'll think a bit about ext3 fix. Though other staff like gfs2 etc can also be affected. Regards, Den -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] AB-BA deadlock in drop_caches sysctl (resend, the one sent was for 2.6.18)
Andrew Morton wrote: On Mon, 3 Dec 2007 16:52:47 +0300 Denis V. Lunev [EMAIL PROTECTED] wrote: There is a AB-BA deadlock regarding drop_caches sysctl. Here are the code paths: drop_pagecache spin_lock(inode_lock); invalidate_mapping_pages try_to_release_page ext3_releasepage journal_try_to_free_buffers __journal_try_to_free_buffer spin_lock(journal-j_list_lock); __journal_temp_unlink_buffer (called under journal-j_list_lock by comments) mark_buffer_dirty __set_page_dirty __mark_inode_dirty spin_lock(inode_lock); The patch tries to address the issue - it drops inode_lock before digging into invalidate_inode_pages. This seems sane as inode hold should not gone from the list and should not change its place. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] -- diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 59375ef..4ac80d8 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -14,15 +14,27 @@ int sysctl_drop_caches; static void drop_pagecache_sb(struct super_block *sb) { -struct inode *inode; +struct inode *inode, *old; +old = NULL; spin_lock(inode_lock); list_for_each_entry(inode, sb-s_inodes, i_sb_list) { if (inode-i_state (I_FREEING|I_WILL_FREE)) continue; -__invalidate_mapping_pages(inode-i_mapping, 0, -1, true); +__iget(inode); +spin_unlock(inode_lock); + +if (old != NULL) +iput(old); +invalidate_mapping_pages(inode-i_mapping, 0, -1); +old = inode; + +spin_lock(inode_lock); } spin_unlock(inode_lock); + +if (old != NULL) +iput(old); } We need to hold onto inode_lock while walking sb-s_inodes. Otherwise the inode which we're currently looking at could get removed from i_sb_list and bad things will happen (drop_pagecache_sb will go infinite, or will oops, I guess). as far as I understand, there are the following place removing inode from i_sb_list: - generic_delete_inode (via iput_final) - generic_forget_inode (via iput_final) - hugetlbfs_forget_inode - dispose_list after the check under inode_lock for i_count So, the patch is sane from disappearing point of view: - I hold inode under inode_lock - and iput it after new inode to clean has been found and hold Nevertheless we'll think a bit about ext3 fix. Though other staff like gfs2 etc can also be affected. Regards, Den -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] AB-BA deadlock in drop_caches sysctl (resend, the one sent was for 2.6.18)
There is a AB-BA deadlock regarding drop_caches sysctl. Here are the code paths: drop_pagecache spin_lock(_lock); invalidate_mapping_pages try_to_release_page ext3_releasepage journal_try_to_free_buffers __journal_try_to_free_buffer spin_lock(>j_list_lock); __journal_temp_unlink_buffer (called under journal->j_list_lock by comments) mark_buffer_dirty __set_page_dirty __mark_inode_dirty spin_lock(_lock); The patch tries to address the issue - it drops inode_lock before digging into invalidate_inode_pages. This seems sane as inode hold should not gone from the list and should not change its place. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> -- diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 59375ef..4ac80d8 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -14,15 +14,27 @@ int sysctl_drop_caches; static void drop_pagecache_sb(struct super_block *sb) { - struct inode *inode; + struct inode *inode, *old; + old = NULL; spin_lock(_lock); list_for_each_entry(inode, >s_inodes, i_sb_list) { if (inode->i_state & (I_FREEING|I_WILL_FREE)) continue; - __invalidate_mapping_pages(inode->i_mapping, 0, -1, true); + __iget(inode); + spin_unlock(_lock); + + if (old != NULL) + iput(old); + invalidate_mapping_pages(inode->i_mapping, 0, -1); + old = inode; + + spin_lock(_lock); } spin_unlock(_lock); + + if (old != NULL) + iput(old); } void drop_pagecache(void) diff --git a/include/linux/fs.h b/include/linux/fs.h index b3ec4a4..b424dff 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1649,9 +1649,6 @@ extern int __invalidate_device(struct block_device *); extern int invalidate_partition(struct gendisk *, int); #endif extern int invalidate_inodes(struct super_block *); -unsigned long __invalidate_mapping_pages(struct address_space *mapping, - pgoff_t start, pgoff_t end, - bool be_atomic); unsigned long invalidate_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t end); diff --git a/mm/truncate.c b/mm/truncate.c index cadc156..5cbe1e7 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -266,8 +266,21 @@ void truncate_inode_pages(struct address_space *mapping, loff_t lstart) } EXPORT_SYMBOL(truncate_inode_pages); -unsigned long __invalidate_mapping_pages(struct address_space *mapping, - pgoff_t start, pgoff_t end, bool be_atomic) +/** + * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode + * @mapping: the address_space which holds the pages to invalidate + * @start: the offset 'from' which to invalidate + * @end: the offset 'to' which to invalidate (inclusive) + * + * This function only removes the unlocked pages, if you want to + * remove all the pages of one inode, you must call truncate_inode_pages. + * + * invalidate_mapping_pages() will not block on IO activity. It will not + * invalidate pages which are dirty, locked, under writeback or mapped into + * pagetables. + */ +unsigned long invalidate_mapping_pages(struct address_space *mapping, + pgoff_t start, pgoff_t end) { struct pagevec pvec; pgoff_t next = start; @@ -308,30 +321,10 @@ unlock: break; } pagevec_release(); - if (likely(!be_atomic)) - cond_resched(); + cond_resched(); } return ret; } - -/** - * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode - * @mapping: the address_space which holds the pages to invalidate - * @start: the offset 'from' which to invalidate - * @end: the offset 'to' which to invalidate (inclusive) - * - * This function only removes the unlocked pages, if you want to - * remove all the pages of one inode, you must call truncate_inode_pages. - * - * invalidate_mapping_pages() will not block on IO activity. It will not - * invalidate pages which are dirty, locked, under writeback or mapped into - * pagetables. - */ -unsigned long invalidate_mapping_pages(struct address_space *mapping, - pgoff_t start, pgoff_t end) -{ - return __invalidate_mapping_pages(mapping, start, end, false); -} EXPORT_SYMBOL(invalidate_mapping_pages); /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] AB-BA deadlock in drop_caches sysctl
There is a AB-BA deadlock regarding drop_caches sysctl. Here are the code paths: drop_pagecache spin_lock(_lock); invalidate_mapping_pages try_to_release_page ext3_releasepage journal_try_to_free_buffers __journal_try_to_free_buffer spin_lock(>j_list_lock); journal_commit_transaction spin_lock(>j_list_lock); __journal_remove_checkpoint __journal_refile_buffer __journal_unfile_buffer __journal_temp_unlink_buffer __set_page_dirty_nobuffers __mark_inode_dirt spin_lock(_lock); The patch tries to address the issue - it drops inode_lock before digging into invalidate_inode_pages. This seems sane as inode hold should not gone from the list and should not change its place. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- ./fs/drop_caches.c.marker 2006-09-20 07:42:06.0 +0400 +++ ./fs/drop_caches.c 2007-12-03 15:43:44.0 +0300 @@ -14,15 +14,27 @@ int sysctl_drop_caches; static void drop_pagecache_sb(struct super_block *sb) { - struct inode *inode; + struct inode *inode, *old; + old = NULL; spin_lock(_lock); list_for_each_entry(inode, >s_inodes, i_sb_list) { if (inode->i_state & (I_FREEING|I_WILL_FREE)) continue; + __iget(inode); + spin_unlock(_lock); + invalidate_inode_pages(inode->i_mapping); + if (old != NULL) + iput(old); + old = inode; + + spin_lock(_lock); } spin_unlock(_lock); + + if (old != NULL) + iput(old); } void drop_pagecache(void) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: namespace support requires network modules to say "GPL"
Patrick McHardy wrote: > Adrian Bunk wrote: >> On Sun, Dec 02, 2007 at 09:03:56PM +0100, Patrick McHardy wrote: >> >>> For all I care binary modules can break, but frankly I don't see >>> how encapsulating a couple of structures and pointers in a new >>> structure and adding a new argument to existing functions shifts >>> the decision about how a function should be usable to the namespace >>> guys. IMO all functions should continue to be usable as before, >>> as decided by whoever actually wrote them. >>> ... >> >> Even ignoring the fact that it's unclear whether distributing modules >> with not GPLv2 compatible licences is legal at all or might bring you >> in jail, > > Agreed, lets ignore that :) > >> your statement has an interesting implication: >> >> Stuff like e.g. the EXPORT_SYMBOL(sk_alloc) predates the >> EXPORT_SYMBOL_GPL stuff. >> >> Who is considered the author of this code? >> >> And when should he state whether he prefers to use EXPORT_SYMBOL_GPL >> but wasn't able to use it at that when he wrote it since his code >> predates it and is glad to be able to decide this now? > > > He can state it when he feels like it, I don't see the point. > Authors generally get to decide whether they use EXPORT_SYMBOL > or EXPORT_SYMBOL_GPL unless in cases where its really clear-cut > that EXPORT_SYMBOL is inapproriate. But thats a different matter. > > If a symbol was OK to be used previously and something using it > would not automatically be considered a derived work, how does > passing _net to the function just to make the compiler > happy, avoid BUG_ONs and generally keep things working as before > make it more of a derived work? We, namely, Pavel Emelyanov and me, if we have some rights as a committers to this staff :), do not mind against change EXPORT_SYMBOL_GPL to EXPORT_SYMBOL. Regards, Den -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: namespace support requires network modules to say GPL
Patrick McHardy wrote: Adrian Bunk wrote: On Sun, Dec 02, 2007 at 09:03:56PM +0100, Patrick McHardy wrote: For all I care binary modules can break, but frankly I don't see how encapsulating a couple of structures and pointers in a new structure and adding a new argument to existing functions shifts the decision about how a function should be usable to the namespace guys. IMO all functions should continue to be usable as before, as decided by whoever actually wrote them. ... Even ignoring the fact that it's unclear whether distributing modules with not GPLv2 compatible licences is legal at all or might bring you in jail, Agreed, lets ignore that :) your statement has an interesting implication: Stuff like e.g. the EXPORT_SYMBOL(sk_alloc) predates the EXPORT_SYMBOL_GPL stuff. Who is considered the author of this code? And when should he state whether he prefers to use EXPORT_SYMBOL_GPL but wasn't able to use it at that when he wrote it since his code predates it and is glad to be able to decide this now? He can state it when he feels like it, I don't see the point. Authors generally get to decide whether they use EXPORT_SYMBOL or EXPORT_SYMBOL_GPL unless in cases where its really clear-cut that EXPORT_SYMBOL is inapproriate. But thats a different matter. If a symbol was OK to be used previously and something using it would not automatically be considered a derived work, how does passing init_net to the function just to make the compiler happy, avoid BUG_ONs and generally keep things working as before make it more of a derived work? We, namely, Pavel Emelyanov and me, if we have some rights as a committers to this staff :), do not mind against change EXPORT_SYMBOL_GPL to EXPORT_SYMBOL. Regards, Den -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] AB-BA deadlock in drop_caches sysctl (resend, the one sent was for 2.6.18)
There is a AB-BA deadlock regarding drop_caches sysctl. Here are the code paths: drop_pagecache spin_lock(inode_lock); invalidate_mapping_pages try_to_release_page ext3_releasepage journal_try_to_free_buffers __journal_try_to_free_buffer spin_lock(journal-j_list_lock); __journal_temp_unlink_buffer (called under journal-j_list_lock by comments) mark_buffer_dirty __set_page_dirty __mark_inode_dirty spin_lock(inode_lock); The patch tries to address the issue - it drops inode_lock before digging into invalidate_inode_pages. This seems sane as inode hold should not gone from the list and should not change its place. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] -- diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 59375ef..4ac80d8 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -14,15 +14,27 @@ int sysctl_drop_caches; static void drop_pagecache_sb(struct super_block *sb) { - struct inode *inode; + struct inode *inode, *old; + old = NULL; spin_lock(inode_lock); list_for_each_entry(inode, sb-s_inodes, i_sb_list) { if (inode-i_state (I_FREEING|I_WILL_FREE)) continue; - __invalidate_mapping_pages(inode-i_mapping, 0, -1, true); + __iget(inode); + spin_unlock(inode_lock); + + if (old != NULL) + iput(old); + invalidate_mapping_pages(inode-i_mapping, 0, -1); + old = inode; + + spin_lock(inode_lock); } spin_unlock(inode_lock); + + if (old != NULL) + iput(old); } void drop_pagecache(void) diff --git a/include/linux/fs.h b/include/linux/fs.h index b3ec4a4..b424dff 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1649,9 +1649,6 @@ extern int __invalidate_device(struct block_device *); extern int invalidate_partition(struct gendisk *, int); #endif extern int invalidate_inodes(struct super_block *); -unsigned long __invalidate_mapping_pages(struct address_space *mapping, - pgoff_t start, pgoff_t end, - bool be_atomic); unsigned long invalidate_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t end); diff --git a/mm/truncate.c b/mm/truncate.c index cadc156..5cbe1e7 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -266,8 +266,21 @@ void truncate_inode_pages(struct address_space *mapping, loff_t lstart) } EXPORT_SYMBOL(truncate_inode_pages); -unsigned long __invalidate_mapping_pages(struct address_space *mapping, - pgoff_t start, pgoff_t end, bool be_atomic) +/** + * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode + * @mapping: the address_space which holds the pages to invalidate + * @start: the offset 'from' which to invalidate + * @end: the offset 'to' which to invalidate (inclusive) + * + * This function only removes the unlocked pages, if you want to + * remove all the pages of one inode, you must call truncate_inode_pages. + * + * invalidate_mapping_pages() will not block on IO activity. It will not + * invalidate pages which are dirty, locked, under writeback or mapped into + * pagetables. + */ +unsigned long invalidate_mapping_pages(struct address_space *mapping, + pgoff_t start, pgoff_t end) { struct pagevec pvec; pgoff_t next = start; @@ -308,30 +321,10 @@ unlock: break; } pagevec_release(pvec); - if (likely(!be_atomic)) - cond_resched(); + cond_resched(); } return ret; } - -/** - * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode - * @mapping: the address_space which holds the pages to invalidate - * @start: the offset 'from' which to invalidate - * @end: the offset 'to' which to invalidate (inclusive) - * - * This function only removes the unlocked pages, if you want to - * remove all the pages of one inode, you must call truncate_inode_pages. - * - * invalidate_mapping_pages() will not block on IO activity. It will not - * invalidate pages which are dirty, locked, under writeback or mapped into - * pagetables. - */ -unsigned long invalidate_mapping_pages(struct address_space *mapping, - pgoff_t start, pgoff_t end) -{ - return __invalidate_mapping_pages(mapping, start, end, false); -} EXPORT_SYMBOL(invalidate_mapping_pages); /* -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] AB-BA deadlock in drop_caches sysctl
There is a AB-BA deadlock regarding drop_caches sysctl. Here are the code paths: drop_pagecache spin_lock(inode_lock); invalidate_mapping_pages try_to_release_page ext3_releasepage journal_try_to_free_buffers __journal_try_to_free_buffer spin_lock(journal-j_list_lock); journal_commit_transaction spin_lock(journal-j_list_lock); __journal_remove_checkpoint __journal_refile_buffer __journal_unfile_buffer __journal_temp_unlink_buffer __set_page_dirty_nobuffers __mark_inode_dirt spin_lock(inode_lock); The patch tries to address the issue - it drops inode_lock before digging into invalidate_inode_pages. This seems sane as inode hold should not gone from the list and should not change its place. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- ./fs/drop_caches.c.marker 2006-09-20 07:42:06.0 +0400 +++ ./fs/drop_caches.c 2007-12-03 15:43:44.0 +0300 @@ -14,15 +14,27 @@ int sysctl_drop_caches; static void drop_pagecache_sb(struct super_block *sb) { - struct inode *inode; + struct inode *inode, *old; + old = NULL; spin_lock(inode_lock); list_for_each_entry(inode, sb-s_inodes, i_sb_list) { if (inode-i_state (I_FREEING|I_WILL_FREE)) continue; + __iget(inode); + spin_unlock(inode_lock); + invalidate_inode_pages(inode-i_mapping); + if (old != NULL) + iput(old); + old = inode; + + spin_lock(inode_lock); } spin_unlock(inode_lock); + + if (old != NULL) + iput(old); } void drop_pagecache(void) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] .gitignore update for x86 arch
This patch: - makes .gitignore files visible to git - makes arch/x86/kernel/vsyscall_32.lds and arch/i386/boot invisible Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> diff --git a/.gitignore b/.gitignore index 27c3e83..22fb8fa 100644 --- a/.gitignore +++ b/.gitignore @@ -26,6 +26,7 @@ vmlinux* !vmlinux.lds.S System.map Module.symvers +!.gitignore # # Generated include files diff --git a/arch/i386/.gitignore b/arch/i386/.gitignore new file mode 100644 index 000..36ef4c3 --- /dev/null +++ b/arch/i386/.gitignore @@ -0,0 +1 @@ +boot diff --git a/arch/x86/kernel/.gitignore b/arch/x86/kernel/.gitignore index 40836ad..4ea38a3 100644 --- a/arch/x86/kernel/.gitignore +++ b/arch/x86/kernel/.gitignore @@ -1 +1,2 @@ vsyscall.lds +vsyscall_32.lds - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] .gitignore update for x86 arch
This patch: - makes .gitignore files visible to git - makes arch/x86/kernel/vsyscall_32.lds and arch/i386/boot invisible Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] diff --git a/.gitignore b/.gitignore index 27c3e83..22fb8fa 100644 --- a/.gitignore +++ b/.gitignore @@ -26,6 +26,7 @@ vmlinux* !vmlinux.lds.S System.map Module.symvers +!.gitignore # # Generated include files diff --git a/arch/i386/.gitignore b/arch/i386/.gitignore new file mode 100644 index 000..36ef4c3 --- /dev/null +++ b/arch/i386/.gitignore @@ -0,0 +1 @@ +boot diff --git a/arch/x86/kernel/.gitignore b/arch/x86/kernel/.gitignore index 40836ad..4ea38a3 100644 --- a/arch/x86/kernel/.gitignore +++ b/arch/x86/kernel/.gitignore @@ -1 +1,2 @@ vsyscall.lds +vsyscall_32.lds - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NET] IPv6 oops bisected
David Miller wrote: > From: "Denis V. Lunev" <[EMAIL PROTECTED]> > Date: Mon, 08 Oct 2007 10:34:23 +0400 > >> OK. I am installing Fedora 7 right now... > > You don't need to install Fedora, just read the code! :-) > > The bug is obvious and it's been explained thoroughly in this > thread. > > When 'dev' is NULL in ip6_route_add() we need to figure out what > namespace and/or loopback device you want to use. > > There is no reason to do an entire dist install to work on fixing this > bug, yikes! I do understand the conditions when the bug happens. Its completely clear :) Though I do not understand how to trigger it from command line to test that the problem is resolved. Jeff was not kind enough to give exact command line :( The unfortunate thing with this place is that original Eric's code is also broken here. I was too optimistic working on the original patchset. In other way, but broken :( So, I must stop and think... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NET] IPv6 oops bisected
David Miller wrote: > From: Herbert Xu <[EMAIL PROTECTED]> > Date: Mon, 8 Oct 2007 14:19:42 +0800 > >> On Sun, Oct 07, 2007 at 11:16:08PM -0700, David Miller wrote: >>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c >>> index a7db84c..7109ad6 100644 >>> --- a/net/ipv6/route.c >>> +++ b/net/ipv6/route.c >>> @@ -1188,7 +1188,7 @@ int ip6_route_add(struct fib6_config *cfg) >>> if ((cfg->fc_flags & RTF_REJECT) || >>> (dev && (dev->flags_LOOPBACK) && >>> !(addr_type_ADDR_LOOPBACK))) { >>> /* hold loopback dev/idev if we haven't done so. */ >>> - if (dev != dev->nd_net->loopback_dev) { >>> + if (!dev || (dev != dev->nd_net->loopback_dev)) { >>> if (dev) { >>> dev_put(dev); >>> in6_dev_put(idev); >> Unfortunately this'll still oops a few lines down when it tries >> to assign dev->nd_net->loopabck_dev to dev. The issue here is >> which namespace are we in if dev is NULL. > > Good catch. > > I'm just going to revert my bogus fix and the original change > for now. Denis can resubmit the original patch once this > is resolved. OK. I am installing Fedora 7 right now... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NET] IPv6 oops bisected
David Miller wrote: From: Herbert Xu [EMAIL PROTECTED] Date: Mon, 8 Oct 2007 14:19:42 +0800 On Sun, Oct 07, 2007 at 11:16:08PM -0700, David Miller wrote: diff --git a/net/ipv6/route.c b/net/ipv6/route.c index a7db84c..7109ad6 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1188,7 +1188,7 @@ int ip6_route_add(struct fib6_config *cfg) if ((cfg-fc_flags RTF_REJECT) || (dev (dev-flagsIFF_LOOPBACK) !(addr_typeIPV6_ADDR_LOOPBACK))) { /* hold loopback dev/idev if we haven't done so. */ - if (dev != dev-nd_net-loopback_dev) { + if (!dev || (dev != dev-nd_net-loopback_dev)) { if (dev) { dev_put(dev); in6_dev_put(idev); Unfortunately this'll still oops a few lines down when it tries to assign dev-nd_net-loopabck_dev to dev. The issue here is which namespace are we in if dev is NULL. Good catch. I'm just going to revert my bogus fix and the original change for now. Denis can resubmit the original patch once this is resolved. OK. I am installing Fedora 7 right now... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/