Bug#647185: linux-2.6: kernel null pointer dereference while adding SAN path

2011-11-02 Thread Bernd Zeimetz
Hi Ben!


 removing paths to our SAN and adding them back results in
 [...]

 Does the attached patch help?  Instructions for building a patched
 kernel can be found at:

 http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official
 
 Sorry, you'll need this patch as well.

thanks for the patches! After applying them we run into the following oops 
instead.
Please note that
- this only seems to happen when there is a partition on the LUN, using LVM on 
the
  DM device directly doesn't seem to trigger that bug
- you'll have to fix the vserver patch when the two dm patches are applied
  (not the issue here, but wanted to mention it before you fall about it)


[ 2000.379681] device-mapper: multipath: Failing path 8:32.
[ 2000.380022] device-mapper: multipath: Failing path 8:48.
[ 2000.381502] device-mapper: table: 254:2: multipath: error getting device
[ 2000.381533] device-mapper: ioctl: error adding target to table
[ 2000.382355] device-mapper: table: 254:2: multipath: error getting device
[ 2000.382385] device-mapper: ioctl: error adding target to table
[ 2000.411143] general protection fault:  [#1] SMP 
[ 2000.411175] last sysfs file: 
/sys/devices/pci:00/:00:07.0/:0e:00.0/host0/rport-0:0-3/target0:0:3/0:0:3:0/block/sdd/uevent
[ 2000.411229] CPU 4 
[ 2000.411251] Modules linked in: 8021q garp stp ext4 jbd2 crc16 dm_round_robin 
dm_multipath scsi_dh bonding ipmi_devintf ipmi_si ipmi_msghandler ohci_hcd 
snd_pcm snd_timer radeon snd ttm soundcore drm_kms_helper drm i2c_algo_bit 
i2c_core hpilo snd_page_alloc hpwdt joydev pcspkr evdev button power_meter 
processor container psmouse serio_raw ext3 jbd mbcache dm_mod sd_mod crc_t10dif 
sg sr_mod cdrom ata_generic usbhid hid qla2xxx hpsa scsi_transport_fc uhci_hcd 
thermal scsi_tgt ata_piix ehci_hcd libata bnx2 qlcnic usbcore cciss nls_base 
scsi_mod thermal_sys [last unloaded: scsi_wait_scan]
[ 2000.411579] Pid: 8402, comm: multipath Not tainted 2.6.32-5-amd64 #1 
ProLiant DL380 G7
[ 2000.411623] RIP: 0010:[8117629b]  [8117629b] 
elv_drain_elevator+0x13/0x5a
[ 2000.411674] RSP: 0018:880e1b2cfd18  EFLAGS: 00010002
[ 2000.411700] RAX: 880719b0cd80 RBX: 880719a291a0 RCX: 
[ 2000.411729] RDX: 0002 RSI: 0001 RDI: 880719a291a0
[ 2000.411758] RBP: 880719a291a0 R08: 88071a65be70 R09: 88071a701840
[ 2000.411787] R10: 000100067c84 R11: 880713a8a780 R12: 880719a291a0
[ 2000.411816] R13: 0002 R14: 880719707160 R15: 880719707044
[ 2000.411845] FS:  7f3b1d07a7a0() GS:88001a44() 
knlGS:
[ 2000.411889] CS:  0010 DS:  ES:  CR0: 80050033
[ 2000.411916] CR2: 025ea210 CR3: 000e1a53f000 CR4: 06e0
[ 2000.411945] DR0:  DR1:  DR2: 
[ 2000.411974] DR3:  DR6: 0ff0 DR7: 0400
[ 2000.418573] Process multipath (pid: 8402, threadinfo 880e1b2ce000, task 
880e1b93e9f0)
[ 2000.418618] Stack:
[ 2000.418637]  880248cc0018 81176c40 880248cc0018 
880248cc0018
[ 2000.418674] 0 880719a291a0 0096 880719707000 
8117dec9
[ 2000.418726] 0 880248cc0018 c9000ca8b040 880719a2b4e0 
a019492b
[ 2000.418795] Call Trace:
[ 2000.418818]  [81176c40] ? elv_insert+0x91/0x260
[ 2000.418847]  [8117dec9] ? blk_insert_cloned_request+0x4f/0x67
[ 2000.418879]  [a019492b] ? dm_dispatch_request+0x33/0x59 [dm_mod]
[ 2000.418912]  [a0195ef7] ? dm_request_fn+0x121/0x1a2 [dm_mod]
[ 2000.418941]  [8117eef6] ? __blk_run_queue+0x35/0x66
[ 2000.418970]  [a0194a43] ? dm_resume+0xb5/0x123 [dm_mod]
[ 2000.419001]  [a0199071] ? dev_suspend+0x0/0x196 [dm_mod]
[ 2000.419032]  [a01991d0] ? dev_suspend+0x15f/0x196 [dm_mod]
[ 2000.419063]  [a0199c24] ? ctl_ioctl+0x1c6/0x20e [dm_mod]
[ 2000.419092]  [a0199c7a] ? dm_ctl_ioctl+0xe/0x12 [dm_mod]
[ 2000.419124]  [810fab66] ? vfs_ioctl+0x21/0x6c
[ 2000.419150]  [810fb0b4] ? do_vfs_ioctl+0x48d/0x4cb
[ 2000.419178]  [810d066d] ? remove_vma+0x6b/0x72
[ 2000.419205]  [810d1782] ? do_munmap+0x307/0x329
[ 2000.419231]  [810fb143] ? sys_ioctl+0x51/0x70
[ 2000.419258]  [81010b42] ? system_call_fastpath+0x16/0x1b
[ 2000.419285] Code: 41 0f 18 09 75 bb 48 8b 02 48 89 70 08 48 89 06 48 89 56 
08 48 89 32 c3 53 48 89 fb 48 8b 43 18 be 01 00 00 00 48 89 df 48 8b 00 ff 50 
20 85 c0 75 ea 8b 8b b0 03 00 00 85 c9 74 34 8b 15 ca ea 
[ 2000.419478] RIP  [8117629b] elv_drain_elevator+0x13/0x5a
[ 2000.419507]  RSP 880e1b2cfd18
[ 2000.419759] ---[ end trace cff8452e221a0978 ]---
[ 2130.43] qla2xxx :0e:00.0: LIP reset occurred (f700).
[ 2130.647148] qla2xxx :0e:00.1: LOOP DOWN detected (2 3 0).
[ 2130.808220] qla2xxx :0e:00.0: LIP occurred (f700).
[ 2130.808342] qla2xxx 

Bug#647185: linux-2.6: kernel null pointer dereference while adding SAN path

2011-11-01 Thread Ben Hutchings
On Mon, 2011-10-31 at 14:35 +0100, Bernd Zeimetz wrote:
 Package: linux-2.6
 Version: 2.6.32-38
 
 Hi,
 
 removing paths to our SAN and adding them back results in
[...]

Does the attached patch help?  Instructions for building a patched
kernel can be found at:

http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official

Ben.

-- 
Ben Hutchings
Sturgeon's Law: Ninety percent of everything is crap.
From: Kiyoshi Ueda k-u...@ct.jp.nec.com
Date: Thu, 12 Aug 2010 04:13:54 +0100
Subject: [PATCH] dm: prevent access to md being deleted

commit abdc568b0540bec6d3e0afebac496adef1189b77 upstream.

This patch prevents access to mapped_device which is being deleted.

Currently, even after a mapped_device has been removed from the hash,
it could be accessed through idr_find() using minor number.
That could cause a race and NULL pointer reference below:
  CPU0  CPU1
  --
  dev_remove(param)
down_write(_hash_lock)
dm_lock_for_deletion(md)
  spin_lock(_minor_lock)
  set_bit(DMF_DELETING)
  spin_unlock(_minor_lock)
__hash_remove(hc)
up_write(_hash_lock)
dev_status(param)
  md = find_device(param)
 down_read(_hash_lock)
 __find_device_hash_cell(param)
   dm_get_md(param-dev)
 md = dm_find_md(dev)
spin_lock(_minor_lock)
md = idr_find(MINOR(dev))
spin_unlock(_minor_lock)
dm_put(md)
  free_dev(md)
 dm_get(md)
 up_read(_hash_lock)
  __dev_status(md, param)
  dm_put(md)

This patch fixes such problems.

Signed-off-by: Kiyoshi Ueda k-u...@ct.jp.nec.com
Signed-off-by: Jun'ichi Nomura j-nom...@ce.jp.nec.com
Cc: sta...@kernel.org
Signed-off-by: Alasdair G Kergon a...@redhat.com
---
 drivers/md/dm.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a3f21dc..ba6934c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2136,6 +2136,7 @@ static struct mapped_device *dm_find_md(dev_t dev)
 	md = idr_find(_minor_idr, minor);
 	if (md  (md == MINOR_ALLOCED ||
 		   (MINOR(disk_devt(dm_disk(md))) != minor) ||
+		   dm_deleting_md(md) ||
 		   test_bit(DMF_FREEING, md-flags))) {
 		md = NULL;
 		goto out;
-- 
1.7.7



signature.asc
Description: This is a digitally signed message part


Bug#647185: linux-2.6: kernel null pointer dereference while adding SAN path

2011-11-01 Thread Ben Hutchings
On Wed, 2011-11-02 at 04:56 +, Ben Hutchings wrote:
 On Mon, 2011-10-31 at 14:35 +0100, Bernd Zeimetz wrote:
  Package: linux-2.6
  Version: 2.6.32-38
  
  Hi,
  
  removing paths to our SAN and adding them back results in
 [...]
 
 Does the attached patch help?  Instructions for building a patched
 kernel can be found at:
 
 http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official

Sorry, you'll need this patch as well.

Ben.

-- 
Ben Hutchings
Sturgeon's Law: Ninety percent of everything is crap.
From: Mike Anderson andm...@linux.vnet.ibm.com
Date: Thu, 10 Dec 2009 23:52:20 +
Subject: [PATCH] dm: add dm_deleting_md function

commit 432a212c0dd0f4ca386cf37c5b740ac9dbda4479 upstream.

Add dm_deleting_md to check whether or not a given mapped
device is currently being deleted.

Signed-off-by: Mike Anderson andm...@linux.vnet.ibm.com
Signed-off-by: Alasdair G Kergon a...@redhat.com
---
 drivers/md/dm.c |9 +++--
 drivers/md/dm.h |5 +
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 233a2e9..16f759f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -329,6 +329,11 @@ static void __exit dm_exit(void)
 /*
  * Block device functions
  */
+int dm_deleting_md(struct mapped_device *md)
+{
+	return test_bit(DMF_DELETING, md-flags);
+}
+
 static int dm_blk_open(struct block_device *bdev, fmode_t mode)
 {
 	struct mapped_device *md;
@@ -340,7 +345,7 @@ static int dm_blk_open(struct block_device *bdev, fmode_t mode)
 		goto out;
 
 	if (test_bit(DMF_FREEING, md-flags) ||
-	test_bit(DMF_DELETING, md-flags)) {
+	dm_deleting_md(md)) {
 		md = NULL;
 		goto out;
 	}
@@ -2659,7 +2664,7 @@ struct mapped_device *dm_get_from_kobject(struct kobject *kobj)
 		return NULL;
 
 	if (test_bit(DMF_FREEING, md-flags) ||
-	test_bit(DMF_DELETING, md-flags))
+	dm_deleting_md(md))
 		return NULL;
 
 	dm_get(md);
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 4a95e8f..604a5f2 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -89,6 +89,11 @@ int dm_target_iterate(void (*iter_func)(struct target_type *tt,
 int dm_split_args(int *argc, char ***argvp, char *input);
 
 /*
+ * Is this mapped_device being deleted?
+ */
+int dm_deleting_md(struct mapped_device *md);
+
+/*
  * The device-mapper can be driven through one of two interfaces;
  * ioctl or filesystem, depending which patch you have applied.
  */
-- 
1.7.7



signature.asc
Description: This is a digitally signed message part


Bug#647185: linux-2.6: kernel null pointer dereference while adding SAN path

2011-10-31 Thread Bernd Zeimetz
Package: linux-2.6
Version: 2.6.32-38

Hi,

removing paths to our SAN and adding them back results in

[  951.569561] device-mapper: table: 253:2: sde too small for target: start=0, 
len=140465493850188, dev_size=627107840
[  951.571750] BUG: unable to handle kernel NULL pointer dereference at (null)
[  951.571876] IP: [(null)] (null)
[  951.571961] PGD 6500c1067 PUD 650135067 PMD 0 
[  951.578673] Oops: 0010 [#1] SMP 
[  951.578788] last sysfs file: /sys/devices/virtual/block/dm-3/uevent
[  951.578846] CPU 16 
[  951.578928] Modules linked in: 8021q garp stp ext4 jbd2 crc16 dm_round_robin 
dm_multipath scsi_dh bonding ipmi_devintf ipmi_si ipmi_msghandler ohci_hcd 
radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core snd_pcm snd_timer snd 
soundcore snd_page_alloc hpilo hpwdt joydev pcspkr psmouse evdev serio_raw 
power_meter container processor button ext3 jbd mbcache dm_mod raid10 raid456 
async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 
raid0 multipath linear md_mod sd_mod crc_t10dif sg usbhid sr_mod hid cdrom 
ata_generic hpsa ata_piix thermal uhci_hcd cciss ehci_hcd qla2xxx 
scsi_transport_fc libata scsi_tgt bnx2 usbcore qlcnic nls_base scsi_mod 
thermal_sys [last unloaded: scsi_wait_scan]
[  951.581772] Pid: 5801, comm: blkid Not tainted 2.6.32-5-amd64 #1 ProLiant 
DL380 G7
[  951.581845] RIP: 0010:[]  [(null)] (null)
[  951.581934] RSP: 0018:88071b9c5b80  EFLAGS: 00010006
[  951.581989] RAX: 880e1ad3e880 RBX: 880e1a4888d0 RCX: 
[  951.582054] RDX: 0002 RSI: 0001 RDI: 880e1a4888d0
[  951.582116] RBP: 880e1a4888d0 R08: 880719cb33e8 R09: 880719f12840
[  951.582175] R10: 000100027c26 R11: 88065b00 R12: 880e1a4888d0
[  951.582234] R13: 0002 R14: 88071bcc1d60 R15: 88071bcc1c44
[  951.582297] FS:  7f5c1037d740() GS:88001a50() 
knlGS:
[  951.582372] CS:  0010 DS:  ES:  CR0: 80050033
[  951.582429] CR2:  CR3: 00071b6d2000 CR4: 06e0
[  951.582488] DR0:  DR1:  DR2: 
[  951.582546] DR3:  DR6: 0ff0 DR7: 0400
[  951.582606] Process blkid (pid: 5801, threadinfo 88071b9c4000, task 
88071a31bf90)
[  951.582680] Stack:
[  951.582729]  8117629e 88071bbd7dc8 81176c40 
88071bbd7dc8
[  951.582885] 0 88071bbd7dc8 880e1a4888d0 0096 
88071bcc1c00
[  951.583118] 0 8117dec9 88071bbd7dc8 c9000c8da040 
88071a2fac10
[  951.583397] Call Trace:
[  951.583452]  [8117629e] ? elv_drain_elevator+0x16/0x5a
[  951.583510]  [81176c40] ? elv_insert+0x91/0x260
[  951.583568]  [8117dec9] ? blk_insert_cloned_request+0x4f/0x67
[  951.583630]  [a022d90f] ? dm_dispatch_request+0x33/0x59 [dm_mod]
[  951.583691]  [a022eedb] ? dm_request_fn+0x121/0x1a2 [dm_mod]
[  951.583752]  [810b43e3] ? sync_page_killable+0x0/0x2f
[  951.583810]  [8117f07a] ? generic_unplug_device+0x21/0x34
[  951.583870]  [a022dac8] ? dm_unplug_all+0x33/0x4c [dm_mod]
[  951.583928]  [810b43d9] ? sync_page+0x3c/0x46
[  951.583984]  [810b43ec] ? sync_page_killable+0x9/0x2f
[  951.584043]  [812fb80a] ? __wait_on_bit_lock+0x3f/0x84
[  951.584101]  [810b42e8] ? __lock_page_killable+0x5d/0x63
[  951.584160]  [81064fc0] ? wake_bit_function+0x0/0x23
[  951.584217]  [810b42f7] ? lock_page_killable+0x9/0x1f
[  951.584274]  [810b5917] ? generic_file_aio_read+0x363/0x536
[  951.584334]  [810eed05] ? do_sync_read+0xce/0x113
[  951.584391]  [81064f92] ? autoremove_wake_function+0x0/0x2e
[  951.584451]  [810ccd36] ? handle_mm_fault+0x3b8/0x80f
[  951.584508]  [810ef728] ? vfs_read+0xa6/0xff
[  951.584564]  [810ef83d] ? sys_read+0x45/0x6e
[  951.584621]  [81010b42] ? system_call_fastpath+0x16/0x1b
[  951.584677] Code:  Bad RIP value.
[  951.584795] RIP  [(null)] (null)
[  951.584879]  RSP 88071b9c5b80
[  951.584932] CR2: 
[  951.584985] ---[ end trace 71dd7f009a29d813 ]---


As I'm adding back the old paths pretty much at the same time it seems
for me that blkid wants to access ond of the devices I've just removed.
But that should not result in a NULL pointer dereference, also it
should not render the access to the LUN faulty, completely forgetting
about the kind of hardware behind it.

lun_alias (980006470684a65693038) dm-1 ,
size=4.9T features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| |- #:#:#:# -   #:#   active faulty running
| `- #:#:#:# -   #:#   active faulty running
`-+- policy='round-robin 0' prio=0 status=enabled
  |- #:#:#:# -   #:#   active faulty running
  `- #:#:#:# -   #:#   active faulty running


The expected output of multipath -ll would be more like