Thank you. I suppose I tracked it down.
Can you please try attached patches?

On Mon, Apr 27, 2009 at 02:46:37PM +0900, KUWAMURA Shin'ya wrote:
> Hi Yamahata-san,
> 
> On <20090427040951.gk22299%yamah...@valinux.co.jp>,
>  Isaku Yamahata wrote:
> > 
> > On Mon, Apr 27, 2009 at 09:53:55AM +0900, KUWAMURA Shin'ya wrote:
> > > Hi Yamahata-san,
> > > 
> > > On <20090424120910.gg22299%yamah...@valinux.co.jp>,
> > >  Isaku Yamahata wrote:
> > > > 
> > > > Can you please try this patch?
> > > 
> > > NaT consumption occurred on tapdisk while PV and HVM were booting up.
> > > Please see the attachment file.
> > 
> > Thank you. I haven't reproduce it yet.
> > Do you have any logs without the patch?
> 
> Dom0 has no error, but domUs output errors:
>   request_module: runaway loop modprobe binfmt-429d
>   request_module: runaway loop modprobe binfmt-429d
>   request_module: runaway loop modprobe binfmt-429d
>   request_module: runaway loop modprobe binfmt-429d
>   request_module: runaway loop modprobe binfmt-429d
>   # hung up
> 
> I attach messages of both PV and HVM domain.
> 
> Best regards,

> Linux version 2.6.18.8-xen (k...@vmi05.sky.yk.fujitsu.co.jp) (gcc version 
> 3.4.4
> 20050721 (Red Hat 3.4.4-2)) #1 SMP Thu Apr 23 09:10:17 JST 2009
> EFI v1.00 by Xen/ia64: SALsystab=0x2178 ACPI 2.0=0x1000
> booting generic kernel on platform xen
> ACPI: RSDP (v002    XEN                                ) @ 0x0000000000001000
> ACPI: XSDT (v001    XEN Xen/ia64 0x00000000 XEN 0x00030004) @ 
> 0x0000000000001024
> ACPI: FADT (v003    XEN Xen/ia64 0x00000000 XEN 0x00030004) @ 
> 0x0000000000001058
> ACPI: MADT (v002    XEN Xen/ia64 0x00000000 XEN 0x00030004) @ 
> 0x0000000000001478
> ACPI: DSDT (v001    XEN Xen/ia64 0x00000000 XEN 0x00030004) @ 
> 0x0000000000000000
> SAL 0.1: Xen/ia64 Xen/ia64 version 0.0
> SAL: AP wakeup using external interrupt vector 0xf3
> No logical to physical processor mapping available
> ACPI: Local APIC address c0000000fee00000
> ACPI: Error parsing MADT - no IOSAPIC entries
> 2 CPUs available, 2 CPUs total
> Running on Xen! start_info_pfn=0xfffd nr_pages=65536 flags=0x0
> Virtual mem_map starts at 0xa0007fffffc80000
> On node 0 totalpages: 64264
>   DMA zone: 64264 pages, LIFO batch:7
> SMP: Allowing 2 CPUs, 0 hotplug CPUs
> Built 1 zonelists.  Total pages: 64264
> Kernel command line: root=/dev/hda1 ro console=tty console=xvc0
> PID hash table entries: 4096 (order: 12, 32768 bytes)
> CPU 0: base freq=199.459MHz, ITC ratio=8/4, ITC freq=398.919MHz
> Console: colour dummy device 80x25
> Memory: 1022256k/1028224k available (10879k code, 26240k reserved, 5054k 
> data, 6
> 88k init)
> McKinley Errata 9 workaround not needed; disabling it
> Calibrating delay loop... 3185.04 BogoMIPS (lpj=15925248)
> Dentry cache hash table entries: 131072 (order: 6, 1048576 bytes)
> Inode-cache hash table entries: 65536 (order: 5, 524288 bytes)
> Mount-cache hash table entries: 1024
> ACPI: Core revision 20060707
> Boot processor id 0x0/0x0
> Fixed BSP b0 value from CPU 1
> CPU 1: synchronized ITC with CPU 0 (last diff 8 cycles, maxerr 160 cycles)
> CPU 1: base freq=199.459MHz, ITC ratio=8/4, ITC freq=398.919MHz
> Calibrating delay loop... 3165.38 BogoMIPS (lpj=15826944)
> Brought up 2 CPUs
> Total of 2 processors activated (6350.43 BogoMIPS).
> migration_cost=12234
> DMI not present or invalid.
> NET: Registered protocol family 16
> ACPI: bus type pci registered
> Brought up 2 CPUs
> ACPI: SCI (ACPI GSI 0) not registered
> ACPI: Interpreter enabled
> ACPI: Using IOSAPIC for interrupt routing
> suspend: event channel 9
> xen_mem: Initialising balloon driver.
> SCSI subsystem initialized
> usbcore: registered new driver usbfs
> usbcore: registered new driver hub
> NET: Registered protocol family 2
> IP route cache hash table entries: 8192 (order: 2, 65536 bytes)
> TCP established hash table entries: 32768 (order: 5, 524288 bytes)
> TCP bind hash table entries: 16384 (order: 4, 262144 bytes)
> TCP: Hash tables configured (established 32768 bind 16384)
> TCP reno registered
> perfmon: version 2.0 IRQ 238
> perfmon: Montecito PMU detected, 27 PMCs, 35 PMDs, 12 counters (47 bits)
> PAL Information Facility v0.5
> perfmon: added sampling format default_format
> perfmon_default_smpl: default_format v2.0 registered
> Installing knfsd (copyright (C) 1996 o...@monad.swb.de).
> SGI XFS with large block/inode numbers, no debug enabled
> Initializing Cryptographic API
> io scheduler noop registered
> io scheduler anticipatory registered (default)
> io scheduler deadline registered
> io scheduler cfq registered
> pci_hotplug: PCI Hot Plug PCI Core version: 0.5
> acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
> ACPI: Power Button (FF) [PWRF]
> ACPI: Sleep Button (FF) [SLPF]
> ACPI Exception (acpi_processor-0721): AE_NOT_FOUND, Processor Device is not 
> pres
> ent [20060707]
> ACPI: Getting cpuindex for acpiid 0x2
> EFI Time Services Driver v0.4
> Linux agpgart interface v0.101 (c) Dave Jones
> [drm] Initialized drm 1.0.1 20051102
> RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
> loop: loaded (max 8 devices)
> HP CISS Driver (v 3.6.10)
> Intel(R) PRO/1000 Network Driver - version 7.1.9-k4
> Copyright (c) 1999-2006 Intel Corporation.
> e100: Intel(R) PRO/100 Network Driver, 3.5.10-k2-NAPI
> e100: Copyright(c) 1999-2005 Intel Corporation
> tun: Universal TUN/TAP device driver, 1.6
> tun: (C) 1999-2004 Max Krasnyansky <m...@qualcomm.com>
> arcnet loaded.
> netconsole: not configured, aborting
> Linux video capture interface: v2.00
> Xen virtual console successfully installed as xvc0
> Event-channel device installed.
> netfront: Initialising virtual ethernet driver.
> xen-vbd: registered block device major 3
> Console: switching to colour frame buffer device 100x37
> input: Xen Virtual Keyboard as /class/input/input0
> input: Xen Virtual Pointer as /class/input/input1
> Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> ide: Assuming 50MHz system bus speed for PIO modes; override with idebus=xx
> ide-floppy driver 0.99.newide
> st: Version 20050830, fixed bufsize 32768, s/g segs 256
> osst :I: Tape driver with OnStream support version 0.99.4
> osst :I: $Id: osst.c,v 1.73 2005/01/01 21:13:34 wriede Exp $
> Fusion MPT base driver 3.04.01
> Copyright (c) 1999-2005 LSI Logic Corporation
> Fusion MPT SPI Host driver 3.04.01
> Fusion MPT SAS Host driver 3.04.01
> usbmon: debugfs is not available
> ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
> USB Universal Host Controller Interface driver v3.0
> Initializing USB Mass Storage driver...
> usbcore: registered new driver usb-storage
> USB Mass Storage support registered.
> usbcore: registered new driver hiddev
> usbcore: registered new driver usbhid
> /home/kuwa/proj/ia64/linux-2.6.18-xen.hg/drivers/usb/input/hid-core.c: 
> v2.6:USB
> HID core driver
> i8042.c: i8042 controller self test timeout.
> mice: PS/2 mouse device common for all mice
> i2c /dev entries driver
> device-mapper: ioctl: 4.7.0-ioctl (2006-06-24) initialised: 
> dm-de...@redhat.com
> EFI Variables Facility v0.08 2004-May-17
> efivars: get_next_variable: status=8000000000000003
> Advanced Linux Sound Architecture Driver Version 1.0.12rc1 (Thu Jun 22 
> 13:55:50
> 2006 UTC).
> no UART detected at 0x1
> specify port
> snd_mpu401: probe of snd_mpu401.0 failed with error -22
> ALSA device list:
>   #0: Dummy 1
>   #1: Virtual MIDI Card 1
> TCP bic registered
> NET: Registered protocol family 1
> NET: Registered protocol family 17
> Bridge firewalling registered
> xen privcmd uses pseudo physical addr range [0x40000000, 0x3ffff000000] 
> (4193264
> MB)
> Xen p2m: assign p2m table of [0x0000000000000000, 0x0000000040004000)
> Xen p2m: to [0x0000000040000000, 0x0000000044000000) (65536 KBytes)
> XENBUS: Device with no driver: device/console/0
> kjournald starting.  Commit interval 5 seconds
> EXT3-fs: mounted filesystem with ordered data mode.
> VFS: Mounted root (ext3 filesystem) readonly.
> Freeing unused kernel memory: 688kB freed
> request_module: runaway loop modprobe binfmt-429d
> request_module: runaway loop modprobe binfmt-429d
> request_module: runaway loop modprobe binfmt-429d
> request_module: runaway loop modprobe binfmt-429d
> request_module: runaway loop modprobe binfmt-429d

> Linux version 2.6.9-22.EL (bhcomp...@boris.devel.redhat.com) (gcc version 
> 3.4.4
> 20050721 (Red Hat 3.4.4-2)) #1 SMP Mon Sep 19 17:54:55 EDT 2005
> Warning: EFI system table major version mismatch: got 2.00, expected 1.00
> EFI v2.00 by TianoCore.org: ACPI 2.0=0xec000 SALsystab=0x1f15aac8 
> SMBIOS=0x1fc01
> 000
> booting generic kernel on platform dig
> ACPI: RSDP (v002    Xen                                ) @ 0x00000000000ec000
> ACPI: XSDT (v001    Xen      HVM 0x00000000 HVML 0x00000000) @ 
> 0x00000000000ecac
> 0
> ACPI: FADT (v004    Xen      HVM 0x00000000 HVML 0x00000000) @ 
> 0x00000000000ec8c
> 0
> ACPI: MADT (v002    Xen      HVM 0x00000000 HVML 0x00000000) @ 
> 0x00000000000ec9c
> 0
> ACPI: HPET (v001    Xen      HVM 0x00000000 HVML 0x00000000) @ 
> 0x00000000000eca8
> 0
> ACPI: DSDT (v002    Xen      HVM 0x00000000 INTL 0x20061109) @ 
> 0x000000000000000
> 0
> Warning: acpi_table_parse(ACPI_SRAT) returned 0!
> Warning: acpi_table_parse(ACPI_SLIT) returned 0!
> efi.trim_top: ignoring 4KB of memory at 0x0 due to granule hole at 0x0
> efi.trim_top: ignoring 636KB of memory at 0x1000 due to granule hole at 0x0
> efi.trim_bottom: ignoring 48KB of memory at 0xe0000 due to granule hole at 0x0
> efi.trim_bottom: ignoring 15432KB of memory at 0xee000 due to granule hole at 
> 0x
> 0
> efi.trim_top: ignoring 128KB of memory at 0x1ff6e000 due to granule hole at 
> 0x20
> 000000
> Initial ramdisk at: 0xe00000001d156000 (1601685 bytes)
> SAL 3.0: Xen/ia64 Tianocore SAL Xen/ia64 SAL version 2.0
> SAL: AP wakeup using external interrupt vector 0xf3
> SAL Platform features: None
> iosapic_system_init: Disabling PC-AT compatible 8259 interrupts
> ACPI: Local APIC address c0000000fee00000
> register_intr: changing vector 47 from IO-SAPIC-edge to IO-SAPIC-level
> register_intr: changing vector 38 from IO-SAPIC-edge to IO-SAPIC-level
> register_intr: changing vector 37 from IO-SAPIC-edge to IO-SAPIC-level
> 8 CPUs available, 8 CPUs total
> Registering legacy COM ports for serial console
> MCA related initialization done
> Virtual mem_map starts at 0xa0007fffffe40000
> On node 0 totalpages: 30397
>   DMA zone: 30397 pages, LIFO batch:4
>   Normal zone: 0 pages, LIFO batch:1
>   HighMem zone: 0 pages, LIFO batch:1
> Built 1 zonelists
> Kernel command line: BOOT_IMAGE=atapi0:\efi\redhat\vmlinuz-2.6.9-22.EL  
> root=/de
> v/hda2 console=tty0 console=ttyS0,9600n8r hda=noprobe hdb=noprobe ro
> ide_setup: hda=noprobe
> ide_setup: hdb=noprobe
> PID hash table entries: 1024 (order: 10, 32768 bytes)
> CPU 0: base freq=199.459MHz, ITC ratio=8/4, ITC freq=398.919MHz+/--1ppm
> Console: colour VGA+ 80x25
> Dentry cache hash table entries: 65536 (order: 5, 524288 bytes)
> Inode-cache hash table entries: 32768 (order: 4, 262144 bytes)
> Placing software IO TLB between 0x4a64000 - 0x8a64000
> Memory: 413248k/486352k available (5611k code, 85024k reserved, 2260k data, 
> 384k
>  init)
> McKinley Errata 9 workaround not needed; disabling it
> Calibrating delay loop... 3130.52 BogoMIPS (lpj=1527808)
> Security Scaffold v1.0.0 initialized
> SELinux:  Initializing.
> SELinux:  Starting in permissive mode
> There is already a security framework initialized, register_security failed.
> selinux_register_security:  Registering secondary module capability
> Capability LSM initialized as secondary
> Mount-cache hash table entries: 1024 (order: 0, 16384 bytes)
> Boot processor id 0x0/0x0
> task migration cache decay timeout: 10 msecs.
> CPU 1: synchronized ITC with CPU 0 (last diff 2 cycles, maxerr 307 cycles)
> CPU 1: base freq=199.459MHz, ITC ratio=8/4, ITC freq=398.919MHz+/--1ppm
> Calibrating delay loop... 3139.76 BogoMIPS (lpj=1531904)
> CPU 2: synchronized ITC with CPU 0 (last diff -2 cycles, maxerr 307 cycles)
> CPU 2: base freq=199.459MHz, ITC ratio=8/4, ITC freq=398.919MHz+/--1ppm
> Calibrating delay loop... 3139.76 BogoMIPS (lpj=1531904)
> CPU 3: synchronized ITC with CPU 0 (last diff -2 cycles, maxerr 307 cycles)
> CPU 3: base freq=199.459MHz, ITC ratio=8/4, ITC freq=398.919MHz+/--1ppm
> Calibrating delay loop... 3113.04 BogoMIPS (lpj=1519616)
> CPU 4: synchronized ITC with CPU 0 (last diff -6 cycles, maxerr 307 cycles)
> CPU 4: base freq=199.459MHz, ITC ratio=8/4, ITC freq=398.919MHz+/--1ppm
> Calibrating delay loop... 3139.76 BogoMIPS (lpj=1531904)
> CPU 5: synchronized ITC with CPU 0 (last diff -2 cycles, maxerr 307 cycles)
> CPU 5: base freq=199.459MHz, ITC ratio=8/4, ITC freq=398.919MHz+/--1ppm
> Calibrating delay loop... 3130.52 BogoMIPS (lpj=1527808)
> CPU 6: synchronized ITC with CPU 0 (last diff 2 cycles, maxerr 307 cycles)
> CPU 6: base freq=199.459MHz, ITC ratio=8/4, ITC freq=398.919MHz+/--1ppm
> Calibrating delay loop... 3122.28 BogoMIPS (lpj=1523712)
> CPU 7: synchronized ITC with CPU 0 (last diff -2 cycles, maxerr 308 cycles)
> CPU 7: base freq=199.459MHz, ITC ratio=8/4, ITC freq=398.919MHz+/--1ppm
> Calibrating delay loop... 3130.52 BogoMIPS (lpj=1527808)
> Brought up 8 CPUs
> Total of 8 processors activated (25046.16 BogoMIPS).
> checking if image is initramfs... it is
> Freeing initrd memory: 1552kB freed
> NET: Registered protocol family 16
> ACPI: Subsystem revision 20040816
> ACPI: Interpreter enabled
> ACPI: Using IOSAPIC for interrupt routing
> ACPI: PCI Root Bridge [PCI0] (00:00)
> ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
> usbcore: registered new driver usbfs
> usbcore: registered new driver hub
> PCI: Using ACPI for IRQ routing
> GSI 20 (level, low) -> CPU 0 (0x0000) vector 48
> ACPI: PCI interrupt 0000:00:01.3[A] -> GSI 20 (level, low) -> IRQ 48
> GSI 28 (level, low) -> CPU 1 (0x0100) vector 49
> ACPI: PCI interrupt 0000:00:03.0[A] -> GSI 28 (level, low) -> IRQ 49
> GSI 32 (level, low) -> CPU 2 (0x0200) vector 50
> ACPI: PCI interrupt 0000:00:04.0[A] -> GSI 32 (level, low) -> IRQ 50
> perfmon: version 2.0 IRQ 238
> perfmon: Generic PMU detected, 8 PMCs, 4 PMDs, 4 counters (32 bits)
> PAL Information Facility v0.5
> perfmon: added sampling format default_format
> perfmon_default_smpl: default_format v2.0 registered
> audit: initializing netlink socket (disabled)
> audit(1240842100.314:1): initialized
> Total HugeTLB memory allocated, 0
> VFS: Disk quotas dquot_6.5.1
> Dquot-cache hash table entries: 2048 (order 0, 16384 bytes)
> SELinux:  Registering netfilter hooks
> Initializing Cryptographic API
> ksign: Installing public key data
> Loading keyring
> - Added public key 453B8631FE6159D3
> - User ID: Red Hat, Inc. (Kernel Module GPG key)
> Limiting direct PCI/PCI transfers.
> PCI: PIIX3: Enabling Passive Release on 0000:00:01.0
> Activating ISA DMA hang workarounds.
> pci_hotplug: PCI Hot Plug PCI Core version: 0.5
> ACPI: Processor [CPU0] (supports C1)
> ACPI: Processor [CPU1] (supports C1)
> ACPI: Processor [CPU2] (supports C1)
> ACPI: Processor [CPU3] (supports C1)
> EFI Time Services Driver v0.4
> Linux agpgart interface v0.100 (c) Dave Jones
> serio: i8042 AUX port at 0x60,0x64 irq 36
> serio: i8042 KBD port at 0x60,0x64 irq 32
> Serial: 8250/16550 driver $Revision: 1.90 $ 20 ports, IRQ sharing enabled
> ttyS0 at I/O 0x3f8 (irq = 44) is a 16550A
> RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
> divert: not allocating divert_blk for non-ethernet device lo
> Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> PIIX3: IDE controller at PCI slot 0000:00:01.1
> PIIX3: chipset revision 0
> PIIX3: not 100% native mode: will probe irqs later
>     ide0: BM-DMA at 0xc000-0xc007, BIOS settings: hda:pio, hdb:pio
>     ide1: BM-DMA at 0xc008-0xc00f, BIOS settings: hdc:pio, hdd:pio
> Probing IDE interface ide0...
> Probing IDE interface ide1...
> Probing IDE interface ide0...
> Probing IDE interface ide1...
> Probing IDE interface ide2...
> Probing IDE interface ide3...
> Probing IDE interface ide4...
> Probing IDE interface ide5...
> ide-floppy driver 0.99.newide
> usbcore: registered new driver hiddev
> usbcore: registered new driver usbhid
> drivers/usb/input/hid-core.c: v2.0:USB HID core driver
> mice: PS/2 mouse device common for all mice
> input: AT Translated Set 2 keyboard on isa0060/serio0
> input: ImExPS/2 Generic Explorer Mouse on isa0060/serio1
> md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
> EFI Variables Facility v0.08 2004-May-17
> NET: Registered protocol family 2
> IP: routing cache hash table of 8192 buckets, 128Kbytes
> TCP: Hash tables configured (established 65536 bind 65536)
> Initializing IPsec netlink socket
> NET: Registered protocol family 1
> NET: Registered protocol family 17
> Freeing unused kernel memory: 384kB freed
> ACPI: PCI interrupt 0000:00:03.0[A] -> GSI 28 (level, low) -> IRQ 49
> suspend: event channel 11
> xen-vbd: registered block device major 3
> Using cfq io scheduler
>  hda: hda1<6>netfront: Initialising virtual ethernet driver.
>  hda2
> divert: allocating divert_blk for eth0
> kjournald starting.  Commit interval 5 seconds
> EXT3-fs: mounted filesystem with ordered data mode.
> request_module: runaway loop modprobe binfmt-429d
> request_module: runaway loop modprobe binfmt-429d
> request_module: runaway loop modprobe binfmt-429d
> request_module: runaway loop modprobe binfmt-429d
> request_module: runaway loop modprobe binfmt-429d

> _______________________________________________
> Xen-devel mailing list
> xen-de...@lists.xensource.com
> http://lists.xensource.com/xen-devel


-- 
yamahata
blktap: add one static.

one static.

Signed-off-by: Isaku Yamahata <yamah...@valinux.co.jp>

diff --git a/drivers/xen/blktap/blktap.c b/drivers/xen/blktap/blktap.c
--- a/drivers/xen/blktap/blktap.c
+++ b/drivers/xen/blktap/blktap.c
@@ -855,7 +855,7 @@ static unsigned int blktap_poll(struct f
 	return 0;
 }
 
-void blktap_kick_user(int idx)
+static void blktap_kick_user(int idx)
 {
 	tap_blkif_t *info;
 
blktap: fix race memory refernce with ring_ok.

fix race memory refernce with ring_ok.

Signed-off-by: Isaku Yamahata <yamah...@valinux.co.jp>

diff --git a/drivers/xen/blktap/blktap.c b/drivers/xen/blktap/blktap.c
--- a/drivers/xen/blktap/blktap.c
+++ b/drivers/xen/blktap/blktap.c
@@ -617,6 +617,9 @@ static int blktap_release(struct inode *
 	if (!info)
 		return 0;
 
+	info->ring_ok = 0;
+	smp_wmb();
+
 	info->dev_inuse = 0;
 	DPRINTK("Freeing device [/dev/xen/blktap%d]\n",info->minor);
 
@@ -717,6 +720,7 @@ static int blktap_mmap(struct file *filp
 #endif
 
 	info->vma = vma;
+	smp_wmb();
 	info->ring_ok = 1;
 	return 0;
  fail:
@@ -1390,6 +1394,7 @@ static void dispatch_rw_block_io(blkif_t
 		WPRINTK("blktap: ring not ready for requests!\n");
 		goto fail_response;
 	}
+	smp_rmb();
 
 	if (RING_FULL(&info->ufe_ring)) {
 		WPRINTK("blktap: fe_ring is full, can't add "
blktap: don't use vma->vm_start to calculate offset.

struct vma can be split, we can't depend on vm_start.
Instead, use tap_blkif_t::rings_vstart.

Signed-off-by: Isaku Yamahata <yamah...@valinux.co.jp>

diff --git a/drivers/xen/blktap/blktap.c b/drivers/xen/blktap/blktap.c
--- a/drivers/xen/blktap/blktap.c
+++ b/drivers/xen/blktap/blktap.c
@@ -318,7 +318,7 @@ static pte_t blktap_clear_pte(struct vm_
 	pte_t copy;
 	tap_blkif_t *info;
 	int offset, seg, usr_idx, pending_idx, mmap_idx;
-	unsigned long uvstart = vma->vm_start + (RING_PAGES << PAGE_SHIFT);
+	unsigned long uvstart;
 	unsigned long kvaddr;
 	struct tap_vma_priv *priv;
 	struct page *pg;
@@ -330,11 +330,15 @@ static pte_t blktap_clear_pte(struct vm_
 	 * If the address is before the start of the grant mapped region or
 	 * if vm_file is NULL (meaning mmap failed and we have nothing to do)
 	 */
-	if (uvaddr < uvstart || vma->vm_file == NULL)
+	if (vma->vm_file != NULL) {
+		info = vma->vm_file->private_data;
+		uvstart = info->rings_vstart + (RING_PAGES << PAGE_SHIFT);
+	} else
+		uvstart = uvaddr;	/* make the following if clause true */
+	if (uvaddr < uvstart)
 		return ptep_get_and_clear_full(vma->vm_mm, uvaddr, 
 					       ptep, is_fullmm);
 
-	info = vma->vm_file->private_data;
 	priv = vma->vm_private_data;
 
 	/* TODO Should these be changed to if statements? */
@@ -1210,8 +1214,7 @@ static int blktap_read_ufe_ring(tap_blki
 
 			pg = pfn_to_page(__pa(kvaddr) >> PAGE_SHIFT);
 			ClearPageReserved(pg);
-			offset = (uvaddr - info->vma->vm_start) 
-				>> PAGE_SHIFT;
+			offset = (uvaddr - info->rings_vstart) >> PAGE_SHIFT;
 			priv->map[offset] = NULL;
 		}
 		fast_flush_area(pending_req, pending_idx, usr_idx, info->minor);
@@ -1501,7 +1504,7 @@ static void dispatch_rw_block_io(blkif_t
 			set_phys_to_machine(__pa(kvaddr) >> PAGE_SHIFT,
 					    FOREIGN_FRAME(map[i].dev_bus_addr
 							  >> PAGE_SHIFT));
-			offset = (uvaddr - info->vma->vm_start) >> PAGE_SHIFT;
+			offset = (uvaddr - info->rings_vstart) >> PAGE_SHIFT;
 			pg = pfn_to_page(__pa(kvaddr) >> PAGE_SHIFT);
 			priv->map[offset] = pg;
 		}
@@ -1528,7 +1531,7 @@ static void dispatch_rw_block_io(blkif_t
 			if (ret)
 				continue;
 
-			offset = (uvaddr - info->vma->vm_start) >> PAGE_SHIFT;
+			offset = (uvaddr - info->rings_vstart) >> PAGE_SHIFT;
 			pg = pfn_to_page(__pa(kvaddr) >> PAGE_SHIFT);
 			priv->map[offset] = pg;
 		}
linux/blktap: fix vma_close() for partial munmap.

vm_area_struct::vm_private_data is used
by get_user_pages() so that we can't override
it. So in order to make blktap work, set it
to a array of struct page*.

Without mm->mmap_sem, virtual mapping can be changed.
so remembering vma which was passed to mmap callback
is bogus because later the vma can be freed or changed.
So don't remember vma and put necessary infomations into
tap_blkif_t. and use find_vma() to get necessary vma's.

Signed-off-by: Isaku Yamahata <yamah...@valinux.co.jp>

diff --git a/drivers/xen/blktap/blktap.c b/drivers/xen/blktap/blktap.c
--- a/drivers/xen/blktap/blktap.c
+++ b/drivers/xen/blktap/blktap.c
@@ -99,7 +99,7 @@ typedef struct domid_translate_ext {
 
 /*Data struct associated with each of the tapdisk devices*/
 typedef struct tap_blkif {
-	struct vm_area_struct *vma;   /*Shared memory area                   */
+	struct mm_struct *mm;         /*User address space                   */
 	unsigned long rings_vstart;   /*Kernel memory mapping                */
 	unsigned long user_vstart;    /*User memory mapping                  */
 	unsigned long dev_inuse;      /*One process opens device at a time.  */
@@ -116,6 +116,7 @@ typedef struct tap_blkif {
 					[req id, idx] tuple                  */
 	blkif_t *blkif;               /*Associate blkif with tapdev          */
 	struct domid_translate_ext trans; /*Translation from domid to bus.   */
+	struct page **map;	      /*Mapping page */
 } tap_blkif_t;
 
 static struct tap_blkif *tapfds[MAX_TAP_DEV];
@@ -293,10 +294,6 @@ static inline int OFFSET_TO_SEG(int offs
 /******************************************************************
  * BLKTAP VM OPS
  */
-struct tap_vma_priv {
-	tap_blkif_t *info;
-	struct page *map[];
-};
 
 static struct page *blktap_nopage(struct vm_area_struct *vma,
 				  unsigned long address,
@@ -315,11 +312,10 @@ static pte_t blktap_clear_pte(struct vm_
 			      pte_t *ptep, int is_fullmm)
 {
 	pte_t copy;
-	tap_blkif_t *info;
+	tap_blkif_t *info = NULL;
 	int offset, seg, usr_idx, pending_idx, mmap_idx;
 	unsigned long uvstart;
 	unsigned long kvaddr;
-	struct tap_vma_priv *priv;
 	struct page *pg;
 	struct grant_handle_pair *khandle;
 	struct gnttab_unmap_grant_ref unmap[2];
@@ -338,12 +334,9 @@ static pte_t blktap_clear_pte(struct vm_
 		return ptep_get_and_clear_full(vma->vm_mm, uvaddr, 
 					       ptep, is_fullmm);
 
-	priv = vma->vm_private_data;
-
 	/* TODO Should these be changed to if statements? */
 	BUG_ON(!info);
 	BUG_ON(!info->idx_map);
-	BUG_ON(!priv);
 
 	offset = (int) ((uvaddr - uvstart) >> PAGE_SHIFT);
 	usr_idx = OFFSET_TO_USR_IDX(offset);
@@ -355,7 +348,7 @@ static pte_t blktap_clear_pte(struct vm_
 	kvaddr = idx_to_kaddr(mmap_idx, pending_idx, seg);
 	pg = pfn_to_page(__pa(kvaddr) >> PAGE_SHIFT);
 	ClearPageReserved(pg);
-	priv->map[offset + RING_PAGES] = NULL;
+	info->map[offset + RING_PAGES] = NULL;
 
 	khandle = &pending_handle(mmap_idx, pending_idx, seg);
 
@@ -396,19 +389,43 @@ static pte_t blktap_clear_pte(struct vm_
 	return copy;
 }
 
+static void blktap_vma_open(struct vm_area_struct *vma)
+{
+	tap_blkif_t *info;
+	if (vma->vm_file == NULL)
+		return;
+
+	info = vma->vm_file->private_data;
+	vma->vm_private_data =
+		&info->map[(vma->vm_start - info->rings_vstart) >> PAGE_SHIFT];
+}
+
+/* tricky part
+ * When partial munmapping, ->open() is called only splitted vma which
+ * will be released soon. * See split_vma() and do_munmap() in mm/mmap.c
+ * So there is no chance to fix up vm_private_data of the end vma.
+ */
 static void blktap_vma_close(struct vm_area_struct *vma)
 {
-	struct tap_vma_priv *priv = vma->vm_private_data;
+	tap_blkif_t *info;
+	struct vm_area_struct *next = vma->vm_next;
 
-	if (priv) {
-		priv->info->vma = NULL;
-		kfree(priv);
-	}
+	if (next == NULL ||
+	    vma->vm_ops != next->vm_ops ||
+	    vma->vm_end != next->vm_start ||
+	    vma->vm_file == NULL ||
+	    vma->vm_file != next->vm_file)
+		return;
+
+	info = vma->vm_file->private_data;
+	next->vm_private_data =
+		&info->map[(next->vm_start - info->rings_vstart) >> PAGE_SHIFT];
 }
 
-struct vm_operations_struct blktap_vm_ops = {
+static struct vm_operations_struct blktap_vm_ops = {
 	nopage:   blktap_nopage,
 	zap_pte:  blktap_clear_pte,
+	open:     blktap_vma_open,
 	close:    blktap_vma_close,
 };
 
@@ -455,7 +472,7 @@ static tap_blkif_t *get_next_free_dev(vo
 		info = tapfds[minor];
 		/* we could have failed a previous attempt. */
 		if (!info ||
-		    ((info->dev_inuse == 0) &&
+		    ((!test_bit(0, &info->dev_inuse)) &&
 		     (info->dev_pending == 0)) ) {
 			info->dev_pending = 1;
 			goto found;
@@ -592,7 +609,7 @@ static int blktap_open(struct inode *ino
 	FRONT_RING_INIT(&info->ufe_ring, sring, PAGE_SIZE);
 	
 	filp->private_data = info;
-	info->vma = NULL;
+	info->mm = NULL;
 
 	info->idx_map = kmalloc(sizeof(unsigned long) * MAX_PENDING_REQS, 
 				GFP_KERNEL);
@@ -624,8 +641,10 @@ static int blktap_release(struct inode *
 	info->ring_ok = 0;
 	smp_wmb();
 
-	info->dev_inuse = 0;
-	DPRINTK("Freeing device [/dev/xen/blktap%d]\n",info->minor);
+	mmput(info->mm);
+	info->mm = NULL;
+	kfree(info->map);
+	info->map = NULL;
 
 	/* Free the ring page. */
 	ClearPageReserved(virt_to_page(info->ufe_ring.sring));
@@ -644,6 +663,9 @@ static int blktap_release(struct inode *
 		info->status = CLEANSHUTDOWN;
 	}
 
+	clear_bit(0, &info->dev_inuse);
+	DPRINTK("Freeing device [/dev/xen/blktap%d]\n",info->minor);
+
 	return 0;
 }
 
@@ -669,7 +691,6 @@ static int blktap_release(struct inode *
 static int blktap_mmap(struct file *filp, struct vm_area_struct *vma)
 {
 	int size;
-	struct tap_vma_priv *priv;
 	tap_blkif_t *info = filp->private_data;
 	int ret;
 
@@ -706,16 +727,14 @@ static int blktap_mmap(struct file *filp
 	}
 
 	/* Mark this VM as containing foreign pages, and set up mappings. */
-	priv = kzalloc(sizeof(*priv) + ((vma->vm_end - vma->vm_start)
-					>> PAGE_SHIFT) * sizeof(*priv->map),
-		       GFP_KERNEL);
-	if (priv == NULL) {
+	info->map = kzalloc(((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) *
+			    sizeof(*info->map), GFP_KERNEL);
+	if (info->map == NULL) {
 		WPRINTK("Couldn't alloc VM_FOREIGN map.\n");
 		goto fail;
 	}
-	priv->info = info;
 
-	vma->vm_private_data = priv;
+	vma->vm_private_data = info->map;
 	vma->vm_flags |= VM_FOREIGN;
 	vma->vm_flags |= VM_DONTCOPY;
 
@@ -723,7 +742,7 @@ static int blktap_mmap(struct file *filp
 	vma->vm_mm->context.has_foreign_mappings = 1;
 #endif
 
-	info->vma = vma;
+	info->mm = get_task_mm(current);
 	smp_wmb();
 	info->ring_ok = 1;
 	return 0;
@@ -997,6 +1016,24 @@ static void free_req(pending_req_t *req)
 		wake_up(&pending_free_wq);
 }
 
+static void blktap_zap_page_range(struct mm_struct *mm,
+				  unsigned long uvaddr, int nr_pages)
+{
+	unsigned long end = uvaddr + (nr_pages << PAGE_SHIFT);
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, uvaddr);
+	while (vma && uvaddr < end) {
+		unsigned long s = max(uvaddr, vma->vm_start);
+		unsigned long e = min(end, vma->vm_end);
+
+		zap_page_range(vma, s, e - s, NULL);
+
+		uvaddr = e;
+		vma = vma->vm_next;
+	}
+}
+
 static void fast_flush_area(pending_req_t *req, int k_idx, int u_idx,
 			    int tapidx)
 {
@@ -1017,14 +1054,13 @@ static void fast_flush_area(pending_req_
 		return;
 	}
 
-	mm = info->vma ? info->vma->vm_mm : NULL;
+	mm = info->mm;
 
-	if (info->vma != NULL &&
-	    xen_feature(XENFEAT_auto_translated_physmap)) {
+	if (mm != NULL && xen_feature(XENFEAT_auto_translated_physmap)) {
 		down_write(&mm->mmap_sem);
-		zap_page_range(info->vma, 
-			       MMAP_VADDR(info->user_vstart, u_idx, 0), 
-			       req->nr_pages << PAGE_SHIFT, NULL);
+		blktap_zap_page_range(mm,
+				      MMAP_VADDR(info->user_vstart, u_idx, 0),
+				      req->nr_pages);
 		up_write(&mm->mmap_sem);
 		return;
 	}
@@ -1075,13 +1111,12 @@ static void fast_flush_area(pending_req_
 		GNTTABOP_unmap_grant_ref, unmap, invcount);
 	BUG_ON(ret);
 	
-	if (info->vma != NULL &&
-	    !xen_feature(XENFEAT_auto_translated_physmap)) {
+	if (mm != NULL && !xen_feature(XENFEAT_auto_translated_physmap)) {
 		if (!locked++)
 			down_write(&mm->mmap_sem);
-		zap_page_range(info->vma, 
-			       MMAP_VADDR(info->user_vstart, u_idx, 0), 
-			       req->nr_pages << PAGE_SHIFT, NULL);
+		blktap_zap_page_range(mm, 
+				      MMAP_VADDR(info->user_vstart, u_idx, 0), 
+				      req->nr_pages);
 	}
 
 	if (locked)
@@ -1195,7 +1230,6 @@ static int blktap_read_ufe_ring(tap_blki
 		for (j = 0; j < pending_req->nr_pages; j++) {
 
 			unsigned long kvaddr, uvaddr;
-			struct tap_vma_priv *priv = info->vma->vm_private_data;
 			struct page *pg;
 			int offset;
 
@@ -1205,7 +1239,7 @@ static int blktap_read_ufe_ring(tap_blki
 			pg = pfn_to_page(__pa(kvaddr) >> PAGE_SHIFT);
 			ClearPageReserved(pg);
 			offset = (uvaddr - info->rings_vstart) >> PAGE_SHIFT;
-			priv->map[offset] = NULL;
+			info->map[offset] = NULL;
 		}
 		fast_flush_area(pending_req, pending_idx, usr_idx, info->minor);
 		info->idx_map[usr_idx] = INVALID_REQ;
@@ -1267,7 +1301,8 @@ static int do_block_io_op(blkif_t *blkif
 
 	info = tapfds[blkif->dev_num];
 
-	if (blkif->dev_num > MAX_TAP_DEV || !info || !info->dev_inuse) {
+	if (blkif->dev_num > MAX_TAP_DEV || !info ||
+	    !test_bit(0, &info->dev_inuse)) {
 		if (print_dbug) {
 			WPRINTK("Can't get UE info!\n");
 			print_dbug = 0;
@@ -1363,12 +1398,12 @@ static void dispatch_rw_block_io(blkif_t
 	unsigned int nseg;
 	int ret, i, nr_sects = 0;
 	tap_blkif_t *info;
-	struct tap_vma_priv *priv;
 	blkif_request_t *target;
 	int pending_idx = RTN_PEND_IDX(pending_req,pending_req->mem_idx);
 	int usr_idx;
 	uint16_t mmap_idx = pending_req->mem_idx;
 	struct mm_struct *mm;
+	struct vm_area_struct *vma = NULL;
 
 	if (blkif->dev_num < 0 || blkif->dev_num > MAX_TAP_DEV)
 		goto fail_response;
@@ -1413,8 +1448,7 @@ static void dispatch_rw_block_io(blkif_t
 	pending_req->status    = BLKIF_RSP_OKAY;
 	pending_req->nr_pages  = nseg;
 	op = 0;
-	priv = info->vma->vm_private_data;
-	mm = info->vma->vm_mm;
+	mm = info->mm;
 	if (!xen_feature(XENFEAT_auto_translated_physmap))
 		down_write(&mm->mmap_sem);
 	for (i = 0; i < nseg; i++) {
@@ -1497,7 +1531,7 @@ static void dispatch_rw_block_io(blkif_t
 							  >> PAGE_SHIFT));
 			offset = (uvaddr - info->rings_vstart) >> PAGE_SHIFT;
 			pg = pfn_to_page(__pa(kvaddr) >> PAGE_SHIFT);
-			priv->map[offset] = pg;
+			info->map[offset] = pg;
 		}
 	} else {
 		for (i = 0; i < nseg; i++) {
@@ -1524,7 +1558,7 @@ static void dispatch_rw_block_io(blkif_t
 
 			offset = (uvaddr - info->rings_vstart) >> PAGE_SHIFT;
 			pg = pfn_to_page(__pa(kvaddr) >> PAGE_SHIFT);
-			priv->map[offset] = pg;
+			info->map[offset] = pg;
 		}
 	}
 
@@ -1542,9 +1576,23 @@ static void dispatch_rw_block_io(blkif_t
 		pg = pfn_to_page(__pa(kvaddr) >> PAGE_SHIFT);
 		SetPageReserved(pg);
 		if (xen_feature(XENFEAT_auto_translated_physmap)) {
-			ret = vm_insert_page(info->vma,
-					     MMAP_VADDR(info->user_vstart,
-							usr_idx, i), pg);
+			unsigned long uvaddr = MMAP_VADDR(info->user_vstart,
+							  usr_idx, i);
+			if (vma && uvaddr >= vma->vm_end) {
+				vma = vma->vm_next;
+				if (vma &&
+				    (uvaddr < vma->vm_start ||
+				     uvaddr >= vma->vm_end))
+					vma = NULL;
+			}
+			if (vma == NULL) {
+				vma = find_vma(mm, uvaddr);
+				/* this virtual area was already munmapped.
+				   so skip to next page */
+				if (!vma)
+					continue;
+			}
+			ret = vm_insert_page(vma, uvaddr, pg);
 			if (ret) {
 				up_write(&mm->mmap_sem);
 				goto fail_flush;
_______________________________________________
Xen-ia64-devel mailing list
Xen-ia64-devel@lists.xensource.com
http://lists.xensource.com/xen-ia64-devel

Reply via email to