date:20121226

Re: [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2012-12-26 Thread Eric W. Biederman

The syscall ABI still has the wrong semantics.

Aka totally unmaintainable and umergeable.

The concept of domU support is also strange.  What does domU support even mean, 
when the dom0 support is loading a kernel to pick up Xen when Xen falls over.

I expect a lot of decisions about what code can be shared and what code can't 
is going to be driven by the simple question what does the syscall mean.

Sharing machine_kexec.c and relocate_kernel.S does not make much sense to me 
when what you are doing is effectively passing your arguments through to the 
Xen version of kexec.

Either Xen has it's own version of those routines or I expect the Xen version 
of kexec is buggy.   I can't imagine what sharing that code would mean.  By the 
same token I can't any need to duplicate the code either.

Furthermore since this is just passing data from one version of the syscall to 
another I expect you can share the majority of the code across all 
architectures that implement Xen.  The only part I can see being arch specific 
is the Xen syscall stub.

With respect to the proposed semantics of silently giving the kexec system call 
different meaning when running under Xen,
/sbin/kexec has to act somewhat differently when loading code into the Xen 
hypervisor so there is no point not making that explicit in the ABI.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] drivers/tty/serial: extern function which for release resource, need check pointer, before free it

2012-12-26 Thread Chen Gang


  for extern function uart_remove_one_port:
need check pointer whether be NULL, before the main work.
just like what the other extern function uart_add_one_port has done.
uart_add_one_port and uart_remove_one_port are pair

  information:
for the callers (such as drivers/tty/serial/jsm: jsm_tty.c, jsm_driver.c)
they realy assume that:
  they still can call uart_remove_one_port, after uart_add_one_port failed
we (as an extern function), have to understand it (just like kfree).
 

Signed-off-by: Chen Gang 
---
 drivers/tty/serial/serial_core.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index 2c7230a..bd65549 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -2642,6 +2642,7 @@ int uart_remove_one_port(struct uart_driver *drv, struct 
uart_port *uport)
 {
struct uart_state *state = drv->state + uport->line;
struct tty_port *port = >port;
+   int ret = 0;
 
BUG_ON(in_interrupt());
 
@@ -2656,6 +2657,11 @@ int uart_remove_one_port(struct uart_driver *drv, struct 
uart_port *uport)
 * succeeding while we shut down the port.
 */
mutex_lock(>mutex);
+   if (!state->uart_port) {
+   mutex_unlock(>mutex);
+   ret = -EINVAL;
+   goto out;
+   }
uport->flags |= UPF_DEAD;
mutex_unlock(>mutex);
 
@@ -2679,9 +2685,10 @@ int uart_remove_one_port(struct uart_driver *drv, struct 
uart_port *uport)
uport->type = PORT_UNKNOWN;
 
state->uart_port = NULL;
+out:
mutex_unlock(_mutex);
 
-   return 0;
+   return ret;
 }
 
 /*
-- 
1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] pci-sysfs: replace mutex_lock with mutex_trylock to avoid potential deadlock situation

2012-12-26 Thread Lin Feng

There is a potential deadlock situation when we manipulate the pci-sysfs user
interfaces from different bus hierarchy simultaneously, described as following:

path1: sysfs remove device: | path2: sysfs rescan device: 
sysfs_schedule_callback_work()  | sysfs_write_file() 
  remove_callback() |   flush_write_buffer()
*1* mutex_lock(_remove_rescan_mutex)|*2*  sysfs_get_active(attr_sd) 
  ...   | dev_attr_store() 
device_remove_file()|   dev_rescan_store()
  ...   |*4*  
mutex_lock(_remove_rescan_mutex)
*3*   sysfs_deactivate(sd)  | ...
wait_for_completion()   |*5*  sysfs_put_active(attr_sd)
*6* mutex_unlock(_remove_rescan_mutex)

If path1 firstly holds the pci_remove_rescan_mutex at *1*, then another path 
called path2 actived and runs to *2* before path1 runs to *3*, we now runs
to a deadlock situation:
Path1 holds the mutex waiting path2 to decrease sysfs_dirent's s_active
counter at *5*, but path2 is blocked at *4* when trying to get the 
pci_remove_rescan_mutex. The mutex won't be put by path1 until it reach
*6*, but it's now blocked at *3*.

The circumvented approach is to avoid manipulating(remove/scan) the pci-tree at 
the same time. If we find someone else is manipulating the pci-tree we simply
abort current operation without touching the pci-tree concurrently.

*dmesg ifno*:
(snip)
1000e :1c:00.0: eth9: Intel(R) PRO/1000 Network Connection
sd 13:2:0:0: [sdb] Attached SCSI disk
e1000e :1c:00.0: eth9: MAC: 0, PHY: 4, PBA No: D50228-005
e1000e :1c:00.1: Disabling ASPM  L1
e1000e :1c:00.1: Interrupt Throttling Rate (ints/sec) set to dynamic 
conservative mode
e1000e :1c:00.1: irq 143 for MSI/MSI-X
e1000e :1c:00.1: eth10: (PCI Express:2.5GT/s:Width x4) 00:15:17:cd:96:bf
e1000e :1c:00.1: eth10: Intel(R) PRO/1000 Network Connection
e1000e :1c:00.1: eth10: MAC: 0, PHY: 4, PBA No: D50228-005
INFO: task bash:62982 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bashD  0 62982  62978 0x0080
 88038b277db8 0082 88038b277fd8 00013940
 88038b276010 00013940 00013940 00013940
 88038b277fd8 00013940 880377449e30 8806e822c670
Call Trace:
 [] schedule+0x29/0x70
 [] schedule_preempt_disabled+0xe/0x10
 [] __mutex_lock_slowpath+0xd3/0x150
 [] mutex_lock+0x2b/0x50
 [] dev_rescan_store+0x5c/0x80
 [] dev_attr_store+0x20/0x30
 [] sysfs_write_file+0xef/0x170
 [] vfs_write+0xc8/0x190
 [] sys_write+0x51/0x90
 [] system_call_fastpath+0x16/0x1b
INFO: task bash:64141 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bashD 81610460 0 64141  64136 0x0080
 8803540e9db8 0086 8803540e9fd8 00013940
 8803540e8010 00013940 00013940 00013940
 8803540e9fd8 00013940 8807db338a10 8806f09abc60
Call Trace:
 [] schedule+0x29/0x70
 [] schedule_preempt_disabled+0xe/0x10
 [] __mutex_lock_slowpath+0xd3/0x150
 [] mutex_lock+0x2b/0x50
 [] dev_rescan_store+0x5c/0x80
 [] dev_attr_store+0x20/0x30
 [] sysfs_write_file+0xef/0x170
 [] vfs_write+0xc8/0x190
 [] sys_write+0x51/0x90
 [] system_call_fastpath+0x16/0x1b
INFO: task kworker/u:3:64451 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u:3 D 81610460 0 64451  2 0x0080
 8807d51b7a30 0046 8807d51b7fd8 00013940
 8807d51b6010 00013940 00013940 00013940
 8807d51b7fd8 00013940 8807db339420 88037744b250
Call Trace:
 [] schedule+0x29/0x70
 [] schedule_timeout+0x19d/0x220
 [] ? __slab_free+0x1f2/0x2f0
 [] wait_for_common+0x11e/0x190
 [] ? try_to_wake_up+0x2c0/0x2c0
 [] wait_for_completion+0x1d/0x20
 [] sysfs_addrm_finish+0xb8/0xd0
 [] ? sysfs_schedule_callback+0x1e0/0x1e0
 [] sysfs_hash_and_remove+0x60/0xb0
 [] sysfs_remove_file+0x39/0x50
 [] device_remove_file+0x17/0x20
 [] bus_remove_device+0xdc/0x180
 [] device_del+0x120/0x1d0
 [] device_unregister+0x22/0x60
 [] pci_stop_bus_device+0x94/0xa0
 [] pci_stop_bus_device+0x40/0xa0
 [] pci_stop_bus_device+0x40/0xa0
 [] pci_stop_bus_device+0x40/0xa0
 [] pci_stop_and_remove_bus_device+0x16/0x30
 [] remove_callback+0x29/0x40
 [] sysfs_schedule_callback_work+0x24/0x70
 [] process_one_work+0x179/0x4b0
 [] worker_thread+0x12e/0x330
 [] ? manage_workers+0x110/0x110
 [] kthread+0x9e/0xb0
 [] kernel_thread_helper+0x4/0x10
 [] ? kthread_freezable_should_stop+0x70/0x70
 [] ? gs_change+0x13/0x13

Reported-by: Taku Izumi  
Signed-off-by: Lin Feng 
Signed-off-by: Gu Zheng 
---
 drivers/pci/pci-sysfs.c |   42 ++
 1 files changed, 26 insertions(+), 16

Re: [PATCH 4/8] Thermal: Add Thermal_trip sysfs node

2012-12-26 Thread Hongbo Zhang

On 18 December 2012 17:29, Durgadoss R  wrote:
> This patch adds a thermal_trip directory under
> /sys/class/thermal/zoneX. This directory contains
> the trip point values for sensors bound to this
> zone.
>
> Signed-off-by: Durgadoss R 
> ---
>  drivers/thermal/thermal_sys.c |  237 
> -
>  include/linux/thermal.h   |   37 +++
>  2 files changed, 272 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/thermal/thermal_sys.c b/drivers/thermal/thermal_sys.c
> index b39bf97..29ec073 100644
> --- a/drivers/thermal/thermal_sys.c
> +++ b/drivers/thermal/thermal_sys.c
> @@ -448,6 +448,22 @@ static void thermal_zone_device_check(struct work_struct 
> *work)
> thermal_zone_device_update(tz);
>  }
>
> +static int get_sensor_indx_by_kobj(struct thermal_zone *tz, const char *name)
> +{
> +   int i, indx = -EINVAL;
> +
> +   mutex_lock(_list_lock);
> +   for (i = 0; i < tz->sensor_indx; i++) {
> +   if (!strnicmp(name, kobject_name(tz->kobj_trip[i]),
> +   THERMAL_NAME_LENGTH)) {
> +   indx = i;
> +   break;
> +   }
> +   }
> +   mutex_unlock(_list_lock);
> +   return indx;
> +}
> +
>  static void remove_sensor_from_zone(struct thermal_zone *tz,
> struct thermal_sensor *ts)
>  {
> @@ -459,9 +475,15 @@ static void remove_sensor_from_zone(struct thermal_zone 
> *tz,
>
> sysfs_remove_link(>device.kobj, kobject_name(>device.kobj));
>
> +   /* Delete this sensor's trip Kobject */
> +   kobject_del(tz->kobj_trip[indx]);
> +
> /* Shift the entries in the tz->sensors array */
> -   for (j = indx; j < MAX_SENSORS_PER_ZONE - 1; j++)
> +   for (j = indx; j < MAX_SENSORS_PER_ZONE - 1; j++) {
> tz->sensors[j] = tz->sensors[j + 1];
> +   tz->sensor_trip[j] = tz->sensor_trip[j + 1];
> +   tz->kobj_trip[j] = tz->kobj_trip[j + 1];
> +   }
>
> tz->sensor_indx--;
>  }
> @@ -875,6 +897,120 @@ policy_show(struct device *dev, struct device_attribute 
> *devattr, char *buf)
> return sprintf(buf, "%s\n", tz->governor->name);
>  }
>
> +static ssize_t
> +active_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> +   int i, indx, ret = 0;
> +   struct thermal_zone *tz;
> +   struct device *dev;
> +
> +   /* In this function, for
> +* /sys/class/thermal/zoneX/thermal_trip/sensorY:
> +* attr points to sysfs node 'active'
> +* kobj points to sensorY
> +* kobj->parent points to thermal_trip
> +* kobj->parent->parent points to zoneX
> +*/
> +
> +   /* Get the zone pointer */
> +   dev = container_of(kobj->parent->parent, struct device, kobj);
> +   tz = to_zone(dev);
> +   if (!tz)
> +   return -EINVAL;
> +
> +   /*
> +* We need this because in the sysfs tree, 'sensorY' is
> +* not really the sensor pointer. It just has the name
> +* 'sensorY'; whereas 'zoneX' is actually the zone pointer.
> +* This means container_of(kobj, struct device, kobj) will not
> +* provide the actual sensor pointer.
> +*/
> +   indx = get_sensor_indx_by_kobj(tz, kobject_name(kobj));
> +   if (indx < 0)
> +   return indx;
> +
> +   if (tz->sensor_trip[indx]->num_active_trips <= 0)
> +   return sprintf(buf, "\n");
> +
> +   ret += sprintf(buf, "0x%x", tz->sensor_trip[indx]->active_trip_mask);
> +   for (i = 0; i < tz->sensor_trip[indx]->num_active_trips; i++) {
> +   ret += sprintf(buf + ret, " %d",
> +   tz->sensor_trip[indx]->active_trips[i]);
> +   }
> +
> +   ret += sprintf(buf + ret, "\n");
> +   return ret;
> +}
> +
> +static ssize_t
> +ptrip_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> +   int i, indx, ret = 0;
> +   struct thermal_zone *tz;
> +   struct device *dev;
> +
> +   /* Get the zone pointer */
> +   dev = container_of(kobj->parent->parent, struct device, kobj);
> +   tz = to_zone(dev);
> +   if (!tz)
> +   return -EINVAL;
> +
> +   indx = get_sensor_indx_by_kobj(tz, kobject_name(kobj));
> +   if (indx < 0)
> +   return indx;
> +
> +   if (tz->sensor_trip[indx]->num_passive_trips <= 0)
> +   return sprintf(buf, "\n");
> +
> +   for (i = 0; i < tz->sensor_trip[indx]->num_passive_trips; i++) {
> +   ret += sprintf(buf + ret, "%d ",
> +   tz->sensor_trip[indx]->passive_trips[i]);
> +   }
> +
> +   ret += sprintf(buf + ret, "\n");
> +   return ret;
> +}
> +
> +static ssize_t
> +hot_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> +   int indx;
> +   struct thermal_zone *tz;
> +   struct device *dev;
> +

RE: [PATCH V2] output the cpu number when printking.

2012-12-26 Thread He, Bo

Thanks, Greg.
I did use this patch to fix many races on SMP. But to respect maintainer, I 
stop pushing the patch to upstream.


-Original Message-
From: Greg KH [mailto:gre...@linuxfoundation.org] 
Sent: Thursday, December 27, 2012 1:50 AM
To: Yanmin Zhang
Cc: He, Bo; Randy Dunlap; a...@linux-foundation.org; mi...@elte.hu; 
linux-kernel@vger.kernel.org; a.p.zijls...@chello.nl
Subject: Re: [PATCH V2] output the cpu number when printking.

On Tue, Dec 25, 2012 at 09:09:05AM +0800, Yanmin Zhang wrote:
> On Mon, 2012-12-24 at 09:55 -0800, Greg KH wrote:
> > On Mon, Dec 24, 2012 at 01:01:55PM +0800, he, bo wrote:
> > > From: "he, bo" 
> > > 
> > > We often hit kernel panic issues on SMP machines because processes 
> > > race on multiple cpu. By adding a new parameter printk.cpu, kernel 
> > > prints cpu number at printk information line. It’s useful to debug 
> > > what cpus are racing.
> > 
> > How useful is this really for normal developers?
> It's very useful to debug race conditions under SMP environment.
> We applied the patch to our Android build image on our smartphones.

That's fine for your application, and seemed to be useful to others with their 
first interactions with SMP systems.  However, once you start to get to "real" 
numbers of CPUs, this information turns pretty pointless, which is why the 
patch was originally rejected.

sorry,

greg k-h

Re: [PATCH v2] fadvise: perform WILLNEED readahead asynchronously

2012-12-26 Thread Zheng Liu

On Tue, Dec 25, 2012 at 02:22:51AM +, Eric Wong wrote:
> Using fadvise with POSIX_FADV_WILLNEED can be very slow and cause
> user-visible latency.  This hurts interactivity and encourages
> userspace to resort to background threads for readahead (or avoid
> POSIX_FADV_WILLNEED entirely).
> 
> "strace -T" timing on an uncached, one gigabyte file:
> 
>  Before: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <2.484832>
>   After: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <0.61>
> 
> For a smaller 9.8M request, there is still a significant improvement:
> 
>  Before: fadvise64(3, 0, 10223108, POSIX_FADV_WILLNEED) = 0 <0.005399>
>   After: fadvise64(3, 0, 10223108, POSIX_FADV_WILLNEED) = 0 <0.59>
> 
> Even with a small 1M request, there is an improvement:
> 
>  Before: fadvise64(3, 0, 1048576, POSIX_FADV_WILLNEED) = 0 <0.000474>
>   After: fadvise64(3, 0, 1048576, POSIX_FADV_WILLNEED) = 0 <0.63>

I do a simple test in my desktop, which reads 128k data.  W/o this patch,
this syscall takes 32us, and w/ this patch it takes 7us.  You can add:

Tested-by: Zheng Liu 

> 
> While userspace can mimic the effect of this commit by using a
> background thread to perform readahead(), this allows for simpler
> userspace code.
> 
> To mitigate denial-of-service attacks, inflight (but incomplete)
> readahead requests are accounted for when new readahead requests arrive.
> New readahead requests may be reduced or ignored if there are too many
> inflight readahead pages in the workqueue.
> 
> IO priority is also taken into account for workqueue readahead.
> Normal and idle priority tasks share a concurrency-limited workqueue to
> prevent excessive readahead requests from taking place simultaneously.
> This normal workqueue is concurrency-limited to one task per-CPU
> (like AIO).
> 
> Real-time I/O tasks get their own high-priority workqueue independent
> of the normal workqueue.
> 
> The impact of idle tasks is also reduced and they are more likely to
> have advisory readahead requests ignored/dropped when read congestion
> occurs.
> 
> Cc: Alan Cox 
> Cc: Dave Chinner 
> Cc: Zheng Liu 
> Signed-off-by: Eric Wong 
> ---
>   I have not tested on NUMA (since I've no access to NUMA hardware)
>   and do not know how the use of the workqueue affects RA performance.
>   I'm only using WQ_UNBOUND on non-NUMA, though.
> 
>   I'm halfway tempted to make DONTNEED use a workqueue, too.
>   Having perceptible latency on advisory syscalls is unpleasant and
>   keeping the latency makes little sense if we can hide it.
> 
>  include/linux/mm.h |   3 +
>  mm/fadvise.c   |  10 +--
>  mm/readahead.c | 217 
> +
>  3 files changed, 224 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6320407..90b361c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1536,6 +1536,9 @@ void task_dirty_inc(struct task_struct *tsk);
>  #define VM_MAX_READAHEAD 128 /* kbytes */
>  #define VM_MIN_READAHEAD 16  /* kbytes (includes current page) */
>  
> +void wq_page_cache_readahead(struct address_space *mapping, struct file 
> *filp,
> + pgoff_t offset, unsigned long nr_to_read);
> +
>  int force_page_cache_readahead(struct address_space *mapping, struct file 
> *filp,
>   pgoff_t offset, unsigned long nr_to_read);
>  
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index a47f0f5..cf3bd4c 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -102,12 +102,10 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, 
> loff_t len, int advice)
>   if (!nrpages)
>   nrpages = ~0UL;
>  
> - /*
> -  * Ignore return value because fadvise() shall return
> -  * success even if filesystem can't retrieve a hint,
> -  */
> - force_page_cache_readahead(mapping, f.file, start_index,
> -nrpages);
> + get_file(f.file); /* fput() is called by workqueue */
> +
> + /* queue up the request, don't care if it fails */
> + wq_page_cache_readahead(mapping, f.file, start_index, nrpages);
>   break;
>   case POSIX_FADV_NOREUSE:
>   break;
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 7963f23..f9e0705 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -19,6 +19,45 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +
> +static struct workqueue_struct *ra_be __read_mostly;
> +static struct workqueue_struct *ra_rt __read_mostly;
> +static unsigned long ra_nr_queued;
> +static DEFINE_SPINLOCK(ra_nr_queued_lock);
> +
> +struct wq_ra_req {
> + struct work_struct work;
> + struct address_space *mapping;
> + struct file *file;
> + pgoff_t offset;
> + unsigned long nr_to_read;
> + int ioprio;
> +};
> +
> +static void wq_ra_enqueue(struct wq_ra_req *);
> +
> +/* keep

[PATCH 2/2] vhost: handle polling failure

2012-12-26 Thread Jason Wang

Currently, polling error were ignored in vhost. This may lead some issues (e.g
kenrel crash when passing a tap fd to vhost before calling TUNSETIFF). Fix this
by:

- extend the idea of vhost_net_poll_state to all vhost_polls
- change the state only when polling is succeed
- make vhost_poll_start() report errors to the caller, which could be used
  caller or userspace.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   |   75 +
 drivers/vhost/vhost.c |   16 +-
 drivers/vhost/vhost.h |   11 ++-
 3 files changed, 50 insertions(+), 52 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 629d6b5..56e7f5a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -64,20 +64,10 @@ enum {
VHOST_NET_VQ_MAX = 2,
 };
 
-enum vhost_net_poll_state {
-   VHOST_NET_POLL_DISABLED = 0,
-   VHOST_NET_POLL_STARTED = 1,
-   VHOST_NET_POLL_STOPPED = 2,
-};
-
 struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
-   /* Tells us whether we are polling a socket for TX.
-* We only do this when socket buffer fills up.
-* Protected by tx vq lock. */
-   enum vhost_net_poll_state tx_poll_state;
/* Number of TX recently submitted.
 * Protected by tx vq lock. */
unsigned tx_packets;
@@ -155,24 +145,6 @@ static void copy_iovec_hdr(const struct iovec *from, 
struct iovec *to,
}
 }
 
-/* Caller must have TX VQ lock */
-static void tx_poll_stop(struct vhost_net *net)
-{
-   if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
-   return;
-   vhost_poll_stop(net->poll + VHOST_NET_VQ_TX);
-   net->tx_poll_state = VHOST_NET_POLL_STOPPED;
-}
-
-/* Caller must have TX VQ lock */
-static void tx_poll_start(struct vhost_net *net, struct socket *sock)
-{
-   if (unlikely(net->tx_poll_state != VHOST_NET_POLL_STOPPED))
-   return;
-   vhost_poll_start(net->poll + VHOST_NET_VQ_TX, sock->file);
-   net->tx_poll_state = VHOST_NET_POLL_STARTED;
-}
-
 /* In case of DMA done not in order in lower device driver for some reason.
  * upend_idx is used to track end of used idx, done_idx is used to track head
  * of used idx. Once lower device DMA done contiguously, we will signal KVM
@@ -252,7 +224,7 @@ static void handle_tx(struct vhost_net *net)
wmem = atomic_read(>sk->sk_wmem_alloc);
if (wmem >= sock->sk->sk_sndbuf) {
mutex_lock(>mutex);
-   tx_poll_start(net, sock);
+   vhost_poll_start(net->poll + VHOST_NET_VQ_TX, sock->file);
mutex_unlock(>mutex);
return;
}
@@ -261,7 +233,7 @@ static void handle_tx(struct vhost_net *net)
vhost_disable_notify(>dev, vq);
 
if (wmem < sock->sk->sk_sndbuf / 2)
-   tx_poll_stop(net);
+   vhost_poll_stop(net->poll + VHOST_NET_VQ_TX);
hdr_size = vq->vhost_hlen;
zcopy = vq->ubufs;
 
@@ -283,7 +255,8 @@ static void handle_tx(struct vhost_net *net)
 
wmem = atomic_read(>sk->sk_wmem_alloc);
if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
-   tx_poll_start(net, sock);
+   vhost_poll_start(net->poll + VHOST_NET_VQ_TX,
+sock->file);
set_bit(SOCK_ASYNC_NOSPACE, >flags);
break;
}
@@ -294,7 +267,8 @@ static void handle_tx(struct vhost_net *net)
(vq->upend_idx - vq->done_idx) :
(vq->upend_idx + UIO_MAXIOV - vq->done_idx);
if (unlikely(num_pends > VHOST_MAX_PEND)) {
-   tx_poll_start(net, sock);
+   vhost_poll_start(net->poll + VHOST_NET_VQ_TX,
+sock->file);
set_bit(SOCK_ASYNC_NOSPACE, >flags);
break;
}
@@ -360,7 +334,8 @@ static void handle_tx(struct vhost_net *net)
}
vhost_discard_vq_desc(vq, 1);
if (err == -EAGAIN || err == -ENOBUFS)
-   tx_poll_start(net, sock);
+   vhost_poll_start(net->poll + VHOST_NET_VQ_TX,
+sock->file);
break;
}
if (err != len)
@@ -623,7 +598,6 @@ static int vhost_net_open(struct inode *inode, struct file 
*f)
 
vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
-   n->tx_poll_state =

[PATCH 1/2] vhost_net: correct error hanlding in vhost_net_set_backend()

2012-12-26 Thread Jason Wang

Fix the leaking of oldubufs and fd refcnt when fail to initialized used ring.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c |   14 +++---
 1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ebd08b2..629d6b5 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -834,8 +834,10 @@ static long vhost_net_set_backend(struct vhost_net *n, 
unsigned index, int fd)
vhost_net_enable_vq(n, vq);
 
r = vhost_init_used(vq);
-   if (r)
-   goto err_vq;
+   if (r) {
+   sock = NULL;
+   goto err_used;
+   }
 
n->tx_packets = 0;
n->tx_zcopy_err = 0;
@@ -859,8 +861,14 @@ static long vhost_net_set_backend(struct vhost_net *n, 
unsigned index, int fd)
mutex_unlock(>dev.mutex);
return 0;
 
+err_used:
+   if (oldubufs)
+   vhost_ubuf_put_and_wait(oldubufs);
+   if (oldsock)
+   fput(oldsock->file);
 err_ubufs:
-   fput(sock->file);
+   if (sock)
+   fput(sock->file);
 err_vq:
mutex_unlock(>mutex);
 err:
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

2012-12-26 Thread Michel Lespinasse

On Wed, Dec 26, 2012 at 11:51 AM, Rik van Riel  wrote:
> On 12/26/2012 02:10 PM, Eric Dumazet wrote:
>> We might try to use a hash on lock address, and an array of 16 different
>> delays so that different spinlocks have a chance of not sharing the same
>> delay.
>>
>> With following patch, I get 982 Mbits/s with same bench, so an increase
>> of 45 % instead of a 13 % regression.

Awesome :)

> I will probably keep it as a separate patch 4/4, with
> your report and performance numbers in it, to preserve
> the reason why we keep multiple hashed values, etc...
>
> There is enough stuff in this code that will be
> indishinguishable from magic if we do not document it
> properly...

If we go with per-spinlock tunings, I feel we'll most likely want to
add an associative cache in order to avoid the 1/16 chance (~6%) of
getting 595Mbit/s instead of 982Mbit/s when there is a hash collision.

I would still prefer if we could make up something that didn't require
per-spinlock tunings, but it's not clear if that'll work. At least we
now know of a simple enough workload to figure it out :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 0/4] iommu/fsl: Freescale PAMU driver and IOMMU API implementation.

2012-12-26 Thread Sethi Varun-B16395

Hi Joerg,
Do you have any comments on the patchset?

Regards
Varun

> -Original Message-
> From: Sethi Varun-B16395
> Sent: Friday, December 21, 2012 7:17 AM
> To: 'Joerg Roedel'
> Cc: Sethi Varun-B16395; joerg.roe...@amd.com; iommu@lists.linux-
> foundation.org; linuxppc-...@lists.ozlabs.org; linux-
> ker...@vger.kernel.org; Tabi Timur-B04825; Wood Scott-B07421
> Subject: RE: [PATCH 0/4] iommu/fsl: Freescale PAMU driver and IOMMU API
> implementation.
> 
> ping!!
> 
> > -Original Message-
> > From: Sethi Varun-B16395
> > Sent: Friday, December 14, 2012 7:22 PM
> > To: joerg.roe...@amd.com; io...@lists.linux-foundation.org; linuxppc-
> > d...@lists.ozlabs.org; linux-kernel@vger.kernel.org; Tabi Timur-B04825;
> > Wood Scott-B07421
> > Cc: Sethi Varun-B16395
> > Subject: [PATCH 0/4] iommu/fsl: Freescale PAMU driver and IOMMU API
> > implementation.
> >
> > This patchset provides the Freescale PAMU (Peripheral Access
> > Management
> > Unit) driver and the corresponding IOMMU API implementation. PAMU is
> > the IOMMU present on Freescale QorIQ platforms. PAMU can authorize
> > memory access, remap the memory address, and remap the I/O transaction
> type.
> >
> > This set consists of the following patches:
> > 1. Addition of new field in the device (powerpc) archdata structure
> > for storing iommu domain information
> >pointer. This pointer is stored when the device is attached to a
> > particular iommu domain.
> > 2. Add PAMU bypass enable register to the ccsr_guts structure.
> > 3. Addition of domain attributes required by the PAMU driver IOMMU API.
> > 4. PAMU driver and IOMMU API implementation.
> >
> > This patch set is based on the next branch of the iommu git tree
> > maintained by Joerg.
> >
> > Varun Sethi (4):
> >   store iommu domain info in device arch data.
> >   add pamu bypass enable register to guts.
> >   Add iommu attributes for PAMU
> >   FSL PAMU driver.
> >
> >  arch/powerpc/include/asm/device.h   |4 +
> >  arch/powerpc/include/asm/fsl_guts.h |4 +-
> >  drivers/iommu/Kconfig   |8 +
> >  drivers/iommu/Makefile  |1 +
> >  drivers/iommu/fsl_pamu.c| 1152
> > +++
> >  drivers/iommu/fsl_pamu.h|  398 
> >  drivers/iommu/fsl_pamu_domain.c | 1033
> > +++
> >  drivers/iommu/fsl_pamu_domain.h |   96 +++
> >  include/linux/iommu.h   |   49 ++
> >  9 files changed, 2744 insertions(+), 1 deletions(-)  create mode
> > 100644 drivers/iommu/fsl_pamu.c  create mode 100644
> > drivers/iommu/fsl_pamu.h create mode 100644
> > drivers/iommu/fsl_pamu_domain.c  create mode 100644
> > drivers/iommu/fsl_pamu_domain.h
> >
> > --
> > 1.7.4.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] drivers/platform/x86/thinkpad_acpi.c: Handle HKEY event 0x6040

2012-12-26 Thread Borislav Petkov

On Wed, Dec 26, 2012 at 06:46:13PM +0100, Richard Hartmann wrote:
> Handle HKEY event generated on AC power change. The current message
> asks users to submit data related to this event which leads to
> a lot of confusion and noise on the mailing list.
> 
> The following is a list of causes, affected models, and 'message-id'
> from ibm-acpi-de...@lists.sourceforge.net :
> 
> AC plug/unplug:
> 
> X120e - caaaujb5v9dhdbdxdvvhnjog4urzc1tgkqeb_zgpay7q8kzh...@mail.gmail.com
> x121e - 20120817143459.gb3...@x1.osrc.amd.com
> X220  - Confirmed by Richard Hartmann
> X220i - 4f406274.7070...@gmail.com
> X220t - 4f489f5b.9040...@cs.tu-berlin.de
> X230  - CAKx4u7kqvVH0-gstomsiVYdGC0i6=bgxzaq8sq9gbg76tgm...@mail.gmail.com
> T420  - 9c848ee30b006737d0534d906bab0...@niklaas-baudet.net
> T420s - 20120608080824.gs25...@hexapodia.org
> W520  - 20121008181050.gf2...@ericlaptop.home.christensenplace.us
> 
> Lid closed/openend:
> 
> X220  - 4f4124df.5030...@gmail.com
> Could not be confirmed by author
> 
> Signed-off-by: Richard Hartmann 
> ---
>  drivers/platform/x86/thinkpad_acpi.c |   12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/platform/x86/thinkpad_acpi.c 
> b/drivers/platform/x86/thinkpad_acpi.c
> index 75dd651..2645084 100644
> --- a/drivers/platform/x86/thinkpad_acpi.c
> +++ b/drivers/platform/x86/thinkpad_acpi.c
> @@ -209,9 +209,8 @@ enum tpacpi_hkey_event_t {
>   TP_HKEY_EV_ALARM_SENSOR_XHOT= 0x6022, /* sensor critically hot */
>   TP_HKEY_EV_THM_TABLE_CHANGED= 0x6030, /* thermal table changed */
>  
> - TP_HKEY_EV_UNK_6040 = 0x6040, /* Related to AC change?
> -  some sort of APM hint,
> -  W520 */
> + /* AC-related events */
> + TP_HKEY_EV_AC_CHANGED   = 0x6040, /* AC status changed */
>  
>   /* Misc */
>   TP_HKEY_EV_RFKILL_CHANGED   = 0x7000, /* rfkill switch changed */
> @@ -3629,6 +3628,13 @@ static bool hotkey_notify_6xxx(const u32 hkey,
>"a sensor reports something is extremely hot!\n");
>   /* recommended action: immediate sleep/hibernate */
>   break;
> + case TP_HKEY_EV_AC_CHANGED:
> + pr_info("AC status has changed\n");
> + /* X120e, x121e, X220, X220i, X220t, X230, T420, T420s, W520:
> +  * AC status changed; can be triggered by plugging or
> +  * unplugging AC adapter, docking or undocking, or closing
> +  * or opening the lid. */
> + break;

It looks like a pretty useless message to me, AFAICT. If it is only an
APM hint, then we probably shouldn't say anything in dmesg but simply
ignore it.

I mean, do I additionally want to know that I just connected to AC after
I just plugged the cable in? There's this green lamp on the side, doh!
:-)

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/6 v9] arm: use devicetree to get smp_twd clock

2012-12-26 Thread Prashant Gaikwad


On Friday 07 December 2012 04:12 AM, Mark Langsdorf wrote:

From: Rob Herring 

Signed-off-by: Rob Herring 
Signed-off-by: Mark Langsdorf 
---
Changes from v4, v5, v6, v7, v8
 None.
Changes from v3
 No longer setting *clk to NULL in twd_get_clock().
Changes from v2
 Turned the check for the node pointer into an if-then-else statement.
 Removed the second, redundant clk_get_rate.
Changes from v1
 None.

  arch/arm/kernel/smp_twd.c | 19 +++
  1 file changed, 11 insertions(+), 8 deletions(-)


Hi Mark,

What is the status of this patch?

Regards,
PrashantG


diff --git a/arch/arm/kernel/smp_twd.c b/arch/arm/kernel/smp_twd.c
index b22d700..af46b80 100644
--- a/arch/arm/kernel/smp_twd.c
+++ b/arch/arm/kernel/smp_twd.c
@@ -237,12 +237,15 @@ static irqreturn_t twd_handler(int irq, void *dev_id)
return IRQ_NONE;
  }



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fb: Rework locking to fix lock ordering on takeover

2012-12-26 Thread Borislav Petkov

On Wed, Dec 26, 2012 at 01:09:51PM -0500, Sasha Levin wrote:
> > This patch can fix the following warning we saw?
> > http://lkml.org/lkml/2012/12/22/53
> >
> > I will give it a try.
> 
> Yup, that's the same error I've reported couple of months ago.
> 
> It looks like the fb maintains are still absent, so it'll probably
> need a different way to get upstream.

Adding to the bug pressure: just got a very similar splat on -rc1 (see
below). Alan, I'll run your patch to verify.

Thanks.

[33946.663968] ==
[33946.663970] [ INFO: possible circular locking dependency detected ]
[33946.663978] 3.8.0-rc1+ #1 Not tainted
[33946.663980] ---
[33946.663986] kworker/1:2/15780 is trying to acquire lock:
[33946.664010]  ((fb_notifier_list).rwsem){.+}, at: [] 
__blocking_notifier_call_chain+0x33/0x60
[33946.664013] 
[33946.664013] but task is already holding lock:
[33946.664029]  (console_lock){+.+.+.}, at: [] 
console_callback+0x13/0x160
[33946.664032] 
[33946.664032] which lock already depends on the new lock.
[33946.664032] 
[33946.664034] 
[33946.664034] the existing dependency chain (in reverse order) is:
[33946.664042] 
[33946.664042] -> #1 (console_lock){+.+.+.}:
[33946.664054][] lock_acquire+0x8a/0x140
[33946.664063][] console_lock+0x5f/0x70
[33946.664072][] register_con_driver+0x39/0x150
[33946.664080][] take_over_console+0x2e/0x70
[33946.664088][] fbcon_takeover+0x5a/0xb0
[33946.664096][] fbcon_event_notify+0x5eb/0x6f0
[33946.664103][] notifier_call_chain+0x4c/0x70
[33946.664111][] 
__blocking_notifier_call_chain+0x4b/0x60
[33946.664119][] 
blocking_notifier_call_chain+0x16/0x20
[33946.664127][] fb_notifier_call_chain+0x1b/0x20
[33946.664136][] register_framebuffer+0x1bc/0x2f0
[33946.664169][] 
drm_fb_helper_single_fb_probe+0x1e3/0x310 [drm_kms_helper]
[33946.664183][] 
drm_fb_helper_initial_config+0x1d1/0x230 [drm_kms_helper]
[33946.664239][] radeon_fbdev_init+0xc1/0x120 [radeon]
[33946.664290][] radeon_modeset_init+0x3a8/0xb90 
[radeon]
[33946.664333][] radeon_driver_load_kms+0xf0/0x180 
[radeon]
[33946.664344][] drm_get_pci_dev+0x186/0x2d0
[33946.664379][] radeon_pci_probe+0xb3/0xf0 [radeon]
[33946.664390][] pci_device_probe+0x9c/0xe0
[33946.664400][] driver_probe_device+0x8b/0x3a0
[33946.664408][] __driver_attach+0xab/0xb0
[33946.664415][] bus_for_each_dev+0x55/0x90
[33946.664422][] driver_attach+0x1e/0x20
[33946.664429][] bus_add_driver+0x1b0/0x2a0
[33946.664437][] driver_register+0x77/0x160
[33946.664445][] __pci_register_driver+0x64/0x70
[33946.664452][] drm_pci_init+0x10c/0x120
[33946.664480][] inet6_ioctl+0x7/0xb0 [ipv6]
[33946.664491][] do_one_initcall+0x122/0x170
[33946.664500][] load_module+0x185f/0x2160
[33946.664507][] sys_init_module+0xae/0x110
[33946.664516][] system_call_fastpath+0x16/0x1b
[33946.664526] 
[33946.664526] -> #0 ((fb_notifier_list).rwsem){.+}:
[33946.664534][] __lock_acquire+0x1ae8/0x1b10
[33946.664542][] lock_acquire+0x8a/0x140
[33946.664549][] down_read+0x34/0x49
[33946.664557][] 
__blocking_notifier_call_chain+0x33/0x60
[33946.664564][] 
blocking_notifier_call_chain+0x16/0x20
[33946.664572][] fb_notifier_call_chain+0x1b/0x20
[33946.664579][] fb_blank+0x3b/0xc0
[33946.664586][] fbcon_blank+0x223/0x2d0
[33946.664595][] do_blank_screen+0x1cb/0x270
[33946.664603][] console_callback+0x6a/0x160
[33946.664612][] process_one_work+0x19d/0x5e0
[33946.664620][] worker_thread+0x15d/0x450
[33946.664628][] kthread+0xea/0xf0
[33946.664636][] ret_from_fork+0x7c/0xb0
[33946.664638] 
[33946.664638] other info that might help us debug this:
[33946.664638] 
[33946.664641]  Possible unsafe locking scenario:
[33946.664641] 
[33946.664643]CPU0CPU1
[33946.664645]
[33946.664650]   lock(console_lock);
[33946.664656]lock((fb_notifier_list).rwsem);
[33946.664661]lock(console_lock);
[33946.664666]   lock((fb_notifier_list).rwsem);
[33946.664667] 
[33946.664667]  *** DEADLOCK ***
[33946.664667] 
[33946.664671] 3 locks held by kworker/1:2/15780:
[33946.664686]  #0:  (events){.+.+.+}, at: [] 
process_one_work+0x130/0x5e0
[33946.664701]  #1:  (console_work){+.+.+.}, at: [] 
process_one_work+0x130/0x5e0
[33946.664715]  #2:  (console_lock){+.+.+.}, at: [] 
console_callback+0x13/0x160
[33946.664717] 
[33946.664717] stack backtrace:
[33946.664723] Pid: 15780, comm: kworker/1:2 Not tainted 3.8.0-rc1+ #1
[33946.664726] Call Trace:
[33946.664736]  [] print_circular_bug+0x1fe/0x20f
[33946.664745]  []

Re: [PATCH v3 01/11] kexec: introduce kexec firmware support

2012-12-26 Thread Eric W. Biederman

Daniel Kiper  writes:

> Some kexec/kdump implementations (e.g. Xen PVOPS) could not use default
> Linux infrastructure and require some support from firmware and/or hypervisor.
> To cope with that problem kexec firmware infrastructure was introduced.
> It allows a developer to use all kexec/kdump features of given firmware
> or hypervisor.

As this stands this patch is wrong.

You need to pass an additional flag from userspace through /sbin/kexec
that says load the kexec image in the firmware.  A global variable here
is not ok.

As I understand it you are loading a kexec on xen panic image.  Which
is semantically different from a kexec on linux panic image.  It is not
ok to do have a silly global variable kexec_use_firmware.

Furthermore it is not ok to have a conditional code outside of header
files.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kvm lockdep splat with 3.8-rc1+

2012-12-26 Thread Borislav Petkov

Hi Hillf,

On Wed, Dec 26, 2012 at 08:18:13PM +0800, Hillf Danton wrote:
> Can you please test with 5a505085f0 and 4fc3f1d66b reverted?

sure can do, but am travelling ATM so I'll run it with the reverted
commits when I get back next week.

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] userns: Allow unprivileged reboot

2012-12-26 Thread Eric W. Biederman

Li Zefan  writes:

> In a container with its own pid namespace and user namespace, rebooting
> the system won't reboot the host, but terminate all the processes in
> it and thus have the container shutdown, so it's safe.
>
> Signed-off-by: Li Zefan 

Applied to my development tree.  It will eventaully make it to my
for-next branch.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2012-12-26 Thread H. Peter Anvin


On 12/26/2012 06:18 PM, Daniel Kiper wrote:

Hi,

This set of patches contains initial kexec/kdump implementation for Xen v3.
Currently only dom0 is supported, however, almost all infrustructure
required for domU support is ready.

Jan Beulich suggested to merge Xen x86 assembler code with baremetal x86 code.
This could simplify and reduce a bit size of kernel code. However, this solution
requires some changes in baremetal x86 code. First of all code which establishes
transition page table should be moved back from machine_kexec_$(BITS).c to
relocate_kernel_$(BITS).S. Another important thing which should be changed in 
that
case is format of page_list array. Xen kexec hypercall requires to alternate 
physical
addresses with virtual ones. These and other required stuff have not been done 
in that
version because I am not sure that solution will be accepted by kexec/kdump 
maintainers.
I hope that this email spark discussion about that topic.



I want a detailed list of the constraints that this assumes and 
therefore imposes on the native implementation as a result of this.  We 
have had way too many patches where Xen PV hacks effectively nailgun 
arbitrary, and sometimes poor, design decisions in place and now we 
can't fix them.


-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 06/11] x86/xen: Add i386 kexec/kdump implementation

2012-12-26 Thread H. Peter Anvin


On 12/26/2012 06:18 PM, Daniel Kiper wrote:

Add i386 kexec/kdump implementation.

v2 - suggestions/fixes:
- allocate transition page table pages below 4 GiB
  (suggested by Jan Beulich).



Why?

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Regression in kernel 3.8-rc1 bisected to commit adfa79d: I now get many "unable to enumerate USB device" messages

2012-12-26 Thread Alan Stern

On Wed, 26 Dec 2012, Larry Finger wrote:

> On 12/26/2012 10:45 AM, Alan Stern wrote:
> >
> > I see.  Do you happen to have CONFIG_USB_EHCI_HCD=y and
> > CONFIG_USB_EHCI_PCI=m in your .config?  If you do, try changing
> > EHCI_PCI to y.
> 
> One additional data point: When the EHCI and HCD parameters are set to y 
> rather 
> than m as in the list that follows, the enumerate messages do not occur.
> 
> CONFIG_USB_ARCH_HAS_HCD=y
> CONFIG_USB_ARCH_HAS_EHCI=y
> CONFIG_USB_XHCI_HCD=y
> CONFIG_USB_EHCI_HCD=y
> CONFIG_USB_OHCI_HCD=y
> CONFIG_USB_EHCI_PCI=y
> # CONFIG_USB_OHCI_HCD_PLATFORM is not set
> # CONFIG_USB_EHCI_HCD_PLATFORM is not set
> CONFIG_USB_UHCI_HCD=y

This looks like a matter of getting modules to load in the right order.  
Apparently your OHCI controller doesn't work right if the EHCI driver
isn't present.  Before the troublesome commit, this meant ehci-hcd had
to be loaded before ohci-hcd.  Now it means ehci-hcd and ehci-pci both
have to be loaded before ohci-hcd.

In the dmesg log you provided, ehci-hcd was loaded before ohci-hcd but 
ehci-pci was loaded after.  Of course, when everything is built into 
the kernel (not as modules) then questions of loading order don't 
arise.

You can test this hypothesis by booting a kernel without that commit
and blacklisting ehci-hcd, so that it doesn't get loaded automatically.  
See if the errors start to come, and see if they stop when you load
ehci-hcd manually.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] virtio-net: reset virtqueue affinity when doing cpu hotplug

2012-12-26 Thread Wanlong Gao

On 12/27/2012 11:28 AM, Jason Wang wrote:
> On 12/26/2012 06:19 PM, Wanlong Gao wrote:
>> On 12/26/2012 06:06 PM, Jason Wang wrote:
>>> On 12/26/2012 03:06 PM, Wanlong Gao wrote:
 Add a cpu notifier to virtio-net, so that we can reset the
 virtqueue affinity if the cpu hotplug happens. It improve
 the performance through enabling or disabling the virtqueue
 affinity after doing cpu hotplug.
>>> Hi Wanlong:
>>>
>>> Thanks for looking at this.
 Cc: Rusty Russell 
 Cc: "Michael S. Tsirkin" 
 Cc: Jason Wang 
 Cc: virtualizat...@lists.linux-foundation.org
 Cc: net...@vger.kernel.org
 Signed-off-by: Wanlong Gao 
 ---
  drivers/net/virtio_net.c | 39 ++-
  1 file changed, 38 insertions(+), 1 deletion(-)

 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index a6fcf15..9710cf4 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -26,6 +26,7 @@
  #include 
  #include 
  #include 
 +#include 
  
  static int napi_weight = 128;
  module_param(napi_weight, int, 0444);
 @@ -34,6 +35,8 @@ static bool csum = true, gso = true;
  module_param(csum, bool, 0444);
  module_param(gso, bool, 0444);
  
 +static bool cpu_hotplug = false;
 +
  /* FIXME: MTU in config. */
  #define MAX_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
  #define GOOD_COPY_LEN 128
 @@ -1041,6 +1044,26 @@ static void virtnet_set_affinity(struct 
 virtnet_info *vi, bool set)
vi->affinity_hint_set = false;
  }
  
 +static int virtnet_cpu_callback(struct notifier_block *nfb,
 + unsigned long action, void *hcpu)
 +{
 +  switch(action) {
 +  case CPU_ONLINE:
 +  case CPU_ONLINE_FROZEN:
 +  case CPU_DEAD:
 +  case CPU_DEAD_FROZEN:
 +  cpu_hotplug = true;
 +  break;
 +  default:
 +  break;
 +  }
 +  return NOTIFY_OK;
 +}
 +
 +static struct notifier_block virtnet_cpu_notifier = {
 +  .notifier_call = virtnet_cpu_callback,
 +};
 +
  static void virtnet_get_ringparam(struct net_device *dev,
struct ethtool_ringparam *ring)
  {
 @@ -1131,7 +1154,14 @@ static int virtnet_change_mtu(struct net_device 
 *dev, int new_mtu)
   */
  static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff 
 *skb)
  {
 -  int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
 +  int txq;
 +
 +  if (unlikely(cpu_hotplug == true)) {
 +  virtnet_set_affinity(netdev_priv(dev), true);
 +  cpu_hotplug = false;
 +  }
 +
>>> Why don't you just do this in callback?
>> Callback can just give us a "hcpu", can't get the virtnet_info from 
>> callback. Am I missing something?
> 
> Well, I think you can just embed the notifier block into virtnet_info,
> then use something like container_of in the callback to make the
> notifier per device. This also solve the concern of Eric.

Yeah, thank you very much for your suggestion. I'll try it.

>>> btw. Does qemu/kvm support cpu-hotplug now?
>> From http://www.linux-kvm.org/page/CPUHotPlug, I saw that qemu-kvm can 
>> support hotplug
>> but failed to merge to qemu.git, right?
> 
> Not sure, I just try latest qemu, it even does not have a cpu_set command.

Adding Igor to CC,

As I know, hotplug support is cleaned from qemu, and Igor want to rework it but 
not been completed?
I'm not sure about that, Igor, could you send out your tech-preview-patches?

Thanks,
Wanlong Gao

> 
> Thanks
>>
>> Thanks,
>> Wanlong Gao
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT] Networking

2012-12-26 Thread David Miller


1) GRE tunnel drivers don't set the transport header properly, they
   also blindly deref the inner protocol ipv4 and needs some checks.
   Fixes from Isaku Yamahata.

2) Fix sleeps while atomic in netdevice rename code, from Eric
   Dumazet.

3) Fix double-spinlock in solos-pci driver, from Dan Carpenter.

4) More ARP bug fixes.  Fix lockdep splat in arp_solicit() and
   then the bug accidently added by that fix.  From Eric Dumazet
   and Cong Wang.

5) Remove some __dev* annotations that slipped back in, as well
   as all HOTPLUG references.  From Greg KH

6) RDS protocol uses wrong interfaces to access scatter-gather
   elements, causing a regression.  From Mike Marciniszyn.

7) Fix build error in cpts driver, from Richard Cochran.

8) Fix arithmetic in packet scheduler, from Stefan Hasko.

9) Similarly, fix association during calculation of random
   backoff in batman-adv.  From Akinobu Mita.

Please pull, thanks a lot!

The following changes since commit c4271c6e37c32105492cbbed35f45330cb327b94:

  NFS: Kill fscache warnings when mounting without -ofsc (2012-12-21 08:32:09 
-0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master

for you to fetch changes up to ae782bb16c35ce27512beeda9be6024c88f85b08:

  ipv6/ip6_gre: set transport header correctly (2012-12-26 15:19:56 -0800)


Akinobu Mita (1):
  batman-adv: fix random jitter calculation

Cong Wang (1):
  arp: fix a regression in arp_solicit()

Dan Carpenter (1):
  solos-pci: double lock in geos_gpio_store()

Eric Dumazet (5):
  ip_gre: fix possible use after free
  net: devnet_rename_seq should be a seqcount
  tuntap: dont use a private kmem_cache
  ipv4: arp: fix a lockdep splat in arp_solicit()
  tcp: should drop incoming frames without ACK flag set

Gao feng (1):
  bridge: call br_netpoll_disable in br_add_if

Greg KH (2):
  Drivers: network: more __dev* removal
  CONFIG_HOTPLUG removal from networking core

Isaku Yamahata (3):
  ip_gre: make ipgre_tunnel_xmit() not parse network header as IP 
unconditionally
  ipv4/ip_gre: set transport header correctly to gre header
  ipv6/ip6_gre: set transport header correctly

Li Zefan (1):
  netprio_cgroup: define sk_cgrp_prioidx only if NETPRIO_CGROUP is enabled

Marciniszyn, Mike (2):
  IB/rds: Correct ib_api use with gs_dma_address/sg_dma_len
  IB/rds: suppress incompatible protocol when version is known

Richard Cochran (2):
  cpts: fix build error by removing useless code.
  cpts: fix a run time warn_on.

Stefan Hasko (1):
  net: sched: integer overflow fix

Yan Burman (1):
  net/vxlan: Use the underlying device index when joining/leaving multicast 
groups

 drivers/atm/solos-pci.c |  2 +-
 drivers/net/ethernet/marvell/mvmdio.c   |  6 +++---
 drivers/net/ethernet/marvell/mvneta.c   | 19 +--
 drivers/net/ethernet/ti/cpts.c  |  3 +--
 drivers/net/ethernet/ti/cpts.h  |  1 -
 drivers/net/tun.c   | 24 +++-
 drivers/net/vxlan.c |  6 --
 drivers/net/wireless/rtlwifi/rtl8723ae/sw.c |  2 +-
 include/linux/netdevice.h   |  2 +-
 include/net/sock.h  |  2 +-
 net/batman-adv/bat_iv_ogm.c |  2 +-
 net/bridge/br_if.c  |  8 +---
 net/core/dev.c  | 18 +-
 net/core/net-sysfs.c|  4 
 net/core/sock.c |  4 ++--
 net/ipv4/arp.c  | 10 --
 net/ipv4/ip_gre.c   | 13 ++---
 net/ipv4/tcp_input.c| 14 ++
 net/ipv6/ip6_gre.c  |  3 +--
 net/rds/ib_cm.c | 11 +--
 net/rds/ib_recv.c   |  9 ++---
 net/sched/sch_htb.c |  2 +-
 net/wireless/reg.c  |  7 ---
 net/wireless/sysfs.c|  4 
 24 files changed, 78 insertions(+), 98 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] userns: Allow unprivileged reboot

2012-12-26 Thread Li Zefan

In a container with its own pid namespace and user namespace, rebooting
the system won't reboot the host, but terminate all the processes in
it and thus have the container shutdown, so it's safe.

Signed-off-by: Li Zefan 
---
 kernel/sys.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 265b376..24d1ef5 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -433,11 +433,12 @@ static DEFINE_MUTEX(reboot_mutex);
 SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
void __user *, arg)
 {
+   struct pid_namespace *pid_ns = task_active_pid_ns(current);
char buffer[256];
int ret = 0;
 
/* We only trust the superuser with rebooting the system. */
-   if (!capable(CAP_SYS_BOOT))
+   if (!ns_capable(pid_ns->user_ns, CAP_SYS_BOOT))
return -EPERM;
 
/* For safety, we require "magic" arguments. */
@@ -453,7 +454,7 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned 
int, cmd,
 * pid_namespace, the command is handled by reboot_pid_ns() which will
 * call do_exit().
 */
-   ret = reboot_pid_ns(task_active_pid_ns(current), cmd);
+   ret = reboot_pid_ns(pid_ns, cmd);
if (ret)
return ret;
 
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE

2012-12-26 Thread H. Peter Anvin

Hmm... this code is being redone at the moment... this might conflict.

Daniel Kiper  wrote:

>Some implementations (e.g. Xen PVOPS) could not use part of identity
>page table
>to construct transition page table. It means that they require separate
>PUDs,
>PMDs and PTEs for virtual and physical (identity) mapping. To satisfy
>that
>requirement add extra pointer to PGD, PUD, PMD and PTE and align
>existing code.
>
>Signed-off-by: Daniel Kiper 
>---
> arch/x86/include/asm/kexec.h   |   10 +++---
> arch/x86/kernel/machine_kexec_64.c |   12 ++--
> 2 files changed, 13 insertions(+), 9 deletions(-)
>
>diff --git a/arch/x86/include/asm/kexec.h
>b/arch/x86/include/asm/kexec.h
>index 6080d26..cedd204 100644
>--- a/arch/x86/include/asm/kexec.h
>+++ b/arch/x86/include/asm/kexec.h
>@@ -157,9 +157,13 @@ struct kimage_arch {
> };
> #else
> struct kimage_arch {
>-  pud_t *pud;
>-  pmd_t *pmd;
>-  pte_t *pte;
>+  pgd_t *pgd;
>+  pud_t *pud0;
>+  pud_t *pud1;
>+  pmd_t *pmd0;
>+  pmd_t *pmd1;
>+  pte_t *pte0;
>+  pte_t *pte1;
> };
> #endif
> 
>diff --git a/arch/x86/kernel/machine_kexec_64.c
>b/arch/x86/kernel/machine_kexec_64.c
>index b3ea9db..976e54b 100644
>--- a/arch/x86/kernel/machine_kexec_64.c
>+++ b/arch/x86/kernel/machine_kexec_64.c
>@@ -137,9 +137,9 @@ out:
> 
> static void free_transition_pgtable(struct kimage *image)
> {
>-  free_page((unsigned long)image->arch.pud);
>-  free_page((unsigned long)image->arch.pmd);
>-  free_page((unsigned long)image->arch.pte);
>+  free_page((unsigned long)image->arch.pud0);
>+  free_page((unsigned long)image->arch.pmd0);
>+  free_page((unsigned long)image->arch.pte0);
> }
> 
> static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
>@@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage
>*image, pgd_t *pgd)
>   pud = (pud_t *)get_zeroed_page(GFP_KERNEL);
>   if (!pud)
>   goto err;
>-  image->arch.pud = pud;
>+  image->arch.pud0 = pud;
>   set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
>   }
>   pud = pud_offset(pgd, vaddr);
>@@ -165,7 +165,7 @@ static int init_transition_pgtable(struct kimage
>*image, pgd_t *pgd)
>   pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL);
>   if (!pmd)
>   goto err;
>-  image->arch.pmd = pmd;
>+  image->arch.pmd0 = pmd;
>   set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
>   }
>   pmd = pmd_offset(pud, vaddr);
>@@ -173,7 +173,7 @@ static int init_transition_pgtable(struct kimage
>*image, pgd_t *pgd)
>   pte = (pte_t *)get_zeroed_page(GFP_KERNEL);
>   if (!pte)
>   goto err;
>-  image->arch.pte = pte;
>+  image->arch.pte0 = pte;
>   set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
>   }
>   pte = pte_offset_kernel(pmd, vaddr);

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] virtio-net: reset virtqueue affinity when doing cpu hotplug

2012-12-26 Thread Jason Wang

On 12/26/2012 06:46 PM, Michael S. Tsirkin wrote:
> On Wed, Dec 26, 2012 at 03:06:54PM +0800, Wanlong Gao wrote:
>> Add a cpu notifier to virtio-net, so that we can reset the
>> virtqueue affinity if the cpu hotplug happens. It improve
>> the performance through enabling or disabling the virtqueue
>> affinity after doing cpu hotplug.
>>
>> Cc: Rusty Russell 
>> Cc: "Michael S. Tsirkin" 
>> Cc: Jason Wang 
>> Cc: virtualizat...@lists.linux-foundation.org
>> Cc: net...@vger.kernel.org
>> Signed-off-by: Wanlong Gao 
> Thanks for looking into this.
> Some comments:
>
> 1. Looks like the logic in
> virtnet_set_affinity (and in virtnet_select_queue)
> will not work very well when CPU IDs are not
> consequitive. This can happen with hot unplug.
>
> Maybe we should add a VQ allocator, and defining
> a per-cpu variable specifying the VQ instead
> of using CPU ID.

Yes, and generate the affinity hint based on the mapping. Btw, what does
VQ allocator means here?
>
>
> 2. The below code seems racy e.g. when CPU is added
>   during device init.
>
> 3. using a global cpu_hotplug seems inelegant.
> In any case we should document what is the
> meaning of this variable.
>
>> ---
>>  drivers/net/virtio_net.c | 39 ++-
>>  1 file changed, 38 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index a6fcf15..9710cf4 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -26,6 +26,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  static int napi_weight = 128;
>>  module_param(napi_weight, int, 0444);
>> @@ -34,6 +35,8 @@ static bool csum = true, gso = true;
>>  module_param(csum, bool, 0444);
>>  module_param(gso, bool, 0444);
>>  
>> +static bool cpu_hotplug = false;
>> +
>>  /* FIXME: MTU in config. */
>>  #define MAX_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
>>  #define GOOD_COPY_LEN   128
>> @@ -1041,6 +1044,26 @@ static void virtnet_set_affinity(struct virtnet_info 
>> *vi, bool set)
>>  vi->affinity_hint_set = false;
>>  }
>>  
>> +static int virtnet_cpu_callback(struct notifier_block *nfb,
>> +   unsigned long action, void *hcpu)
>> +{
>> +switch(action) {
>> +case CPU_ONLINE:
>> +case CPU_ONLINE_FROZEN:
>> +case CPU_DEAD:
>> +case CPU_DEAD_FROZEN:
>> +cpu_hotplug = true;
>> +break;
>> +default:
>> +break;
>> +}
>> +return NOTIFY_OK;
>> +}
>> +
>> +static struct notifier_block virtnet_cpu_notifier = {
>> +.notifier_call = virtnet_cpu_callback,
>> +};
>> +
>>  static void virtnet_get_ringparam(struct net_device *dev,
>>  struct ethtool_ringparam *ring)
>>  {
>> @@ -1131,7 +1154,14 @@ static int virtnet_change_mtu(struct net_device *dev, 
>> int new_mtu)
>>   */
>>  static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
>>  {
>> -int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
>> +int txq;
>> +
>> +if (unlikely(cpu_hotplug == true)) {
>> +virtnet_set_affinity(netdev_priv(dev), true);
>> +cpu_hotplug = false;
>> +}
>> +
>> +txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
>>smp_processor_id();
>>  
>>  while (unlikely(txq >= dev->real_num_tx_queues))
>> @@ -1248,6 +1278,8 @@ static void virtnet_del_vqs(struct virtnet_info *vi)
>>  {
>>  struct virtio_device *vdev = vi->vdev;
>>  
>> +unregister_hotcpu_notifier(_cpu_notifier);
>> +
>>  virtnet_set_affinity(vi, false);
>>  
>>  vdev->config->del_vqs(vdev);
>> @@ -1372,6 +1404,11 @@ static int init_vqs(struct virtnet_info *vi)
>>  goto err_free;
>>  
>>  virtnet_set_affinity(vi, true);
>> +
>> +ret = register_hotcpu_notifier(_cpu_notifier);
>> +if (ret)
>> +goto err_free;
>> +
>>  return 0;
>>  
>>  err_free:
>> -- 
>> 1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] virtio-net: reset virtqueue affinity when doing cpu hotplug

2012-12-26 Thread Jason Wang

On 12/26/2012 06:19 PM, Wanlong Gao wrote:
> On 12/26/2012 06:06 PM, Jason Wang wrote:
>> On 12/26/2012 03:06 PM, Wanlong Gao wrote:
>>> Add a cpu notifier to virtio-net, so that we can reset the
>>> virtqueue affinity if the cpu hotplug happens. It improve
>>> the performance through enabling or disabling the virtqueue
>>> affinity after doing cpu hotplug.
>> Hi Wanlong:
>>
>> Thanks for looking at this.
>>> Cc: Rusty Russell 
>>> Cc: "Michael S. Tsirkin" 
>>> Cc: Jason Wang 
>>> Cc: virtualizat...@lists.linux-foundation.org
>>> Cc: net...@vger.kernel.org
>>> Signed-off-by: Wanlong Gao 
>>> ---
>>>  drivers/net/virtio_net.c | 39 ++-
>>>  1 file changed, 38 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index a6fcf15..9710cf4 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -26,6 +26,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  static int napi_weight = 128;
>>>  module_param(napi_weight, int, 0444);
>>> @@ -34,6 +35,8 @@ static bool csum = true, gso = true;
>>>  module_param(csum, bool, 0444);
>>>  module_param(gso, bool, 0444);
>>>  
>>> +static bool cpu_hotplug = false;
>>> +
>>>  /* FIXME: MTU in config. */
>>>  #define MAX_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
>>>  #define GOOD_COPY_LEN  128
>>> @@ -1041,6 +1044,26 @@ static void virtnet_set_affinity(struct virtnet_info 
>>> *vi, bool set)
>>> vi->affinity_hint_set = false;
>>>  }
>>>  
>>> +static int virtnet_cpu_callback(struct notifier_block *nfb,
>>> +  unsigned long action, void *hcpu)
>>> +{
>>> +   switch(action) {
>>> +   case CPU_ONLINE:
>>> +   case CPU_ONLINE_FROZEN:
>>> +   case CPU_DEAD:
>>> +   case CPU_DEAD_FROZEN:
>>> +   cpu_hotplug = true;
>>> +   break;
>>> +   default:
>>> +   break;
>>> +   }
>>> +   return NOTIFY_OK;
>>> +}
>>> +
>>> +static struct notifier_block virtnet_cpu_notifier = {
>>> +   .notifier_call = virtnet_cpu_callback,
>>> +};
>>> +
>>>  static void virtnet_get_ringparam(struct net_device *dev,
>>> struct ethtool_ringparam *ring)
>>>  {
>>> @@ -1131,7 +1154,14 @@ static int virtnet_change_mtu(struct net_device 
>>> *dev, int new_mtu)
>>>   */
>>>  static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff 
>>> *skb)
>>>  {
>>> -   int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
>>> +   int txq;
>>> +
>>> +   if (unlikely(cpu_hotplug == true)) {
>>> +   virtnet_set_affinity(netdev_priv(dev), true);
>>> +   cpu_hotplug = false;
>>> +   }
>>> +
>> Why don't you just do this in callback?
> Callback can just give us a "hcpu", can't get the virtnet_info from callback. 
> Am I missing something?

Well, I think you can just embed the notifier block into virtnet_info,
then use something like container_of in the callback to make the
notifier per device. This also solve the concern of Eric.
>> btw. Does qemu/kvm support cpu-hotplug now?
> From http://www.linux-kvm.org/page/CPUHotPlug, I saw that qemu-kvm can 
> support hotplug
> but failed to merge to qemu.git, right?

Not sure, I just try latest qemu, it even does not have a cpu_set command.

Thanks
>
> Thanks,
> Wanlong Gao
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 02/14] memory-hotplug: check whether all memory blocks are offlined or not when removing memory

2012-12-26 Thread Tang Chen

On 12/26/2012 11:10 AM, Kamezawa Hiroyuki wrote:
> (2012/12/24 21:09), Tang Chen wrote:
>> From: Yasuaki Ishimatsu
>>
>> We remove the memory like this:
>> 1. lock memory hotplug
>> 2. offline a memory block
>> 3. unlock memory hotplug
>> 4. repeat 1-3 to offline all memory blocks
>> 5. lock memory hotplug
>> 6. remove memory(TODO)
>> 7. unlock memory hotplug
>>
>> All memory blocks must be offlined before removing memory. But we don't hold
>> the lock in the whole operation. So we should check whether all memory blocks
>> are offlined before step6. Otherwise, kernel maybe panicked.
>>
>> Signed-off-by: Wen Congyang
>> Signed-off-by: Yasuaki Ishimatsu
> 
> Acked-by: KAMEZAWA Hiroyuki
> 
> a nitpick below.
> 
>> +
>> +for (pfn = start_pfn; pfn<  end_pfn; pfn += PAGES_PER_SECTION) {
> 
> I prefer adding mem = NULL at the start of this for().

Hi Kamezawa-san,

Added, thanks. :)

> 
>> +section_nr = pfn_to_section_nr(pfn);
>> +if (!present_section_nr(section_nr))
>> +continue;
>> +
>> +section = __nr_to_section(section_nr);
>> +/* same memblock? */
>> +if (mem)
>> +if ((section_nr>= mem->start_section_nr)&&
>> +(section_nr<= mem->end_section_nr))
>> +continue;
>> +
> 
> Thanks,
> -Kame
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 03/14] memory-hotplug: remove redundant codes

2012-12-26 Thread Tang Chen

Hi Kamezawa-san,

Thanks for the reviewing. Please see below. :)

On 12/26/2012 11:20 AM, Kamezawa Hiroyuki wrote:
> (2012/12/24 21:09), Tang Chen wrote:
>> From: Wen Congyang
>>
>> offlining memory blocks and checking whether memory blocks are offlined
>> are very similar. This patch introduces a new function to remove
>> redundant codes.
>>
>> Signed-off-by: Wen Congyang
>> ---
>>mm/memory_hotplug.c |  101 
>> ---
>>1 files changed, 55 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index d43d97b..dbb04d8 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1381,20 +1381,14 @@ int offline_pages(unsigned long start_pfn, unsigned 
>> long nr_pages)
>>  return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
>>}
>>
>> -int remove_memory(u64 start, u64 size)
> 
> please add explanation of this function here. If (*func) returns val other 
> than 0,
> this function will fail and returns callback's return value...right ?
> 

Yes, it will always return the func()'s return value. I'll add the
comment here. :)

> 
>> +static int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
>> +void *arg, int (*func)(struct memory_block *, void *))
>>{
>>  struct memory_block *mem = NULL;
>>  struct mem_section *section;
>> -unsigned long start_pfn, end_pfn;
>>  unsigned long pfn, section_nr;
>>  int ret;
>> -int return_on_error = 0;
>> -int retry = 0;
>> -
>> -start_pfn = PFN_DOWN(start);
>> -end_pfn = start_pfn + PFN_DOWN(size);
>>
>> -repeat:
> 
> Shouldn't we check lock is held here ? 
> (VM_BUG_ON(!mutex_is_locked(_hotplug_mutex);

Well, I think, after applying this patch, walk_memory_range() will be
a separated function. And it can be used somewhere else where we don't
hold this lock. But for now, we can do this check.  :)

> 
> 
>>  for (pfn = start_pfn; pfn<  end_pfn; pfn += PAGES_PER_SECTION) {
>>  section_nr = pfn_to_section_nr(pfn);
>>  if (!present_section_nr(section_nr))
>> @@ -1411,22 +1405,61 @@ repeat:
>>  if (!mem)
>>  continue;
>>
>> -ret = offline_memory_block(mem);
>> +ret = func(mem, arg);
>>  if (ret) {
>> -if (return_on_error) {
>> -kobject_put(>dev.kobj);
>> -return ret;
>> -} else {
>> -retry = 1;
>> -}
>> +kobject_put(>dev.kobj);
>> +return ret;
>>  }
>>  }
>>
>>  if (mem)
>>  kobject_put(>dev.kobj);
>>
>> -if (retry) {
>> -return_on_error = 1;
>> +return 0;
>> +}
>> +
>> +static int offline_memory_block_cb(struct memory_block *mem, void *arg)
>> +{
>> +int *ret = arg;
>> +int error = offline_memory_block(mem);
>> +
>> +if (error != 0&&  *ret == 0)
>> +*ret = error;
>> +
>> +return 0;
> 
> Always returns 0 and run through all mem blocks for scan-and-retry, right ?
> You need explanation here !

Yes, I'll add the comment. :)

> 
> 
>> +}
>> +
>> +static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
>> +{
>> +int ret = !is_memblock_offlined(mem);
>> +
>> +if (unlikely(ret))
>> +pr_warn("removing memory fails, because memory "
>> +"[%#010llx-%#010llx] is onlined\n",
>> +PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
>> +PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1);
>> +
>> +return ret;
>> +}
>> +
>> +int remove_memory(u64 start, u64 size)
>> +{
>> +unsigned long start_pfn, end_pfn;
>> +int ret = 0;
>> +int retry = 1;
>> +
>> +start_pfn = PFN_DOWN(start);
>> +end_pfn = start_pfn + PFN_DOWN(size);
>> +
>> +repeat:
> 
> please explan why you repeat here .

This repeat is add in patch1. It aims to solve the problem we were
talking about in patch1. I'll add the comment here. :)

> 
>> +walk_memory_range(start_pfn, end_pfn,,
>> +  offline_memory_block_cb);
>> +if (ret) {
>> +if (!retry)
>> +return ret;
>> +
>> +retry = 0;
>> +ret = 0;
>>  goto repeat;
>>  }
>>
>> @@ -1444,37 +1477,13 @@ repeat:
>>   * memory blocks are offlined.
>>   */
>>
>> -for (pfn = start_pfn; pfn<  end_pfn; pfn += PAGES_PER_SECTION) {
>> -section_nr = pfn_to_section_nr(pfn);
>> -if (!present_section_nr(section_nr))
>> -continue;
>> -
>> -section = __nr_to_section(section_nr);
>> -/* same memblock? */
>> -if (mem)
>> -if ((section_nr>= mem->start_section_nr)&&
>> -(section_nr<= mem->end_section_nr))
>> -

Re: [PATCH v5 04/14] memory-hotplug: remove /sys/firmware/memmap/X sysfs

2012-12-26 Thread Tang Chen

On 12/26/2012 11:30 AM, Kamezawa Hiroyuki wrote:
>> @@ -41,6 +42,7 @@ struct firmware_map_entry {
>>  const char  *type;  /* type of the memory range */
>>  struct list_headlist;   /* entry for the linked list */
>>  struct kobject  kobj;   /* kobject for each entry */
>> +unsigned intbootmem:1; /* allocated from bootmem */
>>};
> 
> Can't we detect from which the object is allocated from, slab or bootmem ?
> 
> Hm, for example,
> 
>  PageReserved(virt_to_page(address_of_obj)) ?
>  PageSlab(virt_to_page(address_of_obj)) ?
> 

Hi Kamezawa-san,

I think we can detect it without a new member. I think bootmem:1 member
is just for convenience. I think I can remove it. :)

Thanks. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] tracing: Use sched_clock_cpu for trace_clock_global

2012-12-26 Thread Namhyung Kim

From: Namhyung Kim 

For systems have unstable sched_clock, all cpu_clock() does is enable/
disable local irq during call to sched_clock().  And for stable systems
they are same.

As in trace_clock_global(), we already does it for local irq, calling
sched_clock_cpu() directly would be appropriate.

Cc: Steven Rostedt 
Cc: Fredereic Weisbecker 
Cc: Ingo Molnar 
Signed-off-by: Namhyung Kim 
---
 kernel/trace/trace_clock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_clock.c b/kernel/trace/trace_clock.c
index 394783531cbb..795f077978a8 100644
--- a/kernel/trace/trace_clock.c
+++ b/kernel/trace/trace_clock.c
@@ -86,7 +86,7 @@ u64 notrace trace_clock_global(void)
local_irq_save(flags);
 
this_cpu = raw_smp_processor_id();
-   now = cpu_clock(this_cpu);
+   now = sched_clock_cpu(this_cpu);
/*
 * If in an NMI context then dont risk lockups and return the
 * cpu_clock() time:
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] watchdog: Use local_clock for get_timestamp()

2012-12-26 Thread Namhyung Kim

From: Namhyung Kim 

The get_timestamp() function is always called with current cpu, thus
using local_clock() would be more appropriate and it makes the code
shorter and cleaner IMHO.

Cc: Don Zickus 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Signed-off-by: Namhyung Kim 
---
 kernel/watchdog.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 75a2ab3d0b02..082ca6878a3f 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -112,9 +112,9 @@ static int get_softlockup_thresh(void)
  * resolution, and we don't need to waste time with a big divide when
  * 2^30ns == 1.074s.
  */
-static unsigned long get_timestamp(int this_cpu)
+static unsigned long get_timestamp(void)
 {
-   return cpu_clock(this_cpu) >> 30LL;  /* 2^30 ~= 10^9 */
+   return local_clock() >> 30LL;  /* 2^30 ~= 10^9 */
 }
 
 static void set_sample_period(void)
@@ -132,9 +132,7 @@ static void set_sample_period(void)
 /* Commands for resetting the watchdog */
 static void __touch_watchdog(void)
 {
-   int this_cpu = smp_processor_id();
-
-   __this_cpu_write(watchdog_touch_ts, get_timestamp(this_cpu));
+   __this_cpu_write(watchdog_touch_ts, get_timestamp());
 }
 
 void touch_softlockup_watchdog(void)
@@ -195,7 +193,7 @@ static int is_hardlockup(void)
 
 static int is_softlockup(unsigned long touch_ts)
 {
-   unsigned long now = get_timestamp(smp_processor_id());
+   unsigned long now = get_timestamp();
 
/* Warn about unreasonable delays: */
if (time_after(now, touch_ts + get_softlockup_thresh()))
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 01/11] kexec: introduce kexec firmware support

2012-12-26 Thread Daniel Kiper

Some kexec/kdump implementations (e.g. Xen PVOPS) could not use default
Linux infrastructure and require some support from firmware and/or hypervisor.
To cope with that problem kexec firmware infrastructure was introduced.
It allows a developer to use all kexec/kdump features of given firmware
or hypervisor.

v3 - suggestions/fixes:
   - replace kexec_ops struct by kexec firmware infrastructure
 (suggested by Eric Biederman).

v2 - suggestions/fixes:
   - add comment for kexec_ops.crash_alloc_temp_store member
 (suggested by Konrad Rzeszutek Wilk),
   - simplify kexec_ops usage
 (suggested by Konrad Rzeszutek Wilk).

Signed-off-by: Daniel Kiper 
---
 include/linux/kexec.h   |   26 ++-
 kernel/Makefile |1 +
 kernel/kexec-firmware.c |  743 +++
 kernel/kexec.c  |   46 +++-
 4 files changed, 809 insertions(+), 7 deletions(-)
 create mode 100644 kernel/kexec-firmware.c

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d0b8458..9568457 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -116,17 +116,34 @@ struct kimage {
 #endif
 };
 
-
-
 /* kexec interface functions */
 extern void machine_kexec(struct kimage *image);
 extern int machine_kexec_prepare(struct kimage *image);
 extern void machine_kexec_cleanup(struct kimage *image);
+extern struct page *mf_kexec_kimage_alloc_pages(gfp_t gfp_mask,
+   unsigned int order,
+   unsigned long limit);
+extern void mf_kexec_kimage_free_pages(struct page *page);
+extern unsigned long mf_kexec_page_to_pfn(struct page *page);
+extern struct page *mf_kexec_pfn_to_page(unsigned long mfn);
+extern unsigned long mf_kexec_virt_to_phys(volatile void *address);
+extern void *mf_kexec_phys_to_virt(unsigned long address);
+extern int mf_kexec_prepare(struct kimage *image);
+extern int mf_kexec_load(struct kimage *image);
+extern void mf_kexec_cleanup(struct kimage *image);
+extern void mf_kexec_unload(struct kimage *image);
+extern void mf_kexec_shutdown(void);
+extern void mf_kexec(struct kimage *image);
 extern asmlinkage long sys_kexec_load(unsigned long entry,
unsigned long nr_segments,
struct kexec_segment __user *segments,
unsigned long flags);
+extern long firmware_sys_kexec_load(unsigned long entry,
+   unsigned long nr_segments,
+   struct kexec_segment __user *segments,
+   unsigned long flags);
 extern int kernel_kexec(void);
+extern int firmware_kernel_kexec(void);
 #ifdef CONFIG_COMPAT
 extern asmlinkage long compat_sys_kexec_load(unsigned long entry,
unsigned long nr_segments,
@@ -135,7 +152,10 @@ extern asmlinkage long compat_sys_kexec_load(unsigned long 
entry,
 #endif
 extern struct page *kimage_alloc_control_pages(struct kimage *image,
unsigned int order);
+extern struct page *firmware_kimage_alloc_control_pages(struct kimage *image,
+   unsigned int order);
 extern void crash_kexec(struct pt_regs *);
+extern void firmware_crash_kexec(struct pt_regs *);
 int kexec_should_crash(struct task_struct *);
 void crash_save_cpu(struct pt_regs *regs, int cpu);
 void crash_save_vmcoreinfo(void);
@@ -168,6 +188,8 @@ unsigned long paddr_vmcoreinfo_note(void);
 #define VMCOREINFO_CONFIG(name) \
vmcoreinfo_append_str("CONFIG_%s=y\n", #name)
 
+extern bool kexec_use_firmware;
+
 extern struct kimage *kexec_image;
 extern struct kimage *kexec_crash_image;
 
diff --git a/kernel/Makefile b/kernel/Makefile
index 6c072b6..bc96b2f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -58,6 +58,7 @@ obj-$(CONFIG_MODULE_SIG) += module_signing.o modsign_pubkey.o 
modsign_certificat
 obj-$(CONFIG_KALLSYMS) += kallsyms.o
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_KEXEC) += kexec.o
+obj-$(CONFIG_KEXEC_FIRMWARE) += kexec-firmware.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
diff --git a/kernel/kexec-firmware.c b/kernel/kexec-firmware.c
new file mode 100644
index 000..f6ddd4c
--- /dev/null
+++ b/kernel/kexec-firmware.c
@@ -0,0 +1,743 @@
+/*
+ * Copyright (C) 2002-2004 Eric Biederman  
+ * Copyright (C) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * Most of the code here is a copy of kernel/kexec.c.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+/*
+ * KIMAGE_NO_DEST is an impossible destination address..., for
+ * allocating pages whose

[PATCH v3 10/11] drivers/xen: Export vmcoreinfo through sysfs

2012-12-26 Thread Daniel Kiper

Export vmcoreinfo through sysfs.

Signed-off-by: Daniel Kiper 
---
 drivers/xen/sys-hypervisor.c |   42 +-
 1 files changed, 41 insertions(+), 1 deletions(-)

diff --git a/drivers/xen/sys-hypervisor.c b/drivers/xen/sys-hypervisor.c
index 96453f8..9dd290c 100644
--- a/drivers/xen/sys-hypervisor.c
+++ b/drivers/xen/sys-hypervisor.c
@@ -368,6 +368,41 @@ static void xen_properties_destroy(void)
sysfs_remove_group(hypervisor_kobj, _properties_group);
 }
 
+#ifdef CONFIG_KEXEC_FIRMWARE
+static ssize_t vmcoreinfo_show(struct hyp_sysfs_attr *attr, char *buffer)
+{
+   return sprintf(buffer, "%lx %lx\n", xen_vmcoreinfo_maddr,
+   xen_vmcoreinfo_max_size);
+}
+
+HYPERVISOR_ATTR_RO(vmcoreinfo);
+
+static int __init xen_vmcoreinfo_init(void)
+{
+   if (!xen_vmcoreinfo_max_size)
+   return 0;
+
+   return sysfs_create_file(hypervisor_kobj, _attr.attr);
+}
+
+static void xen_vmcoreinfo_destroy(void)
+{
+   if (!xen_vmcoreinfo_max_size)
+   return;
+
+   sysfs_remove_file(hypervisor_kobj, _attr.attr);
+}
+#else
+static int __init xen_vmcoreinfo_init(void)
+{
+   return 0;
+}
+
+static void xen_vmcoreinfo_destroy(void)
+{
+}
+#endif
+
 static int __init hyper_sysfs_init(void)
 {
int ret;
@@ -390,9 +425,14 @@ static int __init hyper_sysfs_init(void)
ret = xen_properties_init();
if (ret)
goto prop_out;
+   ret = xen_vmcoreinfo_init();
+   if (ret)
+   goto vmcoreinfo_out;
 
goto out;
 
+vmcoreinfo_out:
+   xen_properties_destroy();
 prop_out:
xen_sysfs_uuid_destroy();
 uuid_out:
@@ -407,12 +447,12 @@ out:
 
 static void __exit hyper_sysfs_exit(void)
 {
+   xen_vmcoreinfo_destroy();
xen_properties_destroy();
xen_compilation_destroy();
xen_sysfs_uuid_destroy();
xen_sysfs_version_destroy();
xen_sysfs_type_destroy();
-
 }
 module_init(hyper_sysfs_init);
 module_exit(hyper_sysfs_exit);
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 11/11] x86: Add Xen kexec control code size check to linker script

2012-12-26 Thread Daniel Kiper

Add Xen kexec control code size check to linker script.

Signed-off-by: Daniel Kiper 
---
 arch/x86/kernel/vmlinux.lds.S |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 22a1530..f18786a 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -360,5 +360,10 @@ INIT_PER_CPU(irq_stack_union);
 
 . = ASSERT(kexec_control_code_size <= KEXEC_CONTROL_CODE_MAX_SIZE,
"kexec control code size is too big");
-#endif
 
+#ifdef CONFIG_XEN
+. = ASSERT(xen_kexec_control_code_size - xen_relocate_kernel <=
+   KEXEC_CONTROL_CODE_MAX_SIZE,
+   "Xen kexec control code size is too big");
+#endif
+#endif
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 09/11] x86/xen/enlighten: Add init and crash kexec/kdump hooks

2012-12-26 Thread Daniel Kiper

Add init and crash kexec/kdump hooks.

Signed-off-by: Daniel Kiper 
---
 arch/x86/xen/enlighten.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 138e566..5025bba 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1276,6 +1277,12 @@ static void xen_machine_power_off(void)
 
 static void xen_crash_shutdown(struct pt_regs *regs)
 {
+#ifdef CONFIG_KEXEC_FIRMWARE
+   if (kexec_crash_image) {
+   crash_save_cpu(regs, safe_smp_processor_id());
+   return;
+   }
+#endif
xen_reboot(SHUTDOWN_crash);
 }
 
@@ -1353,6 +1360,10 @@ asmlinkage void __init xen_start_kernel(void)
 
xen_init_mmu_ops();
 
+#ifdef CONFIG_KEXEC_FIRMWARE
+   kexec_use_firmware = true;
+#endif
+
/* Prevent unwanted bits from being set in PTEs. */
__supported_pte_mask &= ~_PAGE_GLOBAL;
 #if 0
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 05/11] x86/xen: Register resources required by kexec-tools

2012-12-26 Thread Daniel Kiper

Register resources required by kexec-tools.

v2 - suggestions/fixes:
   - change logging level
 (suggested by Konrad Rzeszutek Wilk).

Signed-off-by: Daniel Kiper 
---
 arch/x86/xen/kexec.c |  150 ++
 1 files changed, 150 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/xen/kexec.c

diff --git a/arch/x86/xen/kexec.c b/arch/x86/xen/kexec.c
new file mode 100644
index 000..7ec4c45
--- /dev/null
+++ b/arch/x86/xen/kexec.c
@@ -0,0 +1,150 @@
+/*
+ * Copyright (c) 2011 Daniel Kiper
+ * Copyright (c) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * kexec/kdump implementation for Xen was written by Daniel Kiper.
+ * Initial work on it was sponsored by Google under Google Summer
+ * of Code 2011 program and Citrix. Konrad Rzeszutek Wilk from Oracle
+ * was the mentor for this project.
+ *
+ * Some ideas are taken from:
+ *   - native kexec/kdump implementation,
+ *   - kexec/kdump implementation for Xen Linux Kernel Ver. 2.6.18,
+ *   - PV-GRUB.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include 
+
+unsigned long xen_vmcoreinfo_maddr = 0;
+unsigned long xen_vmcoreinfo_max_size = 0;
+
+static int __init xen_init_kexec_resources(void)
+{
+   int rc;
+   static struct resource xen_hypervisor_res = {
+   .name = "Hypervisor code and data",
+   .flags = IORESOURCE_BUSY | IORESOURCE_MEM
+   };
+   struct resource *cpu_res;
+   struct xen_kexec_range xkr;
+   struct xen_platform_op cpuinfo_op;
+   uint32_t cpus, i;
+
+   if (!xen_initial_domain())
+   return 0;
+
+   if (strstr(boot_command_line, "crashkernel="))
+   pr_warn("kexec: Ignoring crashkernel option. "
+   "It should be passed to Xen hypervisor.\n");
+
+   /* Register Crash kernel resource. */
+   xkr.range = KEXEC_RANGE_MA_CRASH;
+   rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_get_range, );
+
+   if (rc) {
+   pr_warn("kexec: %s: HYPERVISOR_kexec_op(KEXEC_RANGE_MA_CRASH)"
+   ": %i\n", __func__, rc);
+   return rc;
+   }
+
+   if (!xkr.size)
+   return 0;
+
+   crashk_res.start = xkr.start;
+   crashk_res.end = xkr.start + xkr.size - 1;
+   insert_resource(_resource, _res);
+
+   /* Register Hypervisor code and data resource. */
+   xkr.range = KEXEC_RANGE_MA_XEN;
+   rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_get_range, );
+
+   if (rc) {
+   pr_warn("kexec: %s: HYPERVISOR_kexec_op(KEXEC_RANGE_MA_XEN)"
+   ": %i\n", __func__, rc);
+   return rc;
+   }
+
+   xen_hypervisor_res.start = xkr.start;
+   xen_hypervisor_res.end = xkr.start + xkr.size - 1;
+   insert_resource(_resource, _hypervisor_res);
+
+   /* Determine maximum number of physical CPUs. */
+   cpuinfo_op.cmd = XENPF_get_cpuinfo;
+   cpuinfo_op.u.pcpu_info.xen_cpuid = 0;
+   rc = HYPERVISOR_dom0_op(_op);
+
+   if (rc) {
+   pr_warn("kexec: %s: HYPERVISOR_dom0_op(): %i\n", __func__, rc);
+   return rc;
+   }
+
+   cpus = cpuinfo_op.u.pcpu_info.max_present + 1;
+
+   /* Register CPUs Crash note resources. */
+   cpu_res = kcalloc(cpus, sizeof(struct resource), GFP_KERNEL);
+
+   if (!cpu_res) {
+   pr_warn("kexec: %s: kcalloc(): %i\n", __func__, -ENOMEM);
+   return -ENOMEM;
+   }
+
+   for (i = 0; i < cpus; ++i) {
+   xkr.range = KEXEC_RANGE_MA_CPU;
+   xkr.nr = i;
+   rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_get_range, );
+
+   if (rc) {
+   pr_warn("kexec: %s: cpu: %u: HYPERVISOR_kexec_op"
+   "(KEXEC_RANGE_MA_XEN): %i\n", __func__, i, rc);
+   continue;
+   }
+
+   cpu_res->name = "Crash note";
+   cpu_res->start = xkr.start;
+   cpu_res->end = xkr.start + xkr.size - 1;
+   cpu_res->flags = IORESOURCE_BUSY | IORESOURCE_MEM;
+   insert_resource(_resource, cpu_res++);
+   }
+
+   /* Get vmcoreinfo address and maximum allowed size. */
+   xkr.range =

[PATCH v3 08/11] x86/xen: Add kexec/kdump Kconfig and makefile rules

2012-12-26 Thread Daniel Kiper

Add kexec/kdump Kconfig and makefile rules.

Signed-off-by: Daniel Kiper 
---
 arch/x86/Kconfig  |3 +++
 arch/x86/xen/Kconfig  |1 +
 arch/x86/xen/Makefile |3 +++
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 79795af..e2746c4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1600,6 +1600,9 @@ config KEXEC_JUMP
  Jump between original kernel and kexeced kernel and invoke
  code in physical address mode via KEXEC
 
+config KEXEC_FIRMWARE
+   def_bool n
+
 config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EXPERT || 
CRASH_DUMP)
default "0x100"
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 131dacd..8469c1c 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -7,6 +7,7 @@ config XEN
select PARAVIRT
select PARAVIRT_CLOCK
select XEN_HAVE_PVMMU
+   select KEXEC_FIRMWARE if KEXEC
depends on X86_64 || (X86_32 && X86_PAE && !X86_VISWS)
depends on X86_TSC
help
diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
index 96ab2c0..99952d7 100644
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -22,3 +22,6 @@ obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= spinlock.o
 obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o
 obj-$(CONFIG_XEN_DOM0) += apic.o vga.o
 obj-$(CONFIG_SWIOTLB_XEN)  += pci-swiotlb-xen.o
+obj-$(CONFIG_KEXEC_FIRMWARE)   += kexec.o
+obj-$(CONFIG_KEXEC_FIRMWARE)   += machine_kexec_$(BITS).o
+obj-$(CONFIG_KEXEC_FIRMWARE)   += relocate_kernel_$(BITS).o
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 07/11] x86/xen: Add x86_64 kexec/kdump implementation

2012-12-26 Thread Daniel Kiper

Add x86_64 kexec/kdump implementation.

Signed-off-by: Daniel Kiper 
---
 arch/x86/xen/machine_kexec_64.c   |  318 +
 arch/x86/xen/relocate_kernel_64.S |  309 +++
 2 files changed, 627 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/xen/machine_kexec_64.c
 create mode 100644 arch/x86/xen/relocate_kernel_64.S

diff --git a/arch/x86/xen/machine_kexec_64.c b/arch/x86/xen/machine_kexec_64.c
new file mode 100644
index 000..2600342
--- /dev/null
+++ b/arch/x86/xen/machine_kexec_64.c
@@ -0,0 +1,318 @@
+/*
+ * Copyright (c) 2011 Daniel Kiper
+ * Copyright (c) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * kexec/kdump implementation for Xen was written by Daniel Kiper.
+ * Initial work on it was sponsored by Google under Google Summer
+ * of Code 2011 program and Citrix. Konrad Rzeszutek Wilk from Oracle
+ * was the mentor for this project.
+ *
+ * Some ideas are taken from:
+ *   - native kexec/kdump implementation,
+ *   - kexec/kdump implementation for Xen Linux Kernel Ver. 2.6.18,
+ *   - PV-GRUB.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#define __ma(vaddr)(virt_to_machine(vaddr).maddr)
+
+static void init_level2_page(pmd_t *pmd, unsigned long addr)
+{
+   unsigned long end_addr = addr + PUD_SIZE;
+
+   while (addr < end_addr) {
+   native_set_pmd(pmd++, native_make_pmd(addr | 
__PAGE_KERNEL_LARGE_EXEC));
+   addr += PMD_SIZE;
+   }
+}
+
+static int init_level3_page(struct kimage *image, pud_t *pud,
+   unsigned long addr, unsigned long last_addr)
+{
+   pmd_t *pmd;
+   struct page *page;
+   unsigned long end_addr = addr + PGDIR_SIZE;
+
+   while ((addr < last_addr) && (addr < end_addr)) {
+   page = firmware_kimage_alloc_control_pages(image, 0);
+
+   if (!page)
+   return -ENOMEM;
+
+   pmd = page_address(page);
+   init_level2_page(pmd, addr);
+   native_set_pud(pud++, native_make_pud(__ma(pmd) | 
_KERNPG_TABLE));
+   addr += PUD_SIZE;
+   }
+
+   /* Clear the unused entries. */
+   while (addr < end_addr) {
+   native_pud_clear(pud++);
+   addr += PUD_SIZE;
+   }
+
+   return 0;
+}
+
+
+static int init_level4_page(struct kimage *image, pgd_t *pgd,
+   unsigned long addr, unsigned long last_addr)
+{
+   int rc;
+   pud_t *pud;
+   struct page *page;
+   unsigned long end_addr = addr + PTRS_PER_PGD * PGDIR_SIZE;
+
+   while ((addr < last_addr) && (addr < end_addr)) {
+   page = firmware_kimage_alloc_control_pages(image, 0);
+
+   if (!page)
+   return -ENOMEM;
+
+   pud = page_address(page);
+   rc = init_level3_page(image, pud, addr, last_addr);
+
+   if (rc)
+   return rc;
+
+   native_set_pgd(pgd++, native_make_pgd(__ma(pud) | 
_KERNPG_TABLE));
+   addr += PGDIR_SIZE;
+   }
+
+   /* Clear the unused entries. */
+   while (addr < end_addr) {
+   native_pgd_clear(pgd++);
+   addr += PGDIR_SIZE;
+   }
+
+   return 0;
+}
+
+static void free_transition_pgtable(struct kimage *image)
+{
+   free_page((unsigned long)image->arch.pgd);
+   free_page((unsigned long)image->arch.pud0);
+   free_page((unsigned long)image->arch.pud1);
+   free_page((unsigned long)image->arch.pmd0);
+   free_page((unsigned long)image->arch.pmd1);
+   free_page((unsigned long)image->arch.pte0);
+   free_page((unsigned long)image->arch.pte1);
+}
+
+static int alloc_transition_pgtable(struct kimage *image)
+{
+   image->arch.pgd = (pgd_t *)get_zeroed_page(GFP_KERNEL);
+
+   if (!image->arch.pgd)
+   goto err;
+
+   image->arch.pud0 = (pud_t *)get_zeroed_page(GFP_KERNEL);
+
+   if (!image->arch.pud0)
+   goto err;
+
+   image->arch.pud1 = (pud_t *)get_zeroed_page(GFP_KERNEL);
+
+   if (!image->arch.pud1)
+   goto err;
+
+   image->arch.pmd0 = (pmd_t *)get_zeroed_page(GFP_KERNEL);
+
+   if (!image->arch.pmd0)

[PATCH v3 06/11] x86/xen: Add i386 kexec/kdump implementation

2012-12-26 Thread Daniel Kiper

Add i386 kexec/kdump implementation.

v2 - suggestions/fixes:
   - allocate transition page table pages below 4 GiB
 (suggested by Jan Beulich).

Signed-off-by: Daniel Kiper 
---
 arch/x86/xen/machine_kexec_32.c   |  226 ++
 arch/x86/xen/relocate_kernel_32.S |  323 +
 2 files changed, 549 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/xen/machine_kexec_32.c
 create mode 100644 arch/x86/xen/relocate_kernel_32.S

diff --git a/arch/x86/xen/machine_kexec_32.c b/arch/x86/xen/machine_kexec_32.c
new file mode 100644
index 000..011a5e8
--- /dev/null
+++ b/arch/x86/xen/machine_kexec_32.c
@@ -0,0 +1,226 @@
+/*
+ * Copyright (c) 2011 Daniel Kiper
+ * Copyright (c) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * kexec/kdump implementation for Xen was written by Daniel Kiper.
+ * Initial work on it was sponsored by Google under Google Summer
+ * of Code 2011 program and Citrix. Konrad Rzeszutek Wilk from Oracle
+ * was the mentor for this project.
+ *
+ * Some ideas are taken from:
+ *   - native kexec/kdump implementation,
+ *   - kexec/kdump implementation for Xen Linux Kernel Ver. 2.6.18,
+ *   - PV-GRUB.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#define __ma(vaddr)(virt_to_machine(vaddr).maddr)
+
+static void *alloc_pgtable_page(struct kimage *image)
+{
+   struct page *page;
+
+   page = firmware_kimage_alloc_control_pages(image, 0);
+
+   if (!page || !page_address(page))
+   return NULL;
+
+   memset(page_address(page), 0, PAGE_SIZE);
+
+   return page_address(page);
+}
+
+static int alloc_transition_pgtable(struct kimage *image)
+{
+   image->arch.pgd = alloc_pgtable_page(image);
+
+   if (!image->arch.pgd)
+   return -ENOMEM;
+
+   image->arch.pmd0 = alloc_pgtable_page(image);
+
+   if (!image->arch.pmd0)
+   return -ENOMEM;
+
+   image->arch.pmd1 = alloc_pgtable_page(image);
+
+   if (!image->arch.pmd1)
+   return -ENOMEM;
+
+   image->arch.pte0 = alloc_pgtable_page(image);
+
+   if (!image->arch.pte0)
+   return -ENOMEM;
+
+   image->arch.pte1 = alloc_pgtable_page(image);
+
+   if (!image->arch.pte1)
+   return -ENOMEM;
+
+   return 0;
+}
+
+struct page *mf_kexec_kimage_alloc_pages(gfp_t gfp_mask,
+   unsigned int order,
+   unsigned long limit)
+{
+   struct page *pages;
+   unsigned int address_bits, i;
+
+   pages = alloc_pages(gfp_mask, order);
+
+   if (!pages)
+   return NULL;
+
+   address_bits = (limit == ULONG_MAX) ? BITS_PER_LONG : ilog2(limit);
+
+   /* Relocate set of pages below given limit. */
+   if (xen_create_contiguous_region((unsigned long)page_address(pages),
+   order, address_bits)) {
+   __free_pages(pages, order);
+   return NULL;
+   }
+
+   BUG_ON(PagePrivate(pages));
+
+   pages->mapping = NULL;
+   set_page_private(pages, order);
+
+   for (i = 0; i < (1 << order); ++i)
+   SetPageReserved(pages + i);
+
+   return pages;
+}
+
+void mf_kexec_kimage_free_pages(struct page *page)
+{
+   unsigned int i, order;
+
+   order = page_private(page);
+
+   for (i = 0; i < (1 << order); ++i)
+   ClearPageReserved(page + i);
+
+   xen_destroy_contiguous_region((unsigned long)page_address(page), order);
+   __free_pages(page, order);
+}
+
+unsigned long mf_kexec_page_to_pfn(struct page *page)
+{
+   return pfn_to_mfn(page_to_pfn(page));
+}
+
+struct page *mf_kexec_pfn_to_page(unsigned long mfn)
+{
+   return pfn_to_page(mfn_to_pfn(mfn));
+}
+
+unsigned long mf_kexec_virt_to_phys(volatile void *address)
+{
+   return virt_to_machine(address).maddr;
+}
+
+void *mf_kexec_phys_to_virt(unsigned long address)
+{
+   return phys_to_virt(machine_to_phys(XMADDR(address)).paddr);
+}
+
+int mf_kexec_prepare(struct kimage *image)
+{
+#ifdef CONFIG_KEXEC_JUMP
+   if (image->preserve_context) {
+   pr_info_once("kexec: Context preservation is not "
+

[PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE

2012-12-26 Thread Daniel Kiper

Some implementations (e.g. Xen PVOPS) could not use part of identity page table
to construct transition page table. It means that they require separate PUDs,
PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that
requirement add extra pointer to PGD, PUD, PMD and PTE and align existing code.

Signed-off-by: Daniel Kiper 
---
 arch/x86/include/asm/kexec.h   |   10 +++---
 arch/x86/kernel/machine_kexec_64.c |   12 ++--
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 6080d26..cedd204 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -157,9 +157,13 @@ struct kimage_arch {
 };
 #else
 struct kimage_arch {
-   pud_t *pud;
-   pmd_t *pmd;
-   pte_t *pte;
+   pgd_t *pgd;
+   pud_t *pud0;
+   pud_t *pud1;
+   pmd_t *pmd0;
+   pmd_t *pmd1;
+   pte_t *pte0;
+   pte_t *pte1;
 };
 #endif
 
diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index b3ea9db..976e54b 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -137,9 +137,9 @@ out:
 
 static void free_transition_pgtable(struct kimage *image)
 {
-   free_page((unsigned long)image->arch.pud);
-   free_page((unsigned long)image->arch.pmd);
-   free_page((unsigned long)image->arch.pte);
+   free_page((unsigned long)image->arch.pud0);
+   free_page((unsigned long)image->arch.pmd0);
+   free_page((unsigned long)image->arch.pte0);
 }
 
 static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
@@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, 
pgd_t *pgd)
pud = (pud_t *)get_zeroed_page(GFP_KERNEL);
if (!pud)
goto err;
-   image->arch.pud = pud;
+   image->arch.pud0 = pud;
set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
}
pud = pud_offset(pgd, vaddr);
@@ -165,7 +165,7 @@ static int init_transition_pgtable(struct kimage *image, 
pgd_t *pgd)
pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL);
if (!pmd)
goto err;
-   image->arch.pmd = pmd;
+   image->arch.pmd0 = pmd;
set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
}
pmd = pmd_offset(pud, vaddr);
@@ -173,7 +173,7 @@ static int init_transition_pgtable(struct kimage *image, 
pgd_t *pgd)
pte = (pte_t *)get_zeroed_page(GFP_KERNEL);
if (!pte)
goto err;
-   image->arch.pte = pte;
+   image->arch.pte0 = pte;
set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
}
pte = pte_offset_kernel(pmd, vaddr);
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 03/11] xen: Introduce architecture independent data for kexec/kdump

2012-12-26 Thread Daniel Kiper

Introduce architecture independent constants and structures
required by Xen kexec/kdump implementation.

Signed-off-by: Daniel Kiper 
---
 include/xen/interface/xen.h |   33 +
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
index 886a5d8..09c16ab 100644
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -57,6 +57,7 @@
 #define __HYPERVISOR_event_channel_op 32
 #define __HYPERVISOR_physdev_op   33
 #define __HYPERVISOR_hvm_op   34
+#define __HYPERVISOR_kexec_op 37
 #define __HYPERVISOR_tmem_op  38
 
 /* Architecture-specific hypercall definitions. */
@@ -231,7 +232,39 @@ DEFINE_GUEST_HANDLE_STRUCT(mmuext_op);
 #define VMASST_TYPE_pae_extended_cr3 3
 #define MAX_VMASST_TYPE 3
 
+/*
+ * Commands to HYPERVISOR_kexec_op().
+ */
+#define KEXEC_CMD_kexec0
+#define KEXEC_CMD_kexec_load   1
+#define KEXEC_CMD_kexec_unload 2
+#define KEXEC_CMD_kexec_get_range  3
+
+/*
+ * Memory ranges for kdump (utilized by HYPERVISOR_kexec_op()).
+ */
+#define KEXEC_RANGE_MA_CRASH   0
+#define KEXEC_RANGE_MA_XEN 1
+#define KEXEC_RANGE_MA_CPU 2
+#define KEXEC_RANGE_MA_XENHEAP 3
+#define KEXEC_RANGE_MA_BOOT_PARAM  4
+#define KEXEC_RANGE_MA_EFI_MEMMAP  5
+#define KEXEC_RANGE_MA_VMCOREINFO  6
+
 #ifndef __ASSEMBLY__
+struct xen_kexec_exec {
+   int type;
+};
+
+struct xen_kexec_range {
+   int range;
+   int nr;
+   unsigned long size;
+   unsigned long start;
+};
+
+extern unsigned long xen_vmcoreinfo_maddr;
+extern unsigned long xen_vmcoreinfo_max_size;
 
 typedef uint16_t domid_t;
 
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 04/11] x86/xen: Introduce architecture dependent data for kexec/kdump

2012-12-26 Thread Daniel Kiper

Introduce architecture dependent constants, structures and
functions required by Xen kexec/kdump implementation.

Signed-off-by: Daniel Kiper 
---
 arch/x86/include/asm/xen/hypercall.h |6 +++
 arch/x86/include/asm/xen/kexec.h |   79 ++
 2 files changed, 85 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/xen/kexec.h

diff --git a/arch/x86/include/asm/xen/hypercall.h 
b/arch/x86/include/asm/xen/hypercall.h
index c20d1ce..e76a1b8 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -459,6 +459,12 @@ HYPERVISOR_hvm_op(int op, void *arg)
 }
 
 static inline int
+HYPERVISOR_kexec_op(unsigned long op, void *args)
+{
+   return _hypercall2(int, kexec_op, op, args);
+}
+
+static inline int
 HYPERVISOR_tmem_op(
struct tmem_op *op)
 {
diff --git a/arch/x86/include/asm/xen/kexec.h b/arch/x86/include/asm/xen/kexec.h
new file mode 100644
index 000..d09b52f
--- /dev/null
+++ b/arch/x86/include/asm/xen/kexec.h
@@ -0,0 +1,79 @@
+/*
+ * Copyright (c) 2011 Daniel Kiper
+ * Copyright (c) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * kexec/kdump implementation for Xen was written by Daniel Kiper.
+ * Initial work on it was sponsored by Google under Google Summer
+ * of Code 2011 program and Citrix. Konrad Rzeszutek Wilk from Oracle
+ * was the mentor for this project.
+ *
+ * Some ideas are taken from:
+ *   - native kexec/kdump implementation,
+ *   - kexec/kdump implementation for Xen Linux Kernel Ver. 2.6.18,
+ *   - PV-GRUB.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see .
+ */
+
+#ifndef _ASM_X86_XEN_KEXEC_H
+#define _ASM_X86_XEN_KEXEC_H
+
+#define KEXEC_XEN_NO_PAGES 17
+
+#define XK_MA_CONTROL_PAGE 0
+#define XK_VA_CONTROL_PAGE 1
+#define XK_MA_PGD_PAGE 2
+#define XK_VA_PGD_PAGE 3
+#define XK_MA_PUD0_PAGE4
+#define XK_VA_PUD0_PAGE5
+#define XK_MA_PUD1_PAGE6
+#define XK_VA_PUD1_PAGE7
+#define XK_MA_PMD0_PAGE8
+#define XK_VA_PMD0_PAGE9
+#define XK_MA_PMD1_PAGE10
+#define XK_VA_PMD1_PAGE11
+#define XK_MA_PTE0_PAGE12
+#define XK_VA_PTE0_PAGE13
+#define XK_MA_PTE1_PAGE14
+#define XK_VA_PTE1_PAGE15
+#define XK_MA_TABLE_PAGE   16
+
+#ifndef __ASSEMBLY__
+struct xen_kexec_image {
+   unsigned long page_list[KEXEC_XEN_NO_PAGES];
+   unsigned long indirection_page;
+   unsigned long start_address;
+};
+
+struct xen_kexec_load {
+   int type;
+   struct xen_kexec_image image;
+};
+
+extern unsigned int xen_kexec_control_code_size;
+
+#ifdef CONFIG_X86_32
+extern void xen_relocate_kernel(unsigned long indirection_page,
+   unsigned long *page_list,
+   unsigned long start_address,
+   unsigned int has_pae,
+   unsigned int preserve_context);
+#else
+extern void xen_relocate_kernel(unsigned long indirection_page,
+   unsigned long *page_list,
+   unsigned long start_address,
+   unsigned int preserve_context);
+#endif
+#endif
+#endif /* _ASM_X86_XEN_KEXEC_H */
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 00/11] xen: Initial kexec/kdump implementation

2012-12-26 Thread Daniel Kiper


Hi,

This set of patches contains initial kexec/kdump implementation for Xen v3.
Currently only dom0 is supported, however, almost all infrustructure
required for domU support is ready.

Jan Beulich suggested to merge Xen x86 assembler code with baremetal x86 code.
This could simplify and reduce a bit size of kernel code. However, this solution
requires some changes in baremetal x86 code. First of all code which establishes
transition page table should be moved back from machine_kexec_$(BITS).c to
relocate_kernel_$(BITS).S. Another important thing which should be changed in 
that
case is format of page_list array. Xen kexec hypercall requires to alternate 
physical
addresses with virtual ones. These and other required stuff have not been done 
in that
version because I am not sure that solution will be accepted by kexec/kdump 
maintainers.
I hope that this email spark discussion about that topic.

Daniel

 arch/x86/Kconfig |3 +
 arch/x86/include/asm/kexec.h |   10 +-
 arch/x86/include/asm/xen/hypercall.h |6 +
 arch/x86/include/asm/xen/kexec.h |   79 
 arch/x86/kernel/machine_kexec_64.c   |   12 +-
 arch/x86/kernel/vmlinux.lds.S|7 +-
 arch/x86/xen/Kconfig |1 +
 arch/x86/xen/Makefile|3 +
 arch/x86/xen/enlighten.c |   11 +
 arch/x86/xen/kexec.c |  150 +++
 arch/x86/xen/machine_kexec_32.c  |  226 +++
 arch/x86/xen/machine_kexec_64.c  |  318 +++
 arch/x86/xen/relocate_kernel_32.S|  323 +++
 arch/x86/xen/relocate_kernel_64.S|  309 ++
 drivers/xen/sys-hypervisor.c |   42 ++-
 include/linux/kexec.h|   26 ++-
 include/xen/interface/xen.h  |   33 ++
 kernel/Makefile  |1 +
 kernel/kexec-firmware.c  |  743 ++
 kernel/kexec.c   |   46 ++-
 20 files changed, 2331 insertions(+), 18 deletions(-)

Daniel Kiper (11):
  kexec: introduce kexec firmware support
  x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and 
PTE
  xen: Introduce architecture independent data for kexec/kdump
  x86/xen: Introduce architecture dependent data for kexec/kdump
  x86/xen: Register resources required by kexec-tools
  x86/xen: Add i386 kexec/kdump implementation
  x86/xen: Add x86_64 kexec/kdump implementation
  x86/xen: Add kexec/kdump Kconfig and makefile rules
  x86/xen/enlighten: Add init and crash kexec/kdump hooks
  drivers/xen: Export vmcoreinfo through sysfs
  x86: Add Xen kexec control code size check to linker script
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/32] gadget: remove only user of aio retry

2012-12-26 Thread Kent Overstreet

From: Zach Brown 

This removes the only in-tree user of aio retry.  This will let us
remove the retry code from the aio core.

Removing retry is relatively easy as the USB gadget wasn't using it to
retry IOs at all.  It always fully submitted the IO in the context of
the initial io_submit() call.  It only used the AIO retry facility to
get the submitter's mm context for copying the result of a read back to
user space.  This is easy to implement with use_mm() and a work struct,
much like kvm does with async_pf_execute() for get_user_pages().

Signed-off-by: Zach Brown 
Signed-off-by: Kent Overstreet 
---
 drivers/usb/gadget/inode.c | 38 +-
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index 76494ca..2a3f001 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -514,6 +515,9 @@ static long ep_ioctl(struct file *fd, unsigned code, 
unsigned long value)
 struct kiocb_priv {
struct usb_request  *req;
struct ep_data  *epdata;
+   struct kiocb*iocb;
+   struct mm_struct*mm;
+   struct work_struct  work;
void*buf;
const struct iovec  *iv;
unsigned long   nr_segs;
@@ -541,15 +545,12 @@ static int ep_aio_cancel(struct kiocb *iocb, struct 
io_event *e)
return value;
 }
 
-static ssize_t ep_aio_read_retry(struct kiocb *iocb)
+static ssize_t ep_copy_to_user(struct kiocb_priv *priv)
 {
-   struct kiocb_priv   *priv = iocb->private;
ssize_t len, total;
void*to_copy;
int i;
 
-   /* we "retry" to get the right mm context for this: */
-
/* copy stuff into user buffers */
total = priv->actual;
len = 0;
@@ -569,9 +570,26 @@ static ssize_t ep_aio_read_retry(struct kiocb *iocb)
if (total == 0)
break;
}
+
+   return len;
+}
+
+static void ep_user_copy_worker(struct work_struct *work)
+{
+   struct kiocb_priv *priv = container_of(work, struct kiocb_priv, work);
+   struct mm_struct *mm = priv->mm;
+   struct kiocb *iocb = priv->iocb;
+   size_t ret;
+
+   use_mm(mm);
+   ret = ep_copy_to_user(priv);
+   unuse_mm(mm);
+
+   /* completing the iocb can drop the ctx and mm, don't touch mm after */
+   aio_complete(iocb, ret, ret);
+
kfree(priv->buf);
kfree(priv);
-   return len;
 }
 
 static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req)
@@ -597,14 +615,14 @@ static void ep_aio_complete(struct usb_ep *ep, struct 
usb_request *req)
aio_complete(iocb, req->actual ? req->actual : req->status,
req->status);
} else {
-   /* retry() won't report both; so we hide some faults */
+   /* ep_copy_to_user() won't report both; we hide some faults */
if (unlikely(0 != req->status))
DBG(epdata->dev, "%s fault %d len %d\n",
ep->name, req->status, req->actual);
 
priv->buf = req->buf;
priv->actual = req->actual;
-   kick_iocb(iocb);
+   schedule_work(>work);
}
spin_unlock(>dev->lock);
 
@@ -634,8 +652,10 @@ fail:
return value;
}
iocb->private = priv;
+   priv->iocb = iocb;
priv->iv = iv;
priv->nr_segs = nr_segs;
+   INIT_WORK(>work, ep_user_copy_worker);
 
value = get_ready_ep(iocb->ki_filp->f_flags, epdata);
if (unlikely(value < 0)) {
@@ -647,6 +667,7 @@ fail:
get_ep(epdata);
priv->epdata = epdata;
priv->actual = 0;
+   priv->mm = current->mm; /* mm teardown waits for iocbs in exit_aio() */
 
/* each kiocb is coupled to one usb_request, but we can't
 * allocate or submit those if the host disconnected.
@@ -675,7 +696,7 @@ fail:
kfree(priv);
put_ep(epdata);
} else
-   value = (iv ? -EIOCBRETRY : -EIOCBQUEUED);
+   value = -EIOCBQUEUED;
return value;
 }
 
@@ -693,7 +714,6 @@ ep_aio_read(struct kiocb *iocb, const struct iovec *iov,
if (unlikely(!buf))
return -ENOMEM;
 
-   iocb->ki_retry = ep_aio_read_retry;
return ep_aio_rwtail(iocb, buf, iocb->ki_left, epdata, iov, nr_segs);
 }
 
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 00/32] AIO performance improvements/cleanups, v3

2012-12-26 Thread Kent Overstreet

Last posting: http://article.gmane.org/gmane.linux.kernel.aio.general/3242

As before, changes should mostly be noted in the patch descriptions. 

Some random bits:
 * flush_dcache_page() patch is new
 * Rewrote the aio_read_evt() stuff again
 * Fixed a few comments
 * Included some more patches, notably the batch completion stuff

My git repo has Jens' aio/dio patches on top of this stuff. As of the
latest version, I'm seeing a couple percent better throughput with the
ring buffer, and I think Jens was seeing a couple percent better with
his linked list approach - at this point I think the difference is
noise, we're both testing with fairly crappy drivers.

Patch series is on top of v3.7, git repo is at
http://evilpiepirate.org/git/linux-bcache.git aio-upstream


Kent Overstreet (27):
  aio: Kill return value of aio_complete()
  aio: kiocb_cancel()
  aio: Move private stuff out of aio.h
  aio: dprintk() -> pr_debug()
  aio: do fget() after aio_get_req()
  aio: Make aio_put_req() lockless
  aio: Refcounting cleanup
  wait: Add wait_event_hrtimeout()
  aio: Make aio_read_evt() more efficient, convert to hrtimers
  aio: Use flush_dcache_page()
  aio: Use cancellation list lazily
  aio: Change reqs_active to include unreaped completions
  aio: Kill batch allocation
  aio: Kill struct aio_ring_info
  aio: Give shared kioctx fields their own cachelines
  aio: reqs_active -> reqs_available
  aio: percpu reqs_available
  Generic dynamic per cpu refcounting
  aio: Percpu ioctx refcount
  aio: use xchg() instead of completion_lock
  aio: Don't include aio.h in sched.h
  aio: Kill ki_key
  aio: Kill ki_retry
  block, aio: Batch completion for bios/kiocbs
  virtio-blk: Convert to batch completion
  mtip32xx: Convert to batch completion
  aio: Smoosh struct kiocb

Zach Brown (5):
  mm: remove old aio use_mm() comment
  aio: remove dead code from aio.h
  gadget: remove only user of aio retry
  aio: remove retry-based AIO
  char: add aio_{read,write} to /dev/{null,zero}

 arch/s390/hypfs/inode.c  |1 +
 block/blk-core.c |   34 +-
 block/blk-flush.c|2 +-
 block/blk.h  |3 +-
 block/scsi_ioctl.c   |1 +
 drivers/block/mtip32xx/mtip32xx.c|   68 +-
 drivers/block/mtip32xx/mtip32xx.h|8 +-
 drivers/block/swim3.c|2 +-
 drivers/block/virtio_blk.c   |   31 +-
 drivers/char/mem.c   |   36 +
 drivers/infiniband/hw/ipath/ipath_file_ops.c |1 +
 drivers/infiniband/hw/qib/qib_file_ops.c |2 +-
 drivers/md/dm.c  |2 +-
 drivers/staging/android/logger.c |1 +
 drivers/usb/gadget/inode.c   |   42 +-
 fs/9p/vfs_addr.c |1 +
 fs/afs/write.c   |1 +
 fs/aio.c | 1766 +++---
 fs/bio.c |   52 +-
 fs/block_dev.c   |1 +
 fs/btrfs/file.c  |1 +
 fs/btrfs/inode.c |1 +
 fs/ceph/file.c   |1 +
 fs/compat.c  |1 +
 fs/direct-io.c   |   21 +-
 fs/ecryptfs/file.c   |1 +
 fs/ext2/inode.c  |1 +
 fs/ext3/inode.c  |1 +
 fs/ext4/file.c   |1 +
 fs/ext4/indirect.c   |1 +
 fs/ext4/inode.c  |1 +
 fs/ext4/page-io.c|1 +
 fs/fat/inode.c   |1 +
 fs/fuse/dev.c|1 +
 fs/fuse/file.c   |1 +
 fs/gfs2/aops.c   |1 +
 fs/gfs2/file.c   |1 +
 fs/hfs/inode.c   |1 +
 fs/hfsplus/inode.c   |1 +
 fs/jfs/inode.c   |1 +
 fs/nilfs2/inode.c|2 +-
 fs/ntfs/file.c   |1 +
 fs/ntfs/inode.c  |1 +
 fs/ocfs2/aops.h  |2 +
 fs/ocfs2/dlmglue.c   |2 +-
 fs/ocfs2/inode.h |2 +
 fs/pipe.c|1 +
 fs/read_write.c  |   35 +-
 fs/reiserfs/inode.c  |1 +
 fs/ubifs/file.c  |1 +
 fs/udf/inode.c   |1 +
 fs/xfs/xfs_aops.c|1 +
 fs/xfs/xfs_file.c|1 +
 include/linux/aio.h  |

[PATCH 19/32] aio: Kill struct aio_ring_info

2012-12-26 Thread Kent Overstreet

struct aio_ring_info was kind of odd, the only place it's used is where
it's embedded in struct kioctx - there's no real need for it.

The next patch rearranges struct kioctx and puts various things on their
own cachelines - getting rid of struct aio_ring_info now makes that
reordering a bit clearer.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 149 ++-
 1 file changed, 71 insertions(+), 78 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 5ca383e..96fbd6b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -58,18 +58,6 @@ struct aio_ring {
 }; /* 128 bytes + ring size */
 
 #define AIO_RING_PAGES 8
-struct aio_ring_info {
-   unsigned long   mmap_base;
-   unsigned long   mmap_size;
-
-   struct page **ring_pages;
-   struct mutexring_lock;
-   longnr_pages;
-
-   unsignednr, tail;
-
-   struct page *internal_pages[AIO_RING_PAGES];
-};
 
 struct kioctx {
atomic_tusers;
@@ -86,12 +74,27 @@ struct kioctx {
atomic_treqs_active;
struct list_headactive_reqs;/* used for cancellation */
 
+   unsignednr;
+
/* sys_io_setup currently limits this to an unsigned int */
unsignedmax_reqs;
 
-   struct aio_ring_inforing_info;
+   unsigned long   mmap_base;
+   unsigned long   mmap_size;
+
+   struct page **ring_pages;
+   longnr_pages;
 
-   spinlock_t  completion_lock;
+   struct {
+   struct mutexring_lock;
+   } cacheline_aligned;
+
+   struct {
+   unsignedtail;
+   spinlock_t  completion_lock;
+   } cacheline_aligned;
+
+   struct page *internal_pages[AIO_RING_PAGES];
 
struct rcu_head rcu_head;
struct work_struct  rcu_work;
@@ -123,26 +126,21 @@ __initcall(aio_setup);
 
 static void aio_free_ring(struct kioctx *ctx)
 {
-   struct aio_ring_info *info = >ring_info;
long i;
 
-   for (i=0; inr_pages; i++)
-   put_page(info->ring_pages[i]);
+   for (i = 0; i < ctx->nr_pages; i++)
+   put_page(ctx->ring_pages[i]);
 
-   if (info->mmap_size) {
-   vm_munmap(info->mmap_base, info->mmap_size);
-   }
+   if (ctx->mmap_size)
+   vm_munmap(ctx->mmap_base, ctx->mmap_size);
 
-   if (info->ring_pages && info->ring_pages != info->internal_pages)
-   kfree(info->ring_pages);
-   info->ring_pages = NULL;
-   info->nr = 0;
+   if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
+   kfree(ctx->ring_pages);
 }
 
 static int aio_setup_ring(struct kioctx *ctx)
 {
struct aio_ring *ring;
-   struct aio_ring_info *info = >ring_info;
unsigned nr_events = ctx->max_reqs;
struct mm_struct *mm = current->mm;
unsigned long size;
@@ -160,42 +158,42 @@ static int aio_setup_ring(struct kioctx *ctx)
 
nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / 
sizeof(struct io_event);
 
-   info->nr = 0;
-   info->ring_pages = info->internal_pages;
+   ctx->nr = 0;
+   ctx->ring_pages = ctx->internal_pages;
if (nr_pages > AIO_RING_PAGES) {
-   info->ring_pages = kcalloc(nr_pages, sizeof(struct page *), 
GFP_KERNEL);
-   if (!info->ring_pages)
+   ctx->ring_pages = kcalloc(nr_pages, sizeof(struct page *), 
GFP_KERNEL);
+   if (!ctx->ring_pages)
return -ENOMEM;
}
 
-   info->mmap_size = nr_pages * PAGE_SIZE;
-   pr_debug("attempting mmap of %lu bytes\n", info->mmap_size);
+   ctx->mmap_size = nr_pages * PAGE_SIZE;
+   pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size);
down_write(>mmap_sem);
-   info->mmap_base = do_mmap_pgoff(NULL, 0, info->mmap_size, 
-   PROT_READ|PROT_WRITE,
-   MAP_ANONYMOUS|MAP_PRIVATE, 0);
-   if (IS_ERR((void *)info->mmap_base)) {
+   ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size,
+  PROT_READ|PROT_WRITE,
+  MAP_ANONYMOUS|MAP_PRIVATE, 0);
+   if (IS_ERR((void *)ctx->mmap_base)) {
up_write(>mmap_sem);
-   info->mmap_size = 0;
+   ctx->mmap_size = 0;
aio_free_ring(ctx);
return -EAGAIN;
}
 
-   pr_debug("mmap address: 0x%08lx\n", info->mmap_base);
-   info->nr_pages = get_user_pages(current, mm, info->mmap_base, nr_pages, 
-   1, 0, info->ring_pages, NULL);
+   pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
+   ctx->nr_pages =

[PATCH 16/32] aio: Use cancellation list lazily

2012-12-26 Thread Kent Overstreet

Cancelling kiocbs requires adding them to a per kioctx linked list,
which is one of the few things we need to take the kioctx lock for in
the fast path. But most kiocbs can't be cancelled - so if we just do
this lazily, we can avoid quite a bit of locking overhead.

While we're at it, instead of using a flag bit switch to using ki_cancel
itself to indicate that a kiocb has been cancelled/completed. This lets
us get rid of ki_flags entirely.

Signed-off-by: Kent Overstreet 
---
 drivers/usb/gadget/inode.c |  3 +-
 fs/aio.c   | 95 +-
 include/linux/aio.h| 16 
 3 files changed, 59 insertions(+), 55 deletions(-)

diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index 7640e01..3bf0c35 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -534,7 +534,6 @@ static int ep_aio_cancel(struct kiocb *iocb, struct 
io_event *e)
local_irq_disable();
epdata = priv->epdata;
// spin_lock(>dev->lock);
-   kiocbSetCancelled(iocb);
if (likely(epdata && epdata->ep && priv->req))
value = usb_ep_dequeue (epdata->ep, priv->req);
else
@@ -664,7 +663,7 @@ fail:
goto fail;
}
 
-   iocb->ki_cancel = ep_aio_cancel;
+   kiocb_set_cancel_fn(iocb, ep_aio_cancel);
get_ep(epdata);
priv->epdata = epdata;
priv->actual = 0;
diff --git a/fs/aio.c b/fs/aio.c
index c1047c8..276c6ea 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -97,6 +97,8 @@ struct kioctx {
 
struct aio_ring_inforing_info;
 
+   spinlock_t  completion_lock;
+
struct rcu_head rcu_head;
struct work_struct  rcu_work;
 };
@@ -217,25 +219,40 @@ static int aio_setup_ring(struct kioctx *ctx)
 #define AIO_EVENTS_FIRST_PAGE  ((PAGE_SIZE - sizeof(struct aio_ring)) / 
sizeof(struct io_event))
 #define AIO_EVENTS_OFFSET  (AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE)
 
+void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel)
+{
+   if (!req->ki_list.next) {
+   struct kioctx *ctx = req->ki_ctx;
+   unsigned long flags;
+
+   spin_lock_irqsave(>ctx_lock, flags);
+   list_add(>ki_list, >active_reqs);
+   spin_unlock_irqrestore(>ctx_lock, flags);
+   }
+
+   req->ki_cancel = cancel;
+}
+EXPORT_SYMBOL(kiocb_set_cancel_fn);
+
 static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
struct io_event *res)
 {
-   int (*cancel)(struct kiocb *, struct io_event *);
+   kiocb_cancel_fn *cancel;
int ret = -EINVAL;
 
-   cancel = kiocb->ki_cancel;
-   kiocbSetCancelled(kiocb);
-   if (cancel) {
-   atomic_inc(>ki_users);
-   spin_unlock_irq(>ctx_lock);
+   cancel = xchg(>ki_cancel, KIOCB_CANCELLED);
+   if (!cancel || cancel == KIOCB_CANCELLED)
+   return ret;
+
+   atomic_inc(>ki_users);
+   spin_unlock_irq(>ctx_lock);
 
-   memset(res, 0, sizeof(*res));
-   res->obj = (u64) kiocb->ki_obj.user;
-   res->data = kiocb->ki_user_data;
-   ret = cancel(kiocb, res);
+   memset(res, 0, sizeof(*res));
+   res->obj = (u64) kiocb->ki_obj.user;
+   res->data = kiocb->ki_user_data;
+   ret = cancel(kiocb, res);
 
-   spin_lock_irq(>ctx_lock);
-   }
+   spin_lock_irq(>ctx_lock);
 
return ret;
 }
@@ -323,6 +340,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
atomic_set(>users, 2);
atomic_set(>dead, 0);
spin_lock_init(>ctx_lock);
+   spin_lock_init(>completion_lock);
mutex_init(>ring_info.ring_lock);
init_waitqueue_head(>wait);
 
@@ -465,20 +483,12 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
 {
struct kiocb *req = NULL;
 
-   req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
+   req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
if (unlikely(!req))
return NULL;
 
-   req->ki_flags = 0;
atomic_set(>ki_users, 2);
-   req->ki_key = 0;
req->ki_ctx = ctx;
-   req->ki_cancel = NULL;
-   req->ki_retry = NULL;
-   req->ki_dtor = NULL;
-   req->private = NULL;
-   req->ki_iovec = NULL;
-   req->ki_eventfd = NULL;
 
return req;
 }
@@ -509,7 +519,6 @@ static void kiocb_batch_free(struct kioctx *ctx, struct 
kiocb_batch *batch)
spin_lock_irq(>ctx_lock);
list_for_each_entry_safe(req, n, >head, ki_batch) {
list_del(>ki_batch);
-   list_del(>ki_list);
kmem_cache_free(kiocb_cachep, req);
atomic_dec(>reqs_active);
}
@@ -555,10 +564,7 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct 
kiocb_batch *batch)
}
 
batch->count -= allocated;
-   list_for_each_entry(req, >head, ki_batch) {
-

[PATCH 04/32] aio: remove retry-based AIO

2012-12-26 Thread Kent Overstreet

From: Zach Brown 

This removes the retry-based AIO infrastructure now that nothing in tree
is using it.

We want to remove retry-based AIO because it is fundemantally unsafe.
It retries IO submission from a kernel thread that has only assumed the
mm of the submitting task.  All other task_struct references in the IO
submission path will see the kernel thread, not the submitting task.
This design flaw means that nothing of any meaningful complexity can use
retry-based AIO.

This removes all the code and data associated with the retry machinery.
The most significant benefit of this is the removal of the locking
around the unused run list in the submission path.

This has only been compiled.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c  | 348 --
 fs/ocfs2/dlmglue.c|   2 +-
 fs/read_write.c   |  34 +
 include/linux/aio.h   |  22 
 include/linux/errno.h |   1 -
 5 files changed, 29 insertions(+), 378 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 71f613c..1de4f78 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -54,11 +54,6 @@ unsigned long aio_max_nr = 0x1; /* system wide maximum 
number of aio request
 static struct kmem_cache   *kiocb_cachep;
 static struct kmem_cache   *kioctx_cachep;
 
-static struct workqueue_struct *aio_wq;
-
-static void aio_kick_handler(struct work_struct *);
-static void aio_queue_work(struct kioctx *);
-
 /* aio_setup
  * Creates the slab caches used by the aio routines, panic on
  * failure as this is done early during the boot sequence.
@@ -68,9 +63,6 @@ static int __init aio_setup(void)
kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
 
-   aio_wq = alloc_workqueue("aio", 0, 1);  /* used to limit concurrency */
-   BUG_ON(!aio_wq);
-
pr_debug("aio_setup: sizeof(struct page) = %d\n", (int)sizeof(struct 
page));
 
return 0;
@@ -86,7 +78,6 @@ static void aio_free_ring(struct kioctx *ctx)
put_page(info->ring_pages[i]);
 
if (info->mmap_size) {
-   BUG_ON(ctx->mm != current->mm);
vm_munmap(info->mmap_base, info->mmap_size);
}
 
@@ -101,6 +92,7 @@ static int aio_setup_ring(struct kioctx *ctx)
struct aio_ring *ring;
struct aio_ring_info *info = >ring_info;
unsigned nr_events = ctx->max_reqs;
+   struct mm_struct *mm = current->mm;
unsigned long size;
int nr_pages;
 
@@ -126,22 +118,21 @@ static int aio_setup_ring(struct kioctx *ctx)
 
info->mmap_size = nr_pages * PAGE_SIZE;
dprintk("attempting mmap of %lu bytes\n", info->mmap_size);
-   down_write(>mm->mmap_sem);
+   down_write(>mmap_sem);
info->mmap_base = do_mmap_pgoff(NULL, 0, info->mmap_size, 
PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE, 0);
if (IS_ERR((void *)info->mmap_base)) {
-   up_write(>mm->mmap_sem);
+   up_write(>mmap_sem);
info->mmap_size = 0;
aio_free_ring(ctx);
return -EAGAIN;
}
 
dprintk("mmap address: 0x%08lx\n", info->mmap_base);
-   info->nr_pages = get_user_pages(current, ctx->mm,
-   info->mmap_base, nr_pages, 
+   info->nr_pages = get_user_pages(current, mm, info->mmap_base, nr_pages, 
1, 0, info->ring_pages, NULL);
-   up_write(>mm->mmap_sem);
+   up_write(>mmap_sem);
 
if (unlikely(info->nr_pages != nr_pages)) {
aio_free_ring(ctx);
@@ -203,10 +194,7 @@ static void __put_ioctx(struct kioctx *ctx)
unsigned nr_events = ctx->max_reqs;
BUG_ON(ctx->reqs_active);
 
-   cancel_delayed_work_sync(>wq);
aio_free_ring(ctx);
-   mmdrop(ctx->mm);
-   ctx->mm = NULL;
if (nr_events) {
spin_lock(_nr_lock);
BUG_ON(aio_nr - nr_events > aio_nr);
@@ -234,7 +222,7 @@ static inline void put_ioctx(struct kioctx *kioctx)
  */
 static struct kioctx *ioctx_alloc(unsigned nr_events)
 {
-   struct mm_struct *mm;
+   struct mm_struct *mm = current->mm;
struct kioctx *ctx;
int err = -ENOMEM;
 
@@ -253,8 +241,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
return ERR_PTR(-ENOMEM);
 
ctx->max_reqs = nr_events;
-   mm = ctx->mm = current->mm;
-   atomic_inc(>mm_count);
 
atomic_set(>users, 2);
spin_lock_init(>ctx_lock);
@@ -262,8 +248,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
init_waitqueue_head(>wait);
 
INIT_LIST_HEAD(>active_reqs);
-   INIT_LIST_HEAD(>run_list);
-   INIT_DELAYED_WORK(>wq, aio_kick_handler);
 
if (aio_setup_ring(ctx) < 0)
goto out_freectx;
@@ -284,14 +268,13 @@ static

[PATCH 13/32] wait: Add wait_event_hrtimeout()

2012-12-26 Thread Kent Overstreet

Analagous to wait_event_timeout() and friends, this adds
wait_event_hrtimeout() and wait_event_interruptible_hrtimeout().

Note that unlike the versions that use regular timers, these don't
return the amount of time remaining when they return - instead, they
return 0 or -ETIME if they timed out.

Signed-off-by: Kent Overstreet 
---
 include/linux/wait.h | 86 
 1 file changed, 86 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 168dfe1..3088723 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -330,6 +330,92 @@ do {   
\
__ret;  \
 })
 
+#define __wait_event_hrtimeout(wq, condition, timeout, state)  \
+({ \
+   int __ret = 0;  \
+   DEFINE_WAIT(__wait);\
+   struct hrtimer_sleeper __t; \
+   \
+   hrtimer_init_on_stack(&__t.timer, CLOCK_MONOTONIC,  \
+ HRTIMER_MODE_REL);\
+   hrtimer_init_sleeper(&__t, current);\
+   if ((timeout).tv64 != KTIME_MAX)\
+   hrtimer_start_range_ns(&__t.timer, timeout, \
+  current->timer_slack_ns, \
+  HRTIMER_MODE_REL);   \
+   \
+   for (;;) {  \
+   prepare_to_wait(, &__wait, state);   \
+   if (condition)  \
+   break;  \
+   if (state == TASK_INTERRUPTIBLE &&  \
+   signal_pending(current)) {  \
+   __ret = -ERESTARTSYS;   \
+   break;  \
+   }   \
+   if (!__t.task) {\
+   __ret = -ETIME; \
+   break;  \
+   }   \
+   schedule(); \
+   }   \
+   \
+   hrtimer_cancel(&__t.timer); \
+   destroy_hrtimer_on_stack(&__t.timer);   \
+   finish_wait(, &__wait);  \
+   __ret;  \
+})
+
+/**
+ * wait_event_hrtimeout - sleep until a condition gets true or a timeout 
elapses
+ * @wq: the waitqueue to wait on
+ * @condition: a C expression for the event to wait for
+ * @timeout: timeout, in jiffies
+ *
+ * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the
+ * @condition evaluates to true or a signal is received.
+ * The @condition is checked each time the waitqueue @wq is woken up.
+ *
+ * wake_up() has to be called after changing any variable that could
+ * change the result of the wait condition.
+ *
+ * The function returns 0 if @condition became true, or -ETIME if the timeout
+ * elapsed.
+ */
+#define wait_event_hrtimeout(wq, condition, timeout)   \
+({ \
+   int __ret = 0;  \
+   if (!(condition))   \
+   __ret = __wait_event_hrtimeout(wq, condition, timeout,  \
+  TASK_UNINTERRUPTIBLE);   \
+   __ret;  \
+})
+
+/**
+ * wait_event_interruptible_hrtimeout - sleep until a condition gets true or a 
timeout elapses
+ * @wq: the waitqueue to wait on
+ * @condition: a C expression for the event to wait for
+ * @timeout: timeout, in jiffies
+ *
+ * The process is put to sleep (TASK_INTERRUPTIBLE) until the
+ * @condition evaluates to true or a signal is received.
+ * The @condition is checked each time the waitqueue @wq is woken up.
+ *
+ * wake_up() has to be called after changing any variable that could
+ * change the result of the wait condition.
+ *
+ * The

[PATCH 06/32] aio: Kill return value of aio_complete()

2012-12-26 Thread Kent Overstreet

Nothing used the return value, and it probably wasn't possible to use it
safely for the locked versions (aio_complete(), aio_put_req()). Just
kill it.

Acked-by: Zach Brown 
Signed-off-by: Kent Overstreet 
---
 fs/aio.c| 21 +++--
 include/linux/aio.h |  8 
 2 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 1de4f78..0b85822 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -528,7 +528,7 @@ static inline void really_put_req(struct kioctx *ctx, 
struct kiocb *req)
 /* __aio_put_req
  * Returns true if this put was the last user of the request.
  */
-static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
+static void __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
req, atomic_long_read(>ki_filp->f_count));
@@ -538,7 +538,7 @@ static int __aio_put_req(struct kioctx *ctx, struct kiocb 
*req)
req->ki_users--;
BUG_ON(req->ki_users < 0);
if (likely(req->ki_users))
-   return 0;
+   return;
list_del(>ki_list);/* remove from active_reqs */
req->ki_cancel = NULL;
req->ki_retry = NULL;
@@ -546,21 +546,18 @@ static int __aio_put_req(struct kioctx *ctx, struct kiocb 
*req)
fput(req->ki_filp);
req->ki_filp = NULL;
really_put_req(ctx, req);
-   return 1;
 }
 
 /* aio_put_req
  * Returns true if this put was the last user of the kiocb,
  * false if the request is still in use.
  */
-int aio_put_req(struct kiocb *req)
+void aio_put_req(struct kiocb *req)
 {
struct kioctx *ctx = req->ki_ctx;
-   int ret;
spin_lock_irq(>ctx_lock);
-   ret = __aio_put_req(ctx, req);
+   __aio_put_req(ctx, req);
spin_unlock_irq(>ctx_lock);
-   return ret;
 }
 EXPORT_SYMBOL(aio_put_req);
 
@@ -591,10 +588,8 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
 
 /* aio_complete
  * Called when the io request on the given iocb is complete.
- * Returns true if this is the last user of the request.  The 
- * only other user of the request can be the cancellation code.
  */
-int aio_complete(struct kiocb *iocb, long res, long res2)
+void aio_complete(struct kiocb *iocb, long res, long res2)
 {
struct kioctx   *ctx = iocb->ki_ctx;
struct aio_ring_info*info;
@@ -602,7 +597,6 @@ int aio_complete(struct kiocb *iocb, long res, long res2)
struct io_event *event;
unsigned long   flags;
unsigned long   tail;
-   int ret;
 
/*
 * Special case handling for sync iocbs:
@@ -616,7 +610,7 @@ int aio_complete(struct kiocb *iocb, long res, long res2)
iocb->ki_user_data = res;
iocb->ki_users = 0;
wake_up_process(iocb->ki_obj.tsk);
-   return 1;
+   return;
}
 
info = >ring_info;
@@ -675,7 +669,7 @@ int aio_complete(struct kiocb *iocb, long res, long res2)
 
 put_rq:
/* everything turned out well, dispose of the aiocb. */
-   ret = __aio_put_req(ctx, iocb);
+   __aio_put_req(ctx, iocb);
 
/*
 * We have to order our ring_info tail store above and test
@@ -689,7 +683,6 @@ put_rq:
wake_up(>wait);
 
spin_unlock_irqrestore(>ctx_lock, flags);
-   return ret;
 }
 EXPORT_SYMBOL(aio_complete);
 
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 019204e..615d55a 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -167,16 +167,16 @@ struct kioctx {
 /* prototypes */
 #ifdef CONFIG_AIO
 extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
-extern int aio_put_req(struct kiocb *iocb);
-extern int aio_complete(struct kiocb *iocb, long res, long res2);
+extern void aio_put_req(struct kiocb *iocb);
+extern void aio_complete(struct kiocb *iocb, long res, long res2);
 struct mm_struct;
 extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 struct iocb __user *__user *iocbpp, bool compat);
 #else
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
-static inline int aio_put_req(struct kiocb *iocb) { return 0; }
-static inline int aio_complete(struct kiocb *iocb, long res, long res2) { 
return 0; }
+static inline void aio_put_req(struct kiocb *iocb) { }
+static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
 struct mm_struct;
 static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/32] mm: remove old aio use_mm() comment

2012-12-26 Thread Kent Overstreet

From: Zach Brown 

use_mm() is used in more places than just aio.  There's no need to
mention callers when describing the function.

Signed-off-by: Zach Brown 
Signed-off-by: Kent Overstreet 
---
 mm/mmu_context.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3dcfaf4..8a8cd02 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -14,9 +14,6 @@
  * use_mm
  * Makes the calling kernel thread take on the specified
  * mm context.
- * Called by the retry thread execute retries within the
- * iocb issuer's mm context, so that copy_from/to_user
- * operations work seamlessly for aio.
  * (Note: this routine is intended to be called only
  * from a kernel thread context)
  */
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 26/32] aio: Don't include aio.h in sched.h

2012-12-26 Thread Kent Overstreet

Faster kernel compiles by way of fewer unnecessary includes.

Signed-off-by: Kent Overstreet 
---
 arch/s390/hypfs/inode.c  | 1 +
 block/scsi_ioctl.c   | 1 +
 drivers/char/mem.c   | 1 +
 drivers/infiniband/hw/ipath/ipath_file_ops.c | 1 +
 drivers/infiniband/hw/qib/qib_file_ops.c | 2 +-
 drivers/staging/android/logger.c | 1 +
 fs/9p/vfs_addr.c | 1 +
 fs/afs/write.c   | 1 +
 fs/block_dev.c   | 1 +
 fs/btrfs/file.c  | 1 +
 fs/btrfs/inode.c | 1 +
 fs/ceph/file.c   | 1 +
 fs/compat.c  | 1 +
 fs/direct-io.c   | 1 +
 fs/ecryptfs/file.c   | 1 +
 fs/ext2/inode.c  | 1 +
 fs/ext3/inode.c  | 1 +
 fs/ext4/file.c   | 1 +
 fs/ext4/indirect.c   | 1 +
 fs/ext4/inode.c  | 1 +
 fs/ext4/page-io.c| 1 +
 fs/fat/inode.c   | 1 +
 fs/fuse/dev.c| 1 +
 fs/fuse/file.c   | 1 +
 fs/gfs2/aops.c   | 1 +
 fs/gfs2/file.c   | 1 +
 fs/hfs/inode.c   | 1 +
 fs/hfsplus/inode.c   | 1 +
 fs/jfs/inode.c   | 1 +
 fs/nilfs2/inode.c| 2 +-
 fs/ntfs/file.c   | 1 +
 fs/ntfs/inode.c  | 1 +
 fs/ocfs2/aops.h  | 2 ++
 fs/ocfs2/inode.h | 2 ++
 fs/pipe.c| 1 +
 fs/read_write.c  | 1 +
 fs/reiserfs/inode.c  | 1 +
 fs/ubifs/file.c  | 1 +
 fs/udf/inode.c   | 1 +
 fs/xfs/xfs_aops.c| 1 +
 fs/xfs/xfs_file.c| 1 +
 include/linux/cgroup.h   | 1 +
 include/linux/sched.h| 2 --
 kernel/fork.c| 1 +
 kernel/printk.c  | 1 +
 kernel/ptrace.c  | 1 +
 mm/page_io.c | 1 +
 mm/shmem.c   | 1 +
 mm/swap.c| 1 +
 security/keys/internal.h | 2 ++
 security/keys/keyctl.c   | 1 +
 sound/core/pcm_native.c  | 2 +-
 52 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c
index 06ea69b..c6c6f43 100644
--- a/arch/s390/hypfs/inode.c
+++ b/arch/s390/hypfs/inode.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "hypfs.h"
 
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 9a87daa..a5ffcc9 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 968ae6e..6447854 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c 
b/drivers/infiniband/hw/ipath/ipath_file_ops.c
index 3eb7e45..62edc41 100644
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c 
b/drivers/infiniband/hw/qib/qib_file_ops.c
index 959a5c4..488300c 100644
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -39,7 +39,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/staging/android/logger.c b/drivers/staging/android/logger.c
index 1d5ed47..c79c101 100644
--- a/drivers/staging/android/logger.c
+++ b/drivers/staging/android/logger.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "logger.h"
 
 #include 
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 0ad61c6..055562c 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 9aa52d9..5151ea3 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 static int afs_write_back_from_locked_page(struct afs_writeback *wb,
diff --git a/fs/block_dev.c

[PATCH 15/32] aio: Use flush_dcache_page()

2012-12-26 Thread Kent Overstreet

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 45 +
 1 file changed, 17 insertions(+), 28 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 06e1dd0..c1047c8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -208,33 +208,15 @@ static int aio_setup_ring(struct kioctx *ctx)
ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;
ring->header_length = sizeof(struct aio_ring);
kunmap_atomic(ring);
+   flush_dcache_page(info->ring_pages[0]);
 
return 0;
 }
 
-
-/* aio_ring_event: returns a pointer to the event at the given index from
- * kmap_atomic().  Release the pointer with put_aio_ring_event();
- */
 #define AIO_EVENTS_PER_PAGE(PAGE_SIZE / sizeof(struct io_event))
 #define AIO_EVENTS_FIRST_PAGE  ((PAGE_SIZE - sizeof(struct aio_ring)) / 
sizeof(struct io_event))
 #define AIO_EVENTS_OFFSET  (AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE)
 
-#define aio_ring_event(info, nr) ({\
-   unsigned pos = (nr) + AIO_EVENTS_OFFSET;\
-   struct io_event *__event;   \
-   __event = kmap_atomic(  \
-   (info)->ring_pages[pos / AIO_EVENTS_PER_PAGE]); \
-   __event += pos % AIO_EVENTS_PER_PAGE;   \
-   __event;\
-})
-
-#define put_aio_ring_event(event) do { \
-   struct io_event *__event = (event); \
-   (void)__event;  \
-   kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK)); \
-} while(0)
-
 static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
struct io_event *res)
 {
@@ -645,9 +627,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
struct kioctx   *ctx = iocb->ki_ctx;
struct aio_ring_info*info;
struct aio_ring *ring;
-   struct io_event *event;
+   struct io_event *ev_page, *event;
unsigned long   flags;
-   unsigned long   tail;
+   unsigned tail, pos;
 
/*
 * Special case handling for sync iocbs:
@@ -686,19 +668,24 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
if (kiocbIsCancelled(iocb))
goto put_rq;
 
-   ring = kmap_atomic(info->ring_pages[0]);
-
tail = info->tail;
-   event = aio_ring_event(info, tail);
+   pos = tail + AIO_EVENTS_OFFSET;
+
if (++tail >= info->nr)
tail = 0;
 
+   ev_page = kmap_atomic(info->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
+   event = ev_page + pos % AIO_EVENTS_PER_PAGE;
+
event->obj = (u64)(unsigned long)iocb->ki_obj.user;
event->data = iocb->ki_user_data;
event->res = res;
event->res2 = res2;
 
-   pr_debug("%p[%lu]: %p: %p %Lx %lx %lx\n",
+   kunmap_atomic(ev_page);
+   flush_dcache_page(info->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
+
+   pr_debug("%p[%u]: %p: %p %Lx %lx %lx\n",
 ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
 res, res2);
 
@@ -708,12 +695,13 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
smp_wmb();  /* make event visible before updating tail */
 
info->tail = tail;
-   ring->tail = tail;
 
-   put_aio_ring_event(event);
+   ring = kmap_atomic(info->ring_pages[0]);
+   ring->tail = tail;
kunmap_atomic(ring);
+   flush_dcache_page(info->ring_pages[0]);
 
-   pr_debug("added to ring %p at [%lu]\n", iocb, tail);
+   pr_debug("added to ring %p at [%u]\n", iocb, tail);
 
/*
 * Check if the user asked us to deliver the result through an
@@ -805,6 +793,7 @@ static int aio_read_events_ring(struct kioctx *ctx,
ring = kmap_atomic(info->ring_pages[0]);
ring->head = head;
kunmap_atomic(ring);
+   flush_dcache_page(info->ring_pages[0]);
 
pr_debug("%d  h%u t%u\n", ret, head, info->tail);
 out:
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cma: use unsigned type for count argument

2012-12-26 Thread Michal Nazarewicz

> On Sat, 22 Dec 2012, Michal Nazarewicz wrote:
>> So I think just adding the following, should be sufficient to make
>> everyone happy:
>> 
>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>> index e34e3e0..e91743b 100644
>> --- a/drivers/base/dma-contiguous.c
>> +++ b/drivers/base/dma-contiguous.c
>> @@ -320,7 +320,7 @@ struct page *dma_alloc_from_contiguous(struct device 
>> *dev, unsigned int count,
>>  pr_debug("%s(cma %p, count %u, align %u)\n", __func__, (void *)cma,
>>   count, align);
>>  
>> -if (!count)
>> +if (!count || count > INT_MAX)
>>  return NULL;
>>  
>>  mask = (1 << align) - 1;
>
On Thu, Dec 27 2012, David Rientjes  wrote:
> How is this different than leaving the formal to have a signed type, i.e. 
> drop your patch, and testing for count <= 0 instead?

Not much different I guess.  I don't have strong opinions to be honest,
except that I feel unsigned is the proper type to use, on top of which
I think bitmap_set() should use unsigned, so in case anyone ever bothers
to change it, CMA will be ready. :P

-- 
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of  o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz(o o)
ooo +--ooO--(_)--Ooo--

pgp0BAwjJmwz7.pgp
Description: PGP signature

[PATCH 05/32] char: add aio_{read,write} to /dev/{null,zero}

2012-12-26 Thread Kent Overstreet

From: Zach Brown 

These are handy for measuring the cost of the aio infrastructure with
operations that do very little and complete immediately.

Signed-off-by: Zach Brown 
Signed-off-by: Kent Overstreet 
---
 drivers/char/mem.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 0537903..968ae6e 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -627,6 +627,18 @@ static ssize_t write_null(struct file *file, const char 
__user *buf,
return count;
 }
 
+static ssize_t aio_read_null(struct kiocb *iocb, const struct iovec *iov,
+unsigned long nr_segs, loff_t pos)
+{
+   return 0;
+}
+
+static ssize_t aio_write_null(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+   return iov_length(iov, nr_segs);
+}
+
 static int pipe_to_null(struct pipe_inode_info *info, struct pipe_buffer *buf,
struct splice_desc *sd)
 {
@@ -670,6 +682,24 @@ static ssize_t read_zero(struct file *file, char __user 
*buf,
return written ? written : -EFAULT;
 }
 
+static ssize_t aio_read_zero(struct kiocb *iocb, const struct iovec *iov,
+unsigned long nr_segs, loff_t pos)
+{
+   size_t written = 0;
+   unsigned long i;
+   ssize_t ret;
+
+   for (i = 0; i < nr_segs; i++) {
+   ret = read_zero(iocb->ki_filp, iov[i].iov_base, iov[i].iov_len,
+   );
+   if (ret < 0)
+   break;
+   written += ret;
+   }
+
+   return written ? written : -EFAULT;
+}
+
 static int mmap_zero(struct file *file, struct vm_area_struct *vma)
 {
 #ifndef CONFIG_MMU
@@ -738,6 +768,7 @@ static int open_port(struct inode * inode, struct file * 
filp)
 #define full_lseek  null_lseek
 #define write_zero write_null
 #define read_full   read_zero
+#define aio_write_zero aio_write_null
 #define open_mem   open_port
 #define open_kmem  open_mem
 #define open_oldmemopen_mem
@@ -766,6 +797,8 @@ static const struct file_operations null_fops = {
.llseek = null_lseek,
.read   = read_null,
.write  = write_null,
+   .aio_read   = aio_read_null,
+   .aio_write  = aio_write_null,
.splice_write   = splice_write_null,
 };
 
@@ -782,6 +815,8 @@ static const struct file_operations zero_fops = {
.llseek = zero_lseek,
.read   = read_zero,
.write  = write_zero,
+   .aio_read   = aio_read_zero,
+   .aio_write  = aio_write_zero,
.mmap   = mmap_zero,
 };
 
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 17/32] aio: Change reqs_active to include unreaped completions

2012-12-26 Thread Kent Overstreet

The aio code tries really hard to avoid having to deal with the
completion ringbuffer overflowing. To do that, it has to keep track of
the number of outstanding kiocbs, and the number of completions
currently in the ringbuffer - and it's got to check that every time we
allocate a kiocb. Ouch.

But - we can improve this quite a bit if we just change reqs_active to
mean "number of outstanding requests and unreaped completions" - that
means kiocb allocation doesn't have to look at the ringbuffer, which is
a fairly significant win.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 38 +-
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 276c6ea..b1be0cf 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -71,12 +71,6 @@ struct aio_ring_info {
struct page *internal_pages[AIO_RING_PAGES];
 };
 
-static inline unsigned aio_ring_avail(struct aio_ring_info *info,
-   struct aio_ring *ring)
-{
-   return (ring->head + info->nr - 1 - ring->tail) % info->nr;
-}
-
 struct kioctx {
atomic_tusers;
atomic_tdead;
@@ -270,8 +264,11 @@ static void free_ioctx_rcu(struct rcu_head *head)
  */
 static void free_ioctx(struct kioctx *ctx)
 {
+   struct aio_ring_info *info = >ring_info;
+   struct aio_ring *ring;
struct io_event res;
struct kiocb *req;
+   unsigned head, avail;
 
spin_lock_irq(>ctx_lock);
 
@@ -285,7 +282,21 @@ static void free_ioctx(struct kioctx *ctx)
 
spin_unlock_irq(>ctx_lock);
 
-   wait_event(ctx->wait, !atomic_read(>reqs_active));
+   ring = kmap_atomic(info->ring_pages[0]);
+   head = ring->head;
+   kunmap_atomic(ring);
+
+   while (atomic_read(>reqs_active) > 0) {
+   wait_event(ctx->wait, head != info->tail);
+
+   avail = (head < info->tail ? info->tail : info->nr) - head;
+
+   atomic_sub(avail, >reqs_active);
+   head += avail;
+   head %= info->nr;
+   }
+
+   WARN_ON(atomic_read(>reqs_active) < 0);
 
aio_free_ring(ctx);
 
@@ -534,7 +545,6 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct 
kiocb_batch *batch)
unsigned short allocated, to_alloc;
long avail;
struct kiocb *req, *n;
-   struct aio_ring *ring;
 
to_alloc = min(batch->count, KIOCB_BATCH_SIZE);
for (allocated = 0; allocated < to_alloc; allocated++) {
@@ -549,9 +559,8 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct 
kiocb_batch *batch)
goto out;
 
spin_lock_irq(>ctx_lock);
-   ring = kmap_atomic(ctx->ring_info.ring_pages[0]);
 
-   avail = aio_ring_avail(>ring_info, ring) - 
atomic_read(>reqs_active);
+   avail = ctx->ring_info.nr - atomic_read(>reqs_active);
BUG_ON(avail < 0);
if (avail < allocated) {
/* Trim back the number of requests. */
@@ -566,7 +575,6 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct 
kiocb_batch *batch)
batch->count -= allocated;
atomic_add(allocated, >reqs_active);
 
-   kunmap_atomic(ring);
spin_unlock_irq(>ctx_lock);
 
 out:
@@ -673,8 +681,11 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 * when the event got cancelled.
 */
if (unlikely(xchg(>ki_cancel,
- KIOCB_CANCELLED) == KIOCB_CANCELLED))
+ KIOCB_CANCELLED) == KIOCB_CANCELLED)) {
+   atomic_dec(>reqs_active);
+   /* Still need the wake_up in case free_ioctx is waiting */
goto put_rq;
+   }
 
/*
 * Add a completion event to the ring buffer. Must be done holding
@@ -731,7 +742,6 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 put_rq:
/* everything turned out well, dispose of the aiocb. */
aio_put_req(iocb);
-   atomic_dec(>reqs_active);
 
/*
 * We have to order our ring_info tail store above and test
@@ -812,6 +822,8 @@ static int aio_read_events_ring(struct kioctx *ctx,
flush_dcache_page(info->ring_pages[0]);
 
pr_debug("%d  h%u t%u\n", ret, head, info->tail);
+
+   atomic_sub(ret, >reqs_active);
 out:
mutex_unlock(>ring_lock);
 
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 11/32] aio: Make aio_put_req() lockless

2012-12-26 Thread Kent Overstreet

Freeing a kiocb needed to touch the kioctx for three things:

 * Pull it off the reqs_active list
 * Decrementing reqs_active
 * Issuing a wakeup, if the kioctx was in the process of being freed.

This patch moves these to aio_complete(), for a couple reasons:

 * aio_complete() already has to issue the wakeup, so if we drop the
   kioctx refcount before aio_complete does its wakeup we don't have to
   do it twice.
 * aio_complete currently has to take the kioctx lock, so it makes sense
   for it to pull the kiocb off the reqs_active list too.
 * A later patch is going to change reqs_active to include unreaped
   completions - this will mean allocating a kiocb doesn't have to look
   at the ringbuffer. So taking the decrement of reqs_active out of
   kiocb_free() is useful prep work for that patch.

This doesn't really affect cancellation, since existing (usb) code that
implements a cancel function still calls aio_complete() - we just have
to make sure that aio_complete does the necessary teardown for cancelled
kiocbs.

It does affect code paths where we free kiocbs that were never
submitted; they need to decrement reqs_active and pull the kiocb off the
reqs_active list. This occurs in two places: kiocb_batch_free(), which
is going away in a later patch, and the error path in io_submit_one.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c| 85 +
 include/linux/aio.h |  4 +--
 2 files changed, 35 insertions(+), 54 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index db6cb02..37eac67 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -89,7 +89,7 @@ struct kioctx {
 
spinlock_t  ctx_lock;
 
-   int reqs_active;
+   atomic_treqs_active;
struct list_headactive_reqs;/* used for cancellation */
 
/* sys_io_setup currently limits this to an unsigned int */
@@ -247,7 +247,7 @@ static void ctx_rcu_free(struct rcu_head *head)
 static void __put_ioctx(struct kioctx *ctx)
 {
unsigned nr_events = ctx->max_reqs;
-   BUG_ON(ctx->reqs_active);
+   BUG_ON(atomic_read(>reqs_active));
 
aio_free_ring(ctx);
if (nr_events) {
@@ -281,7 +281,7 @@ static int kiocb_cancel(struct kioctx *ctx, struct kiocb 
*kiocb,
cancel = kiocb->ki_cancel;
kiocbSetCancelled(kiocb);
if (cancel) {
-   kiocb->ki_users++;
+   atomic_inc(>ki_users);
spin_unlock_irq(>ctx_lock);
 
memset(res, 0, sizeof(*res));
@@ -380,12 +380,12 @@ static void kill_ctx(struct kioctx *ctx)
kiocb_cancel(ctx, req, );
}
 
-   if (!ctx->reqs_active)
+   if (!atomic_read(>reqs_active))
goto out;
 
add_wait_queue(>wait, );
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-   while (ctx->reqs_active) {
+   while (atomic_read(>reqs_active)) {
spin_unlock_irq(>ctx_lock);
io_schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
@@ -403,9 +403,9 @@ out:
  */
 ssize_t wait_on_sync_kiocb(struct kiocb *iocb)
 {
-   while (iocb->ki_users) {
+   while (atomic_read(>ki_users)) {
set_current_state(TASK_UNINTERRUPTIBLE);
-   if (!iocb->ki_users)
+   if (!atomic_read(>ki_users))
break;
io_schedule();
}
@@ -435,7 +435,7 @@ void exit_aio(struct mm_struct *mm)
printk(KERN_DEBUG
"exit_aio:ioctx still alive: %d %d %d\n",
atomic_read(>users), ctx->dead,
-   ctx->reqs_active);
+   atomic_read(>reqs_active));
/*
 * We don't need to bother with munmap() here -
 * exit_mmap(mm) is coming and it'll unmap everything.
@@ -450,11 +450,11 @@ void exit_aio(struct mm_struct *mm)
 }
 
 /* aio_get_req
- * Allocate a slot for an aio request.  Increments the users count
+ * Allocate a slot for an aio request.  Increments the ki_users count
  * of the kioctx so that the kioctx stays around until all requests are
  * complete.  Returns NULL if no requests are free.
  *
- * Returns with kiocb->users set to 2.  The io submit code path holds
+ * Returns with kiocb->ki_users set to 2.  The io submit code path holds
  * an extra reference while submitting the i/o.
  * This prevents races between the aio code path referencing the
  * req (after submitting it) and aio_complete() freeing the req.
@@ -468,7 +468,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
return NULL;
 
req->ki_flags = 0;
-   req->ki_users = 2;
+   atomic_set(>ki_users, 2);
req->ki_key = 0;
req->ki_ctx = ctx;
req->ki_cancel = NULL;
@@ -509,9 +509,9 @@ static void kiocb_batch_free(struct kioctx *ctx, struct 
kiocb_batch *batch)

[PATCH 08/32] aio: Move private stuff out of aio.h

2012-12-26 Thread Kent Overstreet

Signed-off-by: Kent Overstreet 
---
 drivers/usb/gadget/inode.c |  1 +
 fs/aio.c   | 61 ++
 include/linux/aio.h| 61 --
 3 files changed, 62 insertions(+), 61 deletions(-)

diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index 2a3f001..7640e01 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
diff --git a/fs/aio.c b/fs/aio.c
index e1d4084..8fcea98 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -45,6 +45,67 @@
 #define dprintk(x...)  do { ; } while (0)
 #endif
 
+#define AIO_RING_MAGIC 0xa10a10a1
+#define AIO_RING_COMPAT_FEATURES   1
+#define AIO_RING_INCOMPAT_FEATURES 0
+struct aio_ring {
+   unsignedid; /* kernel internal index number */
+   unsignednr; /* number of io_events */
+   unsignedhead;
+   unsignedtail;
+
+   unsignedmagic;
+   unsignedcompat_features;
+   unsignedincompat_features;
+   unsignedheader_length;  /* size of aio_ring */
+
+
+   struct io_event io_events[0];
+}; /* 128 bytes + ring size */
+
+#define AIO_RING_PAGES 8
+struct aio_ring_info {
+   unsigned long   mmap_base;
+   unsigned long   mmap_size;
+
+   struct page **ring_pages;
+   spinlock_t  ring_lock;
+   longnr_pages;
+
+   unsignednr, tail;
+
+   struct page *internal_pages[AIO_RING_PAGES];
+};
+
+static inline unsigned aio_ring_avail(struct aio_ring_info *info,
+   struct aio_ring *ring)
+{
+   return (ring->head + info->nr - 1 - ring->tail) % info->nr;
+}
+
+struct kioctx {
+   atomic_tusers;
+   int dead;
+
+   /* This needs improving */
+   unsigned long   user_id;
+   struct hlist_node   list;
+
+   wait_queue_head_t   wait;
+
+   spinlock_t  ctx_lock;
+
+   int reqs_active;
+   struct list_headactive_reqs;/* used for cancellation */
+
+   /* sys_io_setup currently limits this to an unsigned int */
+   unsignedmax_reqs;
+
+   struct aio_ring_inforing_info;
+
+   struct rcu_head rcu_head;
+};
+
 /*-- sysctl variables*/
 static DEFINE_SPINLOCK(aio_nr_lock);
 unsigned long aio_nr;  /* current system wide number of aio requests */
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 615d55a..7b1eb23 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -103,67 +103,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
struct file *filp)
};
 }
 
-#define AIO_RING_MAGIC 0xa10a10a1
-#define AIO_RING_COMPAT_FEATURES   1
-#define AIO_RING_INCOMPAT_FEATURES 0
-struct aio_ring {
-   unsignedid; /* kernel internal index number */
-   unsignednr; /* number of io_events */
-   unsignedhead;
-   unsignedtail;
-
-   unsignedmagic;
-   unsignedcompat_features;
-   unsignedincompat_features;
-   unsignedheader_length;  /* size of aio_ring */
-
-
-   struct io_event io_events[0];
-}; /* 128 bytes + ring size */
-
-#define AIO_RING_PAGES 8
-struct aio_ring_info {
-   unsigned long   mmap_base;
-   unsigned long   mmap_size;
-
-   struct page **ring_pages;
-   spinlock_t  ring_lock;
-   longnr_pages;
-
-   unsignednr, tail;
-
-   struct page *internal_pages[AIO_RING_PAGES];
-};
-
-static inline unsigned aio_ring_avail(struct aio_ring_info *info,
-   struct aio_ring *ring)
-{
-   return (ring->head + info->nr - 1 - ring->tail) % info->nr;
-}
-
-struct kioctx {
-   atomic_tusers;
-   int dead;
-
-   /* This needs improving */
-   unsigned long   user_id;
-   struct hlist_node   list;
-
-   wait_queue_head_t   wait;
-
-   spinlock_t  ctx_lock;
-
-   int reqs_active;
-   struct list_headactive_reqs;/* used for cancellation */
-
-   /* sys_io_setup currently limits this to an unsigned int */
-   unsignedmax_reqs;
-
-   struct aio_ring_inforing_info;
-
-   struct rcu_head rcu_head;
-};
-
 /* prototypes */
 #ifdef CONFIG_AIO
 extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo

[PATCH 10/32] aio: do fget() after aio_get_req()

2012-12-26 Thread Kent Overstreet

aio_get_req() will fail if we have the maximum number of requests
outstanding, which depending on the application may not be uncommon. So
avoid doing an unnecessary fget().

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 22 +-
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 868ac0a..db6cb02 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -584,6 +584,8 @@ static inline void really_put_req(struct kioctx *ctx, 
struct kiocb *req)
 {
assert_spin_locked(>ctx_lock);
 
+   if (req->ki_filp)
+   fput(req->ki_filp);
if (req->ki_eventfd != NULL)
eventfd_ctx_put(req->ki_eventfd);
if (req->ki_dtor)
@@ -602,9 +604,6 @@ static inline void really_put_req(struct kioctx *ctx, 
struct kiocb *req)
  */
 static void __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
-   pr_debug("(%p): f_count=%ld\n",
-req, atomic_long_read(>ki_filp->f_count));
-
assert_spin_locked(>ctx_lock);
 
req->ki_users--;
@@ -615,8 +614,6 @@ static void __aio_put_req(struct kioctx *ctx, struct kiocb 
*req)
req->ki_cancel = NULL;
req->ki_retry = NULL;
 
-   fput(req->ki_filp);
-   req->ki_filp = NULL;
really_put_req(ctx, req);
 }
 
@@ -1264,7 +1261,6 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
 bool compat)
 {
struct kiocb *req;
-   struct file *file;
ssize_t ret;
 
/* enforce forwards compatibility on users */
@@ -1283,16 +1279,16 @@ static int io_submit_one(struct kioctx *ctx, struct 
iocb __user *user_iocb,
return -EINVAL;
}
 
-   file = fget(iocb->aio_fildes);
-   if (unlikely(!file))
-   return -EBADF;
-
req = aio_get_req(ctx, batch);  /* returns with 2 references to req */
-   if (unlikely(!req)) {
-   fput(file);
+   if (unlikely(!req))
return -EAGAIN;
+
+   req->ki_filp = fget(iocb->aio_fildes);
+   if (unlikely(!req->ki_filp)) {
+   ret = -EBADF;
+   goto out_put_req;
}
-   req->ki_filp = file;
+
if (iocb->aio_flags & IOCB_FLAG_RESFD) {
/*
 * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 12/32] aio: Refcounting cleanup

2012-12-26 Thread Kent Overstreet

The usage of ctx->dead was fubar - it makes no sense to explicitly
check it all over the place, especially when we're already using RCU.

Now, ctx->dead only indicates whether we've dropped the initial
refcount. The new teardown sequence is:
set ctx->dead
hlist_del_rcu();
synchronize_rcu();

Now we know no system calls can take a new ref, and it's safe to drop
the initial ref:
put_ioctx();

We also need to ensure there are no more outstanding kiocbs. This was
done incorrectly - it was being done in kill_ctx(), and before dropping
the initial refcount. At this point, other syscalls may still be
submitting kiocbs!

Now, we cancel and wait for outstanding kiocbs in free_ioctx(), after
kioctx->users has dropped to 0 and we know no more iocbs could be
submitted.

v2: Kill a bogus BUG_ON(ctx->dead) in lookup_ioctx, use
list_first_entry() instead of list_kiocb(), and convert
synchronize_rcu() calls to call_rcu() (and document them)

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 275 ---
 1 file changed, 120 insertions(+), 155 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 37eac67..e0eb23d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -79,7 +79,7 @@ static inline unsigned aio_ring_avail(struct aio_ring_info 
*info,
 
 struct kioctx {
atomic_tusers;
-   int dead;
+   atomic_tdead;
 
/* This needs improving */
unsigned long   user_id;
@@ -98,6 +98,7 @@ struct kioctx {
struct aio_ring_inforing_info;
 
struct rcu_head rcu_head;
+   struct work_struct  rcu_work;
 };
 
 /*-- sysctl variables*/
@@ -234,44 +235,6 @@ static int aio_setup_ring(struct kioctx *ctx)
kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK)); \
 } while(0)
 
-static void ctx_rcu_free(struct rcu_head *head)
-{
-   struct kioctx *ctx = container_of(head, struct kioctx, rcu_head);
-   kmem_cache_free(kioctx_cachep, ctx);
-}
-
-/* __put_ioctx
- * Called when the last user of an aio context has gone away,
- * and the struct needs to be freed.
- */
-static void __put_ioctx(struct kioctx *ctx)
-{
-   unsigned nr_events = ctx->max_reqs;
-   BUG_ON(atomic_read(>reqs_active));
-
-   aio_free_ring(ctx);
-   if (nr_events) {
-   spin_lock(_nr_lock);
-   BUG_ON(aio_nr - nr_events > aio_nr);
-   aio_nr -= nr_events;
-   spin_unlock(_nr_lock);
-   }
-   pr_debug("freeing %p\n", ctx);
-   call_rcu(>rcu_head, ctx_rcu_free);
-}
-
-static inline int try_get_ioctx(struct kioctx *kioctx)
-{
-   return atomic_inc_not_zero(>users);
-}
-
-static inline void put_ioctx(struct kioctx *kioctx)
-{
-   BUG_ON(atomic_read(>users) <= 0);
-   if (unlikely(atomic_dec_and_test(>users)))
-   __put_ioctx(kioctx);
-}
-
 static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
struct io_event *res)
 {
@@ -295,6 +258,61 @@ static int kiocb_cancel(struct kioctx *ctx, struct kiocb 
*kiocb,
return ret;
 }
 
+static void free_ioctx_rcu(struct rcu_head *head)
+{
+   struct kioctx *ctx = container_of(head, struct kioctx, rcu_head);
+   kmem_cache_free(kioctx_cachep, ctx);
+}
+
+/*
+ * When this function runs, the kioctx has been removed from the "hash table"
+ * and ctx->users has dropped to 0, so we know no more kiocbs can be submitted 
-
+ * now it's safe to cancel any that need to be.
+ */
+static void free_ioctx(struct kioctx *ctx)
+{
+   struct io_event res;
+   struct kiocb *req;
+
+   spin_lock_irq(>ctx_lock);
+
+   while (!list_empty(>active_reqs)) {
+   req = list_first_entry(>active_reqs,
+  struct kiocb, ki_list);
+
+   list_del_init(>ki_list);
+   kiocb_cancel(ctx, req, );
+   }
+
+   spin_unlock_irq(>ctx_lock);
+
+   wait_event(ctx->wait, !atomic_read(>reqs_active));
+
+   aio_free_ring(ctx);
+
+   spin_lock(_nr_lock);
+   BUG_ON(aio_nr - ctx->max_reqs > aio_nr);
+   aio_nr -= ctx->max_reqs;
+   spin_unlock(_nr_lock);
+
+   pr_debug("freeing %p\n", ctx);
+
+   /*
+* Here the call_rcu() is between the wait_event() for reqs_active to
+* hit 0, and freeing the ioctx.
+*
+* aio_complete() decrements reqs_active, but it has to touch the ioctx
+* after to issue a wakeup so we use rcu.
+*/
+   call_rcu(>rcu_head, free_ioctx_rcu);
+}
+
+static void put_ioctx(struct kioctx *ctx)
+{
+   if (unlikely(atomic_dec_and_test(>users)))
+   free_ioctx(ctx);
+}
+
 /* ioctx_alloc
  * Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
@@ -321,6 +339,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
ctx->max_reqs = nr_events;
 
atomic_set(>users, 2);
+   atomic_set(>dead, 0);

[PATCH 14/32] aio: Make aio_read_evt() more efficient, convert to hrtimers

2012-12-26 Thread Kent Overstreet

Previously, aio_read_event() pulled a single completion off the
ringbuffer at a time, locking and unlocking each time.  Changed it to
pull off as many events as it can at a time, and copy them directly to
userspace.

This also fixes a bug where if copying the event to userspace failed,
we'd lose the event.

Also convert it to wait_event_interruptible_hrtimeout(), which
simplifies it quite a bit.

v3: Convert to wait_event_interupttible_hrtimeout()
v2: Move finish_wait() call so we're not calling copy_to_user in
TASK_INTERRUPTIBLE state
v2: Restructure the code so we're not calling prepare_to_wait() until
after we've done everything that might block, also got rid of the
separate fast path

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 228 ++-
 1 file changed, 78 insertions(+), 150 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index e0eb23d..06e1dd0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -63,7 +63,7 @@ struct aio_ring_info {
unsigned long   mmap_size;
 
struct page **ring_pages;
-   spinlock_t  ring_lock;
+   struct mutexring_lock;
longnr_pages;
 
unsignednr, tail;
@@ -341,7 +341,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
atomic_set(>users, 2);
atomic_set(>dead, 0);
spin_lock_init(>ctx_lock);
-   spin_lock_init(>ring_info.ring_lock);
+   mutex_init(>ring_info.ring_lock);
init_waitqueue_head(>wait);
 
INIT_LIST_HEAD(>active_reqs);
@@ -744,187 +744,115 @@ put_rq:
 }
 EXPORT_SYMBOL(aio_complete);
 
-/* aio_read_evt
- * Pull an event off of the ioctx's event ring.  Returns the number of 
- * events fetched (0 or 1 ;-)
- * FIXME: make this use cmpxchg.
- * TODO: make the ringbuffer user mmap()able (requires FIXME).
+/* aio_read_events
+ * Pull an event off of the ioctx's event ring.  Returns the number of
+ * events fetched
  */
-static int aio_read_evt(struct kioctx *ioctx, struct io_event *ent)
+static int aio_read_events_ring(struct kioctx *ctx,
+   struct io_event __user *event, long nr)
 {
-   struct aio_ring_info *info = >ring_info;
+   struct aio_ring_info *info = >ring_info;
struct aio_ring *ring;
-   unsigned long head;
-   int ret = 0;
+   unsigned head, pos;
+   int ret = 0, copy_ret;
+
+   if (!mutex_trylock(>ring_lock)) {
+   __set_current_state(TASK_RUNNING);
+   mutex_lock(>ring_lock);
+   }
 
ring = kmap_atomic(info->ring_pages[0]);
-   pr_debug("h%u t%u m%u\n", ring->head, ring->tail, ring->nr);
+   head = ring->head;
+   kunmap_atomic(ring);
+
+   pr_debug("h%u t%u m%u\n", head, info->tail, info->nr);
 
-   if (ring->head == ring->tail)
+   if (head == info->tail)
goto out;
 
-   spin_lock(>ring_lock);
-
-   head = ring->head % info->nr;
-   if (head != ring->tail) {
-   struct io_event *evp = aio_ring_event(info, head);
-   *ent = *evp;
-   head = (head + 1) % info->nr;
-   smp_mb(); /* finish reading the event before updatng the head */
-   ring->head = head;
-   ret = 1;
-   put_aio_ring_event(evp);
+   __set_current_state(TASK_RUNNING);
+
+   while (ret < nr) {
+   unsigned i = (head < info->tail ? info->tail : info->nr) - head;
+   struct io_event *ev;
+   struct page *page;
+
+   if (head == info->tail)
+   break;
+
+   i = min_t(int, i, nr - ret);
+   i = min_t(int, i, AIO_EVENTS_PER_PAGE -
+ ((head + AIO_EVENTS_OFFSET) % AIO_EVENTS_PER_PAGE));
+
+   pos = head + AIO_EVENTS_OFFSET;
+   page = info->ring_pages[pos / AIO_EVENTS_PER_PAGE];
+   pos %= AIO_EVENTS_PER_PAGE;
+
+   ev = kmap(page);
+   copy_ret = copy_to_user(event + ret, ev + pos, sizeof(*ev) * i);
+   kunmap(page);
+
+   if (unlikely(copy_ret)) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   ret += i;
+   head += i;
+   head %= info->nr;
}
-   spin_unlock(>ring_lock);
 
-out:
+   ring = kmap_atomic(info->ring_pages[0]);
+   ring->head = head;
kunmap_atomic(ring);
-   pr_debug("%d  h%u t%u\n", ret, ring->head, ring->tail);
+
+   pr_debug("%d  h%u t%u\n", ret, head, info->tail);
+out:
+   mutex_unlock(>ring_lock);
+
return ret;
 }
 
-struct aio_timeout {
-   struct timer_list   timer;
-   int timed_out;
-   struct task_struct  *p;
-};
-
-static void timeout_func(unsigned long data)
+static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr,
+

[PATCH 07/32] aio: kiocb_cancel()

2012-12-26 Thread Kent Overstreet

Minor refactoring, to get rid of some duplicated code

v2: Fix return value for NULL kiocb, so it matches old code; change
synchronization to use xchg() instead of a bit in ki_flags, so we can
get rid of ki_flags.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 79 +++-
 1 file changed, 43 insertions(+), 36 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 0b85822..e1d4084 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -217,6 +217,29 @@ static inline void put_ioctx(struct kioctx *kioctx)
__put_ioctx(kioctx);
 }
 
+static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
+   struct io_event *res)
+{
+   int (*cancel)(struct kiocb *, struct io_event *);
+   int ret = -EINVAL;
+
+   cancel = kiocb->ki_cancel;
+   kiocbSetCancelled(kiocb);
+   if (cancel) {
+   kiocb->ki_users++;
+   spin_unlock_irq(>ctx_lock);
+
+   memset(res, 0, sizeof(*res));
+   res->obj = (u64) kiocb->ki_obj.user;
+   res->data = kiocb->ki_user_data;
+   ret = cancel(kiocb, res);
+
+   spin_lock_irq(>ctx_lock);
+   }
+
+   return ret;
+}
+
 /* ioctx_alloc
  * Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
@@ -287,25 +310,19 @@ out_freectx:
  */
 static void kill_ctx(struct kioctx *ctx)
 {
-   int (*cancel)(struct kiocb *, struct io_event *);
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
struct io_event res;
+   struct kiocb *req;
 
spin_lock_irq(>ctx_lock);
ctx->dead = 1;
while (!list_empty(>active_reqs)) {
-   struct list_head *pos = ctx->active_reqs.next;
-   struct kiocb *iocb = list_kiocb(pos);
-   list_del_init(>ki_list);
-   cancel = iocb->ki_cancel;
-   kiocbSetCancelled(iocb);
-   if (cancel) {
-   iocb->ki_users++;
-   spin_unlock_irq(>ctx_lock);
-   cancel(iocb, );
-   spin_lock_irq(>ctx_lock);
-   }
+   req = list_first_entry(>active_reqs,
+   struct kiocb, ki_list);
+
+   list_del_init(>ki_list);
+   kiocb_cancel(ctx, req, );
}
 
if (!ctx->reqs_active)
@@ -1409,7 +1426,7 @@ static struct kiocb *lookup_kiocb(struct kioctx *ctx, 
struct iocb __user *iocb,
 SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,
struct io_event __user *, result)
 {
-   int (*cancel)(struct kiocb *iocb, struct io_event *res);
+   struct io_event res;
struct kioctx *ctx;
struct kiocb *kiocb;
u32 key;
@@ -1424,32 +1441,22 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, 
struct iocb __user *, iocb,
return -EINVAL;
 
spin_lock_irq(>ctx_lock);
-   ret = -EAGAIN;
+
kiocb = lookup_kiocb(ctx, iocb, key);
-   if (kiocb && kiocb->ki_cancel) {
-   cancel = kiocb->ki_cancel;
-   kiocb->ki_users ++;
-   kiocbSetCancelled(kiocb);
-   } else
-   cancel = NULL;
+   if (kiocb)
+   ret = kiocb_cancel(ctx, kiocb, );
+   else
+   ret = -EINVAL;
+
spin_unlock_irq(>ctx_lock);
 
-   if (NULL != cancel) {
-   struct io_event tmp;
-   pr_debug("calling cancel\n");
-   memset(, 0, sizeof(tmp));
-   tmp.obj = (u64)(unsigned long)kiocb->ki_obj.user;
-   tmp.data = kiocb->ki_user_data;
-   ret = cancel(kiocb, );
-   if (!ret) {
-   /* Cancellation succeeded -- copy the result
-* into the user's buffer.
-*/
-   if (copy_to_user(result, , sizeof(tmp)))
-   ret = -EFAULT;
-   }
-   } else
-   ret = -EINVAL;
+   if (!ret) {
+   /* Cancellation succeeded -- copy the result
+* into the user's buffer.
+*/
+   if (copy_to_user(result, , sizeof(res)))
+   ret = -EFAULT;
+   }
 
put_ioctx(ctx);
 
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 25/32] aio: use xchg() instead of completion_lock

2012-12-26 Thread Kent Overstreet

So, for sticking kiocb completions on the kioctx ringbuffer, we need a
lock - it unfortunately can't be lockless.

When the kioctx is shared between threads on different cpus and the rate
of completions is high, this lock sees quite a bit of contention - in
terms of cacheline contention it's the hottest thing in the aio
subsystem.

That means, with a regular spinlock, we're going to take a cache miss
to grab the lock, then another cache miss when we touch the data the
lock protects - if it's on the same cacheline as the lock, other cpus
spinning on the lock are going to be pulling it out from under us as
we're using it.

So, we use an old trick to get rid of this second forced cache miss -
make the data the lock protects be the lock itself, so we grab them both
at once.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 44 
 1 file changed, 20 insertions(+), 24 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index b26ad5c..fcd1f38 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -102,11 +102,11 @@ struct kioctx {
struct {
struct mutexring_lock;
wait_queue_head_t wait;
+   unsignedshadow_tail;
} cacheline_aligned_in_smp;
 
struct {
unsignedtail;
-   spinlock_t  completion_lock;
} cacheline_aligned_in_smp;
 
struct page *internal_pages[AIO_RING_PAGES];
@@ -308,9 +308,9 @@ static void free_ioctx(struct kioctx *ctx)
kunmap_atomic(ring);
 
while (atomic_read(>reqs_available) < ctx->nr) {
-   wait_event(ctx->wait, head != ctx->tail);
+   wait_event(ctx->wait, head != ctx->shadow_tail);
 
-   avail = (head < ctx->tail ? ctx->tail : ctx->nr) - head;
+   avail = (head < ctx->shadow_tail ? ctx->shadow_tail : ctx->nr) 
- head;
 
atomic_add(avail, >reqs_available);
head += avail;
@@ -375,7 +375,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
rcu_read_unlock();
 
spin_lock_init(>ctx_lock);
-   spin_lock_init(>completion_lock);
mutex_init(>ring_lock);
init_waitqueue_head(>wait);
 
@@ -673,18 +672,19 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 * free_ioctx()
 */
atomic_inc(>reqs_available);
+   smp_mb__after_atomic_inc();
/* Still need the wake_up in case free_ioctx is waiting */
goto put_rq;
}
 
/*
-* Add a completion event to the ring buffer. Must be done holding
-* ctx->ctx_lock to prevent other code from messing with the tail
-* pointer since we might be called from irq context.
+* Add a completion event to the ring buffer; ctx->tail is both our lock
+* and the canonical version of the tail pointer.
 */
-   spin_lock_irqsave(>completion_lock, flags);
+   local_irq_save(flags);
+   while ((tail = xchg(>tail, UINT_MAX)) == UINT_MAX)
+   cpu_relax();
 
-   tail = ctx->tail;
pos = tail + AIO_EVENTS_OFFSET;
 
if (++tail >= ctx->nr)
@@ -710,14 +710,18 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 */
smp_wmb();  /* make event visible before updating tail */
 
-   ctx->tail = tail;
+   ctx->shadow_tail = tail;
 
ring = kmap_atomic(ctx->ring_pages[0]);
ring->tail = tail;
kunmap_atomic(ring);
flush_dcache_page(ctx->ring_pages[0]);
 
-   spin_unlock_irqrestore(>completion_lock, flags);
+   /* unlock, make new tail visible before checking waitlist */
+   smp_mb();
+
+   ctx->tail = tail;
+   local_irq_restore(flags);
 
pr_debug("added to ring %p at [%u]\n", iocb, tail);
 
@@ -733,14 +737,6 @@ put_rq:
/* everything turned out well, dispose of the aiocb. */
aio_put_req(iocb);
 
-   /*
-* We have to order our ring_info tail store above and test
-* of the wait list below outside the wait lock.  This is
-* like in wake_up_bit() where clearing a bit has to be
-* ordered with the unlocked test.
-*/
-   smp_mb();
-
if (waitqueue_active(>wait))
wake_up(>wait);
 
@@ -768,19 +764,19 @@ static int aio_read_events_ring(struct kioctx *ctx,
head = ring->head;
kunmap_atomic(ring);
 
-   pr_debug("h%u t%u m%u\n", head, ctx->tail, ctx->nr);
+   pr_debug("h%u t%u m%u\n", head, ctx->shadow_tail, ctx->nr);
 
-   if (head == ctx->tail)
+   if (head == ctx->shadow_tail)
goto out;
 
__set_current_state(TASK_RUNNING);
 
while (ret < nr) {
-   unsigned i = (head < ctx->tail ? ctx->tail : ctx->nr) - head;
+   unsigned i = (head < ctx->shadow_tail ? ctx->shadow_tail : 
ctx->nr) - head;
struct io_event *ev;

[PATCH 24/32] aio: Percpu ioctx refcount

2012-12-26 Thread Kent Overstreet

This just converts the ioctx refcount to the new generic dynamic percpu
refcount code.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 27 ---
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index e415b33..b26ad5c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -65,8 +66,7 @@ struct kioctx_cpu {
 };
 
 struct kioctx {
-   atomic_tusers;
-   atomic_tdead;
+   struct percpu_ref   users;
 
/* This needs improving */
unsigned long   user_id;
@@ -340,7 +340,7 @@ static void free_ioctx(struct kioctx *ctx)
 
 static void put_ioctx(struct kioctx *ctx)
 {
-   if (unlikely(atomic_dec_and_test(>users)))
+   if (percpu_ref_put(>users))
free_ioctx(ctx);
 }
 
@@ -369,8 +369,11 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 
ctx->max_reqs = nr_events;
 
-   atomic_set(>users, 2);
-   atomic_set(>dead, 0);
+   percpu_ref_init(>users);
+   rcu_read_lock();
+   percpu_ref_get(>users);
+   rcu_read_unlock();
+
spin_lock_init(>ctx_lock);
spin_lock_init(>completion_lock);
mutex_init(>ring_lock);
@@ -442,7 +445,7 @@ static void kill_ioctx_rcu(struct rcu_head *head)
  */
 static void kill_ioctx(struct kioctx *ctx)
 {
-   if (!atomic_xchg(>dead, 1)) {
+   if (percpu_ref_kill(>users)) {
hlist_del_rcu(>list);
/* Between hlist_del_rcu() and dropping the initial ref */
synchronize_rcu();
@@ -488,12 +491,6 @@ void exit_aio(struct mm_struct *mm)
struct hlist_node *p, *n;
 
hlist_for_each_entry_safe(ctx, p, n, >ioctx_list, list) {
-   if (1 != atomic_read(>users))
-   printk(KERN_DEBUG
-   "exit_aio:ioctx still alive: %d %d %d\n",
-   atomic_read(>users),
-   atomic_read(>dead),
-   atomic_read(>reqs_available));
/*
 * We don't need to bother with munmap() here -
 * exit_mmap(mm) is coming and it'll unmap everything.
@@ -504,7 +501,7 @@ void exit_aio(struct mm_struct *mm)
 */
ctx->mmap_size = 0;
 
-   if (!atomic_xchg(>dead, 1)) {
+   if (percpu_ref_kill(>users)) {
hlist_del_rcu(>list);
call_rcu(>rcu_head, kill_ioctx_rcu);
}
@@ -616,7 +613,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
 
hlist_for_each_entry_rcu(ctx, n, >ioctx_list, list)
if (ctx->user_id == ctx_id){
-   atomic_inc(>users);
+   percpu_ref_get(>users);
ret = ctx;
break;
}
@@ -830,7 +827,7 @@ static bool aio_read_events(struct kioctx *ctx, long 
min_nr, long nr,
if (ret > 0)
*i += ret;
 
-   if (unlikely(atomic_read(>dead)))
+   if (unlikely(percpu_ref_dead(>users)))
ret = -EINVAL;
 
if (!*i)
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 27/32] aio: Kill ki_key

2012-12-26 Thread Kent Overstreet

ki_key wasn't actually used for anything previously - it was always 0.
Drop it to trim struct kiocb a bit.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c| 7 +--
 include/linux/aio.h | 9 -
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index fcd1f38..f6bf227 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1193,7 +1193,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
}
}
 
-   ret = put_user(req->ki_key, _iocb->aio_key);
+   ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
goto out_put_req;
@@ -1314,10 +1314,13 @@ static struct kiocb *lookup_kiocb(struct kioctx *ctx, 
struct iocb __user *iocb,
 
assert_spin_locked(>ctx_lock);
 
+   if (key != KIOCB_KEY)
+   return NULL;
+
/* TODO: use a hash or array, this sucks. */
list_for_each(pos, >active_reqs) {
struct kiocb *kiocb = list_kiocb(pos);
-   if (kiocb->ki_obj.user == iocb && kiocb->ki_key == key)
+   if (kiocb->ki_obj.user == iocb)
return kiocb;
}
return NULL;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 58adc56..76a6e59 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -12,7 +12,7 @@
 struct kioctx;
 struct kiocb;
 
-#define KIOCB_SYNC_KEY (~0U)
+#define KIOCB_KEY  0
 
 #define KIOCB_CANCELLED((void *) (~0ULL))
 
@@ -45,10 +45,9 @@ typedef int (kiocb_cancel_fn)(struct kiocb *, struct 
io_event *);
  */
 struct kiocb {
atomic_tki_users;
-   unsignedki_key; /* id of this request */
 
struct file *ki_filp;
-   struct kioctx   *ki_ctx;/* may be NULL for sync ops */
+   struct kioctx   *ki_ctx;/* NULL for sync ops */
kiocb_cancel_fn *ki_cancel;
ssize_t (*ki_retry)(struct kiocb *);
void(*ki_dtor)(struct kiocb *);
@@ -84,14 +83,14 @@ struct kiocb {
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
 {
-   return kiocb->ki_key == KIOCB_SYNC_KEY;
+   return kiocb->ki_ctx == NULL;
 }
 
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
*kiocb = (struct kiocb) {
.ki_users = ATOMIC_INIT(1),
-   .ki_key = KIOCB_SYNC_KEY,
+   .ki_ctx = NULL,
.ki_filp = filp,
.ki_obj.tsk = current,
};
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 29/32] block, aio: Batch completion for bios/kiocbs

2012-12-26 Thread Kent Overstreet

When completing a kiocb, there's some fixed overhead from touching the
kioctx's ring buffer the kiocb belongs to. Some newer high end block
devices can complete multiple IOs per interrupt, much like many network
interfaces have been for some time.

This plumbs through infrastructure so we can take advantage of multiple
completions at the interrupt level, and complete multiple kiocbs at the
same time.

Drivers have to be converted to take advantage of this, but it's a
simple change and the next patches will convert a few drivers.

To use it, an interrupt handler (or any code that completes bios or
requests) declares and initializes a struct batch_complete:

struct batch_complete batch;
batch_complete_init();

Then, instead of calling bio_endio(), it calls
bio_endio_batch(bio, err, ). This just adds the bio to a list in
the batch_complete.

At the end, it calls

batch_complete();

This completes all the bios all at once, building up a list of kiocbs;
then the list of kiocbs are completed all at once.

Also, in order to batch up the kiocbs we have to add a different
bio_endio function to struct bio, that takes a pointer to the
batch_complete - this patch converts the dio code's bio_endio function.
In order to avoid changing every bio_endio function in the kernel (there
are many), we currently use a union and a flag to indicate what kind of
bio endio function to call. This is admittedly a hack, but should
suffice for now.

For batching to work through say md or dm devices, the md/dm bio_endio
functions would have to be converted, much like the dio code. That is
left for future patches.

Signed-off-by: Kent Overstreet 
---
 block/blk-core.c  |  34 ---
 block/blk-flush.c |   2 +-
 block/blk.h   |   3 +-
 drivers/block/swim3.c |   2 +-
 drivers/md/dm.c   |   2 +-
 fs/aio.c  | 254 +++---
 fs/bio.c  |  52 ++
 fs/direct-io.c|  20 ++--
 include/linux/aio.h   |  22 +++-
 include/linux/bio.h   |  36 ++-
 include/linux/blk_types.h |  11 +-
 include/linux/blkdev.h|  12 ++-
 12 files changed, 311 insertions(+), 139 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3c95c4d..4fac6ddb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -151,7 +151,8 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
 EXPORT_SYMBOL(blk_rq_init);
 
 static void req_bio_endio(struct request *rq, struct bio *bio,
- unsigned int nbytes, int error)
+ unsigned int nbytes, int error,
+ struct batch_complete *batch)
 {
if (error)
clear_bit(BIO_UPTODATE, >bi_flags);
@@ -175,7 +176,7 @@ static void req_bio_endio(struct request *rq, struct bio 
*bio,
 
/* don't actually finish bio if it's part of flush sequence */
if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
-   bio_endio(bio, error);
+   bio_endio_batch(bio, error, batch);
 }
 
 void blk_dump_rq_flags(struct request *rq, char *msg)
@@ -2215,7 +2216,8 @@ EXPORT_SYMBOL(blk_fetch_request);
  * %false - this request doesn't have any more data
  * %true  - this request has more data
  **/
-bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
+bool blk_update_request(struct request *req, int error, unsigned int nr_bytes,
+   struct batch_complete *batch)
 {
int total_bytes, bio_nbytes, next_idx = 0;
struct bio *bio;
@@ -2271,7 +2273,7 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)
if (nr_bytes >= bio->bi_size) {
req->bio = bio->bi_next;
nbytes = bio->bi_size;
-   req_bio_endio(req, bio, nbytes, error);
+   req_bio_endio(req, bio, nbytes, error, batch);
next_idx = 0;
bio_nbytes = 0;
} else {
@@ -2333,7 +2335,7 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)
 * if the request wasn't completed, update state
 */
if (bio_nbytes) {
-   req_bio_endio(req, bio, bio_nbytes, error);
+   req_bio_endio(req, bio, bio_nbytes, error, batch);
bio->bi_idx += next_idx;
bio_iovec(bio)->bv_offset += nr_bytes;
bio_iovec(bio)->bv_len -= nr_bytes;
@@ -2370,14 +2372,15 @@ EXPORT_SYMBOL_GPL(blk_update_request);
 
 static bool blk_update_bidi_request(struct request *rq, int error,
unsigned int nr_bytes,
-   unsigned int bidi_bytes)
+   unsigned int bidi_bytes,
+   struct batch_complete *batch)
 {
-   if (blk_update_request(rq, error, nr_bytes))
+   if

[PATCH 28/32] aio: Kill ki_retry

2012-12-26 Thread Kent Overstreet

Thanks to Zach Brown's work to rip out the retry infrastructure, we
don't need this anymore - ki_retry was only called right after the kiocb
was initialized.

This also refactors and trims some duplicated code, as well as cleaning
up the refcounting/error handling a bit.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c| 223 +++-
 include/linux/aio.h |  26 --
 2 files changed, 83 insertions(+), 166 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f6bf227..fedd8f6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -574,7 +574,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
if (unlikely(!req))
goto out_put;
 
-   atomic_set(>ki_users, 2);
+   atomic_set(>ki_users, 1);
req->ki_ctx = ctx;
return req;
 out_put:
@@ -941,24 +941,15 @@ static void aio_advance_iovec(struct kiocb *iocb, ssize_t 
ret)
BUG_ON(ret > 0 && iocb->ki_left == 0);
 }
 
-static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
+typedef ssize_t (aio_rw_op)(struct kiocb *, const struct iovec *,
+   unsigned long, loff_t);
+
+static ssize_t aio_rw_vect_retry(struct kiocb *iocb, int rw, aio_rw_op *rw_op)
 {
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
-   ssize_t (*rw_op)(struct kiocb *, const struct iovec *,
-unsigned long, loff_t);
ssize_t ret = 0;
-   unsigned short opcode;
-
-   if ((iocb->ki_opcode == IOCB_CMD_PREADV) ||
-   (iocb->ki_opcode == IOCB_CMD_PREAD)) {
-   rw_op = file->f_op->aio_read;
-   opcode = IOCB_CMD_PREADV;
-   } else {
-   rw_op = file->f_op->aio_write;
-   opcode = IOCB_CMD_PWRITEV;
-   }
 
/* This matches the pread()/pwrite() logic */
if (iocb->ki_pos < 0)
@@ -974,7 +965,7 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
/* retry all partial writes.  retry partial reads as long as its a
 * regular file. */
} while (ret > 0 && iocb->ki_left > 0 &&
-(opcode == IOCB_CMD_PWRITEV ||
+(rw == WRITE ||
  (!S_ISFIFO(inode->i_mode) && !S_ISSOCK(inode->i_mode;
 
/* This means we must have transferred all that we could */
@@ -984,7 +975,7 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
 
/* If we managed to write some out we return that, rather than
 * the eventual error. */
-   if (opcode == IOCB_CMD_PWRITEV
+   if (rw == WRITE
&& ret < 0 && ret != -EIOCBQUEUED
&& iocb->ki_nbytes - iocb->ki_left)
ret = iocb->ki_nbytes - iocb->ki_left;
@@ -992,73 +983,41 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
return ret;
 }
 
-static ssize_t aio_fdsync(struct kiocb *iocb)
-{
-   struct file *file = iocb->ki_filp;
-   ssize_t ret = -EINVAL;
-
-   if (file->f_op->aio_fsync)
-   ret = file->f_op->aio_fsync(iocb, 1);
-   return ret;
-}
-
-static ssize_t aio_fsync(struct kiocb *iocb)
-{
-   struct file *file = iocb->ki_filp;
-   ssize_t ret = -EINVAL;
-
-   if (file->f_op->aio_fsync)
-   ret = file->f_op->aio_fsync(iocb, 0);
-   return ret;
-}
-
-static ssize_t aio_setup_vectored_rw(int type, struct kiocb *kiocb, bool 
compat)
+static ssize_t aio_setup_vectored_rw(int rw, struct kiocb *kiocb, bool compat)
 {
ssize_t ret;
 
+   kiocb->ki_nr_segs = kiocb->ki_nbytes;
+
 #ifdef CONFIG_COMPAT
if (compat)
-   ret = compat_rw_copy_check_uvector(type,
+   ret = compat_rw_copy_check_uvector(rw,
(struct compat_iovec __user *)kiocb->ki_buf,
-   kiocb->ki_nbytes, 1, >ki_inline_vec,
+   kiocb->ki_nr_segs, 1, >ki_inline_vec,
>ki_iovec);
else
 #endif
-   ret = rw_copy_check_uvector(type,
+   ret = rw_copy_check_uvector(rw,
(struct iovec __user *)kiocb->ki_buf,
-   kiocb->ki_nbytes, 1, >ki_inline_vec,
+   kiocb->ki_nr_segs, 1, >ki_inline_vec,
>ki_iovec);
if (ret < 0)
-   goto out;
-
-   ret = rw_verify_area(type, kiocb->ki_filp, >ki_pos, ret);
-   if (ret < 0)
-   goto out;
+   return ret;
 
-   kiocb->ki_nr_segs = kiocb->ki_nbytes;
-   kiocb->ki_cur_seg = 0;
-   /* ki_nbytes/left now reflect bytes instead of segs */
+   /* ki_nbytes now reflect bytes instead of segs */
kiocb->ki_nbytes = ret;
-   kiocb->ki_left = ret;
-
-   ret = 0;
-out:
-   return ret;
+   return 0;
 }
 
-static ssize_t aio_setup_single_vector(int type, struct file * file, struct 
kiocb

[PATCH 22/32] aio: percpu reqs_available

2012-12-26 Thread Kent Overstreet

See the previous patch for why we want to do this - this basically
implements a per cpu allocator for reqs_available that doesn't actually
allocate anything.

Note that we need to increase the size of the ringbuffer we allocate,
since a single thread won't necessarily be able to use all the
reqs_available slots - some (up to about half) might be on other per cpu
lists, unavailable for the current thread.

We size the ringbuffer based on the nr_events userspace passed to
io_setup(), so this is a slight behaviour change - but nr_events wasn't
being used as a hard limit before, it was being rounded up to the next
page before so this doesn't change the actual semantics.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 92 +++-
 1 file changed, 85 insertions(+), 7 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index d384eb2..e415b33 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -59,6 +60,10 @@ struct aio_ring {
 
 #define AIO_RING_PAGES 8
 
+struct kioctx_cpu {
+   unsignedreqs_available;
+};
+
 struct kioctx {
atomic_tusers;
atomic_tdead;
@@ -67,6 +72,10 @@ struct kioctx {
unsigned long   user_id;
struct hlist_node   list;
 
+   struct __percpu kioctx_cpu *cpu;
+
+   unsignedreq_batch;
+
unsignednr;
 
/* sys_io_setup currently limits this to an unsigned int */
@@ -149,6 +158,9 @@ static int aio_setup_ring(struct kioctx *ctx)
unsigned long size;
int nr_pages;
 
+   nr_events = max(nr_events, num_possible_cpus() * 4);
+   nr_events *= 2;
+
/* Compensate for the ring buffer's head/tail overlap entry */
nr_events += 2; /* 1 is required, 2 for good luck */
 
@@ -255,6 +267,8 @@ static int kiocb_cancel(struct kioctx *ctx, struct kiocb 
*kiocb,
 static void free_ioctx_rcu(struct rcu_head *head)
 {
struct kioctx *ctx = container_of(head, struct kioctx, rcu_head);
+
+   free_percpu(ctx->cpu);
kmem_cache_free(kioctx_cachep, ctx);
 }
 
@@ -268,7 +282,7 @@ static void free_ioctx(struct kioctx *ctx)
struct aio_ring *ring;
struct io_event res;
struct kiocb *req;
-   unsigned head, avail;
+   unsigned cpu, head, avail;
 
spin_lock_irq(>ctx_lock);
 
@@ -282,6 +296,13 @@ static void free_ioctx(struct kioctx *ctx)
 
spin_unlock_irq(>ctx_lock);
 
+   for_each_possible_cpu(cpu) {
+   struct kioctx_cpu *kcpu = per_cpu_ptr(ctx->cpu, cpu);
+
+   atomic_add(kcpu->reqs_available, >reqs_available);
+   kcpu->reqs_available = 0;
+   }
+
ring = kmap_atomic(ctx->ring_pages[0]);
head = ring->head;
kunmap_atomic(ring);
@@ -357,10 +378,16 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 
INIT_LIST_HEAD(>active_reqs);
 
-   if (aio_setup_ring(ctx) < 0)
+   ctx->cpu = alloc_percpu(struct kioctx_cpu);
+   if (!ctx->cpu)
goto out_freectx;
 
+   if (aio_setup_ring(ctx) < 0)
+   goto out_freepcpu;
+
atomic_set(>reqs_available, ctx->nr);
+   ctx->req_batch = ctx->nr / (num_possible_cpus() * 4);
+   BUG_ON(!ctx->req_batch);
 
/* limit the number of system wide aios */
spin_lock(_nr_lock);
@@ -384,6 +411,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 out_cleanup:
err = -EAGAIN;
aio_free_ring(ctx);
+out_freepcpu:
+   free_percpu(ctx->cpu);
 out_freectx:
kmem_cache_free(kioctx_cachep, ctx);
pr_debug("error allocating ioctx %d\n", err);
@@ -482,6 +511,52 @@ void exit_aio(struct mm_struct *mm)
}
 }
 
+static void put_reqs_available(struct kioctx *ctx, unsigned nr)
+{
+   struct kioctx_cpu *kcpu;
+
+   preempt_disable();
+   kcpu = this_cpu_ptr(ctx->cpu);
+
+   kcpu->reqs_available += nr;
+   while (kcpu->reqs_available >= ctx->req_batch * 2) {
+   kcpu->reqs_available -= ctx->req_batch;
+   atomic_add(ctx->req_batch, >reqs_available);
+   }
+
+   preempt_enable();
+}
+
+static bool get_reqs_available(struct kioctx *ctx)
+{
+   struct kioctx_cpu *kcpu;
+   bool ret = false;
+
+   preempt_disable();
+   kcpu = this_cpu_ptr(ctx->cpu);
+
+   if (!kcpu->reqs_available) {
+   int old, avail = atomic_read(>reqs_available);
+
+   do {
+   if (avail < ctx->req_batch)
+   goto out;
+
+   old = avail;
+   avail = atomic_cmpxchg(>reqs_available,
+  avail, avail - ctx->req_batch);
+   } while (avail != old);
+
+   kcpu->reqs_available += ctx->req_batch;
+   }
+
+   ret = true;
+

[PATCH 31/32] mtip32xx: Convert to batch completion

2012-12-26 Thread Kent Overstreet

Signed-off-by: Kent Overstreet 
---
 drivers/block/mtip32xx/mtip32xx.c | 68 ++-
 drivers/block/mtip32xx/mtip32xx.h |  8 ++---
 2 files changed, 34 insertions(+), 42 deletions(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index 9694dd9..5a9982b 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -159,11 +159,9 @@ static void mtip_command_cleanup(struct driver_data *dd)
command = >commands[commandindex];
 
if (atomic_read(>active)
-   && (command->async_callback)) {
-   command->async_callback(command->async_data,
-   -ENODEV);
-   command->async_callback = NULL;
-   command->async_data = NULL;
+   && (command->bio)) {
+   bio_endio(command->bio, -ENODEV);
+   command->bio = NULL;
}
 
dma_unmap_sg(>dd->pdev->dev,
@@ -603,11 +601,9 @@ static void mtip_timeout_function(unsigned long int data)
writel(1 << bit, port->completed[group]);
 
/* Call the async completion callback. */
-   if (likely(command->async_callback))
-   command->async_callback(command->async_data,
--EIO);
-   command->async_callback = NULL;
-   command->comp_func = NULL;
+   if (likely(command->bio))
+   bio_endio(command->bio, -EIO);
+   command->bio = NULL;
 
/* Unmap the DMA scatter list entries */
dma_unmap_sg(>dd->pdev->dev,
@@ -675,7 +671,8 @@ static void mtip_timeout_function(unsigned long int data)
 static void mtip_async_complete(struct mtip_port *port,
int tag,
void *data,
-   int status)
+   int status,
+   struct batch_complete *batch)
 {
struct mtip_cmd *command;
struct driver_data *dd = data;
@@ -692,11 +689,10 @@ static void mtip_async_complete(struct mtip_port *port,
}
 
/* Upper layer callback */
-   if (likely(command->async_callback))
-   command->async_callback(command->async_data, cb_status);
+   if (likely(command->bio))
+   bio_endio_batch(command->bio, cb_status, batch);
 
-   command->async_callback = NULL;
-   command->comp_func = NULL;
+   command->bio = NULL;
 
/* Unmap the DMA scatter list entries */
dma_unmap_sg(>pdev->dev,
@@ -729,24 +725,22 @@ static void mtip_async_complete(struct mtip_port *port,
 static void mtip_completion(struct mtip_port *port,
int tag,
void *data,
-   int status)
+   int status,
+   struct batch_complete *batch)
 {
-   struct mtip_cmd *command = >commands[tag];
struct completion *waiting = data;
if (unlikely(status == PORT_IRQ_TF_ERR))
dev_warn(>dd->pdev->dev,
"Internal command %d completed with TFE\n", tag);
 
-   command->async_callback = NULL;
-   command->comp_func = NULL;
-
complete(waiting);
 }
 
 static void mtip_null_completion(struct mtip_port *port,
int tag,
void *data,
-   int status)
+   int status,
+   struct batch_complete *batch)
 {
return;
 }
@@ -792,7 +786,7 @@ static void mtip_handle_tfe(struct driver_data *dd)
atomic_inc(>active); /* active > 1 indicates error */
if (cmd->comp_data && cmd->comp_func) {
cmd->comp_func(port, MTIP_TAG_INTERNAL,
-   cmd->comp_data, PORT_IRQ_TF_ERR);
+   cmd->comp_data, PORT_IRQ_TF_ERR, NULL);
}
goto handle_tfe_exit;
}
@@ -825,7 +819,7 @@ static void mtip_handle_tfe(struct driver_data *dd)
cmd->comp_func(port,
 tag,
 cmd->comp_data,
-0);
+0, NULL);
} else {
dev_err(>dd->pdev->dev,
"Missing completion func for tag %d",
@@ -912,7 +906,7 @@ static void mtip_handle_tfe(struct driver_data *dd)

[PATCH 30/32] virtio-blk: Convert to batch completion

2012-12-26 Thread Kent Overstreet

Signed-off-by: Kent Overstreet 
---
 drivers/block/virtio_blk.c | 31 ---
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 0bdde8f..6b659d1 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -210,7 +210,8 @@ static void virtblk_bio_send_flush_work(struct work_struct 
*work)
virtblk_bio_send_flush(vbr);
 }
 
-static inline void virtblk_request_done(struct virtblk_req *vbr)
+static inline void virtblk_request_done(struct virtblk_req *vbr,
+   struct batch_complete *batch)
 {
struct virtio_blk *vblk = vbr->vblk;
struct request *req = vbr->req;
@@ -224,11 +225,12 @@ static inline void virtblk_request_done(struct 
virtblk_req *vbr)
req->errors = (error != 0);
}
 
-   __blk_end_request_all(req, error);
+   blk_end_request_all_batch(req, error, batch);
mempool_free(vbr, vblk->pool);
 }
 
-static inline void virtblk_bio_flush_done(struct virtblk_req *vbr)
+static inline void virtblk_bio_flush_done(struct virtblk_req *vbr,
+ struct batch_complete *batch)
 {
struct virtio_blk *vblk = vbr->vblk;
 
@@ -237,12 +239,13 @@ static inline void virtblk_bio_flush_done(struct 
virtblk_req *vbr)
INIT_WORK(>work, virtblk_bio_send_data_work);
queue_work(virtblk_wq, >work);
} else {
-   bio_endio(vbr->bio, virtblk_result(vbr));
+   bio_endio_batch(vbr->bio, virtblk_result(vbr), batch);
mempool_free(vbr, vblk->pool);
}
 }
 
-static inline void virtblk_bio_data_done(struct virtblk_req *vbr)
+static inline void virtblk_bio_data_done(struct virtblk_req *vbr,
+struct batch_complete *batch)
 {
struct virtio_blk *vblk = vbr->vblk;
 
@@ -252,17 +255,18 @@ static inline void virtblk_bio_data_done(struct 
virtblk_req *vbr)
INIT_WORK(>work, virtblk_bio_send_flush_work);
queue_work(virtblk_wq, >work);
} else {
-   bio_endio(vbr->bio, virtblk_result(vbr));
+   bio_endio_batch(vbr->bio, virtblk_result(vbr), batch);
mempool_free(vbr, vblk->pool);
}
 }
 
-static inline void virtblk_bio_done(struct virtblk_req *vbr)
+static inline void virtblk_bio_done(struct virtblk_req *vbr,
+   struct batch_complete *batch)
 {
if (unlikely(vbr->flags & VBLK_IS_FLUSH))
-   virtblk_bio_flush_done(vbr);
+   virtblk_bio_flush_done(vbr, batch);
else
-   virtblk_bio_data_done(vbr);
+   virtblk_bio_data_done(vbr, batch);
 }
 
 static void virtblk_done(struct virtqueue *vq)
@@ -272,16 +276,19 @@ static void virtblk_done(struct virtqueue *vq)
struct virtblk_req *vbr;
unsigned long flags;
unsigned int len;
+   struct batch_complete batch;
+
+   batch_complete_init();
 
spin_lock_irqsave(vblk->disk->queue->queue_lock, flags);
do {
virtqueue_disable_cb(vq);
while ((vbr = virtqueue_get_buf(vblk->vq, )) != NULL) {
if (vbr->bio) {
-   virtblk_bio_done(vbr);
+   virtblk_bio_done(vbr, );
bio_done = true;
} else {
-   virtblk_request_done(vbr);
+   virtblk_request_done(vbr, );
req_done = true;
}
}
@@ -291,6 +298,8 @@ static void virtblk_done(struct virtqueue *vq)
blk_start_queue(vblk->disk->queue);
spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
 
+   batch_complete();
+
if (bio_done)
wake_up(>queue_wait);
 }
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 23/32] Generic dynamic per cpu refcounting

2012-12-26 Thread Kent Overstreet

This implements a refcount with similar semantics to
atomic_get()/atomic_dec_and_test(), that starts out as just an atomic_t
but dynamically switches to per cpu refcounting when the rate of
gets/puts becomes too high.

It also implements two stage shutdown, as we need it to tear down the
percpu counts. Before dropping the initial refcount, you must call
percpu_ref_kill(); this puts the refcount in "shutting down mode" and
switches back to a single atomic refcount with the appropriate barriers
(synchronize_rcu()).

It's also legal to call percpu_ref_kill() multiple times - it only
returns true once, so callers don't have to reimplement shutdown
synchronization.

For the sake of simplicity/efficiency, the heuristic is pretty simple -
it just switches to percpu refcounting if there are more than x gets
in one second (completely arbitrarily, 4096).

It'd be more correct to count the number of cache misses or something
else more profile driven, but doing so would require accessing the
shared ref twice per get - by just counting the number of gets(), we can
stick that counter in the high bits of the refcount and increment both
with a single atomic64_add(). But I expect this'll be good enough in
practice.

Signed-off-by: Kent Overstreet 
---
 include/linux/percpu-refcount.h |  29 +++
 lib/Makefile|   2 +-
 lib/percpu-refcount.c   | 164 
 3 files changed, 194 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/percpu-refcount.h
 create mode 100644 lib/percpu-refcount.c

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
new file mode 100644
index 000..1268010
--- /dev/null
+++ b/include/linux/percpu-refcount.h
@@ -0,0 +1,29 @@
+#ifndef _LINUX_PERCPU_REFCOUNT_H
+#define _LINUX_PERCPU_REFCOUNT_H
+
+#include 
+#include 
+
+struct percpu_ref {
+   atomic64_t  count;
+   unsigned __percpu   *pcpu_count;
+};
+
+void percpu_ref_init(struct percpu_ref *ref);
+void __percpu_ref_get(struct percpu_ref *ref, bool alloc);
+int percpu_ref_put(struct percpu_ref *ref);
+
+int percpu_ref_kill(struct percpu_ref *ref);
+int percpu_ref_dead(struct percpu_ref *ref);
+
+static inline void percpu_ref_get(struct percpu_ref *ref)
+{
+   __percpu_ref_get(ref, true);
+}
+
+static inline void percpu_ref_get_noalloc(struct percpu_ref *ref)
+{
+   __percpu_ref_get(ref, false);
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index a08b791..48a8d26 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 idr.o int_sqrt.o extable.o \
 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-is_single_threaded.o plist.o decompress.o
+is_single_threaded.o plist.o decompress.o percpu-refcount.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
new file mode 100644
index 000..522b2df
--- /dev/null
+++ b/lib/percpu-refcount.c
@@ -0,0 +1,164 @@
+#define pr_fmt(fmt) "%s: " fmt "\n", __func__
+
+#include 
+#include 
+#include 
+
+#define PCPU_COUNT_BITS50
+#define PCPU_COUNT_MASK((1LL << PCPU_COUNT_BITS) - 1)
+
+#define PCPU_STATUS_BITS   2
+#define PCPU_STATUS_MASK   ((1 << PCPU_STATUS_BITS) - 1)
+
+#define PCPU_REF_PTR   0
+#define PCPU_REF_NONE  1
+#define PCPU_REF_DYING 2
+#define PCPU_REF_DEAD  3
+
+#define REF_STATUS(count)  ((unsigned long) count & PCPU_STATUS_MASK)
+
+void percpu_ref_init(struct percpu_ref *ref)
+{
+   unsigned long now = jiffies;
+
+   atomic64_set(>count, 1);
+
+   now <<= PCPU_STATUS_BITS;
+   now |= PCPU_REF_NONE;
+
+   ref->pcpu_count = (void *) now;
+}
+
+static void percpu_ref_alloc(struct percpu_ref *ref, unsigned __user 
*pcpu_count)
+{
+   unsigned __percpu *new;
+   unsigned long last = (unsigned long) pcpu_count;
+   unsigned long now = jiffies;
+
+   now <<= PCPU_STATUS_BITS;
+   now |= PCPU_REF_NONE;
+
+   if (now - last <= HZ << PCPU_STATUS_BITS) {
+   rcu_read_unlock();
+   new = alloc_percpu(unsigned);
+   rcu_read_lock();
+
+   if (!new)
+   goto update_time;
+
+   BUG_ON(((unsigned long) new) & PCPU_STATUS_MASK);
+
+   if (cmpxchg(>pcpu_count, pcpu_count, new) != pcpu_count)
+   free_percpu(new);
+   else
+   pr_debug("created");
+   } else {
+update_time:   new = (void *) now;
+   cmpxchg(>pcpu_count, pcpu_count, new);
+   }
+}
+
+void __percpu_ref_get(struct percpu_ref *ref, bool alloc)
+{
+   unsigned __percpu *pcpu_count;
+   uint64_t v;
+
+   pcpu_count = rcu_dereference(ref->pcpu_count);
+
+   if

[PATCH 32/32] aio: Smoosh struct kiocb

2012-12-26 Thread Kent Overstreet

This patch squishes struct kiocb down to 160 bytes, from 208 previously
- mainly, some of the fields aren't needed until after aio_complete() is
called.

Also, reorder the fields to reduce the amount of memory that has to be
zeroed in aio_get_req(), and to keep members next to each other that are
used in the same place.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c| 22 +++
 include/linux/aio.h | 61 +
 2 files changed, 46 insertions(+), 37 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 0e70b0e..6b05ddb 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -570,12 +570,13 @@ static inline struct kiocb *aio_get_req(struct kioctx 
*ctx)
if (!get_reqs_available(ctx))
return NULL;
 
-   req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
+   req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
if (unlikely(!req))
goto out_put;
 
-   atomic_set(>ki_users, 1);
+   memset(req, 0, offsetof(struct kiocb, ki_ctx));
req->ki_ctx = ctx;
+   atomic_set(>ki_users, 1);
return req;
 out_put:
put_reqs_available(ctx, 1);
@@ -633,8 +634,8 @@ static inline unsigned kioctx_ring_put(struct kioctx *ctx, 
struct kiocb *req,
ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
event = ev_page + pos % AIO_EVENTS_PER_PAGE;
 
-   event->obj  = (u64) req->ki_obj.user;
event->data = req->ki_user_data;
+   event->obj  = (u64) req->ki_obj.user;
event->res  = req->ki_res;
event->res2 = req->ki_res2;
 
@@ -1245,13 +1246,16 @@ static int io_submit_one(struct kioctx *ctx, struct 
iocb __user *user_iocb,
goto out_put_req;
}
 
-   req->ki_obj.user = user_iocb;
-   req->ki_user_data = iocb->aio_data;
-   req->ki_pos = iocb->aio_offset;
+   req->ki_user_data   = iocb->aio_data;
+   req->ki_obj.user= user_iocb;
 
-   req->ki_buf = (char __user *)(unsigned long)iocb->aio_buf;
-   req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
-   req->ki_opcode = iocb->aio_lio_opcode;
+   req->ki_opcode  = iocb->aio_lio_opcode;
+   req->ki_pos = iocb->aio_offset;
+   req->ki_nbytes  = iocb->aio_nbytes;
+   req->ki_left= iocb->aio_nbytes;
+   req->ki_buf = (char __user *) iocb->aio_buf;
+   req->ki_nr_segs = 0;
+   req->ki_cur_seg = 0;
 
ret = aio_run_iocb(req, compat);
if (ret)
diff --git a/include/linux/aio.h b/include/linux/aio.h
index db6b856..f9ffee3 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -20,45 +20,50 @@ struct batch_complete;
 typedef int (kiocb_cancel_fn)(struct kiocb *, struct io_event *);
 
 struct kiocb {
-   struct rb_node  ki_node;
+   struct list_headki_list;/* the aio core uses this
+* for cancellation */
+   kiocb_cancel_fn *ki_cancel;
+   void(*ki_dtor)(struct kiocb *);
+   void*private;
+   struct iovec*ki_iovec;
+
+   /*
+* If the aio_resfd field of the userspace iocb is not zero,
+* this is the underlying eventfd context to deliver events to.
+*/
+   struct eventfd_ctx  *ki_eventfd;
+   struct kioctx   *ki_ctx;/* NULL for sync ops */
+   struct file *ki_filp;
 
atomic_tki_users;
 
-   struct file *ki_filp;
-   struct kioctx   *ki_ctx;/* NULL for sync ops */
-   kiocb_cancel_fn *ki_cancel;
-   void(*ki_dtor)(struct kiocb *);
+   /* State that we remember to be able to restart/retry  */
+   unsignedki_opcode;
 
+   __u64   ki_user_data;   /* user's data for completion */
union {
void __user *user;
struct task_struct  *tsk;
} ki_obj;
 
-   __u64   ki_user_data;   /* user's data for completion */
-   longki_res;
-   longki_res2;
-
-   loff_t  ki_pos;
+   union {
+   struct {
+   loff_t  ki_pos;
+   size_t  ki_nbytes;  /* copy of iocb->aio_nbytes */
+   size_t  ki_left;/* remaining bytes */
+   char __user *ki_buf;/* remaining iocb->aio_buf */
+   unsigned long   ki_nr_segs;
+   unsigned long   ki_cur_seg;
+   };
+
+   struct {
+   longki_res;
+   longki_res2;
+   struct rb_node  ki_node;
+   };
+   };
 
-   void*private;
-   /* State that we remember to be able to

[PATCH 18/32] aio: Kill batch allocation

2012-12-26 Thread Kent Overstreet

Previously, allocating a kiocb required touching quite a few global
(well, per kioctx) cachelines... so batching up allocation to amortize
those was worthwhile. But we've gotten rid of some of those, and in
another couple of patches kiocb allocation won't require writing to any
shared cachelines, so that means we can just rip this code out.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c| 116 +++-
 include/linux/aio.h |   1 -
 2 files changed, 15 insertions(+), 102 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index b1be0cf..5ca383e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -490,108 +490,27 @@ void exit_aio(struct mm_struct *mm)
  * This prevents races between the aio code path referencing the
  * req (after submitting it) and aio_complete() freeing the req.
  */
-static struct kiocb *__aio_get_req(struct kioctx *ctx)
+static inline struct kiocb *aio_get_req(struct kioctx *ctx)
 {
-   struct kiocb *req = NULL;
+   struct kiocb *req;
+
+   if (atomic_read(>reqs_active) >= ctx->ring_info.nr)
+   return NULL;
+
+   if (atomic_inc_return(>reqs_active) > ctx->ring_info.nr)
+   goto out_put;
 
req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
if (unlikely(!req))
-   return NULL;
+   goto out_put;
 
atomic_set(>ki_users, 2);
req->ki_ctx = ctx;
 
return req;
-}
-
-/*
- * struct kiocb's are allocated in batches to reduce the number of
- * times the ctx lock is acquired and released.
- */
-#define KIOCB_BATCH_SIZE   32L
-struct kiocb_batch {
-   struct list_head head;
-   long count; /* number of requests left to allocate */
-};
-
-static void kiocb_batch_init(struct kiocb_batch *batch, long total)
-{
-   INIT_LIST_HEAD(>head);
-   batch->count = total;
-}
-
-static void kiocb_batch_free(struct kioctx *ctx, struct kiocb_batch *batch)
-{
-   struct kiocb *req, *n;
-
-   if (list_empty(>head))
-   return;
-
-   spin_lock_irq(>ctx_lock);
-   list_for_each_entry_safe(req, n, >head, ki_batch) {
-   list_del(>ki_batch);
-   kmem_cache_free(kiocb_cachep, req);
-   atomic_dec(>reqs_active);
-   }
-   spin_unlock_irq(>ctx_lock);
-}
-
-/*
- * Allocate a batch of kiocbs.  This avoids taking and dropping the
- * context lock a lot during setup.
- */
-static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
-{
-   unsigned short allocated, to_alloc;
-   long avail;
-   struct kiocb *req, *n;
-
-   to_alloc = min(batch->count, KIOCB_BATCH_SIZE);
-   for (allocated = 0; allocated < to_alloc; allocated++) {
-   req = __aio_get_req(ctx);
-   if (!req)
-   /* allocation failed, go with what we've got */
-   break;
-   list_add(>ki_batch, >head);
-   }
-
-   if (allocated == 0)
-   goto out;
-
-   spin_lock_irq(>ctx_lock);
-
-   avail = ctx->ring_info.nr - atomic_read(>reqs_active);
-   BUG_ON(avail < 0);
-   if (avail < allocated) {
-   /* Trim back the number of requests. */
-   list_for_each_entry_safe(req, n, >head, ki_batch) {
-   list_del(>ki_batch);
-   kmem_cache_free(kiocb_cachep, req);
-   if (--allocated <= avail)
-   break;
-   }
-   }
-
-   batch->count -= allocated;
-   atomic_add(allocated, >reqs_active);
-
-   spin_unlock_irq(>ctx_lock);
-
-out:
-   return allocated;
-}
-
-static inline struct kiocb *aio_get_req(struct kioctx *ctx,
-   struct kiocb_batch *batch)
-{
-   struct kiocb *req;
-
-   if (list_empty(>head))
-   if (kiocb_batch_refill(ctx, batch) == 0)
-   return NULL;
-   req = list_first_entry(>head, struct kiocb, ki_batch);
-   list_del(>ki_batch);
-   return req;
+out_put:
+   atomic_dec(>reqs_active);
+   return NULL;
 }
 
 static void kiocb_free(struct kiocb *req)
@@ -1162,8 +1081,7 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool 
compat)
 }
 
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
-struct iocb *iocb, struct kiocb_batch *batch,
-bool compat)
+struct iocb *iocb, bool compat)
 {
struct kiocb *req;
ssize_t ret;
@@ -1184,7 +1102,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
return -EINVAL;
}
 
-   req = aio_get_req(ctx, batch);  /* returns with 2 references to req */
+   req = aio_get_req(ctx);  /* returns with 2 references to req */
if (unlikely(!req))
return -EAGAIN;
 
@@ -1256,7 +1174,6 @@ long do_io_submit(aio_context_t ctx_id, long

[PATCH 20/32] aio: Give shared kioctx fields their own cachelines

2012-12-26 Thread Kent Overstreet

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 96fbd6b..fa87732 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -67,13 +67,6 @@ struct kioctx {
unsigned long   user_id;
struct hlist_node   list;
 
-   wait_queue_head_t   wait;
-
-   spinlock_t  ctx_lock;
-
-   atomic_treqs_active;
-   struct list_headactive_reqs;/* used for cancellation */
-
unsignednr;
 
/* sys_io_setup currently limits this to an unsigned int */
@@ -85,19 +78,29 @@ struct kioctx {
struct page **ring_pages;
longnr_pages;
 
+   struct rcu_head rcu_head;
+   struct work_struct  rcu_work;
+
struct {
-   struct mutexring_lock;
+   atomic_treqs_active;
} cacheline_aligned;
 
struct {
+   spinlock_t  ctx_lock;
+   struct list_head active_reqs;   /* used for cancellation */
+   } cacheline_aligned_in_smp;
+
+   struct {
+   struct mutexring_lock;
+   wait_queue_head_t wait;
+   } cacheline_aligned_in_smp;
+
+   struct {
unsignedtail;
spinlock_t  completion_lock;
-   } cacheline_aligned;
+   } cacheline_aligned_in_smp;
 
struct page *internal_pages[AIO_RING_PAGES];
-
-   struct rcu_head rcu_head;
-   struct work_struct  rcu_work;
 };
 
 /*-- sysctl variables*/
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 21/32] aio: reqs_active -> reqs_available

2012-12-26 Thread Kent Overstreet

The number of outstanding kiocbs is one of the few shared things left
that has to be touched for every kiocb - it'd be nice to make it percpu.

We can make it per cpu by treating it like an allocation problem: we
have a maximum number of kiocbs that can be outstanding (i.e. slots) -
then we just allocate and free slots, and we know how to write per cpu
allocators.

So as prep work for that, we convert reqs_active to reqs_available.

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index fa87732..d384eb2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -82,7 +82,7 @@ struct kioctx {
struct work_struct  rcu_work;
 
struct {
-   atomic_treqs_active;
+   atomic_treqs_available;
} cacheline_aligned;
 
struct {
@@ -286,17 +286,17 @@ static void free_ioctx(struct kioctx *ctx)
head = ring->head;
kunmap_atomic(ring);
 
-   while (atomic_read(>reqs_active) > 0) {
+   while (atomic_read(>reqs_available) < ctx->nr) {
wait_event(ctx->wait, head != ctx->tail);
 
avail = (head < ctx->tail ? ctx->tail : ctx->nr) - head;
 
-   atomic_sub(avail, >reqs_active);
+   atomic_add(avail, >reqs_available);
head += avail;
head %= ctx->nr;
}
 
-   WARN_ON(atomic_read(>reqs_active) < 0);
+   WARN_ON(atomic_read(>reqs_available) > ctx->nr);
 
aio_free_ring(ctx);
 
@@ -360,6 +360,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
if (aio_setup_ring(ctx) < 0)
goto out_freectx;
 
+   atomic_set(>reqs_available, ctx->nr);
+
/* limit the number of system wide aios */
spin_lock(_nr_lock);
if (aio_nr + nr_events > aio_max_nr ||
@@ -462,7 +464,7 @@ void exit_aio(struct mm_struct *mm)
"exit_aio:ioctx still alive: %d %d %d\n",
atomic_read(>users),
atomic_read(>dead),
-   atomic_read(>reqs_active));
+   atomic_read(>reqs_available));
/*
 * We don't need to bother with munmap() here -
 * exit_mmap(mm) is coming and it'll unmap everything.
@@ -494,12 +496,9 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
 {
struct kiocb *req;
 
-   if (atomic_read(>reqs_active) >= ctx->nr)
+   if (atomic_dec_if_positive(>reqs_available) <= 0)
return NULL;
 
-   if (atomic_inc_return(>reqs_active) > ctx->nr)
-   goto out_put;
-
req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
if (unlikely(!req))
goto out_put;
@@ -509,7 +508,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
 
return req;
 out_put:
-   atomic_dec(>reqs_active);
+   atomic_inc(>reqs_available);
return NULL;
 }
 
@@ -580,7 +579,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 
/*
 * Take rcu_read_lock() in case the kioctx is being destroyed, as we
-* need to issue a wakeup after decrementing reqs_active.
+* need to issue a wakeup after incrementing reqs_available.
 */
rcu_read_lock();
 
@@ -598,7 +597,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 */
if (unlikely(xchg(>ki_cancel,
  KIOCB_CANCELLED) == KIOCB_CANCELLED)) {
-   atomic_dec(>reqs_active);
+   atomic_inc(>reqs_available);
/* Still need the wake_up in case free_ioctx is waiting */
goto put_rq;
}
@@ -738,7 +737,7 @@ static int aio_read_events_ring(struct kioctx *ctx,
 
pr_debug("%d  h%u t%u\n", ret, head, ctx->tail);
 
-   atomic_sub(ret, >reqs_active);
+   atomic_add(ret, >reqs_available);
 out:
mutex_unlock(>ring_lock);
 
@@ -1157,7 +1156,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
return 0;
 
 out_put_req:
-   atomic_dec(>reqs_active);
+   atomic_inc(>reqs_available);
aio_put_req(req);   /* drop extra ref to req */
aio_put_req(req);   /* drop i/o ref to req */
return ret;
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 02/32] aio: remove dead code from aio.h

2012-12-26 Thread Kent Overstreet

From: Zach Brown 

Signed-off-by: Zach Brown 
Signed-off-by: Kent Overstreet 
---
 include/linux/aio.h | 24 
 1 file changed, 24 deletions(-)

diff --git a/include/linux/aio.h b/include/linux/aio.h
index 31ff6db..b46a09f 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -9,44 +9,22 @@
 
 #include 
 
-#define AIO_MAXSEGS4
-#define AIO_KIOGRP_NR_ATOMIC   8
-
 struct kioctx;
 
-/* Notes on cancelling a kiocb:
- * If a kiocb is cancelled, aio_complete may return 0 to indicate 
- * that cancel has not yet disposed of the kiocb.  All cancel 
- * operations *must* call aio_put_req to dispose of the kiocb 
- * to guard against races with the completion code.
- */
-#define KIOCB_C_CANCELLED  0x01
-#define KIOCB_C_COMPLETE   0x02
-
 #define KIOCB_SYNC_KEY (~0U)
 
 /* ki_flags bits */
-/*
- * This may be used for cancel/retry serialization in the future, but
- * for now it's unused and we probably don't want modules to even
- * think they can use it.
- */
-/* #define KIF_LOCKED  0 */
 #define KIF_KICKED 1
 #define KIF_CANCELLED  2
 
-#define kiocbTryLock(iocb) test_and_set_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbTryKick(iocb) test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags)
 
-#define kiocbSetLocked(iocb)   set_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbSetKicked(iocb)   set_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbSetCancelled(iocb)set_bit(KIF_CANCELLED, 
&(iocb)->ki_flags)
 
-#define kiocbClearLocked(iocb) clear_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbClearKicked(iocb) clear_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbClearCancelled(iocb)  clear_bit(KIF_CANCELLED, 
&(iocb)->ki_flags)
 
-#define kiocbIsLocked(iocb)test_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbIsKicked(iocb)test_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbIsCancelled(iocb) test_bit(KIF_CANCELLED, &(iocb)->ki_flags)
 
@@ -207,8 +185,6 @@ struct kioctx {
 };
 
 /* prototypes */
-extern unsigned aio_max_size;
-
 #ifdef CONFIG_AIO
 extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
 extern int aio_put_req(struct kiocb *iocb);
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 09/32] aio: dprintk() -> pr_debug()

2012-12-26 Thread Kent Overstreet

Signed-off-by: Kent Overstreet 
---
 fs/aio.c | 57 -
 1 file changed, 24 insertions(+), 33 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 8fcea98..868ac0a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -8,6 +8,8 @@
  *
  * See ../COPYING for licensing terms.
  */
+#define pr_fmt(fmt) "%s: " fmt, __func__
+
 #include 
 #include 
 #include 
@@ -18,8 +20,6 @@
 #include 
 #include 
 
-#define DEBUG 0
-
 #include 
 #include 
 #include 
@@ -39,12 +39,6 @@
 #include 
 #include 
 
-#if DEBUG > 1
-#define dprintkprintk
-#else
-#define dprintk(x...)  do { ; } while (0)
-#endif
-
 #define AIO_RING_MAGIC 0xa10a10a1
 #define AIO_RING_COMPAT_FEATURES   1
 #define AIO_RING_INCOMPAT_FEATURES 0
@@ -124,7 +118,7 @@ static int __init aio_setup(void)
kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
 
-   pr_debug("aio_setup: sizeof(struct page) = %d\n", (int)sizeof(struct 
page));
+   pr_debug("sizeof(struct page) = %zu\n", sizeof(struct page));
 
return 0;
 }
@@ -178,7 +172,7 @@ static int aio_setup_ring(struct kioctx *ctx)
}
 
info->mmap_size = nr_pages * PAGE_SIZE;
-   dprintk("attempting mmap of %lu bytes\n", info->mmap_size);
+   pr_debug("attempting mmap of %lu bytes\n", info->mmap_size);
down_write(>mmap_sem);
info->mmap_base = do_mmap_pgoff(NULL, 0, info->mmap_size, 
PROT_READ|PROT_WRITE,
@@ -190,7 +184,7 @@ static int aio_setup_ring(struct kioctx *ctx)
return -EAGAIN;
}
 
-   dprintk("mmap address: 0x%08lx\n", info->mmap_base);
+   pr_debug("mmap address: 0x%08lx\n", info->mmap_base);
info->nr_pages = get_user_pages(current, mm, info->mmap_base, nr_pages, 
1, 0, info->ring_pages, NULL);
up_write(>mmap_sem);
@@ -262,7 +256,7 @@ static void __put_ioctx(struct kioctx *ctx)
aio_nr -= nr_events;
spin_unlock(_nr_lock);
}
-   pr_debug("__put_ioctx: freeing %p\n", ctx);
+   pr_debug("freeing %p\n", ctx);
call_rcu(>rcu_head, ctx_rcu_free);
 }
 
@@ -351,7 +345,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
hlist_add_head_rcu(>list, >ioctx_list);
spin_unlock(>ioctx_lock);
 
-   dprintk("aio: allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
+   pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
ctx, ctx->user_id, mm, ctx->ring_info.nr);
return ctx;
 
@@ -360,7 +354,7 @@ out_cleanup:
aio_free_ring(ctx);
 out_freectx:
kmem_cache_free(kioctx_cachep, ctx);
-   dprintk("aio: error allocating ioctx %d\n", err);
+   pr_debug("error allocating ioctx %d\n", err);
return ERR_PTR(err);
 }
 
@@ -608,8 +602,8 @@ static inline void really_put_req(struct kioctx *ctx, 
struct kiocb *req)
  */
 static void __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
-   dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
-   req, atomic_long_read(>ki_filp->f_count));
+   pr_debug("(%p): f_count=%ld\n",
+req, atomic_long_read(>ki_filp->f_count));
 
assert_spin_locked(>ctx_lock);
 
@@ -720,9 +714,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
event->res = res;
event->res2 = res2;
 
-   dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n",
-   ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
-   res, res2);
+   pr_debug("%p[%lu]: %p: %p %Lx %lx %lx\n",
+ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
+res, res2);
 
/* after flagging the request as done, we
 * must never even look at it again
@@ -778,9 +772,7 @@ static int aio_read_evt(struct kioctx *ioctx, struct 
io_event *ent)
int ret = 0;
 
ring = kmap_atomic(info->ring_pages[0]);
-   dprintk("in aio_read_evt h%lu t%lu m%lu\n",
-(unsigned long)ring->head, (unsigned long)ring->tail,
-(unsigned long)ring->nr);
+   pr_debug("h%u t%u m%u\n", ring->head, ring->tail, ring->nr);
 
if (ring->head == ring->tail)
goto out;
@@ -801,8 +793,7 @@ static int aio_read_evt(struct kioctx *ioctx, struct 
io_event *ent)
 
 out:
kunmap_atomic(ring);
-   dprintk("leaving aio_read_evt: %d  h%lu t%lu\n", ret,
-(unsigned long)ring->head, (unsigned long)ring->tail);
+   pr_debug("%d  h%u t%u\n", ret, ring->head, ring->tail);
return ret;
 }
 
@@ -865,13 +856,13 @@ static int read_events(struct kioctx *ctx,
if (unlikely(ret <= 0))
break;
 
-   dprintk("read event: %Lx %Lx %Lx %Lx\n",
-   ent.data, ent.obj, ent.res, ent.res2);
+

Re: [PATCH] cma: use unsigned type for count argument

2012-12-26 Thread David Rientjes

On Sat, 22 Dec 2012, Michal Nazarewicz wrote:

> So I think just adding the following, should be sufficient to make
> everyone happy:
> 
> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
> index e34e3e0..e91743b 100644
> --- a/drivers/base/dma-contiguous.c
> +++ b/drivers/base/dma-contiguous.c
> @@ -320,7 +320,7 @@ struct page *dma_alloc_from_contiguous(struct device 
> *dev, unsigned int count,
>   pr_debug("%s(cma %p, count %u, align %u)\n", __func__, (void *)cma,
>count, align);
>  
> - if (!count)
> + if (!count || count > INT_MAX)
>   return NULL;
>  
>   mask = (1 << align) - 1;

How is this different than leaving the formal to have a signed type, i.e. 
drop your patch, and testing for count <= 0 instead?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:x86/build] x86: Default to ARCH= x86 to avoid overriding CONFIG_64BIT

2012-12-26 Thread David Rientjes

On Wed, 26 Dec 2012, David Woodhouse wrote:

> Thanks. I'll look into this. I presume it was *always* failing, but
> nobody happened to come across it because our test coverage of x86
> configs without CONFIG_64BIT wasn't particularly good?
> 

Purely for selfish reasons, 32-bit isn't interesting for me.  I'm sure 
they existed before, as you said, but only got exposed to be because "make 
randconfig" now allows such configurations.  I've added ARCH=x86_64 to my 
scripts.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 2/3] aerdrv: Enhanced AER logging

2012-12-26 Thread Bjorn Helgaas

On Wed, Dec 5, 2012 at 9:30 AM, Borislav Petkov  wrote:
> On Wed, Dec 05, 2012 at 04:14:14PM +, Ortiz, Lance E wrote:
>> I removed the prefix argument because it was never used by its caller
>> and never set. The reason I added the prefix variable and set it to
>> NULL was to help in breaking up the patch and adding it would help the
>> intermittent patch build without changing too much code. I knew I was
>> actually going to use the variable in patch 3.
>
> No, the correct way to do that is to keep all changes that belong
> logically together in a single patch for ease of reviewing and avoid
> breakages. And in your case this should be pretty easy: simply move all
> the 'prefix' touching code to patch #3 and you're done, AFAICT.

Lance, you didn't add all the "prefix" stuff in AER, but since you're
touching it, I think things will make more sense if you clean it up at
the same time.  There are a lot of printk() calls there that should be
converted to dev_printk().

I think I see why it was done that way -- it sticks either
KERN_WARNING or KERN_ERR at the beginning of the prefix, then uses
plain printk() later.  I guess that means you only have to pass around
the prefix argument, rather than both a "level" and a "struct pci_dev
*".  But I think it will be simpler overall to pass both and take
advantage of dev_printk() and stop emulating it.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH fix-3.8] video: vt8500: Fix X crash when initializing framebuffer.

2012-12-26 Thread Florian Tobias Schandinat

On 12/27/2012 12:25 AM, Tony Prisk wrote:
> This patch adds support for .fb_check_var which is required when
> X attempts to initialize the framebuffer. The only supported
> resolution is the native resolution of the LCD panel, so we test
> against the resolution supplied from the DT panel definition.

Nack. As far as I understand this driver behaves as it is supposed to do
according to drivers/video/skeletonfb.c. The frambuffer code seems to do
what is documented as well. If the X driver cannot cope with a different
var set than requested it needs to be fixed anyway as check_var is
always allowed to alter the parameters requested.
Strange, I thought I already saw X complaining about this when I added
30bpp mode to viafb.


Best regards,

Florian Tobias Schandinat


> 
> Signed-off-by: Tony Prisk 
> ---
>  drivers/video/wm8505fb.c |   25 +
>  1 file changed, 25 insertions(+)
> 
> diff --git a/drivers/video/wm8505fb.c b/drivers/video/wm8505fb.c
> index 77539c1..c84e376 100644
> --- a/drivers/video/wm8505fb.c
> +++ b/drivers/video/wm8505fb.c
> @@ -41,10 +41,18 @@
>  
>  #define to_wm8505fb_info(__info) container_of(__info, \
>   struct wm8505fb_info, fb)
> +
> +struct lcd_params {
> + u32 pixel_width;
> + u32 pixel_height;
> + u32 color_depth;
> +};
> +
>  struct wm8505fb_info {
>   struct fb_info  fb;
>   void __iomem*regbase;
>   unsigned intcontrast;
> + struct lcd_params   lcd_params;
>  };
>  
>  
> @@ -248,8 +256,21 @@ static int wm8505fb_blank(int blank, struct fb_info 
> *info)
>   return 0;
>  }
>  
> +static int wm8505fb_check_var(struct fb_var_screeninfo *var,
> +   struct fb_info *info)
> +{
> + struct wm8505fb_info *fbi = to_wm8505fb_info(info);
> +if (!fbi) return -EINVAL;
> +
> + if (info->var.bits_per_pixel != fbi->lcd_params.color_depth) return 
> -EINVAL;
> + if (info->var.xres != fbi->lcd_params.pixel_width) return -EINVAL;
> + if (info->var.yres != fbi->lcd_params.pixel_height) return -EINVAL;
> + return 0;
> +}
> +
>  static struct fb_ops wm8505fb_ops = {
>   .owner  = THIS_MODULE,
> + .fb_check_var   = wm8505fb_check_var,
>   .fb_set_par = wm8505fb_set_par,
>   .fb_setcolreg   = wm8505fb_setcolreg,
>   .fb_fillrect= wmt_ge_fillrect,
> @@ -354,6 +375,10 @@ static int __devinit wm8505fb_probe(struct 
> platform_device *pdev)
>   goto failed_free_res;
>   }
>  
> + fbi->lcd_params.pixel_width = of_mode.xres;
> + fbi->lcd_params.pixel_height = of_mode.yres;
> + fbi->lcd_params.color_depth = bpp;
> +
>   of_mode.vmode = FB_VMODE_NONINTERLACED;
>   fb_videomode_to_var(>fb.var, _mode);
>  

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip] config: Add 'make kvmconfig'

2012-12-26 Thread Randy Dunlap

On 12/26/12 15:32, David Woodhouse wrote:
> On Tue, 2012-12-25 at 22:32 -0800, David Rientjes wrote:
>>
>> This creates quite a few build failures on auto-latest:
>>
>> arch/x86/built-in.o: In function `hpet_setup_msi_irq':
>> hpet.c:(.text+0x34638): undefined reference to `arch_setup_hpet_msi'
>> hpet.c:(.text+0x34651): undefined reference to `destroy_irq'
>> arch/x86/built-in.o: In function `hpet_msi_capability_lookup':
>> hpet.c:(.text+0x347ff): undefined reference to `create_irq_nr'
>> arch/x86/built-in.o:(.data+0xd1c): undefined reference to 
>> `native_setup_msi_irqs'
>> arch/x86/built-in.o:(.data+0xd20): undefined reference to 
>> `native_teardown_msi_irq'

I reported these build errors in linux-next on Nov. 7, 2011 !!!

> This one is actually caused by commit 3b08ed026 (config: Add 'make
> kvmconfig'), which selects PCI_MSI even on a 32-bit config where it's
> invalid to do so.
> 
> Ew, that commit seems like a *completely* wrong-headed idea. That abuse
> of 'select' is just begging for this kind of breakage. We have other
> ways to merge configs and turn certain options on, without doing it this
> way.

but I didn't diagnose the problem.  Thanks for that.

Yes, 'make kvmconfig' takes liberties and shortcuts.  :(

-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] namespace fixes for v3.8-rc2

2012-12-26 Thread Eric W. Biederman


Linus,

Please pull the for-linus git tree from:

   git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
for-linus

   HEAD: 48c6d1217e3dc743e7d3ad9b9def8d4810d13a85 f2fs: Don't assign e_id in 
f2fs_acl_from_disk

   This tree is against v3.8-rc1

This tree includes two bug fixes for problems Oleg spotted on his review
of the recent pid namespace work.  A small fix to not enable bottom
halves with irqs disabled, and a trivial build fix for f2fs with user
namespaces enabled.

Eric W. Biederman (4):
  pidns: Outlaw thread creation after unshare(CLONE_NEWPID)
  pidns: Stop pid allocation when init dies
  proc: Allow proc_free_inum to be called from any context
  f2fs: Don't assign e_id in f2fs_acl_from_disk

 fs/f2fs/acl.c |1 -
 fs/proc/generic.c |   13 +++--
 include/linux/pid.h   |1 +
 include/linux/pid_namespace.h |4 +++-
 kernel/fork.c |8 
 kernel/pid.c  |   15 ---
 kernel/pid_namespace.c|4 
 7 files changed, 35 insertions(+), 11 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] drm: make frame duration time calculation more precise

2012-12-26 Thread Daniel Kurtz

It is a bit more precise to compute the total number of pixels first and
then divide, rather than multiplying the line pixel count by the
already-rounded line duration.

Signed-off-by: Daniel Kurtz 
---
 drivers/gpu/drm/drm_irq.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
index 19c01ca..05c91e0 100644
--- a/drivers/gpu/drm/drm_irq.c
+++ b/drivers/gpu/drm/drm_irq.c
@@ -505,6 +505,7 @@ void drm_calc_timestamping_constants(struct drm_crtc *crtc)
 
/* Valid dotclock? */
if (dotclock > 0) {
+   int frame_size;
/* Convert scanline length in pixels and video dot clock to
 * line duration, frame duration and pixel duration in
 * nanoseconds:
@@ -512,7 +513,10 @@ void drm_calc_timestamping_constants(struct drm_crtc *crtc)
pixeldur_ns = (s64) div64_u64(10, dotclock);
linedur_ns  = (s64) div64_u64(((u64) crtc->hwmode.crtc_htotal *
  10), dotclock);
-   framedur_ns = (s64) crtc->hwmode.crtc_vtotal * linedur_ns;
+   frame_size = crtc->hwmode.crtc_htotal *
+   crtc->hwmode.crtc_vtotal;
+   framedur_ns = (s64) div64_u64((u64) frame_size * 10,
+ dotclock);
} else
DRM_ERROR("crtc %d: Can't calculate constants, dotclock = 0!\n",
  crtc->base.id);
-- 
1.7.7.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] Thunderbolt workarounds

2012-12-26 Thread Bjorn Helgaas

On Thu, Dec 13, 2012 at 12:25 PM, Kirill A. Shutemov
 wrote:
> From: "Kirill A. Shutemov" 
>
> I had chance to test two PC setups with Thunderbolt: Acer Aspire S5 and
> Intel DZ77RE-75K motherboard. Unfortunately, both of them are broken in
> different ways.
>
> This patchset contains workarounds for the issues.
>
> Kirill A. Shutemov (3):
>   PCI Hotplug: workaround for Thunderbolt on Acer Aspire S5
>   PCI Hotplug: convert acpiphp_hp_work to use delayed work
>   PCI Hotplug: workaround for Thunderbolt on Intel DZ77RE-75K
> motherboard
>
>  drivers/pci/hotplug/acpi_pcihp.c   |   13 +
>  drivers/pci/hotplug/acpiphp_glue.c |   33 +
>  2 files changed, 38 insertions(+), 8 deletions(-)

I'm ignoring these for now.  [1/3] has a hardcoded BIOS path that I
don't think is the right approach and [3/3] has a timeout issue to be
addressed.  [2/3] might be worthwhile by itself, but I'd rather merge
a complete solution when it's ready.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [PATCH 0/4] AER-KVM: Error containment of PCI pass-thru devices assigned to KVM guests

2012-12-26 Thread Bjorn Helgaas

On Mon, Nov 26, 2012 at 11:46 PM, Gleb Natapov  wrote:
> On Mon, Nov 26, 2012 at 09:46:12PM -0200, Marcelo Tosatti wrote:
>> On Tue, Nov 20, 2012 at 02:09:46PM +, Pandarathil, Vijaymohan R wrote:
>> >
>> >
>> > > -Original Message-
>> > > From: Stefan Hajnoczi [mailto:stefa...@gmail.com]
>> > > Sent: Tuesday, November 20, 2012 5:41 AM
>> > > To: Pandarathil, Vijaymohan R
>> > > Cc: k...@vger.kernel.org; linux-...@vger.kernel.org; 
>> > > qemu-de...@nongnu.org;
>> > > linux-kernel@vger.kernel.org
>> > > Subject: Re: [PATCH 0/4] AER-KVM: Error containment of PCI pass-thru
>> > > devices assigned to KVM guests
>> > >
>> > > On Tue, Nov 20, 2012 at 06:31:48AM +, Pandarathil, Vijaymohan R 
>> > > wrote:
>> > > > Add support for error containment when a PCI pass-thru device assigned 
>> > > > to
>> > > a KVM
>> > > > guest encounters an error. This is for PCIe devices/drivers that 
>> > > > support
>> > > AER
>> > > > functionality. When the OS is notified of an error in a device either
>> > > > through the firmware first approach or through an interrupt handled by
>> > > the AER
>> > > > root port driver, concerned subsystems are notified by invoking 
>> > > > callbacks
>> > > > registered by these subsystems. The device is also marked as tainted 
>> > > > till
>> > > the
>> > > > corresponding driver recovery routines are successful.
>> > > >
>> > > > KVM module registers for a notification of such errors. In the KVM
>> > > callback
>> > > > routine, a global counter is incremented to keep track of the error
>> > > > notification. Before each CPU enters guest mode to execute guest code,
>> > > > appropriate checks are done to see if the impacted device belongs to 
>> > > > the
>> > > guest
>> > > > or not. If the device belongs to the guest, qemu hypervisor for the 
>> > > > guest
>> > > is
>> > > > informed and the guest is immediately brought down, thus preventing or
>> > > > minimizing chances of any bad data being written out by the guest 
>> > > > driver
>> > > > after the device has encountered an error.
>> > >
>> > > I'm surprised that the hypervisor would shut down the guest when PCIe
>> > > AER kicks in for a pass-through device.  Shouldn't we pass the AER event
>> > > into the guest and deal with it there?
>> >
>> > Agreed. That would be the ideal behavior and is planned in a future patch.
>> > Lack of control over the capabilities/type of the OS/drivers running in
>> > the guest is also a concern in passing along the event to the guest.
>> >
>> > My understanding is that in the current implementation of Linux/KVM, these
>> > errors are not handled at all and can potentially cause a guest hang or
>> > crash or even data corruption depending on the implementation of the guest
>> > driver for the device. As a first step, these patches make the behavior
>> > better by doing error containment with a predictable behavior when such
>> > errors occur.
>>
>> For both ACPI notifications and Linux PCI AER driver there is a way for
>> the PCI driver to receive a notification, correct?
>>
>> Can just have virt/kvm/assigned-dev.c code register such a notifier (as
>> a "PCI driver") and then perform appropriate action?
>>
>> Also the semantics of "tainted driver" is not entirely clear.
>>
>> Is there any reason for not having this feature for VFIO only, as KVM
>> device assigment is being phased out?
>>
> Exactly. We shouldn't add checks to guest entry code and introduce new
> userspace ABI to add minor feature to deprecated code. New userspace ABI
> means that QEMU changes are needed, so the feature will be fully functional
> only with latest QEMU which is capable of using VFIO anyway.

I'm ignoring these patches for now.  Please address the review
comments if you think we still need to do something here.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel BUG at mm/huge_memory.c:1798!

2012-12-26 Thread Alexander Beregalov

On 25 December 2012 16:05, Hillf Danton  wrote:
> On Tue, Dec 25, 2012 at 12:38 PM, Zhouping Liu  wrote:
>> Hello all,
>>
>> I found the below kernel bug using latest mainline(637704cbc95),
>> my hardware has 2 numa nodes, and it's easy to reproduce the issue
>> using LTP test case: "# ./mmap10 -a -s -c 200":
>
> Can you test with 5a505085f0 and 4fc3f1d66b1 reverted?
>

Hello,
does it look like the same problem?

mapcount 0 page_mapcount 1
[ cut here ]
kernel BUG at mm/huge_memory.c:1798!
invalid opcode:  [#1] PREEMPT SMP
Modules linked in: r8169 radeon cfbfillrect cfbimgblt cfbcopyarea
i2c_algo_bit backlight drm_kms_helper ttm drm agpgart
CPU 3
Pid: 15825, comm: firefox Not tainted 3.8.0-rc1-4-g637704c #1
Gigabyte Technology Co., Ltd. P35-DS3/P35-DS3
RIP: 0010:[]  [] split_huge_page+0x739/0x7a0
RSP: 0018:880193b43b78  EFLAGS: 00010297
RAX: 0001 RBX: ea0002fd RCX: 8175e078
RDX: 003e RSI: ea0002fd RDI: 0246
RBP: 880193b43c48 R08:  R09: 
R10: 02d5 R11:  R12: 
R13: 880173533464 R14: 7f097300 R15: ea0002fd
FS:  7f09b8db6740() GS:88019fd8() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7ff210e78008 CR3: 000195379000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process firefox (pid: 15825, threadinfo 880193b42000, task 880198af9f90)
Stack:
  880193b43e1c  0019
 880193b43c08 8801 88017af80180 8801
 880173533400 880198af9f90 9fc91540 88017af801b0
Call Trace:
 [] __split_huge_page_pmd+0xe4/0x280
 [] ? free_hot_cold_page_list+0x3e/0x60
 [] unmap_single_vma+0x77d/0x820
 [] zap_page_range+0xa4/0xe0
 [] ? sys_recvfrom+0xd6/0x120
 [] sys_madvise+0x31d/0x660
 [] system_call_fastpath+0x1a/0x1f
Code: 83 39 00 f3 90 49 8b 45 00 a9 00 00 80 00 75 f3 41 ff 84 24 44
e0 ff ff f0 41 0f ba 6d 00 17 19 c0 85 c0 0f 84 d7 fa ff ff eb c8 <0f>
0b 8b 53 18 8b 75 9c ff c2 48 c7 c7 60 95 5c 81 31 c0 e8 ac
RIP  [] split_huge_page+0x739/0x7a0
 RSP 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4] usb: phy: samsung: Add support to set pmu isolation

2012-12-26 Thread Russell King - ARM Linux

On Wed, Dec 26, 2012 at 05:58:32PM +0530, Vivek Gautam wrote:
> + if (!ret)
> + sphy->phyctrl_pmureg = ioremap(reg[0], reg[1]);
> +
> + of_node_put(usbphy_pmu);
> +
> + if (IS_ERR_OR_NULL(sphy->phyctrl_pmureg)) {

No.  Learn what the error return values are from functions.  Using the
wrong ones is buggy.  ioremap() only ever returns NULL on error.  You
must check against NULL, and not use the IS_ERR stuff.

> +/*
> + * Set isolation here for phy.
> + * SOCs control this by controlling corresponding PMU registers
> + */
> +static void samsung_usbphy_set_isolation(struct samsung_usbphy *sphy, int on)
> +{
> + u32 reg;
> + int en_mask;
> +
> + if (!sphy->phyctrl_pmureg) {
> + dev_warn(sphy->dev, "Can't set pmu isolation\n");
> + return;
> + }
> +
> + reg = readl(sphy->phyctrl_pmureg);
> +
> + en_mask = sphy->drv_data->devphy_en_mask;
> +
> + if (on)
> + writel(reg & ~en_mask, sphy->phyctrl_pmureg);
> + else
> + writel(reg | en_mask, sphy->phyctrl_pmureg);

What guarantees that this read-modify-write sequence of this register safe?
And, btw, this can't be optimised very well because of the barrier inside
writel().  This would be better:

if (on)
reg &= ~en_mask;
else
reg |= en_mask;

writel(reg, sphy->phyctrl_pmureg);

> +static inline struct samsung_usbphy_drvdata
> +*samsung_usbphy_get_driver_data(struct platform_device *pdev)
>  {
>   if (pdev->dev.of_node) {
>   const struct of_device_id *match;
>   match = of_match_node(samsung_usbphy_dt_match,
>   pdev->dev.of_node);
> - return (int) match->data;
> + return (struct samsung_usbphy_drvdata *) match->data;

match->data is a const void pointer.  Is there a reason you need this
cast here?  What if you made the returned pointer from this function
also const and fixed up all its users (no user should modify this
data.)

>  #ifdef CONFIG_OF
>  static const struct of_device_id samsung_usbphy_dt_match[] = {
>   {
>   .compatible = "samsung,s3c64xx-usbphy",
> - .data = (void *)TYPE_S3C64XX,
> + .data = (void *)_s3c64xx,

Why do you need this cast?

>   }, {
>   .compatible = "samsung,exynos4210-usbphy",
> - .data = (void *)TYPE_EXYNOS4210,
> + .data = (void *)_exynos4,

Ditto.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH fix-3.8] video: vt8500: Fix X crash when initializing framebuffer.

2012-12-26 Thread Tony Prisk

This patch adds support for .fb_check_var which is required when
X attempts to initialize the framebuffer. The only supported
resolution is the native resolution of the LCD panel, so we test
against the resolution supplied from the DT panel definition.

Signed-off-by: Tony Prisk 
---
 drivers/video/wm8505fb.c |   25 +
 1 file changed, 25 insertions(+)

diff --git a/drivers/video/wm8505fb.c b/drivers/video/wm8505fb.c
index 77539c1..c84e376 100644
--- a/drivers/video/wm8505fb.c
+++ b/drivers/video/wm8505fb.c
@@ -41,10 +41,18 @@
 
 #define to_wm8505fb_info(__info) container_of(__info, \
struct wm8505fb_info, fb)
+
+struct lcd_params {
+   u32 pixel_width;
+   u32 pixel_height;
+   u32 color_depth;
+};
+
 struct wm8505fb_info {
struct fb_info  fb;
void __iomem*regbase;
unsigned intcontrast;
+   struct lcd_params   lcd_params;
 };
 
 
@@ -248,8 +256,21 @@ static int wm8505fb_blank(int blank, struct fb_info *info)
return 0;
 }
 
+static int wm8505fb_check_var(struct fb_var_screeninfo *var,
+ struct fb_info *info)
+{
+   struct wm8505fb_info *fbi = to_wm8505fb_info(info);
+if (!fbi) return -EINVAL;
+
+   if (info->var.bits_per_pixel != fbi->lcd_params.color_depth) return 
-EINVAL;
+   if (info->var.xres != fbi->lcd_params.pixel_width) return -EINVAL;
+   if (info->var.yres != fbi->lcd_params.pixel_height) return -EINVAL;
+   return 0;
+}
+
 static struct fb_ops wm8505fb_ops = {
.owner  = THIS_MODULE,
+   .fb_check_var   = wm8505fb_check_var,
.fb_set_par = wm8505fb_set_par,
.fb_setcolreg   = wm8505fb_setcolreg,
.fb_fillrect= wmt_ge_fillrect,
@@ -354,6 +375,10 @@ static int __devinit wm8505fb_probe(struct platform_device 
*pdev)
goto failed_free_res;
}
 
+   fbi->lcd_params.pixel_width = of_mode.xres;
+   fbi->lcd_params.pixel_height = of_mode.yres;
+   fbi->lcd_params.color_depth = bpp;
+
of_mode.vmode = FB_VMODE_NONINTERLACED;
fb_videomode_to_var(>fb.var, _mode);
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH fix-3.8] rtc: vt8500: Correct handling of CR_24H bitfield

2012-12-26 Thread Tony Prisk

Control register bitfield for 12H/24H mode is handled incorrectly.
Setting CR_24H actually enables 12H mode. This patch renames the
define and changes the initialization code to correctly set
24H mode.

Signed-off-by: Tony Prisk 
---
 drivers/rtc/rtc-vt8500.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/rtc/rtc-vt8500.c b/drivers/rtc/rtc-vt8500.c
index 14e2d8c..387edf6 100644
--- a/drivers/rtc/rtc-vt8500.c
+++ b/drivers/rtc/rtc-vt8500.c
@@ -70,7 +70,7 @@
| ALARM_SEC_BIT)
 
 #define VT8500_RTC_CR_ENABLE   (1 << 0)/* Enable RTC */
-#define VT8500_RTC_CR_24H  (1 << 1)/* 24h time format */
+#define VT8500_RTC_CR_12H  (1 << 1)/* 12h time format */
 #define VT8500_RTC_CR_SM_ENABLE(1 << 2)/* Enable periodic irqs 
*/
 #define VT8500_RTC_CR_SM_SEC   (1 << 3)/* 0: 1Hz/60, 1: 1Hz */
 #define VT8500_RTC_CR_CALIB(1 << 4)/* Enable calibration */
@@ -247,7 +247,7 @@ static int __devinit vt8500_rtc_probe(struct 
platform_device *pdev)
}
 
/* Enable RTC and set it to 24-hour mode */
-   writel(VT8500_RTC_CR_ENABLE | VT8500_RTC_CR_24H,
+   writel(VT8500_RTC_CR_ENABLE,
   vt8500_rtc->regbase + VT8500_RTC_CR);
 
vt8500_rtc->rtc = rtc_device_register("vt8500-rtc", >dev,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] clk: vt8500: Fix division-by-0 when requested rate=0

2012-12-26 Thread Tony Prisk

A request to vt8500_dclk_(round_rate/set_rate) with rate=0 results
in a division-by-0 in the kernel.

Signed-off-by: Tony Prisk 
---
 drivers/clk/clk-vt8500.c |   14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/clk/clk-vt8500.c b/drivers/clk/clk-vt8500.c
index 3306c2b..db7d41f 100644
--- a/drivers/clk/clk-vt8500.c
+++ b/drivers/clk/clk-vt8500.c
@@ -121,7 +121,12 @@ static long vt8500_dclk_round_rate(struct clk_hw *hw, 
unsigned long rate,
unsigned long *prate)
 {
struct clk_device *cdev = to_clk_device(hw);
-   u32 divisor = *prate / rate;
+   u32 divisor;
+
+   if (rate == 0)
+   return 0;
+
+   divisor = *prate / rate;
 
/* If prate / rate would be decimal, incr the divisor */
if (rate * divisor < *prate)
@@ -142,9 +147,14 @@ static int vt8500_dclk_set_rate(struct clk_hw *hw, 
unsigned long rate,
unsigned long parent_rate)
 {
struct clk_device *cdev = to_clk_device(hw);
-   u32 divisor = parent_rate / rate;
+   u32 divisor;
unsigned long flags = 0;
 
+   if (rate == 0)
+   return 0;
+
+   divisor =  parent_rate / rate;
+
/* If prate / rate would be decimal, incr the divisor */
if (rate * divisor < *prate)
divisor++;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] clk: vt8500: Fix error in PLL calculations on non-exact match.

2012-12-26 Thread Tony Prisk

When a PLL frequency calculation is performed and a non-exact match
is found the wrong multiplier and divisors are returned.

Signed-off-by: Tony Prisk 
---
 drivers/clk/clk-vt8500.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/clk/clk-vt8500.c b/drivers/clk/clk-vt8500.c
index fe25570..0cb26be 100644
--- a/drivers/clk/clk-vt8500.c
+++ b/drivers/clk/clk-vt8500.c
@@ -361,9 +361,9 @@ static void wm8650_find_pll_bits(unsigned long rate, 
unsigned long parent_rate,
/* if we got here, it wasn't an exact match */
pr_warn("%s: requested rate %lu, found rate %lu\n", __func__, rate,
rate - best_err);
-   *multiplier = mul;
-   *divisor1 = div1;
-   *divisor2 = div2;
+   *multiplier = best_mul;
+   *divisor1 = best_div1;
+   *divisor2 = best_div2;
 }
 
 static int vtwm_pll_set_rate(struct clk_hw *hw, unsigned long rate,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] clk: vt8500: Fix device clock divisor calculations

2012-12-26 Thread Tony Prisk

When calculating device clock divisor values in set_rate and
round_rate, we do a simple integer divide. If parent_rate / rate
has a fraction, this is dropped which results in the device clock
being set too high.

This patch corrects the problem by adding 1 to the calculated
divisor if the division would have had a decimal result.

Signed-off-by: Tony Prisk 
---
 drivers/clk/clk-vt8500.c |8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/clk/clk-vt8500.c b/drivers/clk/clk-vt8500.c
index 0cb26be..3306c2b 100644
--- a/drivers/clk/clk-vt8500.c
+++ b/drivers/clk/clk-vt8500.c
@@ -123,6 +123,10 @@ static long vt8500_dclk_round_rate(struct clk_hw *hw, 
unsigned long rate,
struct clk_device *cdev = to_clk_device(hw);
u32 divisor = *prate / rate;
 
+   /* If prate / rate would be decimal, incr the divisor */
+   if (rate * divisor < *prate)
+   divisor++;
+
/*
 * If this is a request for SDMMC we have to adjust the divisor
 * when >31 to use the fixed predivisor
@@ -141,6 +145,10 @@ static int vt8500_dclk_set_rate(struct clk_hw *hw, 
unsigned long rate,
u32 divisor = parent_rate / rate;
unsigned long flags = 0;
 
+   /* If prate / rate would be decimal, incr the divisor */
+   if (rate * divisor < *prate)
+   divisor++;
+
if (divisor == cdev->div_mask + 1)
divisor = 0;
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] clk fixes for 3.8

2012-12-26 Thread Tony Prisk

Mike,

Three bugfixes for 3.8.

#1 was a boo-boo on my part, function returned the wrong variables.
#2 is a truncation problem which results in a higher-than-requested clock rate.
#3 became apparent when the MMC driver started requesting rate=0 during init.

Tony Prisk (3):
  clk: vt8500: Fix error in PLL calculations on non-exact match.
  clk: vt8500: Fix device clock divisor calculations
  clk: vt8500: Fix division-by-0 when requested rate=0

 drivers/clk/clk-vt8500.c |   28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Alternative][PATCH] ACPI / PCI: Set root bridge ACPI handle in advance

2012-12-26 Thread Yinghai Lu

On Wed, Dec 26, 2012 at 2:36 PM, Rafael J. Wysocki  wrote:
> On Wednesday, December 26, 2012 12:41:05 PM Yinghai Lu wrote:
>> On Wed, Dec 26, 2012 at 12:16 PM, Yinghai Lu  wrote:
>> > On Wed, Dec 26, 2012 at 12:10 PM, Bjorn Helgaas  
>> > wrote:
>> >> Do you have a reference for this?  I think this might have been true
>> >> in the past, but I don't think it's true for any version of gcc we
>> >> support for building Linux.
>> >
>> > http://lkml.indiana.edu/hypermail/linux/kernel/0804.3/3600.html
>>
>> the problem is already addressed by:
>>
>> | commit f9d14250071eda9972e4c9cea745a11185952114
>> | Author: Linus Torvalds 
>> | Date:   Fri Jan 2 09:29:43 2009 -0800
>> |
>> |Disallow gcc versions 4.1.{0,1}
>> |
>> |These compiler versions are known to miscompile __weak functions and
>> |thus generate kernels that don't necessarily work correctly.  If a weak
>> |function is int he same compilation unit as a caller, gcc may end up
>> |inlining it, and thus binding the weak function too early.
>> |
>> |See
>> |
>> |http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27781
>> |
>> |for details.
>>
>> so it is ok to put the __weak in the same file now.
>
> Cool, thanks for checking and for the ACK!

wait, we have some problem on systems that root bus is not exported via DSDT ...

one of my nehalem system that have uncore cpu devices are not exported via ACPI.

also there will be problem that system is booting with acpi=off.


+int pcibios_root_bridge_prepare(struct pci_host_bridge *bridge)
+{
+   struct pci_sysdata *sd = bridge->bus->sysdata;
+   struct pci_root_info *info = container_of(sd, struct pci_root_info, sd);
+
+   ACPI_HANDLE_SET(>dev, info->bridge->handle);
+   return 0;
+}

will get wrong info...via sd... as their sd is standalone

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip] config: Add 'make kvmconfig'

2012-12-26 Thread David Woodhouse

On Tue, 2012-12-25 at 22:32 -0800, David Rientjes wrote:
> 
> This creates quite a few build failures on auto-latest:
> 
> arch/x86/built-in.o: In function `hpet_setup_msi_irq':
> hpet.c:(.text+0x34638): undefined reference to `arch_setup_hpet_msi'
> hpet.c:(.text+0x34651): undefined reference to `destroy_irq'
> arch/x86/built-in.o: In function `hpet_msi_capability_lookup':
> hpet.c:(.text+0x347ff): undefined reference to `create_irq_nr'
> arch/x86/built-in.o:(.data+0xd1c): undefined reference to 
> `native_setup_msi_irqs'
> arch/x86/built-in.o:(.data+0xd20): undefined reference to 
> `native_teardown_msi_irq'

This one is actually caused by commit 3b08ed026 (config: Add 'make
kvmconfig'), which selects PCI_MSI even on a 32-bit config where it's
invalid to do so.

Ew, that commit seems like a *completely* wrong-headed idea. That abuse
of 'select' is just begging for this kind of breakage. We have other
ways to merge configs and turn certain options on, without doing it this
way.

-- 
dwmw2

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH 04/26] aio: remove retry-based AIO

2012-12-26 Thread Kent Overstreet

On Wed, Dec 19, 2012 at 08:04:11PM +0800, Hillf Danton wrote:
> >@@ -52,15 +46,6 @@ struct kioctx;
> >  * not ask the method again -- ki_retry must ensure forward progress.
> >  * aio_complete() must be called once and only once in the future, multiple
> >  * calls may result in undefined behaviour.
> >- *
> >- * If ki_retry returns -EIOCBRETRY it has made a promise that kick_iocb()
> >- * will be called on the kiocb pointer in the future.  This may happen
> >- * through generic helpers that associate kiocb->ki_wait with a wait
> >- * queue head that ki_retry uses via current->io_wait.  It can also happen
> >- * with custom tracking and manual calls to kick_iocb(), though that is
> >- * discouraged.  In either case, kick_iocb() must be called once and only
> >- * once.  ki_retry must ensure forward progress, the AIO core will wait
> >- * indefinitely for kick_iocb() to be called.
> >  */
> > struct kiocb {
> > struct list_headki_run_list;
> 
> Then you can also erase ki_run_list if no longer used.

Thanks, fixed that and also the comments you pointed out.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH BUGFIX] pkt_sched: fix little service anomalies and possible crashes of qfq+

2012-12-26 Thread David Miller

From: Paolo valente 
Date: Wed, 19 Dec 2012 18:31:06 +0100

> + /*
> +  * The next assignment may let
> +  * agg->initial_budget > agg->budgetmax
> +  * hold, but this does not cause any harm
> +  */

Please format comments in the networking:

/* Like
 * this.
 */

and

/*
 * Never
 * like this.
 */

I know this file is full of exceptions, but that error is to be
corrected rather than expanded.

> + /*
> +  * If lmax is lowered, through qfq_change_class, for a class
> +  * owning pending packets with larger size than the new value of lmax,
> +  * then the following condition may hold.
> +  */

Likewise.

And I'm not applying this until someone familiar with this code
does some review of this patch.  These are seriously non-trivial
changes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:x86/build] x86: Default to ARCH= x86 to avoid overriding CONFIG_64BIT

2012-12-26 Thread H. Peter Anvin

On 12/26/2012 02:00 PM, David Rientjes wrote:
>

In the past, "make randconfig" would always generate a
kernel that _should_ boot on that machine unless there was an underlying
bug that should be fixed.

>

Not even remotely true.  There are tons of options which may not be set 
that your machine needs, or you might set options that exclude support 
for your CPU, for example.

Seriously, this is a bad joke.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:x86/build] x86: Default to ARCH= x86 to avoid overriding CONFIG_64BIT

2012-12-26 Thread David Woodhouse

On Wed, 2012-12-26 at 14:00 -0800, David Rientjes wrote:
> I do quite a bit of automated config and boot tests to try out 
> combinations that others may not have tested when developing their
> code; staging branches such as in tip are interesting to try because
> they haven't yet reached Linus and it's helpful to catch breakages
> before it reaches mainline.

Indeed. It seems quite broken at the moment.

Your config builds fine in the tip of Linus' current tree (with
ARCH=i386 or ARCH=x86 of course). But breaks on tip.git as you describe.
The first commit I hit when attempting to bisect (2bd24259f78) is
*differently* buggered:

  CC  arch/x86/kernel/ptrace.o
arch/x86/kernel/ptrace.c:1350:17: error: conflicting types for 
‘syscall_trace_enter’
In file included from /ssd/git/tip/arch/x86/include/asm/vm86.h:130:0,
 from /ssd/git/tip/arch/x86/include/asm/processor.h:10,
 from /ssd/git/tip/arch/x86/include/asm/thread_info.h:22,
 from include/linux/thread_info.h:56,
 from include/linux/preempt.h:9,
 from include/linux/spinlock.h:50,
 from include/linux/seqlock.h:29,
 from include/linux/time.h:8,
 from include/linux/timex.h:56,
 from include/linux/sched.h:56,
 from arch/x86/kernel/ptrace.c:8:
/ssd/git/tip/arch/x86/include/asm/ptrace.h:146:13: note: previous declaration 
of ‘syscall_trace_enter’ was here

I'll persist, but the build failure you describe looks like it's a
simple failure of the 32-bit build in tip.git, which 'randconfig' was
*designed* to catch... but of course it wasn't doing its job very well
until I fixed it. It's entirely inappropriate to *blame* it on my patch.

-- 
dwmw2

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH v4] usb: phy: samsung: Add support to set pmu isolation

2012-12-26 Thread Sylwester Nawrocki


Hi,

On 12/26/2012 02:56 PM, Vivek Gautam wrote:

On Wed, Dec 26, 2012 at 5:58 PM, Vivek Gautam  wrote:

Adding support to parse device node data in order to get
required properties to set pmu isolation for usb-phy.

Signed-off-by: Vivek Gautam
---


Hope these changes align with what architectural changes you had suggested ?


It looks much better now, thanks! I had a few additional comments, please
see my other reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 374 matches

Mail list logo