Re: Alignment Issue with Direct IO to NVMe Drive

2012-11-27 Thread Matthew Wilcox
On Tue, Nov 27, 2012 at 01:09:46PM +0100, Jens Axboe wrote:
> On 2012-11-27 01:35, Laine Walker-Avina wrote:
> > Hi all,
> > 
> > We are experiencing an issue with doing direct IO to a NVMe device I'm
> > helping to develop. Every so often, the physical address given by
> > sg_dma_address() is aligned to 0x800 instead of 0x1000 as specified by
> > blk_queue_dma_alignement(queue, 4095) when the queue is initialized.

FYI, this is a modification to the driver that Laine has made; presumably
for a limitation of the prototype hardware he's working with.  The NVMe
spec requires the device to be able to do I/Os to 4 byte boundaries.

Laine, when this occurs, what is the alignment of 'offset' in the sg
entry you're looking at?  If userspace is passing in an unaligned address,
I don't think there's anything we do to try to align it.

> > The request is also split over multiple segments to make up for the
> > missing space (eg: for a 4k IO it's split into two segments 2k in
> > size, and for an 8k IO it's split into 3 segments--2k,4k,2k). Our
> > design requires the physical segments given to the device be aligned
> > to 4k boundaries and be multiples of 4k in size. When not doing direct
> > IO the physical addresses appear to always be 4k aligned as expected.
> > One possible issue is the kernel we're primarily testing against is
> > 2.6.32-220 from CentOS, but we have observed similar behavior from a
> > vanilla 3.3 kernel as well. Any help would be greatly appreciated.
> 
> I'm assuming you set the hardware sector size to 4k as well?
> 
> -- 
> Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Alignment Issue with Direct IO to NVMe Drive

2012-11-27 Thread Matthew Wilcox
On Tue, Nov 27, 2012 at 09:47:48AM -0800, Laine Walker-Avina wrote:
> On Tue, Nov 27, 2012 at 9:05 AM, Matthew Wilcox  wrote:
> > Laine, when this occurs, what is the alignment of 'offset' in the sg
> > entry you're looking at?  If userspace is passing in an unaligned address,
> > I don't think there's anything we do to try to align it.
> 
> I thought this was the case as well, but it appears to be aligned
> looking at the request in the user space program (fio using libaio in
> this case) before the IO is submitted.

OK, so we have an aligned virtual address being translated into an
unaligned DMA address.  I have a suspicion ... are you using the swiotlb,
or are you using a real IOMMU?  Grep dmesg for 'IOMMU' if you're not sure.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] VFS: File System Mount Wide O_DIRECT Support

2012-09-04 Thread Matthew Wilcox
On Tue, Sep 04, 2012 at 06:57:14AM -0400, Christoph Hellwig wrote:
> On Tue, Sep 04, 2012 at 06:17:47PM +0800, Li Wang wrote:
> > For file system created on file-backed loop device, there will be 
> > two-levels of 
> > page cache present, which typically doubles the memory consumption. 
> 
> And the right fix is to not use buffer I/O on the backing file instead
> of hacks like this.

That was my initial reaction too, but for the case of two VMs operating on
the same device, it's better for it to be cached once in the hype-rvisor
than twice in the VMs.  Is that a common case worth optimising for?
Probably not ...

-- 
Matthew Wilcox  Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: DMA mapping on SCSI device?

2008-01-30 Thread Matthew Wilcox
On Tue, Jan 29, 2008 at 02:09:52PM -0800, Luben Tuikov wrote:
> > The ideal solution would be to do mapping against a
> > different struct 
> > device for each port, so that we could maintain the proper
> > DMA mask for 
> > each of them at all times. However I'm not sure if
> > that's possible. The 
> > thought of using the SCSI struct device for DMA mapping was
> > brought up 
> > at one point.. any thoughts on that?
> 
> The reason for this is that the object that a struct scsi_dev
> represents has nothing to do with HW DMA engines.

It really would work, once the few remaining architectures move away
from asserting that the 'struct device' passed in is a pci device.
It seems like the best way forward to me.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pull request: TASK_KILLABLE

2008-01-31 Thread Matthew Wilcox
On Tue, Jan 29, 2008 at 10:41:24PM +1100, Linus Torvalds wrote:
> On Mon, 28 Jan 2008, Matthew Wilcox wrote:
> > 
> > I'd like you to pull the task_killable branch of 
> > git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git
> 
> Well, Andrew already pointed out some of this, but I do want more 
> information. There's a "git request-pull" script that even comes with git 
> and that generates a shortlog, a proper url+branch description, and a 
> diffstat of the changes.
> 
> So that is already much more descriptive, but in addition to that, if this 
> isn't one of the "regular" pulls (which it obviously isn't), I really want 
> a human description too so that I know what I'm pulling and what it's all 
> about.
> 
> I do *not* want to be in the position of having to fetch first, just to be 
> able to look at what I'm fetching. By the time I pull a branch, I want to 
> basically know (a) what I'm pulling and (b) why I _want_ to pull it.

OK.  Hopefully this satisfies you:


Introduce TASK_KILLABLE.

The basic idea was Linus', back in 2002.  We can't normally return a
short read because it breaks applications.  But if a task has received
a fatal signal, it doesn't have a chance to notice that it didn't get
all the data it asked for.

To allow tasks to be interrupted by fatal signals, we introduce a new
TASK_* bit; TASK_WAKEKILL.  We also add a predicate fatal_signal_pending;
the counterpart of signal_pending().  Then we add killable versions
of lock_page(), mutex_lock(), schedule_timeout(), wait_event(), and
wait_for_completion().  Finally, we can make the NFS 'intr' mount option
a no-op.

At the end of this patch set, you can now kill a task in readdir(), read()
and write(), at least most of the time.  There is still much more work
that can (and should) be done, such as implementing down_killable(),
lock_kernel_killable() and even spin_lock_killable().

While my main motivation has been to fix problems I have with flaky
wireless networks and/or NFS servers, people like Ingo and Nick Piggin
want to see this infrastructure for helping them improve the OOM killer.

The following changes since commit 8af03e782cae1e0a0f530ddd22301cdd12cf9dc0:
  Linus Torvalds (1):
Merge branch 'for-2.6.25' of git://git.kernel.org/.../paulus/powerpc

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git task_killable

Liam R. Howlett (2):
  Add mutex_lock_killable
  Use mutex_lock_killable in vfs_readdir

Matthew Wilcox (20):
  Use wake_up_locked() in eventpoll
  Add macros to replace direct uses of TASK_ flags
  perfmon: Use task_is_*
  proc/array.c: Use TASK_REPORT
  proc/base.c: Use task_is_*
  wait: Use TASK_NORMAL
  power: Use task_is_*
  ptrace: Use task_is_*
  sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL
  signal: Use task_is_*
  exit: Use task_is_*
  Add TASK_WAKEKILL
  Add fatal_signal_pending
  Add lock_page_killable
  Use lock_page_killable
  Add schedule_timeout_killable
  Add wait_event_killable
  Add wait_for_completion_killable
  NFS: Switch from intr mount option to TASK_KILLABLE
  Remove commented-out code copied from NFS

 arch/ia64/kernel/perfmon.c   |4 +-
 fs/eventpoll.c   |   11 ++---
 fs/nfs/client.c  |6 +--
 fs/nfs/direct.c  |   10 +
 fs/nfs/inode.c   |6 +--
 fs/nfs/mount_clnt.c  |2 +-
 fs/nfs/nfs3proc.c|7 +--
 fs/nfs/nfs4proc.c|   27 +++-
 fs/nfs/nfsroot.c |3 -
 fs/nfs/pagelist.c|   18 ++--
 fs/nfs/read.c|5 --
 fs/nfs/super.c   |4 --
 fs/nfs/write.c   |7 +---
 fs/proc/array.c  |7 +---
 fs/proc/base.c   |2 +-
 fs/readdir.c |5 ++-
 fs/smbfs/request.c   |2 +-
 include/linux/completion.h   |1 +
 include/linux/mutex.h|5 ++
 include/linux/nfs_fs.h   |9 +
 include/linux/nfs_mount.h|2 +-
 include/linux/pagemap.h  |   14 +++
 include/linux/sched.h|   36 -
 include/linux/sunrpc/clnt.h  |2 -
 include/linux/sunrpc/sched.h |2 -
 include/linux/wait.h |   52 ++--
 kernel/exit.c|   88 ++---
 kernel/mutex.c   |   36 -
 kernel/power/process.c   |6 +-
 kernel/ptrace.c  |8 ++--
 kernel/sched.c   |   28 +-
 kernel/signal.c  |   19 ++---
 kernel/timer.c   |7 +++
 kernel/wait.c|2 +-
 mm/filemap.c |   25 ++--
 net/sunrpc/clnt.c

Re: [PATCH?][arch/parisc/kernel/pci-dma.c] pcxl_dma_ops.alloc_noncoherent = pa11_dma_alloc_consistent?

2008-02-11 Thread Matthew Wilcox
On Mon, Feb 11, 2008 at 05:23:33PM +0100, Roel Kluin wrote:
> duplicate pa11_dma_alloc_consistent; more appropriate appears
> pa11_dma_alloc_noncoherent here. 
> 
> Not tested, please confirm that this fix is correct

I don't think it is.  The memories are fading, so I don't recall why it
is we do it this way, but I'm pretty sure it's correct the way it is.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: raw_pci_read in quirk_intel_irqbalance

2008-02-10 Thread Matthew Wilcox
On Sun, Feb 10, 2008 at 04:02:04PM -0700, Matthew Wilcox wrote:
> The line in question reads:
> 
> /* read xTPR register */
> raw_pci_read(0, 0, 0x40, 0x4c, 2, &word);
> 
> That's domain 0, bus 0, device 8, function 0, address 0x4c, length 2.
> 
> I've checked the public E7525 and E7520 MCH datasheets, and they don't
> document the xTPR registers; nor do any of the devices in the datasheet
> have registers documented at 0x4c.
> 
> You can see from my lspci above that I don't _have_ a device 8 on bus 0.
> The aforementioned documentation says:
> 
> "A disabled or non-existent device's configuration register space is
> hidden. A disabled or non-existent device will return all ones for reads
> and will drop writes just as if the cycle terminated with a Master Abort
> on PCI."

I'd like to thank Grant for pointing out to me that this is exactly what
the write immediately above this is doing -- enabling device 8 to
respond to config space cycles.

> Now, my E7525 isn't affected by this quirk as it has a revision greater
> than 0x9.  So maybe it's expected that device 8 is hidden on my machine;
> that it's only present on revisions up to 0x9.  But maybe device 8 is
> always hidden, and that's why the author used raw_pci_ops?
> 
> We can still do better than this, though.  We can do:
> 
> - raw_pci_read(0, 0, 0x40, 0x4c, 2, &word);
> + pci_bus_read_config_word(dev->bus, PCI_DEVFN(8, 0), 0x4c, &word);
> 
> Using PCI_DEVFN tells people you really did mean device 8, and it's not
> a braino for device 4 or 2 (how many bits for slot and function again?)

Here's the patch to implement the above two suggestions:



>From f565b65591a3f90a272b1d511e4ab1728861fe77 Mon Sep 17 00:00:00 2001
From: Matthew Wilcox <[EMAIL PROTECTED]>
Date: Sun, 10 Feb 2008 23:18:15 -0500
Subject: [PATCH] Use proper abstractions in quirk_intel_irqbalance

Since we may not have a pci_dev for the device we need to access, we can't
use pci_read_config_word.  But raw_pci_read is an internal implementation
detail; it's better to use the architected pci_bus_read_config_word
interface.  Using PCI_DEVFN instead of a mysterious constant helps
reassure everyone that we really do intend to access device 8.

Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>
---
 arch/x86/kernel/quirks.c |9 ++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
index 1941482..c47208f 100644
--- a/arch/x86/kernel/quirks.c
+++ b/arch/x86/kernel/quirks.c
@@ -11,7 +11,7 @@
 static void __devinit quirk_intel_irqbalance(struct pci_dev *dev)
 {
u8 config, rev;
-   u32 word;
+   u16 word;
 
/* BIOS may enable hardware IRQ balancing for
 * E7520/E7320/E7525(revision ID 0x9 and below)
@@ -26,8 +26,11 @@ static void __devinit quirk_intel_irqbalance(struct pci_dev 
*dev)
pci_read_config_byte(dev, 0xf4, &config);
pci_write_config_byte(dev, 0xf4, config|0x2);
 
-   /* read xTPR register */
-   raw_pci_read(0, 0, 0x40, 0x4c, 2, &word);
+   /*
+* read xTPR register.  We may not have a pci_dev for device 8
+* because it might be hidden until the above write.
+*/
+   pci_bus_read_config_word(dev->bus, PCI_DEVFN(8, 0), 0x4c, &word);
 
if (!(word & (1 << 13))) {
dev_info(&dev->dev, "Intel E7520/7320/7525 detected; "
-- 
1.5.2.5

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: raw_pci_read in quirk_intel_irqbalance

2008-02-11 Thread Matthew Wilcox
On Mon, Feb 11, 2008 at 12:49:54AM -0700, Grant Grundler wrote:
> Can you also add a comment which points at the Intel documentation?
> 
> http://download.intel.com/design/chipsets/datashts/30300702.pdf
> Page 34 documents 0xf4 register.

I'm told that these URLs are not guaranteed to be stable.  And
remembering the pain we had when HP decided to relocate all of their
documents, I'm really not inclined to embed a link to a URL in the
source code.

> And I just doubled checked that the 0xf4 register value is restored later
> in the quirk (obvious when you look at the code but not from the patch).

Yep, I checked that too ;-)

> Acked-by: Grant Grundler <[EMAIL PROTECTED]>

Thanks.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-12 Thread Matthew Wilcox
On Tue, Feb 12, 2008 at 12:48:13PM -0800, Greg KH wrote:
> I have tried, and successfully done this many times in the past.  The
> kobject change was one example: add a new function, migrate all users of
> a direct pointer over to that function, after that work is all done and
> in, change the structure and do the needed work afterward.  All is
> bisectable completly, with no big "flag day" needed.

That takes care of bisectability, but doesn't take care of the problem
of new stuff being merged which used the old API and is now broken.

Maybe the last step (removal of old API) needs to be taken a lot later
than it currently is (maybe a second merge window between -rc1 and -rc2
for removal of interfaces that became unused during the first merge
window before -rc1?)

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: "ide=reverse" do we still need this?

2008-02-15 Thread Matthew Wilcox
On Thu, Feb 14, 2008 at 12:16:42PM -0500, Bill Davidsen wrote:
> There are any number of things you can do when the system is booted, but 
> the only thing you can do when the system won't boot is use kernel boot 
> options.

Greg's not removing your option to boot the system using an old kernel
to set up a new kernel properly ;-)

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] gdth: convert to PCI hotplug API

2008-02-15 Thread Matthew Wilcox
On Fri, Feb 15, 2008 at 10:44:52AM -0500, Jeff Garzik wrote:
> I kept those array for one reason:  you need it to preserve the existing 
> in-driver PCI device sort.

Just get rid of it.  I got rid of it for sym2 during 2.5 and very few
people have complained.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] 2.6.25-rc1-git2: GDT SCSI: change drivers/scsi/gdth.c into using pci_get device

2008-02-16 Thread Matthew Wilcox
On Wed, Feb 13, 2008 at 10:57:37AM +0200, Boaz Harrosh wrote:
> I still don't have a card for testing myself. Again anyone
> wants to send me a card. Intel people anybody home?

Apparently Intel sold this line of cards to Adaptec.  The copyright
notice in the file backs this up:

 * Copyright (C) 1995-06 ICP vortex GmbH, Achim Leubner *
 * Copyright (C) 2002-04 Intel Corporation  *
 * Copyright (C) 2003-06 Adaptec Inc.   *
 * <[EMAIL PROTECTED]>  *

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: Increasing partial pages

2008-02-16 Thread Matthew Wilcox
On Tue, Jan 22, 2008 at 12:00:00PM -0800, Christoph Lameter wrote:
> Patches that I would recommend to test individually if you could do it 
> (get the series via git pull 
> git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm.git performance):

With these patches applied to 2.6.24-rc8, the perf team are seeing
oopses while running the benchmark.  They're currently trying to narrow
down which of the patches it is.  I'll get an oops for you to study when
they've figured that out.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] NVMe: Add a character device for each nvme device

2012-07-27 Thread Matthew Wilcox
On Fri, Jul 27, 2012 at 10:44:18AM -0600, Keith Busch wrote:
> Registers a character device for the nvme module and creates character
> files as /dev/nvmeN for each nvme device probed, where N is the device
> instance. The character devices support nvme admin ioctl commands so
> that nvme devices without namespaces can be managed.

I don't see a problem here, but I'm no expert at sysfs / character devices.
Alan, Greg, anyone else see any problems with how this character device is
created / destroyed?

> 
> Signed-off-by: Keith Busch 
> ---
>  drivers/block/nvme.c |   55 
> +-
>  1 files changed, 54 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/block/nvme.c b/drivers/block/nvme.c
> index 7bcd882..8a16ac8 100644
> --- a/drivers/block/nvme.c
> +++ b/drivers/block/nvme.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -45,6 +46,7 @@
>  #define SQ_SIZE(depth)   (depth * sizeof(struct nvme_command))
>  #define CQ_SIZE(depth)   (depth * sizeof(struct nvme_completion))
>  #define NVME_MINORS 64
> +#define NVME_MAX_DEVS 1024
>  #define NVME_IO_TIMEOUT  (5 * HZ)
>  #define ADMIN_TIMEOUT(60 * HZ)
>  
> @@ -54,9 +56,13 @@ module_param(nvme_major, int, 0);
>  static int use_threaded_interrupts;
>  module_param(use_threaded_interrupts, int, 0);
>  
> +static int nvme_char_major;
> +module_param(nvme_char_major, int, 0);
> +
>  static DEFINE_SPINLOCK(dev_list_lock);
>  static LIST_HEAD(dev_list);
>  static struct task_struct *nvme_thread;
> +static struct class *nvme_char_cl;
>  
>  /*
>   * Represents an NVM Express device.  Each nvme_dev is a PCI function.
> @@ -1222,6 +1228,35 @@ static const struct block_device_operations nvme_fops 
> = {
>   .compat_ioctl   = nvme_ioctl,
>  };
>  
> +static long nvme_char_ioctl(struct file *f, unsigned int cmd, unsigned long 
> arg)
> +{
> + struct nvme_dev *dev;
> + int instance = iminor(f->f_dentry->d_inode);
> +
> + spin_lock(&dev_list_lock);
> + list_for_each_entry(dev, &dev_list, node) {
> + if (dev->instance == instance)
> + break;
> + }
> + spin_unlock(&dev_list_lock);
> +
> + if (&dev->node == &dev_list)
> + return -ENOTTY;
> + 
> + switch (cmd) {
> + case NVME_IOCTL_ADMIN_CMD:
> + return nvme_user_admin_cmd(dev, (void __user *)arg);
> + default:
> + return -ENOTTY;
> + }
> +}
> +
> +static const struct file_operations nvme_char_fops = {
> + .owner  = THIS_MODULE,
> + .unlocked_ioctl = nvme_char_ioctl,
> + .compat_ioctl   = nvme_char_ioctl,
> +};
> +
>  static void nvme_timeout_ios(struct nvme_queue *nvmeq)
>  {
>   int depth = nvmeq->q_depth - 1;
> @@ -1632,6 +1667,8 @@ static int __devinit nvme_probe(struct pci_dev *pdev,
>   if (result)
>   goto delete;
>  
> + device_create(nvme_char_cl, NULL, MKDEV(nvme_char_major, dev->instance),
> + NULL, "nvme%d", dev->instance);
>   return 0;
>  
>   delete:
> @@ -1660,6 +1697,7 @@ static void __devexit nvme_remove(struct pci_dev *pdev)
>  {
>   struct nvme_dev *dev = pci_get_drvdata(pdev);
>   nvme_dev_remove(dev);
> + device_destroy(nvme_char_cl, MKDEV(nvme_char_major, dev->instance));
>   pci_disable_msix(pdev);
>   iounmap(dev->bar);
>   nvme_release_instance(dev);
> @@ -1721,11 +1759,24 @@ static int __init nvme_init(void)
>   else if (result > 0)
>   nvme_major = result;
>  
> + result = __register_chrdev(nvme_char_major, 0, NVME_MAX_DEVS, "nvme",
> + &nvme_char_fops);
> + if (result < 0)
> + goto unregister_blkdev;
> + else if (result > 0)
> + nvme_char_major = result;
> + nvme_char_cl = class_create(THIS_MODULE, "nvme");
> + if (!nvme_char_cl)
> + goto unregister_chrdev;
>   result = pci_register_driver(&nvme_driver);
>   if (result)
> - goto unregister_blkdev;
> + goto destroy_class;
>   return 0;
>  
> + destroy_class:
> + class_destroy(nvme_char_cl);
> + unregister_chrdev:
> + __unregister_chrdev(nvme_char_major, 0, NVME_MAX_DEVS, "nvme");
>   unregister_blkdev:
>   unregister_blkdev(nvme_major, "nvme");
>   kill_kthread:
> @@ -1736,6 +1787,8 @@ static int __init nvme_init(void)
>  static void __exit nvme_exit(void)
>  {
>   pci_unregister_driver(&nvme_driver);
> + class_destroy(nvme_char_cl);
> + __unregister_chrdev(nvme_char_major, 0, NVME_MAX_DEVS, "nvme");
>   unregister_blkdev(nvme_major, "nvme");
>   kthread_stop(nvme_thread);
>  }
> -- 
> 1.7.0.4
> 
> 
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://merlin.infradead.org/mailman/listinfo/linux-nvme
--
To unsubscribe from t

Re: [PATCH] NVMe: Add a character device for each nvme device

2012-07-27 Thread Matthew Wilcox
On Fri, Jul 27, 2012 at 11:25:46AM -0700, Greg KH wrote:
> On Fri, Jul 27, 2012 at 02:12:12PM -0400, Matthew Wilcox wrote:
> > I don't see a problem here, but I'm no expert at sysfs / character devices.
> > Alan, Greg, anyone else see any problems with how this character device is
> > created / destroyed?
> 
> Yes, see below:

Thanks!

> > > + device_create(nvme_char_cl, NULL, MKDEV(nvme_char_major, dev->instance),
> > > + NULL, "nvme%d", dev->instance);
> 
> You just created a device at the "root" of sysfs, which is wrong,
> especially when you do have a parent device here.  Please use it.

OK, that makes sense; this device should be the child of the pci_dev that
it belongs to.

> Also, why are you creating your own class?  Can't this just be a misc
> device?  And if you want to create your own class, please don't, use a
> bus, as that is what is really happening here, right?  We are trying to
> move away from using 'struct class' wherever possible (one of these days
> we'll just remove it...)

What we're trying to achieve here is to create one character device
per NVMe controller that gets plugged in.  Each NVMe controller is-a
PCI function.  The reason we're trying to do this is so that we can send
commands to the NVMe controller, even when there is no storage present
(eg a drive is shipped from the factory with no configured storage).

So we have no particular desire to create a new struct class, or struct
bus.  If we can create a misc device per PCIe function that's bound to our
driver, that's great!  Can you recommend a driver that does this already?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] NVMe: Add a character device for each nvme device

2012-07-27 Thread Matthew Wilcox
On Fri, Jul 27, 2012 at 03:28:25PM -0400, Jeff Garzik wrote:
> On 07/27/2012 02:12 PM, Matthew Wilcox wrote:
> >On Fri, Jul 27, 2012 at 10:44:18AM -0600, Keith Busch wrote:
> >>Registers a character device for the nvme module and creates character
> >>files as /dev/nvmeN for each nvme device probed, where N is the device
> >>instance. The character devices support nvme admin ioctl commands so
> >>that nvme devices without namespaces can be managed.
> >
> >I don't see a problem here, but I'm no expert at sysfs / character devices.
> >Alan, Greg, anyone else see any problems with how this character device is
> >created / destroyed?
> 
> This seems like something normally done via a control device that is
> addressible via bsg.

I'm not convinced about that.  bsg requires a request_queue, and we
don't have one in the absence of any storage.  There doesn't even seem
to be a standard way of sending commands to SCSI hosts, let alone block
device controllers.

Maybe we should design such a mechanism, but maybe we shouldn't ... as we
find common things to do, we tend to move those to sysfs, not ioctls,
and the kinds of commands that are being sent here are essentially
vendor-specific NVMe commands; it's not clear they'd fit neatly into a
generic mechanism.

> This is -not- a NAK, but maybe the storage folks have a different
> preference for an admin-command path.
> 
>   Jeff
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] NVMe: Add a character device for each nvme device

2012-07-27 Thread Matthew Wilcox
On Fri, Jul 27, 2012 at 12:21:00PM -0700, Greg KH wrote:
> > > Also, why are you creating your own class?  Can't this just be a misc
> > > device?  And if you want to create your own class, please don't, use a
> > > bus, as that is what is really happening here, right?  We are trying to
> > > move away from using 'struct class' wherever possible (one of these days
> > > we'll just remove it...)
> > 
> > What we're trying to achieve here is to create one character device
> > per NVMe controller that gets plugged in.  Each NVMe controller is-a
> > PCI function.  The reason we're trying to do this is so that we can send
> > commands to the NVMe controller, even when there is no storage present
> > (eg a drive is shipped from the factory with no configured storage).
> > 
> > So we have no particular desire to create a new struct class, or struct
> > bus.  If we can create a misc device per PCIe function that's bound to our
> > driver, that's great!  Can you recommend a driver that does this already?
> 
> I don't think there is one, but it shouldn't be that hard to just create
> a 'struct misdevice' for each one of the devices you want to create,
> would it?
> 
> But, as you really are a "specific type", a bus_type might be overkill,
> so the original use of device_create() should be fine.  Just be sure to
> fix the parent pointer issue, and you should be fine, right?

Works for me.  I don't think we're going to have any software that
depends on it being a class or a bus, so it's easy to change later.
All we really care about is /dev/nvmeN being created.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-21 Thread Matthew Wilcox
On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
> This patchset optimizes several places which take the per inode spin lock.
> They have not been fully tested yet, thus they are marked as RFC. 
> 
> I do limited tests after all patches applied: use two 'find' to traverse the 
> filesystems and touch all files in parallel. This runs for several days in a 
> virtual machine, no suspicious log appears. 

Have you done any performance testing?  Taking and releasing a lock which
isn't contended is not particularly expensive.

-- 
Matthew Wilcox  Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC Block Layer Extensions to Support NV-DIMMs

2013-09-05 Thread Matthew Wilcox
On Thu, Sep 05, 2013 at 08:12:05AM -0400, Jeff Moyer wrote:
> If the memory is available to be mapped into the address space of the
> kernel or a user process, then I don't see why we should have a block
> device at all.  I think it would make more sense to have a different
> driver class for these persistent memory devices.

We already have at least two block devices in the tree that provide
this kind of functionality (arch/powerpc/sysdev/axonram.c and
drivers/s390/block/dcssblk.c).  Looking at how they're written, it
seems like implementing either of them as a block device on top of a
character device that extended their functionality in the direction we
want would be a pretty major bloating factor for no real benefit (not
even a particularly cleaner architecture).

> > Different applications, filesystem and drivers may wish to share
> > ranges of PMEM.  This is analogous to partitioning a disk that is
> > using multiple and different filesystems.   Since PMEM is addressed
> > on a byte basis rather than a block basis the existing partitioning
> > model does not fit well.  As a result there needs to be a way to
> > describe PMEM ranges.
> >
> > struct pmem_layout *(*getpmem)(struct block_device *bdev);
> 
> If existing partitioning doesn't work well, then it sounds like a block
> device isn't the right fit (again).  Ignoring that detail, what about
> requesting and releasing ranges of persistent memory, as in your
> "partitioning" example?  Would that not also be a function of the
> driver?

"existing partitioning" doesn't even work well for existing drives!
Nobody actually builds a drive with fixed C/H/S any more.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC Block Layer Extensions to Support NV-DIMMs

2013-09-05 Thread Matthew Wilcox
On Thu, Sep 05, 2013 at 01:15:40PM -0400, Jeff Moyer wrote:
> Matthew Wilcox  writes:
> 
> > On Thu, Sep 05, 2013 at 08:12:05AM -0400, Jeff Moyer wrote:
> >> If the memory is available to be mapped into the address space of the
> >> kernel or a user process, then I don't see why we should have a block
> >> device at all.  I think it would make more sense to have a different
> >> driver class for these persistent memory devices.
> >
> > We already have at least two block devices in the tree that provide
> > this kind of functionality (arch/powerpc/sysdev/axonram.c and
> > drivers/s390/block/dcssblk.c).  Looking at how they're written, it
> > seems like implementing either of them as a block device on top of a
> > character device that extended their functionality in the direction we
> > want would be a pretty major bloating factor for no real benefit (not
> > even a particularly cleaner architecture).
> 
> Fun examples to read, thanks for the pointers.  I'll note that neither
> required extensions to the block device operations.  ;-)  I do agree with
> you that neither would benefit from changing.

Ah, but they did require extensions to the block device operations!
See commit 420edbcc09008342c7b2665453f6b370739aadb0.  The next obvious
question is, "Why isn't this API suitable for us".  And the answer is
that this API suits an existing filesystem design like ext2 (with the
xip mount option) that wants to stick very closely to the idiom of
reading one block at a time.  It's not really suited to wanting to
"read" multiple vectors at once.  Sure, we could ask for something
like preadv/pwritev in the block device operations, but we thought that
allowing the filesystem to ask for the geometry and then map the pieces
it needed itself was more elegant.

> There are a couple of things in this proposal that cause me grief,
> centered around the commitpmem call:
> 
> >>void (*commitpmem)(struct block_device *bdev, void *addr);
> 
> For block devices, when you want to flush something out, you submit a
> bio with REQ_FLUSH set.  Or, you could have submitted one or more I/Os
> with REQ_FUA.  Here, you want to add another method to accomplish the
> same thing, but outside of the data path.  So, who would the caller of
> this commitpmem function be?  Let's assume that we have a file system
> layered on top of this block device.  Will the file system need to call
> commitpmem in addition to sending down the appropriate flags with the
> I/Os?

Existing filesystems work as-is, without using commitpmem.  commitpmem
is only for a filesystem that isn't using bios.  We could use a bio to
commit writes ... but we'd like to be able to commit writes that are
as small as a cacheline, and I don't think Linux supports block sizes
smaller than 512 bytes.

> This brings me to the other thing.  If the caller of commitpmem is a
> persistent memory-aware file system, then it seems awkward to call into
> a block driver at all.  You are basically turning the block device into
> a sort of hybrid thing, where you can access stuff behind it in myriad
> ways.  That's the part that doesn't make sense to me.

If we get the API right, the filesystem shouldn't care whether these
things are done with bios or with direct calls to the block driver.
Look at how we did sb_issue_discard().  That could be switched to a
direct call, but we choose to use bios because we want the discard bios
queued along with regular reads and writes.  

> So, that's why I suggested that maybe pmem is different from a block
> device, but a block device could certainly be layered on top of it.

Architecturally, it absolutely could be done, but I think in terms of
writing the driver it is more complex, and in terms of the filesystem,
it makes no difference (assuming we get the APIs right).  I think we
should proceed as-is, and we can look at migrating to a block-layer-free
version later, if that ever makes sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PULL REQUEST] NVM Express updates

2013-09-06 Thread Matthew Wilcox
Hi Linus,

Please pull the NVMe tree.

The following changes since commit a2648ebb7ed69ef209d9c8a76fadeb3252d9a023:

  Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs (2013-06-13 
22:34:14 -0700)

are available in the git repository at:


  git://git.infradead.org/users/willy/linux-nvme.git master

for you to fetch changes up to d82e8bfdef9afae83b894be49af4644d9ac3c359:
  NVMe: Merge issue on character device bring-up (2013-09-06 16:26:58 -0400)


Keith Busch (12):
  NVMe: Disk IO statistics
  NVMe: Update nvme_id_power_state with latest spec
  NVMe: Fix checkpatch issues
  NVMe: Bring up cdev on set feature failure
  NVMe: Disk stats for read/write commands only
  NVMe: Group pci related actions in functions
  NVMe: Separate queue alloc/free from create/delete
  NVMe: Separate controller init from disk discovery
  NVMe: Use normal shutdown
  NVMe: Add pci suspend/resume driver callbacks
  NVMe: Handle ioremap failure
  NVMe: Merge issue on character device bring-up

Matthew Wilcox (6):
  NVMe: Restructure MSI / MSI-X setup
  NVMe: Return correct value from interrupt handler
  NVMe: Remove "process_cq did something" message
  NVMe: Call nvme_process_cq from submission path
  NVMe: Split header file into user-visible and kernel-visible pieces
  NVMe: Namespace IDs are unsigned

Tushar Behera (1):
  NVMe: Use kzalloc instead of kmalloc+memset

 drivers/block/nvme-core.c | 585 +++---
 drivers/block/nvme-scsi.c |  24 +-
 include/linux/nvme.h  | 466 +---
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/nvme.h | 477 +
 5 files changed, 895 insertions(+), 658 deletions(-)
 create mode 100644 include/uapi/linux/nvme.h


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] Convert from bio-based to blk-mq

2013-10-08 Thread Matthew Wilcox
On Tue, Oct 08, 2013 at 11:34:20AM +0200, Matias Bjørling wrote:
> The nvme driver implements itself as a bio-based driver. This primarily 
> because
> of high lock congestion for high-performance nvm devices. To remove the
> congestion, a multi-queue block layer is being implemented.

Um, no.  You'll crater performance by adding another memory allocation
(of the struct request).  multi-queue is not the solution.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC Block Layer Extensions to Support NV-DIMMs

2013-09-26 Thread Matthew Wilcox
On Thu, Sep 26, 2013 at 02:56:17PM +, Zuckerman, Boris wrote:
> To work with persistent memory as efficiently as we can work with RAM we need 
> a bit more than "commit". It's reasonable to expect that we get some 
> additional support from CPUs that goes beyond mfence and mflush. That may 
> include discovery, transactional support, etc. Encapsulating that in a 
> special class sooner than later seams a right thing to do...

If it's something CPU-specific, then we wouldn't handle it as part of
the "class", we'd handle it as an architecture abstraction.  It's only
operations which are device-specific which would need to be exposed
through an operations vector.  For example, suppose you buy one device
from IBM and another device from HP, and plug them both into your SPARC
system.  The code you compile needs to run on SPARC, doing whatever
CPU operations are supported, but if HP and IBM have different ways of
handling a "commit" operation, we need that operation to be part of an
operations vector.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Use ecryptfs_dentry_to_lower_path in a couple of places

2013-06-15 Thread Matthew Wilcox

There are two places in ecryptfs that benefit from using
ecryptfs_dentry_to_lower_path() instead of separate calls to
ecryptfs_dentry_to_lower() and ecryptfs_dentry_to_lower_mnt().  Both
sites use fewer instructions and less stack (determined by examining
objdump output).

Signed-off-by: Matthew Wilcox 

diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index a7abbea..0943f6f 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -49,7 +49,7 @@ static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb,
unsigned long nr_segs, loff_t pos)
 {
ssize_t rc;
-   struct path lower;
+   struct path *path;
struct file *file = iocb->ki_filp;
 
rc = generic_file_aio_read(iocb, iov, nr_segs, pos);
@@ -60,9 +60,8 @@ static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb,
if (-EIOCBQUEUED == rc)
rc = wait_on_sync_kiocb(iocb);
if (rc >= 0) {
-   lower.dentry = ecryptfs_dentry_to_lower(file->f_path.dentry);
-   lower.mnt = ecryptfs_dentry_to_lower_mnt(file->f_path.dentry);
-   touch_atime(&lower);
+   path = ecryptfs_dentry_to_lower_path(file->f_path.dentry);
+   touch_atime(path);
}
return rc;
 }
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index e924cf4..eb1c597 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -120,16 +120,15 @@ static int ecryptfs_init_lower_file(struct dentry *dentry,
struct file **lower_file)
 {
const struct cred *cred = current_cred();
-   struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
-   struct vfsmount *lower_mnt = ecryptfs_dentry_to_lower_mnt(dentry);
+   struct path *path = ecryptfs_dentry_to_lower_path(dentry);
int rc;
 
-   rc = ecryptfs_privileged_open(lower_file, lower_dentry, lower_mnt,
+   rc = ecryptfs_privileged_open(lower_file, path->dentry, path->mnt,
  cred);
if (rc) {
printk(KERN_ERR "Error opening lower file "
   "for lower_dentry [0x%p] and lower_mnt [0x%p]; "
-  "rc = [%d]\n", lower_dentry, lower_mnt, rc);
+  "rc = [%d]\n", path->dentry, path->mnt, rc);
(*lower_file) = NULL;
}
return rc;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-20 Thread Matthew Wilcox

A paper at FAST2012
(http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf) pointed
out the performance overhead of taking interrupts for low-latency block
I/Os.  The solution the author investigated was to spin waiting for each
I/O to complete.  This is inefficient as Linux submits many I/Os which
are not latency-sensitive, and even when we do submit latency-sensitive
I/Os (eg swap-in), we frequently submit several I/Os before waiting.

This RFC takes a different approach, only spinning when we would
otherwise sleep.  To implement this, I add an 'io_poll' function pointer
to backing_dev_info.  I include a sample implementation for the NVMe
driver.  Next, I add an io_wait() function which will call io_poll()
if it is set.  It falls back to calling io_schedule() if anything goes
wrong with io_poll() or the task exceeds its timeslice.  Finally, all
that is left is to judiciously replace calls to io_schedule() with
calls to io_wait().  I think I've covered the main contenders with
sleep_on_page(), sleep_on_buffer() and the DIO path.

I've measured the performance benefits of this with a Chatham NVMe
prototype device and a simple
# dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=100
The latency of each I/O reduces by about 2.5us (from around 8.0us to
around 5.5us).  This matches up quite well with the performance numbers
shown in the FAST2012 paper (which used a similar device).

Is backing_dev_info the right place for this function pointer?
Have I made any bad assumptions about the correct way to retrieve
the backing_dev_info from (eg) a struct page, buffer_head or kiocb?
Should I pass a 'state' into io_wait() instead of retrieving it from
'current'?  Is kernel/sched/core.c the right place for io_wait()?
Should it be bdi_wait() instead?  Should it be marked with __sched?
Where should I add documentation for the io_poll() function pointer?

NB: The NVMe driver piece of this is for illustrative purposes only and
should not be applied.  I've rearranged the diff so that the interesting
pieces appear first.

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..97f8fde 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -68,6 +68,7 @@ struct backing_dev_info {
unsigned int capabilities; /* Device capabilities */
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data;   /* Pointer to aux data for congested func */
+   int (*io_poll)(struct backing_dev_info *);
 
char *name;
 
@@ -126,6 +127,8 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
 void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
 
+void io_wait(struct backing_dev_info *bdi);
+
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 58453b8..4840065 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4527,6 +4527,36 @@ long __sched io_schedule_timeout(long timeout)
return ret;
 }
 
+/*
+ * Wait for an I/O to complete against this backing_dev_info.  If the
+ * task exhausts its timeslice polling for completions, call io_schedule()
+ * anyway.  If a signal comes pending, return so the task can handle it.
+ * If the io_poll returns an error, give up and call io_schedule(), but
+ * swallow the error.  We may miss an I/O completion (eg if the interrupt
+ * handler gets to it first).  Guard against this possibility by returning
+ * if we've been set back to TASK_RUNNING.
+ */
+void io_wait(struct backing_dev_info *bdi)
+{
+   if (bdi && bdi->io_poll) {
+   long state = current->state;
+   while (!need_resched()) {
+   int ret = bdi->io_poll(bdi);
+   if ((ret > 0) || signal_pending_state(state, current)) {
+   set_current_state(TASK_RUNNING);
+   return;
+   }
+   if (current->state == TASK_RUNNING)
+   return;
+   if (ret < 0)
+   break;
+   cpu_relax();
+   }
+   }
+
+   io_schedule();
+}
+
 /**
  * sys_sched_get_priority_max - return maximum RT priority.
  * @policy: scheduling class.
diff --git a/fs/aio.c b/fs/aio.c
index 2bbcacf..7b20397 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -453,11 +453,15 @@ static void kill_ioctx(struct kioctx *ctx)
  */
 ssize_t wait_on_sync_kiocb(struct kiocb *iocb)
 {
+   struct backing_dev_info *bdi = NULL;
+
+   if (iocb->ki_filp)
+   bdi = iocb->ki_filp->f_mapping->backing_dev_info;
while (atomic_read(&iocb->ki_users)) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!atomic_read(&iocb->ki_users))
break;
-   io_schedule();
+ 

Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-24 Thread Matthew Wilcox
On Mon, Jun 24, 2013 at 09:15:45AM +0200, Jens Axboe wrote:
> Willy, I think the general design is fine, hooking in via the bdi is the
> only way to get back to the right place from where you need to sleep.
> Some thoughts:
> 
> - This should be hooked in via blk-iopoll, both of them should call into
>   the same driver hook for polling completions.

I actually started working on this, then I realised that it's actually
a bad idea.  blk-iopoll's poll function is to poll the single I/O queue
closest to this CPU.  The iowait poll function is to poll all queues
that the I/O for this address_space might complete on.

I'm reluctant to ask drivers to define two poll functions, but I'm even
more reluctant to ask them to define one function with two purposes.

> - It needs to be more intelligent in when you want to poll and when you
>   want regular irq driven IO.

Oh yeah, absolutely.  While the example patch didn't show it, I wouldn't
enable it for all NVMe devices; only ones with sufficiently low latency.
There's also the ability for the driver to look at the number of
outstanding I/Os and return an error (eg -EBUSY) to stop spinning.

> - With the former note, the app either needs to opt in (and hence
>   willingly sacrifice CPU cycles of its scheduling slice) or it needs to
>   be nicer in when it gives up and goes back to irq driven IO.

Yup.  I like the way you framed it.  If the task *wants* to spend its
CPU cycles on polling for I/O instead of giving up the remainder of its
time slice, then it should be able to do that.  After all, it already can;
it can submit an I/O request via AIO, and then call io_getevents in a
tight loop.

So maybe the right way to do this is with a task flag?  If we go that
route, I'd like to further develop this option to allow I/Os to be
designated as "low latency" vs "normal".  Taking a page fault would be
"low latency" for all tasks, not just ones that choose to spin for I/O.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-24 Thread Matthew Wilcox
On Mon, Jun 24, 2013 at 08:11:02PM -0400, Steven Rostedt wrote:
> What about hooking into the idle_balance code? That happens if we are
> about to go to idle but before the full schedule switch to the idle
> task.
> 
> 
> In __schedule(void):
> 
>   if (unlikely(!rq->nr_running))
>   idle_balance(cpu, rq);

That may be a great place to try it from the PoV of the scheduler, but are
you OK with me threading a struct backing_dev_info * all the way through
the scheduler to idle_balance()?  :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-24 Thread Matthew Wilcox
On Mon, Jun 24, 2013 at 10:07:51AM +0200, Ingo Molnar wrote:
> I'm wondering, how will this scheme work if the IO completion latency is a 
> lot more than the 5 usecs in the testcase? What if it takes 20 usecs or 
> 100 usecs or more?

There's clearly a threshold at which it stops making sense, and our
current NAND-based SSDs are almost certainly on the wrong side of that
threshold!  I can't wait for one of the "post-NAND" technologies to make
it to market in some form that makes it economical to use in an SSD.

The problem is that some of the people who are looking at those
technologies are crazy.  They want to "bypass the kernel" and "do user
space I/O" because "the kernel is too slow".  This patch is part of an
effort to show them how crazy they are.  And even if it doesn't convince
them, at least users who refuse to rewrite their applications to take
advantage of magical userspace I/O libraries will see real performance
benefits.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PULL REQUEST] NVMe driver fixes

2013-06-11 Thread Matthew Wilcox

Hi Linus,

The following bugfixes have been accumulating for the last few weeks, and
it would be nice if they could be in 3.10.

The following changes since commit ec50f2a97a4a7098a81b40030e0bfe28bdc43740:

  Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux 
(2013-05-16 19:01:46 -0700)

are available in the git repository at:


  git://git.infradead.org/users/willy/linux-nvme.git fixes-3.10

for you to fetch changes up to fa08a396647767abd24a9e7015cb177121d0cf15:

  NVMe: Add MSI support (2013-05-31 11:45:52 -0400)


Chayan Biswas (1):
  Return the result from user admin command IOCTL even in case of failure

Dan Carpenter (2):
  MAINTAINERS: update NVM EXPRESS DRIVER file list
  NVMe: check for integer overflow in nvme_map_user_pages()

Keith Busch (1):
  NVMe: Do not cancel command multiple times

Matthew Wilcox (1):
  NVMe: Use dma_set_mask() correctly

Ramachandra Rao Gajula (1):
  NVMe: Add MSI support

Sachin Kamat (1):
  NVMe: Remove redundant version.h header include

Vishal Verma (1):
  NVMe: Fix a signedness bug in nvme_trans_modesel_get_mp

Wei Yongjun (1):
  NVMe: fix error return code in nvme_submit_bio_queue()

 MAINTAINERS   |2 +-
 drivers/block/nvme-core.c |   62 +++--
 drivers/block/nvme-scsi.c |3 +--
 3 files changed, 50 insertions(+), 17 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/8] mm: drop actor argument of do_generic_file_read()

2013-07-16 Thread Matthew Wilcox
On Mon, Jul 15, 2013 at 01:47:47PM +0300, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" 
> 
> There's only one caller of do_generic_file_read() and the only actor is
> file_read_actor(). No reason to have a callback parameter.
> 
> Signed-off-by: Kirill A. Shutemov 
> Acked-by: Dave Hansen 

Would it make sense to do the same thing to do_shmem_file_read()?

From: Matthew Wilcox 

There's only one caller of do_shmem_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Matthew Wilcox 

diff --git a/mm/shmem.c b/mm/shmem.c
index 5e6a842..6a9c325 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1464,7 +1464,7 @@ shmem_write_end(struct file *file, struct address_space 
*mapping,
return copied;
 }
 
-static void do_shmem_file_read(struct file *filp, loff_t *ppos, 
read_descriptor_t *desc, read_actor_t actor)
+static void do_shmem_file_read(struct file *filp, loff_t *ppos, 
read_descriptor_t *desc)
 {
struct inode *inode = file_inode(filp);
struct address_space *mapping = inode->i_mapping;
@@ -1546,13 +1546,14 @@ static void do_shmem_file_read(struct file *filp, 
loff_t *ppos, read_descriptor_
 * Ok, we have the page, and it's up-to-date, so
 * now we can copy it to user space...
 *
-* The actor routine returns how many bytes were actually used..
+* The file_read_actor routine returns how many bytes were 
actually
+* used..
 * NOTE! This may not be the same as how much of a user buffer
 * we filled up (we may be padding etc), so we can only update
-* "pos" here (the actor routine has to update the user buffer
+* "pos" here (file_read_actor has to update the user buffer
 * pointers and the remaining count).
 */
-   ret = actor(desc, page, offset, nr);
+   ret = file_read_actor(desc, page, offset, nr);
offset += ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
@@ -1590,7 +1591,7 @@ static ssize_t shmem_file_aio_read(struct kiocb *iocb,
if (desc.count == 0)
continue;
desc.error = 0;
-   do_shmem_file_read(filp, ppos, &desc, file_read_actor);
+   do_shmem_file_read(filp, ppos, &desc);
retval += desc.written;
if (desc.error) {
retval = retval ?: desc.error;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

2013-08-06 Thread Matthew Wilcox
On Mon, Aug 05, 2013 at 01:17:39PM +0200, Jan Kara wrote:
> On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" 
> > The radix tree is variable-height, so an insert operation not only has
> > to build the branch to its corresponding item, it also has to build the
> > branch to existing items if the size has to be increased (by
> > radix_tree_extend).
> > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> >   * The worst case is a zero height tree with just a single item at index 0,
> >   * and then inserting an item at index ULONG_MAX. This requires 2 new 
> > branches
> >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node 
> > shared.
> > + *
> > + * Worst case for adding N contiguous items is adding entries at indexes
> > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single 
> > worst-case
> > + * item plus extra nodes if you cross the boundary from one node to the 
> > next.
> > + *
> >   * Hence:
> >   */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MAX \
> > +   (RADIX_TREE_PRELOAD_MIN + \
> > +DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>
>   Umm, is this really correct? I see two problems:
> 1) You may need internal tree nodes at various levels but you seem to
> account only for the level 1.
> 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> require 3 nodes at level 1 if the indexes are like:
> i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> ^^
> node boundarynode boundary
> 
>   Otherwise the patch looks good.

You are correct that in the fully general case, these things are needed,
and the patch undercounts the number of nodes needed.  However, in the
specific case of THP pagecache, insertions are naturally aligned, and
we end up needing very few internal nodes (so few that we've never hit
the end of this array in some fairly heavy testing).

There are two penalties for getting the general case correct.  One is
that the calculation becomes harder to understand, and the other is
that we consume more per-CPU memory.  I think we should document that
the current code requires "natural alignment", and include a note about
what things will need to change in order to support arbitrary alignment
in case anybody needs to do it in the future, but not include support
for arbitrary alignment in this patchset.

What do you think?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] fsfreeze: manage kill signal when sb_start_pagefault is called

2013-04-06 Thread Matthew Wilcox
On Sat, Apr 06, 2013 at 12:05:52PM +0200, Marco Stornelli wrote:
> In every place where sb_start_pagefault was called now we must manage
> the error code and return VM_FAULT_RETRY.

Erm ... in patch 1/4:

 static inline void sb_start_pagefault(struct super_block *sb)
 {
-   __sb_start_write(sb, SB_FREEZE_PAGEFAULT, true);
+   __sb_start_write_wait(sb, SB_FREEZE_PAGEFAULT, false);
 }

>  
> - sb_start_pagefault(inode->i_sb);
> + ret = sb_start_pagefault(inode->i_sb);
> + if (ret)
> + return VM_FAULT_RETRY;
>   ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);

Does the compiler not warn that you're assigning void to 'ret'?  Or was
there some other SNAFU sending these patches?

-- 
Matthew Wilcox  Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] fsfreeze: manage kill signal when sb_start_write is called

2013-04-06 Thread Matthew Wilcox
On Sat, Apr 06, 2013 at 12:04:52PM +0200, Marco Stornelli wrote:
> In every place where sb_start_write was called now we must manage
> the error code and return -EINTR.

If we must manage the error code, then these functions should be marked
__must_check.

-- 
Matthew Wilcox  Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PULL REQUEST] NVMe driver updates

2013-05-09 Thread Matthew Wilcox

Hi Linus,

Lots of exciting new features in the NVM Express driver this time,
including support for emulating SCSI commands, discard support and the
ability to submit per-sector metadata with I/Os.  It's still mostly
bugfixes though!

The following changes since commit a12183c62717ac4579319189a00f5883a18dff08:

  Merge branch 'timers-urgent-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (2013-03-25 18:03:34 
-0700)

are available in the git repository at:


  git://git.infradead.org/users/willy/linux-nvme.git master

for you to fetch changes up to 94f370cab6e5ac514b658c6b2b3aa308cefc5c7a:

  NVMe: Use user defined admin ioctl timeout (2013-05-09 16:03:50 -0400)


Arjan van de Ven (2):
  NVMe: Use round_jiffies_relative() for the periodic, once-per-second timer
  NVMe: Set TASK_INTERRUPTIBLE before processing queues

Keith Busch (13):
  NVMe: Add discard support for capable devices
  NVMe: Add a character device for each nvme device
  NVMe: queue usage fixes in nvme-scsi
  NVMe: Add scsi unmap to SG_IO
  NVMe: Free admin queue on request_irq error
  NVMe: Fix error clean-up on nvme_alloc_queue
  NVMe: Check for NULL memory in nvme_dev_add
  NVMe: Remove dead code in nvme_dev_add
  NVMe: Split non-mergeable bio requests
  NVMe: Device specific stripe size handling
  NVMe: Meta-data support in NVME_IOCTL_SUBMIT_IO
  NVMe: Schedule timeout for sync commands
  NVMe: Use user defined admin ioctl timeout

Matthew Wilcox (7):
  NVMe: Abstract out sector to block number conversion
  NVMe: Don't fail initialisation unnecessarily
  NVMe: Fix I/O cancellation status on big-endian machines
  NVMe: Fix endian-related problems in user I/O submission path
  NVMe: Wait for device to acknowledge shutdown
  NVMe: Only clear the enable bit when disabling controller
  NVMe: Simplify Firmware Activate code slightly

Vishal Verma (5):
  NVMe: Rename nvme.c to nvme-core.c
  NVMe: Move structures & definitions to header file
  NVMe: Add definitions for format command
  NVMe: Add nvme-scsi.c
  NVMe: Fix sparse warnings in scsi emulation

 drivers/block/Makefile|1 +
 drivers/block/{nvme.c => nvme-core.c} |  594 +--
 drivers/block/nvme-scsi.c | 3053 +
 include/linux/nvme.h  |  158 +-
 4 files changed, 3640 insertions(+), 166 deletions(-)
 rename drivers/block/{nvme.c => nvme-core.c} (79%)
 create mode 100644 drivers/block/nvme-scsi.c


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] fsfreeze: added new file_start_write_killable

2013-04-26 Thread Matthew Wilcox
On Fri, Apr 26, 2013 at 10:50:52AM +0200, Marco Stornelli wrote:
> Replace file_start_write with file_start_write_killable where
> possible.

I feel like I'm missing context here.  Possibly because you only cc'd me
on patch 2/4.  In particular, file_start_write doesn't exist upstream,
so I'm not sure what it's for.  But returning 1 for non-regular files
looks dodgy:

> +static inline int file_start_write_killable(struct file *file)
> +{
> + if (!S_ISREG(file_inode(file)->i_mode))
> + return 1;
> + return sb_start_write_killable(file_inode(file)->i_sb);
> +}

> +++ b/fs/aio.c
> @@ -1103,8 +1103,11 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb, 
> int rw, aio_rw_op *rw_op)
>   if (iocb->ki_pos < 0)
>   return -EINVAL;
>  
> - if (rw == WRITE)
> - file_start_write(file);
> + if (rw == WRITE) {
> + ret = file_start_write_killable(file);
> + if (ret < 0)
> + return ret;
> + }
>   do {

So ... it's OK to do this write to pipes/directories/devices/... ?  Or is
that check always taken care of elsewhere?  If so, why do we need this
check?  I'm confused.  None of the callers check for the 'ret == 1' case,
so I'm sure there's something wrong here, I just can't tell what it is.

> +++ b/fs/read_write.c
> @@ -438,17 +438,19 @@ ssize_t vfs_write(struct file *file, const char __user 
> *buf, size_t count, loff_
>   ret = rw_verify_area(WRITE, file, pos, count);
>   if (ret >= 0) {
>   count = ret;
> - file_start_write(file);
> - if (file->f_op->write)
> - ret = file->f_op->write(file, buf, count, pos);
> - else
> - ret = do_sync_write(file, buf, count, pos);
> + ret = file_start_write_killable(file);
>   if (ret > 0) {
> - fsnotify_modify(file);
> - add_wchar(current, ret);
> + if (file->f_op->write)
> + ret = file->f_op->write(file, buf, count, pos);
> + else
> + ret = do_sync_write(file, buf, count, pos);
> + if (ret > 0) {
> + fsnotify_modify(file);
> + add_wchar(current, ret);
> + }
> + inc_syscw(current);
> + file_end_write(file);
>   }
> - inc_syscw(current);
> - file_end_write(file);
>   }
>  
>   return ret;

I don't like it that you've increased the indentation here.  Better to do
a preliminary patch which just converts to our normal style with gotos.  ie:

-   if (ret >= 0) {
-   count = ret;
-   file_start_write(file);
-   if (file->f_op->write)
-   ret = file->f_op->write(file, buf, count, pos);
-   else
-   ret = do_sync_write(file, buf, count, pos);
-   if (ret > 0) {
-   fsnotify_modify(file);
-   add_wchar(current, ret);
-   }
-   inc_syscw(current);
-   file_end_write(file);
+   if (ret < 0)
+   goto out;
+   count = ret;
+   file_start_write(file);
+   if (file->f_op->write)
+   ret = file->f_op->write(file, buf, count, pos);
+   else
+   ret = do_sync_write(file, buf, count, pos);
+   if (ret > 0) {
+   fsnotify_modify(file);
+   add_wchar(current, ret);
}
+   inc_syscw(current);
+   file_end_write(file);
+ out: 
return ret;

and then this patch would just add another if ... goto out case.

-- 
Matthew Wilcox  Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] MAINTAINERS: update NVM EXPRESS DRIVER file list

2013-05-17 Thread Matthew Wilcox
On Mon, May 13, 2013 at 06:00:42PM +0300, Dan Carpenter wrote:
> There isn't an nvme.c file any more.  It has been split into multiple
> files.
> 
> Signed-off-by: Dan Carpenter 

Applied
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch -resend] NVMe: check for integer overflow in nvme_map_user_pages()

2013-05-17 Thread Matthew Wilcox
On Mon, May 13, 2013 at 05:59:50PM +0300, Dan Carpenter wrote:
> You need to have CAP_SYS_ADMIN to trigger this overflow but it makes the
> static checkers complain so we should fix it.  The worry is that
> "length" comes from copy_from_user() so we need to check that "length +
> offset" can't overflow.
> 
> I also changed the min_t() cast to be unsigned instead of signed.  Now
> that we cap "length" to INT_MAX it doesn't make a difference, but it's a
> little easier for reviewers to know that large values aren't cast to
> negative.
> 
> Signed-off-by: Dan Carpenter 

Applied
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] NVMe: fix error return code in nvme_submit_bio_queue()

2013-05-17 Thread Matthew Wilcox
On Mon, May 13, 2013 at 10:29:04PM +0800, Wei Yongjun wrote:
> From: Wei Yongjun 
> 
> Fix to return -ENOMEM in the nvem iod alloc error handling case instead
> of 0(possible overwrite to 0 by above nvme_submit_flush_data()), as done
> elsewhere in this function.
> 
> Signed-off-by: Wei Yongjun 

Applied, slightly modified to fit the style a little better.

> v1 -> v2: send to mail list from MAINTAINERS
> ---
>  drivers/block/nvme-core.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index 8efdfaa..750c337 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -629,7 +629,7 @@ static int nvme_submit_bio_queue(struct nvme_queue 
> *nvmeq, struct nvme_ns *ns,
>   struct nvme_command *cmnd;
>   struct nvme_iod *iod;
>   enum dma_data_direction dma_dir;
> - int cmdid, length, result = -ENOMEM;
> + int cmdid, length, result;
>   u16 control;
>   u32 dsmgmt;
>   int psegs = bio_phys_segments(ns->queue, bio);
> @@ -641,8 +641,10 @@ static int nvme_submit_bio_queue(struct nvme_queue 
> *nvmeq, struct nvme_ns *ns,
>   }
>  
>   iod = nvme_alloc_iod(psegs, bio->bi_size, GFP_ATOMIC);
> - if (!iod)
> + if (!iod) {
> + result = -ENOMEM;
>   goto nomem;
> + }
>   iod->private = bio;
>  
>   result = -EBUSY;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] NVMe: Fix incompatible pointer type warnings

2013-05-17 Thread Matthew Wilcox
On Fri, May 17, 2013 at 04:09:40PM +0200, Emil Goode wrote:
> Commit 6a4d55146334 ("block: prep work for batch completion")
> added a struct batch_complete * as a third argument to bi_end_io.
> This patch adds the missing third argument to the nvme_bio_pair_endio
> function.

That patch isn't in Linus' tree yet, so we can't apply this patch.  It needs
to go to linux-next.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/2] mm: Batch page_check_references in shrink_page_list sharing the same i_mmap_mutex

2012-08-23 Thread Matthew Wilcox
On Tue, Aug 21, 2012 at 05:48:20PM -0700, Tim Chen wrote:
> Thanks to Matthew's suggestions on improving the patch. Here's the
> updated version.  It seems to be sane when I booted my machine up.  I
> will put it through more testing when I get a chance.

Looks good.

> +int __try_to_munlock(struct page *page)

Nit: I think this can be static.  There aren't any users of it other
than try_to_munlock() itself.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pci-disable-decoding-during-sizing-of-bars.patch no longer needed

2008-02-16 Thread Matthew Wilcox
On Sat, Feb 16, 2008 at 09:50:51PM +, Daniel Drake wrote:
> Hi,
> 
> The patch titled pci-disable-decoding-during-sizing-of-bars.patch in -mm 
> was previously used as a candidate to fix a boot hang with Intel's Q35 
> chipset: https://bugs.gentoo.org/show_bug.cgi?id=198810
> 
> However, that particular issue is solved by commit a0ca9909609 in Linus 
> tree:
>   PCI x86: always use conf1 to access config space below 256 bytes
> 
> So unless there are other reasons for keeping 
> pci-disable-decoding-during-sizing-of-bars around, I think we can drop it.

I agree it should be dropped.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] signal(ia64): add a signal stack overflow check

2008-02-18 Thread Matthew Wilcox
On Mon, Feb 18, 2008 at 06:26:23PM +0800, Shi Weihua wrote:
> + if (!rbs_on_sig_stack(scr->pt.ar_bspstore))
> + new_rbs = (current->sas_ss_sp +
> +   sizeof(long) - 1) & ~(sizeof(long) - 
> 1);

I know you're only moving this code, but how about fixing it to use
ALIGN at the same time?

+   if (!rbs_on_sig_stack(scr->pt.ar_bspstore))
+   new_rbs = ALIGN(current->sas_ss_sp,
+   sizeof(long));

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/10] PCI: AMD SATA IDE mode quirk

2008-02-21 Thread Matthew Wilcox
On Thu, Feb 21, 2008 at 03:47:33PM -0800, Greg Kroah-Hartman wrote:
> +static void __devinit quirk_amd_ide_mode(struct pci_dev *pdev)
>  {
> - /* set sb600 sata to ahci mode */
> - if ((pdev->class >> 8) == PCI_CLASS_STORAGE_IDE) {
> - u8 tmp;
> + /* set sb600/sb700/sb800 sata to ahci mode */
> + u8 tmp;
>  
> + pci_read_config_byte(pdev, PCI_CLASS_DEVICE, &tmp);
> + if (tmp == 0x01) {
>   pci_read_config_byte(pdev, 0x40, &tmp);

This seems like a dis-improvement.  Why are we reading a config byte for
something we already have in the pci_dev?  Why are we now checking
against 0x01 instead of a symbolic constant?  Why are we no longer
checking that this is PCI_BASE_CLASS_STORAGE?

> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP600_SATA, 
> quirk_sb600_sata);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP700_SATA, 
> quirk_sb600_sata);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP600_SATA, 
> quirk_amd_ide_mode);
> +DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP600_SATA, 
> quirk_amd_ide_mode);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP700_SATA, 
> quirk_amd_ide_mode);
> +DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP700_SATA, 
> quirk_amd_ide_mode);

Nothing in the changelog entry suggests why we now need FIXUP_RESUME
entries when we didn't before.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/10] PCI: AMD SATA IDE mode quirk

2008-02-22 Thread Matthew Wilcox
On Fri, Feb 22, 2008 at 01:49:20PM +0800, Cai, Crane wrote:
> > On Thu, Feb 21, 2008 at 03:47:33PM -0800, Greg Kroah-Hartman wrote:
> > > +static void __devinit quirk_amd_ide_mode(struct pci_dev *pdev)
> > >  {
> > > - /* set sb600 sata to ahci mode */
> > > - if ((pdev->class >> 8) == PCI_CLASS_STORAGE_IDE) {
> > > - u8 tmp;
> > > + /* set sb600/sb700/sb800 sata to ahci mode */
> > > + u8 tmp;
> > >  
> > > + pci_read_config_byte(pdev, PCI_CLASS_DEVICE, &tmp);
> > > + if (tmp == 0x01) {
> > >   pci_read_config_byte(pdev, 0x40, &tmp);
> > 
> > This seems like a dis-improvement.  Why are we reading a 
> > config byte for something we already have in the pci_dev?  
> > Why are we now checking against 0x01 instead of a symbolic 
> > constant?  Why are we no longer checking that this is 
> > PCI_BASE_CLASS_STORAGE?
> It is a quirk. In pci_ids.h did have PCI_CLASS_STORAGE_IDE 
> and PCI_BASE_CLASS_STORAGE, these can not represent 
> the right situation we want to check. 0x01 represents 
> PCI_CLASS_STORAGE_IDE last 2 bit. Also because it 
> is a quirk, I do not think we need to change pci_ids.h. So 0x01 
> used. 

You haven't explained what is wrong with the original code:

if ((pdev->class >> 8) == PCI_CLASS_STORAGE_IDE) {

> > Nothing in the changelog entry suggests why we now need 
> > FIXUP_RESUME entries when we didn't before.
> > 
> PCI configuration space will be changed by BIOS and then in pci
> init and restore. So resume also needed.

That information needed to be in the changelog.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pull request: TASK_KILLABLE

2008-02-05 Thread Matthew Wilcox
On Thu, Jan 31, 2008 at 06:02:05PM -0800, Andrew Morton wrote:
> No such export was needed in the patches which I added to -mm.  So
> something changed between then and now.

Not sure abouut that problem -- still on holiday, so just checking ym
mail quickly.

> And going back through the mailing list all I can find is a series of five
> patches in October - it's unclear where and when the other 17 were
> reviewed, if they were.

A large number of these patches are just a resplit of the patches sent
back in October -- you complained they weren't split up enough.  So I
resplit them.  And sent them to you.  Asking if this was how you
preferred it.  Which you didn't reply to.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Pull request: DMA pool updates

2008-02-05 Thread Matthew Wilcox
Hi Linus,

Could I ask you to pull the DMA Pool changes detailed below?

All the patches have been posted to linux-kernel before, and various
comments (and acks) have been taken into account.  (see
http://thread.gmane.org/gmane.linux.kernel/609943)

It's a fairly nice performance improvement, so would be good to get in.
It's survived a few hours of *mumble* high-stress database benchmark,
so I have high confidence in its stability.

The following changes since commit 21511abd0a248a3f225d3b611cfabb93124605a7:
  Linus Torvalds (1):
Merge branch 'release' of git://git.kernel.org/.../aegl/linux-2.6

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git dmapool

Matthew Wilcox (7):
  Move dmapool.c to mm/ directory
  dmapool: Fix style problems
  Avoid taking waitqueue lock in dmapool
  dmapool: Validate parameters to dma_pool_create
  dmapool: Tidy up includes and add comments
  Change dmapool free block management
  pool: Improve memory usage for devices which can't cross boundaries

 drivers/base/Makefile  |2 +-
 drivers/base/dmapool.c |  481 --
 mm/Makefile|1 +
 mm/dmapool.c   |  500 
 4 files changed, 502 insertions(+), 482 deletions(-)
 delete mode 100644 drivers/base/dmapool.c
 create mode 100644 mm/dmapool.c

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: down_killable implementations for every architecture

2008-02-05 Thread Matthew Wilcox
On Tue, Jan 29, 2008 at 05:23:52AM +0100, Andi Kleen wrote:
> On Tuesday 29 January 2008 00:19, Matthew Wilcox wrote:
> > As part of the TASK_KILLABLE changes, we're going to need
> > down_killable().  Unfortunately, semaphores are implemented for every
> > architecture, which we should probably fix at some point.
> 
> It would be best to just change it now before doing further changes. Right now
> we have the bizarre situation that semaphores are more optimized
> with fast path inline assembly code than the far more critical spinlocks.
> But that clearly doesn't make much sense. So the best approach would
> be likely to just pick some generic C implementation from some architecture
> and use it everywhere.

We don't really have an appropriate one.  So I've invented my own.

 104 files changed, 338 insertions(+), 7335 deletions(-)

(and 228k).  That seems inappropriate to post.

Here's the whole patch (includes deletions from every architecture):
http://www.parisc-linux.org/~willy/generic-semaphore.diff

Here's the interesting/useful bit.  I've only tested it briefly on my
laptop -- it could be full of holes, but I've tried very hard to think
of all the interesting edge/race conditions with multiple
sleepers/wakers.

Please review.  I won't be around to respond to comments for another
four days.

diff --git a/include/linux/semaphore.h b/include/linux/semaphore.h
new file mode 100644
index 000..8e563bc
--- /dev/null
+++ b/include/linux/semaphore.h
@@ -0,0 +1,81 @@
+/*
+ * Copyright (c) 2008 Intel Corporation
+ * Author: Matthew Wilcox <[EMAIL PROTECTED]>
+ *
+ * Distributed under the terms of the GNU GPL, version 2
+ *
+ * Counting semaphores allow up to  tasks to acquire the semaphore
+ * simultaneously.
+ */
+#ifndef __LINUX_SEMAPHORE_H
+#define __LINUX_SEMAPHORE_H
+
+#include 
+#include 
+
+/*
+ * The spinlock controls access to the other members of the semaphore.
+ * 'count' is decremented by every task which calls down*() and incremented
+ * by every call to up().  Thus, if it is positive, it indicates how many
+ * more tasks may acquire the lock.  If it is negative, it indicates how
+ * many tasks are waiting for the lock.  Tasks waiting for the lock are
+ * kept on the wait_list.
+ */
+struct semaphore {
+   spinlock_t  lock;
+   int count;
+   struct list_headwait_list;
+};
+
+#define __SEMAPHORE_INITIALIZER(name, n)   \
+{  \
+   .lock   = __SPIN_LOCK_UNLOCKED((name).lock),\
+   .count  = n,\
+   .wait_list  = LIST_HEAD_INIT((name).wait_list), \
+}
+
+#define __DECLARE_SEMAPHORE_GENERIC(name, count) \
+   struct semaphore name = __SEMAPHORE_INITIALIZER(name, count)
+
+#define DECLARE_MUTEX(name)__DECLARE_SEMAPHORE_GENERIC(name, 1)
+ 
+static inline void sema_init(struct semaphore *sem, int val)
+{
+   *sem = (struct semaphore) __SEMAPHORE_INITIALIZER(*sem, val);
+}
+
+#define init_MUTEX(sem)sema_init(sem, 1)
+#define init_MUTEX_LOCKED(sem) sema_init(sem, 0)
+
+/*
+ * Attempt to acquire the semaphore.  If another task is already holding the
+ * semaphore, sleep until the semaphore is released.
+ */
+extern void fastcall down(struct semaphore *sem);
+
+/*
+ * As down(), except the sleep may be interrupted by a signal.  If it is,
+ * this function will return -EINTR.
+ */
+extern int __must_check fastcall down_interruptible(struct semaphore *sem);
+
+/*
+ * As down_interruptible(), except the sleep may only be interrupted by
+ * signals which are fatal to this process.
+ */
+extern int __must_check fastcall down_killable(struct semaphore *sem);
+
+/*
+ * As down, except this function will not sleep.  It will return 0 if it
+ * acquired the semaphore and 1 if the semaphore was contended.  This
+ * function may be called from any context, including interrupt and softirq.
+ */
+extern int __must_check fastcall down_trylock(struct semaphore *sem);
+
+/*
+ * Release the semaphore.  Unlike mutexes, up() may be called from any
+ * context and even by tasks which have never called down().
+ */
+extern void fastcall up(struct semaphore *sem);
+
+#endif /* __LINUX_SEMAPHORE_H */
diff --git a/kernel/semaphore.c b/kernel/semaphore.c
new file mode 100644
index 000..94f65a1
--- /dev/null
+++ b/kernel/semaphore.c
@@ -0,0 +1,208 @@
+/*
+ * Copyright (c) 2008 Intel Corporation
+ * Author: Matthew Wilcox <[EMAIL PROTECTED]>
+ *
+ * Distributed under the terms of the GNU GPL, version 2
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Some notes on the implementation:
+ *
+ * down_trylock() and up() can be called from interrupt context.
+ * So we have to disable interrupts when taking the lock.
+ *
+

Re: [PATCH] Change pci_raw_ops to pci_raw_read/write

2008-02-09 Thread Matthew Wilcox
On Thu, Feb 07, 2008 at 10:54:05AM -0500, Tony Camuso wrote:
> Matthew,
> 
> Perhaps I missed it, but did you address Yinghai's concerns?

No, I was on holiday.

> Yinghai Lu wrote:
> >On Jan 28, 2008 7:03 PM, Matthew Wilcox <[EMAIL PROTECTED]> wrote:
> >>
> >>-int pci_conf1_write(unsigned int seg, unsigned int bus,
> >>+static int pci_conf1_write(unsigned int seg, unsigned int bus,
> >>   unsigned int devfn, int reg, int len, u32 
> >>   value)
> >
> >any reason to change pci_conf1_read/write to static?

Yes -- it no longer needs to be called from outside this file.

> >>+config ATA_RAM
> >>+   tristate "ATA RAM driver"
> >>+
> >
> >related?

No.  An unrelated patch that I didn't trim out.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Change pci_raw_ops to pci_raw_read/write

2008-02-10 Thread Matthew Wilcox
On Sat, Feb 09, 2008 at 11:21:16PM -0800, Greg KH wrote:
> Can I get a revised version of this, without the incorrect hunk?

Sure.  I've even rebased it against current HEAD.  Damn whitespace
cleanup introducing unnecessary conflicts 

I suggest Ivan's patch be merged ASAP as it actually fixes bugs.
This patch is just cleanup (and takes care of some future concerns).

>From ad4c3f135cda6f5210735231d30ef8e9dbd58c7c Mon Sep 17 00:00:00 2001
From: Matthew Wilcox <[EMAIL PROTECTED]>
Date: Sun, 10 Feb 2008 09:45:28 -0500
Subject: [PATCH] Change pci_raw_ops to pci_raw_read/write

We want to allow different implementations of pci_raw_ops for standard
and extended config space on x86.  Rather than clutter generic code with
knowledge of this, we make pci_raw_ops private to x86 and use it to
implement the new raw interface -- raw_pci_read() and raw_pci_write().

Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>
---
 arch/ia64/pci/pci.c   |   25 -
 arch/ia64/sn/pci/tioce_provider.c |   16 
 arch/x86/kernel/quirks.c  |2 +-
 arch/x86/pci/common.c |   25 +++--
 arch/x86/pci/direct.c |4 ++--
 arch/x86/pci/fixup.c  |6 --
 arch/x86/pci/legacy.c |2 +-
 arch/x86/pci/mmconfig-shared.c|6 +++---
 arch/x86/pci/mmconfig_32.c|   10 ++
 arch/x86/pci/mmconfig_64.c|8 +---
 arch/x86/pci/pci.h|   15 +++
 arch/x86/pci/visws.c  |3 ---
 drivers/acpi/osl.c|   25 ++---
 include/linux/pci.h   |   16 
 14 files changed, 78 insertions(+), 85 deletions(-)

diff --git a/arch/ia64/pci/pci.c b/arch/ia64/pci/pci.c
index 488e48a..8fd7e82 100644
--- a/arch/ia64/pci/pci.c
+++ b/arch/ia64/pci/pci.c
@@ -43,8 +43,7 @@
 #define PCI_SAL_EXT_ADDRESS(seg, bus, devfn, reg)  \
(((u64) seg << 28) | (bus << 20) | (devfn << 12) | (reg))
 
-static int
-pci_sal_read (unsigned int seg, unsigned int bus, unsigned int devfn,
+int raw_pci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
  int reg, int len, u32 *value)
 {
u64 addr, data = 0;
@@ -68,8 +67,7 @@ pci_sal_read (unsigned int seg, unsigned int bus, unsigned 
int devfn,
return 0;
 }
 
-static int
-pci_sal_write (unsigned int seg, unsigned int bus, unsigned int devfn,
+int raw_pci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
   int reg, int len, u32 value)
 {
u64 addr;
@@ -91,24 +89,17 @@ pci_sal_write (unsigned int seg, unsigned int bus, unsigned 
int devfn,
return 0;
 }
 
-static struct pci_raw_ops pci_sal_ops = {
-   .read = pci_sal_read,
-   .write =pci_sal_write
-};
-
-struct pci_raw_ops *raw_pci_ops = &pci_sal_ops;
-
-static int
-pci_read (struct pci_bus *bus, unsigned int devfn, int where, int size, u32 
*value)
+static int pci_read(struct pci_bus *bus, unsigned int devfn, int where,
+   int size, u32 *value)
 {
-   return raw_pci_ops->read(pci_domain_nr(bus), bus->number,
+   return raw_pci_read(pci_domain_nr(bus), bus->number,
 devfn, where, size, value);
 }
 
-static int
-pci_write (struct pci_bus *bus, unsigned int devfn, int where, int size, u32 
value)
+static int pci_write(struct pci_bus *bus, unsigned int devfn, int where,
+   int size, u32 value)
 {
-   return raw_pci_ops->write(pci_domain_nr(bus), bus->number,
+   return raw_pci_write(pci_domain_nr(bus), bus->number,
  devfn, where, size, value);
 }
 
diff --git a/arch/ia64/sn/pci/tioce_provider.c 
b/arch/ia64/sn/pci/tioce_provider.c
index e1a3e19..999f14f 100644
--- a/arch/ia64/sn/pci/tioce_provider.c
+++ b/arch/ia64/sn/pci/tioce_provider.c
@@ -752,13 +752,13 @@ tioce_kern_init(struct tioce_common *tioce_common)
 * Determine the secondary bus number of the port2 logical PPB.
 * This is used to decide whether a given pci device resides on
 * port1 or port2.  Note:  We don't have enough plumbing set up
-* here to use pci_read_config_xxx() so use the raw_pci_ops vector.
+* here to use pci_read_config_xxx() so use raw_pci_read().
 */
 
seg = tioce_common->ce_pcibus.bs_persist_segment;
bus = tioce_common->ce_pcibus.bs_persist_busnum;
 
-   raw_pci_ops->read(seg, bus, PCI_DEVFN(2, 0), PCI_SECONDARY_BUS, 1,&tmp);
+   raw_pci_read(seg, bus, PCI_DEVFN(2, 0), PCI_SECONDARY_BUS, 1,&tmp);
tioce_kern->ce_port1_secondary = (u8) tmp;
 
/*
@@ -799,11 +799,11 @@ tioce_kern_init(struct tioce_common *tioce_common)
 
/* mem base/limit */
 
-   raw_pci_o

Re: [PATCH] Change pci_raw_ops to pci_raw_read/write

2008-02-10 Thread Matthew Wilcox
On Sun, Feb 10, 2008 at 12:13:13PM -0700, Grant Grundler wrote:
> Just wondering...why don't we just pass "struct bus*" through to the
> raw_pci* ops?
> My thinking is if a PCI bus controller or bridge is discovered, then we should
> always create a matching "struct bus *".

ACPI may need to access PCI config space before we've done a PCI bus
walk.  There's an opregion that AML may access that is for PCI config
space, and an apparently unrelated method might happen to contain such a
piece of AML.

> > --- a/arch/x86/kernel/quirks.c
> > +++ b/arch/x86/kernel/quirks.c
> > @@ -27,7 +27,7 @@ static void __devinit quirk_intel_irqbalance(struct 
> > pci_dev *dev)
> > pci_write_config_byte(dev, 0xf4, config|0x2);
> >  
> > /* read xTPR register */
> > -   raw_pci_ops->read(0, 0, 0x40, 0x4c, 2, &word);
> > +   raw_pci_read(0, 0, 0x40, 0x4c, 2, &word);
> 
> Why are we using raw_pci_read here instead of pci_read_config_dword()?
> If the pci_write_config_byte() above works, then I expect the read
> to work too.

I have no idea.  I didn't want to change the semantics in this patch.
Presumably the original author would have an idea why they needed to do
this.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Change pci_raw_ops to pci_raw_read/write

2008-02-10 Thread Matthew Wilcox
On Sun, Feb 10, 2008 at 12:16:43PM -0800, Yinghai Lu wrote:
> On Feb 10, 2008 6:51 AM, Matthew Wilcox <[EMAIL PROTECTED]> wrote:
> > On Sat, Feb 09, 2008 at 11:21:16PM -0800, Greg KH wrote:
> > > Can I get a revised version of this, without the incorrect hunk?
> >
> > Sure.  I've even rebased it against current HEAD.  Damn whitespace
> > cleanup introducing unnecessary conflicts 
> >
> > I suggest Ivan's patch be merged ASAP as it actually fixes bugs.
> > This patch is just cleanup (and takes care of some future concerns).
> 
> your patch and Ivan's patch should be merged in one...

Why?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Change pci_raw_ops to pci_raw_read/write

2008-02-10 Thread Matthew Wilcox
On Sun, Feb 10, 2008 at 12:24:18PM -0800, Linus Torvalds wrote:
> On Sun, 10 Feb 2008, Yinghai Lu wrote:
> > >
> > > I suggest Ivan's patch be merged ASAP as it actually fixes bugs.
> > > This patch is just cleanup (and takes care of some future concerns).
> > 
> > your patch and Ivan's patch should be merged in one...
> 
> I really don't care whether they get merges as one or separately, but I 
> think it should be merged _now_ (-rc1 is already delayed), and I'd like to 
> see the final versions of both. Does anybody have them in a final 
> agreed-upon format (preferably with that oddness in quirk_intel_irqbalance 
> also fixed?)

I just looked at fixing that -- the reason seems to be that we don't
actually have the struct pci_dev at that point.  I can fix it, but I
think it's actually buggy.  I want to look at some chipset docs to
confirm that though.

I've attached the two patches that I believe are the ones we want.  We
can (and should) fix quirk_intel_irqbalance separately.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
>From fc1e24786c764bf7e7a73128f839b58a0559739c Mon Sep 17 00:00:00 2001
From: Ivan Kokshaysky <[EMAIL PROTECTED]>
Date: Mon, 14 Jan 2008 17:31:09 -0500
Subject: [PATCH] PCI x86: always use conf1 to access config space below 256 
bytes

Thanks to Loic Prylli <[EMAIL PROTECTED]>, who originally proposed
this idea.

Always using legacy configuration mechanism for the legacy config space
and extended mechanism (mmconf) for the extended config space is
a simple and very logical approach. It's supposed to resolve all
known mmconf problems. It still allows per-device quirks (tweaking
dev->cfg_size). It also allows to get rid of mmconf fallback code.

Signed-off-by: Ivan Kokshaysky <[EMAIL PROTECTED]>
Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>

---
 arch/x86/pci/mmconfig-shared.c |   35 ---
 arch/x86/pci/mmconfig_32.c |   22 +-
 arch/x86/pci/mmconfig_64.c |   22 ++
 arch/x86/pci/pci.h |7 ---
 4 files changed, 19 insertions(+), 67 deletions(-)

diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
index 4df637e..6b521d3 100644
--- a/arch/x86/pci/mmconfig-shared.c
+++ b/arch/x86/pci/mmconfig-shared.c
@@ -22,42 +22,9 @@
 #define MMCONFIG_APER_MIN  (2 * 1024*1024)
 #define MMCONFIG_APER_MAX  (256 * 1024*1024)
 
-DECLARE_BITMAP(pci_mmcfg_fallback_slots, 32*PCI_MMCFG_MAX_CHECK_BUS);
-
 /* Indicate if the mmcfg resources have been placed into the resource table. */
 static int __initdata pci_mmcfg_resources_inserted;
 
-/* K8 systems have some devices (typically in the builtin northbridge)
-   that are only accessible using type1
-   Normally this can be expressed in the MCFG by not listing them
-   and assigning suitable _SEGs, but this isn't implemented in some BIOS.
-   Instead try to discover all devices on bus 0 that are unreachable using MM
-   and fallback for them. */
-static void __init unreachable_devices(void)
-{
-   int i, bus;
-   /* Use the max bus number from ACPI here? */
-   for (bus = 0; bus < PCI_MMCFG_MAX_CHECK_BUS; bus++) {
-   for (i = 0; i < 32; i++) {
-   unsigned int devfn = PCI_DEVFN(i, 0);
-   u32 val1, val2;
-
-   pci_conf1_read(0, bus, devfn, 0, 4, &val1);
-   if (val1 == 0x)
-   continue;
-
-   if (pci_mmcfg_arch_reachable(0, bus, devfn)) {
-   raw_pci_ops->read(0, bus, devfn, 0, 4, &val2);
-   if (val1 == val2)
-   continue;
-   }
-   set_bit(i + 32 * bus, pci_mmcfg_fallback_slots);
-   printk(KERN_NOTICE "PCI: No mmconfig possible on device"
-  " %02x:%02x\n", bus, i);
-   }
-   }
-}
-
 static const char __init *pci_mmcfg_e7520(void)
 {
u32 win;
@@ -270,8 +237,6 @@ void __init pci_mmcfg_init(int type)
return;
 
if (pci_mmcfg_arch_init()) {
-   if (type == 1)
-   unreachable_devices();
if (known_bridge)
pci_mmcfg_insert_resources(IORESOURCE_BUSY);
pci_probe = (pci_probe & ~PCI_PROBE_MASK) | PCI_PROBE_MMCONF;
diff --git a/arch/x86/pci/mmconfig_32.c b/arch/x86/pci/mmconfig_32.c
index 1bf5816..7b75e65 100644
--- a/arch/x86/pci/mmconfig_32.c
+++ b/arch/x86/pci/mmconfig_32.c
@@ -30,10 

Re: [PATCH] Change pci_raw_ops to pci_raw_read/write

2008-02-10 Thread Matthew Wilcox
On Sun, Feb 10, 2008 at 12:25:02PM -0800, Yinghai Lu wrote:
> Even Greg didn't know that there was another patch need to be applied
> before this one yesterday.

I don't believe you.  For example:

On Mon, Jan 28, 2008 at 02:53:34PM -0800, Greg KH wrote:
> Please send me patches, in a form that can be merged, along with a
> proper changelog entry, in the order in which you wish them to be
> applied, so I know exactly what changes you are referring to.

Which I then did.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


raw_pci_read in quirk_intel_irqbalance

2008-02-10 Thread Matthew Wilcox
On Sun, Feb 10, 2008 at 01:45:57PM -0700, Matthew Wilcox wrote:
> I just looked at fixing that -- the reason seems to be that we don't
> actually have the struct pci_dev at that point.  I can fix it, but I
> think it's actually buggy.  I want to look at some chipset docs to
> confirm that though.

I don't think I fully understand what's going on here.  So here's what
I've been able to glean; hopefully someone who understands this better
can help out.

I happen to have an E7525-based machine, so here's an lspci of bus 0:

00:00.0 Host bridge: Intel Corporation E7525 Memory Controller Hub (rev 0a)
00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 
0a)
00:03.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A1 
(rev 0a)
00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0a)
00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller #3 (rev 02)
00:1d.3 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller #4 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI 
Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2)
00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface 
Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller 
(rev 02)
00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02)
00:1f.5 Multimedia audio controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) 
AC'97 Audio Controller (rev 02)

The line in question reads:

/* read xTPR register */
raw_pci_read(0, 0, 0x40, 0x4c, 2, &word);

That's domain 0, bus 0, device 8, function 0, address 0x4c, length 2.

I've checked the public E7525 and E7520 MCH datasheets, and they don't
document the xTPR registers; nor do any of the devices in the datasheet
have registers documented at 0x4c.

You can see from my lspci above that I don't _have_ a device 8 on bus 0.
The aforementioned documentation says:

"A disabled or non-existent device's configuration register space is
hidden. A disabled or non-existent device will return all ones for reads
and will drop writes just as if the cycle terminated with a Master Abort
on PCI."

Now, my E7525 isn't affected by this quirk as it has a revision greater
than 0x9.  So maybe it's expected that device 8 is hidden on my machine;
that it's only present on revisions up to 0x9.  But maybe device 8 is
always hidden, and that's why the author used raw_pci_ops?

We can still do better than this, though.  We can do:

-   raw_pci_read(0, 0, 0x40, 0x4c, 2, &word);
+   pci_bus_read_config_word(dev->bus, PCI_DEVFN(8, 0), 0x4c, &word);

Using PCI_DEVFN tells people you really did mean device 8, and it's not
a braino for device 4 or 2 (how many bits for slot and function again?)

I'll see if I can dig up the internal documentation for the xTPR register
when I'm at work on Monday.  But I've never gone looking for internal
documentation before, so I have no idea how easy it will be to find ;-)

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scsi_error: Fix language abuse.

2008-02-10 Thread Matthew Wilcox
On Sat, Feb 09, 2008 at 01:50:20PM +, Alan Cox wrote:
> On Fri, 08 Feb 2008 20:32:54 -0500 Douglas Gilbert <[EMAIL PROTECTED]> wrote:
> > Alan Cox wrote:
> > > The word "illegal" has a precise dictionary meaning of "prohibited by
> > > law". 
> > 
> > Also "contrary to or forbidden by official rules, regulations, etc".
> 
> The OED I have here doesn't seem to think so, however if the words are
> the ones used in the T10 documentation then I'm happy to drop the patch.

Not everyone believes in the OED's definitions.  I prefer Chambers (in
general), which has the (online) definition:

illegal adj
1 against the law; not legal. Also called unlawful.
2 not authorized by law or by specific rules which apply.
ETYMOLOGY: 17c: from Latin illegalis.

I believe the second definition would apply here.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfs: simply the code logic

2012-07-21 Thread Matthew Wilcox
On Sat, Jul 21, 2012 at 11:38:19PM +0800, zwu.ker...@gmail.com wrote:
>   if (offset >= 0 && offset <= size) {
> - if (offset != file->f_pos) {
> - file->f_pos = offset;
> - }
> - retval = offset;
> + retval = file->f_pos = offset;
>   }

But now you're writing to f_pos unconditionally.  That may cause
cacheline bouncing.  NAK.

Plus, you're now using a less-well-known feature of C, so while it's fewer
lines of code, it's not necessarily a simplification.

-- 
Matthew Wilcox  Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/2] mm: Batch page_check_references in shrink_page_list sharing the same i_mmap_mutex

2012-08-21 Thread Matthew Wilcox
On Mon, Aug 20, 2012 at 09:43:02AM -0700, Tim Chen wrote:
> > In shrink_page_list, call to page_referenced_file will causes the
> > acquisition/release of mapping->i_mmap_mutex for each page in the page
> > list.  However, it is very likely that successive pages in the list
> > share the same mapping and we can reduce the frequency of i_mmap_mutex
> > acquisition by holding the mutex in shrink_page_list. This improves the
> > performance when the system has a lot page reclamations for file mapped
> > pages if workloads are using a lot of memory for page cache.

Is there a (performant) way to avoid passing around the
'mmap_mutex_locked' state?

For example, does it hurt to have all the callers hold the i_mmap_mutex()
over the entire call, or do we rely on being able to execute large chunks
of this in parallel?

Here's what I'm thinking:

1. Rename the existing page_referenced implementation to __page_referenced().
2. Add:

int needs_page_mmap_mutex(struct page *page)
{
return page->mapping && page_mapped(page) && page_rmapping(page) &&
!PageKsm(page) && !PageAnon(page);
}

int page_referenced(struct page *page, int is_locked, struct mem_cgroup *memcg,
unsigned long *vm_flags)
{
int result, needs_lock;

needs_lock = needs_page_mmap_mutex(page);
if (needs_lock)
mutex_lock(&page->mapping->i_mmap_mutex);
result = __page_referenced(page, is_locked, memcg, vm_flags);
if (needs_lock)
mutex_unlock(&page->mapping->i_mmap_mutex);
return result;
}

3. Rename the existing try_to_unmap() to __try_to_unmap()
4. Add:

int try_to_unmap(struct page *page, enum ttu_flags flags)
{
int result, needs_lock;

needs_lock = needs_page_mmap_mutex(page);
if (needs_lock)
mutex_lock(&page->mapping->i_mmap_mutex);
result = __try_to_unmap(page, is_locked, memcg, vm_flags);
if (needs_lock)
mutex_unlock(&page->mapping->i_mmap_mutex);
return result;
}

5. Change page_check_references to always call __page_referenced (since it
now always holds the mutex)
6. Replace the mutex_lock() calls in page_referenced_file() and
try_to_unmap_file() with
BUG_ON(!mutex_is_locked(&mapping->i_mmap_mutex));
7. I think you can simplify this:

> - references = page_check_references(page, mz, sc);
> + if (page->mapping) {
> + if (i_mmap_mutex == &page->mapping->i_mmap_mutex) {
> + mmap_mutex_locked = 1;
> + } else {
> + if (page_mapped(page) && page_rmapping(page)
> + && !PageKsm(page) && !PageAnon(page)) {
> + if (i_mmap_mutex)
> + mutex_unlock(i_mmap_mutex);
> + i_mmap_mutex = 
> &page->mapping->i_mmap_mutex;
> + mutex_lock(i_mmap_mutex);
> + mmap_mutex_locked = 1;
> + }
> + }
> + }
> +
> + references = page_check_references(page, mz, sc,
> +mmap_mutex_locked);

to just:

+   if (needs_page_mmap_mutex(page) &&
+   i_mmap_mutex != &page->mapping->i_mmap_mutex) {
+   if (i_mmap_mutex)
+   mutex_unlock(i_mmap_mutex);
+   i_mmap_mutex = &page->mapping->i_mmap_mutex;
+   mutex_lock(i_mmap_mutex);
+   }
references = page_check_references(page, mz, sc);

The only clunky bit would seem to be this bit:

>   if (page_mapped(page) && mapping) {
> - switch (try_to_unmap(page, TTU_UNMAP)) {
> + switch (try_to_unmap(page, TTU_UNMAP,
> + mmap_mutex_locked)) {

Which I think has to look like this:

if (page_mapped(page) && mapping) {
-   switch (try_to_unmap(page, TTU_UNMAP)) {
+   int result;
+   if (i_mmap_mutex)
+   result = __try_to_unmap(page, TTU_UNMAP);
+   else
+   result = try_to_unmap(page, TTU_UNMAP);
+   switch (result) {

Either that, or use a function pointer.  But ... bleugh.

> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index fd07c45..f0174ae 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -156,8 +156,8 @@ static inline void page_dup_rmap(struct page *page)
>  /*
>   * Called from mm/vmscan.c to handle paging out
>   */
> -int page_referenced(struct page *, int is_locked,
> - struct mem_c

[GIT PULL] Fixes for the NVMe driver

2012-09-28 Thread Matthew Wilcox

Hi Linus,

Now that actual hardware has been released (don't have any yet myself),
people are starting to want some of these fixes merged.  Could I trouble
you to pull?


The following changes since commit df3481399042200792822b6243e36a95a557b57e:

  NVMe: Set number of queues correctly (2012-01-11 09:22:24 -0500)

are available in the git repository at:

  git://git.infradead.org/users/willy/linux-nvme.git master

for you to fetch changes up to a09115b23e2002bb35b7bfd337683f00875671ec:

  NVMe: Cancel outstanding IOs on queue deletion (2012-08-07 15:56:23 -0400)


Dan Carpenter (1):
  NVMe: handle allocation failure in nvme_map_user_pages()

Keith Busch (7):
  NVMe: Set request queue logical block size
  NVMe: Fix nvme module init when nvme_major is set
  NVMe: replace nvme_ns with nvme_dev for user admin
  NVMe: use namespace id for nvme_get_features
  NVMe: Set block queue max sectors
  NVMe: Do not set IO queue depth beyond device max
  NVMe: Fix uninitialized iod compiler warning

Matthew Wilcox (3):
  NVMe: Fix whitespace damage in nvme_init
  NVMe: Free admin queue memory on initialisation failure
  NVMe: Cancel outstanding IOs on queue deletion

Quoc-Son Anh (1):
  NVMe: Use ida for nvme device instance

 drivers/block/nvme.c |  153 +-
 include/linux/nvme.h |2 +
 2 files changed, 105 insertions(+), 50 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] scsi: megaraid_sas - Fix random failure of DCDB cmds with sense info

2008-02-22 Thread Matthew Wilcox
On Fri, Feb 22, 2008 at 09:17:33AM -0700, Yang, Bo wrote:
> James,
> 
> What is the status for this patch?  We need to submit more patches based
> on the acceptance of this patch.

I've NAKed it.  You need to use compat_ioctl.  That way your code will
work even if you run a 32-bit x86 application on ia64 (or a 64-bit
application on x86).

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ata_ram driver

2008-02-22 Thread Matthew Wilcox

I've ported the scsi_ram driver [1] to libata.  It could use a lot more
work -- there's a lot of stuff in the identify page that I haven't
filled in, and there's a lot of commands it doesn't even try to execute.

For example, when you unload the driver, you get the mildly disturbing
messages:

sd 12:0:0:0: [sdb] Stopping disk
sd 12:0:0:0: [sdb] START_STOP FAILED
sd 12:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET 
driverbyte=DRIVER_OK,SUGGEST_OK

However, it's still useful at the current stage of development, so I
thought it was worth posting to get some feedback.

[1] http://marc.info/?l=linux-scsi&m=120331663227540&w=2

Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>

diff --git a/drivers/ata/Kconfig b/drivers/ata/Kconfig
index ba8f7f4..a3dfb50 100644
--- a/drivers/ata/Kconfig
+++ b/drivers/ata/Kconfig
@@ -40,6 +40,9 @@ config ATA_ACPI
  You can disable this at kernel boot time by using the
  option libata.noacpi=1
 
+config ATA_RAM
+   tristate "ATA RAM driver"
+
 config SATA_AHCI
tristate "AHCI SATA support"
depends on PCI
diff --git a/drivers/ata/Makefile b/drivers/ata/Makefile
index 701651e..a0a5a8c 100644
--- a/drivers/ata/Makefile
+++ b/drivers/ata/Makefile
@@ -77,6 +77,9 @@ obj-$(CONFIG_ATA_GENERIC) += ata_generic.o
 # Should be last libata driver
 obj-$(CONFIG_PATA_LEGACY)  += pata_legacy.o
 
+# A fake ata driver.  Can it be postultimate?
+obj-$(CONFIG_ATA_RAM)  += ata_ram.o
+
 libata-objs:= libata-core.o libata-scsi.o libata-sff.o libata-eh.o \
   libata-pmp.o
 libata-$(CONFIG_ATA_ACPI)  += libata-acpi.o
diff --git a/drivers/ata/ata_ram.c b/drivers/ata/ata_ram.c
new file mode 100644
index 000..7935bd1
--- /dev/null
+++ b/drivers/ata/ata_ram.c
@@ -0,0 +1,674 @@
+/*
+ * ata_ram.c - A RAM-based ATA driver for Linux
+ *
+ * This driver is intended to run as fast as possible, hence the options
+ * to discard writes and reads.
+ * By default, it'll allocate half a gigabyte of RAM to use as a ramdisc;
+ * you can change this with the `capacity' module parameter.
+ *
+ * (C) Copyright 2007-2008 Intel Corporation
+ * Author: Matthew Wilcox <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#undef DEBUG
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Matthew Wilcox <[EMAIL PROTECTED]>");
+
+#define DRV_NAME "ata_ram"
+
+static unsigned int sector_size = 512;
+module_param(sector_size, uint, 0444);
+MODULE_PARM_DESC(sector_size, "Size of sectors");
+
+static unsigned int capacity = 1024 * 1024;
+module_param(capacity, uint, 0444);
+MODULE_PARM_DESC(capacity, "Number of logical blocks in device");
+
+static int throw_away_writes;
+module_param(throw_away_writes, int, 0644);
+MODULE_PARM_DESC(throw_away_writes, "Discard all writes to the device");
+
+static int throw_away_reads;
+module_param(throw_away_reads, int, 0644);
+MODULE_PARM_DESC(throw_away_reads, "Don't actually read data from the device");
+
+
+struct ata_ram_host {
+   struct ata_host *host;
+   struct ata_queued_cmd *cmds[ATA_MAX_QUEUE];
+   int first, last;
+   struct task_struct *thread;
+   struct ata_taskfile shadow_tf;
+};
+
+static void copy_buffer(struct ata_queued_cmd *qc, char *buf, int len)
+{
+   char *p;
+   struct scatterlist *sg;
+   unsigned int i;
+
+   for_each_sg(qc->sg, sg, qc->n_elem, i) {
+   int tocopy = sg->length;
+   if (tocopy > len)
+   tocopy = len;
+
+   p = kmap_atomic(sg_page(sg), KM_USER0);
+   memcpy(p + sg->offset, buf, tocopy);
+   kunmap_atomic(p, KM_USER0);
+
+   len -= tocopy;
+   if (!len)
+   break;
+   buf += tocopy;
+   }
+}
+
+static void ata_ram_setup_result(struct ata_queued_cmd *qc)
+{
+   struct ata_ram_host *arhost = qc->ap->private_data;
+   arhost->shadow_tf.flags = ATA_TFLAG_LBA48 | ATA_TFLAG_LBA;
+   arhost->shadow_tf.protocol = ATA_PROT_DMA;
+   arhost->shadow_tf.ctl = 0;
+   arhost->shadow_tf.feature = 0;
+   arhost->shadow_tf.device = 0;
+   arhost->shadow_tf.command = 0;
+}
+
+static void ata_put_dword(u16 *d, u32 v)
+{
+   d[0] = cpu_to_le16(v & 0x);
+   d[1] = cpu_to_le16(v >> 16);
+}
+
+static u32 ata_get_lba_28(struct ata_taskfile *tf)
+{
+   return tf->lbal | (tf->lbam << 8) | (tf->lbah << 16);
+}
+
+static u64 ata_get_l

Accessor macros vs reference counting

2008-02-23 Thread Matthew Wilcox
On Sat, Feb 23, 2008 at 04:14:08PM +0800, WANG Cong wrote:
> Use get_personality() macro instead of explicit reference
> for parisc code.
> - if (personality(current->personality) == PER_LINUX32
> + if (personality(get_personality()) == PER_LINUX32

Hm.  We have an interesting clash of conventions here.

On the one hand, we have the java-style accessor convention.
get_personality()/set_personality().

On the other hand, we have reference counting.  get_cpu()/put_cpu().

When I first saw this patch, I assumed the latter (and went looking for
the missing put_), but _personality is of the former style.  I have one
suggestion (that nobody is going to like), which is that we suffix
reference-counting functions with _ref, ie:

get_cpu_ref()
put_cpu_ref()
pci_get_device_ref()
pci_put_device_ref()
pci_dev_get_ref()
pci_dev_put_ref()

(PCI is actually a good example here; it has "get" referring both to
refcounting and to accessors)

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Accessor macros vs reference counting

2008-02-23 Thread Matthew Wilcox
On Sat, Feb 23, 2008 at 11:16:22PM +1030, David Newall wrote:
> Is get_personality() thought to be better than current->personality?  To
> me, the latter seems more meaningful, and I'd rather see it than the former.

I suppose it's possible we might want to move 'personality' out of
'current' one day, and having an accessor macro makes it easier to do
that.  I agree it doesn't seem like a big win though.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] [SCSI] gdth: misc cleanups, preparation for ISA/EISA hotplug API

2008-02-23 Thread Matthew Wilcox
On Sun, Feb 24, 2008 at 12:31:17AM -0500, Christoph Hellwig wrote:
> On Sun, Feb 24, 2008 at 12:18:23AM -0500, Jeff Garzik wrote:
> > hm.  We'll see how it plays out...  on the remove side, the above is 
> > exact what happens in gdth_remove_one() without my patch, thus 
> > consolidating two cases of the same code into one.  There is a less-strong 
> > argument for doing the allocation that way, but it may turn out to be 
> > useful anyway once the ISA/EISA API conversion is complete.
> 
> EISA ->remove has a different prototype from PCI ->remove from ISA
> ->remove, so gdth_remove_one will be split up eventually.  Having the
> scsi_host_put duplicated in each shouldn't be too much of a problem :)

Shouldn't need to duplicate it ... free bus-specific things in the
->remove method, and call a common helper.  See advansys_release() and
its callers in advansys.c for how I did it.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] Revert "IB/fmr_pool: ib_fmr_pool_flush() should flush all dirty FMRs"

2008-02-26 Thread Matthew Wilcox
On Tue, Feb 26, 2008 at 11:23:01AM -0800, Benny Halevy wrote:
> Pete, the subject says "PATCH 1/2" but I didn't see any follow-up message
> for PATCH 2/2. Just wondering :)

I think the problem's on your end ... I got it and so did marc:
http://marc.info/?l=linux-scsi&m=120405067313933&w=2

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OK, let's try cleaning up another nit. Is anyone paying attention?

2001-04-19 Thread Matthew Wilcox

On Thu, Apr 19, 2001 at 18:50:34 EDT, Eric S. Raymond wrote:
> Remove dead CONFIG_BINFMT_JAVA symbol.

Please don't do this, it just makes merging our patches with Linus harder.
This has already been done in our tree since Feb 1.  In fact, please
don't touch anything in the tree which is PA specific; we have a large
arch update pending.

http://puffin.external.hp.com/cvs/linux/arch/parisc/config.in?log=y

shows the current state of our config.in, if you're curious.  If you
have any changes you want to make, don't hesitate to coordinate with us
by mailing [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OK, let's try cleaning up another nit. Is anyone paying attention?

2001-04-19 Thread Matthew Wilcox

On Thu, Apr 19, 2001 at 11:00:09PM -0400, Eric S. Raymond wrote:
> What is the right procedure for doing changes like this?  Is "don't
> touch that tree" a permanent condition, or am I going to get a chance
> to clean up the global CONFIG_ namespace after your next merge-down?

Our current status is that we've got a patch with Alan that's been sitting
in his tree for a while (things got trickier than he expected and he
hasn't been able to merge that upstream to Linus yet).  Meanwhile we've
carried on development as normal.  So even after the patches in Alan's
tree land, we've still got a fair hunk of changes to go in.

My preference would be for you to fetch our tree 

cvs -d :pserver:[EMAIL PROTECTED]:/home/cvs/parisc login
[no password]

cvs -d :pserver:[EMAIL PROTECTED]:/home/cvs/parisc co linux

and submit patches to us, which will get to Linus in the fullness of time.
I'm aware this might not be terribly satisfactory for you, but we're
doing our best not to lose our way amid the churn of development right
now and having patches which haven't followed a progression through
us makes that significantly harder.

> Could I ask you to audit your tree and change the prefix on any 
> CONFIG_ symbols that are private over there?  This would make life 
> easier for my auditing tools (kxref and Stephen Cole's ach script).

I don't think we have any of those.  We certainly have symbols which are
defined for symmetry and may not actually be used yet (CONFIG_PA11 might not
be, perhaps).  But that's what happens when you're developing software :-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OK, let's try cleaning up another nit. Is anyone paying attention?

2001-04-19 Thread Matthew Wilcox

On Thu, Apr 19, 2001 at 10:07:22PM -0600, james rich wrote:
> Doesn't this seem a little like the problems occurring with lvm right now?
> A separate tree maintained with the maintainers not wanting others
> submitting patches that conflict with their particular tree?  It seems
> that any project should be able to submit any patch against The One True
> Tree: Linus' tree.

every single architecture has their own development tree.  the pa project
has not been running as long as the other ports, and has a large amount of
development going on.  i count 28 commits for april (so far), 75 commits
for march, 187 for february and 112 for january (to the kernel tree, other
parts of the port also have commit messages).  linus would go insane if
we sent him every single one of those patches individually.  and we'd
go insane trying to keep up with what he'd taken and what he'd dropped.

until you've actually tried doing this, please don't attempt to criticise.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [parisc-linux] Re: OK, let's try cleaning up another nit. Is anyone paying attention?

2001-04-20 Thread Matthew Wilcox

On Fri, Apr 20, 2001 at 11:15:12AM -0500, Bob McElrath wrote:
> This may be a dumb question, but is there some place where the arch
> maintainers are listed?  Where the arch-specific trees are kept?  Where
> would I go to get the latest set of relevant patches for alpha?

http://www.kernel.org/ has a list of architecture websites.  Also the
CREDITS / MAINTAINERS files tend to list the people who are involved.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [parisc-linux] Re: OK, let's try cleaning up another nit. Is anyone paying attention?

2001-04-20 Thread Matthew Wilcox

On Fri, Apr 20, 2001 at 02:00:00PM -0500, Jeff Dike wrote:
> [EMAIL PROTECTED] said:
> > http://www.kernel.org/ has a list of architecture websites.  Also the
> > CREDITS / MAINTAINERS files tend to list the people who are involved. 
> 
> Except it's restricted to processor ports, which would leave you not knowing 
> about UML.

Have you tried mailing [EMAIL PROTECTED] and asking to be added?

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OK, let's try cleaning up another nit. Is anyone paying attention?

2001-04-20 Thread Matthew Wilcox

On Fri, Apr 20, 2001 at 03:47:43PM -0400, Eric S. Raymond wrote:
> CONFIG_BINFMT_SOM: arch/parisc/config.in arch/parisc/defconfig
> 
> Not used in code anywhere.  Can you get rid of this one?

Code not merged yet.

> CONFIG_DMB_TRAP: arch/parisc/kernel/sba_iommu.c
> CONFIG_FUNC_SIZE: arch/parisc/kernel/sba_iommu.c
> 
> Would you please take these out of the CONFIG_ namespace?  Changing the 
> prefix to CONFIGURE would do nicely.

Grant?  This is your code.

> CONFIG_GENRTC: arch/parisc/defconfig
> 
> This is a typo for GEN_RTC.  Please fix it or get rid of it.

in our tree it's in drivers/char/Makefile:

obj-$(CONFIG_GENRTC) += genrtc.o

this code was written by Sam Creasey as part of the sun3 port, and he
schlepped it into our tree too.  we have no GEN_RTC in our tree.

> CONFIG_HIL: arch/parisc/defconfig
> 
> Looks like an orphan.  Can you get rid of it?

code not yet merged.

> CONFIG_SERIAL_GSC: drivers/char/serial.c arch/parisc/defconfig
> 
> That reference pattern looks kind of weird.  Is this a bug?

it's old and needs to die properly.  i haven't had time to fix that yet.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Architecture-specific include files

2001-04-22 Thread Matthew Wilcox


Something which came up in one of the hallway discussions at the
kernelsummit was that a lot of the architecture maintainers would find
it more convenient if the arch-specific header files were moved from
include/asm-$ARCH to arch/$ARCH/include.  Since we use a symlink _anyway_,
no global changes to include statements are necessary, we'd merely need
to change Makefile from

symlinks:
rm -f include/asm
( cd include ; ln -sf asm-$(ARCH) asm)

to

symlinks:
rm -f include/asm
( cd include ; ln -sf ../arch/$(ARCH)/include asm)

Would anyone have a problem with this change?  It'll make for a hell
of a big patch from Linus, but it really will simplify the lives of the
architecture maintainers.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: question on atomic_t

2001-04-22 Thread Matthew Wilcox


> can I assume that a member of a structure of type atomic_t is 0 after
> using memset to zero on the structure ?

You must not do this.  use atomic_init instead.  on parisc, the locked value
of a spinlock is 0, so no more updates to that atomic value would ever work.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] Add documentation for string functions

2001-03-03 Thread Matthew Wilcox

diff -urNX dontdiff linux-2.4.2/Documentation/DocBook/Makefile 
linux-willy/Documentation/DocBook/Makefile
--- linux-2.4.2/Documentation/DocBook/Makefile  Sat Feb  3 13:43:55 2001
+++ linux-willy/Documentation/DocBook/Makefile  Sat Mar  3 18:01:07 2001
@@ -82,6 +82,8 @@
$(TOPDIR)/kernel/pm.c \
$(TOPDIR)/kernel/ksyms.c \
$(TOPDIR)/kernel/kmod.c \
+   $(TOPDIR)/lib/string.c \
+   $(TOPDIR)/lib/vsprintf.c \
$(TOPDIR)/net/netsyms.c
  
 kernel-api.sgml: kernel-api.tmpl $(APISOURCES)
diff -urNX dontdiff linux-2.4.2/Documentation/DocBook/kernel-api.tmpl 
linux-willy/Documentation/DocBook/kernel-api.tmpl
--- linux-2.4.2/Documentation/DocBook/kernel-api.tmpl   Sat Dec 30 19:16:13 2000
+++ linux-willy/Documentation/DocBook/kernel-api.tmpl   Sat Mar  3 18:00:33 2001
@@ -41,6 +41,16 @@
  
   
 
+  
+ Basic C Library Functions
+ String Conversions
+!Ilib/vsprintf.c
+ 
+ String Manipulation
+!Ilib/string.c
+ 
+  
+
   
  Memory Management in Linux
  The Slab Cache
diff -urNX dontdiff linux-2.4.2/lib/string.c linux-willy/lib/string.c
--- linux-2.4.2/lib/string.cThu Aug 10 14:10:14 2000
+++ linux-willy/lib/string.cSat Mar  3 19:56:04 2001
@@ -20,6 +20,12 @@
 #include 
 
 #ifndef __HAVE_ARCH_STRNICMP
+/**
+ * strnicmp - Case insensitive, length-limited string comparison
+ * @s1: One string
+ * @s2: The other string
+ * @len: the maximum number of characters to compare
+ */
 int strnicmp(const char *s1, const char *s2, size_t len)
 {
/* Yes, Virginia, it had better be unsigned */
@@ -49,6 +55,11 @@
 char * ___strtok;
 
 #ifndef __HAVE_ARCH_STRCPY
+/**
+ * strcpy - Copy a %NUL terminated string
+ * @dest: Where to copy the string to
+ * @src: Where to copy the string from
+ */
 char * strcpy(char * dest,const char *src)
 {
char *tmp = dest;
@@ -60,6 +71,16 @@
 #endif
 
 #ifndef __HAVE_ARCH_STRNCPY
+/**
+ * strncpy - Copy a length-limited, %NUL-terminated string
+ * @dest: Where to copy the string to
+ * @src: Where to copy the string from
+ * @count: The maximum number of bytes to copy
+ *
+ * Note that unlike userspace strncpy, this does not %NUL-pad the buffer.
+ * However, the result is not %NUL-terminated if the source exceeds
+ * @count bytes.
+ */
 char * strncpy(char * dest,const char *src,size_t count)
 {
char *tmp = dest;
@@ -72,6 +93,11 @@
 #endif
 
 #ifndef __HAVE_ARCH_STRCAT
+/**
+ * strcat - Append one %NUL-terminated string to another
+ * @dest: The string to be appended to
+ * @src: The string to append to it
+ */
 char * strcat(char * dest, const char * src)
 {
char *tmp = dest;
@@ -86,6 +112,15 @@
 #endif
 
 #ifndef __HAVE_ARCH_STRNCAT
+/**
+ * strncat - Append a length-limited, %NUL-terminated string to another
+ * @dest: The string to be appended to
+ * @src: The string to append to it
+ * @count: The maximum numbers of bytes to copy
+ *
+ * Note that in contrast to strncpy, strncat ensures the result is
+ * terminated.
+ */
 char * strncat(char *dest, const char *src, size_t count)
 {
char *tmp = dest;
@@ -106,6 +141,11 @@
 #endif
 
 #ifndef __HAVE_ARCH_STRCMP
+/**
+ * strcmp - Compare two strings
+ * @cs: One string
+ * @ct: Another string
+ */
 int strcmp(const char * cs,const char * ct)
 {
register signed char __res;
@@ -120,6 +160,12 @@
 #endif
 
 #ifndef __HAVE_ARCH_STRNCMP
+/**
+ * strncmp - Compare two length-limited strings
+ * @cs: One string
+ * @ct: Another string
+ * @count: The maximum number of bytes to compare
+ */
 int strncmp(const char * cs,const char * ct,size_t count)
 {
register signed char __res = 0;
@@ -135,6 +181,11 @@
 #endif
 
 #ifndef __HAVE_ARCH_STRCHR
+/**
+ * strchr - Find the first occurrence of a character in a string
+ * @s: The string to be searched
+ * @c: The character to search for
+ */
 char * strchr(const char * s, int c)
 {
for(; *s != (char) c; ++s)
@@ -145,6 +196,11 @@
 #endif
 
 #ifndef __HAVE_ARCH_STRRCHR
+/**
+ * strrchr - Find the last occurrence of a character in a string
+ * @s: The string to be searched
+ * @c: The character to search for
+ */
 char * strrchr(const char * s, int c)
 {
const char *p = s + strlen(s);
@@ -157,6 +213,10 @@
 #endif
 
 #ifndef __HAVE_ARCH_STRLEN
+/**
+ * strlen - Find the length of a string
+ * @s: The string to be sized
+ */
 size_t strlen(const char * s)
 {
const char *sc;
@@ -168,6 +228,11 @@
 #endif
 
 #ifndef __HAVE_ARCH_STRNLEN
+/**
+ * strnlen - Find the length of a length-limited string
+ * @s: The string to be sized
+ * @count: The maximum number of bytes to search
+ */
 size_t strnlen(const char * s, size_t count)
 {
const char *sc;
@@ -179,6 +244,12 @@
 #endif
 
 #ifndef __HAVE_ARCH_STRSPN
+/**
+ * strspn - Calculate the length of the initial substring of @s which only
+ * contain letters in @accept
+ * @s: The string to be searched
+ * @accept: The string to search for
+ */
 size_t strspn(const char *

[PATCH] Documentation for bitops

2001-03-04 Thread Matthew Wilcox


This is just the kernel-doc parts; the .tmpl file entry is missing until
i get the chance to sync up with Alan again.

--- linux-2.4.2/include/asm-i386/bitops.h   Wed Feb 21 17:09:56 2001
+++ linux-willy/include/asm-i386/bitops.h   Mon Mar  5 00:39:28 2001
@@ -23,6 +23,16 @@
 
 #define ADDR (*(volatile long *) addr)
 
+/**
+ * set_bit - Atomically set a bit in memory
+ * @nr: the bit to set
+ * @addr: the address to start counting from
+ *
+ * This function is atomic and may not be reordered.  See __set_bit()
+ * if you do not require the atomic guarantees.
+ * Note that @nr may be almost arbitrarily large; this function is not
+ * restricted to acting on a single-word quantity.
+ */
 static __inline__ void set_bit(int nr, volatile void * addr)
 {
__asm__ __volatile__( LOCK_PREFIX
@@ -31,7 +41,15 @@
:"Ir" (nr));
 }
 
-/* WARNING: non atomic and it can be reordered! */
+/**
+ * __set_bit - Set a bit in memory
+ * @nr: the bit to set
+ * @addr: the address to start counting from
+ *
+ * Unlike set_bit(), this function is non-atomic and may be reordered.
+ * If it's called on the same region of memory simultaneously, the effect
+ * may be that only one operation succeeds.
+ */
 static __inline__ void __set_bit(int nr, volatile void * addr)
 {
__asm__(
@@ -40,11 +58,16 @@
:"Ir" (nr));
 }
 
-/*
- * clear_bit() doesn't provide any barrier for the compiler.
+/**
+ * clear_bit - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * clear_bit() is atomic and may not be reordered.  However, it does
+ * not contain a memory barrier, so if it is used for locking purposes,
+ * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
+ * in order to ensure changes are visible on other processors.
  */
-#define smp_mb__before_clear_bit() barrier()
-#define smp_mb__after_clear_bit()  barrier()
 static __inline__ void clear_bit(int nr, volatile void * addr)
 {
__asm__ __volatile__( LOCK_PREFIX
@@ -52,7 +75,18 @@
:"=m" (ADDR)
:"Ir" (nr));
 }
+#define smp_mb__before_clear_bit() barrier()
+#define smp_mb__after_clear_bit()  barrier()
 
+/**
+ * change_bit - Toggle a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * change_bit() is atomic and may not be reordered.
+ * Note that @nr may be almost arbitrarily large; this function is not
+ * restricted to acting on a single-word quantity.
+ */
 static __inline__ void change_bit(int nr, volatile void * addr)
 {
__asm__ __volatile__( LOCK_PREFIX
@@ -61,10 +95,13 @@
:"Ir" (nr));
 }
 
-/*
- * It will also imply a memory barrier, thus it must clobber memory
- * to make sure to reload anything that was cached into registers
- * outside _this_ critical section.
+/**
+ * test_and_set_bit - Set a bit and return its old value
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This operation is atomic and cannot be reordered.  
+ * It also implies a memory barrier.
  */
 static __inline__ int test_and_set_bit(int nr, volatile void * addr)
 {
@@ -77,7 +114,15 @@
return oldbit;
 }
 
-/* WARNING: non atomic and it can be reordered! */
+/**
+ * __test_and_set_bit - Set a bit and return its old value
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This operation is non-atomic and can be reordered.  
+ * If two examples of this operation race, one can appear to succeed
+ * but actually fail.  You must protect multiple accesses with a lock.
+ */
 static __inline__ int __test_and_set_bit(int nr, volatile void * addr)
 {
int oldbit;
@@ -89,6 +134,14 @@
return oldbit;
 }
 
+/**
+ * test_and_clear_bit - Clear a bit and return its old value
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This operation is atomic and cannot be reordered.  
+ * It also implies a memory barrier.
+ */
 static __inline__ int test_and_clear_bit(int nr, volatile void * addr)
 {
int oldbit;
@@ -100,7 +153,15 @@
return oldbit;
 }
 
-/* WARNING: non atomic and it can be reordered! */
+/**
+ * __test_and_clear_bit - Clear a bit and return its old value
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This operation is non-atomic and can be reordered.  
+ * If two examples of this operation race, one can appear to succeed
+ * but actually fail.  You must protect multiple accesses with a lock.
+ */
 static __inline__ int __test_and_clear_bit(int nr, volatile void * addr)
 {
int oldbit;
@@ -112,6 +173,14 @@
return oldbit;
 }
 
+/**
+ * test_and_change_bit - Change a bit and return its new value
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This operation is atomic and cannot be reordered.  
+ * It also implies a memory barrier.
+ */
 static __inline__ int test_and_change_bit(int nr, volatile void * addr)
 {
int oldbit;
@@ -123,9 +192,15 @@
return oldbit;
 }
 
-/*
- * Thi

super_operations in -pre7

2001-01-16 Thread Matthew Wilcox


several new operations have been added to super_operations, presumably
as part of the reiserfs merge.  write_super_lockfs and unlockfs are
never called.  can we remove them?

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-29 Thread Matthew Wilcox

On Tue, Aug 29, 2000 at 06:08:04PM +0200, Roman Zippel wrote:
> Anyway, I'm happy about any bug reports, that you can't reproduce with
> hfs on a drive with 512 byte sectors (for that I still trying to fully
> understand hfs btrees :-) ). I don't think this patch should be included

last time i looked (somewhere around 2.3.4x), all the B-tree directory
implementations in the kernel were broken.  That's HFS, HPFS and NTFS.
None of them consider the race where an insert occurs into the tree
while you're doing a readdir.  I thought about how to fix it for ext2
btrees but I haven't come up with a satisfactory solution yet.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] make config w/ Pentium-II

2000-12-12 Thread Matthew Wilcox


I built 2.4.0-test10 for my laptop and got caught out by `make config'.
When I typed `Pentium-II' for my CPU type, it selected Pentium-III and
my kernel wouldn't boot.  To prevent errors of this type, please apply
this patch.  As a bonus, `Celeron' will now work as an answer too.

Please apply.

diff -urNX dontdiff linux/scripts/Configure linux-test10/scripts/Configure
--- linux/scripts/Configure Mon Oct 30 22:44:29 2000
+++ linux-test10/scripts/Configure  Tue Dec 12 03:39:41 2000
@@ -479,11 +479,14 @@
while [ -n "$1" ]; do
name=$(echo $1 | tr a-z A-Z)
case "$name" in
-   "$ans"* )
-   if [ "$name" = "$ans" ]; then
-   val="$2"
-   break   # stop on exact match
-   fi
+   "$ans"* | */"$ans"* )
+   case "$name" in
+   "$ans" | */"$ans"/* | \
+   "$ans"/* | */"$ans" )
+   val="$2"
+   break # exact match
+   ;;
+   esac
if [ -n "$val" ]; then
echo;echo \
"  Sorry, \"$ans\" is ambiguous; please enter a longer string."

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] find_vma_prev rewrite

2000-12-23 Thread Matthew Wilcox


find_vma_prev doesn't return a pointer to the `prev' vma if the address
is greater than the last existing vma.  This doesn't matter unless you're
on a PA-RISC machine :-)

This rewrite should speed up & make find_vma_prev simpler, as well as
fixing the previous behaviour.

--- linux-t10/mm/mmap.c Fri Oct 13 20:10:30 2000
+++ linux-mine/mm/mmap.cWed Dec 20 15:22:53 2000
@@ -417,54 +417,48 @@
 struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
  struct vm_area_struct **pprev)
 {
-   if (mm) {
-   if (!mm->mmap_avl) {
-   /* Go through the linear list. */
-   struct vm_area_struct * prev = NULL;
-   struct vm_area_struct * vma = mm->mmap;
-   while (vma && vma->vm_end <= addr) {
-   prev = vma;
-   vma = vma->vm_next;
-   }
-   *pprev = prev;
-   return vma;
-   } else {
-   /* Go through the AVL tree quickly. */
-   struct vm_area_struct * vma = NULL;
-   struct vm_area_struct * last_turn_right = NULL;
-   struct vm_area_struct * prev = NULL;
-   struct vm_area_struct * tree = mm->mmap_avl;
-   for (;;) {
-   if (tree == vm_avl_empty)
+   struct vm_area_struct *vma = NULL;
+   struct vm_area_struct *prev = NULL;
+   if (!mm)
+   goto out;
+   prev = mm->mmap_cache;
+   if (prev) {
+   vma = prev->vm_next;
+   if (prev->vm_end < addr &&
+   ((vma == NULL) || (addr < vma->vm_end)))
+   goto out;
+   prev = NULL;
+   }
+   vma = mm->mmap; /* guard against there being no prev */
+   if (!mm->mmap_avl) {
+   /* Go through the linear list. */
+   while (vma && vma->vm_end <= addr) {
+   prev = vma;
+   vma = vma->vm_next;
+   }
+   } else {
+   /* Go through the AVL tree quickly. */
+   struct vm_area_struct * tree = mm->mmap_avl;
+   while (tree != vm_avl_empty) {
+   if (addr < tree->vm_end) {
+   tree = tree->vm_avl_left;
+   } else {
+   prev = tree;
+   if (tree->vm_next == NULL)
break;
-   if (tree->vm_end > addr) {
-   vma = tree;
-   prev = last_turn_right;
-   if (tree->vm_start <= addr)
-   break;
-   tree = tree->vm_avl_left;
-   } else {
-   last_turn_right = tree;
-   tree = tree->vm_avl_right;
-   }
-   }
-   if (vma) {
-   if (vma->vm_avl_left != vm_avl_empty) {
-   prev = vma->vm_avl_left;
-   while (prev->vm_avl_right != vm_avl_empty)
-   prev = prev->vm_avl_right;
-   }
-   if ((prev ? prev->vm_next : mm->mmap) != vma)
-   printk("find_vma_prev: tree inconsistent with 
list\n");
-   *pprev = prev;
-   return vma;
+   if (addr < tree->vm_next->vm_end)
+   break;
+   tree = tree->vm_avl_right;
}
}
}
-   *pprev = NULL;
-   return NULL;
+   mm->mmap_cache = prev;
+out:
+   *pprev = prev;
+   return prev ? prev->vm_next : vma;
 }
 
+/* XXX: Needs to be fixed for PA-RISC -- won't grow our stack. */
 struct vm_area_struct * find_extend_vma(struct mm_struct * mm, unsigned long addr)
 {
struct vm_area_struct * vma;

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] kmalloc(HIGHUSER) crashes

2000-12-28 Thread Matthew Wilcox


doh, i'm a moron.

-- This is a copy of the message --

Date: Thu, 28 Dec 2000 14:51:25 +
From: Matthew Wilcox <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: [PATCH] kmalloc(HIGHUSER) crashes

The slab cache accepts the __GFP_HIGHMEM flag, but it will then die
horribly, tracing it through:


#define SLAB_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_HIGHMEM)

slab.c:kmem_cache_grow()
if (flags & ~(SLAB_DMA|SLAB_LEVEL_MASK|SLAB_NO_GROW))
BUG();
(...)
slab.c:kmem_getpages()
addr = (void*) __get_free_pages(flags, cachep->gfporder);

__get_free_pages calls page_address() on the allocated pages, but:

/*
 * Permanent address of a page. Obviously must never be
 * called on a highmem page.
 */
#define page_address(page) ((page)->virtual)


so I think we should apply the patch at the bottom of this email.

Ideally, I think slab.c should call alloc_pages and manage them directly
-- it calls virt_to_page often enough that it should become more intimate
with the linux struct page *.  But I don't think we should allow HIGHMEM
kmallocs anyway; we'd run out of kmap space too readily.

Index: include/linux/slab.h
===
RCS file: /var/cvs/linux/include/linux/slab.h,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 include/linux/slab.h
--- include/linux/slab.h2000/10/31 19:18:05 1.1.1.1
+++ include/linux/slab.h2000/12/28 11:06:49
@@ -22,7 +22,7 @@ typedef struct kmem_cache_s kmem_cache_t
 #defineSLAB_NFSGFP_NFS
 #defineSLAB_DMAGFP_DMA
 
-#define SLAB_LEVEL_MASK(__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_HIGHMEM)
+#define SLAB_LEVEL_MASK(__GFP_WAIT|__GFP_HIGH|__GFP_IO)
 #defineSLAB_NO_GROW0x1000UL/* don't grow a cache */
 
 /* flags to pass to kmem_cache_create().

-- 
Revolutions do not require corporate support.

- End forwarded message -

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] move xchg/cmpxchg to atomic.h

2001-01-02 Thread Matthew Wilcox

On Tue, Jan 02, 2001 at 01:03:48AM -0800, David S. Miller wrote:
> If you require an external agent (f.e. your spinlock) because you
> cannot implement xchg with a real atomic sequence, this breaks the
> above assumptions.

We really can't.  We _only_ have load-and-zero.  And it has to be 16-byte
aligned.  xchg() is just not something the CPU implements.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Semantics of remount

2000-09-24 Thread Matthew Wilcox

On Sat, Sep 09, 2000 at 03:18:57PM +0200, Andreas Gruenbacher wrote:
> Hi,
> 
> What are the intended semantics for a remount:
> 
> (a) equivalent to a mount, resetting all mount options that
> might be set
> (b) change mount options relative to the current mount Aoptions
> 
> For ext2, the 2.2.17 kernel implements (a),
> while 2.4.0-test8 implements (b).

I don't believe there are any standards documents which tell us what
to do.  (b) seems more intuitive than (a).

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [PATCH] af_rose.c: s/suser/capable/ + micro cleanups

2000-08-31 Thread Matthew Wilcox

On Thu, Aug 31, 2000 at 12:26:29AM +0200, Rogier Wolff wrote:
>int mr (unsigned int rate, int r) 
>{
>  int e = 16+9;
>  static int round[4]={0, 0, 0x, 0x8000};
>  if (!rate) return 0;
>  for (;  rate & 0xfc00 ;rate >>= 1, e++);
>  for (;!(rate & 0xfe00);rate <<= 1, e--);
>  return ((rate & ~0x0200) | (e << (16+9)) + round[r]) >> 16;
>}
> 
> Dense code, right? Floating point in the kernel Aaargh. 

you've failed Chapter 3 of CodingStyle:

HOWEVER, while mixed-case names are frowned upon, descriptive names for
global variables are a must.  To call a global function "foo" is a
shooting offense. 

what the hell does a function called `mr' do?

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



dontdiff updates

2000-10-04 Thread Matthew Wilcox


hi tigran.  can we add some more entries to dontdiff?

*.ver
sm_tbl_*.h

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[RFC] Imminent death of /proc/locks predicted; film at 11

2000-09-30 Thread Matthew Wilcox


Does anyone actually want /proc/locks to stay?  The data structure
which allows it to be generated is now only used by the code to print
out /proc/locks.  If it could be deleted, a lot of code and data pointers
could go away.  I don't think any program depends on its existance and
it's a pretty ugly file anyway (exposing kernel pointers to userspace?
looks like pure debug code).

Speak now, or it shall be gone.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Imminent death of /proc/locks predicted; film at 11

2000-09-30 Thread Matthew Wilcox

On Sun, Oct 01, 2000 at 12:29:27AM +0100, Alan Cox wrote:
> > Does anyone actually want /proc/locks to stay?  The data structure
> 
> I'd like a way to view the locks that exist - its useful for debugging
> weird app stuff
> 
> > out /proc/locks.  If it could be deleted, a lot of code and data pointers
> > could go away.  I don't think any program depends on its existance and
> > it's a pretty ugly file anyway (exposing kernel pointers to userspace?
> > looks like pure debug code).
> > 
> > Speak now, or it shall be gone.
> 
> If it makes the code far cleaner then I have no objection. If we can do a
> simple (different format even) /proc/locks to replace it that scores double
> points ;)

This was the sort of objection I was hoping to receive :-)

When debugging this kind of problem, you're not interested in the
non-conflicting locks, only the ones which are blocked waiting for
another lock, right?

If so, then we need that structure around anyway for doing the crappy
POSIX deadlock detection.  And I don't have a problem with exposing that
to userspace.

If you did want all locks, we could walk all inodes in core and print
out all the locks held on them :-)  That might even be more scalable than
the current approach...

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Imminent death of /proc/locks predicted; film at 11

2000-09-30 Thread Matthew Wilcox

On Sun, Oct 01, 2000 at 12:55:30AM +0100, Alan Cox wrote:
> > When debugging this kind of problem, you're not interested in the
> > non-conflicting locks, only the ones which are blocked waiting for
> > another lock, right?
> 
> All I care about is what locks are currently in force in the system. So for
> example I can do something like
> 
> showlocks /var/spool/mail/alan

there's already a F_GETLK fcntl.  however, it's not entirely suitable
for use as a debugging tool since it dosen't report the blocked locks
and it might not find all the locks in force (or if it does,it could
take 2^64 calls to fcntl to find them all).

given an inode, it's trivial to find the locks in force on it (and the
locks blocked on the locks in force upon it).  the difficulty lies in
reporting this data to userspace.

if fcntl took a 4th argument specifying the length of the buffer, i'd
recommend a F_GETLKS fcntl.  a horrid work around for this would be that
the first 4 bytes of the buffer pointed to by the third argument of the
fcntl is the length of the buffer.

i'd appreciate anyone coming up with a nicer interface.  How do other
unices provide this information?  grovelling through /proc/kcore?

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: dontdiff updates

2000-10-04 Thread Matthew Wilcox

On Wed, Oct 04, 2000 at 03:30:45PM +0100, Tigran Aivazian wrote:
> They were in fact there already. Maybe you had an old version. To remind,
> I keep it at:
> 
> http://www.moses.uklinux.net/patches/dontdiff

Ah, you've moved it then :-)  I found an old posting from you with
a different `canonical location', let me see.. ah yes.  searching for
`tigran dontdiff' on google gets me ocston.org/~tigran/patches/dontdiff.
Several times over, so trying to sneak the moses url into google's
database may be too hard :-) Could you either keep that one up to date
or remove it?

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Imminent death of /proc/locks predicted; film at 11

2000-09-30 Thread Matthew Wilcox

On Sat, Sep 30, 2000 at 08:23:37PM -0500, Peter Samuelson wrote:
> 
> [Matthew Wilcox]
> > if fcntl took a 4th argument specifying the length of the buffer, i'd
> > recommend a F_GETLKS fcntl.  a horrid work around for this would be
> > that the first 4 bytes of the buffer pointed to by the third argument
> > of the fcntl is the length of the buffer.
> 
> E!  You're right, it's horrid.  Anyway, I think this one can be
> solved in userspace -- as long as you don't need atomicity.  Untested
> code with no error checking:

thanks for snipping the part of my email where i explain this won't work.
examples:

process 1 locks bytes 1 to 7 nonexclusively
process 2 locks bytes 2 to 5 nonexclusively

you now can't see the second lock.

and there's no way of seeing the blocked lock.  This needs kernel support
some how.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: (struct dentry *)->vfsmnt;

2001-03-14 Thread Matthew Wilcox

On Wed, Mar 14, 2001 at 10:26:50AM -0700, Andreas Dilger wrote:
> > Let me put it that way: I don't understand why (if it is useful at all)
> > it is done in the fs. Looks like a wrong level...
> 
> For the same reason that the UUID and LABEL are stored in the superblock:
> you want this infomation kept with the filesystem and not anywhere else,
> otherwise it will quickly get out-of-date.  Wherever you mounted the
> filesystem last is where it would be mounted if you import the VG on
> another system.  You can obviously edit /etc/fstab afterwards if it is
> wrong, and then remount the filesystem(s), and this will store the
> correct mountpoint into the filesystem for the next vgimport.

Al is saying `why not do this in mount(8) instead of mount(2)?'  I haven't
seen you answer that yet.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] sane access to per-fs metadata (was Re: [PATCH] Documentation/ioctl-number.txt)

2001-03-23 Thread Matthew Wilcox

On Fri, Mar 23, 2001 at 09:56:47AM -0700, Bryan Henderson wrote:
> There's a lot of cool simplicity in this, both in implementation and 
> application, but it leaves something to be desired in functionality.  This 
> is partly because the price you pay for being able to use existing, 
> well-worn Unix interfaces is the ancient limitations of those interfaces 
> -- like the inability to return adequate error information.

hmm... open("defrag-error") first, then read from it if it fails?

> effective the defrag was?  And bear in mind that multiple processes may be 
> issuing commands to /mnt/control simultaneously.

you should probably serialise them.  you probably have to do this anyway.

> With ioctl, I can easily match a response of any kind to a request.  I can 
> even return an English text message if I want to be friendly.

yes, one of the nice plan9 changes was the change to returning strings
instead of numerics.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 64-bit block sizes on 32-bit systems

2001-03-26 Thread Matthew Wilcox

On Mon, Mar 26, 2001 at 08:39:21AM -0800, LA Walsh wrote:
> I vaguely remember a discussion about this a few months back.
> If I remember, the reasoning was it would unnecessarily slow
> down smaller systems that would never have block devices in
> the 4-28T range attached.  

4k page size * 2GB = 8TB.

i consider it much more likely on such systems that the page size will
be increased to maybe 16 or 64k which would give us 32TB or 128TB.
you keep on trying to increase the size of types without looking at
what gcc outputs in the way of code that manipulates 64-bit types.
seriously, why don't you just try it?  see what the performance is.
see what the code size is.  then come back with some numbers.  and i mean
numbers, not `it doesn't feel any slower'.

personally, i'm going to see what the situation looks like in 5 years time
and try to solve the problem then.  there're enough real problems with the
VFS today that i don't feel inclined to fix tomorrow's potential problems.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 64-bit block sizes on 32-bit systems

2001-03-26 Thread Matthew Wilcox

On Mon, Mar 26, 2001 at 08:01:21PM +0200, Manfred Spraul wrote:
> drivers/block/ll_rw_blk.c, in submit_bh()
> >bh->b_rsector = bh->b_blocknr * (bh->b_size >> 9);
> 
> But it shouldn't cause data corruptions:
> It was discussed a few months ago, and iirc LVM refuses to create too
> large volumes.

Ah yes, I'd forgotten the block layer still works in terms of 512-byte blocks.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 64-bit block sizes on 32-bit systems

2001-03-26 Thread Matthew Wilcox

On Mon, Mar 26, 2001 at 10:47:13AM -0700, Andreas Dilger wrote:
> What do you mean by problems 5 years down the road?  The real issue is that
> this 32-bit block count limit affects composite devices like MD RAID and
> LVM today, not just individual disks.  There have been several postings
> I have seen with people having a problem _today_ with a 2TB limit on
> devices.

people who can afford 2TB of disc can afford to buy a 64-bit processor.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 64-bit block sizes on 32-bit systems

2001-03-27 Thread Matthew Wilcox

On Mon, Mar 26, 2001 at 08:39:21AM -0800, LA Walsh wrote:
> So...is it the plan, or has it been though about -- 'abstracting'
> block numbes as a typedef 'block_nr', then at compile time
> having it be selectable as to whether or not this was to
> be a 32-bit or 64 bit quantity -- that way older systems would

Oh, did no-one mention the words `Module ABI' yet?

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero

2005-04-14 Thread Matthew Wilcox
On Fri, Apr 15, 2005 at 03:07:42AM +0200, Jesper Juhl wrote:
> 'arg' is unsigned so it can never be less than zero, so testing for that 
> is pointless and also generates a warning when building with gcc -W. This 
> patch eliminates the pointless check.

Didn't Linus already reject this one 6 months ago?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero

2005-04-15 Thread Matthew Wilcox
On Fri, Apr 15, 2005 at 09:21:50AM +0100, Christoph Hellwig wrote:
> On Fri, Apr 15, 2005 at 02:31:00AM +0100, Matthew Wilcox wrote:
> > On Fri, Apr 15, 2005 at 03:07:42AM +0200, Jesper Juhl wrote:
> > > 'arg' is unsigned so it can never be less than zero, so testing for that 
> > > is pointless and also generates a warning when building with gcc -W. This 
> > > patch eliminates the pointless check.
> > 
> > Didn't Linus already reject this one 6 months ago?
> 
> I think Linux only complained if we're using some typedef that actually
> may be signed.  For fcntl that 'arg' argument is unsigned and that's hardcoded
> in the ABI.  So the check doesn't make sense at all.

No, it was exactly this patch:
http://www.ussg.iu.edu/hypermail/linux/kernel/0401.0/1816.html

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero

2005-04-15 Thread Matthew Wilcox
On Fri, Apr 15, 2005 at 10:03:05PM +1000, Herbert Xu wrote:
> I suppose it could be smart and stay quiet about
> 
> val < 0 || val > BOUND
> 
> However, gcc is slow enough as it is without adding unnecessary
> smarts like this.

It only warns with -W on, not with -Wall, so I see no compelling
reason to fix this.  I think the real problem here is that 'arg'
is declared 2 pages earlier in the function prototype (aka the
function-growth-hormone-imbalance syndrome).

There's two good ways of fixing this, adding a f_setsig() function:

static inline int f_setsig(struct file *filp, unsigned long arg)
{
if (arg > _NSIG)
return -EINVAL;

filp->f_owner.signum = arg;
return 0;
}
...
case F_SETSIG:
err = f_setsig(filp, arg);
break;

or add a function that checks a variable to see if it's a valid signal number:

#define valid_signal(arg)   ((unsigned long)arg <= _NSIG)
...
case F_SETSIG:
if (!valid_signal(arg))
break;
err = 0;
filp->f_owner.signum = arg;
break;

Looks like futex.c, ptrace.c, signal.c, sys.c and almost every
architecture's ptrace code could easily make use of the latter, but not
the former.  It also looks like we have a few off-by-one errors.  For
example, in h8300's ptrace code:

case PTRACE_SYSCALL:
case PTRACE_CONT: {
ret = -EIO;
if ((unsigned long) data >= _NSIG)
break ;

but

case PTRACE_SINGLESTEP: {
ret = -EIO;
if ((unsigned long) data > _NSIG)
break;

so I'd recommend the second solution.  But be careful not to "fix up"
cases like:

./kernel/exit.c:if (sig < 1 || sig > _NSIG)

where we really don't want to allow zero.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-19 Thread Matthew Wilcox

On Sat, May 19, 2001 at 10:22:55PM -0400, Richard Gooch wrote:
> The transaction(2) syscall can be just as easily abused as ioctl(2) in
> this respect.

But read() and write() cannot.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-22 Thread Matthew Wilcox

On Tue, May 22, 2001 at 08:49:04AM +0100, Alan Cox wrote:
> > For _devices_, though?  I don't expect my mouse to work if gpm and xfree
> > both try to consume device events from the same filp.  Heck, it doesn't
> > even work when they try to consume events from the same inode :-)  I think
> > this is a reasonable restriction for the class of devices in question.
> 
> Not really. Think about basic things like full duplex audio with two threads

`the class of devices in question' was cryptographic devices, and possibly
other transactional DSPs.  I don't consider audio to be transactional.
in any case, you can do transactional things with two threads, as long
as they each have their own fd on the device.  Think of the fd as your
transaction handle.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-22 Thread Matthew Wilcox

On Tue, May 22, 2001 at 04:31:37PM +0100, Alan Cox wrote:
> > `the class of devices in question' was cryptographic devices, and possibly
> > other transactional DSPs.  I don't consider audio to be transactional.
> > in any case, you can do transactional things with two threads, as long
> > as they each have their own fd on the device.  Think of the fd as your
> > transaction handle.
> 
> Thats a bit pathetic. So I have to fill my app with expensive pthread locks
> or hack all the drivers and totally change the multi-open sematics in the ABI

huh?

void thread_init(void) {
int fd = open("/dev/crypto");
real_thread_init(fd);
}

where was that lock again?

and notice this idea is only for transactional things -- what
transactional things do sound drivers do?

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



  1   2   3   4   5   6   7   8   9   10   >