Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 16:24 -0600, Jason Gunthorpe wrote:
> Basically, all this list processing is a huge overhead compared to
> just putting a helper call in the existing sg iteration loop of the
> actual op.  Particularly if the actual op is a no-op like no-mmu x86
> would use.

Yes, I'm leaning toward that approach too.

The helper itself could hang off the devmap though.

> Since dma mapping is a performance path we must be careful not to
> create intrinsic inefficiencies with otherwise nice layering :)
> 
> Jason


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 17:21 -0600, Jason Gunthorpe wrote:
> Splitting the sgl is different from iommu batching.
> 
> As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in
> the middle.
> 
> The optimum behavior is to allocate a 1MB-4K iommu range and fill it
> with the CPU memory. Then return a SGL with three entires, two
> pointing into the range and one to the p2p.
> 
> It is creating each range which tends to be expensive, so creating
> two
> ranges (or worse, if every SGL created a range it would be 255) is
> very undesired.

I think it's easier to get us started to just use a helper and
stick it in the existing sglist processing loop of the architecture.

As we noticed, stacking dma_ops is actually non-trivial and opens quite
the can of worms.

As Jerome mentioned, you can end up with IOs ops containing an sglist
that is a collection of memory and GPU pages for example.

Cheers,
Ben.


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 15:22 -0600, Jason Gunthorpe wrote:
> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
> > > I think this opens an even bigger can of worms..
> > 
> > No, I don't think it does. You'd only shim when the target page is
> > backed by a device, not host memory, and you can figure this out by
> > a
> > is_zone_device_page()-style lookup.
> 
> The bigger can of worms is how do you meaningfully stack dma_ops.
> 
> What does the p2p provider do when it detects a p2p page?

Yeah I think we don't really want to stack dma_ops... thinking more
about it.

As I just wrote, it looks like we might need a more specialised hook
in the devmap to be used by the main dma_op, on a per-page basis.

Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 15:03 -0600, Jason Gunthorpe wrote:
> I don't follow, when does get_dma_ops() return a p2p aware provider?
> It has no way to know if the DMA is going to involve p2p, get_dma_ops
> is called with the device initiating the DMA.
> 
> So you'd always return the P2P shim on a system that has registered
> P2P memory?
> 
> Even so, how does this shim work? dma_ops are not really intended to
> be stacked. How would we make unmap work, for instance? What happens
> when the underlying iommu dma ops actually natively understands p2p
> and doesn't want the shim?

Good point. We only know on a per-page basis ... ugh.

So we really need to change the arch main dma_ops. I'm not opposed to
that. What we then need to do is have that main arch dma_map_sg,
when it encounters a "device" page, call into a helper attached to
the devmap to handle *that page*, providing sufficient context.

That helper wouldn't perform the actual iommu mapping. It would simply
return something along the lines of:

 - "use that alternate bus address and don't map in the iommu"
 - "use that alternate bus address and do map in the iommu"
 - "proceed as normal"
 - "fail"

What do you think ?

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 14:48 -0600, Logan Gunthorpe wrote:
> > ...and that dma_map goes through get_dma_ops(), so I don't see the conflict?
> 
> The main conflict is in dma_map_sg which only does get_dma_ops once but
> the sg may contain memory of different types.

We can handle that in our "overriden" dma ops.

It's a bit tricky but it *could* break it down into segments and
forward portions back to the original dma ops.

Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 12:00 -0600, Jason Gunthorpe wrote:
> - All platforms can succeed if the PCI devices are under the same
>   'segment', but where segments begin is somewhat platform specific
>   knowledge. (this is 'same switch' idea Logan has talked about)

We also need to be careful whether P2P is enabled in the switch
or not.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 10:27 -0700, Dan Williams wrote:
> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> > already have APIs that map BAR memory to user space, and would like to
> > keep using them. A 'enable P2P for bar' helper function sounds better
> > to me.
> 
> ...and I think it's not a helper function as much as asking the bus
> provider "can these two device dma to each other". The "helper" is the
> dma api redirecting through a software-iommu that handles bus address
> translation differently than it would handle host memory dma mapping.

Do we even need tat function ? The dma_ops have a dma_supported()
call...

If we have those override ops built into the "dma_target" object,
then these things can make that decision knowing both the source
and target device.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Mon, 2017-04-17 at 23:43 -0600, Logan Gunthorpe wrote:
> 
> On 17/04/17 03:11 PM, Benjamin Herrenschmidt wrote:
> > Is it ? Again, you create a "concept" the user may have no idea about,
> > "p2pmem memory". So now any kind of memory buffer on a device can could
> > be use for p2p but also potentially a bunch of other things becomes
> > special and called "p2pmem" ...
> 
> The user is going to have to have an idea about it if they are designing
> systems to make use of it. I've said it before many times: this is an
> optimization with significant trade-offs so the user does have to make
> decisions regarding when to enable it.

Not necessarily. There are many cases where the "end user" won't have
any idea. In any case, I think we bring the story down to those two
points of conflating the allocator with the peer to peer DMA, and the
lack of generality in the approach to solve the peer to peer DMA
problem.

> > But what do you have in p2pmem that somebody benefits from. Again I
> > don't understand what that "p2pmem" device buys you in term of
> > functionality vs. having the device just instanciate the pages.
> 
> Well thanks for just taking a big shit on all of our work without even
> reading the patches. Bravo.

Now now now  calm down. We are being civil here. I'm not shitting
on anything, I'm asking what seems to be a reasonable question in term
of benefits of the approach you have chosen. Using that sort of
language will not get you anywhere. 

> > Now having some kind of way to override the dma_ops, yes I do get that,
> > and it could be that this "p2pmem" is typically the way to do it, but
> > at the moment you don't even have that. So I'm a bit at a loss here.
> 
> Yes, we've already said many times that this is something we will need
> to add.
> 
> > But it doesn't *have* to be. Again, take my GPU example. The fact that
> > a NIC might be able to DMA into it doesn't make it specifically "p2p
> > memory".
> 
> Just because you use it for other things doesn't mean it can't also
> provide the service of a "p2pmem" device.

But there is no such thing as a "p2pmem" device.. that's what I'm
trying to tell you...

As both Jerome and I tried to explain, there are many reason why one
may want to do peer DMA into some device memory, that doesn't make that
memory some kind of "p2pmem". It's trying to stick a generic label onto
something that isn't.

That's why I'm suggesting we disconnect the two aspects. On one hand
the problem of handling p2p DMA, whether the target is some memory,
some MMIO registers, etc...

On the other hand, some generic "utility" that can optionally be used
by drivers to manage a pool of DMA memory in the device, essentially a
simple allocator.

The two things are completely orthogonal.

> > So now your "p2pmem" device needs to also be laid out on top of those
> > MMIO registers ? It's becoming weird.
> 
> Yes, Max Gurtovoy has also expressed an interest in expanding this work
> to cover things other than memory. He's suggested simply calling it a
> p2p device, but until we figure out what exactly that all means we can't
> really finalize a name.

Possibly. In any case, I think it should be separate from the
allocation.

> > See, basically, doing peer 2 peer between devices has 3 main challenges
> > today: The DMA API needing struct pages, the MMIO translation issues
> > and the IOMMU translation issues.
> > 
> > You seem to create that added device as some kind of "owner" for the
> > struct pages, solving #1, but leave #2 and #3 alone.
> 
> Well there are other challenges too. Like figuring out when it's
> appropriate to use, tying together the device that provides the memory
> with the driver tring to use it in DMA transactions, etc, etc. Our patch
> set tackles these latter issues.

But it tries to conflate the allocation, which is basically the fact
that this is some kind of "memory pool" with the problem of doing peer
DMA.

I'm advocating for separating the concepts.

> > If we go down that path, though, rather than calling it p2pmem I would
> > call it something like dma_target which I find much clearer especially
> > since it doesn't have to be just memory.
> 
> I'm not set on the name. My arguments have been specifically for the
> existence of an independent struct device. But I'm not really interested
> in getting into bike shedding arguments over what to call it at this
> time when we don't even really know what it's going to end up doing in
> the end.

It's not bike shedding. It's about taking out the allocator part and
making it clear that this isn't something to lay out on top of a p

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Benjamin Herrenschmidt
On Mon, 2017-04-17 at 10:52 -0600, Logan Gunthorpe wrote:
> 
> On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
> > But is it ? For example take a GPU, does it, in your scheme, need an
> > additional "p2pmem" child ? Why can't the GPU driver just use some
> > helper to instantiate the necessary struct pages ? What does having an
> > actual "struct device" child buys you ?
> 
> Yes, in this scheme, it needs an additional p2pmem child. Why is that an
> issue? It certainly makes it a lot easier for the user to understand the
> p2pmem memory in the system (through the sysfs tree) and reason about
> the topology and when to use it. This is important.

Is it ? Again, you create a "concept" the user may have no idea about,
"p2pmem memory". So now any kind of memory buffer on a device can could
be use for p2p but also potentially a bunch of other things becomes
special and called "p2pmem" ...

> > > 2) In order to create the struct pages we use the ZONE_DEVICE
> > > infrastructure which requires a struct device. (See
> > > devm_memremap_pages.)
> > 
> > Yup, but you already have one in the actual pci_dev ... What is the
> > benefit of adding a second one ?
> 
> But that would tie all of this very tightly to be pci only and may get
> hard to differentiate if more users of ZONE_DEVICE crop up who happen to
> be using a pci device.

But what do you have in p2pmem that somebody benefits from. Again I
don't understand what that "p2pmem" device buys you in term of
functionality vs. having the device just instanciate the pages.

Now having some kind of way to override the dma_ops, yes I do get that,
and it could be that this "p2pmem" is typically the way to do it, but
at the moment you don't even have that. So I'm a bit at a loss here.
 
>  Having a specific class for this makes it very
> clear how this memory would be handled.

But it doesn't *have* to be. Again, take my GPU example. The fact that
a NIC might be able to DMA into it doesn't make it specifically "p2p
memory".

Essentially you are saying that any device that happens to have a piece
of mappable "memory" (or something that behaves like it) and can be
DMA'ed into should now have that "p2pmem" thing attached to it.

Now take an example where that becomes really awkward (it's also a real
example of something people want to do). I have a NIC and a GPU, the
NIC DMA's data to/from the GPU, but they also want to poke at each
other doorbell, the GPU to kick the NIC into action when data is ready
to send, the NIC to poke the GPU when data has been received.

Those doorbells are MMIO registers.

So now your "p2pmem" device needs to also be laid out on top of those
MMIO registers ? It's becoming weird.

See, basically, doing peer 2 peer between devices has 3 main challenges
today: The DMA API needing struct pages, the MMIO translation issues
and the IOMMU translation issues.

You seem to create that added device as some kind of "owner" for the
struct pages, solving #1, but leave #2 and #3 alone.

Now, as I said, it could very well be that having the devmap pointer
point to some specific device-type with a well known structure to
provide solutions for #2 and #3 such as dma_ops overrides, is indeed
the right way to solve these problems.

If we go down that path, though, rather than calling it p2pmem I would
call it something like dma_target which I find much clearer especially
since it doesn't have to be just memory.

For the sole case of creating struct page's however, I fail to see the
point.

>  For example, although I haven't
> looked into it, this could very well be a point of conflict with HMM. If
> they were to use the pci device to populate the dev_pagemap then we
> couldn't also use the pci device. I feel it's much better for users of
> dev_pagemap to have their struct devices they own to avoid such conflicts.

If we are going to create some sort of struct dma_target, HMM could
potentially just look for the parent if it needs the PCI device.

> > >  This amazingly gets us the get_dev_pagemap
> > > architecture which also uses a struct device. So by using a p2pmem
> > > device we can go from struct page to struct device to p2pmem device
> > > quickly and effortlessly.
> > 
> > Which isn't terribly useful in itself right ? What you care about is
> > the "enclosing" pci_dev no ? Or am I missing something ?
> 
> Sure it is. What if we want to someday support p2pmem that's on another bus?

But why not directly use that other bus' device in that case ?

> > > 3) You wouldn't want to use the pci's struct device because it doesn't
> > > really describe what's going on. For example, there may be multiple
> > > devices on the pci devi

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 23:13 -0600, Logan Gunthorpe wrote:
> 
> > 
> > I'm still not 100% why do you need a "p2mem device" mind you ...
> 
> Well, you don't "need" it but it is a design choice that I think makes a
> lot of sense for the following reasons:
> 
> 1) p2pmem is in fact a device on the pci bus. A pci driver will need to
> set it up and create the device and thus it will have a natural parent
> pci device. Instantiating a struct device for it means it will appear in
> the device hierarchy and one can use that to reason about its position
> in the topology.

But is it ? For example take a GPU, does it, in your scheme, need an
additional "p2pmem" child ? Why can't the GPU driver just use some
helper to instantiate the necessary struct pages ? What does having an
actual "struct device" child buys you ?

> 2) In order to create the struct pages we use the ZONE_DEVICE
> infrastructure which requires a struct device. (See
> devm_memremap_pages.)

Yup, but you already have one in the actual pci_dev ... What is the
benefit of adding a second one ?

>  This amazingly gets us the get_dev_pagemap
> architecture which also uses a struct device. So by using a p2pmem
> device we can go from struct page to struct device to p2pmem device
> quickly and effortlessly.

Which isn't terribly useful in itself right ? What you care about is
the "enclosing" pci_dev no ? Or am I missing something ?

> 3) You wouldn't want to use the pci's struct device because it doesn't
> really describe what's going on. For example, there may be multiple
> devices on the pci device in question: eg. an NVME card and some p2pmem.

What is "some p2pmem" ?

> Or it could be a NIC with some p2pmem.

Again what is "some p2pmem" ?

That a device might have some memory-like buffer space is all well and
good but does it need to be specifically distinguished at the device
level ? It could be inherent to what the device is... for example again
take the GPU example, why would you call the FB memory "p2pmem" ? 

>  Or it could just be p2pmem by itself. And the logic to figure out what
>  memory is available and where
> the address is will be non-standard so it's really straightforward to
> have any pci driver just instantiate a p2pmem device.

Again I'm not sure why it needs to "instanciate a p2pmem" device. Maybe
it's the term "p2pmem" that offputs me. If p2pmem allowed to have a
standard way to lookup the various offsets etc... I mentioned earlier,
then yes, it would make sense to have it as a staging point. As-is, I
don't know. 

> It is probably worth you reading the RFC patches at this point to get a
> better feel for this.

Yup, I'll have another look a bit more in depth.

Cheers,
Ben.


> Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 10:34 -0600, Logan Gunthorpe wrote:
> 
> On 16/04/17 09:53 AM, Dan Williams wrote:
> > ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
> > context about the physical address in question. I'm thinking you can
> > hang bus address translation data off of that structure. This seems
> > vaguely similar to what HMM is doing.
> 
> Thanks! I didn't realize you had the infrastructure to look up a device
> from a pfn/page. That would really come in handy for us.

It does indeed. I won't be able to play with that much for a few weeks
(see my other email) so if you're going to tackle this while I'm away,
can you work with Jerome to make sure you don't conflict with HMM ?

I really want a way for HMM to be able to layout struct pages over the
GPU BARs rather than in "allocated free space" for the case where the
BAR is big enough to cover all of the GPU memory.

In general, I'd like a simple & generic way for any driver to ask the
core to layout DMA'ble struct pages over BAR space. I an not convinced
this requires a "p2mem device" to be created on top of this though but
that's a different discussion.

Of course the actual ability to perform the DMA mapping will be subject
to various restrictions that will have to be implemented in the actual
"dma_ops override" backend. We can have generic code to handle the case
where devices reside on the same domain, which can deal with switch
configuration etc... we will need to have iommu specific code to handle
the case going through the fabric. 

Virtualization is a separate can of worms due to how qemu completely
fakes the MMIO space, we can look into that later.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 10:47 -0600, Logan Gunthorpe wrote:
> > I think you need to give other archs a chance to support this with a
> > design that considers the offset case as a first class citizen rather
> > than an afterthought.
> 
> I'll consider this. Given the fact I can use your existing
> get_dev_pagemap infrastructure to look up the p2pmem device this
> probably isn't as hard as I thought it would be anyway (we probably
> don't even need a page flag). We'd just have lookup the dev_pagemap,
> test if it's a p2pmem device, and if so, call a p2pmem_dma_map function
> which could apply the offset or do any other arch specific logic (if
> necessary).

I'm still not 100% why do you need a "p2mem device" mind you ...

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 08:53 -0700, Dan Williams wrote:
> > Just thinking out loud ... I don't have a firm idea or a design. But
> > peer to peer is definitely a problem we need to tackle generically, the
> > demand for it keeps coming up.
> 
> ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
> context about the physical address in question. I'm thinking you can
> hang bus address translation data off of that structure. This seems
> vaguely similar to what HMM is doing.

Ok, that's interesting. That would be a way to handle the "lookup" I
was mentioning in the email I sent a few minutes ago.

We would probably need to put some "structure" to that context.

I'm very short on time to look into the details of this for at least
a month (I'm taking about 3 weeks off for personal reasons next week),
but I'm happy to dive more into this when I'm back and sort out with
Jerome how to make it all co-habitate nicely with HMM.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 08:44 -0700, Dan Williams wrote:
> The difference is that there was nothing fundamental in the core
> design of pmem + DAX that prevented other archs from growing pmem
> support.

Indeed. In fact we have work in progress support for pmem on power
using experimental HW.

> THP and memory hotplug existed on other architectures and
> they just need to plug in their arch-specific enabling. p2p support
> needs the same starting point of something more than one architecture
> can plug into, and handling the bus address offset case needs to be
> incorporated into the design.
> 
> pmem + dax did not change the meaning of what a dma_addr_t is, p2p does.

The more I think about it, the more I tend toward something along the
lines of having the arch DMA ops being able to quickly differentiate
between "normal" memory (which includes non-PCI pmem in some cases,
it's an architecture choice I suppose) and "special device" (page flag
? pfn bit ? ... there are options).

>From there, we keep our existing fast path for the normal case.

For the special case, we need to provide a fast lookup mechanism
(assuming we can't stash enough stuff in struct page or the pfn)
to get back to a struct of some sort that provides the necessary
information to resolve the translation.

This *could* be something like a struct p2mem device that carries
a special set of DMA ops, though we probably shouldn't make the generic
structure PCI specific.

This is a slightly slower path, but that "stub" structure allows the
special DMA ops to provide the necessary bus-specific knowledge, which
for PCI for example, can check whether the devices are on the same
segment, whether the switches are configured to allow p2p, etc...

What form should that fast lookup take ? It's not completely clear to
me at that point. We could start with a simple linear lookup I suppose
and improve in a second stage.

Of course this pipes into the old discussion about disconnecting
the DMA ops from struct page. If we keep struct page, any device that
wants to be a potential DMA target will need to do something "special"
to create those struct pages etc.. though we could make that a simple
pci helper that pops the necessary bits and pieces for a given BAR &
range.

If we don't need struct page, then it might be possible to hide it all
in the PCI infrastructure.

> > Virtualization specifically would be a _lot_ more difficult than simply
> > supporting offsets. The actual topology of the bus will probably be lost
> > on the guest OS and it would therefor have a difficult time figuring out
> > when it's acceptable to use p2pmem. I also have a difficult time seeing
> > a use case for it and thus I have a hard time with the argument that we
> > can't support use cases that do want it because use cases that don't
> > want it (perhaps yet) won't work.
> > 
> > > This is an interesting experiement to look at I suppose, but if you
> > > ever want this upstream I would like at least for you to develop a
> > > strategy to support the wider case, if not an actual implementation.
> > 
> > I think there are plenty of avenues forward to support offsets, etc.
> > It's just work. Nothing we'd be proposing would be incompatible with it.
> > We just don't want to have to do it all upfront especially when no one
> > really knows how well various architecture's hardware supports this or
> > if anyone even wants to run it on systems such as those. (Keep in mind
> > this is a pretty specific optimization that mostly helps systems
> > designed in specific ways -- not a general "everybody gets faster" type
> > situation.) Get the cases working we know will work, can easily support
> > and people actually want.  Then expand it to support others as people
> > come around with hardware to test and use cases for it.
> 
> I think you need to give other archs a chance to support this with a
> design that considers the offset case as a first class citizen rather
> than an afterthought.

Thanks :-) There's a reason why I'm insisting on this. We have constant
requests for this today. We have hacks in the GPU drivers to do it for
GPUs behind a switch, but those are just that, ad-hoc hacks in the
drivers. We have similar grossness around the corner with some CAPI
NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines
to whack nVME devices.

I'm very interested in a more generic solution to deal with the problem
of P2P between devices. I'm happy to contribute with code to handle the
powerpc bits but we need to agree on the design first :)

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-15 Thread Benjamin Herrenschmidt
On Sat, 2017-04-15 at 15:09 -0700, Dan Williams wrote:
> I'm wondering, since this is limited to support behind a single
> switch, if you could have a software-iommu hanging off that switch
> device object that knows how to catch and translate the non-zero
> offset bus address case. We have something like this with VMD driver,
> and I toyed with a soft pci bridge when trying to support AHCI+NVME
> bar remapping. When the dma api looks up the iommu for its device it
> hits this soft-iommu and that driver checks if the page is host memory
> or device memory to do the dma translation. You wouldn't need a bit in
> struct page, just a lookup to the hosting struct dev_pagemap in the
> is_zone_device_page() case and that can point you to p2p details.

I was thinking about a hook in the arch DMA ops but that kind of
wrapper might work instead indeed. However I'm not sure what's the best
way to "instantiate" it.

The main issue is that the DMA ops are a function of the initiator,
not the target (since the target is supposed to be memory) so things
are a bit awkward.

One (user ?) would have to know that a given device "intends" to DMA
directly to another device.

This is awkward because in the ideal scenario, this isn't something the
device knows. For example, one could want to have an existing NIC DMA
directly to/from NVME pages or GPU pages.

The NIC itself doesn't know the characteristic of these pages, but
*something* needs to insert itself in the DMA ops of that bridge to
make it possible.

That's why I wonder if it's the struct page of the target that should
be "marked" in such a way that the arch dma'ops can immediately catch
that they belong to a device and might require "wrapped" operations.

Are ZONE_DEVICE pages identifiable based on the struct page alone ? (a
flag ?)

That would allow us to keep a fast path for normal memory targets, but
also have some kind of way to handle the special cases of such peer 2
peer (or also handle other type of peer to peer that don't necessarily
involve PCI address wrangling but could require additional iommu bits).

Just thinking out loud ... I don't have a firm idea or a design. But
peer to peer is definitely a problem we need to tackle generically, the
demand for it keeps coming up.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-15 Thread Benjamin Herrenschmidt
On Sat, 2017-04-15 at 11:41 -0600, Logan Gunthorpe wrote:
> Thanks, Benjamin, for the summary of some of the issues.
> 
> On 14/04/17 04:07 PM, Benjamin Herrenschmidt wrote
> > So I assume the p2p code provides a way to address that too via special
> > dma_ops ? Or wrappers ?
> 
> Not at this time. We will probably need a way to ensure the iommus do
> not attempt to remap these addresses.

You can't. If the iommu is on, everything is remapped. Or do you mean
to have dma_map_* not do a remapping ?

That's the problem again, same as before, for that to work, the
dma_map_* ops would have to do something special that depends on *both*
the source and target device.

The current DMA infrastructure doesn't have anything like that. It's a
rather fundamental issue to your design that you need to address.

The dma_ops today are architecture specific and have no way to
differenciate between normal and those special P2P DMA pages.

> Though if it does, I'd expect
> everything would still work you just wouldn't get the performance or
> traffic flow you are looking for. We've been testing with the software
> iommu which doesn't have this problem.

So first, no, it's more than "you wouldn't get the performance". On
some systems it may also just not work. Also what do you mean by "the
SW iommu doesn't have this problem" ? It catches the fact that
addresses don't point to RAM and maps differently ?

> > The problem is that the latter while seemingly easier, is also slower
> > and not supported by all platforms and architectures (for example,
> > POWER currently won't allow it, or rather only allows a store-only
> > subset of it under special circumstances).
> 
> Yes, I think situations where we have to cross host bridges will remain
> unsupported by this work for a long time. There are two many cases where
> it just doesn't work or it performs too poorly to be useful.

And the situation where you don't cross bridges is the one where you
need to also take into account the offsets.

*both* cases mean that you need somewhat to intervene at the dma_ops
level to handle this. Which means having a way to identify your special
struct pages or PFNs to allow the arch to add a special case to the
dma_ops.

> > I don't fully understand how p2pmem "solves" that by creating struct
> > pages. The offset problem is one issue. But there's the iommu issue as
> > well, the driver cannot just use the normal dma_map ops.
> 
> We are not using a proper iommu and we are dealing with systems that
> have zero offset. This case is also easily supported. I expect fixing
> the iommus to not map these addresses would also be reasonably achievable.

So you are designing something that is built from scratch to only work
on a specific limited category of systems and is also incompatible with
virtualization.

This is an interesting experiement to look at I suppose, but if you
ever want this upstream I would like at least for you to develop a
strategy to support the wider case, if not an actual implementation.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Fri, 2017-04-14 at 14:04 -0500, Bjorn Helgaas wrote:
> I'm a little hesitant about excluding offset support, so I'd like to
> hear more about this.
> 
> Is the issue related to PCI BARs that are not completely addressable
> by the CPU?  If so, that sounds like a first-class issue that should
> be resolved up front because I don't think the PCI core in general
> would deal well with that.
> 
> If all the PCI memory of interest is in fact addressable by the CPU,
> I would think it would be pretty straightforward to support offsets
> --
> everywhere you currently use a PCI bus address, you just use the
> corresponding CPU physical address instead.

It's not *that* easy sadly. The reason is that normal dma map APIs
assume the "target" of the DMAs are system memory, there is no way to
pass it "another device", in a way that allows it to understand the
offsets if needed.

That said, dma_map will already be problematic when doing p2p behind
the same bridge due to the fact that the iommu is not in the way so you
can't use the arch standard ops there.

So I assume the p2p code provides a way to address that too via special
dma_ops ? Or wrappers ?

Basically there are two very different ways you can do p2p, either
behind the same host bridge or accross two host bridges:

 - Behind the same host bridge, you don't go through the iommu, which
means that even if your target looks like a struct page, you can't just
use dma_map_* on it because what you'll get back from that is an iommu
token, not a sibling BAR offset. Additionally, if you use the target
struct resource address, you will need to offset it to get back to the
actual BAR value on the PCIe bus.

 - Behind different host bridges, then you go through the iommu and the
host remapping. IE. that's actually the easy case. You can probably
just use the normal iommu path and normal dma mapping ops, you just
need to have your struct page representing the target device BAR *in
CPU space* this time. So no offsetting required either.

The problem is that the latter while seemingly easier, is also slower
and not supported by all platforms and architectures (for example,
POWER currently won't allow it, or rather only allows a store-only
subset of it under special circumstances).

So what people practically want to do is to have 2 devices behind a
switch DMA'ing to/from each other.

But that brings the 2 problems above.

I don't fully understand how p2pmem "solves" that by creating struct
pages. The offset problem is one issue. But there's the iommu issue as
well, the driver cannot just use the normal dma_map ops.

I haven't had a chance to look at the details of the patches but it's
not clear from the description in patch 0 how that is solved.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 22:40 -0600, Logan Gunthorpe wrote:
> 
> On 13/04/17 10:16 PM, Jason Gunthorpe wrote:
> > I'd suggest just detecting if there is any translation in bus
> > addresses anywhere and just hard disabling P2P on such systems.
> 
> That's a fantastic suggestion. It simplifies things significantly.
> Unless there are any significant objections I think I will plan on
> doing
> that.

I object.

> > On modern hardware with 64 bit BARs there is very little reason to
> > have translation, so I think this is a legacy feature.
> 
> Yes, p2pmem users are likely to be designing systems around it (ie
> JBOFs) and not trying to shoehorn it onto legacy architectures.
> 
> At the very least, it makes sense to leave it out and if someone
> comes
> along who cares they can put in the effort to support the address
> translation.
> 
> Thanks,
> 
> Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Fri, 2017-04-14 at 21:37 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2017-04-13 at 22:40 -0600, Logan Gunthorpe wrote:
> > 
> > On 13/04/17 10:16 PM, Jason Gunthorpe wrote:
> > > I'd suggest just detecting if there is any translation in bus
> > > addresses anywhere and just hard disabling P2P on such systems.
> > 
> > That's a fantastic suggestion. It simplifies things significantly.
> > Unless there are any significant objections I think I will plan on
> > doing
> > that.
> 
> I object.

Note: It would also make your stuff fundamentally incompatible with
KVM guest pass-through since KVM plays with remapping BARs all over
the place.

Ben.

> > > On modern hardware with 64 bit BARs there is very little reason
> > > to
> > > have translation, so I think this is a legacy feature.
> > 
> > Yes, p2pmem users are likely to be designing systems around it (ie
> > JBOFs) and not trying to shoehorn it onto legacy architectures.
> > 
> > At the very least, it makes sense to leave it out and if someone
> > comes
> > along who cares they can put in the effort to support the address
> > translation.
> > 
> > Thanks,
> > 
> > Logan



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 22:16 -0600, Jason Gunthorpe wrote:
> > Any caller of pci_add_resource_offset() uses CPU addresses different from
> > the PCI bus addresses (unless the offset is zero, of course).  All ACPI
> > platforms also support this translation (see "translation_offset"), though
> > in most x86 systems the offset is zero.  I'm aware of one x86 system that
> > was tested with a non-zero offset but I don't think it was shipped that
> > way.
> 
> I'd suggest just detecting if there is any translation in bus
> addresses anywhere and just hard disabling P2P on such systems.

I object to designing a subsystem that by design cannot work on whole
categories of architectures out there.

> On modern hardware with 64 bit BARs there is very little reason to
> have translation, so I think this is a legacy feature.

No it's not.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-13 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 15:22 -0600, Logan Gunthorpe wrote:
> 
> On 12/04/17 03:55 PM, Benjamin Herrenschmidt wrote:
> > Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
> > will perform the conversion between the struct resource content (CPU
> > physical address) and the actual PCI bus side address.
> 
> Ah, thanks for the tip! On my system, this translation returns the same
> address so it was not necessary. And, yes, that means this would have to
> find its way into the dma mapping routine somehow. This means we'll
> eventually need a way to look-up the p2pmem device from the struct page.
> Which means we will likely need a new flag bit in the struct page or
> something. The big difficulty I see is testing. Do you know what
> architectures or in what circumstances are these translations used?

I think a bunch of non-x86 architectures but I don't know which ones
outside of powerpc.

> > When behind the same switch you need to use PCI addresses. If one tries
> > later to do P2P between host bridges (via the CPU fabric) things get
> > more complex and one will have to use either CPU addresses or something
> > else alltogether (probably would have to teach the arch DMA mapping
> > routines to work with those struct pages you create and return the
> > right thing).
> 
> Probably for starters we'd want to explicitly deny cases between host
> bridges and add that later if someone wants to do the testing.

Cheers,
Ben.

> Thanks,
> 
> Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-12 Thread Benjamin Herrenschmidt
On Wed, 2017-04-12 at 11:09 -0600, Logan Gunthorpe wrote:
> 
> > Do you handle funky address translation too ? IE. the fact that the PCI
> > addresses aren't the same as the CPU physical addresses for a BAR ?
> 
> No, we use the CPU physical address of the BAR. If it's not mapped that
> way we can't use it.

Ok, you need to fix that or a bunch of architectures won't work. 

Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
will perform the conversion between the struct resource content (CPU
physical address) and the actual PCI bus side address.

When behind the same switch you need to use PCI addresses. If one tries
later to do P2P between host bridges (via the CPU fabric) things get
more complex and one will have to use either CPU addresses or something
else alltogether (probably would have to teach the arch DMA mapping
routines to work with those struct pages you create and return the
right thing).

> > > This will mean many setups that could likely
> > > work well will not be supported so that we can be more confident it
> > > will work and not place any responsibility on the user to understand
> > > their topology. (We've chosen to go this route based on feedback we
> > > received at LSF).
> > > 
> > > In order to enable this functionality we introduce a new p2pmem device
> > > which can be instantiated by PCI drivers. The device will register some
> > > PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> > > users of these devices to get buffers.
> > 
> > I don't completely understand this. This is actual memory on the PCI
> > bus ? Where does it come from ? Or are you just trying to create struct
> > pages that cover your PCIe DMA target ?
> 
> Yes, the memory is on the PCI bus in a BAR. For now we have a special
> PCI card for this, but in the future it would likely be the CMB in an
> NVMe card. These patches create struct pages to map these BAR addresses
> using ZONE_DEVICE.

Ok.

So ideally we'd want things like dma_map_* to be able to be fed those
struct pages and do the right thing which is ... tricky, especially
with the address translation I mentioned since the address will be
different whether the initiator is on the same host bridge as the
target or not.

> > So correct me if I'm wrong, you are trying to create struct page's that
> > map a PCIe BAR right ? I'm trying to understand how that interacts with
> > what Jerome is doing for HMM.
> 
> Yes, well we are using ZONE_DEVICE in the exact same way as the dax code
> is. These patches use the existing API with no modifications. As I
> understand it, HMM was using ZONE_DEVICE in a way that was quite
> different to how it was originally designed.

Sort-of. I don't see why there would be a conflict with the struct
pages use though. Jerome can you chime in ? Jerome: It looks like they
are just laying out struct page over a BAR which is the same thing I
think you should do when the BAR is "large enough" for the GPU memory.

The case where HMM uses "holes" in the address space for its struct
page is somewhat orthogonal but I also see no conflict here.

> > The reason is that the HMM currently creates the struct pages with
> > "fake" PFNs pointing to a hole in the address space rather than
> > covering the actual PCIe memory of the GPU. He does that to deal with
> > the fact that some GPUs have a smaller aperture on PCIe than their
> > total memory.
> 
> I'm aware of what HMM is trying to do and although I'm not familiar with
> the intimate details, I saw it as fairly orthogonal to what we are
> attempting to do.

Right.

> > However, I have asked him to only apply that policy if the aperture is
> > indeed smaller, and if not, create struct pages that directly cover the
> > PCIe BAR of the GPU instead, which will work better on systems or
> > architecture that don't have a "pinhole" window limitation.
> > However he was under the impression that this was going to collide with
> > what you guys are doing, so I'm trying to understand how. 
> 
> I'm not sure I understand how either. However, I suspect if you collide
> with these patches then you'd also be breaking dax too.

Possibly but as I said, I don't see why so I'll let Jerome chime in
since he was under the impression that there was a conflict here :-)

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-12 Thread Benjamin Herrenschmidt
On Thu, 2017-03-30 at 16:12 -0600, Logan Gunthorpe wrote:
> Hello,
> 
> As discussed at LSF/MM we'd like to present our work to enable
> copy offload support in NVMe fabrics RDMA targets. We'd appreciate
> some review and feedback from the community on our direction.
> This series is not intended to go upstream at this point.
> 
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the trade-off
> is currently a reduction in overall throughput. (Largely due to hardware
> issues that would certainly improve in the future).

Another issue of course is that not all systems support P2P
between host bridges :-) (Though almost all switches can enable it).

> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch.

Ok. I suppose that's a reasonable starting point. Do I haven't looked
at the patches in detail yet but it would be nice if that policy was in
a well isolated component so it can potentially be affected by
arch/platform code.

Do you handle funky address translation too ? IE. the fact that the PCI
addresses aren't the same as the CPU physical addresses for a BAR ?

> This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We've chosen to go this route based on feedback we
> received at LSF).
> 
> In order to enable this functionality we introduce a new p2pmem device
> which can be instantiated by PCI drivers. The device will register some
> PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> users of these devices to get buffers.

I don't completely understand this. This is actual memory on the PCI
bus ? Where does it come from ? Or are you just trying to create struct
pages that cover your PCIe DMA target ?
 
> We give an example of enabling
> p2p memory with the cxgb4 driver, however currently these devices have
> some hardware issues that prevent their use so we will likely be
> dropping this patch in the future. Ideally, we'd want to enable this
> functionality with NVME CMB buffers, however we don't have any hardware
> with this feature at this time.

So correct me if I'm wrong, you are trying to create struct page's that
map a PCIe BAR right ? I'm trying to understand how that interacts with
what Jerome is doing for HMM.

The reason is that the HMM currently creates the struct pages with
"fake" PFNs pointing to a hole in the address space rather than
covering the actual PCIe memory of the GPU. He does that to deal with
the fact that some GPUs have a smaller aperture on PCIe than their
total memory.

However, I have asked him to only apply that policy if the aperture is
indeed smaller, and if not, create struct pages that directly cover the
PCIe BAR of the GPU instead, which will work better on systems or
architecture that don't have a "pinhole" window limitation.

However he was under the impression that this was going to collide with
what you guys are doing, so I'm trying to understand how. 

> In nvmet-rdma, we attempt to get an appropriate p2pmem device at
> queue creation time and if a suitable one is found we will use it for
> all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
> attribute is also created which is required to be set before any p2pmem
> is attempted.
> 
> This patchset also includes a more controversial patch which provides an
> interface for userspace to obtain p2pmem buffers through an mmap call on
> a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
> O_DIRECT interfaces. However, the user would be entirely responsible for
> knowing what their doing and inspecting sysfs to understand the pci
> topology and only using it in sane situations.
> 
> Thanks,
> 
> Logan
> 
> 
> Logan Gunthorpe (6):
>   Introduce Peer-to-Peer memory (p2pmem) device
>   nvmet: Use p2pmem in nvme target
>   scatterlist: Modify SG copy functions to support io memory.
>   nvmet: Be careful about using iomem accesses when dealing with p2pmem
>   p2pmem: Support device removal
>   p2pmem: Added char device user interface
> 
> Steve Wise (2):
>   cxgb4: setup pcie memory window 4 and create p2pmem region
>   p2pmem: Add debugfs "stats" file
> 
>  drivers/memory/Kconfig  |   5 +
>  drivers/memory/Makefile |   2 +
>  drivers/memory/p2pmem.c | 697 
> 
>  

Re: [PATCH] scsi: lpfc: Add shutdown method for kexec

2017-03-06 Thread Benjamin Herrenschmidt
On Mon, 2017-03-06 at 22:46 -0500, Martin K. Petersen wrote:
> > > > > > "Mauricio" == Mauricio Faria de Oliveira  > > > > > et.ibm.com> writes:
> 
> Mauricio> On 02/12/2017 07:49 PM, Anton Blanchard wrote:
> > > We see lpfc devices regularly fail during kexec. Fix this by
> > > adding a
> > > shutdown method which mirrors the remove method.
> 
> Mauricio> @mkp, @jejb: may you please flag this patch for
> stable?  Thank
> Mauricio> you.
> 
> I don't recall a consensus being reached on this patch.

What would be the opposition ? Without it kexec breaks. With it, it
works ...

Now we all seem to agree that kexec should be overhauled to not use
shutdown but instead unbind drivers, but that's a more long term
project that nobody yet had a chance to tackle.

In the meantime, some systems need a functioning kexec to boot.

Cheers,
Ben.



Re: [PATCH] scsi: lpfc: Add shutdown method for kexec

2017-02-13 Thread Benjamin Herrenschmidt
On Tue, 2017-02-14 at 15:45 +1300, Eric W. Biederman wrote:
> The only difference ever that should exist between shutdown and remove
> is do you clean up kernel data structures.  The shutdown method is
> allowed to skip the cleanup up kernel data structures that the remove
> method needs to make.
> 
> Assuming the kernel does not have corrupted data structures I don't see
> any practical reason for distinguishing between the two.  There may be
> some real world gotchas we have to deal with but semantically I don't
> see any concerns.

We had historical additions to shutdown actually shutting things down,
like spinning platters down to park disk heads, switching devices
into D3 on some systems, etc... that remove never had (and we don't
want for kexec).

Cheers,
Ben.


Re: [PATCH] scsi: lpfc: Add shutdown method for kexec

2017-02-13 Thread Benjamin Herrenschmidt
On Mon, 2017-02-13 at 15:57 -0600, Brian King wrote:
> If we do transition to use remove rather than shutdown, I think we
> want
> some way for a device driver to know whether we are doing kexec or
> not.
> A RAID adapter with a write cache is going to want to flush its write
> cache on a PCI hotplug remove, but for a kexec, its going to want to
> skip
> that so the kexec is faster. Today, since kexec looks like a reboot,
> rather than a shutdown, we can skip the flush on a reboot, since its
> technically not needed there either.

What happens if a non-flushed adapter gets a PERST ?

I wouldn't trust that 'don't have to flush' magic ...

Cheers,
Ben.



Re: [PATCH] scsi: lpfc: Add shutdown method for kexec

2017-02-12 Thread Benjamin Herrenschmidt
On Mon, 2017-02-13 at 13:21 +1300, Eric W. Biederman wrote:
> > Good point, at the very least we should call remove if shutdown doesn't
> > exist. Eric: could we make the changes Ben suggests?
> 
> Definitely.  That was the original design of the kexec interface
> but people were worried about calling remove during reboot might
> introduce regressions on the reboot functionality.  So shutdown was
> added to be remove without the cleaning up the kernel data structures.

Right. Part of the problem was also that remove was dependent on CONFIG_HOTPLUG
though that's no longer the case anymore.

The problem is that a bunch of drivers either don't have a shutdown or
worse, have one that actually shuts the HW down rather than "idle" it
which puts it in a state that the new kernel can't deal with.

> I am all for changing the core to call remove.  That seems to be a more
> tested code path (because remove tends to be part of the development
> path for modules) and thus much more likely to work in practice.
> 
> The challenge with changes in this area is that when the infrastructure
> is changed for everyone someone needs to baby sit it until all of the
> unexpected issues are resolved.  I was hoping a couple of years ago that
> Ben could be that person.

Correct. And I lack time. But since we are a platform that uses kexec
daily as our main booting mechanism we should probably be the ones
driving that.

I had a patch at least to fallback to remove in absence of shutdown
which I can try to dig out.

We can add a config option to make it do the other way around that
we should start testing internally. I'm sure we will find 1 or 2
regressions as drivers do weird things.

Cheers,
Ben.



Re: [PATCH] scsi: lpfc: Add shutdown method for kexec

2017-02-12 Thread Benjamin Herrenschmidt
On Mon, 2017-02-13 at 08:49 +1100, Anton Blanchard wrote:
> From: Anton Blanchard 
> 
> We see lpfc devices regularly fail during kexec. Fix this by adding
> a shutdown method which mirrors the remove method.

Or instead finally do what I've been advocating for years (and even
sent patches for) which is to have kexec call remove instead of
shutdown.

Shutdown is and has *always* been the wrong thing to do.

> Signed-off-by: Anton Blanchard 
> ---
>  drivers/scsi/lpfc/lpfc_init.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_init.c
> b/drivers/scsi/lpfc/lpfc_init.c
> index 4776fd8..10f75ad 100644
> --- a/drivers/scsi/lpfc/lpfc_init.c
> +++ b/drivers/scsi/lpfc/lpfc_init.c
> @@ -11447,6 +11447,7 @@ static struct pci_driver lpfc_driver = {
>   .id_table   = lpfc_id_table,
>   .probe  = lpfc_pci_probe_one,
>   .remove = lpfc_pci_remove_one,
> + .shutdown   = lpfc_pci_remove_one,
>   .suspend= lpfc_pci_suspend_one,
>   .resume = lpfc_pci_resume_one,
>   .err_handler= _err_handler,


Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing

2016-12-09 Thread Benjamin Herrenschmidt
On Wed, 2016-12-07 at 17:31 -0600, Tyrel Datwyler wrote:
> The first byte of each CRQ entry is used to indicate whether an entry is
> a valid response or free for the VIOS to use. After processing a
> response the driver sets the valid byte to zero to indicate the entry is
> now free to be reused. Add a memory barrier after this write to ensure
> no other stores are reordered when updating the valid byte.

Which "other stores" specifically ? This smells fishy without that
precision. It's important to always understand what exactly barriers
order with.

Cheers,
Ben.

> Signed-off-by: Tyrel Datwyler 
> ---
>  drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c 
> b/drivers/scsi/ibmvscsi/ibmvscsi.c
> index d9534ee..2f5b07e 100644
> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
> @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data)
> >     while ((crq = crq_queue_next_crq(>queue)) != NULL) {
> >     ibmvscsi_handle_crq(crq, hostdata);
> >     crq->valid = VIOSRP_CRQ_FREE;
> > +   wmb();
> >     }
>  
> >     vio_enable_interrupts(vdev);
> @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data)
> >     vio_disable_interrupts(vdev);
> >     ibmvscsi_handle_crq(crq, hostdata);
> >     crq->valid = VIOSRP_CRQ_FREE;
> > +   wmb();
> >     } else {
> >     done = 1;
> >     }

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] scsi/ipr: Fix runaway IRQs when falling back from MSI to LSI

2016-11-23 Thread Benjamin Herrenschmidt
LSIs must be ack'ed with an MMIO otherwise they remain asserted
forever. This is controlled by the "clear_isr" flag.

While we set that flag properly when deciding initially whether
to use LSIs or MSIs, we fail to set it if we first chose MSIs,
the test fails, then fallback to LSIs.

Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org>
---
 drivers/scsi/ipr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c
index 5324741..5dd3194 100644
--- a/drivers/scsi/ipr.c
+++ b/drivers/scsi/ipr.c
@@ -10213,6 +10213,7 @@ static int ipr_probe_ioa(struct pci_dev *pdev,
}
 
ioa_cfg->intr_flag = IPR_USE_LSI;
+   ioa_cfg->clear_isr = 1;
ioa_cfg->nvectors = 1;
}
else if (rc)

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/3] cxlflash: Superpipe support

2015-08-10 Thread Benjamin Herrenschmidt
On Mon, 2015-08-10 at 12:09 -0500, Matthew R. Ochs wrote:
 Add superpipe supporting infrastructure to device driver for the IBM CXL
 Flash adapter. This patch allows userspace applications to take advantage
 of the accelerated I/O features that this adapter provides and bypass the
 traditional filesystem stack.

 So in a similar vein to the previous review, I am missing a lot of
context here but a few things did spring to me eyes:

 +/**
 + * lookup_local() - find a local LUN information structure by WWID
 + * @cfg: Internal structure associated with the host.
 + * @wwid:WWID associated with LUN.
 + *
 + * Return: Found local lun_info structure on success, NULL on failure
 + */
 +static struct llun_info *lookup_local(struct cxlflash_cfg *cfg, u8 *wwid)
 +{
 + struct llun_info *lli, *temp;
 + ulong lock_flags;
 +
 + spin_lock_irqsave(cfg-slock, lock_flags);
 +
 + list_for_each_entry_safe(lli, temp, cfg-lluns, list)
 + if (!memcmp(lli-wwid, wwid, DK_CXLFLASH_MANAGE_LUN_WWID_LEN)) {
 + lli-newly_created = false;

This is weird ... a lookup effectively changes the state of the object
looked up... what for ? There is something oddball here.

It might be legit but in that case, you should really add a comment
explaining the logic around that 'newly_created' field.

Also you drop the lock right below but you have no refcounting, are
these objects ever disposed of ?

In general, can you provide a clearer explanation of what are global
vs local LUNs ?

 + spin_unlock_irqrestore(cfg-slock, lock_flags);
 + return lli;
 + }
 +
 + spin_unlock_irqrestore(cfg-slock, lock_flags);
 + return NULL;
 +}
 +
 +/**
 + * lookup_global() - find a global LUN information structure by WWID
 + * @wwid:WWID associated with LUN.
 + *
 + * Return: Found global lun_info structure on success, NULL on failure
 + */
 +static struct glun_info *lookup_global(u8 *wwid)
 +{
 + struct glun_info *gli, *temp;
 + ulong lock_flags;
 +
 + spin_lock_irqsave(global.slock, lock_flags);
 +
 + list_for_each_entry_safe(gli, temp, global.gluns, list)
 + if (!memcmp(gli-wwid, wwid, DK_CXLFLASH_MANAGE_LUN_WWID_LEN)) {
 + spin_unlock_irqrestore(global.slock, lock_flags);
 + return gli;
 + }
 +
 + spin_unlock_irqrestore(global.slock, lock_flags);
 + return NULL;
 +}

Same ...

 +/**
 + * lookup_lun() - find or create a local LUN information structure
 + * @sdev:SCSI device associated with LUN.
 + * @wwid:WWID associated with LUN.
 + *
 + * When a local LUN is not found and a global LUN is also not found, both
 + * a global LUN and local LUN are created. The global LUN is added to the
 + * global list and the local LUN is returned.

Make the function name more explicit: find_or_create_lun() for example.
I very much dislike a function called lookup that has side effects.

 + * Return: Found/Allocated local lun_info structure on success, NULL on 
 failure
 + */
 +static struct llun_info *lookup_lun(struct scsi_device *sdev, u8 *wwid)
 +{
 + struct llun_info *lli = NULL;
 + struct glun_info *gli = NULL;
 + struct Scsi_Host *shost = sdev-host;
 + struct cxlflash_cfg *cfg = shost_priv(shost);
 + ulong lock_flags;
 +
 + if (unlikely(!wwid))
 + goto out;
 +
 + lli = lookup_local(cfg, wwid);
 + if (lli)
 + goto out;
 +
 + lli = create_local(sdev, wwid);
 + if (unlikely(!lli))
 + goto out;

Similar question to earlier, you have no refcounting, should I assume
these things never get removed ?

 + gli = lookup_global(wwid);
 + if (gli) {
 + lli-parent = gli;
 + spin_lock_irqsave(cfg-slock, lock_flags);
 + list_add(lli-list, cfg-lluns);
 + spin_unlock_irqrestore(cfg-slock, lock_flags);
 + goto out;
 + }
 +
 + gli = create_global(sdev, wwid);
 + if (unlikely(!gli)) {
 + kfree(lli);
 + lli = NULL;
 + goto out;
 + }
 +
 + lli-parent = gli;
 + spin_lock_irqsave(cfg-slock, lock_flags);
 + list_add(lli-list, cfg-lluns);
 + spin_unlock_irqrestore(cfg-slock, lock_flags);
 +
 + spin_lock_irqsave(global.slock, lock_flags);
 + list_add(gli-list, global.gluns);
 + spin_unlock_irqrestore(global.slock, lock_flags);

Your locks are extremely fine grained... too much ? Any reason why you
don't have a simple/single lock handling all these ? IE, do you expect
frequent accesses ?

Also, a function called lookup_something that has the side effect of
adding that something to two lists doesn't look great to me. You may
want to review the function naming a bit.

Finally, what happens if two processes call this trying to effectively
create the same global LUN simultaneously ?

IE, can't you have a case where both lookup fail, then they both hit the
create_global() 

Re: [PATCH v4 1/3] cxlflash: Base error recovery support

2015-08-10 Thread Benjamin Herrenschmidt
On Mon, 2015-08-10 at 12:09 -0500, Matthew R. Ochs wrote:
 Introduce support for enhanced I/O error handling.
 
 Signed-off-by: Matthew R. Ochs mro...@linux.vnet.ibm.com
 Signed-off-by: Manoj N. Kumar ma...@linux.vnet.ibm.com
 ---

So I'm not necessarily very qualified to review SCSI bits as I haven't
done anything close to the Linux SCSI code for almost a decade but I
have a couple of questions and nits...

   wait_queue_head_t tmf_waitq;
   bool tmf_active;
 - u8 err_recovery_active:1;
 + wait_queue_head_t limbo_waitq;
 + enum cxlflash_state state;
  };
  
  struct afu_cmd {
 diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
 index 76a7286..18359d4 100644
 --- a/drivers/scsi/cxlflash/main.c
 +++ b/drivers/scsi/cxlflash/main.c
 @@ -380,6 +380,18 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, 
 struct scsi_cmnd *scp)
   }
   spin_unlock_irqrestore(cfg-tmf_waitq.lock, lock_flags);
  
 + switch (cfg-state) {
 + case STATE_LIMBO:
 + pr_debug_ratelimited(%s: device in limbo!\n, __func__);
 + rc = SCSI_MLQUEUE_HOST_BUSY;
 + goto out;
 + case STATE_FAILTERM:
 + pr_debug_ratelimited(%s: device has failed!\n, __func__);
 + goto error;
 + default:
 + break;
 + }

I noticed that you mostly read and write that new state outside of your
tmf_waitq.lock. Is there any other lock or mutex protecting you ? In
this specific case ?

I know in the old day queuecommand was called with a host lock, is that
still the case ?

Or you just don't care about an occasional spurrious
SCSI_MLQUEUE_HOST_BUSY ?

   cmd = cxlflash_cmd_checkout(afu);
   if (unlikely(!cmd)) {
   pr_err(%s: could not get a free command\n, __func__);
 @@ -428,6 +440,10 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, 
 struct scsi_cmnd *scp)
  
  out:
   return rc;
 +error:
 + scp-result = (DID_NO_CONNECT  16);
 + scp-scsi_done(scp);
 + return 0;
  }
  
  /**
 @@ -455,9 +471,21 @@ static int cxlflash_eh_device_reset_handler(struct 
 scsi_cmnd *scp)
get_unaligned_be32(((u32 *)scp-cmnd)[2]),
get_unaligned_be32(((u32 *)scp-cmnd)[3]));
  
 - rcr = send_tmf(afu, scp, TMF_LUN_RESET);
 - if (unlikely(rcr))
 + switch (cfg-state) {
 + case STATE_NORMAL:
 + rcr = send_tmf(afu, scp, TMF_LUN_RESET);
 + if (unlikely(rcr))
 + rc = FAILED;
 + break;
 + case STATE_LIMBO:
 + wait_event(cfg-limbo_waitq, cfg-state != STATE_LIMBO);
 + if (cfg-state == STATE_NORMAL)
 + break;
 + /* fall through */
 + default:
   rc = FAILED;
 + break;
 + }

Same here, since you are doing wait_event, I assume no lock is held
(unless it's a mutex) and in subsequent places I am not listing.

As I said, it could well be that it all works fine but it would be worth
mentioning in this case because it's not obvious from simply reading the
code.

Cheers,
Ben.



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Concerns about mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support

2015-04-01 Thread Benjamin Herrenschmidt
On Thu, 2015-02-19 at 21:45 -0800, James Bottomley wrote:

 Ben, this is legal by design.  It was specifically designed for the
 aic79xx SCSI card, but can be used for a variety of other reasons.  The
 aic79xx hardware problem was that the DMA engine could address the whole
 of memory (it had two address modes, a 39 bit one and a 64 bit one) but
 the script engine that runs the mailboxes only had a 32 bit activation
 register (the activating write points at the physical address of the
 script to begin executing).  This meant that the scripts that run in
 memory had to be in the first 4GB of physical memory, hence the split
 mask.  The DMA mask specifies that the card can transfer from anywhere
 in physical memory, but the consistent_dma_mask says that the consistent
 allocation used to get scripts memory must come from the lower 4GB.

So looking at that again...

This is interesting ... basically any driver using a different mask has
been broken on powerpc for basically ever. The whole concept was poorly
designed, for example,  the set_consistent_mask isn't a dma_map_ops
unlike everything else.

In some cases, what we want is convey a base+offset information to
drivers but we can't do that.

This stuff cannot work with setups like a lot of our iommus where we
have a remapped region at the bottom of the DMA address space and a
bypass (direct map) region high up.

Basically, we can fix it, at least for most platforms, but it will be
hard, invasive, *and* will need to go to stable. Grmbl.

We'll have to replace our direct DMA ops (which just apply an offset)
which we use for devices that set a 64-bit mask on platform that support
a bypass window, with some smart-ass hybrid variant that selectively
shoot stuff up to the bypass window or down via the iommu remapped based
on the applicable mask for a given operation.

It would be nice if we could come up with a way to inform the driver
that we support that sort of bypass region with an offset. That would
allow drivers that have that 64-bit base + 32-bit offset scheme to work
much more efficiently for us. The driver could configure the base to be
our bypass window offset, and we could use ZONE_DMA32 for consistent
DMAs.

It would also help us with things like some GPUs that can only do 40-bit
DMA (which won't allow them to reach our bypass region normally) but do
have a way to configure the generated top bits of all DMA addresses in
some fixed register.

Any idea what such an interface might look like ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Concerns about mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support

2015-02-19 Thread Benjamin Herrenschmidt
On Fri, 2015-02-20 at 16:22 +1100, Benjamin Herrenschmidt wrote:

 Looking a bit more closely, you basically do
 
  - set_dma_mask(64-bit)
  - set_consistent_dma_mask(32-bit)
 
 Now, I don't know how x86 will react to the conflicting masks, but on
 ppc64, I'm pretty sure the second one will barf. IE, the first one will
 establish a set of direct mapping ops which give you a bypass of the
 iommu to all of memory. The second one will then do a
 dma_supported(mask) call which will hit the direct ops, and they will
 fail since a 32-bit mask cannot address the bypass completely.
 
 Are architectures really required to support such mismatching dma_mask
 and consistent_dma_mask ? what a bloody trainwreck ... :-(

Oh well, looks like x86 supports it and it won't be too hard to support
it on ppc64 as well. We even had some code along those lines for FSL
platforms with an ifdef due to the amount of drivers that used to fail
setting the consistent mask properly but that seems to be mostly fixed
nowadays.

Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Concerns about mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support

2015-02-19 Thread Benjamin Herrenschmidt
On Thu, 2015-02-19 at 21:45 -0800, James Bottomley wrote:
 Ben, this is legal by design.  It was specifically designed for the
 aic79xx SCSI card, but can be used for a variety of other reasons.  The
 aic79xx hardware problem was that the DMA engine could address the whole
 of memory (it had two address modes, a 39 bit one and a 64 bit one) but
 the script engine that runs the mailboxes only had a 32 bit activation
 register (the activating write points at the physical address of the
 script to begin executing).  This meant that the scripts that run in
 memory had to be in the first 4GB of physical memory, hence the split
 mask.  The DMA mask specifies that the card can transfer from anywhere
 in physical memory, but the consistent_dma_mask says that the consistent
 allocation used to get scripts memory must come from the lower 4GB.

Right, ok, it looks like it's easy enough to support with ZONE_DMA32, I'm
testing patches to create it unconditionally on ppc64 (it used to depend
on us using swiotlb on embedded platforms) and I'll shoot that upstream
if it passes.

Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Concerns about mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support

2015-02-19 Thread Benjamin Herrenschmidt
Hi Sreekanth !

While looking at some (unrelated) issue where mtp2sas seems to be using
32-bit DMA instead of 64-bit DMA on some POWER platforms, I noticed this
patch which was merged as 5fb1bf8aaa832e1e9ca3198de7bbecb8eff7db9c.

Can you confirm my understanding that you are:

 - Setting the DMA mask to 32-bit

 - Mapping pages for DMA

 - Changing the DMA mask to 64-bit

?

If yes, then I don't think this is a legal thing to do and definitely
not something supported by all architectures. It might work by accident,
but there is no telling that any translation/DMA mapping provided before
a call to set_dma_mask() is still valid after that call.

The architecture might have to completely reconfigure the iommu, for
example on some PowerPC platforms, we switch from a remapped mapping to
a direct linear map of all memory, all translations established before
the switch might be lost (it depends on the specific implementation).

How does it work on x86 with DMAR ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Concerns about mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support

2015-02-19 Thread Benjamin Herrenschmidt
On Fri, 2015-02-20 at 16:06 +1100, Benjamin Herrenschmidt wrote:

 Note that even on powerpc platforms where it would work because we
 maintain both 32-bit and 64-bit bypass windows in the device address
 space simultaneously, you will leak iommu entries unless you also switch
 back to 32-bit when freeing the 32-bit mappings... (and you would
 probably crash if you tried to free a 64-bit mapping while in 32-bit
 mode).
 
 The iommu APIs weren't designed with that switching mask facility in
 mind...

Looking a bit more closely, you basically do

 - set_dma_mask(64-bit)
 - set_consistent_dma_mask(32-bit)

Now, I don't know how x86 will react to the conflicting masks, but on
ppc64, I'm pretty sure the second one will barf. IE, the first one will
establish a set of direct mapping ops which give you a bypass of the
iommu to all of memory. The second one will then do a
dma_supported(mask) call which will hit the direct ops, and they will
fail since a 32-bit mask cannot address the bypass completely.

Are architectures really required to support such mismatching dma_mask
and consistent_dma_mask ? what a bloody trainwreck ... :-(

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND][PATCH 1/2] lib/scatterlist: Make ARCH_HAS_SG_CHAIN an actual Kconfig

2014-03-23 Thread Benjamin Herrenschmidt
On Sun, 2014-03-23 at 00:03 -0700, Christoph Hellwig wrote:
 On Sun, Mar 23, 2014 at 02:04:46PM +1100, Benjamin Herrenschmidt wrote:
  
   diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
   index 1594945..8122294 100644
   --- a/arch/arm/Kconfig
   +++ b/arch/arm/Kconfig
   @@ -82,6 +82,7 @@ config ARM
   http://www.arm.linux.org.uk/.

config ARM_HAS_SG_CHAIN
   + select ARCH_HAS_SG_CHAIN
 bool

  
  Heh, a self-selecting config option... I didn't know that trick !
 
 ARM vs ARCH.  However the arm variant of the variable should probably be
 consolidated into the ARCH one as a follow up.

Oh right... sorry for the noise.

Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND][PATCH 1/2] lib/scatterlist: Make ARCH_HAS_SG_CHAIN an actual Kconfig

2014-03-22 Thread Benjamin Herrenschmidt

 diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
 index 1594945..8122294 100644
 --- a/arch/arm/Kconfig
 +++ b/arch/arm/Kconfig
 @@ -82,6 +82,7 @@ config ARM
 http://www.arm.linux.org.uk/.
  
  config ARM_HAS_SG_CHAIN
 + select ARCH_HAS_SG_CHAIN
   bool
  

Heh, a self-selecting config option... I didn't know that trick !

Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND][PATCH 1/2] lib/scatterlist: Make ARCH_HAS_SG_CHAIN an actual Kconfig

2014-03-22 Thread Benjamin Herrenschmidt
On Sat, 2014-03-22 at 11:13 -0700, Laura Abbott wrote:
 Rather than have architectures #define ARCH_HAS_SG_CHAIN in an architecture
 specific scatterlist.h, make it a proper Kconfig option and use that
 instead. At same time, remove the header files are are now mostly
 useless and just include asm-generic/scatterlist.h.
 
 Cc: Russell King li...@arm.linux.org.uk
 Cc: Tony Luck tony.l...@intel.com
 Cc: Fenghua Yu fenghua...@intel.com

For powerpc

Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org

 Cc: Paul Mackerras pau...@samba.org
 Cc: Ingo Molnar mi...@redhat.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: James E.J. Bottomley jbottom...@parallels.com
 Cc: Fenghua Yu fenghua...@intel.com
 Cc: Tony Luck tony.l...@intel.com
 Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
 Cc: Paul Mackerras pau...@samba.org
 Cc: Martin Schwidefsky schwidef...@de.ibm.com
 Cc: Heiko Carstens heiko.carst...@de.ibm.com
 Cc: Andrew Morton a...@linux-foundation.org
 Signed-off-by: Laura Abbott lau...@codeaurora.org


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-08 Thread Benjamin Herrenschmidt
On Tue, 2013-10-08 at 20:55 -0700, H. Peter Anvin wrote:
 Why not add a minimum number to pci_enable_msix(), i.e.:
 
 pci_enable_msix(pdev, msix_entries, nvec, minvec)
 
 ... which means nvec is the number of interrupts *requested*, and
 minvec is the minimum acceptable number (otherwise fail).

Which is exactly what Ben (the other Ben :-) suggested and that I
supports...

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-07 Thread Benjamin Herrenschmidt
On Mon, 2013-10-07 at 14:01 -0400, Tejun Heo wrote:
 I don't think the same race condition would happen with the loop.  The
 problem case is where multiple msi(x) allocation fails completely
 because the global limit went down before inquiry and allocation.  In
 the loop based interface, it'd retry with the lower number.
 
 As long as the number of drivers which need this sort of adaptive
 allocation isn't too high and the common cases can be made simple, I
 don't think the complex part of interface is all that important.
 Maybe we can have reserve / cancel type interface or just keep the
 loop with more explicit function names (ie. try_enable or something
 like that).

I'm thinking a better API overall might just have been to request
individual MSI-X one by one :-)

We want to be able to request an MSI-X at runtime anyway ... if I want
to dynamically add a queue to my network interface, I want it to be able
to pop a new arbitrary MSI-X.

And we don't want to lock drivers into contiguous MSI-X sets either.

And for the cleanup ... well that's what the pcim functions are for,
we can just make MSI-X variants.

Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-06 Thread Benjamin Herrenschmidt
On Sun, 2013-10-06 at 08:02 +0200, Alexander Gordeev wrote:
 On Sun, Oct 06, 2013 at 08:46:26AM +1100, Benjamin Herrenschmidt wrote:
  On Sat, 2013-10-05 at 16:20 +0200, Alexander Gordeev wrote:
   So my point is - drivers should first obtain a number of MSIs they *can*
   get, then *derive* a number of MSIs the device is fine with and only then
   request that number. Not terribly different from memory or any other type
   of resource allocation ;)
  
  What if the limit is for a group of devices ? Your interface is racy in
  that case, another driver could have eaten into the limit in between the
  calls.
 
 Well, the another driver has had a better karma ;) But seriously, the
 current scheme with a loop is not race-safe wrt to any other type of
 resource which might exhaust. What makes the quota so special so we
 should care about it and should not care i.e. about lack of msi_desc's?

I'm not saying the current scheme is better but I prefer the option of
passing a min,max to the request function.

 Yeah, I know the quota might hit more likely. But why it is not addressed
 right now then? Not a single function in chains...
   rtas_msi_check_device() - msi_quota_for_device() - traverse_pci_devices()
   rtas_setup_msi_irqs() - msi_quota_for_device() - traverse_pci_devices()
 ...is race-safe. So if it has not been bothering anyone until now then 
 no reason to start worrying now :)

 In fact, in the current design to address the quota race decently the
 drivers would have to protect the *loop* to prevent the quota change
 between a pci_enable_msix() returned a positive number and the the next
 call to pci_enable_msix() with that number. Is it doable?

I am not advocating for the current design, simply saying that your
proposal doesn't address this issue while Ben's does.

Cheers,
Ben.

  Ben.
  
  
 


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-05 Thread Benjamin Herrenschmidt
On Sat, 2013-10-05 at 16:20 +0200, Alexander Gordeev wrote:
 So my point is - drivers should first obtain a number of MSIs they *can*
 get, then *derive* a number of MSIs the device is fine with and only then
 request that number. Not terribly different from memory or any other type
 of resource allocation ;)

What if the limit is for a group of devices ? Your interface is racy in
that case, another driver could have eaten into the limit in between the
calls.

Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2, part 1 3/9] PCI: Convert alloc_pci_dev(void) to pci_alloc_dev(bus) instead

2013-05-15 Thread Benjamin Herrenschmidt
On Wed, 2013-05-15 at 22:46 +0800, Liu Jiang wrote:
I don't know any OF exports, could you please help to CC
 some OF experts?

I wrote that code I think. Sorry, I've missed the beginning of the
thread, what is the problem ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi/ibmvscsi: add module alias for ibmvscsic

2012-07-30 Thread Benjamin Herrenschmidt
On Mon, 2012-07-30 at 21:06 +0200, Olaf Hering wrote:
  So while this would work, I do wonder however whether we could
 instead
  fix it by simplifying the whole thing as follow since iSeries is now
  gone and so we don't need split backends anymore:
  
  scsi/ibmvscsi: Remove backend abstraction
 
 I cant that these things myself anymore.

Brian, can somebody from your side own these ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi/ibmvscsi: add module alias for ibmvscsic

2012-07-29 Thread Benjamin Herrenschmidt
On Wed, 2012-07-18 at 18:49 +0200, o...@aepfle.de wrote:
 From: Olaf Hering o...@aepfle.de
 
 The driver is named ibmvscsic, at runtime it its name is advertised as
 ibmvscsi. For this reason mkinitrd wont pickup the driver properly.
 Reported by IBM during SLES11 beta testing:
 
 https://bugzilla.novell.com/show_bug.cgi?id=459933
 LTC50724

So while this would work, I do wonder however whether we could instead
fix it by simplifying the whole thing as follow since iSeries is now
gone and so we don't need split backends anymore:

scsi/ibmvscsi: Remove backend abstraction

Now that the iSeries code is gone the backend abstraction
in this driver is no longer necessary, which allows us to
consolidate the driver in one file.

The side effect is that the module name is now ibmvscsi.ko
which matches the driver hotplug name and fixes auto-load
issues.

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
---
---
 drivers/scsi/ibmvscsi/Makefile|6 +-
 drivers/scsi/ibmvscsi/ibmvscsi.c  |  348 +--
 drivers/scsi/ibmvscsi/ibmvscsi.h  |   22 ---
 drivers/scsi/ibmvscsi/rpa_vscsi.c |  368 -
 4 files changed, 330 insertions(+), 414 deletions(-)
 delete mode 100644 drivers/scsi/ibmvscsi/rpa_vscsi.c

diff --git a/drivers/scsi/ibmvscsi/Makefile b/drivers/scsi/ibmvscsi/Makefile
index ff5b5c5..cb150d1 100644
--- a/drivers/scsi/ibmvscsi/Makefile
+++ b/drivers/scsi/ibmvscsi/Makefile
@@ -1,7 +1,3 @@
-obj-$(CONFIG_SCSI_IBMVSCSI)+= ibmvscsic.o
-
-ibmvscsic-y+= ibmvscsi.o
-ibmvscsic-$(CONFIG_PPC_PSERIES)+= rpa_vscsi.o 
-
+obj-$(CONFIG_SCSI_IBMVSCSI)+= ibmvscsi.o
 obj-$(CONFIG_SCSI_IBMVSCSIS)   += ibmvstgt.o
 obj-$(CONFIG_SCSI_IBMVFC)  += ibmvfc.o
diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
index 3a6c474..d2bd2c0 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -93,13 +93,13 @@ static int max_requests = IBMVSCSI_MAX_REQUESTS_DEFAULT;
 static int max_events = IBMVSCSI_MAX_REQUESTS_DEFAULT + 2;
 static int fast_fail = 1;
 static int client_reserve = 1;
+static char partition_name[97] = UNKNOWN;
+static unsigned int partition_number = -1;
 
 static struct scsi_transport_template *ibmvscsi_transport_template;
 
 #define IBMVSCSI_VERSION 1.5.9
 
-static struct ibmvscsi_ops *ibmvscsi_ops;
-
 MODULE_DESCRIPTION(IBM Virtual SCSI);
 MODULE_AUTHOR(Dave Boutcher);
 MODULE_LICENSE(GPL);
@@ -118,6 +118,315 @@ MODULE_PARM_DESC(fast_fail, Enable fast fail. 
[Default=1]);
 module_param_named(client_reserve, client_reserve, int, S_IRUGO );
 MODULE_PARM_DESC(client_reserve, Attempt client managed reserve/release);
 
+static void ibmvscsi_handle_crq(struct viosrp_crq *crq,
+   struct ibmvscsi_host_data *hostdata);
+
+/* 
+ * Routines for managing the command/response queue
+ */
+/**
+ * ibmvscsi_handle_event: - Interrupt handler for crq events
+ * @irq:   number of irq to handle, not used
+ * @dev_instance: ibmvscsi_host_data of host that received interrupt
+ *
+ * Disables interrupts and schedules srp_task
+ * Always returns IRQ_HANDLED
+ */
+static irqreturn_t ibmvscsi_handle_event(int irq, void *dev_instance)
+{
+   struct ibmvscsi_host_data *hostdata =
+   (struct ibmvscsi_host_data *)dev_instance;
+   vio_disable_interrupts(to_vio_dev(hostdata-dev));
+   tasklet_schedule(hostdata-srp_task);
+   return IRQ_HANDLED;
+}
+
+/**
+ * release_crq_queue: - Deallocates data and unregisters CRQ
+ * @queue: crq_queue to initialize and register
+ * @host_data: ibmvscsi_host_data of host
+ *
+ * Frees irq, deallocates a page for messages, unmaps dma, and unregisters
+ * the crq with the hypervisor.
+ */
+static void ibmvscsi_release_crq_queue(struct crq_queue *queue,
+  struct ibmvscsi_host_data *hostdata,
+  int max_requests)
+{
+   long rc = 0;
+   struct vio_dev *vdev = to_vio_dev(hostdata-dev);
+   free_irq(vdev-irq, (void *)hostdata);
+   tasklet_kill(hostdata-srp_task);
+   do {
+   if (rc)
+   msleep(100);
+   rc = plpar_hcall_norets(H_FREE_CRQ, vdev-unit_address);
+   } while ((rc == H_BUSY) || (H_IS_LONG_BUSY(rc)));
+   dma_unmap_single(hostdata-dev,
+queue-msg_token,
+queue-size * sizeof(*queue-msgs), DMA_BIDIRECTIONAL);
+   free_page((unsigned long)queue-msgs);
+}
+
+/**
+ * crq_queue_next_crq: - Returns the next entry in message queue
+ * @queue: crq_queue to use
+ *
+ * Returns pointer to next entry in queue, or NULL if there are no new 
+ * entried in the CRQ.
+ */
+static struct viosrp_crq *crq_queue_next_crq(struct crq_queue *queue)
+{
+   struct viosrp_crq *crq;
+   unsigned long flags;
+
+   spin_lock_irqsave(queue-lock

Re: [PATCH] scsi/ibmvscsi: /sys/class/scsi_host/hostX/config doesn't show any information

2012-07-29 Thread Benjamin Herrenschmidt
n Wed, 2012-07-18 at 18:49 +0200, o...@aepfle.de wrote:
 From: Linda Xie lx...@us.ibm.com
 
 Expected result:
 It should show something like this:
 x1521p4:~ # cat /sys/class/scsi_host/host1/config
 PARTITIONNAME='x1521p4'
 NWSDNAME='X1521P4'
 HOSTNAME='X1521P4'
 DOMAINNAME='RCHLAND.IBM.COM'
 NAMESERVERS='9.10.244.100 9.10.244.200'
 
 Actual result:
 x1521p4:~ # cat /sys/class/scsi_host/host0/config
 x1521p4:~ #
 
 This patch changes the size of the buffer used for transfering config
 data to 4K. It was tested against 2.6.19-rc2 tree.
 
 Reported by IBM during SLES11 beta testing:

So this patch just seems to blindly replace all occurrences of PAGE_SIZE
with HOST_PAGE_SIZE which is utterly wrong. Only one of those needs to
be changed, the one passed to ibmvscsi_do_host_config() which is what's
visible to the server, all the rest is just sysfs attributes and should
remain as-is.

Additionally (not even mentioning that there is no explanation as to
what the real problem is anywhere in the changeset) I don't like the
fix. The root of the problem is that the MAD header has a 16-bit length
field, so writing 0x1 (64K PAGE_SIZE) into it doesn't quite work.

So in addition to a better comment, I would suggest a fix more like
this:

scsi/ibmvscsi: Fix host config length field overflow

The length field in the host config packet is only 16-bit long, so
passing it 0x1 (64K which is our standard PAGE_SIZE) doesn't
work and result in an empty config from the server.

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
CC: sta...@vger.kernel.org
---

diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
index 3a6c474..337e8b3 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -1541,6 +1541,9 @@ static int ibmvscsi_do_host_config(struct 
ibmvscsi_host_data *hostdata,
 
host_config = evt_struct-iu.mad.host_config;
 
+   /* The transport length field is only 16-bit */
+   length = min(0x, length);
+
/* Set up a lun reset SRP command */
memset(host_config, 0x00, sizeof(*host_config));
host_config-common.type = VIOSRP_HOST_CONFIG_TYPE;


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi/ibmvscsi: /sys/class/scsi_host/hostX/config doesn't show any information

2012-07-26 Thread Benjamin Herrenschmidt
On Wed, 2012-07-18 at 18:49 +0200, o...@aepfle.de wrote:
 From: Linda Xie lx...@us.ibm.com

James, can I assume you're picking up those two ?

Cheers,
Ben.

 Expected result:
 It should show something like this:
 x1521p4:~ # cat /sys/class/scsi_host/host1/config
 PARTITIONNAME='x1521p4'
 NWSDNAME='X1521P4'
 HOSTNAME='X1521P4'
 DOMAINNAME='RCHLAND.IBM.COM'
 NAMESERVERS='9.10.244.100 9.10.244.200'
 
 Actual result:
 x1521p4:~ # cat /sys/class/scsi_host/host0/config
 x1521p4:~ #
 
 This patch changes the size of the buffer used for transfering config
 data to 4K. It was tested against 2.6.19-rc2 tree.
 
 Reported by IBM during SLES11 beta testing:
 
 https://bugzilla.novell.com/show_bug.cgi?id=439970
 LTC49349
 
 Signed-off-by: Olaf Hering o...@aepfle.de
 
 diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c 
 b/drivers/scsi/ibmvscsi/ibmvscsi.c
 index e580aa4..1513ca8 100644
 --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
 +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
 @@ -93,6 +93,8 @@ static int max_requests = IBMVSCSI_MAX_REQUESTS_DEFAULT;
  static int max_events = IBMVSCSI_MAX_REQUESTS_DEFAULT + 2;
  static int fast_fail = 1;
  static int client_reserve = 1;
 +/* host data buffer size */
 +#define HOST_BUFFER_SIZE 4096
  
  static struct scsi_transport_template *ibmvscsi_transport_template;
  
 @@ -1666,7 +1668,7 @@ static ssize_t show_host_srp_version(struct device *dev,
   struct ibmvscsi_host_data *hostdata = shost_priv(shost);
   int len;
  
 - len = snprintf(buf, PAGE_SIZE, %s\n,
 +   len = snprintf(buf, HOST_BUFFER_SIZE, %s\n,
  hostdata-madapter_info.srp_version);
   return len;
  }
 @@ -1687,7 +1689,7 @@ static ssize_t show_host_partition_name(struct device 
 *dev,
   struct ibmvscsi_host_data *hostdata = shost_priv(shost);
   int len;
  
 - len = snprintf(buf, PAGE_SIZE, %s\n,
 +   len = snprintf(buf, HOST_BUFFER_SIZE, %s\n,
  hostdata-madapter_info.partition_name);
   return len;
  }
 @@ -1708,7 +1710,7 @@ static ssize_t show_host_partition_number(struct device 
 *dev,
   struct ibmvscsi_host_data *hostdata = shost_priv(shost);
   int len;
  
 - len = snprintf(buf, PAGE_SIZE, %d\n,
 +   len = snprintf(buf, HOST_BUFFER_SIZE, %d\n,
  hostdata-madapter_info.partition_number);
   return len;
  }
 @@ -1728,7 +1730,7 @@ static ssize_t show_host_mad_version(struct device *dev,
   struct ibmvscsi_host_data *hostdata = shost_priv(shost);
   int len;
  
 - len = snprintf(buf, PAGE_SIZE, %d\n,
 +   len = snprintf(buf, HOST_BUFFER_SIZE, %d\n,
  hostdata-madapter_info.mad_version);
   return len;
  }
 @@ -1748,7 +1750,7 @@ static ssize_t show_host_os_type(struct device *dev,
   struct ibmvscsi_host_data *hostdata = shost_priv(shost);
   int len;
  
 - len = snprintf(buf, PAGE_SIZE, %d\n, hostdata-madapter_info.os_type);
 +   len = snprintf(buf, HOST_BUFFER_SIZE, %d\n, 
 hostdata-madapter_info.os_type);
   return len;
  }
  
 @@ -1767,7 +1769,7 @@ static ssize_t show_host_config(struct device *dev,
   struct ibmvscsi_host_data *hostdata = shost_priv(shost);
  
   /* returns null-terminated host config data */
 - if (ibmvscsi_do_host_config(hostdata, buf, PAGE_SIZE) == 0)
 +   if (ibmvscsi_do_host_config(hostdata, buf, HOST_BUFFER_SIZE) == 0)
   return strlen(buf);
   else
   return 0;


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] DMA buffer alignment annotations

2007-12-21 Thread Benjamin Herrenschmidt

On Fri, 2007-12-21 at 09:39 +, Russell King wrote:

  +#ifndef ARCH_MIN_DMA_ALIGNMENT
  +#define __dma_aligned
  +#define __dma_buffer
  +#else
  +#define __dma_aligned  
  __attribute__((aligned(ARCH_MIN_DMA_ALIGNMENT)))
  +#define __dma_buffer   __dma_buffer_line(__LINE__)
  +#define __dma_buffer_line(line)__dma_aligned;\
  +   char __dma_pad_##line[0] __dma_aligned
 
 You introduce __dma_buffer_line() if ARCH_MIN_DMA_ALIGNMENT is set but
 not if it isn't...

Yup, it's not meant to be used outside of __dma_buffer...

 .../...

  +then dev-buffer will be safe for DMA on all architectures.  On a
  +cache-coherent architecture the members of dev will be aligned exactly
  +as they would have been without __dma_buffer; on a non-cache-coherent
  +architecture buffer and field2 will be aligned so that buffer does not
  +share a cache line with any other data.
  +
 
 ... but it's not described.  What's the purpose of it, and why would it
 only be used on CPUs with ARCH_MIN_DMA_ALIGNMENT set?

Hrm, I'm not the best at writing exlanations, care to send a patch ? :-)

Cheers,
Ben


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] scsi: Use new __dma_buffer to align sense buffer in scsi_cmnd

2007-12-21 Thread Benjamin Herrenschmidt

On Fri, 2007-12-21 at 10:33 +, Alan Cox wrote:
 On Fri, 21 Dec 2007 13:30:08 +1100
 Benjamin Herrenschmidt [EMAIL PROTECTED] wrote:
 
  The sense buffer ins scsi_cmnd can nowadays be DMA'ed into directly
  by some low level drivers (that typically happens with USB mass
  storage).
 
 Should that not be fixed in USB storage by using pci_alloc_coherent on the
 PCI device of the hub not peeing directly into kernel space ?

All dumb SCSI drivers have this problem, USB storage is just one of
them. This would also allow to remove bounce buffering that some
non-dumb ones are doing in fact.

There's another solution jejb was talking about involving reworking the
allocation of the sense buffer to make it always under driver control
etc... but that's the kind of SCSI surgery that I'm not prepared to do
especially not for .25 and without much HW to test with.

 It's also incomplete as a fix because I don't see what guarantees the
 buffer size will always exceed cache line size

How is that a problem ? The annotation will make sure the buffer doesn't
share cache lines. It forces alignmenet before and pads after.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] scsi: Use new __dma_buffer to align sense buffer in scsi_cmnd

2007-12-21 Thread Benjamin Herrenschmidt

On Fri, 2007-12-21 at 06:16 -0700, Matthew Wilcox wrote:
 On Fri, Dec 21, 2007 at 10:33:26AM +, Alan Cox wrote:
  On Fri, 21 Dec 2007 13:30:08 +1100
  Benjamin Herrenschmidt [EMAIL PROTECTED] wrote:
  
   The sense buffer ins scsi_cmnd can nowadays be DMA'ed into directly
   by some low level drivers (that typically happens with USB mass
   storage).
  
  Should that not be fixed in USB storage by using pci_alloc_coherent on the
  PCI device of the hub not peeing directly into kernel space ?
 
 That's what I said, but Ben seems fixated on this particular fix.

That means more understanding of the SCSI code than I have right now
and a _lot_ more time than I have right now. It's not only USB storage,
it's anything that does DMA and uses the generic error handling.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] scsi: Use new __dma_buffer to align sense buffer in scsi_cmnd

2007-12-20 Thread Benjamin Herrenschmidt
The sense buffer ins scsi_cmnd can nowadays be DMA'ed into directly
by some low level drivers (that typically happens with USB mass
storage).

This is a problem on non cache coherent architectures such as
embedded PowerPCs where the sense buffer can share cache lines with
other structure members, which leads to various forms of corruption.

This uses the newly defined __dma_buffer annotation to enforce that
on such platforms, the sense_buffer is contained within its own
cache line. This has no effect on cache coherent architectures.

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]
---

 include/scsi/scsi_cmnd.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-merge.orig/include/scsi/scsi_cmnd.h   2007-12-21 13:07:14.0 
+1100
+++ linux-merge/include/scsi/scsi_cmnd.h2007-12-21 13:07:29.0 
+1100
@@ -88,7 +88,7 @@ struct scsi_cmnd {
   working on */
 
 #define SCSI_SENSE_BUFFERSIZE  96
-   unsigned char sense_buffer[SCSI_SENSE_BUFFERSIZE];
+   unsigned char sense_buffer[SCSI_SENSE_BUFFERSIZE] __dma_buffer;
/* obtained by REQUEST SENSE when
 * CHECK CONDITION is received on original
 * command (auto-sense) */
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] DMA buffer alignment annotations

2007-12-20 Thread Benjamin Herrenschmidt
This patch based on some earlier work by Roland Dreier introduces
a pair of annotations that can be used to enforce alignment of
objects that can be DMA'ed into, and to enforce that an DMA'able
object within a structure isn't sharing a cache line with some
other object.

Such sharing of a data structure between DMA and non-DMA objects
isn't a recommended practice, but it does happen and in some case
might even make sense, so we now have a way to make it work
propertly.

The current patch only enables such alignment for some PowerPC
platforms that do not have coherent caches. Other platforms such
as ARM, MIPS, etc... can define ARCH_MIN_DMA_ALIGNMENT if they
want to benefit from this, I don't know them well enough to do
it myself.

The initial issue I'm fixing (in a second patch) by using these
is the SCSI sense buffer which is currently part of the scsi
command structure and can be DMA'ed to. On non-coherent platforms,
this causes various corruptions as this cache line is shared with
various other fields of the scsi_cmnd data structure.

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]
---

 Documentation/DMA-mapping.txt |   32 
 include/asm-generic/page.h|   10 ++
 include/asm-powerpc/page.h|8 
 3 files changed, 50 insertions(+)

--- linux-merge.orig/include/asm-generic/page.h 2007-07-27 13:44:45.0 
+1000
+++ linux-merge/include/asm-generic/page.h  2007-12-21 13:07:28.0 
+1100
@@ -20,6 +20,16 @@ static __inline__ __attribute_const__ in
return order;
 }
 
+#ifndef ARCH_MIN_DMA_ALIGNMENT
+#define __dma_aligned
+#define __dma_buffer
+#else
+#define __dma_aligned  __attribute__((aligned(ARCH_MIN_DMA_ALIGNMENT)))
+#define __dma_buffer   __dma_buffer_line(__LINE__)
+#define __dma_buffer_line(line)__dma_aligned;\
+   char __dma_pad_##line[0] __dma_aligned
+#endif
+
 #endif /* __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 
Index: linux-merge/include/asm-powerpc/page.h
===
--- linux-merge.orig/include/asm-powerpc/page.h 2007-09-28 11:42:10.0 
+1000
+++ linux-merge/include/asm-powerpc/page.h  2007-12-21 13:15:02.0 
+1100
@@ -77,6 +77,14 @@
 #define VM_DATA_DEFAULT_FLAGS64(VM_READ | VM_WRITE | \
 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
 
+/*
+ * On non cache coherent platforms, we enforce cache aligned DMA
+ * buffers inside of structures
+ */
+#ifdef CONFIG_NOT_COHERENT_CACHE
+#define ARCH_MIN_DMA_ALIGNMENT L1_CACHE_BYTES
+#endif
+
 #ifdef __powerpc64__
 #include asm/page_64.h
 #else
Index: linux-merge/Documentation/DMA-mapping.txt
===
--- linux-merge.orig/Documentation/DMA-mapping.txt  2007-12-21 
13:17:14.0 +1100
+++ linux-merge/Documentation/DMA-mapping.txt   2007-12-21 13:20:00.0 
+1100
@@ -75,6 +75,38 @@ What about block I/O and networking buff
 networking subsystems make sure that the buffers they use are valid
 for you to DMA from/to.
 
+Note that on non-cache-coherent architectures, having a DMA buffer
+that shares a cache line with other data can lead to memory
+corruption.
+
+The __dma_buffer macro exists to allow safe DMA buffers to be declared
+easily and portably as part of larger structures without causing bloat
+on cache-coherent architectures. To get this macro, architectures have
+to define ARCH_MIN_DMA_ALIGNMENT to the requested alignment value in
+their asm/page.h before including asm-generic/page.h
+
+Of course these structures must be contained in memory that can be
+used for DMA as described above.
+
+To use __dma_buffer, just declare a struct like:
+
+   struct mydevice {
+   int field1;
+   char buffer[BUFFER_SIZE] __dma_buffer;
+   int field2;
+   };
+
+If this is used in code like:
+
+   struct mydevice *dev;
+   dev = kmalloc(sizeof *dev, GFP_KERNEL);
+
+then dev-buffer will be safe for DMA on all architectures.  On a
+cache-coherent architecture the members of dev will be aligned exactly
+as they would have been without __dma_buffer; on a non-cache-coherent
+architecture buffer and field2 will be aligned so that buffer does not
+share a cache line with any other data.
+
DMA addressing limitations
 
 Does your device have any DMA addressing limitations?  For example, is
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-20 Thread Benjamin Herrenschmidt

On Tue, 2007-11-20 at 15:10 -0600, James Bottomley wrote:
 We're talking about trying to fix this for 2.4; which is already at
 -rc3 ... Is an entire arch change for dma alignment really a merge
 candidate at this stage?

Well, as I said before... it's a matter of what seems to be the less
likely to break something right ?

On one side,  I'm doing surgery on code I barely know, the scsi error
handling, and now it seems I also have to fixup a handful of drivers
that aren't the most obvious pieces of code around.

On the other side, Roland proposal is basically just adding a macro that
can be empty for everybody but a handful of archs, and stick it onto one
field in one structure...

The later has about 0 chances to actually break something or cause a
regression. I wouldn't say that about the former.

Now, I will see if I manage to fixup the NCR drivers to pass a
pre-allocated buffer (USB storage I think can pass NULL as it's not
calling prep in atomic context). But then, it complicates the matter
because that means restore will have to know whether prep allocated
the buffer or not, thus more fields to add to the save struct, it's
getting messy, unless we decide -all- callers are responsible for the
buffer allocation (hrm... maybe the best approach).

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 00:38 -0800, David Miller wrote:
 From: Benjamin Herrenschmidt [EMAIL PROTECTED]
 Date: Mon, 19 Nov 2007 16:35:23 +1100
 
  I'm not sure what is the best way to fix that. Internally, I've done
  some test whacking some cacheline_aligned in the scsi_cmnd data
  structure to verify I no longer get random SLAB corruption when using my
  USB but that significantly bloats the size of the structure on archs
  such as ppc64 that don't need it and have a large cache line size.
  
  Unfortunately, I don't think there's any existing Kconfig symbol or arch
  provided #define to tell us that we are on a non-coherent arch afaik
  that could be used to make that conditional.
  
  Another option would be to kmalloc the buffer (wasn't it the case before
  btw ?) but I suppose some people will scream at the idea due to how the
  command pools are done...
 
 You could make a dma_cacheline_aligned and use that.
 It seems pretty reasonable.

I was thinking about that. What archs would need it ? arm, mips, what
else ?

Cheers,
Ben.

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 05:32 -0700, Matthew Wilcox wrote:
 On Mon, Nov 19, 2007 at 04:35:23PM +1100, Benjamin Herrenschmidt wrote:
  The other one I'm hitting now is that the SCSI layer nowadays embeds the
 
 'nowadays'?  It has always been so.

Wasn't it kmalloc'ed at one point ?

  sense_buffer inside the scsi_cmnd structure without any kind of
  alignment whatsoever. I've been hitting irregulary is a crash on SCSI 
  command
  completion that seems to be related to corruption of the request
  pointer in struct scsi_cmnd and I think it might be the cause.
  I'm now trying to setup a proper repro-case.
 
 What other drivers do is DMA to their own allocation and then memcpy to
 the sense buffer.

What other drivers ? Those architectures use the same drivers as
everything else.

 There is a movement to allocate the sense data as its own sg list, but
 I don't think that patch has even been posted yet.

I've seen code creating an sglist from the scsi_cmnd-sense_buffer and
passing that to drivers. That breaks.

Ben.

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 09:09 -0600, James Bottomley wrote:
  What other drivers do is DMA to their own allocation and then memcpy to
  the sense buffer.
  
  There is a movement to allocate the sense data as its own sg list, but
  I don't think that patch has even been posted yet.
 
 I'd like to be rid of it inside the command for various reasons:  every
 command has one of these, and they're expensive in the allocation (at 96
 bytes).  There's no reason we have to allocate and free that amount of
 space with every command.  In theory, the number of these is bounded at
 the queue depth, in practice, there's usually only one, and this DMA
 alignment issue does requires most drivers to double copy.

And most drivers don't and break. Take USB storage, I -think- (code path
aren't trivial to follow) it just gets the sglist cooked up by the code
in scsi_error.c no ? That just points to the buffer in scsi_cmnd. It
then pass that for DMA to the USB stack.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

 I'd like to be rid of it inside the command for various reasons:  every
 command has one of these, and they're expensive in the allocation (at 96
 bytes).  There's no reason we have to allocate and free that amount of
 space with every command.  In theory, the number of these is bounded at
 the queue depth, in practice, there's usually only one, and this DMA
 alignment issue does requires most drivers to double copy.

Do you have a plan for short term here ? I'd like something fixed
for .25, so I may have to introduce a __dma_aligned macro of some sort
to deal with that in the meantime...

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 13:43 -0800, Roland Dreier wrote:
  I've been debugging various issues on the PowerPC 44x embedded
   architecture which happens to have non-coherent PCI DMA.
   
   One of the problem I'm hitting is that one really need to enforce
   kmalloc alignement to cache lines or bad things will happen (among
   others with USB), for some reasons, powerpc failed to do so, I fixed it.
 
 Heh... I hit the same problem literally 5 years ago:
 http://lwn.net/Articles/1783/
 
 I implemented the __dma_buffer annotation:
 http://lwn.net/Articles/2269/
 
 But DaveM said we should just use the PCI pool code instead:
 http://lwn.net/Articles/2270/

Heh, well... 

In this case, DaveM just proposed something akin to your
__dma_buffer :-)

On the other hand, after discussing with James, it looks like we'll just
be reverting the patch that removed the kmalloc of the sense buffer
since non cache-coherent archs are supposed to enforce kmalloc alignment
to cache lines.

__dma_buffer still seems like a good thing to have if too many other
drivers are hitting that but for this specific problem, it's not the
approach that we'll be taking.

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 14:31 -0800, David Miller wrote:
 From: Benjamin Herrenschmidt [EMAIL PROTECTED]
 Date: Tue, 20 Nov 2007 06:51:14 +1100
 
  On Mon, 2007-11-19 at 00:38 -0800, David Miller wrote:
   From: Benjamin Herrenschmidt [EMAIL PROTECTED]
   Date: Mon, 19 Nov 2007 16:35:23 +1100
   
   You could make a dma_cacheline_aligned and use that.
   It seems pretty reasonable.
  
  I was thinking about that. What archs would need it ? arm, mips, what
  else ?
 
 The sparc32 port would need it too.

James preference seem to go for a revert of the patch that removed the
kmalloc of the buffer instead. Sounds definitely like an easier plan
for .24 (and maybe even backport to stable).

I'll produce a patch for that later today or tomorrow.

Do you still think we should introduce this __dma_cacheline_aligned ? Do
you see other cases of drivers where it would be useful ? It tend to
agree with your earlier statement that drivers doing that are broken and
should be using a separate allocator for DMA'ble objects (in fact, on
non-cache coherent archs, kmalloc is just fine).

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 16:46 -0800, David Miller wrote:
 
 1) Require that entire buffers are commited by call sites,
and thus embedding DMA'd within non-DMA stuff isn't allowed
 
 2) Add the __dma_cacheline_aligned tag.
 
 But note that with #2 it could get quite ugly because the
 alignment and size both have a minimum that needs to be
 enforced, not just the alignment alone.  So either:

Yup.

 struct foo {
 unsigned int other_unrelated_stuff;
 
 struct object dma_thing __dma_cacheline_aligned;
 
 unsigned int more_nondma_stuff __dma_cacheline_aligned;
 };

In my tests, I had used a fuckton_t object defined to be an empty
thing with alignment constraint, seemed to work :-) But I'd rather
require #1.

BTW. What is the status nowadays with skb's ?

Cheers,
Ben.

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 18:10 -0800, Roland Dreier wrote:
 
 I wrapped this ugliness up inside the macro back in what I posted in
 2002 (http://lkml.org/lkml/2002/6/12/234):
 
 #define __dma_buffer __dma_buffer_line(__LINE__)
 #define __dma_buffer_line(line) __dma_buffer_expand_line(line)
 #define __dma_buffer_expand_line(line) \
 __attribute__ ((aligned(L1_CACHE_BYTES))); \
 char __dma_pad_ ## line [0] __attribute__
 ((aligned(L1_CACHE_BYTES)))
 
 then you just need to tag the actual member like:
 
 char foo[3] __dma_buffer;

That's actually not too bad ... 

I'm having a problem with reverting SCSI to use an allocation for the
sense buffer, because it can fail and the new scso_eh_prep_cmnd() isn't
supposed to return a failure. I've changed that but now I get into
trying to fix the various drivers that use it without checking the
result code and it's becoming much more complicated than initially
thought.

So I may do the above instead and revive your patch.

Any objection ? James ? David ?

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt
FYI, Here's what I have for the SCSI change. I haven't updated drivers
to care for the new return code though, help appreciated with that as I
don't know much about these drivers.

Index: linux-work/drivers/scsi/scsi_error.c
===
--- linux-work.orig/drivers/scsi/scsi_error.c   2007-11-20 13:26:18.0 
+1100
+++ linux-work/drivers/scsi/scsi_error.c2007-11-20 13:43:05.0 
+1100
@@ -602,8 +602,9 @@ static void scsi_abort_eh_cmnd(struct sc
  * @cmnd is ignored and this functions sets up a REQUEST_SENSE command
  * and cmnd buffers to read @sense_bytes into @scmd-sense_buffer.
  **/
-void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses,
-   unsigned char *cmnd, int cmnd_size, unsigned 
sense_bytes)
+int scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses,
+ unsigned char *cmnd, int cmnd_size, unsigned sense_bytes,
+ gfp_t gfp_mask)
 {
struct scsi_device *sdev = scmd-device;
 
@@ -622,12 +623,20 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd 
ses-use_sg = scmd-use_sg;
ses-resid = scmd-resid;
ses-result = scmd-result;
+   sg_init_table(ses-sense_sgl, 1);
 
if (sense_bytes) {
+   struct page *pg;
+
+   if (sdev-host-hostt-unchecked_isa_dma)
+   gfp_mask |= __GFP_DMA;
scmd-request_bufflen = min_t(unsigned,
   sizeof(scmd-sense_buffer), sense_bytes);
-   sg_init_one(ses-sense_sgl, scmd-sense_buffer,
-  scmd-request_bufflen);
+   pg = alloc_page(gfp_mask);
+   if (!pg)
+   return FAILED;
+   memset(page_address(pg), 0, scmd-request_bufflen);
+   sg_set_page(ses-sense_sgl, pg, scmd-request_bufflen, 0);
scmd-request_buffer = ses-sense_sgl;
scmd-sc_data_direction = DMA_FROM_DEVICE;
scmd-use_sg = 1;
@@ -658,6 +667,8 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd 
 * untransferred sense data should be interpreted as being zero.
 */
memset(scmd-sense_buffer, 0, sizeof(scmd-sense_buffer));
+
+   return SUCCESS;
 }
 EXPORT_SYMBOL(scsi_eh_prep_cmnd);
 
@@ -670,9 +681,17 @@ EXPORT_SYMBOL(scsi_eh_prep_cmnd);
  **/
 void scsi_eh_restore_cmnd(struct scsi_cmnd* scmd, struct scsi_eh_save *ses)
 {
+   struct page *pg;
+
/*
 * Restore original data
 */
+   pg = sg_page(ses-sense_sgl);
+   if (pg) {
+   memcpy(scmd-sense_buffer, page_address(pg),
+  scmd-request_bufflen);
+   __free_page(pg);
+   }
scmd-cmd_len = ses-cmd_len;
memcpy(scmd-cmnd, ses-cmnd, sizeof(scmd-cmnd));
scmd-sc_data_direction = ses-data_direction;
@@ -709,7 +728,10 @@ static int scsi_send_eh_cmnd(struct scsi
struct scsi_eh_save ses;
int rtn;
 
-   scsi_eh_prep_cmnd(scmd, ses, cmnd, cmnd_size, sense_bytes);
+   if (scsi_eh_prep_cmnd(scmd, ses, cmnd, cmnd_size, sense_bytes,
+ GFP_KERNEL) != SUCCESS)
+   return FAILED;
+
shost-eh_action = done;
 
spin_lock_irqsave(shost-host_lock, flags);
Index: linux-work/include/scsi/scsi_eh.h
===
--- linux-work.orig/include/scsi/scsi_eh.h  2007-11-20 13:36:44.0 
+1100
+++ linux-work/include/scsi/scsi_eh.h   2007-11-20 13:42:49.0 +1100
@@ -81,9 +81,10 @@ struct scsi_eh_save {
struct scatterlist sense_sgl;
 };
 
-extern void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd,
+extern int scsi_eh_prep_cmnd(struct scsi_cmnd *scmd,
struct scsi_eh_save *ses, unsigned char *cmnd,
-   int cmnd_size, unsigned sense_bytes);
+   int cmnd_size, unsigned sense_bytes,
+   gfp_t gfp_mask);
 
 extern void scsi_eh_restore_cmnd(struct scsi_cmnd* scmd,
struct scsi_eh_save *ses);
Index: linux-work/drivers/scsi/NCR5380.c
===
--- linux-work.orig/drivers/scsi/NCR5380.c  2007-11-20 13:46:41.0 
+1100
+++ linux-work/drivers/scsi/NCR5380.c   2007-11-20 13:46:47.0 +1100
@@ -2283,7 +2283,7 @@ static void NCR5380_information_transfer
}
 
if ((cmd-cmnd[0] != REQUEST_SENSE)  
(status_byte(cmd-SCp.Status) == CHECK_CONDITION)) {
-   scsi_eh_prep_cmnd(cmd, 
hostdata-ses, NULL, 0, ~0);
+   scsi_eh_prep_cmnd(cmd, 
hostdata-ses, NULL, 0, ~0, GFP_ATOMIC);
 
dprintk(NDEBUG_AUTOSENSE, 
(scsi%d : performing request sense\n, 

SCSI breakage on non-cache coherent architectures

2007-11-18 Thread Benjamin Herrenschmidt
Hi James !

(Please CC me on replies as I'm not subscribed to linux-scsi)

I've been debugging various issues on the PowerPC 44x embedded
architecture which happens to have non-coherent PCI DMA.

One of the problem I'm hitting is that one really need to enforce
kmalloc alignement to cache lines or bad things will happen (among
others with USB), for some reasons, powerpc failed to do so, I fixed it.

The other one I'm hitting now is that the SCSI layer nowadays embeds the
sense_buffer inside the scsi_cmnd structure without any kind of
alignment whatsoever. I've been hitting irregulary is a crash on SCSI command
completion that seems to be related to corruption of the request
pointer in struct scsi_cmnd and I think it might be the cause.
I'm now trying to setup a proper repro-case.

The sense buffer is something that is written to by the device, thus it
gets cache invalidated when the DMA happens, potentially losing whatever
was sharing the cache line, which includes, among other things, that
request pointer field.

There are other issues as well if any of the fields sharing the cache
line happens to be read while the DMA is in progress, it would
populate the cache with memory content prior to the DMA being completed.

It's fairly important on such architectures not to share cache lines
between objects being DMA'd to/from and the rest of the system. If the
DMA is strictly outgoing, it's generally ok, but not if the DMA is
bidirectional or the device is writing to memory.

I'm not sure what is the best way to fix that. Internally, I've done
some test whacking some cacheline_aligned in the scsi_cmnd data
structure to verify I no longer get random SLAB corruption when using my
USB but that significantly bloats the size of the structure on archs
such as ppc64 that don't need it and have a large cache line size.

Unfortunately, I don't think there's any existing Kconfig symbol or arch
provided #define to tell us that we are on a non-coherent arch afaik
that could be used to make that conditional.

Another option would be to kmalloc the buffer (wasn't it the case before
btw ?) but I suppose some people will scream at the idea due to how the
command pools are done...

What do you suggest ?

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver

2007-07-16 Thread Benjamin Herrenschmidt

 Upon closer look, while flush_kernel_dcache_page() is a no-op on ppc64,
 flush_dcache_page() isn't. So I'd prefer to not call it if not really needed.
 
 And according to James, flush_kernel_dcache_page() should be sufficient...
 
 So I'm getting puzzled again...

flush_dcache_page() handles icache vs. dcache coherency by clearing the
PG_arch_1 bit in the struct page that indicates that the page is cache
clean.

You -must- call it if you're going to use any form of CPU access to
write to the page (basically dirtying the data cache) and that page can
be ever mapped into user space and executed from.

I don't know what flush_kernel_dcache_page() does and if it needs a
similar treatement, it's a new interface, so maybe Jens and or James can
tell me more about it..

Cheers,
Ben.

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver

2007-07-16 Thread Benjamin Herrenschmidt
On Mon, 2007-07-16 at 17:03 -0500, James Bottomley wrote:
 On Tue, 2007-07-17 at 07:49 +1000, Benjamin Herrenschmidt wrote:
   No ... that was the point of flush_kernel_dcache_page().  The page in
   question is page cache backed and contains user mappings.  However, the
   block layer has already done a flush_dcache_page() in get_user_pages()
   and the user shouldn't be touching memory under I/O (unless they want
   self induced aliasing problems) so we're free to assume all the user
   cachelines are purged, hence all we have to do is flush the kernel alias
   to bring the page up to date and make the users see it correctly.
  
  The block layer will have done that even in the swap-out path ? (Just
  asking... I'm not very familiar with the block layer)
 
 Er ... not really, this is the I/O path for user initiated I/O.  The
 page out path, by definition, can't have any extant user mappings.  For
 page out, the relevant page is flushed before its mapping is detached,
 and then it can be paged to the backing store (or for anonymous pages to
 the swap device) when no mappings remain.

Ok, thanks.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver

2007-07-13 Thread Benjamin Herrenschmidt
On Fri, 2007-07-13 at 09:02 -0400, James Bottomley wrote:
 On Wed, 2007-07-04 at 15:22 +0200, Geert Uytterhoeven wrote:
  +   kaddr = kmap_atomic(sgpnt-page, KM_USER0);
  +   if (!kaddr)
  +   return -1;
  +   len = sgpnt-length;
  +   if ((req_len + len)  buflen) {
  +   active = 0;
  +   len = buflen - req_len;
  +   }
  +   memcpy(kaddr + sgpnt-offset, buf + req_len,
  len);
  +   kunmap_atomic(kaddr, KM_USER0);
 
 This isn't a SCSI objection, but this sequence appears several times in
 this driver.  It's wrong for a non-PIPT architecture (and I believe the
 PS3 is VIPT) because you copy into the kernel alias for the page, which
 dirties the line in the cache of that alias (the user alias cache line
 was already invalidated).  However, unless you flush the kernel alias to
 main memory, the user could read stale data.  The way this is supposed
 to be done is to do a 

Nah, we have no cache aliasing on ppc, at least not that matter here and
not on cell.

Ben.

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver

2007-07-13 Thread Benjamin Herrenschmidt
On Fri, 2007-07-13 at 16:19 +0200, Arnd Bergmann wrote:
 I'm pretty sure that no ppc64 machine needs alias resolution in the kernel,
 although some are VIPT. Last time we discussed this, Segher explained it
 to me, but I don't remember which way Cell does it. IIRC, it automatically
 flushes cache lines that are accessed through aliases. 

Ah yes, I remember reading about this automatic flushing thing. I don't
know how the caches actually work on most of our PPC's, but the fact is,
we don't have aliasing issues, so I can safely ignore it for a bit
longer :-)

There are some aliasing issues with the instruction cache specifically
on some 4xx models but that's irrelevant to this discussion (and I think
we handle them elsewhere anyway).

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/6] ps3: Preallocate bootmem memory for the PS3 FLASH ROM storage driver

2007-06-15 Thread Benjamin Herrenschmidt
On Fri, 2007-06-15 at 13:39 +0200, Geert Uytterhoeven wrote:
 plain text document attachment (ps3-stable)
 Preallocate 256 KiB of bootmem memory for the PS3 FLASH ROM storage driver.

I still very much dislike the #ifdef xxx_MODULE in main kernel code.

At the end of the day, is it realistic to ever use a PS3 without the
storage driver ? I would suggest just allocating those unconditionally.

Ben.

 Signed-off-by: Geert Uytterhoeven [EMAIL PROTECTED]
 Signed-off-by: Geoff Levand [EMAIL PROTECTED]
 ---
  arch/powerpc/platforms/ps3/setup.c |   19 ++-
  include/asm-powerpc/ps3.h  |1 +
  2 files changed, 19 insertions(+), 1 deletion(-)
 
 --- a/arch/powerpc/platforms/ps3/setup.c
 +++ b/arch/powerpc/platforms/ps3/setup.c
 @@ -107,7 +107,8 @@ static void ps3_panic(char *str)
   while(1);
  }
  
 -#if defined(CONFIG_FB_PS3) || defined(CONFIG_FB_PS3_MODULE)
 +#if defined(CONFIG_FB_PS3) || defined(CONFIG_FB_PS3_MODULE) || \
 +defined(CONFIG_PS3_FLASH) || defined(CONFIG_PS3_FLASH_MODULE)
  static void prealloc(struct ps3_prealloc *p)
  {
   if (!p-size)
 @@ -123,7 +124,9 @@ static void prealloc(struct ps3_prealloc
   printk(KERN_INFO %s: %lu bytes at %p\n, p-name, p-size,
  p-address);
  }
 +#endif
  
 +#if defined(CONFIG_FB_PS3) || defined(CONFIG_FB_PS3_MODULE)
  struct ps3_prealloc ps3fb_videomemory = {
   .name = ps3fb videomemory,
   .size = CONFIG_FB_PS3_DEFAULT_SIZE_M*1024*1024,
 @@ -146,6 +149,18 @@ early_param(ps3fb, early_parse_ps3fb);
  #define prealloc_ps3fb_videomemory() do { } while (0)
  #endif
  
 +#if defined(CONFIG_PS3_FLASH) || defined(CONFIG_PS3_FLASH_MODULE)
 +struct ps3_prealloc ps3flash_bounce_buffer = {
 + .name = ps3flash bounce buffer,
 + .size = 256*1024,
 + .align = 256*1024
 +};
 +EXPORT_SYMBOL_GPL(ps3flash_bounce_buffer);
 +#define prealloc_ps3flash_bounce_buffer()
 prealloc(ps3flash_bounce_buffer)
 +#else
 +#define prealloc_ps3flash_bounce_buffer()do { } while (0)
 +#endif
 +
  static int ps3_set_dabr(u64 dabr)
  {
   enum {DABR_USER = 1, DABR_KERNEL = 2,};
 @@ -175,6 +190,8 @@ static void __init ps3_setup_arch(void)
  #endif
  
   prealloc_ps3fb_videomemory();
 + prealloc_ps3flash_bounce_buffer();
 +
   ppc_md.power_save = ps3_power_save;
  
   DBG( - %s:%d\n, __func__, __LINE__);
 --- a/include/asm-powerpc/ps3.h
 +++ b/include/asm-powerpc/ps3.h
 @@ -427,6 +427,7 @@ struct ps3_prealloc {
  };
  
  extern struct ps3_prealloc ps3fb_videomemory;
 +extern struct ps3_prealloc ps3flash_bounce_buffer;
  
 
  #endif
 
 ___
 Linuxppc-dev mailing list
 [EMAIL PROTECTED]
 https://ozlabs.org/mailman/listinfo/linuxppc-dev

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 6/7] ps3: ROM Storage Driver

2007-05-30 Thread Benjamin Herrenschmidt
On Wed, 2007-05-30 at 12:13 +0200, Christoph Hellwig wrote:
 
 For any sane hypervisor or hardware the copy should be worth
 than that.  Then again a sane hardware or hypervisor would support
 SG requests..

Agreed... Sony should fix that, it's a bit ridiculous.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 6/7] ps3: ROM Storage Driver

2007-05-29 Thread Benjamin Herrenschmidt
On Tue, 2007-05-29 at 13:11 +0200, Geert Uytterhoeven wrote:
  This looks very inefficient.  Just set sg_tablesize of your driver
  to 1 to avoid getting mutiple segments.
 
 The disadvantage of setting sg_tablesize = 1 is that the driver will
 get small
 requests (PAGE_SIZE) most of the time, which is very bad for
 performance.

And the joke is that not only the HW can do scatter  gather but you
also have an iommu ...

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qla_wxyz pci_set_mwi question

2007-04-12 Thread Benjamin Herrenschmidt
On Thu, 2007-04-12 at 14:04 -0600, Matthew Wilcox wrote:
 On Thu, Apr 12, 2007 at 12:37:13PM -0700, Andrew Vasquez wrote:
  On Thu, 12 Apr 2007, Matthew Wilcox wrote:
   Why should it fail?  If there's a platform which can't support a
   cacheline size that the qla2xyz card can handle, it should be able to
   happily fall back to doing plain writes instead of MWIs.  IMO, it should
   just call pci_set_mwi() and ignore the result.
  
  I believe there were some erratas on some ISP2xxx chips where MWI
  needed to be set for proper operation.  I'll go back, verify and
  update the patch accordingly.
 
 Hmm.  The thing is that pci_set_mwi() returns success on machines where
 MWI is disabled (currently only PPC64).  Perhaps it needs to fail
 instead.

MWI isn't diabled on ppc64... or did I miss something ?

Ben.

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qla_wxyz pci_set_mwi question

2007-04-12 Thread Benjamin Herrenschmidt

 Willy was referring to this from include/asm-powerpc/pci.h:
 
 #ifdef CONFIG_PPC64
 
 /*
  * We want to avoid touching the cacheline size or MWI bit.
  * pSeries firmware sets the cacheline size (which is not the cpu cacheline
  * size in all cases) and hardware treats MWI the same as memory write.
  */
 #define PCI_DISABLE_MWI
 
 
 which makes pci_set_mwi() do nothing other than return 0;

Interesting... I think I missed that we had that bit for some time :-)

Well, I suppose that on pSeries and probably pmac too, the firmware will
set the MWI bit for us anyway, but that's a bit dodgy to apply that to
all ppc64... they aren't all pSeries.

I'll have to look into that again one of these days.

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/59] sysctl: ipmi remove unnecessary insert_at_head flag

2007-01-16 Thread Benjamin Herrenschmidt
On Tue, 2007-01-16 at 09:39 -0700, Eric W. Biederman wrote:
 From: Eric W. Biederman [EMAIL PROTECTED] - unquoted
 
 With unique sysctl binary numbers setting insert_at_head is pointless.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]

Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED]

 ---
  drivers/char/ipmi/ipmi_poweroff.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/drivers/char/ipmi/ipmi_poweroff.c 
 b/drivers/char/ipmi/ipmi_poweroff.c
 index 9d23136..b3ae65e 100644
 --- a/drivers/char/ipmi/ipmi_poweroff.c
 +++ b/drivers/char/ipmi/ipmi_poweroff.c
 @@ -686,7 +686,7 @@ static int ipmi_poweroff_init (void)
   printk(KERN_INFO PFX Power cycle is enabled.\n);
  
  #ifdef CONFIG_PROC_FS
 - ipmi_table_header = register_sysctl_table(ipmi_root_table, 1);
 + ipmi_table_header = register_sysctl_table(ipmi_root_table, 0);
   if (!ipmi_table_header) {
   printk(KERN_ERR PFX Unable to register powercycle sysctl\n);
   rv = -ENOMEM;

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 36/59] sysctl: C99 convert ctl_tables entries in arch/ppc/kernel/ppc_htab.c

2007-01-16 Thread Benjamin Herrenschmidt
On Tue, 2007-01-16 at 09:39 -0700, Eric W. Biederman wrote:
 From: Eric W. Biederman [EMAIL PROTECTED] - unquoted
 
 And make the mode of the kernel directory 0555 no one is allowed
 to write to sysctl directories.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]

Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED]

 ---
  arch/ppc/kernel/ppc_htab.c |   11 ---
  1 files changed, 8 insertions(+), 3 deletions(-)
 
 diff --git a/arch/ppc/kernel/ppc_htab.c b/arch/ppc/kernel/ppc_htab.c
 index bd129d3..77b20ff 100644
 --- a/arch/ppc/kernel/ppc_htab.c
 +++ b/arch/ppc/kernel/ppc_htab.c
 @@ -442,11 +442,16 @@ static ctl_table htab_ctl_table[]={
   .mode   = 0644,
   .proc_handler   = proc_dol2crvec,
   },
 - { 0, },
 + {}
  };
  static ctl_table htab_sysctl_root[] = {
 - { 1, kernel, NULL, 0, 0755, htab_ctl_table, },
 - { 0,},
 + {
 + .ctl_name   = CTL_KERN,
 + .procname   = kernel,
 + .mode   = 0555,
 + .child  = htab_ctl_table,
 + },
 + {}
  };
  
  static int __init

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 35/59] sysctl: C99 convert ctl_tables in arch/powerpc/kernel/idle.c

2007-01-16 Thread Benjamin Herrenschmidt
On Tue, 2007-01-16 at 09:39 -0700, Eric W. Biederman wrote:
 From: Eric W. Biederman [EMAIL PROTECTED] - unquoted
 
 This was partially done already and there was no ABI breakage what
 a relief.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]

Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED]

 ---
  arch/powerpc/kernel/idle.c |   11 ---
  1 files changed, 8 insertions(+), 3 deletions(-)
 
 diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
 index 8994af3..8b27bb1 100644
 --- a/arch/powerpc/kernel/idle.c
 +++ b/arch/powerpc/kernel/idle.c
 @@ -110,11 +110,16 @@ static ctl_table powersave_nap_ctl_table[]={
   .mode   = 0644,
   .proc_handler   = proc_dointvec,
   },
 - { 0, },
 + {}
  };
  static ctl_table powersave_nap_sysctl_root[] = {
 - { 1, kernel, NULL, 0, 0755, powersave_nap_ctl_table, },
 - { 0,},
 + {
 + .ctl_name   = CTL_KERN,
 + .procname   = kernel,
 + .mode   = 0755,
 + .child  = powersave_nap_ctl_table,
 + },
 + {}
  };
  
  static int __init

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.15.4 rel.2 1/1] libata: add hotswap to sata_svw

2006-11-28 Thread Benjamin Herrenschmidt
On Tue, 2006-11-28 at 23:22 +, David Woodhouse wrote:
 On Thu, 2006-02-16 at 16:09 +0100, Martin Devera wrote:
  From: Martin Devera [EMAIL PROTECTED]
  
  Add hotswap capability to Serverworks/BroadCom SATA controlers. The
  controler has SIM register and it selects which bits in SATA_ERROR
  register fires interrupt.
  The solution hooks on COMWAKE (plug), PHYRDY change and 10B8B decode 
  error (unplug) and calls into Lukasz's hotswap framework.
  The code got one day testing on dual core Athlon64 H8SSL Supermicro 
  MoBo with HT-1000 SATA, SMP kernel and two CaviarRE SATA HDDs in
  hotswap bays.
  
  Signed-off-by: Martin Devera [EMAIL PROTECTED]
 
 What became of this?

I might be to blame for not testing it... The Xserve I had on my desk
was too noisy for most of my co-workers so I kept delaying and forgot
about it 

Also the Xserve I have only has one disk, which makes hotplug testing a
bit harder :-)

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iomapping a big endian area

2005-04-04 Thread Benjamin Herrenschmidt
On Sat, 2005-04-02 at 22:27 -0600, James Bottomley wrote:
 On Sat, 2005-04-02 at 20:08 -0800, David S. Miller wrote:
   Did anyone have a preference for the API?  I was thinking
   ioread32_native, but ioread32be is fine too.
  
  I think doing foo{be,le}{8,16,32}() would be consistent with
  our byteorder.h interface names.
 
 Thinking about this some more, I know of no case of a BE bus connected
 to a LE system, nor do I think anyone would ever create such a beast, so
 our only missing interface is for a BE bus on a BE system.

It's more a matter of the device than the bus imho... 

 Thus, I think io{read,write}{16,32}_native are better interfaces ...

I disagree. The driver will never know ...

 they basically mean pass memory operations without byte swaps, so
 they're well defined on both BE and LE systems and correspond exactly to
 our existing _raw_{read,write}{w,l} calls (principle of least surprise).

I don't think it's sane. You know that your device is BE or LE and use
the appropriate interface. native doesn't make sense to me in this
context.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html