Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, 2017-04-18 at 16:24 -0600, Jason Gunthorpe wrote: > Basically, all this list processing is a huge overhead compared to > just putting a helper call in the existing sg iteration loop of the > actual op. Particularly if the actual op is a no-op like no-mmu x86 > would use. Yes, I'm leaning toward that approach too. The helper itself could hang off the devmap though. > Since dma mapping is a performance path we must be careful not to > create intrinsic inefficiencies with otherwise nice layering :) > > Jason
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, 2017-04-18 at 17:21 -0600, Jason Gunthorpe wrote: > Splitting the sgl is different from iommu batching. > > As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in > the middle. > > The optimum behavior is to allocate a 1MB-4K iommu range and fill it > with the CPU memory. Then return a SGL with three entires, two > pointing into the range and one to the p2p. > > It is creating each range which tends to be expensive, so creating > two > ranges (or worse, if every SGL created a range it would be 255) is > very undesired. I think it's easier to get us started to just use a helper and stick it in the existing sglist processing loop of the architecture. As we noticed, stacking dma_ops is actually non-trivial and opens quite the can of worms. As Jerome mentioned, you can end up with IOs ops containing an sglist that is a collection of memory and GPU pages for example. Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, 2017-04-18 at 15:22 -0600, Jason Gunthorpe wrote: > On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote: > > > I think this opens an even bigger can of worms.. > > > > No, I don't think it does. You'd only shim when the target page is > > backed by a device, not host memory, and you can figure this out by > > a > > is_zone_device_page()-style lookup. > > The bigger can of worms is how do you meaningfully stack dma_ops. > > What does the p2p provider do when it detects a p2p page? Yeah I think we don't really want to stack dma_ops... thinking more about it. As I just wrote, it looks like we might need a more specialised hook in the devmap to be used by the main dma_op, on a per-page basis. Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, 2017-04-18 at 15:03 -0600, Jason Gunthorpe wrote: > I don't follow, when does get_dma_ops() return a p2p aware provider? > It has no way to know if the DMA is going to involve p2p, get_dma_ops > is called with the device initiating the DMA. > > So you'd always return the P2P shim on a system that has registered > P2P memory? > > Even so, how does this shim work? dma_ops are not really intended to > be stacked. How would we make unmap work, for instance? What happens > when the underlying iommu dma ops actually natively understands p2p > and doesn't want the shim? Good point. We only know on a per-page basis ... ugh. So we really need to change the arch main dma_ops. I'm not opposed to that. What we then need to do is have that main arch dma_map_sg, when it encounters a "device" page, call into a helper attached to the devmap to handle *that page*, providing sufficient context. That helper wouldn't perform the actual iommu mapping. It would simply return something along the lines of: - "use that alternate bus address and don't map in the iommu" - "use that alternate bus address and do map in the iommu" - "proceed as normal" - "fail" What do you think ? Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, 2017-04-18 at 14:48 -0600, Logan Gunthorpe wrote: > > ...and that dma_map goes through get_dma_ops(), so I don't see the conflict? > > The main conflict is in dma_map_sg which only does get_dma_ops once but > the sg may contain memory of different types. We can handle that in our "overriden" dma ops. It's a bit tricky but it *could* break it down into segments and forward portions back to the original dma ops. Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, 2017-04-18 at 12:00 -0600, Jason Gunthorpe wrote: > - All platforms can succeed if the PCI devices are under the same > 'segment', but where segments begin is somewhat platform specific > knowledge. (this is 'same switch' idea Logan has talked about) We also need to be careful whether P2P is enabled in the switch or not. Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, 2017-04-18 at 10:27 -0700, Dan Williams wrote: > > FWIW, RDMA probably wouldn't want to use a p2mem device either, we > > already have APIs that map BAR memory to user space, and would like to > > keep using them. A 'enable P2P for bar' helper function sounds better > > to me. > > ...and I think it's not a helper function as much as asking the bus > provider "can these two device dma to each other". The "helper" is the > dma api redirecting through a software-iommu that handles bus address > translation differently than it would handle host memory dma mapping. Do we even need tat function ? The dma_ops have a dma_supported() call... If we have those override ops built into the "dma_target" object, then these things can make that decision knowing both the source and target device. Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Mon, 2017-04-17 at 23:43 -0600, Logan Gunthorpe wrote: > > On 17/04/17 03:11 PM, Benjamin Herrenschmidt wrote: > > Is it ? Again, you create a "concept" the user may have no idea about, > > "p2pmem memory". So now any kind of memory buffer on a device can could > > be use for p2p but also potentially a bunch of other things becomes > > special and called "p2pmem" ... > > The user is going to have to have an idea about it if they are designing > systems to make use of it. I've said it before many times: this is an > optimization with significant trade-offs so the user does have to make > decisions regarding when to enable it. Not necessarily. There are many cases where the "end user" won't have any idea. In any case, I think we bring the story down to those two points of conflating the allocator with the peer to peer DMA, and the lack of generality in the approach to solve the peer to peer DMA problem. > > But what do you have in p2pmem that somebody benefits from. Again I > > don't understand what that "p2pmem" device buys you in term of > > functionality vs. having the device just instanciate the pages. > > Well thanks for just taking a big shit on all of our work without even > reading the patches. Bravo. Now now now calm down. We are being civil here. I'm not shitting on anything, I'm asking what seems to be a reasonable question in term of benefits of the approach you have chosen. Using that sort of language will not get you anywhere. > > Now having some kind of way to override the dma_ops, yes I do get that, > > and it could be that this "p2pmem" is typically the way to do it, but > > at the moment you don't even have that. So I'm a bit at a loss here. > > Yes, we've already said many times that this is something we will need > to add. > > > But it doesn't *have* to be. Again, take my GPU example. The fact that > > a NIC might be able to DMA into it doesn't make it specifically "p2p > > memory". > > Just because you use it for other things doesn't mean it can't also > provide the service of a "p2pmem" device. But there is no such thing as a "p2pmem" device.. that's what I'm trying to tell you... As both Jerome and I tried to explain, there are many reason why one may want to do peer DMA into some device memory, that doesn't make that memory some kind of "p2pmem". It's trying to stick a generic label onto something that isn't. That's why I'm suggesting we disconnect the two aspects. On one hand the problem of handling p2p DMA, whether the target is some memory, some MMIO registers, etc... On the other hand, some generic "utility" that can optionally be used by drivers to manage a pool of DMA memory in the device, essentially a simple allocator. The two things are completely orthogonal. > > So now your "p2pmem" device needs to also be laid out on top of those > > MMIO registers ? It's becoming weird. > > Yes, Max Gurtovoy has also expressed an interest in expanding this work > to cover things other than memory. He's suggested simply calling it a > p2p device, but until we figure out what exactly that all means we can't > really finalize a name. Possibly. In any case, I think it should be separate from the allocation. > > See, basically, doing peer 2 peer between devices has 3 main challenges > > today: The DMA API needing struct pages, the MMIO translation issues > > and the IOMMU translation issues. > > > > You seem to create that added device as some kind of "owner" for the > > struct pages, solving #1, but leave #2 and #3 alone. > > Well there are other challenges too. Like figuring out when it's > appropriate to use, tying together the device that provides the memory > with the driver tring to use it in DMA transactions, etc, etc. Our patch > set tackles these latter issues. But it tries to conflate the allocation, which is basically the fact that this is some kind of "memory pool" with the problem of doing peer DMA. I'm advocating for separating the concepts. > > If we go down that path, though, rather than calling it p2pmem I would > > call it something like dma_target which I find much clearer especially > > since it doesn't have to be just memory. > > I'm not set on the name. My arguments have been specifically for the > existence of an independent struct device. But I'm not really interested > in getting into bike shedding arguments over what to call it at this > time when we don't even really know what it's going to end up doing in > the end. It's not bike shedding. It's about taking out the allocator part and making it clear that this isn't something to lay out on top of a p
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Mon, 2017-04-17 at 10:52 -0600, Logan Gunthorpe wrote: > > On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote: > > But is it ? For example take a GPU, does it, in your scheme, need an > > additional "p2pmem" child ? Why can't the GPU driver just use some > > helper to instantiate the necessary struct pages ? What does having an > > actual "struct device" child buys you ? > > Yes, in this scheme, it needs an additional p2pmem child. Why is that an > issue? It certainly makes it a lot easier for the user to understand the > p2pmem memory in the system (through the sysfs tree) and reason about > the topology and when to use it. This is important. Is it ? Again, you create a "concept" the user may have no idea about, "p2pmem memory". So now any kind of memory buffer on a device can could be use for p2p but also potentially a bunch of other things becomes special and called "p2pmem" ... > > > 2) In order to create the struct pages we use the ZONE_DEVICE > > > infrastructure which requires a struct device. (See > > > devm_memremap_pages.) > > > > Yup, but you already have one in the actual pci_dev ... What is the > > benefit of adding a second one ? > > But that would tie all of this very tightly to be pci only and may get > hard to differentiate if more users of ZONE_DEVICE crop up who happen to > be using a pci device. But what do you have in p2pmem that somebody benefits from. Again I don't understand what that "p2pmem" device buys you in term of functionality vs. having the device just instanciate the pages. Now having some kind of way to override the dma_ops, yes I do get that, and it could be that this "p2pmem" is typically the way to do it, but at the moment you don't even have that. So I'm a bit at a loss here. > Having a specific class for this makes it very > clear how this memory would be handled. But it doesn't *have* to be. Again, take my GPU example. The fact that a NIC might be able to DMA into it doesn't make it specifically "p2p memory". Essentially you are saying that any device that happens to have a piece of mappable "memory" (or something that behaves like it) and can be DMA'ed into should now have that "p2pmem" thing attached to it. Now take an example where that becomes really awkward (it's also a real example of something people want to do). I have a NIC and a GPU, the NIC DMA's data to/from the GPU, but they also want to poke at each other doorbell, the GPU to kick the NIC into action when data is ready to send, the NIC to poke the GPU when data has been received. Those doorbells are MMIO registers. So now your "p2pmem" device needs to also be laid out on top of those MMIO registers ? It's becoming weird. See, basically, doing peer 2 peer between devices has 3 main challenges today: The DMA API needing struct pages, the MMIO translation issues and the IOMMU translation issues. You seem to create that added device as some kind of "owner" for the struct pages, solving #1, but leave #2 and #3 alone. Now, as I said, it could very well be that having the devmap pointer point to some specific device-type with a well known structure to provide solutions for #2 and #3 such as dma_ops overrides, is indeed the right way to solve these problems. If we go down that path, though, rather than calling it p2pmem I would call it something like dma_target which I find much clearer especially since it doesn't have to be just memory. For the sole case of creating struct page's however, I fail to see the point. > For example, although I haven't > looked into it, this could very well be a point of conflict with HMM. If > they were to use the pci device to populate the dev_pagemap then we > couldn't also use the pci device. I feel it's much better for users of > dev_pagemap to have their struct devices they own to avoid such conflicts. If we are going to create some sort of struct dma_target, HMM could potentially just look for the parent if it needs the PCI device. > > > This amazingly gets us the get_dev_pagemap > > > architecture which also uses a struct device. So by using a p2pmem > > > device we can go from struct page to struct device to p2pmem device > > > quickly and effortlessly. > > > > Which isn't terribly useful in itself right ? What you care about is > > the "enclosing" pci_dev no ? Or am I missing something ? > > Sure it is. What if we want to someday support p2pmem that's on another bus? But why not directly use that other bus' device in that case ? > > > 3) You wouldn't want to use the pci's struct device because it doesn't > > > really describe what's going on. For example, there may be multiple > > > devices on the pci devi
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Sun, 2017-04-16 at 23:13 -0600, Logan Gunthorpe wrote: > > > > > I'm still not 100% why do you need a "p2mem device" mind you ... > > Well, you don't "need" it but it is a design choice that I think makes a > lot of sense for the following reasons: > > 1) p2pmem is in fact a device on the pci bus. A pci driver will need to > set it up and create the device and thus it will have a natural parent > pci device. Instantiating a struct device for it means it will appear in > the device hierarchy and one can use that to reason about its position > in the topology. But is it ? For example take a GPU, does it, in your scheme, need an additional "p2pmem" child ? Why can't the GPU driver just use some helper to instantiate the necessary struct pages ? What does having an actual "struct device" child buys you ? > 2) In order to create the struct pages we use the ZONE_DEVICE > infrastructure which requires a struct device. (See > devm_memremap_pages.) Yup, but you already have one in the actual pci_dev ... What is the benefit of adding a second one ? > This amazingly gets us the get_dev_pagemap > architecture which also uses a struct device. So by using a p2pmem > device we can go from struct page to struct device to p2pmem device > quickly and effortlessly. Which isn't terribly useful in itself right ? What you care about is the "enclosing" pci_dev no ? Or am I missing something ? > 3) You wouldn't want to use the pci's struct device because it doesn't > really describe what's going on. For example, there may be multiple > devices on the pci device in question: eg. an NVME card and some p2pmem. What is "some p2pmem" ? > Or it could be a NIC with some p2pmem. Again what is "some p2pmem" ? That a device might have some memory-like buffer space is all well and good but does it need to be specifically distinguished at the device level ? It could be inherent to what the device is... for example again take the GPU example, why would you call the FB memory "p2pmem" ? > Or it could just be p2pmem by itself. And the logic to figure out what > memory is available and where > the address is will be non-standard so it's really straightforward to > have any pci driver just instantiate a p2pmem device. Again I'm not sure why it needs to "instanciate a p2pmem" device. Maybe it's the term "p2pmem" that offputs me. If p2pmem allowed to have a standard way to lookup the various offsets etc... I mentioned earlier, then yes, it would make sense to have it as a staging point. As-is, I don't know. > It is probably worth you reading the RFC patches at this point to get a > better feel for this. Yup, I'll have another look a bit more in depth. Cheers, Ben. > Logan
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Sun, 2017-04-16 at 10:34 -0600, Logan Gunthorpe wrote: > > On 16/04/17 09:53 AM, Dan Williams wrote: > > ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve > > context about the physical address in question. I'm thinking you can > > hang bus address translation data off of that structure. This seems > > vaguely similar to what HMM is doing. > > Thanks! I didn't realize you had the infrastructure to look up a device > from a pfn/page. That would really come in handy for us. It does indeed. I won't be able to play with that much for a few weeks (see my other email) so if you're going to tackle this while I'm away, can you work with Jerome to make sure you don't conflict with HMM ? I really want a way for HMM to be able to layout struct pages over the GPU BARs rather than in "allocated free space" for the case where the BAR is big enough to cover all of the GPU memory. In general, I'd like a simple & generic way for any driver to ask the core to layout DMA'ble struct pages over BAR space. I an not convinced this requires a "p2mem device" to be created on top of this though but that's a different discussion. Of course the actual ability to perform the DMA mapping will be subject to various restrictions that will have to be implemented in the actual "dma_ops override" backend. We can have generic code to handle the case where devices reside on the same domain, which can deal with switch configuration etc... we will need to have iommu specific code to handle the case going through the fabric. Virtualization is a separate can of worms due to how qemu completely fakes the MMIO space, we can look into that later. Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Sun, 2017-04-16 at 10:47 -0600, Logan Gunthorpe wrote: > > I think you need to give other archs a chance to support this with a > > design that considers the offset case as a first class citizen rather > > than an afterthought. > > I'll consider this. Given the fact I can use your existing > get_dev_pagemap infrastructure to look up the p2pmem device this > probably isn't as hard as I thought it would be anyway (we probably > don't even need a page flag). We'd just have lookup the dev_pagemap, > test if it's a p2pmem device, and if so, call a p2pmem_dma_map function > which could apply the offset or do any other arch specific logic (if > necessary). I'm still not 100% why do you need a "p2mem device" mind you ... Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Sun, 2017-04-16 at 08:53 -0700, Dan Williams wrote: > > Just thinking out loud ... I don't have a firm idea or a design. But > > peer to peer is definitely a problem we need to tackle generically, the > > demand for it keeps coming up. > > ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve > context about the physical address in question. I'm thinking you can > hang bus address translation data off of that structure. This seems > vaguely similar to what HMM is doing. Ok, that's interesting. That would be a way to handle the "lookup" I was mentioning in the email I sent a few minutes ago. We would probably need to put some "structure" to that context. I'm very short on time to look into the details of this for at least a month (I'm taking about 3 weeks off for personal reasons next week), but I'm happy to dive more into this when I'm back and sort out with Jerome how to make it all co-habitate nicely with HMM. Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Sun, 2017-04-16 at 08:44 -0700, Dan Williams wrote: > The difference is that there was nothing fundamental in the core > design of pmem + DAX that prevented other archs from growing pmem > support. Indeed. In fact we have work in progress support for pmem on power using experimental HW. > THP and memory hotplug existed on other architectures and > they just need to plug in their arch-specific enabling. p2p support > needs the same starting point of something more than one architecture > can plug into, and handling the bus address offset case needs to be > incorporated into the design. > > pmem + dax did not change the meaning of what a dma_addr_t is, p2p does. The more I think about it, the more I tend toward something along the lines of having the arch DMA ops being able to quickly differentiate between "normal" memory (which includes non-PCI pmem in some cases, it's an architecture choice I suppose) and "special device" (page flag ? pfn bit ? ... there are options). >From there, we keep our existing fast path for the normal case. For the special case, we need to provide a fast lookup mechanism (assuming we can't stash enough stuff in struct page or the pfn) to get back to a struct of some sort that provides the necessary information to resolve the translation. This *could* be something like a struct p2mem device that carries a special set of DMA ops, though we probably shouldn't make the generic structure PCI specific. This is a slightly slower path, but that "stub" structure allows the special DMA ops to provide the necessary bus-specific knowledge, which for PCI for example, can check whether the devices are on the same segment, whether the switches are configured to allow p2p, etc... What form should that fast lookup take ? It's not completely clear to me at that point. We could start with a simple linear lookup I suppose and improve in a second stage. Of course this pipes into the old discussion about disconnecting the DMA ops from struct page. If we keep struct page, any device that wants to be a potential DMA target will need to do something "special" to create those struct pages etc.. though we could make that a simple pci helper that pops the necessary bits and pieces for a given BAR & range. If we don't need struct page, then it might be possible to hide it all in the PCI infrastructure. > > Virtualization specifically would be a _lot_ more difficult than simply > > supporting offsets. The actual topology of the bus will probably be lost > > on the guest OS and it would therefor have a difficult time figuring out > > when it's acceptable to use p2pmem. I also have a difficult time seeing > > a use case for it and thus I have a hard time with the argument that we > > can't support use cases that do want it because use cases that don't > > want it (perhaps yet) won't work. > > > > > This is an interesting experiement to look at I suppose, but if you > > > ever want this upstream I would like at least for you to develop a > > > strategy to support the wider case, if not an actual implementation. > > > > I think there are plenty of avenues forward to support offsets, etc. > > It's just work. Nothing we'd be proposing would be incompatible with it. > > We just don't want to have to do it all upfront especially when no one > > really knows how well various architecture's hardware supports this or > > if anyone even wants to run it on systems such as those. (Keep in mind > > this is a pretty specific optimization that mostly helps systems > > designed in specific ways -- not a general "everybody gets faster" type > > situation.) Get the cases working we know will work, can easily support > > and people actually want. Then expand it to support others as people > > come around with hardware to test and use cases for it. > > I think you need to give other archs a chance to support this with a > design that considers the offset case as a first class citizen rather > than an afterthought. Thanks :-) There's a reason why I'm insisting on this. We have constant requests for this today. We have hacks in the GPU drivers to do it for GPUs behind a switch, but those are just that, ad-hoc hacks in the drivers. We have similar grossness around the corner with some CAPI NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines to whack nVME devices. I'm very interested in a more generic solution to deal with the problem of P2P between devices. I'm happy to contribute with code to handle the powerpc bits but we need to agree on the design first :) Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Sat, 2017-04-15 at 15:09 -0700, Dan Williams wrote: > I'm wondering, since this is limited to support behind a single > switch, if you could have a software-iommu hanging off that switch > device object that knows how to catch and translate the non-zero > offset bus address case. We have something like this with VMD driver, > and I toyed with a soft pci bridge when trying to support AHCI+NVME > bar remapping. When the dma api looks up the iommu for its device it > hits this soft-iommu and that driver checks if the page is host memory > or device memory to do the dma translation. You wouldn't need a bit in > struct page, just a lookup to the hosting struct dev_pagemap in the > is_zone_device_page() case and that can point you to p2p details. I was thinking about a hook in the arch DMA ops but that kind of wrapper might work instead indeed. However I'm not sure what's the best way to "instantiate" it. The main issue is that the DMA ops are a function of the initiator, not the target (since the target is supposed to be memory) so things are a bit awkward. One (user ?) would have to know that a given device "intends" to DMA directly to another device. This is awkward because in the ideal scenario, this isn't something the device knows. For example, one could want to have an existing NIC DMA directly to/from NVME pages or GPU pages. The NIC itself doesn't know the characteristic of these pages, but *something* needs to insert itself in the DMA ops of that bridge to make it possible. That's why I wonder if it's the struct page of the target that should be "marked" in such a way that the arch dma'ops can immediately catch that they belong to a device and might require "wrapped" operations. Are ZONE_DEVICE pages identifiable based on the struct page alone ? (a flag ?) That would allow us to keep a fast path for normal memory targets, but also have some kind of way to handle the special cases of such peer 2 peer (or also handle other type of peer to peer that don't necessarily involve PCI address wrangling but could require additional iommu bits). Just thinking out loud ... I don't have a firm idea or a design. But peer to peer is definitely a problem we need to tackle generically, the demand for it keeps coming up. Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Sat, 2017-04-15 at 11:41 -0600, Logan Gunthorpe wrote: > Thanks, Benjamin, for the summary of some of the issues. > > On 14/04/17 04:07 PM, Benjamin Herrenschmidt wrote > > So I assume the p2p code provides a way to address that too via special > > dma_ops ? Or wrappers ? > > Not at this time. We will probably need a way to ensure the iommus do > not attempt to remap these addresses. You can't. If the iommu is on, everything is remapped. Or do you mean to have dma_map_* not do a remapping ? That's the problem again, same as before, for that to work, the dma_map_* ops would have to do something special that depends on *both* the source and target device. The current DMA infrastructure doesn't have anything like that. It's a rather fundamental issue to your design that you need to address. The dma_ops today are architecture specific and have no way to differenciate between normal and those special P2P DMA pages. > Though if it does, I'd expect > everything would still work you just wouldn't get the performance or > traffic flow you are looking for. We've been testing with the software > iommu which doesn't have this problem. So first, no, it's more than "you wouldn't get the performance". On some systems it may also just not work. Also what do you mean by "the SW iommu doesn't have this problem" ? It catches the fact that addresses don't point to RAM and maps differently ? > > The problem is that the latter while seemingly easier, is also slower > > and not supported by all platforms and architectures (for example, > > POWER currently won't allow it, or rather only allows a store-only > > subset of it under special circumstances). > > Yes, I think situations where we have to cross host bridges will remain > unsupported by this work for a long time. There are two many cases where > it just doesn't work or it performs too poorly to be useful. And the situation where you don't cross bridges is the one where you need to also take into account the offsets. *both* cases mean that you need somewhat to intervene at the dma_ops level to handle this. Which means having a way to identify your special struct pages or PFNs to allow the arch to add a special case to the dma_ops. > > I don't fully understand how p2pmem "solves" that by creating struct > > pages. The offset problem is one issue. But there's the iommu issue as > > well, the driver cannot just use the normal dma_map ops. > > We are not using a proper iommu and we are dealing with systems that > have zero offset. This case is also easily supported. I expect fixing > the iommus to not map these addresses would also be reasonably achievable. So you are designing something that is built from scratch to only work on a specific limited category of systems and is also incompatible with virtualization. This is an interesting experiement to look at I suppose, but if you ever want this upstream I would like at least for you to develop a strategy to support the wider case, if not an actual implementation. Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Fri, 2017-04-14 at 14:04 -0500, Bjorn Helgaas wrote: > I'm a little hesitant about excluding offset support, so I'd like to > hear more about this. > > Is the issue related to PCI BARs that are not completely addressable > by the CPU? If so, that sounds like a first-class issue that should > be resolved up front because I don't think the PCI core in general > would deal well with that. > > If all the PCI memory of interest is in fact addressable by the CPU, > I would think it would be pretty straightforward to support offsets > -- > everywhere you currently use a PCI bus address, you just use the > corresponding CPU physical address instead. It's not *that* easy sadly. The reason is that normal dma map APIs assume the "target" of the DMAs are system memory, there is no way to pass it "another device", in a way that allows it to understand the offsets if needed. That said, dma_map will already be problematic when doing p2p behind the same bridge due to the fact that the iommu is not in the way so you can't use the arch standard ops there. So I assume the p2p code provides a way to address that too via special dma_ops ? Or wrappers ? Basically there are two very different ways you can do p2p, either behind the same host bridge or accross two host bridges: - Behind the same host bridge, you don't go through the iommu, which means that even if your target looks like a struct page, you can't just use dma_map_* on it because what you'll get back from that is an iommu token, not a sibling BAR offset. Additionally, if you use the target struct resource address, you will need to offset it to get back to the actual BAR value on the PCIe bus. - Behind different host bridges, then you go through the iommu and the host remapping. IE. that's actually the easy case. You can probably just use the normal iommu path and normal dma mapping ops, you just need to have your struct page representing the target device BAR *in CPU space* this time. So no offsetting required either. The problem is that the latter while seemingly easier, is also slower and not supported by all platforms and architectures (for example, POWER currently won't allow it, or rather only allows a store-only subset of it under special circumstances). So what people practically want to do is to have 2 devices behind a switch DMA'ing to/from each other. But that brings the 2 problems above. I don't fully understand how p2pmem "solves" that by creating struct pages. The offset problem is one issue. But there's the iommu issue as well, the driver cannot just use the normal dma_map ops. I haven't had a chance to look at the details of the patches but it's not clear from the description in patch 0 how that is solved. Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Thu, 2017-04-13 at 22:40 -0600, Logan Gunthorpe wrote: > > On 13/04/17 10:16 PM, Jason Gunthorpe wrote: > > I'd suggest just detecting if there is any translation in bus > > addresses anywhere and just hard disabling P2P on such systems. > > That's a fantastic suggestion. It simplifies things significantly. > Unless there are any significant objections I think I will plan on > doing > that. I object. > > On modern hardware with 64 bit BARs there is very little reason to > > have translation, so I think this is a legacy feature. > > Yes, p2pmem users are likely to be designing systems around it (ie > JBOFs) and not trying to shoehorn it onto legacy architectures. > > At the very least, it makes sense to leave it out and if someone > comes > along who cares they can put in the effort to support the address > translation. > > Thanks, > > Logan
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Fri, 2017-04-14 at 21:37 +1000, Benjamin Herrenschmidt wrote: > On Thu, 2017-04-13 at 22:40 -0600, Logan Gunthorpe wrote: > > > > On 13/04/17 10:16 PM, Jason Gunthorpe wrote: > > > I'd suggest just detecting if there is any translation in bus > > > addresses anywhere and just hard disabling P2P on such systems. > > > > That's a fantastic suggestion. It simplifies things significantly. > > Unless there are any significant objections I think I will plan on > > doing > > that. > > I object. Note: It would also make your stuff fundamentally incompatible with KVM guest pass-through since KVM plays with remapping BARs all over the place. Ben. > > > On modern hardware with 64 bit BARs there is very little reason > > > to > > > have translation, so I think this is a legacy feature. > > > > Yes, p2pmem users are likely to be designing systems around it (ie > > JBOFs) and not trying to shoehorn it onto legacy architectures. > > > > At the very least, it makes sense to leave it out and if someone > > comes > > along who cares they can put in the effort to support the address > > translation. > > > > Thanks, > > > > Logan
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Thu, 2017-04-13 at 22:16 -0600, Jason Gunthorpe wrote: > > Any caller of pci_add_resource_offset() uses CPU addresses different from > > the PCI bus addresses (unless the offset is zero, of course). All ACPI > > platforms also support this translation (see "translation_offset"), though > > in most x86 systems the offset is zero. I'm aware of one x86 system that > > was tested with a non-zero offset but I don't think it was shipped that > > way. > > I'd suggest just detecting if there is any translation in bus > addresses anywhere and just hard disabling P2P on such systems. I object to designing a subsystem that by design cannot work on whole categories of architectures out there. > On modern hardware with 64 bit BARs there is very little reason to > have translation, so I think this is a legacy feature. No it's not. Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Thu, 2017-04-13 at 15:22 -0600, Logan Gunthorpe wrote: > > On 12/04/17 03:55 PM, Benjamin Herrenschmidt wrote: > > Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They > > will perform the conversion between the struct resource content (CPU > > physical address) and the actual PCI bus side address. > > Ah, thanks for the tip! On my system, this translation returns the same > address so it was not necessary. And, yes, that means this would have to > find its way into the dma mapping routine somehow. This means we'll > eventually need a way to look-up the p2pmem device from the struct page. > Which means we will likely need a new flag bit in the struct page or > something. The big difficulty I see is testing. Do you know what > architectures or in what circumstances are these translations used? I think a bunch of non-x86 architectures but I don't know which ones outside of powerpc. > > When behind the same switch you need to use PCI addresses. If one tries > > later to do P2P between host bridges (via the CPU fabric) things get > > more complex and one will have to use either CPU addresses or something > > else alltogether (probably would have to teach the arch DMA mapping > > routines to work with those struct pages you create and return the > > right thing). > > Probably for starters we'd want to explicitly deny cases between host > bridges and add that later if someone wants to do the testing. Cheers, Ben. > Thanks, > > Logan
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Wed, 2017-04-12 at 11:09 -0600, Logan Gunthorpe wrote: > > > Do you handle funky address translation too ? IE. the fact that the PCI > > addresses aren't the same as the CPU physical addresses for a BAR ? > > No, we use the CPU physical address of the BAR. If it's not mapped that > way we can't use it. Ok, you need to fix that or a bunch of architectures won't work. Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They will perform the conversion between the struct resource content (CPU physical address) and the actual PCI bus side address. When behind the same switch you need to use PCI addresses. If one tries later to do P2P between host bridges (via the CPU fabric) things get more complex and one will have to use either CPU addresses or something else alltogether (probably would have to teach the arch DMA mapping routines to work with those struct pages you create and return the right thing). > > > This will mean many setups that could likely > > > work well will not be supported so that we can be more confident it > > > will work and not place any responsibility on the user to understand > > > their topology. (We've chosen to go this route based on feedback we > > > received at LSF). > > > > > > In order to enable this functionality we introduce a new p2pmem device > > > which can be instantiated by PCI drivers. The device will register some > > > PCI memory as ZONE_DEVICE and provide an genalloc based allocator for > > > users of these devices to get buffers. > > > > I don't completely understand this. This is actual memory on the PCI > > bus ? Where does it come from ? Or are you just trying to create struct > > pages that cover your PCIe DMA target ? > > Yes, the memory is on the PCI bus in a BAR. For now we have a special > PCI card for this, but in the future it would likely be the CMB in an > NVMe card. These patches create struct pages to map these BAR addresses > using ZONE_DEVICE. Ok. So ideally we'd want things like dma_map_* to be able to be fed those struct pages and do the right thing which is ... tricky, especially with the address translation I mentioned since the address will be different whether the initiator is on the same host bridge as the target or not. > > So correct me if I'm wrong, you are trying to create struct page's that > > map a PCIe BAR right ? I'm trying to understand how that interacts with > > what Jerome is doing for HMM. > > Yes, well we are using ZONE_DEVICE in the exact same way as the dax code > is. These patches use the existing API with no modifications. As I > understand it, HMM was using ZONE_DEVICE in a way that was quite > different to how it was originally designed. Sort-of. I don't see why there would be a conflict with the struct pages use though. Jerome can you chime in ? Jerome: It looks like they are just laying out struct page over a BAR which is the same thing I think you should do when the BAR is "large enough" for the GPU memory. The case where HMM uses "holes" in the address space for its struct page is somewhat orthogonal but I also see no conflict here. > > The reason is that the HMM currently creates the struct pages with > > "fake" PFNs pointing to a hole in the address space rather than > > covering the actual PCIe memory of the GPU. He does that to deal with > > the fact that some GPUs have a smaller aperture on PCIe than their > > total memory. > > I'm aware of what HMM is trying to do and although I'm not familiar with > the intimate details, I saw it as fairly orthogonal to what we are > attempting to do. Right. > > However, I have asked him to only apply that policy if the aperture is > > indeed smaller, and if not, create struct pages that directly cover the > > PCIe BAR of the GPU instead, which will work better on systems or > > architecture that don't have a "pinhole" window limitation. > > However he was under the impression that this was going to collide with > > what you guys are doing, so I'm trying to understand how. > > I'm not sure I understand how either. However, I suspect if you collide > with these patches then you'd also be breaking dax too. Possibly but as I said, I don't see why so I'll let Jerome chime in since he was under the impression that there was a conflict here :-) Cheers, Ben.
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Thu, 2017-03-30 at 16:12 -0600, Logan Gunthorpe wrote: > Hello, > > As discussed at LSF/MM we'd like to present our work to enable > copy offload support in NVMe fabrics RDMA targets. We'd appreciate > some review and feedback from the community on our direction. > This series is not intended to go upstream at this point. > > The concept here is to use memory that's exposed on a PCI BAR as > data buffers in the NVME target code such that data can be transferred > from an RDMA NIC to the special memory and then directly to an NVMe > device avoiding system memory entirely. The upside of this is better > QoS for applications running on the CPU utilizing memory and lower > PCI bandwidth required to the CPU (such that systems could be designed > with fewer lanes connected to the CPU). However, presently, the trade-off > is currently a reduction in overall throughput. (Largely due to hardware > issues that would certainly improve in the future). Another issue of course is that not all systems support P2P between host bridges :-) (Though almost all switches can enable it). > Due to these trade-offs we've designed the system to only enable using > the PCI memory in cases where the NIC, NVMe devices and memory are all > behind the same PCI switch. Ok. I suppose that's a reasonable starting point. Do I haven't looked at the patches in detail yet but it would be nice if that policy was in a well isolated component so it can potentially be affected by arch/platform code. Do you handle funky address translation too ? IE. the fact that the PCI addresses aren't the same as the CPU physical addresses for a BAR ? > This will mean many setups that could likely > work well will not be supported so that we can be more confident it > will work and not place any responsibility on the user to understand > their topology. (We've chosen to go this route based on feedback we > received at LSF). > > In order to enable this functionality we introduce a new p2pmem device > which can be instantiated by PCI drivers. The device will register some > PCI memory as ZONE_DEVICE and provide an genalloc based allocator for > users of these devices to get buffers. I don't completely understand this. This is actual memory on the PCI bus ? Where does it come from ? Or are you just trying to create struct pages that cover your PCIe DMA target ? > We give an example of enabling > p2p memory with the cxgb4 driver, however currently these devices have > some hardware issues that prevent their use so we will likely be > dropping this patch in the future. Ideally, we'd want to enable this > functionality with NVME CMB buffers, however we don't have any hardware > with this feature at this time. So correct me if I'm wrong, you are trying to create struct page's that map a PCIe BAR right ? I'm trying to understand how that interacts with what Jerome is doing for HMM. The reason is that the HMM currently creates the struct pages with "fake" PFNs pointing to a hole in the address space rather than covering the actual PCIe memory of the GPU. He does that to deal with the fact that some GPUs have a smaller aperture on PCIe than their total memory. However, I have asked him to only apply that policy if the aperture is indeed smaller, and if not, create struct pages that directly cover the PCIe BAR of the GPU instead, which will work better on systems or architecture that don't have a "pinhole" window limitation. However he was under the impression that this was going to collide with what you guys are doing, so I'm trying to understand how. > In nvmet-rdma, we attempt to get an appropriate p2pmem device at > queue creation time and if a suitable one is found we will use it for > all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs > attribute is also created which is required to be set before any p2pmem > is attempted. > > This patchset also includes a more controversial patch which provides an > interface for userspace to obtain p2pmem buffers through an mmap call on > a cdev. This enables userspace to fairly easily use p2pmem with RDMA and > O_DIRECT interfaces. However, the user would be entirely responsible for > knowing what their doing and inspecting sysfs to understand the pci > topology and only using it in sane situations. > > Thanks, > > Logan > > > Logan Gunthorpe (6): > Introduce Peer-to-Peer memory (p2pmem) device > nvmet: Use p2pmem in nvme target > scatterlist: Modify SG copy functions to support io memory. > nvmet: Be careful about using iomem accesses when dealing with p2pmem > p2pmem: Support device removal > p2pmem: Added char device user interface > > Steve Wise (2): > cxgb4: setup pcie memory window 4 and create p2pmem region > p2pmem: Add debugfs "stats" file > > drivers/memory/Kconfig | 5 + > drivers/memory/Makefile | 2 + > drivers/memory/p2pmem.c | 697 > >
Re: [PATCH] scsi: lpfc: Add shutdown method for kexec
On Mon, 2017-03-06 at 22:46 -0500, Martin K. Petersen wrote: > > > > > > "Mauricio" == Mauricio Faria de Oliveira> > > > > et.ibm.com> writes: > > Mauricio> On 02/12/2017 07:49 PM, Anton Blanchard wrote: > > > We see lpfc devices regularly fail during kexec. Fix this by > > > adding a > > > shutdown method which mirrors the remove method. > > Mauricio> @mkp, @jejb: may you please flag this patch for > stable? Thank > Mauricio> you. > > I don't recall a consensus being reached on this patch. What would be the opposition ? Without it kexec breaks. With it, it works ... Now we all seem to agree that kexec should be overhauled to not use shutdown but instead unbind drivers, but that's a more long term project that nobody yet had a chance to tackle. In the meantime, some systems need a functioning kexec to boot. Cheers, Ben.
Re: [PATCH] scsi: lpfc: Add shutdown method for kexec
On Tue, 2017-02-14 at 15:45 +1300, Eric W. Biederman wrote: > The only difference ever that should exist between shutdown and remove > is do you clean up kernel data structures. The shutdown method is > allowed to skip the cleanup up kernel data structures that the remove > method needs to make. > > Assuming the kernel does not have corrupted data structures I don't see > any practical reason for distinguishing between the two. There may be > some real world gotchas we have to deal with but semantically I don't > see any concerns. We had historical additions to shutdown actually shutting things down, like spinning platters down to park disk heads, switching devices into D3 on some systems, etc... that remove never had (and we don't want for kexec). Cheers, Ben.
Re: [PATCH] scsi: lpfc: Add shutdown method for kexec
On Mon, 2017-02-13 at 15:57 -0600, Brian King wrote: > If we do transition to use remove rather than shutdown, I think we > want > some way for a device driver to know whether we are doing kexec or > not. > A RAID adapter with a write cache is going to want to flush its write > cache on a PCI hotplug remove, but for a kexec, its going to want to > skip > that so the kexec is faster. Today, since kexec looks like a reboot, > rather than a shutdown, we can skip the flush on a reboot, since its > technically not needed there either. What happens if a non-flushed adapter gets a PERST ? I wouldn't trust that 'don't have to flush' magic ... Cheers, Ben.
Re: [PATCH] scsi: lpfc: Add shutdown method for kexec
On Mon, 2017-02-13 at 13:21 +1300, Eric W. Biederman wrote: > > Good point, at the very least we should call remove if shutdown doesn't > > exist. Eric: could we make the changes Ben suggests? > > Definitely. That was the original design of the kexec interface > but people were worried about calling remove during reboot might > introduce regressions on the reboot functionality. So shutdown was > added to be remove without the cleaning up the kernel data structures. Right. Part of the problem was also that remove was dependent on CONFIG_HOTPLUG though that's no longer the case anymore. The problem is that a bunch of drivers either don't have a shutdown or worse, have one that actually shuts the HW down rather than "idle" it which puts it in a state that the new kernel can't deal with. > I am all for changing the core to call remove. That seems to be a more > tested code path (because remove tends to be part of the development > path for modules) and thus much more likely to work in practice. > > The challenge with changes in this area is that when the infrastructure > is changed for everyone someone needs to baby sit it until all of the > unexpected issues are resolved. I was hoping a couple of years ago that > Ben could be that person. Correct. And I lack time. But since we are a platform that uses kexec daily as our main booting mechanism we should probably be the ones driving that. I had a patch at least to fallback to remove in absence of shutdown which I can try to dig out. We can add a config option to make it do the other way around that we should start testing internally. I'm sure we will find 1 or 2 regressions as drivers do weird things. Cheers, Ben.
Re: [PATCH] scsi: lpfc: Add shutdown method for kexec
On Mon, 2017-02-13 at 08:49 +1100, Anton Blanchard wrote: > From: Anton Blanchard> > We see lpfc devices regularly fail during kexec. Fix this by adding > a shutdown method which mirrors the remove method. Or instead finally do what I've been advocating for years (and even sent patches for) which is to have kexec call remove instead of shutdown. Shutdown is and has *always* been the wrong thing to do. > Signed-off-by: Anton Blanchard > --- > drivers/scsi/lpfc/lpfc_init.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/scsi/lpfc/lpfc_init.c > b/drivers/scsi/lpfc/lpfc_init.c > index 4776fd8..10f75ad 100644 > --- a/drivers/scsi/lpfc/lpfc_init.c > +++ b/drivers/scsi/lpfc/lpfc_init.c > @@ -11447,6 +11447,7 @@ static struct pci_driver lpfc_driver = { > .id_table = lpfc_id_table, > .probe = lpfc_pci_probe_one, > .remove = lpfc_pci_remove_one, > + .shutdown = lpfc_pci_remove_one, > .suspend= lpfc_pci_suspend_one, > .resume = lpfc_pci_resume_one, > .err_handler= _err_handler,
Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing
On Wed, 2016-12-07 at 17:31 -0600, Tyrel Datwyler wrote: > The first byte of each CRQ entry is used to indicate whether an entry is > a valid response or free for the VIOS to use. After processing a > response the driver sets the valid byte to zero to indicate the entry is > now free to be reused. Add a memory barrier after this write to ensure > no other stores are reordered when updating the valid byte. Which "other stores" specifically ? This smells fishy without that precision. It's important to always understand what exactly barriers order with. Cheers, Ben. > Signed-off-by: Tyrel Datwyler> --- > drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c > b/drivers/scsi/ibmvscsi/ibmvscsi.c > index d9534ee..2f5b07e 100644 > --- a/drivers/scsi/ibmvscsi/ibmvscsi.c > +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c > @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data) > > while ((crq = crq_queue_next_crq(>queue)) != NULL) { > > ibmvscsi_handle_crq(crq, hostdata); > > crq->valid = VIOSRP_CRQ_FREE; > > + wmb(); > > } > > > vio_enable_interrupts(vdev); > @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data) > > vio_disable_interrupts(vdev); > > ibmvscsi_handle_crq(crq, hostdata); > > crq->valid = VIOSRP_CRQ_FREE; > > + wmb(); > > } else { > > done = 1; > > } -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] scsi/ipr: Fix runaway IRQs when falling back from MSI to LSI
LSIs must be ack'ed with an MMIO otherwise they remain asserted forever. This is controlled by the "clear_isr" flag. While we set that flag properly when deciding initially whether to use LSIs or MSIs, we fail to set it if we first chose MSIs, the test fails, then fallback to LSIs. Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org> --- drivers/scsi/ipr.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c index 5324741..5dd3194 100644 --- a/drivers/scsi/ipr.c +++ b/drivers/scsi/ipr.c @@ -10213,6 +10213,7 @@ static int ipr_probe_ioa(struct pci_dev *pdev, } ioa_cfg->intr_flag = IPR_USE_LSI; + ioa_cfg->clear_isr = 1; ioa_cfg->nvectors = 1; } else if (rc) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/3] cxlflash: Superpipe support
On Mon, 2015-08-10 at 12:09 -0500, Matthew R. Ochs wrote: Add superpipe supporting infrastructure to device driver for the IBM CXL Flash adapter. This patch allows userspace applications to take advantage of the accelerated I/O features that this adapter provides and bypass the traditional filesystem stack. So in a similar vein to the previous review, I am missing a lot of context here but a few things did spring to me eyes: +/** + * lookup_local() - find a local LUN information structure by WWID + * @cfg: Internal structure associated with the host. + * @wwid:WWID associated with LUN. + * + * Return: Found local lun_info structure on success, NULL on failure + */ +static struct llun_info *lookup_local(struct cxlflash_cfg *cfg, u8 *wwid) +{ + struct llun_info *lli, *temp; + ulong lock_flags; + + spin_lock_irqsave(cfg-slock, lock_flags); + + list_for_each_entry_safe(lli, temp, cfg-lluns, list) + if (!memcmp(lli-wwid, wwid, DK_CXLFLASH_MANAGE_LUN_WWID_LEN)) { + lli-newly_created = false; This is weird ... a lookup effectively changes the state of the object looked up... what for ? There is something oddball here. It might be legit but in that case, you should really add a comment explaining the logic around that 'newly_created' field. Also you drop the lock right below but you have no refcounting, are these objects ever disposed of ? In general, can you provide a clearer explanation of what are global vs local LUNs ? + spin_unlock_irqrestore(cfg-slock, lock_flags); + return lli; + } + + spin_unlock_irqrestore(cfg-slock, lock_flags); + return NULL; +} + +/** + * lookup_global() - find a global LUN information structure by WWID + * @wwid:WWID associated with LUN. + * + * Return: Found global lun_info structure on success, NULL on failure + */ +static struct glun_info *lookup_global(u8 *wwid) +{ + struct glun_info *gli, *temp; + ulong lock_flags; + + spin_lock_irqsave(global.slock, lock_flags); + + list_for_each_entry_safe(gli, temp, global.gluns, list) + if (!memcmp(gli-wwid, wwid, DK_CXLFLASH_MANAGE_LUN_WWID_LEN)) { + spin_unlock_irqrestore(global.slock, lock_flags); + return gli; + } + + spin_unlock_irqrestore(global.slock, lock_flags); + return NULL; +} Same ... +/** + * lookup_lun() - find or create a local LUN information structure + * @sdev:SCSI device associated with LUN. + * @wwid:WWID associated with LUN. + * + * When a local LUN is not found and a global LUN is also not found, both + * a global LUN and local LUN are created. The global LUN is added to the + * global list and the local LUN is returned. Make the function name more explicit: find_or_create_lun() for example. I very much dislike a function called lookup that has side effects. + * Return: Found/Allocated local lun_info structure on success, NULL on failure + */ +static struct llun_info *lookup_lun(struct scsi_device *sdev, u8 *wwid) +{ + struct llun_info *lli = NULL; + struct glun_info *gli = NULL; + struct Scsi_Host *shost = sdev-host; + struct cxlflash_cfg *cfg = shost_priv(shost); + ulong lock_flags; + + if (unlikely(!wwid)) + goto out; + + lli = lookup_local(cfg, wwid); + if (lli) + goto out; + + lli = create_local(sdev, wwid); + if (unlikely(!lli)) + goto out; Similar question to earlier, you have no refcounting, should I assume these things never get removed ? + gli = lookup_global(wwid); + if (gli) { + lli-parent = gli; + spin_lock_irqsave(cfg-slock, lock_flags); + list_add(lli-list, cfg-lluns); + spin_unlock_irqrestore(cfg-slock, lock_flags); + goto out; + } + + gli = create_global(sdev, wwid); + if (unlikely(!gli)) { + kfree(lli); + lli = NULL; + goto out; + } + + lli-parent = gli; + spin_lock_irqsave(cfg-slock, lock_flags); + list_add(lli-list, cfg-lluns); + spin_unlock_irqrestore(cfg-slock, lock_flags); + + spin_lock_irqsave(global.slock, lock_flags); + list_add(gli-list, global.gluns); + spin_unlock_irqrestore(global.slock, lock_flags); Your locks are extremely fine grained... too much ? Any reason why you don't have a simple/single lock handling all these ? IE, do you expect frequent accesses ? Also, a function called lookup_something that has the side effect of adding that something to two lists doesn't look great to me. You may want to review the function naming a bit. Finally, what happens if two processes call this trying to effectively create the same global LUN simultaneously ? IE, can't you have a case where both lookup fail, then they both hit the create_global()
Re: [PATCH v4 1/3] cxlflash: Base error recovery support
On Mon, 2015-08-10 at 12:09 -0500, Matthew R. Ochs wrote: Introduce support for enhanced I/O error handling. Signed-off-by: Matthew R. Ochs mro...@linux.vnet.ibm.com Signed-off-by: Manoj N. Kumar ma...@linux.vnet.ibm.com --- So I'm not necessarily very qualified to review SCSI bits as I haven't done anything close to the Linux SCSI code for almost a decade but I have a couple of questions and nits... wait_queue_head_t tmf_waitq; bool tmf_active; - u8 err_recovery_active:1; + wait_queue_head_t limbo_waitq; + enum cxlflash_state state; }; struct afu_cmd { diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c index 76a7286..18359d4 100644 --- a/drivers/scsi/cxlflash/main.c +++ b/drivers/scsi/cxlflash/main.c @@ -380,6 +380,18 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scp) } spin_unlock_irqrestore(cfg-tmf_waitq.lock, lock_flags); + switch (cfg-state) { + case STATE_LIMBO: + pr_debug_ratelimited(%s: device in limbo!\n, __func__); + rc = SCSI_MLQUEUE_HOST_BUSY; + goto out; + case STATE_FAILTERM: + pr_debug_ratelimited(%s: device has failed!\n, __func__); + goto error; + default: + break; + } I noticed that you mostly read and write that new state outside of your tmf_waitq.lock. Is there any other lock or mutex protecting you ? In this specific case ? I know in the old day queuecommand was called with a host lock, is that still the case ? Or you just don't care about an occasional spurrious SCSI_MLQUEUE_HOST_BUSY ? cmd = cxlflash_cmd_checkout(afu); if (unlikely(!cmd)) { pr_err(%s: could not get a free command\n, __func__); @@ -428,6 +440,10 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scp) out: return rc; +error: + scp-result = (DID_NO_CONNECT 16); + scp-scsi_done(scp); + return 0; } /** @@ -455,9 +471,21 @@ static int cxlflash_eh_device_reset_handler(struct scsi_cmnd *scp) get_unaligned_be32(((u32 *)scp-cmnd)[2]), get_unaligned_be32(((u32 *)scp-cmnd)[3])); - rcr = send_tmf(afu, scp, TMF_LUN_RESET); - if (unlikely(rcr)) + switch (cfg-state) { + case STATE_NORMAL: + rcr = send_tmf(afu, scp, TMF_LUN_RESET); + if (unlikely(rcr)) + rc = FAILED; + break; + case STATE_LIMBO: + wait_event(cfg-limbo_waitq, cfg-state != STATE_LIMBO); + if (cfg-state == STATE_NORMAL) + break; + /* fall through */ + default: rc = FAILED; + break; + } Same here, since you are doing wait_event, I assume no lock is held (unless it's a mutex) and in subsequent places I am not listing. As I said, it could well be that it all works fine but it would be worth mentioning in this case because it's not obvious from simply reading the code. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Concerns about mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support
On Thu, 2015-02-19 at 21:45 -0800, James Bottomley wrote: Ben, this is legal by design. It was specifically designed for the aic79xx SCSI card, but can be used for a variety of other reasons. The aic79xx hardware problem was that the DMA engine could address the whole of memory (it had two address modes, a 39 bit one and a 64 bit one) but the script engine that runs the mailboxes only had a 32 bit activation register (the activating write points at the physical address of the script to begin executing). This meant that the scripts that run in memory had to be in the first 4GB of physical memory, hence the split mask. The DMA mask specifies that the card can transfer from anywhere in physical memory, but the consistent_dma_mask says that the consistent allocation used to get scripts memory must come from the lower 4GB. So looking at that again... This is interesting ... basically any driver using a different mask has been broken on powerpc for basically ever. The whole concept was poorly designed, for example, the set_consistent_mask isn't a dma_map_ops unlike everything else. In some cases, what we want is convey a base+offset information to drivers but we can't do that. This stuff cannot work with setups like a lot of our iommus where we have a remapped region at the bottom of the DMA address space and a bypass (direct map) region high up. Basically, we can fix it, at least for most platforms, but it will be hard, invasive, *and* will need to go to stable. Grmbl. We'll have to replace our direct DMA ops (which just apply an offset) which we use for devices that set a 64-bit mask on platform that support a bypass window, with some smart-ass hybrid variant that selectively shoot stuff up to the bypass window or down via the iommu remapped based on the applicable mask for a given operation. It would be nice if we could come up with a way to inform the driver that we support that sort of bypass region with an offset. That would allow drivers that have that 64-bit base + 32-bit offset scheme to work much more efficiently for us. The driver could configure the base to be our bypass window offset, and we could use ZONE_DMA32 for consistent DMAs. It would also help us with things like some GPUs that can only do 40-bit DMA (which won't allow them to reach our bypass region normally) but do have a way to configure the generated top bits of all DMA addresses in some fixed register. Any idea what such an interface might look like ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Concerns about mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support
On Fri, 2015-02-20 at 16:22 +1100, Benjamin Herrenschmidt wrote: Looking a bit more closely, you basically do - set_dma_mask(64-bit) - set_consistent_dma_mask(32-bit) Now, I don't know how x86 will react to the conflicting masks, but on ppc64, I'm pretty sure the second one will barf. IE, the first one will establish a set of direct mapping ops which give you a bypass of the iommu to all of memory. The second one will then do a dma_supported(mask) call which will hit the direct ops, and they will fail since a 32-bit mask cannot address the bypass completely. Are architectures really required to support such mismatching dma_mask and consistent_dma_mask ? what a bloody trainwreck ... :-( Oh well, looks like x86 supports it and it won't be too hard to support it on ppc64 as well. We even had some code along those lines for FSL platforms with an ifdef due to the amount of drivers that used to fail setting the consistent mask properly but that seems to be mostly fixed nowadays. Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Concerns about mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support
On Thu, 2015-02-19 at 21:45 -0800, James Bottomley wrote: Ben, this is legal by design. It was specifically designed for the aic79xx SCSI card, but can be used for a variety of other reasons. The aic79xx hardware problem was that the DMA engine could address the whole of memory (it had two address modes, a 39 bit one and a 64 bit one) but the script engine that runs the mailboxes only had a 32 bit activation register (the activating write points at the physical address of the script to begin executing). This meant that the scripts that run in memory had to be in the first 4GB of physical memory, hence the split mask. The DMA mask specifies that the card can transfer from anywhere in physical memory, but the consistent_dma_mask says that the consistent allocation used to get scripts memory must come from the lower 4GB. Right, ok, it looks like it's easy enough to support with ZONE_DMA32, I'm testing patches to create it unconditionally on ppc64 (it used to depend on us using swiotlb on embedded platforms) and I'll shoot that upstream if it passes. Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Concerns about mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support
Hi Sreekanth ! While looking at some (unrelated) issue where mtp2sas seems to be using 32-bit DMA instead of 64-bit DMA on some POWER platforms, I noticed this patch which was merged as 5fb1bf8aaa832e1e9ca3198de7bbecb8eff7db9c. Can you confirm my understanding that you are: - Setting the DMA mask to 32-bit - Mapping pages for DMA - Changing the DMA mask to 64-bit ? If yes, then I don't think this is a legal thing to do and definitely not something supported by all architectures. It might work by accident, but there is no telling that any translation/DMA mapping provided before a call to set_dma_mask() is still valid after that call. The architecture might have to completely reconfigure the iommu, for example on some PowerPC platforms, we switch from a remapped mapping to a direct linear map of all memory, all translations established before the switch might be lost (it depends on the specific implementation). How does it work on x86 with DMAR ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Concerns about mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support
On Fri, 2015-02-20 at 16:06 +1100, Benjamin Herrenschmidt wrote: Note that even on powerpc platforms where it would work because we maintain both 32-bit and 64-bit bypass windows in the device address space simultaneously, you will leak iommu entries unless you also switch back to 32-bit when freeing the 32-bit mappings... (and you would probably crash if you tried to free a 64-bit mapping while in 32-bit mode). The iommu APIs weren't designed with that switching mask facility in mind... Looking a bit more closely, you basically do - set_dma_mask(64-bit) - set_consistent_dma_mask(32-bit) Now, I don't know how x86 will react to the conflicting masks, but on ppc64, I'm pretty sure the second one will barf. IE, the first one will establish a set of direct mapping ops which give you a bypass of the iommu to all of memory. The second one will then do a dma_supported(mask) call which will hit the direct ops, and they will fail since a 32-bit mask cannot address the bypass completely. Are architectures really required to support such mismatching dma_mask and consistent_dma_mask ? what a bloody trainwreck ... :-( Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND][PATCH 1/2] lib/scatterlist: Make ARCH_HAS_SG_CHAIN an actual Kconfig
On Sun, 2014-03-23 at 00:03 -0700, Christoph Hellwig wrote: On Sun, Mar 23, 2014 at 02:04:46PM +1100, Benjamin Herrenschmidt wrote: diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 1594945..8122294 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -82,6 +82,7 @@ config ARM http://www.arm.linux.org.uk/. config ARM_HAS_SG_CHAIN + select ARCH_HAS_SG_CHAIN bool Heh, a self-selecting config option... I didn't know that trick ! ARM vs ARCH. However the arm variant of the variable should probably be consolidated into the ARCH one as a follow up. Oh right... sorry for the noise. Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND][PATCH 1/2] lib/scatterlist: Make ARCH_HAS_SG_CHAIN an actual Kconfig
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 1594945..8122294 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -82,6 +82,7 @@ config ARM http://www.arm.linux.org.uk/. config ARM_HAS_SG_CHAIN + select ARCH_HAS_SG_CHAIN bool Heh, a self-selecting config option... I didn't know that trick ! Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND][PATCH 1/2] lib/scatterlist: Make ARCH_HAS_SG_CHAIN an actual Kconfig
On Sat, 2014-03-22 at 11:13 -0700, Laura Abbott wrote: Rather than have architectures #define ARCH_HAS_SG_CHAIN in an architecture specific scatterlist.h, make it a proper Kconfig option and use that instead. At same time, remove the header files are are now mostly useless and just include asm-generic/scatterlist.h. Cc: Russell King li...@arm.linux.org.uk Cc: Tony Luck tony.l...@intel.com Cc: Fenghua Yu fenghua...@intel.com For powerpc Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: James E.J. Bottomley jbottom...@parallels.com Cc: Fenghua Yu fenghua...@intel.com Cc: Tony Luck tony.l...@intel.com Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Martin Schwidefsky schwidef...@de.ibm.com Cc: Heiko Carstens heiko.carst...@de.ibm.com Cc: Andrew Morton a...@linux-foundation.org Signed-off-by: Laura Abbott lau...@codeaurora.org -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Tue, 2013-10-08 at 20:55 -0700, H. Peter Anvin wrote: Why not add a minimum number to pci_enable_msix(), i.e.: pci_enable_msix(pdev, msix_entries, nvec, minvec) ... which means nvec is the number of interrupts *requested*, and minvec is the minimum acceptable number (otherwise fail). Which is exactly what Ben (the other Ben :-) suggested and that I supports... Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Mon, 2013-10-07 at 14:01 -0400, Tejun Heo wrote: I don't think the same race condition would happen with the loop. The problem case is where multiple msi(x) allocation fails completely because the global limit went down before inquiry and allocation. In the loop based interface, it'd retry with the lower number. As long as the number of drivers which need this sort of adaptive allocation isn't too high and the common cases can be made simple, I don't think the complex part of interface is all that important. Maybe we can have reserve / cancel type interface or just keep the loop with more explicit function names (ie. try_enable or something like that). I'm thinking a better API overall might just have been to request individual MSI-X one by one :-) We want to be able to request an MSI-X at runtime anyway ... if I want to dynamically add a queue to my network interface, I want it to be able to pop a new arbitrary MSI-X. And we don't want to lock drivers into contiguous MSI-X sets either. And for the cleanup ... well that's what the pcim functions are for, we can just make MSI-X variants. Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Sun, 2013-10-06 at 08:02 +0200, Alexander Gordeev wrote: On Sun, Oct 06, 2013 at 08:46:26AM +1100, Benjamin Herrenschmidt wrote: On Sat, 2013-10-05 at 16:20 +0200, Alexander Gordeev wrote: So my point is - drivers should first obtain a number of MSIs they *can* get, then *derive* a number of MSIs the device is fine with and only then request that number. Not terribly different from memory or any other type of resource allocation ;) What if the limit is for a group of devices ? Your interface is racy in that case, another driver could have eaten into the limit in between the calls. Well, the another driver has had a better karma ;) But seriously, the current scheme with a loop is not race-safe wrt to any other type of resource which might exhaust. What makes the quota so special so we should care about it and should not care i.e. about lack of msi_desc's? I'm not saying the current scheme is better but I prefer the option of passing a min,max to the request function. Yeah, I know the quota might hit more likely. But why it is not addressed right now then? Not a single function in chains... rtas_msi_check_device() - msi_quota_for_device() - traverse_pci_devices() rtas_setup_msi_irqs() - msi_quota_for_device() - traverse_pci_devices() ...is race-safe. So if it has not been bothering anyone until now then no reason to start worrying now :) In fact, in the current design to address the quota race decently the drivers would have to protect the *loop* to prevent the quota change between a pci_enable_msix() returned a positive number and the the next call to pci_enable_msix() with that number. Is it doable? I am not advocating for the current design, simply saying that your proposal doesn't address this issue while Ben's does. Cheers, Ben. Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Sat, 2013-10-05 at 16:20 +0200, Alexander Gordeev wrote: So my point is - drivers should first obtain a number of MSIs they *can* get, then *derive* a number of MSIs the device is fine with and only then request that number. Not terribly different from memory or any other type of resource allocation ;) What if the limit is for a group of devices ? Your interface is racy in that case, another driver could have eaten into the limit in between the calls. Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2, part 1 3/9] PCI: Convert alloc_pci_dev(void) to pci_alloc_dev(bus) instead
On Wed, 2013-05-15 at 22:46 +0800, Liu Jiang wrote: I don't know any OF exports, could you please help to CC some OF experts? I wrote that code I think. Sorry, I've missed the beginning of the thread, what is the problem ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi/ibmvscsi: add module alias for ibmvscsic
On Mon, 2012-07-30 at 21:06 +0200, Olaf Hering wrote: So while this would work, I do wonder however whether we could instead fix it by simplifying the whole thing as follow since iSeries is now gone and so we don't need split backends anymore: scsi/ibmvscsi: Remove backend abstraction I cant that these things myself anymore. Brian, can somebody from your side own these ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi/ibmvscsi: add module alias for ibmvscsic
On Wed, 2012-07-18 at 18:49 +0200, o...@aepfle.de wrote: From: Olaf Hering o...@aepfle.de The driver is named ibmvscsic, at runtime it its name is advertised as ibmvscsi. For this reason mkinitrd wont pickup the driver properly. Reported by IBM during SLES11 beta testing: https://bugzilla.novell.com/show_bug.cgi?id=459933 LTC50724 So while this would work, I do wonder however whether we could instead fix it by simplifying the whole thing as follow since iSeries is now gone and so we don't need split backends anymore: scsi/ibmvscsi: Remove backend abstraction Now that the iSeries code is gone the backend abstraction in this driver is no longer necessary, which allows us to consolidate the driver in one file. The side effect is that the module name is now ibmvscsi.ko which matches the driver hotplug name and fixes auto-load issues. Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org --- --- drivers/scsi/ibmvscsi/Makefile|6 +- drivers/scsi/ibmvscsi/ibmvscsi.c | 348 +-- drivers/scsi/ibmvscsi/ibmvscsi.h | 22 --- drivers/scsi/ibmvscsi/rpa_vscsi.c | 368 - 4 files changed, 330 insertions(+), 414 deletions(-) delete mode 100644 drivers/scsi/ibmvscsi/rpa_vscsi.c diff --git a/drivers/scsi/ibmvscsi/Makefile b/drivers/scsi/ibmvscsi/Makefile index ff5b5c5..cb150d1 100644 --- a/drivers/scsi/ibmvscsi/Makefile +++ b/drivers/scsi/ibmvscsi/Makefile @@ -1,7 +1,3 @@ -obj-$(CONFIG_SCSI_IBMVSCSI)+= ibmvscsic.o - -ibmvscsic-y+= ibmvscsi.o -ibmvscsic-$(CONFIG_PPC_PSERIES)+= rpa_vscsi.o - +obj-$(CONFIG_SCSI_IBMVSCSI)+= ibmvscsi.o obj-$(CONFIG_SCSI_IBMVSCSIS) += ibmvstgt.o obj-$(CONFIG_SCSI_IBMVFC) += ibmvfc.o diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c index 3a6c474..d2bd2c0 100644 --- a/drivers/scsi/ibmvscsi/ibmvscsi.c +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c @@ -93,13 +93,13 @@ static int max_requests = IBMVSCSI_MAX_REQUESTS_DEFAULT; static int max_events = IBMVSCSI_MAX_REQUESTS_DEFAULT + 2; static int fast_fail = 1; static int client_reserve = 1; +static char partition_name[97] = UNKNOWN; +static unsigned int partition_number = -1; static struct scsi_transport_template *ibmvscsi_transport_template; #define IBMVSCSI_VERSION 1.5.9 -static struct ibmvscsi_ops *ibmvscsi_ops; - MODULE_DESCRIPTION(IBM Virtual SCSI); MODULE_AUTHOR(Dave Boutcher); MODULE_LICENSE(GPL); @@ -118,6 +118,315 @@ MODULE_PARM_DESC(fast_fail, Enable fast fail. [Default=1]); module_param_named(client_reserve, client_reserve, int, S_IRUGO ); MODULE_PARM_DESC(client_reserve, Attempt client managed reserve/release); +static void ibmvscsi_handle_crq(struct viosrp_crq *crq, + struct ibmvscsi_host_data *hostdata); + +/* + * Routines for managing the command/response queue + */ +/** + * ibmvscsi_handle_event: - Interrupt handler for crq events + * @irq: number of irq to handle, not used + * @dev_instance: ibmvscsi_host_data of host that received interrupt + * + * Disables interrupts and schedules srp_task + * Always returns IRQ_HANDLED + */ +static irqreturn_t ibmvscsi_handle_event(int irq, void *dev_instance) +{ + struct ibmvscsi_host_data *hostdata = + (struct ibmvscsi_host_data *)dev_instance; + vio_disable_interrupts(to_vio_dev(hostdata-dev)); + tasklet_schedule(hostdata-srp_task); + return IRQ_HANDLED; +} + +/** + * release_crq_queue: - Deallocates data and unregisters CRQ + * @queue: crq_queue to initialize and register + * @host_data: ibmvscsi_host_data of host + * + * Frees irq, deallocates a page for messages, unmaps dma, and unregisters + * the crq with the hypervisor. + */ +static void ibmvscsi_release_crq_queue(struct crq_queue *queue, + struct ibmvscsi_host_data *hostdata, + int max_requests) +{ + long rc = 0; + struct vio_dev *vdev = to_vio_dev(hostdata-dev); + free_irq(vdev-irq, (void *)hostdata); + tasklet_kill(hostdata-srp_task); + do { + if (rc) + msleep(100); + rc = plpar_hcall_norets(H_FREE_CRQ, vdev-unit_address); + } while ((rc == H_BUSY) || (H_IS_LONG_BUSY(rc))); + dma_unmap_single(hostdata-dev, +queue-msg_token, +queue-size * sizeof(*queue-msgs), DMA_BIDIRECTIONAL); + free_page((unsigned long)queue-msgs); +} + +/** + * crq_queue_next_crq: - Returns the next entry in message queue + * @queue: crq_queue to use + * + * Returns pointer to next entry in queue, or NULL if there are no new + * entried in the CRQ. + */ +static struct viosrp_crq *crq_queue_next_crq(struct crq_queue *queue) +{ + struct viosrp_crq *crq; + unsigned long flags; + + spin_lock_irqsave(queue-lock
Re: [PATCH] scsi/ibmvscsi: /sys/class/scsi_host/hostX/config doesn't show any information
n Wed, 2012-07-18 at 18:49 +0200, o...@aepfle.de wrote: From: Linda Xie lx...@us.ibm.com Expected result: It should show something like this: x1521p4:~ # cat /sys/class/scsi_host/host1/config PARTITIONNAME='x1521p4' NWSDNAME='X1521P4' HOSTNAME='X1521P4' DOMAINNAME='RCHLAND.IBM.COM' NAMESERVERS='9.10.244.100 9.10.244.200' Actual result: x1521p4:~ # cat /sys/class/scsi_host/host0/config x1521p4:~ # This patch changes the size of the buffer used for transfering config data to 4K. It was tested against 2.6.19-rc2 tree. Reported by IBM during SLES11 beta testing: So this patch just seems to blindly replace all occurrences of PAGE_SIZE with HOST_PAGE_SIZE which is utterly wrong. Only one of those needs to be changed, the one passed to ibmvscsi_do_host_config() which is what's visible to the server, all the rest is just sysfs attributes and should remain as-is. Additionally (not even mentioning that there is no explanation as to what the real problem is anywhere in the changeset) I don't like the fix. The root of the problem is that the MAD header has a 16-bit length field, so writing 0x1 (64K PAGE_SIZE) into it doesn't quite work. So in addition to a better comment, I would suggest a fix more like this: scsi/ibmvscsi: Fix host config length field overflow The length field in the host config packet is only 16-bit long, so passing it 0x1 (64K which is our standard PAGE_SIZE) doesn't work and result in an empty config from the server. Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org CC: sta...@vger.kernel.org --- diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c index 3a6c474..337e8b3 100644 --- a/drivers/scsi/ibmvscsi/ibmvscsi.c +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c @@ -1541,6 +1541,9 @@ static int ibmvscsi_do_host_config(struct ibmvscsi_host_data *hostdata, host_config = evt_struct-iu.mad.host_config; + /* The transport length field is only 16-bit */ + length = min(0x, length); + /* Set up a lun reset SRP command */ memset(host_config, 0x00, sizeof(*host_config)); host_config-common.type = VIOSRP_HOST_CONFIG_TYPE; -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi/ibmvscsi: /sys/class/scsi_host/hostX/config doesn't show any information
On Wed, 2012-07-18 at 18:49 +0200, o...@aepfle.de wrote: From: Linda Xie lx...@us.ibm.com James, can I assume you're picking up those two ? Cheers, Ben. Expected result: It should show something like this: x1521p4:~ # cat /sys/class/scsi_host/host1/config PARTITIONNAME='x1521p4' NWSDNAME='X1521P4' HOSTNAME='X1521P4' DOMAINNAME='RCHLAND.IBM.COM' NAMESERVERS='9.10.244.100 9.10.244.200' Actual result: x1521p4:~ # cat /sys/class/scsi_host/host0/config x1521p4:~ # This patch changes the size of the buffer used for transfering config data to 4K. It was tested against 2.6.19-rc2 tree. Reported by IBM during SLES11 beta testing: https://bugzilla.novell.com/show_bug.cgi?id=439970 LTC49349 Signed-off-by: Olaf Hering o...@aepfle.de diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c index e580aa4..1513ca8 100644 --- a/drivers/scsi/ibmvscsi/ibmvscsi.c +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c @@ -93,6 +93,8 @@ static int max_requests = IBMVSCSI_MAX_REQUESTS_DEFAULT; static int max_events = IBMVSCSI_MAX_REQUESTS_DEFAULT + 2; static int fast_fail = 1; static int client_reserve = 1; +/* host data buffer size */ +#define HOST_BUFFER_SIZE 4096 static struct scsi_transport_template *ibmvscsi_transport_template; @@ -1666,7 +1668,7 @@ static ssize_t show_host_srp_version(struct device *dev, struct ibmvscsi_host_data *hostdata = shost_priv(shost); int len; - len = snprintf(buf, PAGE_SIZE, %s\n, + len = snprintf(buf, HOST_BUFFER_SIZE, %s\n, hostdata-madapter_info.srp_version); return len; } @@ -1687,7 +1689,7 @@ static ssize_t show_host_partition_name(struct device *dev, struct ibmvscsi_host_data *hostdata = shost_priv(shost); int len; - len = snprintf(buf, PAGE_SIZE, %s\n, + len = snprintf(buf, HOST_BUFFER_SIZE, %s\n, hostdata-madapter_info.partition_name); return len; } @@ -1708,7 +1710,7 @@ static ssize_t show_host_partition_number(struct device *dev, struct ibmvscsi_host_data *hostdata = shost_priv(shost); int len; - len = snprintf(buf, PAGE_SIZE, %d\n, + len = snprintf(buf, HOST_BUFFER_SIZE, %d\n, hostdata-madapter_info.partition_number); return len; } @@ -1728,7 +1730,7 @@ static ssize_t show_host_mad_version(struct device *dev, struct ibmvscsi_host_data *hostdata = shost_priv(shost); int len; - len = snprintf(buf, PAGE_SIZE, %d\n, + len = snprintf(buf, HOST_BUFFER_SIZE, %d\n, hostdata-madapter_info.mad_version); return len; } @@ -1748,7 +1750,7 @@ static ssize_t show_host_os_type(struct device *dev, struct ibmvscsi_host_data *hostdata = shost_priv(shost); int len; - len = snprintf(buf, PAGE_SIZE, %d\n, hostdata-madapter_info.os_type); + len = snprintf(buf, HOST_BUFFER_SIZE, %d\n, hostdata-madapter_info.os_type); return len; } @@ -1767,7 +1769,7 @@ static ssize_t show_host_config(struct device *dev, struct ibmvscsi_host_data *hostdata = shost_priv(shost); /* returns null-terminated host config data */ - if (ibmvscsi_do_host_config(hostdata, buf, PAGE_SIZE) == 0) + if (ibmvscsi_do_host_config(hostdata, buf, HOST_BUFFER_SIZE) == 0) return strlen(buf); else return 0; -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] DMA buffer alignment annotations
On Fri, 2007-12-21 at 09:39 +, Russell King wrote: +#ifndef ARCH_MIN_DMA_ALIGNMENT +#define __dma_aligned +#define __dma_buffer +#else +#define __dma_aligned __attribute__((aligned(ARCH_MIN_DMA_ALIGNMENT))) +#define __dma_buffer __dma_buffer_line(__LINE__) +#define __dma_buffer_line(line)__dma_aligned;\ + char __dma_pad_##line[0] __dma_aligned You introduce __dma_buffer_line() if ARCH_MIN_DMA_ALIGNMENT is set but not if it isn't... Yup, it's not meant to be used outside of __dma_buffer... .../... +then dev-buffer will be safe for DMA on all architectures. On a +cache-coherent architecture the members of dev will be aligned exactly +as they would have been without __dma_buffer; on a non-cache-coherent +architecture buffer and field2 will be aligned so that buffer does not +share a cache line with any other data. + ... but it's not described. What's the purpose of it, and why would it only be used on CPUs with ARCH_MIN_DMA_ALIGNMENT set? Hrm, I'm not the best at writing exlanations, care to send a patch ? :-) Cheers, Ben - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] scsi: Use new __dma_buffer to align sense buffer in scsi_cmnd
On Fri, 2007-12-21 at 10:33 +, Alan Cox wrote: On Fri, 21 Dec 2007 13:30:08 +1100 Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: The sense buffer ins scsi_cmnd can nowadays be DMA'ed into directly by some low level drivers (that typically happens with USB mass storage). Should that not be fixed in USB storage by using pci_alloc_coherent on the PCI device of the hub not peeing directly into kernel space ? All dumb SCSI drivers have this problem, USB storage is just one of them. This would also allow to remove bounce buffering that some non-dumb ones are doing in fact. There's another solution jejb was talking about involving reworking the allocation of the sense buffer to make it always under driver control etc... but that's the kind of SCSI surgery that I'm not prepared to do especially not for .25 and without much HW to test with. It's also incomplete as a fix because I don't see what guarantees the buffer size will always exceed cache line size How is that a problem ? The annotation will make sure the buffer doesn't share cache lines. It forces alignmenet before and pads after. Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] scsi: Use new __dma_buffer to align sense buffer in scsi_cmnd
On Fri, 2007-12-21 at 06:16 -0700, Matthew Wilcox wrote: On Fri, Dec 21, 2007 at 10:33:26AM +, Alan Cox wrote: On Fri, 21 Dec 2007 13:30:08 +1100 Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: The sense buffer ins scsi_cmnd can nowadays be DMA'ed into directly by some low level drivers (that typically happens with USB mass storage). Should that not be fixed in USB storage by using pci_alloc_coherent on the PCI device of the hub not peeing directly into kernel space ? That's what I said, but Ben seems fixated on this particular fix. That means more understanding of the SCSI code than I have right now and a _lot_ more time than I have right now. It's not only USB storage, it's anything that does DMA and uses the generic error handling. Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] scsi: Use new __dma_buffer to align sense buffer in scsi_cmnd
The sense buffer ins scsi_cmnd can nowadays be DMA'ed into directly by some low level drivers (that typically happens with USB mass storage). This is a problem on non cache coherent architectures such as embedded PowerPCs where the sense buffer can share cache lines with other structure members, which leads to various forms of corruption. This uses the newly defined __dma_buffer annotation to enforce that on such platforms, the sense_buffer is contained within its own cache line. This has no effect on cache coherent architectures. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- include/scsi/scsi_cmnd.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- linux-merge.orig/include/scsi/scsi_cmnd.h 2007-12-21 13:07:14.0 +1100 +++ linux-merge/include/scsi/scsi_cmnd.h2007-12-21 13:07:29.0 +1100 @@ -88,7 +88,7 @@ struct scsi_cmnd { working on */ #define SCSI_SENSE_BUFFERSIZE 96 - unsigned char sense_buffer[SCSI_SENSE_BUFFERSIZE]; + unsigned char sense_buffer[SCSI_SENSE_BUFFERSIZE] __dma_buffer; /* obtained by REQUEST SENSE when * CHECK CONDITION is received on original * command (auto-sense) */ - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] DMA buffer alignment annotations
This patch based on some earlier work by Roland Dreier introduces a pair of annotations that can be used to enforce alignment of objects that can be DMA'ed into, and to enforce that an DMA'able object within a structure isn't sharing a cache line with some other object. Such sharing of a data structure between DMA and non-DMA objects isn't a recommended practice, but it does happen and in some case might even make sense, so we now have a way to make it work propertly. The current patch only enables such alignment for some PowerPC platforms that do not have coherent caches. Other platforms such as ARM, MIPS, etc... can define ARCH_MIN_DMA_ALIGNMENT if they want to benefit from this, I don't know them well enough to do it myself. The initial issue I'm fixing (in a second patch) by using these is the SCSI sense buffer which is currently part of the scsi command structure and can be DMA'ed to. On non-coherent platforms, this causes various corruptions as this cache line is shared with various other fields of the scsi_cmnd data structure. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- Documentation/DMA-mapping.txt | 32 include/asm-generic/page.h| 10 ++ include/asm-powerpc/page.h|8 3 files changed, 50 insertions(+) --- linux-merge.orig/include/asm-generic/page.h 2007-07-27 13:44:45.0 +1000 +++ linux-merge/include/asm-generic/page.h 2007-12-21 13:07:28.0 +1100 @@ -20,6 +20,16 @@ static __inline__ __attribute_const__ in return order; } +#ifndef ARCH_MIN_DMA_ALIGNMENT +#define __dma_aligned +#define __dma_buffer +#else +#define __dma_aligned __attribute__((aligned(ARCH_MIN_DMA_ALIGNMENT))) +#define __dma_buffer __dma_buffer_line(__LINE__) +#define __dma_buffer_line(line)__dma_aligned;\ + char __dma_pad_##line[0] __dma_aligned +#endif + #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ Index: linux-merge/include/asm-powerpc/page.h === --- linux-merge.orig/include/asm-powerpc/page.h 2007-09-28 11:42:10.0 +1000 +++ linux-merge/include/asm-powerpc/page.h 2007-12-21 13:15:02.0 +1100 @@ -77,6 +77,14 @@ #define VM_DATA_DEFAULT_FLAGS64(VM_READ | VM_WRITE | \ VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) +/* + * On non cache coherent platforms, we enforce cache aligned DMA + * buffers inside of structures + */ +#ifdef CONFIG_NOT_COHERENT_CACHE +#define ARCH_MIN_DMA_ALIGNMENT L1_CACHE_BYTES +#endif + #ifdef __powerpc64__ #include asm/page_64.h #else Index: linux-merge/Documentation/DMA-mapping.txt === --- linux-merge.orig/Documentation/DMA-mapping.txt 2007-12-21 13:17:14.0 +1100 +++ linux-merge/Documentation/DMA-mapping.txt 2007-12-21 13:20:00.0 +1100 @@ -75,6 +75,38 @@ What about block I/O and networking buff networking subsystems make sure that the buffers they use are valid for you to DMA from/to. +Note that on non-cache-coherent architectures, having a DMA buffer +that shares a cache line with other data can lead to memory +corruption. + +The __dma_buffer macro exists to allow safe DMA buffers to be declared +easily and portably as part of larger structures without causing bloat +on cache-coherent architectures. To get this macro, architectures have +to define ARCH_MIN_DMA_ALIGNMENT to the requested alignment value in +their asm/page.h before including asm-generic/page.h + +Of course these structures must be contained in memory that can be +used for DMA as described above. + +To use __dma_buffer, just declare a struct like: + + struct mydevice { + int field1; + char buffer[BUFFER_SIZE] __dma_buffer; + int field2; + }; + +If this is used in code like: + + struct mydevice *dev; + dev = kmalloc(sizeof *dev, GFP_KERNEL); + +then dev-buffer will be safe for DMA on all architectures. On a +cache-coherent architecture the members of dev will be aligned exactly +as they would have been without __dma_buffer; on a non-cache-coherent +architecture buffer and field2 will be aligned so that buffer does not +share a cache line with any other data. + DMA addressing limitations Does your device have any DMA addressing limitations? For example, is - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI breakage on non-cache coherent architectures
On Tue, 2007-11-20 at 15:10 -0600, James Bottomley wrote: We're talking about trying to fix this for 2.4; which is already at -rc3 ... Is an entire arch change for dma alignment really a merge candidate at this stage? Well, as I said before... it's a matter of what seems to be the less likely to break something right ? On one side, I'm doing surgery on code I barely know, the scsi error handling, and now it seems I also have to fixup a handful of drivers that aren't the most obvious pieces of code around. On the other side, Roland proposal is basically just adding a macro that can be empty for everybody but a handful of archs, and stick it onto one field in one structure... The later has about 0 chances to actually break something or cause a regression. I wouldn't say that about the former. Now, I will see if I manage to fixup the NCR drivers to pass a pre-allocated buffer (USB storage I think can pass NULL as it's not calling prep in atomic context). But then, it complicates the matter because that means restore will have to know whether prep allocated the buffer or not, thus more fields to add to the save struct, it's getting messy, unless we decide -all- callers are responsible for the buffer allocation (hrm... maybe the best approach). Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI breakage on non-cache coherent architectures
On Mon, 2007-11-19 at 00:38 -0800, David Miller wrote: From: Benjamin Herrenschmidt [EMAIL PROTECTED] Date: Mon, 19 Nov 2007 16:35:23 +1100 I'm not sure what is the best way to fix that. Internally, I've done some test whacking some cacheline_aligned in the scsi_cmnd data structure to verify I no longer get random SLAB corruption when using my USB but that significantly bloats the size of the structure on archs such as ppc64 that don't need it and have a large cache line size. Unfortunately, I don't think there's any existing Kconfig symbol or arch provided #define to tell us that we are on a non-coherent arch afaik that could be used to make that conditional. Another option would be to kmalloc the buffer (wasn't it the case before btw ?) but I suppose some people will scream at the idea due to how the command pools are done... You could make a dma_cacheline_aligned and use that. It seems pretty reasonable. I was thinking about that. What archs would need it ? arm, mips, what else ? Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI breakage on non-cache coherent architectures
On Mon, 2007-11-19 at 05:32 -0700, Matthew Wilcox wrote: On Mon, Nov 19, 2007 at 04:35:23PM +1100, Benjamin Herrenschmidt wrote: The other one I'm hitting now is that the SCSI layer nowadays embeds the 'nowadays'? It has always been so. Wasn't it kmalloc'ed at one point ? sense_buffer inside the scsi_cmnd structure without any kind of alignment whatsoever. I've been hitting irregulary is a crash on SCSI command completion that seems to be related to corruption of the request pointer in struct scsi_cmnd and I think it might be the cause. I'm now trying to setup a proper repro-case. What other drivers do is DMA to their own allocation and then memcpy to the sense buffer. What other drivers ? Those architectures use the same drivers as everything else. There is a movement to allocate the sense data as its own sg list, but I don't think that patch has even been posted yet. I've seen code creating an sglist from the scsi_cmnd-sense_buffer and passing that to drivers. That breaks. Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI breakage on non-cache coherent architectures
On Mon, 2007-11-19 at 09:09 -0600, James Bottomley wrote: What other drivers do is DMA to their own allocation and then memcpy to the sense buffer. There is a movement to allocate the sense data as its own sg list, but I don't think that patch has even been posted yet. I'd like to be rid of it inside the command for various reasons: every command has one of these, and they're expensive in the allocation (at 96 bytes). There's no reason we have to allocate and free that amount of space with every command. In theory, the number of these is bounded at the queue depth, in practice, there's usually only one, and this DMA alignment issue does requires most drivers to double copy. And most drivers don't and break. Take USB storage, I -think- (code path aren't trivial to follow) it just gets the sglist cooked up by the code in scsi_error.c no ? That just points to the buffer in scsi_cmnd. It then pass that for DMA to the USB stack. Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI breakage on non-cache coherent architectures
I'd like to be rid of it inside the command for various reasons: every command has one of these, and they're expensive in the allocation (at 96 bytes). There's no reason we have to allocate and free that amount of space with every command. In theory, the number of these is bounded at the queue depth, in practice, there's usually only one, and this DMA alignment issue does requires most drivers to double copy. Do you have a plan for short term here ? I'd like something fixed for .25, so I may have to introduce a __dma_aligned macro of some sort to deal with that in the meantime... Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI breakage on non-cache coherent architectures
On Mon, 2007-11-19 at 13:43 -0800, Roland Dreier wrote: I've been debugging various issues on the PowerPC 44x embedded architecture which happens to have non-coherent PCI DMA. One of the problem I'm hitting is that one really need to enforce kmalloc alignement to cache lines or bad things will happen (among others with USB), for some reasons, powerpc failed to do so, I fixed it. Heh... I hit the same problem literally 5 years ago: http://lwn.net/Articles/1783/ I implemented the __dma_buffer annotation: http://lwn.net/Articles/2269/ But DaveM said we should just use the PCI pool code instead: http://lwn.net/Articles/2270/ Heh, well... In this case, DaveM just proposed something akin to your __dma_buffer :-) On the other hand, after discussing with James, it looks like we'll just be reverting the patch that removed the kmalloc of the sense buffer since non cache-coherent archs are supposed to enforce kmalloc alignment to cache lines. __dma_buffer still seems like a good thing to have if too many other drivers are hitting that but for this specific problem, it's not the approach that we'll be taking. Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI breakage on non-cache coherent architectures
On Mon, 2007-11-19 at 14:31 -0800, David Miller wrote: From: Benjamin Herrenschmidt [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 06:51:14 +1100 On Mon, 2007-11-19 at 00:38 -0800, David Miller wrote: From: Benjamin Herrenschmidt [EMAIL PROTECTED] Date: Mon, 19 Nov 2007 16:35:23 +1100 You could make a dma_cacheline_aligned and use that. It seems pretty reasonable. I was thinking about that. What archs would need it ? arm, mips, what else ? The sparc32 port would need it too. James preference seem to go for a revert of the patch that removed the kmalloc of the buffer instead. Sounds definitely like an easier plan for .24 (and maybe even backport to stable). I'll produce a patch for that later today or tomorrow. Do you still think we should introduce this __dma_cacheline_aligned ? Do you see other cases of drivers where it would be useful ? It tend to agree with your earlier statement that drivers doing that are broken and should be using a separate allocator for DMA'ble objects (in fact, on non-cache coherent archs, kmalloc is just fine). Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI breakage on non-cache coherent architectures
On Mon, 2007-11-19 at 16:46 -0800, David Miller wrote: 1) Require that entire buffers are commited by call sites, and thus embedding DMA'd within non-DMA stuff isn't allowed 2) Add the __dma_cacheline_aligned tag. But note that with #2 it could get quite ugly because the alignment and size both have a minimum that needs to be enforced, not just the alignment alone. So either: Yup. struct foo { unsigned int other_unrelated_stuff; struct object dma_thing __dma_cacheline_aligned; unsigned int more_nondma_stuff __dma_cacheline_aligned; }; In my tests, I had used a fuckton_t object defined to be an empty thing with alignment constraint, seemed to work :-) But I'd rather require #1. BTW. What is the status nowadays with skb's ? Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI breakage on non-cache coherent architectures
On Mon, 2007-11-19 at 18:10 -0800, Roland Dreier wrote: I wrapped this ugliness up inside the macro back in what I posted in 2002 (http://lkml.org/lkml/2002/6/12/234): #define __dma_buffer __dma_buffer_line(__LINE__) #define __dma_buffer_line(line) __dma_buffer_expand_line(line) #define __dma_buffer_expand_line(line) \ __attribute__ ((aligned(L1_CACHE_BYTES))); \ char __dma_pad_ ## line [0] __attribute__ ((aligned(L1_CACHE_BYTES))) then you just need to tag the actual member like: char foo[3] __dma_buffer; That's actually not too bad ... I'm having a problem with reverting SCSI to use an allocation for the sense buffer, because it can fail and the new scso_eh_prep_cmnd() isn't supposed to return a failure. I've changed that but now I get into trying to fix the various drivers that use it without checking the result code and it's becoming much more complicated than initially thought. So I may do the above instead and revive your patch. Any objection ? James ? David ? Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI breakage on non-cache coherent architectures
FYI, Here's what I have for the SCSI change. I haven't updated drivers to care for the new return code though, help appreciated with that as I don't know much about these drivers. Index: linux-work/drivers/scsi/scsi_error.c === --- linux-work.orig/drivers/scsi/scsi_error.c 2007-11-20 13:26:18.0 +1100 +++ linux-work/drivers/scsi/scsi_error.c2007-11-20 13:43:05.0 +1100 @@ -602,8 +602,9 @@ static void scsi_abort_eh_cmnd(struct sc * @cmnd is ignored and this functions sets up a REQUEST_SENSE command * and cmnd buffers to read @sense_bytes into @scmd-sense_buffer. **/ -void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses, - unsigned char *cmnd, int cmnd_size, unsigned sense_bytes) +int scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses, + unsigned char *cmnd, int cmnd_size, unsigned sense_bytes, + gfp_t gfp_mask) { struct scsi_device *sdev = scmd-device; @@ -622,12 +623,20 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd ses-use_sg = scmd-use_sg; ses-resid = scmd-resid; ses-result = scmd-result; + sg_init_table(ses-sense_sgl, 1); if (sense_bytes) { + struct page *pg; + + if (sdev-host-hostt-unchecked_isa_dma) + gfp_mask |= __GFP_DMA; scmd-request_bufflen = min_t(unsigned, sizeof(scmd-sense_buffer), sense_bytes); - sg_init_one(ses-sense_sgl, scmd-sense_buffer, - scmd-request_bufflen); + pg = alloc_page(gfp_mask); + if (!pg) + return FAILED; + memset(page_address(pg), 0, scmd-request_bufflen); + sg_set_page(ses-sense_sgl, pg, scmd-request_bufflen, 0); scmd-request_buffer = ses-sense_sgl; scmd-sc_data_direction = DMA_FROM_DEVICE; scmd-use_sg = 1; @@ -658,6 +667,8 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd * untransferred sense data should be interpreted as being zero. */ memset(scmd-sense_buffer, 0, sizeof(scmd-sense_buffer)); + + return SUCCESS; } EXPORT_SYMBOL(scsi_eh_prep_cmnd); @@ -670,9 +681,17 @@ EXPORT_SYMBOL(scsi_eh_prep_cmnd); **/ void scsi_eh_restore_cmnd(struct scsi_cmnd* scmd, struct scsi_eh_save *ses) { + struct page *pg; + /* * Restore original data */ + pg = sg_page(ses-sense_sgl); + if (pg) { + memcpy(scmd-sense_buffer, page_address(pg), + scmd-request_bufflen); + __free_page(pg); + } scmd-cmd_len = ses-cmd_len; memcpy(scmd-cmnd, ses-cmnd, sizeof(scmd-cmnd)); scmd-sc_data_direction = ses-data_direction; @@ -709,7 +728,10 @@ static int scsi_send_eh_cmnd(struct scsi struct scsi_eh_save ses; int rtn; - scsi_eh_prep_cmnd(scmd, ses, cmnd, cmnd_size, sense_bytes); + if (scsi_eh_prep_cmnd(scmd, ses, cmnd, cmnd_size, sense_bytes, + GFP_KERNEL) != SUCCESS) + return FAILED; + shost-eh_action = done; spin_lock_irqsave(shost-host_lock, flags); Index: linux-work/include/scsi/scsi_eh.h === --- linux-work.orig/include/scsi/scsi_eh.h 2007-11-20 13:36:44.0 +1100 +++ linux-work/include/scsi/scsi_eh.h 2007-11-20 13:42:49.0 +1100 @@ -81,9 +81,10 @@ struct scsi_eh_save { struct scatterlist sense_sgl; }; -extern void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, +extern int scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses, unsigned char *cmnd, - int cmnd_size, unsigned sense_bytes); + int cmnd_size, unsigned sense_bytes, + gfp_t gfp_mask); extern void scsi_eh_restore_cmnd(struct scsi_cmnd* scmd, struct scsi_eh_save *ses); Index: linux-work/drivers/scsi/NCR5380.c === --- linux-work.orig/drivers/scsi/NCR5380.c 2007-11-20 13:46:41.0 +1100 +++ linux-work/drivers/scsi/NCR5380.c 2007-11-20 13:46:47.0 +1100 @@ -2283,7 +2283,7 @@ static void NCR5380_information_transfer } if ((cmd-cmnd[0] != REQUEST_SENSE) (status_byte(cmd-SCp.Status) == CHECK_CONDITION)) { - scsi_eh_prep_cmnd(cmd, hostdata-ses, NULL, 0, ~0); + scsi_eh_prep_cmnd(cmd, hostdata-ses, NULL, 0, ~0, GFP_ATOMIC); dprintk(NDEBUG_AUTOSENSE, (scsi%d : performing request sense\n,
SCSI breakage on non-cache coherent architectures
Hi James ! (Please CC me on replies as I'm not subscribed to linux-scsi) I've been debugging various issues on the PowerPC 44x embedded architecture which happens to have non-coherent PCI DMA. One of the problem I'm hitting is that one really need to enforce kmalloc alignement to cache lines or bad things will happen (among others with USB), for some reasons, powerpc failed to do so, I fixed it. The other one I'm hitting now is that the SCSI layer nowadays embeds the sense_buffer inside the scsi_cmnd structure without any kind of alignment whatsoever. I've been hitting irregulary is a crash on SCSI command completion that seems to be related to corruption of the request pointer in struct scsi_cmnd and I think it might be the cause. I'm now trying to setup a proper repro-case. The sense buffer is something that is written to by the device, thus it gets cache invalidated when the DMA happens, potentially losing whatever was sharing the cache line, which includes, among other things, that request pointer field. There are other issues as well if any of the fields sharing the cache line happens to be read while the DMA is in progress, it would populate the cache with memory content prior to the DMA being completed. It's fairly important on such architectures not to share cache lines between objects being DMA'd to/from and the rest of the system. If the DMA is strictly outgoing, it's generally ok, but not if the DMA is bidirectional or the device is writing to memory. I'm not sure what is the best way to fix that. Internally, I've done some test whacking some cacheline_aligned in the scsi_cmnd data structure to verify I no longer get random SLAB corruption when using my USB but that significantly bloats the size of the structure on archs such as ppc64 that don't need it and have a large cache line size. Unfortunately, I don't think there's any existing Kconfig symbol or arch provided #define to tell us that we are on a non-coherent arch afaik that could be used to make that conditional. Another option would be to kmalloc the buffer (wasn't it the case before btw ?) but I suppose some people will scream at the idea due to how the command pools are done... What do you suggest ? Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver
Upon closer look, while flush_kernel_dcache_page() is a no-op on ppc64, flush_dcache_page() isn't. So I'd prefer to not call it if not really needed. And according to James, flush_kernel_dcache_page() should be sufficient... So I'm getting puzzled again... flush_dcache_page() handles icache vs. dcache coherency by clearing the PG_arch_1 bit in the struct page that indicates that the page is cache clean. You -must- call it if you're going to use any form of CPU access to write to the page (basically dirtying the data cache) and that page can be ever mapped into user space and executed from. I don't know what flush_kernel_dcache_page() does and if it needs a similar treatement, it's a new interface, so maybe Jens and or James can tell me more about it.. Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver
On Mon, 2007-07-16 at 17:03 -0500, James Bottomley wrote: On Tue, 2007-07-17 at 07:49 +1000, Benjamin Herrenschmidt wrote: No ... that was the point of flush_kernel_dcache_page(). The page in question is page cache backed and contains user mappings. However, the block layer has already done a flush_dcache_page() in get_user_pages() and the user shouldn't be touching memory under I/O (unless they want self induced aliasing problems) so we're free to assume all the user cachelines are purged, hence all we have to do is flush the kernel alias to bring the page up to date and make the users see it correctly. The block layer will have done that even in the swap-out path ? (Just asking... I'm not very familiar with the block layer) Er ... not really, this is the I/O path for user initiated I/O. The page out path, by definition, can't have any extant user mappings. For page out, the relevant page is flushed before its mapping is detached, and then it can be paged to the backing store (or for anonymous pages to the swap device) when no mappings remain. Ok, thanks. Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver
On Fri, 2007-07-13 at 09:02 -0400, James Bottomley wrote: On Wed, 2007-07-04 at 15:22 +0200, Geert Uytterhoeven wrote: + kaddr = kmap_atomic(sgpnt-page, KM_USER0); + if (!kaddr) + return -1; + len = sgpnt-length; + if ((req_len + len) buflen) { + active = 0; + len = buflen - req_len; + } + memcpy(kaddr + sgpnt-offset, buf + req_len, len); + kunmap_atomic(kaddr, KM_USER0); This isn't a SCSI objection, but this sequence appears several times in this driver. It's wrong for a non-PIPT architecture (and I believe the PS3 is VIPT) because you copy into the kernel alias for the page, which dirties the line in the cache of that alias (the user alias cache line was already invalidated). However, unless you flush the kernel alias to main memory, the user could read stale data. The way this is supposed to be done is to do a Nah, we have no cache aliasing on ppc, at least not that matter here and not on cell. Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver
On Fri, 2007-07-13 at 16:19 +0200, Arnd Bergmann wrote: I'm pretty sure that no ppc64 machine needs alias resolution in the kernel, although some are VIPT. Last time we discussed this, Segher explained it to me, but I don't remember which way Cell does it. IIRC, it automatically flushes cache lines that are accessed through aliases. Ah yes, I remember reading about this automatic flushing thing. I don't know how the caches actually work on most of our PPC's, but the fact is, we don't have aliasing issues, so I can safely ignore it for a bit longer :-) There are some aliasing issues with the instruction cache specifically on some 4xx models but that's irrelevant to this discussion (and I think we handle them elsewhere anyway). Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/6] ps3: Preallocate bootmem memory for the PS3 FLASH ROM storage driver
On Fri, 2007-06-15 at 13:39 +0200, Geert Uytterhoeven wrote: plain text document attachment (ps3-stable) Preallocate 256 KiB of bootmem memory for the PS3 FLASH ROM storage driver. I still very much dislike the #ifdef xxx_MODULE in main kernel code. At the end of the day, is it realistic to ever use a PS3 without the storage driver ? I would suggest just allocating those unconditionally. Ben. Signed-off-by: Geert Uytterhoeven [EMAIL PROTECTED] Signed-off-by: Geoff Levand [EMAIL PROTECTED] --- arch/powerpc/platforms/ps3/setup.c | 19 ++- include/asm-powerpc/ps3.h |1 + 2 files changed, 19 insertions(+), 1 deletion(-) --- a/arch/powerpc/platforms/ps3/setup.c +++ b/arch/powerpc/platforms/ps3/setup.c @@ -107,7 +107,8 @@ static void ps3_panic(char *str) while(1); } -#if defined(CONFIG_FB_PS3) || defined(CONFIG_FB_PS3_MODULE) +#if defined(CONFIG_FB_PS3) || defined(CONFIG_FB_PS3_MODULE) || \ +defined(CONFIG_PS3_FLASH) || defined(CONFIG_PS3_FLASH_MODULE) static void prealloc(struct ps3_prealloc *p) { if (!p-size) @@ -123,7 +124,9 @@ static void prealloc(struct ps3_prealloc printk(KERN_INFO %s: %lu bytes at %p\n, p-name, p-size, p-address); } +#endif +#if defined(CONFIG_FB_PS3) || defined(CONFIG_FB_PS3_MODULE) struct ps3_prealloc ps3fb_videomemory = { .name = ps3fb videomemory, .size = CONFIG_FB_PS3_DEFAULT_SIZE_M*1024*1024, @@ -146,6 +149,18 @@ early_param(ps3fb, early_parse_ps3fb); #define prealloc_ps3fb_videomemory() do { } while (0) #endif +#if defined(CONFIG_PS3_FLASH) || defined(CONFIG_PS3_FLASH_MODULE) +struct ps3_prealloc ps3flash_bounce_buffer = { + .name = ps3flash bounce buffer, + .size = 256*1024, + .align = 256*1024 +}; +EXPORT_SYMBOL_GPL(ps3flash_bounce_buffer); +#define prealloc_ps3flash_bounce_buffer() prealloc(ps3flash_bounce_buffer) +#else +#define prealloc_ps3flash_bounce_buffer()do { } while (0) +#endif + static int ps3_set_dabr(u64 dabr) { enum {DABR_USER = 1, DABR_KERNEL = 2,}; @@ -175,6 +190,8 @@ static void __init ps3_setup_arch(void) #endif prealloc_ps3fb_videomemory(); + prealloc_ps3flash_bounce_buffer(); + ppc_md.power_save = ps3_power_save; DBG( - %s:%d\n, __func__, __LINE__); --- a/include/asm-powerpc/ps3.h +++ b/include/asm-powerpc/ps3.h @@ -427,6 +427,7 @@ struct ps3_prealloc { }; extern struct ps3_prealloc ps3fb_videomemory; +extern struct ps3_prealloc ps3flash_bounce_buffer; #endif ___ Linuxppc-dev mailing list [EMAIL PROTECTED] https://ozlabs.org/mailman/listinfo/linuxppc-dev - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 6/7] ps3: ROM Storage Driver
On Wed, 2007-05-30 at 12:13 +0200, Christoph Hellwig wrote: For any sane hypervisor or hardware the copy should be worth than that. Then again a sane hardware or hypervisor would support SG requests.. Agreed... Sony should fix that, it's a bit ridiculous. Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 6/7] ps3: ROM Storage Driver
On Tue, 2007-05-29 at 13:11 +0200, Geert Uytterhoeven wrote: This looks very inefficient. Just set sg_tablesize of your driver to 1 to avoid getting mutiple segments. The disadvantage of setting sg_tablesize = 1 is that the driver will get small requests (PAGE_SIZE) most of the time, which is very bad for performance. And the joke is that not only the HW can do scatter gather but you also have an iommu ... Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qla_wxyz pci_set_mwi question
On Thu, 2007-04-12 at 14:04 -0600, Matthew Wilcox wrote: On Thu, Apr 12, 2007 at 12:37:13PM -0700, Andrew Vasquez wrote: On Thu, 12 Apr 2007, Matthew Wilcox wrote: Why should it fail? If there's a platform which can't support a cacheline size that the qla2xyz card can handle, it should be able to happily fall back to doing plain writes instead of MWIs. IMO, it should just call pci_set_mwi() and ignore the result. I believe there were some erratas on some ISP2xxx chips where MWI needed to be set for proper operation. I'll go back, verify and update the patch accordingly. Hmm. The thing is that pci_set_mwi() returns success on machines where MWI is disabled (currently only PPC64). Perhaps it needs to fail instead. MWI isn't diabled on ppc64... or did I miss something ? Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qla_wxyz pci_set_mwi question
Willy was referring to this from include/asm-powerpc/pci.h: #ifdef CONFIG_PPC64 /* * We want to avoid touching the cacheline size or MWI bit. * pSeries firmware sets the cacheline size (which is not the cpu cacheline * size in all cases) and hardware treats MWI the same as memory write. */ #define PCI_DISABLE_MWI which makes pci_set_mwi() do nothing other than return 0; Interesting... I think I missed that we had that bit for some time :-) Well, I suppose that on pSeries and probably pmac too, the firmware will set the MWI bit for us anyway, but that's a bit dodgy to apply that to all ppc64... they aren't all pSeries. I'll have to look into that again one of these days. Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 18/59] sysctl: ipmi remove unnecessary insert_at_head flag
On Tue, 2007-01-16 at 09:39 -0700, Eric W. Biederman wrote: From: Eric W. Biederman [EMAIL PROTECTED] - unquoted With unique sysctl binary numbers setting insert_at_head is pointless. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- drivers/char/ipmi/ipmi_poweroff.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/char/ipmi/ipmi_poweroff.c b/drivers/char/ipmi/ipmi_poweroff.c index 9d23136..b3ae65e 100644 --- a/drivers/char/ipmi/ipmi_poweroff.c +++ b/drivers/char/ipmi/ipmi_poweroff.c @@ -686,7 +686,7 @@ static int ipmi_poweroff_init (void) printk(KERN_INFO PFX Power cycle is enabled.\n); #ifdef CONFIG_PROC_FS - ipmi_table_header = register_sysctl_table(ipmi_root_table, 1); + ipmi_table_header = register_sysctl_table(ipmi_root_table, 0); if (!ipmi_table_header) { printk(KERN_ERR PFX Unable to register powercycle sysctl\n); rv = -ENOMEM; - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 36/59] sysctl: C99 convert ctl_tables entries in arch/ppc/kernel/ppc_htab.c
On Tue, 2007-01-16 at 09:39 -0700, Eric W. Biederman wrote: From: Eric W. Biederman [EMAIL PROTECTED] - unquoted And make the mode of the kernel directory 0555 no one is allowed to write to sysctl directories. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- arch/ppc/kernel/ppc_htab.c | 11 --- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/ppc/kernel/ppc_htab.c b/arch/ppc/kernel/ppc_htab.c index bd129d3..77b20ff 100644 --- a/arch/ppc/kernel/ppc_htab.c +++ b/arch/ppc/kernel/ppc_htab.c @@ -442,11 +442,16 @@ static ctl_table htab_ctl_table[]={ .mode = 0644, .proc_handler = proc_dol2crvec, }, - { 0, }, + {} }; static ctl_table htab_sysctl_root[] = { - { 1, kernel, NULL, 0, 0755, htab_ctl_table, }, - { 0,}, + { + .ctl_name = CTL_KERN, + .procname = kernel, + .mode = 0555, + .child = htab_ctl_table, + }, + {} }; static int __init - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 35/59] sysctl: C99 convert ctl_tables in arch/powerpc/kernel/idle.c
On Tue, 2007-01-16 at 09:39 -0700, Eric W. Biederman wrote: From: Eric W. Biederman [EMAIL PROTECTED] - unquoted This was partially done already and there was no ABI breakage what a relief. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- arch/powerpc/kernel/idle.c | 11 --- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c index 8994af3..8b27bb1 100644 --- a/arch/powerpc/kernel/idle.c +++ b/arch/powerpc/kernel/idle.c @@ -110,11 +110,16 @@ static ctl_table powersave_nap_ctl_table[]={ .mode = 0644, .proc_handler = proc_dointvec, }, - { 0, }, + {} }; static ctl_table powersave_nap_sysctl_root[] = { - { 1, kernel, NULL, 0, 0755, powersave_nap_ctl_table, }, - { 0,}, + { + .ctl_name = CTL_KERN, + .procname = kernel, + .mode = 0755, + .child = powersave_nap_ctl_table, + }, + {} }; static int __init - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.15.4 rel.2 1/1] libata: add hotswap to sata_svw
On Tue, 2006-11-28 at 23:22 +, David Woodhouse wrote: On Thu, 2006-02-16 at 16:09 +0100, Martin Devera wrote: From: Martin Devera [EMAIL PROTECTED] Add hotswap capability to Serverworks/BroadCom SATA controlers. The controler has SIM register and it selects which bits in SATA_ERROR register fires interrupt. The solution hooks on COMWAKE (plug), PHYRDY change and 10B8B decode error (unplug) and calls into Lukasz's hotswap framework. The code got one day testing on dual core Athlon64 H8SSL Supermicro MoBo with HT-1000 SATA, SMP kernel and two CaviarRE SATA HDDs in hotswap bays. Signed-off-by: Martin Devera [EMAIL PROTECTED] What became of this? I might be to blame for not testing it... The Xserve I had on my desk was too noisy for most of my co-workers so I kept delaying and forgot about it Also the Xserve I have only has one disk, which makes hotplug testing a bit harder :-) Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iomapping a big endian area
On Sat, 2005-04-02 at 22:27 -0600, James Bottomley wrote: On Sat, 2005-04-02 at 20:08 -0800, David S. Miller wrote: Did anyone have a preference for the API? I was thinking ioread32_native, but ioread32be is fine too. I think doing foo{be,le}{8,16,32}() would be consistent with our byteorder.h interface names. Thinking about this some more, I know of no case of a BE bus connected to a LE system, nor do I think anyone would ever create such a beast, so our only missing interface is for a BE bus on a BE system. It's more a matter of the device than the bus imho... Thus, I think io{read,write}{16,32}_native are better interfaces ... I disagree. The driver will never know ... they basically mean pass memory operations without byte swaps, so they're well defined on both BE and LE systems and correspond exactly to our existing _raw_{read,write}{w,l} calls (principle of least surprise). I don't think it's sane. You know that your device is BE or LE and use the appropriate interface. native doesn't make sense to me in this context. Ben. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html