Re: [PATCH 1/3] mpt2sas: remove the use of writeq, since writeq is not atomic
On Wed, May 18, 2011 at 11:31 AM, Milton Miller milt...@bga.com wrote: So the real question should be why is x86-32 supplying a broken writeq instead of letting drivers work out what to do it when needed? Sounds a lot like what I was asking a couple of years ago :) http://lkml.org/lkml/2009/4/19/164 But Ingo insisted that non-atomic writeq would be fine: http://lkml.org/lkml/2009/4/19/167 - R. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 25/37] drivers/infiniband: use .dev.of_node instead of .node in struct of_device
Seems fine... adding EHCA guys just to make sure. .node is being removed Signed-off-by: Grant Likely grant.lik...@secretlab.ca --- drivers/infiniband/hw/ehca/ehca_main.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 129a6be..2639185 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -291,8 +291,9 @@ static int ehca_sense_attributes(struct ehca_shca *shca) }; ehca_gen_dbg(Probing adapter %s..., - shca-ofdev-node-full_name); -loc_code = of_get_property(shca-ofdev-node, ibm,loc-code, NULL); + shca-ofdev-dev.of_node-full_name); +loc_code = of_get_property(shca-ofdev-dev.of_node, ibm,loc-code, + NULL); if (loc_code) ehca_gen_dbg( ... location lode=%s, loc_code); @@ -720,16 +721,16 @@ static int __devinit ehca_probe(struct of_device *dev, int ret, i, eq_size; unsigned long flags; -handle = of_get_property(dev-node, ibm,hca-handle, NULL); +handle = of_get_property(dev-dev.of_node, ibm,hca-handle, NULL); if (!handle) { ehca_gen_err(Cannot get eHCA handle for adapter: %s., - dev-node-full_name); + dev-dev.of_node-full_name); return -ENODEV; } if (!(*handle)) { ehca_gen_err(Wrong eHCA handle for adapter: %s., - dev-node-full_name); + dev-dev.of_node-full_name); return -ENODEV; } -- Roland Dreier rola...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 04/10] 8xx: Always pin kernel instruction TLB
How to make better use of the remaining ITLB slots is tricky. Somehow one would want to map at lest one to modules but I cannot see how. No. If you use modules, you pay the price. Sane embedded solutions running in tight environments don't use modules :-) No point pinning TLB entries on the vmalloc space, really. Long ago (2.4 days I think) when using modules on ppc 4xx we hacked the module_alloc function (or whatever it was called back then) to allocate space in the kernel pinned TLB instead of using vmalloc. Gave something like a 2x speedup for module code, since the 4xx TLB is so small and the miss handling is so expensive. I assume it should still be possible to do a similar hack with current kernels. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/6 v5] Memory DLPAR Handling
This isn't a review of this patch -- more a question out of curiousity about how you actually can do memory remove in practice. Do you have any coordination between the platform/hypervisor and the kernel to make sure that a memory region you might want to remove later gets put into zone_movable so that there's a chance for the kernel to vacate it? Or do you have some other way to coordinate or at least expose which regions of memory the kernel will have a chance at releasing? Thanks, Roland ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Fix CQE flags reporting
applied, thanks ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Construct MAD redirect replies from request MAD
this seems reasonable to me, applied, thanks. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect replies from request MAD
Given that you seem to like the rest of the code and Jason hasn't spoken up yet, I think we can have Roland merge this patch. Roland, what do you think? I don't see any problem with the idea and this does sound like a step forward, so I am planning on merging this (pending review). ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] ehca: use port autodetect mode as default
looks good, applied ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [ewg] Re: [PATCH 2.6.31 try 2] ehca: Tolerate dynamic memory operations and huge pages
applied... ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [ewg] Re: [PATCH 2.6.31 try 2] ehca: Tolerate dynamic memory operations and huge pages
thanks, applied. -#define HCAD_VERSION 0026 +#define HCAD_VERSION 0027 the driver version was already 0027 (since bde2cfaf), so I dropped this chunk. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [ewg] Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
Yeah, the notifier code remains untouched as we still do not allow dynamic memory operations _while_ our module is loaded. The patch allows the driver to cope with DMEM operations that happened before the module was loaded, which might result in a non-contiguous memory layout. When the driver registers its global memory region in the system, the memory layout must be considered. We chose the term toleration instead of support to illustrate this. I see. So things just silently broke in some cases when the driver was loaded after operations you didn't tolerate? Anyway, thanks for the explanation. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] lib: Provide generic atomic64_t implementation
The new Nehalems provide 8 logical threads in a single socket. All those threads share a cache, and they have cmpxchg8b anyway, so this won't matter. FWIW, Nehalem EX actually goes to 8 cores/16 threads per socket. But worrying about 32-bit performance on Nehalem is a little silly -- this simplest solution is simply to run a 64-bit kernel. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
OK, one major issue with this patch and a few minor nits. First, the major issue is that I don't see anything in the patch that changes the code in ehca_mem_notifier() in ehca_main.c: case MEM_GOING_ONLINE: case MEM_GOING_OFFLINE: /* only ok if no hca is attached to the lpar */ spin_lock_irqsave(shca_list_lock, flags); if (list_empty(shca_list)) { spin_unlock_irqrestore(shca_list_lock, flags); return NOTIFY_OK; } else { spin_unlock_irqrestore(shca_list_lock, flags); if (printk_timed_ratelimit(ehca_dmem_warn_time, 30 * 1000)) ehca_gen_err(DMEM operations are not allowed in conjunction with eHCA); return NOTIFY_BAD; } But your patch description says: This patch implements toleration of dynamic memory operations But it seems you're still going to hit the same NOTIFY_BAD case above after your patch. So something doesn't compute for me. Could you explain more? Second, a nit: +#define EHCA_REG_MR 0 +#define EHCA_REG_BUSMAP_MR (~0) and you pass these as the reg_busmap parm in: int ehca_reg_mr(struct ehca_shca *shca, struct ehca_mr *e_mr, u64 *iova_start, @@ -991,7 +1031,8 @@ struct ehca_pd *e_pd, struct ehca_mr_pginfo *pginfo, u32 *lkey, /*OUT*/ -u32 *rkey) /*OUT*/ +u32 *rkey, /*OUT*/ +int reg_busmap) and test it as: +if (reg_busmap) +ret = ehca_reg_bmap_mr_rpages(shca, e_mr, pginfo); +else +ret = ehca_reg_mr_rpages(shca, e_mr, pginfo); So the ~0 for true looks a bit odd. One option would be to make reg_busmap a bool, since that's how you're using it, but then you lose the nice self-documenting macro where you call things. So I think it would be cleaner to do something like enum ehca_reg_type { EHCA_REG_MR, EHCA_REG_BUSMAP_MR }; and make the int reg_busmap parameter into enum ehca_reg_type reg_type and have the code become + if (reg_type == EHCA_REG_BUSMAP_MR) + ret = ehca_reg_bmap_mr_rpages(shca, e_mr, pginfo); + else if (reg_type == EHCA_REG_MR) + ret = ehca_reg_mr_rpages(shca, e_mr, pginfo); + else + ret = -EINVAL or something like that. +struct ib_dma_mapping_ops ehca_dma_mapping_ops = { +.mapping_error = ehca_dma_mapping_error, +.map_single = ehca_dma_map_single, +.unmap_single = ehca_dma_unmap_single, +.map_page = ehca_dma_map_page, +.unmap_page = ehca_dma_unmap_page, +.map_sg = ehca_dma_map_sg, +.unmap_sg = ehca_dma_unmap_sg, +.dma_address = ehca_dma_address, +.dma_len = ehca_dma_len, +.sync_single_for_cpu = ehca_dma_sync_single_for_cpu, +.sync_single_for_device = ehca_dma_sync_single_for_device, +.alloc_coherent = ehca_dma_alloc_coherent, +.free_coherent = ehca_dma_free_coherent, +}; I always think structures like this are easier to read if you align the '=' signs. But no big deal. +ret = ehca_create_busmap(); +if (ret) { +ehca_gen_err(Cannot create busmap.); +goto module_init2; +} + ret = ibmebus_register_driver(ehca_driver); if (ret) { ehca_gen_err(Cannot register eHCA device driver); ret = -EINVAL; -goto module_init2; +goto module_init3; } ret = register_memory_notifier(ehca_mem_nb); if (ret) { ehca_gen_err(Failed registering memory add/remove notifier); -goto module_init3; +goto module_init4; Having to renumber unrelated things is when something changes is why I don't like this style of error path labels. But I think it's well and truly too late to fix that in ehca. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Remove superfluous bitmasks from QP control block
looks fine, applied for 2.6.31 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [ewg] Re: [PATCH 1/3] IB/ehca: Replace vmalloc with kmalloc
did you have a chance to take a look at the patchset and will you apply it, or are there any outstanding issues we need to address? I guess it's OK, but definitely 2.6.31 material. I guess I'll stick it linux-next soon. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/3] IB/ehca: Replace vmalloc with kmalloc
thanks, applied. From: Anton Blanchard antonb at au1.ibm.com Signed-off-by: Stefan Roscher stefan.roscher at de.ibm.com please use '@' signs so these are real email addresses. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 3/3] IB/ehca: Increment version number
thanks, applied 2 3. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 0/3] IB/ehca: Perfomance improvment for creation of queue pairs
Because of this fundamental code change we will also increment the version number. So this is 2.6.31-targeted stuff I assume? (Too late in the 2.6.30 cycle for fundamental code change of course) - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/3] IB/ehca: Replace vmalloc with kmalloc
+queue-queue_pages = kmalloc(nr_of_pages * sizeof(void *), GFP_KERNEL); How big might this buffer be? Any chance of allocation failure due to memory fragmentation? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH v5] introduce macro spin_event_timeout()
Alan did have one valid point though. Determining how long to loop for is architecture-specific. Using jiffies is bad, because even one jiffy is too long. Adding a udelay() inside the loop means that it only checks he condition every microsecond. So the real solution is to use keep looping until a certain amount of time has passed. This means using an architecture-specific timebase register. Are there really cases where spinning for 1 jiffy is too long of a timeout? It might make sense for the parameter passed in to be in terms of microseconds but I have a hard time coming up with a case where having the real timeout be 40 msecs or whatever 1 jiffy ends up being is a real problem -- after all, this helper is intended for the case where we expect the condition to become true much sooner than the worst case. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
huge_page_size(page_hstate(page)) That would suit. I assume the intention is for that to be usable by driver modules on any architecture? erm, you overestimate the amount of planning and forethought which goes into these things ;) The lack of any EXPORT_SYMBOL(size_to_hstate) is a broadish hint. Heh. Looking into the implementation, it seems that I could actually do PAGE_SIZE compound_order(page) directly (since there's no reason to go from size to hstate and back to size. I don't know all the details of these VM internals, but that seems to only work on the first (small) page of a giant page? Which causes problems for what we're trying to do here... To summarize the goal, we are mapping user memory to a device that has its own page tables, where the device's page tables can also use multiple page sizes. Using big pages on the device leads to similar efficiencies as hugetlb pages do on the CPU, and in fact if a user has used hugetlb pages for the memory they're giving to the device, that's a very strong hint that the device should use big pages too. But one valid situation we have to handle in the driver is if, say, userspace has a hugetlb mapped at virtual address 0x20, and wants to map 0x8 bytes at 0x28 to the device. In that case, we're going to do essentially get_user_pages(..., 0x28, 0x8 / PAGE_SIZE, ...) and get_user_pages() is going to give us a bunch of normal PAGE_SIZE pages starting at offset 0x80 within the compound page that makes up the huge page mapped at 0x20. get_user_pages() also gives us the vma back, and we can see from is_vm_hugetlb_page() (-- BTW can I just say that a function is_xxx_page() that operates on vmas is horribly misnamed --) that these pages all come from a hugetlb mapping, but figuring out the size of that mapping is I guess a challenge. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
Right, but then you need to set that in the VMA's, and thus gone is your nice fast g_u_p() that doesn't touch VMAs :-) Registering memory is a slow path thing in the RDMA world. Speeding it up is nice, so we make userspace do the madvise(VM_DONTCOPY) if it cares but if it doesn't it can leave it out. Yes, but unfortunately MPI says apps can allocate memory however they damn well please... in any case these issues are all-too-well-known in the RDMA world for quite a while. Yup. What do you think of the idea of pre-COWing pages with an elevated count at fork time ? Super-duper sucks if the first thing the child does is exec() :) Also if the parent has registered half the memory in the system then it's instant OOM. So not that useful for the RDMA case :) The one thing that might make sense is to pre-COW any partial pages that the parent has registered -- ie if half a page can be used by the child, at least pre-COW that, but leave all the full pages with VM_DONTCOPY. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
Forwarding Eli's patch below, since PowerPC guys may have missed it. I guess the question for Ben et al is whether there is any issue with exporting HPAGE_SHIFT for modules (can be EXPORT_SYMBOL_GPL if you feel it's an internal detail). It would probably make sense to roll this change into the mlx4 change that Eli alludes to below and merge through my tree (with ppc maintainer acks of course), rather than splitting this patch out and introducing cross-tree dependencies (and also separating the rationale for the change from the change itself). Thanks, Roland Drivers may want to take advantage of the large pages used for memory obtained from hugetlbfs. One example is mlx4_ib which can use much less MTT entries (in the order of HPAGE_SIZE / PAGE_SIZE) when registering such memory, thus scale significantly better when registering larger memory regions. Other drivers could also benefit from this. Signed-off-by: Eli Cohen e...@mellanox.co.il --- arch/powerpc/mm/hash_utils_64.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 8d5b475..6cff8c7 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -104,6 +104,7 @@ int mmu_highuser_ssize = MMU_SEGSIZE_256M; u16 mmu_slb_size = 64; #ifdef CONFIG_HUGETLB_PAGE unsigned int HPAGE_SHIFT; +EXPORT_SYMBOL(HPAGE_SHIFT); #endif #ifdef CONFIG_PPC_64K_PAGES int mmu_ci_restrictions; -- 1.6.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] infiniband/ehca: use consistent type
I'm going to apply this for 2.6.29, since the change to the u64 type went upstream, so ehca spews a bunch of warnings now. I'll also add the following patch to fix all the printk format warnings: IB/ehca: Fix printk format warnings from u64 type change Commit fe21 (powerpc: Change u64/s64 to a long long integer type) changed u64 from unsigned long to unsigned long long, which means that printk formats for printing u64 values should use ll instead of l to avoid warnings. Fix all the places affected by this in ehca. Signed-off-by: Roland Dreier rola...@cisco.com --- drivers/infiniband/hw/ehca/ehca_cq.c | 16 ++-- drivers/infiniband/hw/ehca/ehca_hca.c|2 +- drivers/infiniband/hw/ehca/ehca_irq.c| 18 ++-- drivers/infiniband/hw/ehca/ehca_main.c |6 +- drivers/infiniband/hw/ehca/ehca_mcast.c |4 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 144 +++--- drivers/infiniband/hw/ehca/ehca_qp.c | 32 drivers/infiniband/hw/ehca/ehca_reqs.c |2 +- drivers/infiniband/hw/ehca/ehca_sqp.c|2 +- drivers/infiniband/hw/ehca/ehca_tools.h |2 +- drivers/infiniband/hw/ehca/ehca_uverbs.c |2 +- drivers/infiniband/hw/ehca/hcp_if.c | 30 +++--- 12 files changed, 130 insertions(+), 130 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 2f4c28a..97e4b23 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -196,7 +196,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if (h_ret != H_SUCCESS) { ehca_err(device, hipz_h_alloc_resource_cq() failed -h_ret=%li device=%p, h_ret, device); +h_ret=%lli device=%p, h_ret, device); cq = ERR_PTR(ehca2ib_return_code(h_ret)); goto create_cq_exit2; } @@ -232,7 +232,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if (h_ret H_SUCCESS) { ehca_err(device, hipz_h_register_rpage_cq() failed -ehca_cq=%p cq_num=%x h_ret=%li counter=%i +ehca_cq=%p cq_num=%x h_ret=%lli counter=%i act_pages=%i, my_cq, my_cq-cq_number, h_ret, counter, param.act_pages); cq = ERR_PTR(-EINVAL); @@ -244,7 +244,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if ((h_ret != H_SUCCESS) || vpage) { ehca_err(device, Registration of pages not complete ehca_cq=%p cq_num=%x -h_ret=%li, my_cq, my_cq-cq_number, +h_ret=%lli, my_cq, my_cq-cq_number, h_ret); cq = ERR_PTR(-EAGAIN); goto create_cq_exit4; @@ -252,7 +252,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, } else { if (h_ret != H_PAGE_REGISTERED) { ehca_err(device, Registration of page failed -ehca_cq=%p cq_num=%x h_ret=%li +ehca_cq=%p cq_num=%x h_ret=%lli counter=%i act_pages=%i, my_cq, my_cq-cq_number, h_ret, counter, param.act_pages); @@ -266,7 +266,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, gal = my_cq-galpas.kernel; cqx_fec = hipz_galpa_load(gal, CQTEMM_OFFSET(cqx_fec)); - ehca_dbg(device, ehca_cq=%p cq_num=%x CQX_FEC=%lx, + ehca_dbg(device, ehca_cq=%p cq_num=%x CQX_FEC=%llx, my_cq, my_cq-cq_number, cqx_fec); my_cq-ib_cq.cqe = my_cq-nr_of_entries = @@ -307,7 +307,7 @@ create_cq_exit3: h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 1); if (h_ret != H_SUCCESS) ehca_err(device, hipz_h_destroy_cq() failed ehca_cq=%p -cq_num=%x h_ret=%li, my_cq, my_cq-cq_number, h_ret); +cq_num=%x h_ret=%lli, my_cq, my_cq-cq_number, h_ret); create_cq_exit2: write_lock_irqsave(ehca_cq_idr_lock, flags); @@ -355,7 +355,7 @@ int ehca_destroy_cq(struct ib_cq *cq) h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0); if (h_ret == H_R_STATE) { /* cq in err: read err data and destroy it forcibly */ - ehca_dbg(device, ehca_cq=%p cq_num=%x ressource=%lx in err + ehca_dbg(device, ehca_cq=%p cq_num=%x resource=%llx in err
Re: [PATCH] infiniband/ehca: use consistent type
Sorry to appear picky, but how is that different from the patch powerpc: cleanup from powerpc l64 to ll64 change: drivers/infiniband that I sent to you on Jan 7? Well, I didn't lose my version and forget all about it ;) Seriously, I must have lost that patch, sorry about that. I'll dig it out and replace mine so you get credit. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: mal_probe crash
Can you double check that the e1000 isn't copying the PCI resources into a unsigned long before ioremap'ing the result, thus cropping the top bits ? as far as I can see, e1000 is using pci_ioremap_bar(), which should do the right thing as long as resource_size_t is the right type (which it looks like it is on PowerPC 44x). - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] infiniband/ehca: spin_lock_irqsave takes an unsigned long
thanks, applied. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] infiniband/ehca: use consistent type
If we're going to clean this code up, does it make sense to take it further? More precisely, your patch does: @@ -226,7 +226,7 @@ u64 hipz_h_alloc_resource_eq(const struct ipz_adapter_handle adapter_handle, u32 *eq_ist) { u64 ret; - u64 outs[PLPAR_HCALL9_BUFSIZE]; + unsigned long outs[PLPAR_HCALL9_BUFSIZE]; u64 allocate_controls; but every parameter of ehca_plpar_hcall9() is unsigned long, and the return value is a signed long. So should we change ret to long and all these other declarations to unsigned long while we're touching the code here? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] infiniband/ehca: spin_lock_irqsave takes an unsigned long
are you trying to land the 'typedef unsigned long u64' change for 2.6.28, or can these patches wait for 2.6.29? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: replace modulus operations in flush error completion path
thanks, applied ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Fix locking for shca_list_lock
Looks good... I'll add this for 2.6.29, since as far as I can tell this bug has been there approximately forever already. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Fix suppression of port activation events
A previous fix introduced a regression where port activation events were dropped unconditionally if port autodetection was not enabled. Fixed. Is this a fix to IB/ehca: Remove reference to special QP in case of port activation failure? Because if so I can roll it into that patch, since Linus hasn't pulled it yet. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: remove reference to the QP in case of port activation failure
thanks, applied ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Fix problem with max number of QPs and CQs in systems with different adapters
thanks, applied ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH]IB/ehca:reject dynamic memory add/remove
thanks, applied with a slightly expanded changelog. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/1] IB/ehca: Disallow creating UC QP with SRQ
thanks, applied -- it didn't apply to the latest tree, because of the flush CQE changes, so I merged it manually as below -- let me know if this is wrong: commit 0540bbbe455e123a1692d26205ad1a29983883b0 Author: Hoang-Nam Nguyen [EMAIL PROTECTED] Date: Fri Oct 10 14:40:39 2008 -0700 IB/ehca: Don't allow creating UC QP with SRQ This patch prevents a UC QP to be created attached to an SRQ, since current firmware does not support this feature. Signed-off-by: Michael Faath [EMAIL PROTECTED] Signed-off-by: Roland Dreier [EMAIL PROTECTED] diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 4dbe287..40b578d 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -502,6 +502,12 @@ static struct ehca_qp *internal_create_qp( if (init_attr-srq) { my_srq = container_of(init_attr-srq, struct ehca_qp, ib_srq); + if (qp_type == IB_QPT_UC) { + ehca_err(pd-device, UC with SRQ not supported); + atomic_dec(shca-num_qps); + return ERR_PTR(-EINVAL); + } + has_srq = 1; parms.ext_type = EQPT_SRQBASE; parms.srq_qpn = my_srq-real_qp_num; ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] ib/ehca: add flush CQE generation
thanks, queued for 2.6.28. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] powerpc: Avoid integer overflow in page_is_ram()
Commit 8b150478 (ppc: make phys_mem_access_prot() work with pfns instead of addresses) fixed page_is_ram() in arch/ppc to avoid overflow for addresses above 4G on 32-bit kernels. However arch/powerpc's page_is_ram() is missing the same fix -- it computes a physical address by doing pfn PAGE_SHIFT, which overflows if pfn corresponds to a page above 4G. In particular this causes pages above 4G to be mapped with the wrong caching attribute; for example many ppc440-based SoCs have PCI space above 4G, and mmap()ing MMIO space may end up with a mapping that has caching enabled. Fix this by working with the pfn and avoiding the conversion to physical address that causes the overflow. This patch compares the pfn to max_pfn, which is a semantic change from the old code -- that code compared the physical address to high_memory, which corresponds to max_low_pfn. However, I think that was is another bug, since highmem pages are still RAM. Reported-by: vb [EMAIL PROTECTED] Signed-off-by: Roland Dreier [EMAIL PROTECTED] Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- Paul, didn't see this in your list... please add for 2.6.28. arch/powerpc/mm/mem.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 1c93c25..98d7bf9 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -75,11 +75,10 @@ static inline pte_t *virt_to_kpte(unsigned long vaddr) int page_is_ram(unsigned long pfn) { - unsigned long paddr = (pfn PAGE_SHIFT); - #ifndef CONFIG_PPC64 /* XXX for now */ - return paddr __pa(high_memory); + return pfn max_pfn; #else + unsigned long paddr = (pfn PAGE_SHIFT); int i; for (i=0; i lmb.memory.cnt; i++) { unsigned long base; ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] powerpc: Avoid integer overflow in page_is_ram()
#ifndef CONFIG_PPC64 /* XXX for now */ - return paddr __pa(high_memory); + return pfn max_pfn; #else + unsigned long paddr = (pfn PAGE_SHIFT); seems like this could be a phys_addr_t Yes, it could I guess, but that would be an unrelated change, and I'm not sure there's much point given this is in 64-bit-only code. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: demuxing irqs
Can someone point me at a simple example of how to demux irqs using the powerpc irq functions? I have eight devices on a single irq and I want to turn them into virtual irqs. Sorry about the previous reply. Anyway, what are you going to demux based on? Do you have some other signal you can read in the interrupt dispatch code that tells you which device raised the interrupt? What happens if two devices raise an interrupt at the same time? If you just have 8 interrupt lines wire-ORed together then you probably just need to register your interrupt handlers with IRQF_SHARED and run 8 interrupt handlers on an interrupt. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: demuxing irqs
The muxed interrupts are inside a SOC CPU. For example eight GPIOs can each individually be enabled to trigger hardware interrupt 7. When I get hw interrupt 7 i want to demux it into 8 virtual interrupts. There are eight bit registers for individually acking, enabling, etc each of the eight multiplexed interrupts. With eight virutal interrupts each user can register a different handler and isn't aware the muxing is going on. I see... well, assuming all the issues around handling multiple simultaneous (or overlapping) interrupts are OK, you could look at how arch/powerpc/sysdev/uic.c handles the cascaded interrupt controllers for 4xx SoCs. The idea of that code is that you get an interrupt, and look look at interrupt cause register 0, and see interrupt X, and to handle interrupt X you read interrupt cause register 1 to see which the real interrupt is. And you might see interrupt Y, which means to go onto interrupt cause register 2. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/9] powerpc/44x: Add PowerPC 44x simple platform support
You should fix 52xx with the same for loop change then, since I blatantly most of this file from you ;) Heh, then factor out mpc5200_simple_probe() into a helper and use it instead of copying it as ppc44x_probe? ;) If you stick to the NULL-terminated array version, then it becomes easy to convert some other platform probing code to use your new code too, eg tqm85xx_probe() could use an array too. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/9] powerpc/44x: Add PowerPC 44x simple platform support
Except that logically doesn't make much sense. Why would you have a list of mpc52xx and 44x boards together? They require completely different kernels because the MMU and drive set is entirely different. Or am I totally missing what you are saying? Yeah, I wasn't clear -- I meant to add a new helper like of_flat_dt_is_compatible_list() (not sure of the name) that takes a node and a NULL-terminated array of strings, and then mpc5200_simple_probe() can become a one-liner, along with mpc5121_generic_probe(), tqm85xx_probe(), ppc44x_probe(), etc. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/9] powerpc/44x: Add PowerPC 44x simple platform support
Yeah, I wasn't clear -- I meant to add a new helper like of_flat_dt_is_compatible_list() (not sure of the name) that takes a node and a NULL-terminated array of strings, and then mpc5200_simple_probe() can become a one-liner, along with mpc5121_generic_probe(), tqm85xx_probe(), ppc44x_probe(), etc. I worry about doing that though. I don't want to give people the impression that these boards are of compatible. I want it to be more this file supports these explicit platforms and makes no claim on compatibility between them. It's no big deal really -- just my knee-jerk reaction to seeing a comment like this should also be fixed in some other place because I copied the code from there. My reaction to that is always, why not use one common copy of the code then? You could call it something like of_flat_dt_match_list() if that makes more sense to you. Or just copy the for loop around, it's not much code. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 5/5 try2] ib/ehca: discard double CQE for one WR
thanks, applied all 5. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] isdn: mISDN HFC PCI support depends on virt_to_bus()
IMHO, this driver was really rushed in and that was a huge mistake. If it had gone through linux-next we could have tidied all of this stuff up in a less rushed manner. ?? This thread *is* about a build failure in linux-next. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 2/2] IB/ehca: Default value for Local CA ACK Delay
thanks, applied 1 and 2 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code
Sorry for the late comments, I missed this when it went by before. +DMA_ATTR_STRONG_ORDERING +-- + +DMA_ATTR_STRONG_ORDERING specifies that previous reads and writes are +performed in the order in which they're received by the IOMMU; thus +reads and writes may not pass each other. I don't understand what this is trying to say. What is previous referring to? What does received by the IOMMU mean -- do you mean issued onto the bus by the CPU? When you say reads and writes may not pass each other, do you mean just that reads may not pass writes and writes may not pass reads, or do you mean that reads also can't pass reads and writes can't pass writes? Since I don't know exactly what this attribute does, I can't be sure, but it seems that making weak ordering the default is dangerous in that it breaks drivers that expect usual memory ordering semantics. Would it be safer/better to make strong ordering the default and then add a WEAK_ORDERING attribute that drivers can use as an optimization? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code
This is all about inbound transfers, i.e. DMAs coming from the I/O bridge into the CPU, both DMA read and DMA write. OK -- at a minimum I think the documentation should make that clear. As it stands the description of what this does is pretty much impossible to parse and understand. Strong ordering is only active when both the bridge and the IOMMU enable it, but for correctly written drivers, this only results in a slowdown. So when would someone use this dma attribute? As a hack to fix drivers where the real fix is too complicated? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Make device table externally visible
thanks, applied ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH REPOST #2] IB/ehca: In case of lost interrupts, trigger EOI to reenable interrupts
ok, I queued this for 2.6.27. thanks everyone ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Reject recv WRs if QP is in RESET state
thanks, applied. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH REPOST #2] IB/ehca: In case of lost interrupts, trigger EOI to reenable interrupts
During corner case testing, we noticed that some versions of ehca do not properly transition to interrupt done in special load situations. This can be resolved by periodically triggering EOI through H_EOI, if eqes are pending. Signed-off-by: Stefan Roscher [EMAIL PROTECTED] --- As firmware team suggested I moved the call of the EOI h_call into the handler function, this ensures that we will call EOI only when we find a valid eqe on the event queue. Additionally I changed the calculation of the xirr value as Roland suggested. paulus / benh -- does this version still get your ack? Seems that fw team is OK with it according to Stefan... If so I will add this to my tree for 2.6.27. diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index ce1ab05..0792d93 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -531,7 +531,7 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq) { struct ehca_eq *eq = shca-eq; struct ehca_eqe_cache_entry *eqe_cache = eq-eqe_cache; -u64 eqe_value; +u64 eqe_value, ret; unsigned long flags; int eqe_cnt, i; int eq_empty = 0; @@ -583,8 +583,13 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq) ehca_dbg(shca-ib_device, No eqe found for irq event); goto unlock_irq_spinlock; -} else if (!is_irq) +} else if (!is_irq) { +ret = hipz_h_eoi(eq-ist); +if (ret != H_SUCCESS) +ehca_err(shca-ib_device, + bad return code EOI -rc = %ld\n, ret); ehca_dbg(shca-ib_device, deadman found %x eqe, eqe_cnt); +} if (unlikely(eqe_cnt == EHCA_EQE_CACHE_SIZE)) ehca_dbg(shca-ib_device, too many eqes for one irq event); /* enable irq for new packets */ diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 5245e13..415d3a4 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -933,3 +933,13 @@ u64 hipz_h_error_data(const struct ipz_adapter_handle adapter_handle, r_cb, 0, 0, 0, 0); } + +u64 hipz_h_eoi(int irq) +{ +unsigned long xirr; + +iosync(); +xirr = (0xffULL 24) | irq; + +return plpar_hcall_norets(H_EOI, xirr); +} diff --git a/drivers/infiniband/hw/ehca/hcp_if.h b/drivers/infiniband/hw/ehca/hcp_if.h index 60ce02b..2c3c6e0 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.h +++ b/drivers/infiniband/hw/ehca/hcp_if.h @@ -260,5 +260,6 @@ u64 hipz_h_error_data(const struct ipz_adapter_handle adapter_handle, const u64 ressource_handle, void *rblock, unsigned long *byte_count); +u64 hipz_h_eoi(int irq); #endif /* __HCP_IF_H__ */ -- 1.5.5 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH REPOST] IB/ehca: In case of lost interrupts, trigger EOI to reenable interrupts
During corner case testing, we noticed that some versions of ehca do not properly transition to interrupt done in special load situations. This can be resolved by periodically triggering EOI through H_EOI, if eqes are pending. So just to be clear: this is a workaround for a hardware/firmware bug? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH REPOST] IB/ehca: In case of lost interrupts, trigger EOI to reenable interrupts
So just to be clear: this is a workaround for a hardware/firmware bug? Yes it is. OK, so paulus et al... does it seem like a good approach to call H_EOI from driver code (given that this driver makes tons of other hcalls)? How critical is this? Since you said corner case testing I suspect we can defer this to 2.6.27 and maybe get it into -stable later? Also, out of curiousity: +u64 hipz_h_eoi(int irq) +{ +int value; +unsigned long xirr; + +iosync(); what is the iosync() required for here? +value = (0xff 24) | irq; +xirr = value 0x; given that irq and value are ints, is there any possible way value could have bits outside of the low 32 set? If you're worried about sign extension isn't it simpler to just make value unsigned? +return plpar_hcall_norets(H_EOI, xirr); +} ie why not: u64 hipz_h_eoi(int irq) { unsigned xirr; iosync(); xirr = (0xff 24) | irq; return plpar_hcall_norets(H_EOI, xirr); } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
me too. That's the whole basis for readX_relaxed() and its cohorts: we make our weirdest machines (like altix) conform to the x86 norm. Then where it really kills us we introduce additional semantics to selected drivers that enable us to recover I/O speed on the abnormal platforms. Except as I pointed out before, Altix doesn't conform to the norm and many (most?) drivers are missing mmiowb()s that are needed for Altix. Just no one has plugged most devices into an Altix (or haven't stressed the driver in a way that exposes problems of IO ordering between CPUs). It would be a great thing to use the powerpc trick of setting a flag that is tested by spin_unlock()/mutex_unlock() and automatically doing the mmiowb() if needed, and then killing off mmiowb() entirely. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 2/2] IB/ehca: In case of lost interrupts, trigger EOI to reenable interrupts
+if (desc-chip desc-chip-eoi) +desc-chip-eoi(irq); This doesn't seem like the type of thing a device driver should be doing. Both patches are rather short on changelog text -- what is the issue here and how does this fix it? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: IB/ehca: Reject send WRs only for RESET, INIT and RTR state
thanks, applied. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
The problem is that your two writel's, despite being both issued on cpu X, due to the spin lock, in your example, can end up with the first one going through NR 1 and the second one going through NR 2. If there's contention on NR 1, the write going via NR 2 may hit the PCI bridge prior to the one going via NR 1. Really?? I can't see how you can expect any drivers to work reliably if simple code like writel(reg1); writel(reg2); might end up with the write to reg2 hitting the device before the write to reg1. Almost all MMIO stuff has ordering requirements and will totally break if you allow, say, the start IO write to be reordered before the set IO address write. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Actually, this specifically should not be. The need for mmiowb on altix is because it explicitly violates some of the PCI rules that would otherwise impede performance. The compromise is that readX on altix contains the needed dma flush but there's a variant operator, readX_relaxed that doesn't (for drivers that know what they're doing). The altix critical drivers have all been converted to use the relaxed form for performance, and the unconverted ones should all operate just fine (albeit potentially more slowly). Is this a recent change? Because as of October 2007, 76d7cc03 (IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up) was needed. But this was involving writel() (__raw_writel() actually, looking at the code), not readl(). But writel_relaxed() doesn't exist (and doesn't make sense). - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Um, OK, you've said write twice now ... I was assuming you meant read. Even on an x86, writes are posted, so there's no way a spin lock could serialise a write without an intervening read to flush the posting (that's why only reads have a relaxed version on altix). Or is there something else I'm missing? Writes are posted yes, but not reordered arbitrarily. If I have code like: spin_lock(mmio_lock); writel(val1, reg1); writel(val2, reg2); spin_unlock(mmio_lock); then I have a reasonable expectation that if two CPUs run this at the same time, their writes to reg1/reg2 won't be interleaved with each other (because the whole section is inside a spinlock). And Altix violates that expectation. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Writes are posted yes, but not reordered arbitrarily. on standard x86 I mean here... ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Writes are posted yes, but not reordered arbitrarily. If I have code like: spin_lock(mmio_lock); writel(val1, reg1); writel(val2, reg2); spin_unlock(mmio_lock); then I have a reasonable expectation that if two CPUs run this at the same time, their writes to reg1/reg2 won't be interleaved with each other (because the whole section is inside a spinlock). And Altix violates that expectation. Does that necessarily follow? If you've got a large system with multiple pci bridges, could you end up with posted writes coming from different cpus taking a different amount of time to propagate to a device and thus colliding? Not on x86. And a given PCI device can only be reached from a single host bridge, so I don't see how it can happen. But on SGI altix systems, there is a routed fabric between the CPU and the PCI bus, so the reordering can happen there. Hence mmiowb() and the endless supply of driver bugs that it causes. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
This is a different issue. We deal with it on powerpc by having writel set a per-cpu flag and spin_unlock() test it, and do the barrier if needed there. Cool... I assume you do this for mutex_unlock() etc? Is there any reason why ia64 can't do this too so we can kill mmiowb and save everyone a lot of hassle? (mips, sh and frv have non-empty mmiowb() definitions too but I'd guess that these are all bugs based on misunderstandings of the mmiowb() semantics...) However, drivers such as e1000 -also- have a wmb() between filling the ring buffer and kicking the DMA with MMIO, with a comment about this being needed for ia64 relaxed ordering. I put these barriers into mthca, mlx4 etc, although it came from my possible misunderstanding of the memory ordering rules in the kernel more than any experience of problems (as opposed the the mmiowb()s, which all came from real world bugs). - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Cool... I assume you do this for mutex_unlock() etc? That's a good point... I don't think we do. Maybe we should. I think it's needed -- take a look at 76d7cc03, which came from a real bug seen on Altix boxes. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [ewg] [PATCH] IB/ehca: Protect QP against destroying until all async events for it are handled.
We are not sure if this should be fixed in the driver or in uverbs itself. Roland, what's your opinion about this? Would be nice to be able to fix it in uverbs but I don't see how. In particular a kernel consumer has to have the same guarantee that no async events will come in after destroy QP returns. And I don't see any way generic code can provide a guarantee about what low-level driver code may do internally. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [ewg] [PATCH] IB/ehca: Protect QP against destroying until all async events for it are handled.
I agree, that's why I posted the driver fix first. Glad we agree :) I thought about it a little more and really convinced myself that there is no good way for generic code to handle this race. So, will you apply it next? Yes, will apply it shortly. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [ewg] [PATCH] IB/ehca: Protect QP against destroying until all async events for it are handled.
So I applied this, but thinking about it further, do you (theoretically at least) have the same problem with CQ and SRQ async events? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [ewg] [PATCH] IB/ehca: Change function return types to correct type.
thanks, applied ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Allocate event queue size depending on max number of CQs and QPs
Signed-off-by: Stefan Roscher stefan.roscher at de.ibm.com Kind of an inadequate changelog ;) Is this a fix or an enhancement or what? +if (atomic_read(shca-num_cqs) = ehca_max_cq) { +if (atomic_read(shca-num_qps) = ehca_max_qp) { These are racy in the sense that multiple simultaneous calls to create_cq/create_qp might end up exceeding the ehca_max_cq limit. Is that an issue? You could close the race by using atomic_add_unless() and testing the return value (and being careful to do atomic_dec() on error paths after you bump num_cqs/num_qps). - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [REPOST][PATCH] IB/ehca: Allocate event queue size depending on max number of CQs and QPs
thanks, makes sense, applied. fast turnaround too ;) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: IB/ehca: handle negative return value from ibmebus_request_irq() properly in ehca_create_eq()
thanks, applied ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Prevent sending UD packets to QP0
thanks, applied ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 2/2] IB/ehca: Add PMA support
thanks, applied 1-2 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Reminder: removal of arch/ppc
Hm. Katmai is already in-tree. Theoretically I should be able to get the HW docs and do a new DTS and have it mostly just work. Perhaps I'll give that a shot and have you give it a quick spin. Yep, should be pretty easy. A while back I built a Katmai kernel in arch/ppc based on the Yucca support and the differences were pretty trivial -- as I recall, the Katmai has compact flash support that the Yucca doesn't have, and the Yucca has a simple little FPGA with a few registers that controls power to the PCI slots. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Reminder: removal of arch/ppc
YUCCA Yucca was what again? 440spe? Yes, it was the first 440SPe eval board; since replaced by Katmai. I have a Yucca and can at least test thing; if no one else gets to it, I may try to port it to arch/powerpc for 2.6.26. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 4/4] IB/ehca: Prevent RDMA-related connection failures
thanks, applied 1-4. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [ofa-general] [PATCH] IB/ehca: Forward event client-reregister-required to registered clients
thanks, applied. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Fix lock flag location, bump version number
applied, thanks ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [ewg] Re: [PATCH] IB/ehca: Serialize HCA-related hCalls on POWER5
What is the fix you suggest, to add a device query that tells you for which verbs the documentation does not apply? or enhance the code of the map_phys_fmr verb within the ehca driver to return error if called from non-sleepable context? I think the right fix for iSER would be to make iSER work even for devices that don't support FMRs. For example cxgb3 doesn't implement FMRs so if anyone ever updates iSER to work on iWARP and not just IB, then this is something that has to be tackled anyway. Then ehca could just get rid of the FMR support it has. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Return correct #SGEs for SRQ
thanks, applied. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Serialize HCA-related hCalls if necessary
thanks, applied. With your next batch of patches for 2.6.25, could you clean up: --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -89,6 +89,7 @@ #define HCALL9_REGS_FORMAT HCALL7_REGS_FORMAT r11=%lx r12=%lx static DEFINE_SPINLOCK(hcall_lock); +extern int ehca_lock_hcalls; and move that extern declaration into an appropriate header file? Thanks... ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Serialize HCA-related hCalls on POWER5
map_phys_fmr In fact, we do use hCalls there. Our hardware doesn't actually support FMRs, so we translate a map FMR into a reallocate PMR, which doesn't work without hCalls. What's more, the hCalls involved (e.g. H_FREE_RESOURCE) might well return H_LONG_BUSY, so the whole operation might sleep; no way around it. It's a big problem. If you cannot implement FMRs in such a way that you can handling having map_phys_fmr being called in a context that can't sleep, then I think the only option is to remove your FMR support. It's an optional device feature, so this should be OK (although the iSER driver currently seems to depend on a device supporting FMRs, which is probably going to be a problem with iWARP support in the future anyway). The fact that consumers can map FMRs from interrupt context, while holding locks, etc, is pretty fundamental to the use of FMRs so I don't see any way around the requirement that map_phys_fmr never sleep. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Serialize HCA-related hCalls on POWER5
I think it needs some more inspection. The msleep in there is only called for hcalls that return H_IS_LONG_BUSY(). In theory, you can call ehca_plpar_hcall_norets() from inside an interrupt handler if the hcall in question never returns long busy. Fair enough... according to Documentation/infiniband/core_locking.txt, the only driver methods that cannot sleep are: create_ah modify_ah query_ah destroy_ah bind_mw post_send post_recv poll_cq req_notify_cq map_phys_fmr and I don't think ehca does an hcall from any of those. Of course there might be other driver-internal code paths that I don't know about. Maybe do a quick audit and then stick might_sleep() in the hcall functions to catch any mistakes? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Serialize HCA-related hCalls on POWER5
+ ehca_lock_hcalls = !(cur_cpu_spec-cpu_user_features + PPC_FEATURE_ARCH_2_05); We already talked about this yesterday, but I still feel that checking the instruction set of the CPU should not be used to determine whether a specific device driver implementation is used int hypervisor. I had the same reaction... is testing cpu_user_features really the best way to detect this issue? I'll hold off applying this for a few days so you guys can decide the best thing to do. We'll definitely get some fix into 2.6.24 but we have time to make a good decision. Regarding the performance problem, have you checked whether converting all your spin_lock_irqsave to spin_lock/spin_lock_irq improves your performance on the older machines? Maybe it's already fast enough that way. It does seem that the only places that the hcall_lock is taken also use msleep, so they must always be in process context. So you can safely just use spin_lock(), right? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Fix static rate if path faster than link
thanks, applied ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Fix static rate regression
thanks, applied. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 2/2] IB/ehca: Fix static rate calculation
thanks, applied both patches. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 4/5] ibmebus: Move to of_device and of_platform_driver, match eHCA and eHEA drivers
Replace struct ibmebus_dev and struct ibmebus_driver with struct of_device and struct of_platform_driver, respectively. Match the external ibmebus interface and drivers using it. Signed-off-by: Joachim Fenkes [EMAIL PROTECTED] If not, then you need to get an Acked-by and an agreement that this change can go via the powerpc.git tree from Roland Dreier and Jeff Garzik. I don't see anything objectionable in the infiniband parts of the patch -- I don't have any way to test the changes but it all looks like a straightforward conversion to a new platform API. So: Acked-by: Roland Dreier [EMAIL PROTECTED] - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 3/3] IB/ehca: Make sure user pages are from hugetlb before using MR large pages
thanks, applied this and the umem patch... ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 2/3] IB/umem: Add hugetlb flag to struct ib_umem
This looks realy nice to me... a very clean patch. I'll add this to 2.6.24 unless someone objects soon... ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/3] IB/ehca: Fix large page HW cap defines
obviously OK...applied. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 02/12] IB/ehca: Add 1 is not longer needed because of firmware interface change
What happens if someone runs the new driver with older firmware? Or what if someone upgrades the firmware without updating the driver? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] IB/ehca: Make sure user pages are from hugetlb before using MR large pages
-#define HCA_CAP_MR_PGSIZE_4K 1 -#define HCA_CAP_MR_PGSIZE_64K 2 -#define HCA_CAP_MR_PGSIZE_1M 4 -#define HCA_CAP_MR_PGSIZE_16M 8 +#define HCA_CAP_MR_PGSIZE_4K 0x8000 +#define HCA_CAP_MR_PGSIZE_64K 0x4000 +#define HCA_CAP_MR_PGSIZE_1M 0x2000 +#define HCA_CAP_MR_PGSIZE_16M 0x1000 Not sure I understand what this has to do with things... is this an unrelated fix? +static int ehca_is_mem_hugetlb(unsigned long addr, unsigned long size) This is rather awful -- another call to get_user_pages() to iterate over all the vmas... I would suggest extending ib_umem_get() to check the vmas and adding a member to struct ib_umem to say whether the memory is entirely covered by hugetlb pages or not. +ret = ehca_is_mem_hugetlb(virt, length); +switch (ret) { +case 0: /* mem is not from hugetlb */ +hwpage_size = PAGE_SIZE; +break; +case 1: +if (length = EHCA_MR_PGSIZE4K + PAGE_SIZE == EHCA_MR_PGSIZE4K) +hwpage_size = EHCA_MR_PGSIZE4K; +else if (length = EHCA_MR_PGSIZE64K) +hwpage_size = EHCA_MR_PGSIZE64K; +else if (length = EHCA_MR_PGSIZE1M) +hwpage_size = EHCA_MR_PGSIZE1M; +else +hwpage_size = EHCA_MR_PGSIZE16M; +break; +default: /* out of mem */ +ib_mr = ERR_PTR(-ENOMEM); +goto reg_user_mr_exit1; It seems like it would be better to just assume the memory is not from a hugetlb is ehca_is_mem_hugetlb() fails its memory allocation and fall back to the PAGE_SIZE case rather than failing entirely. Also if someone runs a kernel with 64K pages on a machine where they end up being simulated from 4K pages, do you have the same issue with the hypervisor ganging together non-contiguous pages? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [ofa-general] [PATCH 7/7] IB/ehca: Prevent overwriting QP init attributes given by caller
I don't understand this patch. rdma/ib_verbs.h says this about ib_create_qp(): * @qp_init_attr: A list of initial attributes required to create the * QP. If QP creation succeeds, then the attributes are updated to * the actual capabilities of the created QP. So it seems the current code is actually correct and your patch breaks it?? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [ofa-general] [PATCH 1/2] ehca: remove checkpatch.pl's warnings externs should be avoided in .c files
Was going to recreate this patch, but then I saw that you probably have incorporated it (manually) in your latest git. Just want to make sure I'm seeing it right. Yes, I ended up doing it by hand. Thanks. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 2/5] ehca: Generate event when SRQ limit reached
thanks, applied. BTW, does your SRQ-capable hardware support generating the last WQE reached event? There's not any reliable way to avoid problems when destroying QPs attached to an SRQ without it, and the IB spec requires CAs that support SRQs to generate it (o11-5.2.5 in chapter 11 of vol 1). I don't see any code in ehca to generate the event, and IPoIB CM at least will be very unhappy when using SRQs if the event is not generated. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 3/5] ehca: Make ehca2ib_return_code() non-inline
thanks, applied ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [ofa-general] [PATCH 5/5] ehca: Support small QP queues
thanks, applied. I fixed this up myself to work with commit 20c2df83, which got rid of the destructor argument to kmem_cache_create() -- you probably want to check my tree to make sure it's OK. Also the same as I said before about checkpatch.pl's warning: WARNING: externs should be avoided in .c files #337: FILE: drivers/infiniband/hw/ehca/ehca_pd.c:91: + extern struct kmem_cache *small_qp_cache; please fix that up when you get a chance ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 01/10] IB/ehca: Support for multiple event queues
Here's some anecdotal evidence :) http://lists.openfabrics.org/pipermail/general/2007-May/035758.html Right, but then we went on to say that we probably want to use multiple vectors to separate out multiple HCA ports rather than send/sreceive on the same port. And the current IPoIB implementation of having that second CQ seems suboptimal anyway, since it seems to leave us susceptible to the interrupt overload that NAPI was supposed to solve. At a higher level, I'm left wondering why nobody talked about multiple EQs during the last months of the 2.6.22 process and now all of a sudden it becomes urgent in the last few days of the 2.6.23 merge window. That's not really how I like to merge features - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev