MPC8315 PCI express lockup
(I apologise for this not having much to do with linux...) We have a system with an MPC8315 ppc running linux 2.6.32 that uses the PCI express interface in RC mode to interface to an Altera FPGA. This uses both PIO and the PEX DMA interfaces (locally written dma driver). Under normal circumstances this all works fine. However under some circumstances (eg DMA reads from addresses that don't have actual slaves on the fpga [1]) the dma transfer requests don't complete. There are no obvious error bits set in the hisr or csmisr registers and the csb_status shows 'dma in progress'. The dma transfer itself can be cancelled (by setting the SUS bit in the dma_ctrl register), and the relevant status bit are set to show the transfer has been aborted. Once in this state all further PCI express transfers fail. DMA requests timeout (driver gives up waiting for completion) and PIO requests fault (Oops: Machine check, sig 7 [#1]) locking the kernel solid. This looks very much like the MPC8315's errata PEX7 except that I don't see the CSMISR[RST] bit set. I'm not at all sure the recovery for that errata is actually writable! I'm certainly not going to write it just to find out if it would help. In any case it is quite likely that the driver's ISR will try to do a PIO read while a dma transfer is timing out. We can look at the fpga side and possibly find out what it is doing, but it would be useful to know more about the status on the ppc side. I presume there is a way of doing a 'probe' type memory cycle that won't panic on a fault? Although that may not help me keep linux running as the ISR needs to to a PCIe write to remove the level-sensitive interrupt. Any thought on ways to progress? David [1] PIO reads are ok and just return 0x. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: pread() and pwrite() system calls
A special pread/pwrite asm stub that just copies r7 to r0 could be used. Would it be enough to do: syscall_pread_pwrite: mov 0,7 sc blr and handle the -ve - errno in C? Huh? Won't fly, r0 is used for the system call number! I was copying that from r7! Actually I have a much better stub by copying the one used for mmap(). The system call itself is fine. Using the system call almost halved the time taken for a 4-byte read. On the other hand, I believed PPC had no problems passing up to 8 32 bit arguments in registers (r3 to r10), but I may be confusing with the standard ABI for function calls. Hmm, a quick look at kernel/entry_32.s shows that it should be able to use at least r3 to r8, which should be sufficient. I think that it is an uClibc problem. True, in that it isn't a kernel bug. OTOH the kernel is likely to get blamed for non-atomic pread. I've found the same user-space code in newlib as well. glibc may be ok, some code I've found implies it only uses the 'emulation' when the system call returns ENOSYS. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
pread() and pwrite() system calls
We have a system with linux 2.6.32 and the somewhat archaic uClibc 0.9.27 (but I'm not sure the current version is any better, and I think there are binary compatibility if we update). I've just discovered that pread() is 'implemented' by using 3 lseek() system calls and read(). (the same is true for the 64bit versions). I thought that pread() was supposed to be atomic (so usable concurrently by multiple threads) which means that this implementation is completely broken. I've not looked to see what glibc does. I can see that part of the problem is the alignment of the 64bit value on the argument list of syscall() (when the register save area is cast to a sycall argument structure). But it also looks as though the uClibc syscall() stub can only pass 5 arguments in registers, and pread() (with an aligned 64bit offset) requires 6. The ucLibc source seems to be predicated by __NR_pread, but if that were defined it would try to call __syscall_pread() and I can't find that anywhere. A special pread/pwrite asm stub that just copies r7 to r0 could be used. Would it be enough to do: syscall_pread_pwrite: mov 0,7 sc blr and handle the -ve - errno in C? I've seen other parts of linux pass structures to avoid issues with excessive number of arguments. Unlike things like NetBSD amd64 which will read additional args off the user stack (and has per-system call stubs in libc so doesn't lose one argument). (I rearranged the NetBSD amd64 kernel trap frame layout so all the args ended up adjacent...) David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] powerpc/windfarm: don't pass const strings to snprintf
--- a/drivers/macintosh/windfarm_smu_sat.c +++ b/drivers/macintosh/windfarm_smu_sat.c @@ -287,7 +287,7 @@ static int wf_sat_probe(struct i2c_client *client, sens-sat = sat; sens-sens.ops = wf_sat_ops; sens-sens.name = (char *) (sens + 1); - snprintf(sens-sens.name, 16, %s-%d, name, cpu); + snprintf((char *)sens-sens.name, 16, %s-%d, name, cpu); if (wf_register_sensor(sens-sens)) kfree(sens); Wouldn't it be better to do: snprintf((char *)(sens + 1), 16, %s-%d, name, cpu); David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [REGRESSION][PATCH V4 1/3] bpf jit: Make the filter.c::__load_pointer helper non-static for the jits
The function is renamed to make it a little more clear what it does. It is not added to any .h because it is not for general consumption, only for bpf internal use (and so by the jits). I'd have thought it better to put in into a bfp_internal.h (or similar) with a big warning there about the asm users. Possibly even worth adding some other defs that the asm files will need (if there are any). David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH v2] powerpc/srio: Fix the compile errors when building with 64bit
diff --git a/arch/powerpc/sysdev/fsl_rmu.c b/arch/powerpc/sysdev/fsl_rmu.c index 1548578..1bba6d1 100644 --- a/arch/powerpc/sysdev/fsl_rmu.c +++ b/arch/powerpc/sysdev/fsl_rmu.c @@ -657,7 +658,8 @@ fsl_add_outb_message(struct rio_mport *mport, struct rio_dev *rdev, int mbox, int ret = 0; pr_debug(RIO: fsl_add_outb_message(): destid %4.4x mbox %d buffer \ - %8.8x len %8.8x\n, rdev-destid, mbox, (int)buffer, len); + %8.8lx len %8.8zx\n, rdev-destid, mbox, + (unsigned long)buffer, len); Should 'buffer' be printed with %p ?? David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 4/4] powerpc/mpc8548: Add workaround for erratum NMG_SRIO135
Issue: Applications using lwarx/stwcx instructions in the core to compete for a software lock or semaphore with a device on RapidIO using read atomic set, clr, inc, or dec in a similar manner may falsely result in both masters seeing the lock as available. This could result in data corruption as both masters try to modify the same piece of data protected by the lock. Workaround: Set bits 13 and 29 of CCSR offset 0x01010 (EEBPCR register of the ECM) during initialization and leave them set indefinitely. This may slightly degrade overall system performance. Might be worth actually saying what these bits do, and why/when overall performance is affected. Is the problem trying to do locked read-write cycles on a slow peripheral bus? Might be a case for just 'not doing that'. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] powerpc/srio: Fix the compile errors when building with 64bit
For the file arch/powerpc/sysdev/fsl_rmu.c, there will be some compile errors while using the corenet64_smp_defconfig: I'm sure that replacing 'u32' with 'unsigned long' is really the best thing to do here. It looks to me as though they should be some pointer type. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 20/21] Introduce struct eeh_stats for EEH
+struct eeh_stats { + unsigned int no_device; /* PCI device not found */ ... + no device =%d\n ... Use %u (for all the stats), you really don't want negative values printed. I've NFI how long wrapping these counters might take! If it is feasable (maybe much above 100Hz) then you need 64bit counters. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 20/21] Introduce struct eeh_stats for EEH
+/* + * The struct is used to maintain the EEH global statistic + * information. Besides, the EEH global statistics will be + * exported to user space through procfs + */ +struct eeh_stats { + unsigned long no_device;/* PCI device not found */ + unsigned long no_dn;/* OF node not found */ + unsigned long no_cfg_addr; /* Config address not found */ + unsigned long ignored_check;/* EEH check skipped */ + unsigned long total_mmio_ffs; /* Total EEH checks */ + unsigned long false_positives; /* Unnecessary EEH checks */ + unsigned long slot_resets; /* PE reset */ +}; Why 'unsigned long', surely either 'unsigned int' or a fixed-width type. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: ARM + TI DSP device tree repesentation
I am trying to locate the device tree for OMAP platform which has ARM and TI DSP core. The background is that we are trying to understand how such SOCs with dissimilar cores are represented in linux. Most likely linux runs on the ARM, and the DSP processor is loaded with specific code to perform certain functions. As such the DSP is probably a 'peripheral' not a cpu. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX
One idea would be to have a structure of function pointers for each CPU that gets runtime patched into the right places, similar to how we do some of the MMU fixups. Sounds good to me :-) Except the indirect jump/call is almost certainly never predicted - so will be slow. You might want to patch jump instructions instead. The same is true for in-kernel memcpy() and other similar operations. I actually wonder sometimes what the typical lengths are for these sort of functions, and whether, in fact, small lengths dominate - where the fixed costs matter. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: Re: [PATCHv5] atomic: add *_dec_not_zero
Looking at this: #ifndef atomic_inc_unless_negative static inline int atomic_inc_unless_negative(atomic_t *p) { int v, v1; for (v = 0; v = 0; v = v1) { v1 = atomic_cmpxchg(p, v, v + 1); if (likely(v1 == v)) return 1; } return 0; } #endif why is it optimised for '*p' being zero?? I'd have though the initial assignment to 'v' should be made by reading '*p' without any memory barriers (etc). David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 5/6] powerpc/boot: Add mfdcrx
+#define mfdcrx(rn) \ + ({ \ + unsigned long rval; \ + asm volatile(mfdcrx %0,%1 : =r(rval) : g(rn)); \ + rval; \ + }) g is never correct on PowerPC, you want r here. You can write this as a static inline btw, you only need the #define stuff when there is an i constraint involved. I think you can still use static inlines even when a constraint is one that requires a compile time constant. eg (not ppc, but the n become part of the instruction word): static __inline__ uint32_t custom_inic(const uint32_t op, uint32_t a, const uint32_t b) { uint32_t result; __asm__ ( custom\t%1, %0, %2, c%3 : =r (result) : n (op), r (a), n (b)); return result; } David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 2/2] powerpc/85xx: Renamed mpc85xx_common.c to common.c
Subject: [PATCH 2/2] powerpc/85xx: Renamed mpc85xx_common.c to common.c The file name is already scoped by the directory its in. ... rename from arch/powerpc/platforms/85xx/mpc85xx_common.c rename to arch/powerpc/platforms/85xx/common.c All very well until you only have a reference to the filename part somewhere and want to 'find' the file in a large directory tree. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 01/16] pmac_zilog: fix unexpected irq
On most 68k Macs the SCC IRQ is an autovector interrupt and cannot be masked. This can be a problem when pmac_zilog starts up. Wouldn't this also happen if the interrupt were shared? Hopefully nothing vaguely modern uses the borked Zilog 8530 SCC (which I presume is the part in question - brings back too many nightmares) David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 1/3] mtd/nand: fix coding style issue in drivers/mtd/nand/fsl_elbc.c
On Tue, 2011-11-15 at 11:26 +, Jenkins, Clive wrote: fix whitespaces,tabs coding style issue and ... In my opinion this code was already correct, and would display correctly at any TAB setting. This patch changes it so that it displays incorrectly at all TAB settings other than 8. Any tab setting other than 8 is incorrect and should not be used for Linux code. If I view it in a proportionally-spaced font, it's not going to line up right either. Fix your editor, if that's an issue for you. Personally I don't even attempt to line up argumants on continuation lines - it just wastes vertical space. I just indent them as any other continuation line, so double-indent if using 4-char indents and 1/2 indent for 8-char indents. Tabs have to be assumed to be 8 columns, otherwise all sorts of tools just get it wrong. (Or you ban tabs from source files.) David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Disabling TCP slow start after idle
We have some connections that suffer very badly from the TCP 'slow start' algorithm. These are connections that will always be local - they may be MAC-Switch-MAC using RGMII crossover, they might also be connected via an external switch. In either case the RTT is most likely to be almost zero, certainly below 1ms. The traffic is single packets (carrying another protocol) so we have Nagle disabled and the send and receive sides run separately. So the traffic is neither bulk, nor command/response. This means that there is very rarely any unacked data, so almost every packet is sent using 'slow start'. If the external switch drops a packet (they do!) then slow start stops more packets being sent, and nothing progresses for about 1.5 seconds by which time there is a significant amount of traffic queued and, in some cases, data has to be discarded. Similar issues happen if the receiving system decides to defer the ack until a timer tick (instead of sending one after every second packet). In this case only 4 packets are sent. (We fixed this one be sending a software ACK every 4 packets.) Quite cleary the 'slow start' algorithm just doesn't work in these cases. I found this https://lkml.org/lkml/2010/4/9/427 discussion about a socket option to disable slow start. But it seems that some people are completely against the idea. I'd have thought that the global option would be more of a problem - since that will affect ftp connections to remote hosts where slow start is alomost certainly benefitial. I'd have thought it would be sensible to allow one (or more) of the following (either as a sysctl, socket option, or code change): 1) Disable slow start for the local subnet. 2) Disable slow start for connections with very low RTT. 3) Disable slow start for a minimum period with no traffic (after the last packet is acked). I'm not sure of the resolution used by the Linux RTT calculations. I know NetBSD had a recent set of patches to fix calculation errors with low RTT because the code had been written when all RTT were much longer. David (Copied to linuxppc-dev because I'm subscribed to it.) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH v2 1/5] [ppc] Process dynamic relocations for kernel
How about using clean_dcache_range() to flush the range runtime address range [ _stext, _end ] ? That would flush the entire instructions. Wouldn't that result in more cache flushing than the original solution? For example, my kernel is 3.5MB. Assuming a 32 byte cache line size, clean_dcache_range(_stext, _end) would result in about 115,000 dcbst's (3.5MB / 32). Oops ! You are right. We could go back to the clean_dcache_all() or the initial approach that you suggested. (dcbst). Maybe clean_dcache_range() itself should limit the range to the size of the cache? I suspect that code can easily know the cache size. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH v2 1/5] [ppc] Process dynamic relocations for kernel
On Fri, 2011-11-04 at 14:06 +0530, Suzuki Poulose wrote: On 11/03/11 05:06, Josh Poimboeuf wrote: On Tue, 2011-10-25 at 17:23 +0530, Suzuki K. Poulose wrote: @@ -137,6 +137,9 @@ get_type: lwz r0, 8(r9) /* r_addend */ add r0, r0, r3 /* final addend */ stwxr0, r4, r7 /* memory[r4+r7]) = (u32)r0 */ + dcbst r4,r7 /* flush dcache line to memory */ + sync/* wait for flush to complete */ + icbir4,r7 /* invalidate icache line */ Doing it this way has two drawbacks : 1) Placing it here in relocate would do the flushing for each and every update. I agree. My kernel had around 80,000 relocations, which means 80,000 d-cache line flushes (for a 32k d-cache) and 80,000 i-cache line invalidates (for a 32k i-cache). Which is obviously a little overkill. Although I didn't notice a performance hit during boot. The I-cache invalidates shouldn't be needed, the un-relocated code can't be in the I-cache (on the grounds that executing it would crash the system). A single sync at the end is probably enough as well. I guess it is possible for the cpu to prefetch/preload into the i-cache through the jump into the relocated code? So maybe a full i-cache invalidate right at the end? (or a jump indirect? - which is probably there anyway) The d-cache will need some kind of flush, since the modified lines have to be written out, the only time it generates additional memeory cycles are if there are two (or more) reloations in the same d-cache line. Otherwise the early write-back might help! David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] powerpc/usb: use unsigned long to type cast an address of ioremap
Below are codes for accessing usb sysif_regs in driver: usb_sys_regs = (struct usb_sys_interface *) ((u32)dr_regs + USB_DR_SYS_OFFSET); these codes work in 32-bit, but in 64-bit, use u32 to type cast the address of ioremap is not right, and accessing members of 'usb_sys_regs' will cause call trace, so use unsigned long for both 32-bit and 64-bit. Wouldn't a (char *) cast be even better? David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] powerpc/usb: use unsigned long to type cast an address of ioremap
On Thu, Nov 3, 2011 at 4:58 AM, Shaohui Xie shaohui@freescale.com wrote: usb_sys_regs = (struct usb_sys_interface *) - ((u32)dr_regs + USB_DR_SYS_OFFSET); + ((unsigned long)dr_regs + USB_DR_SYS_OFFSET); This makes more sense: usb_sys_regs = (void *)dr_regs + USB_DR_SYS_OFFSET; But that is invalid C. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] powerpc/usb: use unsigned long to type cast an address of ioremap
usb_sys_regs = (void *)dr_regs + USB_DR_SYS_OFFSET; But that is invalid C. What's invalid about it? I haven't tried compiling this specific line of code, but I've done stuff like it in the past many times. Are you talking about adding an integer to a void pointer? If so, then that's something that gcc supports and that the kernel uses all over the place. Arithmetic on 'void *' should not be done. I know some versions of gcc allow it (provided some warning level/option is enabled) but that doesn't mean it is valid. My suspicions are that is was allowed due to the way 'void *' was originally bodged into gcc. A char* is incorrect because a char could be more than one byte, in theory. It is somewhat difficult to untangle the standard, but sizeof (char) is defined to be one. Of course, the C language doesn't actually require that you can converts between pointers to different types in any well-defined manner. But most of the low level device access assumes an adequately linear address space. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] powerpc/usb: use ioremap instead of base plus offset to access regs
#ifndef CONFIG_ARCH_MXC if (pdata-have_sysif_regs) - usb_sys_regs = (struct usb_sys_interface *) - ((u32)dr_regs + USB_DR_SYS_OFFSET); + usb_sys_regs = ioremap(res-start + USB_DR_SYS_OFFSET, + sizeof(struct usb_sys_interface)/sizeof(int)); #endif That ioremap() doesn't look right. Isn't the 'size' in bytes?? (Although it will only matter if it crosses a page boundary.) Mind you, I'd have though the original ioremap() should have covered the entire structure?? David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH][v2] uio: Support 36-bit physical addresses on 32-bit systems
Kumar Gala wrote: + phys_addr_t addr; Please add a comment here saying: 1) That 'addr' can be a virtual or physical address The code and everything else makes that clear I'm sorry, but I have to strongly disagree here. It is *NOT* clear that a variable of type 'phys_addr_t' can hold something that is not a physical address. Since there is a discriminating field, could a union be used? At a guess the type of the address is constrained between produces and consumer?? David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
Then, this statement: *(u32 *) (tx_desc-ctrl.vlan_tag) |= ring-doorbell_qpn; ... instead do ... : *(u32 *) (tx_desc-ctrl.vlan_tag) |= cpu_to_be32(ring-doorbell_qpn); (Also get rid of that cast and define vlan_tag as a __be32 to start with). Agreed, casts that change the type of memory - *(foo *)xxx - are generally bad news unless you are casting a generic 'buffer' to a specific structure. I've seen far to much code that ends up being depending on the endianness and system word size. For the above I'd actually suggest making 'doorbell_qpn' have the correct endianness in order to avoid the (potential) swap every time it is set. You also need to treble-check the required endianness for the 'vlan_tag' in the tx descriptor. What would be needed is the MAC PCI slave were on an x86 (LE) system. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
What is this __iowrite64_copy... oh I see Nice, somebody _AGAIN_ added a bunch of generic IO accessors that are utterly wrong on all archs except x86 (ok, -almost-). There isn't a single bloody memory barrier in there ! Actually memory barriers shouldn't really be added to any of these 'accessor' functions. (Or, at least, ones without barriers should be provided.) The driver may want to to a series of writes, then a single barrier, before a final write of a command (etc). in_le32() from io.h is specially horrid! David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] mlx4_en: fix transmit of packages when blue frame isenabled
static void mlx4_bf_copy(unsigned long *dst, unsigned long *src, unsigned bytecnt) { + int i; + __le32 *psrc = (__le32 *)src; + + /* + * the buffer is already in big endian. For little endian machines that's + * fine. For big endain machines we must swap since the chipset swaps again + */ + for (i = 0; i bytecnt / 4; ++i) + psrc[i] = le32_to_cpu(psrc[i]); + __iowrite64_copy(dst, src, bytecnt / 8); } That code looks horrid... 1) I'm not sure the caller expects the buffer to be corrupted. 2) It contains a lot of memory cycles. 3) It looked from the calls that this code is copying descriptors, so the transfer length is probably 1 or 2 words - so the loop is inefficient. 4) ppc doesn't have a fast byteswap instruction (very new gcc might use the byteswapping memery access for the le32_to_cpu() though), so it would be better getting the byteswap done inside __iowrite64_copy() - since that is probably requesting a byteswap anyway. OTOH I'm not at all clear about the 64bit xfers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: scheduling while atomic:
Not entirely relevant to the error you are seeing, but your ISR is: irqreturn_t cpld_irq_handler(int irq, void * dev_id, struct pt_regs *regs) { wake_up(cpld_intr_wait); atomic_inc(cpld_intr_data); /* incrementing this will indicate the poll() that the interrupt is occured */ return 0; } You need to set 'cpld_intr_data' before clling wake_up() otherwise the sleeping process might run on another cpu before the data item changes. You also don't need to use the 'atomic' functions. Writing to a 'volatile' data item is sufficent. You don't want to count interrupts, just mark they have happened. Maybe your problem it that the ISR doesn't do anything to cause the hardware to drop its IRQ. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: build failure with gcc 4.6.0 array subscript is above array bounds
Subject: build failure with gcc 4.6.0 array subscript is above array bounds ... That corresponds to: tmp = ((unsigned long *)child-thread.fpr) [TS_FPRWIDTH * (index - PT_FPR0)]; child-thread.fpr is double fpr[32][TS_FPRWIDTH]. index has already been bounds checked so we know it is = PT_FPSCR. That code looks gross I think it is trying to index a 2D array with a single index and type-pun the lookup. I'm not sure how the array size (for the subscript error) is determined in the presence of the cast, but without the cast the index would have to be less than 32. I also suspect this is failing when gcc inlines the function from a call where 'index' is a constant. Possibly the code should read: tmp = (unsigned long *)child-thread.fpr[index - PT_FPRO]; although index may have been scaled by 'sizeof double/sizeof long'. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [RFC] Optimize __arch_swab32 and __arch_swab16
Joakim Tjernlund joakim.tjernl...@transmode.se writes: unsigned short my__arch_swab16(unsigned short value) { __asm__(rlwimi %0,%0,16,0x00ff : +r (value)); You are creating a value that does not fit in a short. Which is a problem because the compiler could schedule it be written back to real memory between the instructions. Actually the generated code would be better if the swap16() functions operated on 'unsigned int' fields - since it would save the compiler from doing a lot of shifts/masks elsewhere. For instance there is likely to be a mask with 0x prior to the call to swab16(). Since one use of these is for htons() (etc), when the host is the correct endianness these are #defines that do nothing - so out of range values aren't masked. So it seems to me that defining: unsigned int swab(unsigned int); would be fine - except it clashes with standard headers :-( David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [RFC] Optimize __arch_swab32 and __arch_swab16
Which is a problem because the compiler could schedule it be written back to real memory between the instructions. It can? There is no memory here, just registers. Even if it is written to memory, how would that affect the register? Although the function argument is passed in a register, the compiler could generate a store-load sequence before and after each __asm__() line. Assuming you are right, would rewriting it to __asm__(rlwimi %0,%0,16,0x00ff\n\t rlwinm %0,%0,24,0x : +r(value)); help? Except that now you've stopped the compiler scheduling another instruction between the two - probably forcing a execution stall. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: union/struct representations for MAS Registers
I have some and use them in some code, they represent ISA 2.06 MAVN=1 (version 2) Can I keep them? if so, should I put them somewhere useful to others? Examples: union mas1 { u32 _val; struct { u32 v:1; u32 iprot:1; u32 tid:14; u32 _reserved_1:2; u32 ind:1; u32 ts:1; u32 tsize:4; u32 _reserved_2:8; }; }; Bitfields are rather non-portable, the compiler has a lot of choice about how to align the bits in memory. Their use to map anything physical is doomed to portabilily issues. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: union/struct representations for MAS Registers
Bitfields are rather non-portable, the compiler has a lot of choice about how to align the bits in memory. I'm ok with the masking stuff. However, I'm actually surprised this is true given the maturity of our ABIs. The C standard says nothing at all about how bitfields are implemented, I think the first bit might be 0x1, 0x80, 0x100 or 0x8000 when treated as a 32bit value, regardless of the endianness. Different architectures can (and do) assign things in different ways. So code that is ok on ppc might fail on arm or x86 (etc). David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW trackingof dirty young
Got it, if the fault_in_user_writeable() is designed to catch the exact same write permission fault problem we discuss here, so your patch fixed that very nicely, we should fixup it by directly calling handle_mm_fault like what you did because we are for sure to know what just happened(permission violation), its not necessary to check what's happened by calling gup--follow_page, and further the follow_page failed to report the fault :-) One thought I've had - and I don't know enough about the data area in use to know if it is a problem - is what happens if a different cpu faults on the same user page and has already marked it 'valid' between the fault happening and the fault handler looking at the page tables to find out why. If any of the memory areas are shared, it might be that the PTE (etc) might already show the page a writable by the time the fault handler is looking at them - this might confuse it! David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
The fault causing futex_atomic_cmpxchg_inatomic() is protected by pagefault_disable(), so the page fault handler has no chance to toggle the SW dirty/young tracking. Perhaps that is the bug! Whatever pagefault_disable() does, it shouldn't disable the SW dirty/young tracking - which should only needs bits moving in the page table itself (and TLB update??) rather than any operations on the rest of the data areas. It looks to me as though this could happen any time a page is marked inaccessible by the dirty/young tracking. Not just as a result of COW. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 4/4] powerpc: Enable lockup and hung task detectors in pseriesand ppc64 defeconfigs
As a result of changes to Kconfig files, we no longer enable the lockup and hung task detectors. Both are very light weight and provide useful information in the event of a hang, so reenable them. ... +CONFIG_LOCKUP_DETECTOR=y +CONFIG_DETECT_HUNG_TASK=y Is one of thise responsible for generating a kernel stack traceback when a process has been sleeping uninterruptably for a 'long' time? We have a kernel subsystem that has several 'worker' threads, these always sleep uninterruptable (they are shut down by explicit request) and, at times, can be idle for long periods. Perhaps it should be possible to disable the check either on a per-process of per sleep basis? David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 2/3] powerpc: POWER7 optimised memcpy using VMX
On Fri, Jun 17, 2011 at 02:54:00PM +1000, Anton Blanchard wrote: Implement a POWER7 optimised memcpy using VMX. For large aligned copies this new loop is over 10% faster and for large unaligned copies it is over 200% faster. ... BTW: do you have any statistics on the size distribution of memcpy memcpy_to_from_usr? My gut feeling is that the intermediate case is the most important, and the short case the less critical (drowned in overhead's noise) but that's the kind of things on which I've often been wrong. My thoughts are certainly that the code is too big, and that the 'cold cache' version and possibly the effects of increasing the size of the working set (ie displacing other code) may be significant in real life. For memcpy() the 'short' case will happen surprisingly often, I suspect the fixed costs for the short case may dominate some real workloads. I'm not sure the speed of misaligned copies matters enough to take the hit of the alignment test! Of course, I don't actually remember doing any instrumentation of this, but I have changed i386/amd64 memcpy (not linux/glibc) to avoid the 'rep movsb' used for the trailing bytes (copy the last 'word' first) - the setup cost for 'rep movsb' is over 40 clocks on netburst P4! (It is possible to get amd64 to copy data as fast as the 'rep movd', but the setup times are longer. And very recent Intel cpus contain hardware acceleration for aligned and misaligned 'rep movsd' - so trying anything clever isn't good.) I do realise thise doesn't directly apply to ppc :-) David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] fs_enet: fix freescale FCC ethernet dp buffer alignment
Hello, Motioned to the memory aligned, now there is such requirement: When the driver send an packet to hardware, the skb's address passed by stack do a dma map into hardware, the skb's dma address must be 64-byte aligned. Does the hardware support buffer chaining? In which case you only need to copy the data upto the first 64 byte boundary into another buffer. Actually, given that you are likely to have to fixup every fragment of the frame being transmitted, if might be worth allocating a fixed transmnit buffer area and copying the frames into it prior to sending. Certainly you need to allow for transmits made up of a significant number of small buffers linked together. Really you should beat up the hardware designers! Copying the data to even a 4 byte boundary is almost always a misaligned copy. Typically this only applies to the receive dma - when writing a 2 byte pad before the frame data would be much better. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: Changes to of_device ?
For finding structures, the following pattern tends to work well: git grep 'struct device_node {' In general 'typedef.*\type_name\' will find types. eg: grep -r -n --include '*.h' 'typedef.*\type_name\' base_of_source_tree Finding functions (in .c files) is easy if names start in column 1 '^function_name\' will work. But I ended up with: '^[^=]*[^ =][^=]*function_name\' to find some where the type was on the same line. If you get really desparate, the definitions of functions can be found by grepping through the namelists of the .o files. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: Relocatable kernel for ppc44x
The PPC440X currently uses 256M TLB entries to pin the lowmem. When we go for a relocatable kernel we have to : 1) Restrict the kernel load address to be 256M aligned OR 2) Use 16M TLB(the next possible TLB page size supported) entries till the first 256M and then use the 256M TLB entries for the rest of lowmem. What is wrong with: 3) Use 256M TLB entries with the lowest one including addresses below the kernel base. Clearly the kernel shouldn't be accessing the addresses below its base address - but that is true of a lot of address space mapped into the kernel. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 7/7] [v5] drivers/virt: introduce Freescale hypervisormanagement driver
+enum fsl_hv_ioctl_cmd { + FSL_HV_IOCTL_PARTITION_RESTART = _IOWR(0, 1, struct fsl_hv_ioctl_restart), ... +}; Using a #define here is usually preferred because then you can use #ifdef in a user application to check if a given value has been assigned. It is also possible to add: #define FSL_HV_IOCTL_PARTITION_RESTART FSL_HV_IOCTL_PARTITION_RESTART to have much the same effect. But there are many cases where #defines are better. I only tend to use enums when the constanst are beting generated by the expansion of a #define. More importantly, the code you have chose (0) conflicts with existing drivers (frame buffer, scsi and wavefront among others). Please chose a free one and add it to Documentation/ioctl/ioctl-number.txt in the same patch. It is rather a PITA that, when 'int' went from 16 to 32 bits, the BSD people used the high 16 bits for size/flags rather than using the extra bits to help make ioctl's unique. Linux seems to have copied BSD here - rather than SYSV. One problem with clashing ioctl commands is when systems like NetBSD are running linux binaries and need to translate ioctl buffers to/from native format. If the ioctl commands are unique this can be done much more easily. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [gianfar]bandwidth management problem on mpc8313 based board
Subject: [gianfar]bandwidth management problem on mpc8313 based board ... I have mpc8313 powerpc based board with silicon revision 2.1. the processor has two ETH ports (eTsec1 and eTsec2) i.e. eth0 and eth1. eth0 is 1Gbps port and eth1 is 100Mbps port. On board there is L2 switch from TANTOS2G (psb6972) supports one port 1Gbps, and from switch there are 4 more eth ports derived which are 100Mbps ports and port based VLAN is configured for this purpose. The interface between switch and eth0 (port of processor) is RGMII. So the processor port and switch port are connected on 1Gbps Link. ... After this I started to perform bandwidth test using iperf tool. When I performed this test on one port out of 4 derived ports I am getting bandwidth in the range of 80-85Mbps but when the same test is performed on 2 ports simultaneously then the per port bandwidth is reduced to 40-45Mbps. To summerise, you have a Ge port connected by RGMII (cross over) to an on-board switch that is configured to use VLAN tagging to drive four external 100M ports? I see two likely reasons for the aggregate throughput being constant: 1) The switch has limited throughput/buffering 2) The host really is 100% busy 3) The remote system has limited throughput I'd vote for the system being busy and 'top' (or whatever you are using) lying about the cpu usage. Measuring free cpu time by counting it in a low priority process is much more accurate than relying on the 'code interrupted by timer tick' scheme. (Clearly the scheduler could use a high-res timestamp on entry/exit to the idle loop and/or process switch - but, to my knowledge, the linux kernel only uses the timer interrupt.) David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [RFC][PATCH] powerpc: Use the #address-cells information to parsememory/reg
Changed the add_usable_mem_property() to accept FILE* fp instead of int fd, as most of the other users of read_memory_region_limits() deals with FILE*. Signed-off-by: Suzuki K. Poulose suz...@in.ibm.com Could you please let me know your thoughts/comments about this patch ? Is the change to use 'FILE *' actually progress? I'd have thought that the randomly aligned read/lseek system calls that this allows to happen are not desirable for anything that isn't a true file. I'd also suggest that the sizeof's should be applied to the actual type of the variable being read/written, not arbitrarily 'long' or 'int', and this probably ought to be some fixed size type. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 1/2] ppc/85xx: create a platform node for PCI EDAC device
+int __init fsl_add_pci_err(void) static :-) and why __ ? aren't such names reserved in the C language for some purpose. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: Cannot run tcl program of PPC460EX
On Tue, May 24, 2011 at 11:16:12PM -0700, efti wrote: I am trying to run tcl of my PPC460ex board. I have download the tcl-8.3.3-sol26-sparc-local.gz file. But when I run it I get the Syntax error. That tarball seems to indicate it's for both Solaris and Sparc. PPC460ex is neither of those. Which may well mean that the 'syntax error' is from the shell trying to interpret (as a shell script) a sparc elf binary that the kernel has failed to 'exec'. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: PCI DMA to user mem on mpc83xx
we have a pretty old PCI device driver here that needs some basic rework running on 2.6.27 on several MPC83xx. It's a simple char-device with give me some data implemented using read() resulting in zero-copy DMA to user mem. There's get_user_pages() working under the hood along with SetPageDirty() and page_cache_release(). Does that dma use the userspace virtual address, or the physical address - or are you remapping the user memory into kernel address space. If the memory is remapped into the kernel address space, the cost of the mmu and tlb operations (especially on MP systems) is such that a dma to kernel memory followed by copyout/copytouser may well be faster! That may even be the case even if the dma is writing to the user virtual (or physical) addresses when it is only necessary to ensure the memory page is resident and that the caches are coherent. In any case the second copy is probably far faster than the PCI one! I've recently written driver that supports a pread/pwrite interface to the memory windows on a PCIe card. It was important to use dma for the PCIe transfers (to get a sensible transfer size). I overlapped the copyin/copyout with the next dma transfer. The dma's are fast enough that it is worth spinning waiting for completion - but slow enough to make the overlapped operation worthwhile (same speed as a single word pio transfer). David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 1/3] mpt2sas: remove the use of writeq, since writeq isnot atomic
The specific mpt2sas problem is that if we write a 64 bit register non atomically, we can't allow any interleaving writes for any other region on the chip, otherwise the HW will take the write as complete in the 64 bit register and latch the wrong value. The only way to achieve that given the semantics of writeq is a global static spinlock. That sounds like very specific and slightly dodgy hardware. You don't say what the scope of 'region on the chip' is, but it looks like you need to disable ALL writes to the memory area between the first and second writes of the 64bit value and not just those coming from writeq(). I don't see how this can possibly be done by normal mutexing around the writeq() sequence, surely you need to lock the bus between the two transfers. Even dma writes would be a problem. The only way I can think to stop other cpus doing writes is to use IPIs for force them into a busy wait loop. All rather reminds me of a PCI slave that got things horribly wrong when a read was done while a write was still 'posted', or a 2nd master did a cycle did a read while a read rerun sequence was still in progress. (required a mutex and dummy cycles). At least than one wqs confined to one driver. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: book to learn ppc assembly and architecture
On Mon, 2011-05-16 at 16:37 +1000, Michael Neuling wrote: what is the best book to learn assembly and architecture . Assuming you have a powerpc compiler available you can use the -S option to produce assembly listings. With gcc add -fverbose-asm for more info. For a general background, look at something much simpler than ppc, even if you don't write/run any code. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 1/3] mpt2sas: remove the use of writeq, since writeq isnot atomic
static inline void writeq(__u64 val, volatile void __iomem *addr) { writel(val, addr); writel(val 32, addr+4); } ... We need the 64 bit completed in one access pci memory write, else spin lock is required. Since it's going to be difficult to know which writeq was implemented in the kernel, the driver is going to have to always acquire a spin lock each time we do 64bit write. ... I'm just in the process of finding them now on IRC so I can demand an explanation: this is a really serious API problem because writeq is supposed to be atomic on 64 bit. Most 32 bit systems don't have atomic 64bit writes. I'd also have thought there would be code which wouldn't mind the write being done as two cycles. I'm not sure that some of the ppc soc systems are capable of doing a 64bit data pci/pcie cycle except by dma. So your driver is probably doomed to require a lock. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system callfiltering
... If you can be completely stateless its easier, but there's a reason that stacking security modules is hard. Serge has tried in the past and both dhowells and casey schaufler are working on it right now. Stacking is never as easy as it sounds :) For a bad example of trying to allow alternate security models look at NetBSD's kauth code :-) NetBSD also had issues where some 'system call trace' code was being used to (try to) apply security - unfortunately it worked by looking at the user-space buffers on system call entry - and a multithreaded program can easily arrange to update them after the initial check! For trace/event type activities this wouldn't really matter, for security policy it does. (I've not looked directly at these event points in linux) David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] atomic: add *_dec_not_zero
Introduce an *_dec_not_zero operation. Make this a special case of *_add_unless because batman-adv uses atomic_dec_not_zero in different places like re-broadcast queue or aggregation queue management. There are other non-final patches which may also want to use this macro. Isn't there a place where a default definition of this can be defined? Instead of adding it separately to every architecture. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH V4] POWER: perf_event: Skip updating kernel counters if register value shrinks
I keep telling Eric that the code below is incorrect modulo arithimetic... +static u64 check_and_compute_delta(u64 prev, u64 val) +{ + u64 delta = (val - prev) 0xul; + + /* + * POWER7 can roll back counter values, if the new value is smaller + * than the previous value it will cause the delta and the counter to + * have bogus values unless we rolled a counter over. If a coutner is + * rolled back, it will be smaller, but within 256, which is the maximum + * number of events to rollback at once. If we dectect a rollback + * return 0. This can lead to a small lack of precision in the + * counters. + */ + if (prev val (prev - val) 256) + delta = 0; + + return delta; The code should detect rollback by looking at the value of 'delta' otherwise there are horrid end effects near 2^32-1. For instance: u32 delta = val - prev; return delta 0x8000 ? 0 : delta; David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH V4] POWER: perf_event: Skip updating kernel counters if register value shrinks
But it isn't, and it doesn't have trouble with 2^32 - 1. what about: prev = 0x0001 val = 0x David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH V4] POWER: perf_event: Skip updating kernel counters if register value shrinks
+ if (prev val (prev - val) 256) + delta = 0; + + return delta; Also, 'prev' is a true 64bit value, but 'val' is only ever 32bits. So once the 64bit 'prev' exceeds 2^32+256 both 'prev val' and 'prev - val' are true regardless of the value of 'val'. This will lead to jumps in the value David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH V4] POWER: perf_event: Skip updating kernel counters if register value shrinks
On Wed, 27 Apr 2011, David Laight wrote: But it isn't, and it doesn't have trouble with 2^32 - 1. what about: prev = 0x0001 val = 0x Result is 0xfffe and we are fine. 'delta' will be 0xfffe, but you need the function to return zero - since the underlying counter has decremented by 2. But 'prev val' is false so the counter will be increased by 2^32-2. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH V4] POWER: perf_event: Skip updating kernel counters if register value shrinks
prev and val are both 64 bit variables holding 32 bit numbers, we do not accumulate in either, they are both replaced by values directly from the registers. So prev val will not always be true. The code seems to be: prev = local64_read(event-hw.prev_count); val = read_pmc(event-hw.idx); delta = check_and_compute_delta(prev, val); local64_add(delta, event-count); Which looks very much like 'prev' being a 64bit counter generated from the 32bit pmc register. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: align DTL buffer to AMS boundary
From: linuxppc-dev-bounces+david.laight=aculab@lists.ozlabs.org [mailto:linuxppc-dev-bounces+david.laight=aculab@lists.ozl abs.org] On Behalf Of Nishanth Aravamudan Sent: 13 April 2011 15:53 To: Ben Herrenschmidt Cc: linuxppc-...@ozlabs.org; Paul Mackerras; Anton Blanchard Subject: [PATCH] powerpc: align DTL buffer to AMS boundary PAPR specifies that DTL buffers can not cross AMS environments (aka CMO in the PAPR) and can not cross a memory entitlement granule boundary (4k) How big is the buffer being allocated? If it is much less than 4k then it might be worth allocating a buffer of the correct size, and only if that crosses a 4k boundary allocate the larger buffer. Also, if the buffer is ever freed, the actual base address is needed for the free. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH V3] POWER: perf_event: Skip updating kernel counters ifregister value shrinks
+ u64 delta = 0; ... + if (((prev 0x8000) !(val 0x8000)) || (val prev)) + delta = (val - prev) 0xul; + + return delta; The above is incorrect modulo arithmetic. It is probably intended to do: s32 delta = val - prev; return delta 0 ? 0 : delta; which will just ignore the fact that some counts are rolled back. More accurate would be: static u64 val64; u32 val = read_perf_count(); s32 delta = val - val_64; if (delta 0) return; val64 += delta; Which will not double count for rolled back events. (The low bits of val64 should always match the actual counter.) David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH v2 5/5] xhci: Remove recursive call to xhci_handle_event
- xhci_handle_event(xhci); + while (xhci_handle_event(xhci)) {} I must admit I dislike the style with empty loop bodies... I would use an explicit 'continue;' for the body of an otherwise empty loop. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
High load average (~2.0) on an idle PowerPC 64 machine
I've this Quad G5 machine that's sitting pretty much idle with the latest Debian stable installed, and yet it's got an abnormaly high load average. The 'load average' value is rather useless since it seems to contain any process that is sleeping uninterruptibly, and IIRC any process that has run at all (for however short a period) in the current schedule epoch (or whatever period is relevant). I can easily generate a linux system that is 99.9% idle but has a 'load average' of 20 or more. It would still be 99.9% idle even if the idle time were based of the actual time outside the scheduler idle loop (as NetBSD doe) rather than where timer ticks interrupted. We also have the related fubar (on x64/amd64) of the kernel generating stack traces for processes that are sleeping uninterruptibly for long periods. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH][v2] driver/FSL SATA:Fix wrong Device Error Register usage
Case when ffs return will never arise.This scenario is already been discussed on linuxppc-dev@lists.ozlabs.org. Please see below explanation: sata_fsl_error_intr() is called during device error.The mentioned scenario will never comes. It can be observed via code:- if (cereg) { -- cereg is set on command error. Means there is at least 1 device present. This all relies on the hardware working properly, and nothing else accidentally clearing the device 'error' bit mask. Being slightly defensive when the out of range value causes an array subscript error seems sensible. It isn't as though the code size or execution time matters here. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH][v1] driver/FSL SATA:Fix wrong Device Error Register usage
The patch changes the code to: if (ap-nr_pmp_links) { + int dev_num; dereg = ioread32(hcr_base + DE); ... + dev_num = ffs(dereg)-1; + if (dev_num ap-nr_pmp_links) { + link = ap-pmp_link[dev_num]; On the off chance that no bits are set in 'DE' this code depends on nr_pmp_links being unsigned (since dev_num becomes -1. Changing 'dev_num' to 'unsigned int' would be better. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] driver/FSL SATA:Fix wrong Device Error Register usage
+ if ((ffs(dereg)-1) ap-nr_pmp_links) { + /* array start from 0 */ + link = ap-pmp_link[ffs(dereg)-1]; I'd only call ffs() once - it could be a slow library function. Any comment should note that ffs() returns 0 when no bits are set - rather than anything about array indexes. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH V8 03/10] USB/ppc4xx: Add Synopsys DWC OTG Core InterfaceLayer
Sorry, I don't understand that. I think u32 is always 32bit 4byte on all archs. Right? Yes. Use an unsigned long if you want to hold a pointer correctly on all arches. Although that is true for many systems (and probably all ppc Linux) it isn't necessarily true (eg 64 bit Microsoft Windows). C99 inttypes.h should define uintptr_t as an unsigned integer type that is large enough to hold a (data) pointer. I'm not sure if this is defined for the Linux kernel. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 1/2] misc: add CARMA DATA-FPGA Access Driver
I was completely unaware of that feature. I hunch that many drivers are incapable of dealing with an unbind while they are still open. Hmm, maybe older drivers... Anythig hotpluggable (USB, PCI, etc) should be in a better shape because they expect to be yanked at any time. Whim and prayer ??? David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 1/2] misc: add CARMA DATA-FPGA Access Driver
This driver allows userspace to access the data processing FPGAs on the OVRO CARMA board. It has two modes of operation: 1) random access This allows users to poke any DATA-FPGA registers by using mmap to map the address region directly into their memory map. I needed something similar, but used pread() and pwrite() to request the transfers. While this does require a system call per transfer, it allows the driver to use dma (if available) to speed up the request. In my case doing single cycle transfers would be too slow. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH V8 03/10] USB/ppc4xx: Add Synopsys DWC OTG Core InterfaceLayer
[Marri] If u32 is 8bytes isn't pointer type would be 8bytes as well. If u32 is 8bytes 'char' would only be 4 bits! (IIRC the X code has a fubar that requires a type with 32 in its name to be 64 bits!) David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 4/6] ftrace syscalls: Allow arch specific syscallsymbol matching
+#define arch_syscall_match_sym_name(sym, name) !strcmp(sym + 3, name + 3) Whenever you use a #define macro arg, you should enclose it in (). About the only time you don't need to is when it is being passed as an argument to another function (ie when it's use is also ',' separated). So the above ought to be: #define arch_syscall_match_sym_name(sym, name) (!strcmp((sym) + 3, (name) + 3)) Whether an inline function is better or worse is much more subtle! For instance I've used: asm volatile ( # line STR(__LINE__) :: ) to stop gcc merging the tails of conditionals. Useful when the conditional is at the end of a loop (etc), it might increase code size slightly, but removes a branch. If I put one of those in an 'inline' function separate copies of the function end up sharing code. With a #define __LINE__ differs so they don't. (I had some code to get below 190 clocks, these changes were significant!) David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 2/7] PowerPC: add unlikely() to BUG_ON()
+#define __BUG_ON(x) do { \ if (__builtin_constant_p(x)) { \ if (x) \ BUG(); \ @@ -85,6 +86,8 @@ } \ } while (0) +#define BUG_ON(x) __BUG_ON(unlikely(x)) + From my experiments, adding an 'unlikely' at that point isn't enough for non-trivial conditions - so its presence will give a false sense the the optimisation is present! In particular 'if (unlikely(x y))' needs to be 'if (unlikely(x) unlikely(y))' in order to avoid mispredicted branches when 'x' is false. Also, as (I think) in some of the generated code quoted, use of __builtin_expect() with a boolean expression can force some versions of gcc to generate the integer value of the expression - rather than just selecting the branch instructions that statically predict the normal code path. Sometimes I've also also had to add an asm() statement that generates no code in order to actually force a forwards branch (so it has something to place at the target). (I've been counting clocks ) David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH v2] gianfar: Fall back to software tcp/udp checksum on oldercontrollers
+ if (unlikely(gfar_has_errata(priv, GFAR_ERRATA_12) + ((unsigned long)fcb % 0x20) 0x18)) { You need to check the generated code, but I think you need: if (unlikely(gfar_has_errata(priv, GFAR_ERRATA_12)) unlikely(((unsigned long)fcb % 0x20) 0x18)) ie unlikely() around both the primitive comparisons. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: FSL DMA engine transfer to PCI memory
What was the ppc you used? The 8315E PowerQUIICC II On 85xx/QorIQ-family chips such as P2020, there is no DMA controller inside the PCIe controller itself (or are you talking about bus mastering by the PCIe device[1]? interface is a bit ambiguous), though it was considered part of the PCI controller on 82xx. The DMA engine and PCIe are both on OCeaN, so the traffic does not need to pass through the e500 Coherency Module. My understanding -- for what it's worth, coming from a software person :-) -- is that you should be able to get large transfer chunks using the DMA engine. It might be possible - but the ppc's pcie would need to know the length of the dma (or at least be told that there was more data to arrive) before even starting the pcie transfer. I used 128 bytes per pcie transfer (which the altera slave can handle) but that is longer than you want a burst on the internal (CSB in my case) bus on the ppc. It is also longer than a cache line - so the dma engine's memory reads might induce a cache flush. I suggest getting things working, and then seeing whether the performance is acceptable. The only reason for using dma (instead of pio) is to get long pcie transfers - otherwise it isn't really worth the effort. Transfers are unlikely to take long enough to make it worth taking an interrupt at the end of the dma. My device driver implements read() and write() (and poll() to wait for interrupts). So I do overlap the copy_to/from_user with the next dma. The generic dma controller can't even generate 64bit cycles into the ppc's PCIe engine. Could you elaborate? The pcie is (apparantly) a 64bit interface, to a single 32bit transfer is actually a 64bit one with only 4 byte enables driven. I couldn't see anything that would allow a CSB master to generate two 32bit cycles (since it is a 32bit bus) that the pcie hardware could convert into a single 64bit pcie transfer. The fpga target is likely to have 32bit targets (it could have 64 bit ones, but if you've instantiated a NiosII cpu it wont!) so you get a bus width adapter (which carefully does the cycle with no byte enables driven) as well as the clock crossing bridge. These both make the slave even slower than it would otherwise be! IIRC We managed to get 2us for a read and 500ns for a write cycle. The per byte costs are relatively small in comparison. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH V8 03/10] USB/ppc4xx: Add Synopsys DWC OTG CoreInterface Layer
Also in_le32/out_le32/in_be32/out_be32 are architecture-specific AFAIK. Isn't the whole patch architecture-specific ? I'd suggest using readl/writel for LE ops and __be32_to_cpu(__raw_readl(addr))/__raw_writel(__cpu_to_be32(b),addr) for BE ops. Since the ppc doesn't have a byteswap instruction (and the LE memory transfers might even be slow!) you really don't want to be doing software byteswap. Not to mention the horrific synchronistion instructions that in_le32() and out_le32() actually contain. I didn't find __raw_readl() when I was looking for asm patterns that byteswapped memory transfers. I did find st_le32() and ld_le32() in arch/powerpc/include/asm/swab.h but that file is actually quite difficult to #include because there is another swab.h that gets included instead. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: FSL DMA engine transfer to PCI memory
I'm trying to use FSL DMA engine to perform DMA transfer from memory buffer obtained by kmalloc() to PCI memory. This is on custom board based on P2020 running linux-2.6.35. The PCI device is Altera FPGA, connected directly to SoC PCI-E controller. You'll need to use the dma engine that is part of the PCIe interface in order to get large PCIe transfers. I think everything else will still generate single 32bit PCIe transfers - which are (if your measurements match mine) exceptionally lethargic - the ISA bus is faster! That does work provided you remember to give the dma controller physical addresses and byteswap absolutely everything. (Oh, and I didn't get single word transfers to work - they locked the dma controller - not a problem since they are faster by PIO.) Note that the PPC Linux (Linux in general??) doesn't have a 'virtual to physical' function that works for all addresses, you'll need to remember the physical address of the PCIe slave and use malloc'ed memory for the descriptors (on which virt_to_phys() actually works). I don't think there is a standard device driver for the PCIe dma, I couldn't even find any header files that were vaugely relevent except in the uboot sources. I certainly wrote some code that just assumes it is on the right hardware! These are the relevant bits of code Global initialisation: /* Enable the read/write dma controllers */ csb_ctrl = in_le32(pex-pex_csb_ctrl); csb_ctrl |= PEX_CSB_CTRL_WDMAE | PEX_CSB_CTRL_RDMAE; out_le32(pex-pex_csb_ctrl, csb_ctrl); /* We don't rely on the dma polling the descriptor, I have NFI * whether the default of 0 means 'never poll' or 'poll very quickly'. * Set a large slow value for sanity. */ out_le32(pex-pex_dms_dstmr, ~0u); Transfer setup: /* We only support aligned writes - caller must verify */ dma_ctrl = PDMAD_CTRL_VALID; dma_ctrl |= PDMAD_CTRL_SNOOP_CSB; dma_ctrl |= PDMAD_CTRL_1ST_BYTES | PDMAD_CTRL_LAST_BYTES; dma_ctrl |= PDMAD_CTRL_NEXT_VALID; dma_ctrl |= len (PDMAD_CTRL_LEN_SHIFT - 2); /* Fill in DMA descriptor */ st_le32(desc-pdmad_ctrl, dma_ctrl); /* We MUST clear the status - otherwise the xfer will be skipped */ st_le32(desc-pdmad_stat, 0); st_le32(desc-pdmad_src_address, src_phys); st_le32(desc-pdmad_dst_address, dst_phys); st_le32(desc-pdmad_next_desc, 0); /* Clear old status */ st_le32(pex_dma-pex_dma_stat, in_le32(pex_dma-pex_dma_stat)); /* Give descriptor address to dma engine */ st_le32(pex_dma-pex_dma_addr, virt_to_phys(desc)); /* Wait for all above memory cycles, then start xfer */ iosync(); st_le32(pex_dma-pex_dma_ctrl, PEX_DMA_CTRL_START | PEX_DMA_CTRL_SNOOP); Poll for completion: /* Wait for transfer to complete/fail */ do { desc_stat = ld_le32(desc-pdmad_stat); } while (!(desc_stat PDMAD_STAT_DONE)); status = ld_le32(pex_dma-pex_dma_stat); if (status == (PEX_DMA_STAT_DSCPL | PEX_DMA_STAT_CHCPL) desc_stat == PDMAD_STAT_DONE) /* Transfer ok */ return 0; /* Transfer failed */ Oh, since I couldn't find it in the documentation, the first word of the dma descriptor is 'ctrl' and the last 'next_desc'. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
FW: FSL DMA engine transfer to PCI memory
memcpy_toio() works fine, the data is written correctly. After DMA, the correct data appears at offsets 0xC, 0x1C, 0x2C, etc. of the destination buffer. I have 12 bytes of junk, 4 bytes of correct data, then again 12 bytes of junk and so on. Does your Avalon MM slave decode the 4 byte enables? The slave always sees 2 Avalon MM writes, one of which will have no byte enables asserted. (Actually it is a 64bit Avalon transfer - and a bus width adapter gets inserted for you.) Internal M9K memory blocks get this right... David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: FSL DMA engine transfer to PCI memory
custom board based on P2020 running linux-2.6.35. The PCI device is Altera FPGA, connected directly to SoC PCI-E controller. This sounds like your FPGA doesn't handle burst mode accesses correctly. A logic analyzer will help you prove it. He is doing PCIe, not PCI. A PCIe transfers is an HDLC packet pair, one containing the request, the other the response. In order to get any significant throughput the hdlc packet(s) have to contain all the data (eg 128 bytes). On the ppc we used that means you have to use the dma controller inside the PCIe interface block. The generic dma controller can't even generate 64bit cycles into the ppc's PCIe engine. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: MPC831x (and others?) NAND erase performance improvements
An external IRQ line would let you limit interrupts to rising edges rather than all edges, though you'd lose the ability to directly read the line status. oh, one cannot read the IRQ line? didn't know that. Also I not sure all Freescale CPUs can do rising edge. I suspect that you may be able to leave the interupt masked, but still read the 'interrupt pending' register. Which would have the same effect. Our HW engineers tend to feed everything into an FPGA since it gives than a lot more flexibility over pin connections. In which case the invertor is trivial. (and the fpga interface can read the status!) David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: MPC831x (and others?) NAND erase performance improvements
The problem cropped up when there was a lot of traffic to the NAND (Samsung K9WAGU08U1B-PIB0), with the NAND being on the LBC along with a video chip that needed constant and prompt attention. What I would see is that, as the writes happened, the erases would wind up batched and issued all at once, such that frequently 400-700 erases were issued in rapid succession with a 1ms LBC BUSY cycle per erase. Are those just the reads of the status register polling to determine when the sector erase has completed ? In which case a software delay beteen the reads might work. Writes probably also have to be polled, but the individual writes happen faster. It is possible that an uncached read of another memory area will stall the cpu long enough to allow another LBC master in. One every few writes might be enough. I had to do something similar on rather different hardware ... David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: Getting the IRQ number (Was: Basic driver devel questions ?)
That's the idea. Communication between usermode and the driver is limited to simple ioctl calls and the driver receives an interrupt when certain data has been placed in memory blocks by the hardware (like a DMA). Then the user prog figures out where that latest data buffer is (mmap) and saves it. I've used pread() and pwrite() to directly copy data to/from a devices PCI memory space. The driver just verifies the offset and calls copy_to/from_user() [1]. High offsets can be used to access driver specific data. The userspace code for pread/write is less painful than using ioctl() for everything. This also lets you dump the device memory space with hexdump. I also made the device interrupt generate POLLIN (this needs a bit of care since the interrupt can arrive before poll() is called). David [1] I actually had to use the PCIe dma controller to get adequate throughput. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: Getting the IRQ number (Was: Basic driver devel questions ?)
Another question: I just spent 10 minutes trying to find where struct device was defined. Dirty trick #4: At the top of a source file (before the first include) add: struct device { int fubar; }; Then try to compile it. The compiler will the tell where it is defined! David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [MPC52xx]Latency issue with DMA on FEC
A mb() is usually used if you do a write to device and read from it. With out it, the CPU could perform the read before the write, which would give you an incorrect result. There's no other way around that. Possibly the synchronisation functions are doing significantly more work than is required. I was looking at the in_le32() and out_le32() functions for the ppc e300 (and maybe others). The out_le32() contains a 'sync' instruction - this may only be needed after a series of writes (eg just before a command). The iosync() function just adds a 'sync' and can be used as needed. The in_le32() not only contains the unwanted 'sync', but also a 'twi' (trap immediate - NFI exactly what this does) and 'isync'. The 'isync' is particularly horrid and unnecessary (aborts the instruction queue and refeches the opcode bytes). The very slow in_le32() might be there to give semi-synchronous traps on address fault - but unless the hardware is being probed that really isn't necessary. I did find st_le32() and ld_le32() in arch/powerpc/include/asm/swab.h but had difficulty #including that version of swab.h! #include ../arch/powerpc/include/asm/swab.h worked - but isn't that nice. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: application needs fast access to physical memory
On Wed, 2010-11-17 at 16:03 -0600, steven@teradyne.com wrote: My application needs a fast way to access a specific physical DDR memory region. The application runs on an MPC8548 PowerPC which has an MMU. I've tried two approaches that are typical for Linux, mmap() and using a kernel module that implements read()/write() into this region and I'm finding that performance is very slow for both. It's a couple orders of magnitude slower than, for example, copying a large buffer from one place in the application's virtual memory to another place in the application's virtual memory. The mmap() version should basically run at full speed, at least once you've faulted the address range in. This specific DDR region isn't specifically slow is it ? :) Or being mapped uncached ? David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: Corrected data type mismatch
OOPS! It is wrong here, The right one should be as following: - memcpy(mem, current-thread.evr[regno-32], + memcpy(mem, (void*)current-thread.evr[regno-32], dbg_reg_def[regno].size); The (void *) cast should be unnecessary David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: INFO: task snmpd:398 blocked for more than 120 seconds.
I can't make out what is causing this hang every now an then: INFO: task snmpd:398 blocked for more than 120 seconds. My problem with that 'error' message is that there is no way for a driver to disable it on a per-process basis. We have some processes whose 'normal state' is to sleep uninterruptibly in the kernel. Shutdown is handled by an explicit request (not by sending a signal). The processes could be kernel worker threads (except that is is ~impossible to handle them exiting from a loadble kernel module) so are actually children of a daemon sat inside an ioctl() request that never terminates! However, on the face of it, your case does look as though the mutex is fubar'ed. Might be worth (somehow) dumping the mutex state. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
PowerQUICC II PCIe dma
I'm trying to get a ppc 831x to do PCIe dma transfers, PIO transfers work ok but are somewhat lethargic (partially due to the slave). I've done the following: - All reads/writes (including the dma descriptors) are byteswapped. - Enabled the dma in PEX_CSB_CTRL (value 0x3f) - Used the kernel physical addresses for all addresses. For memory from virt_to_phys(), pci from pci_resource_start(). (virt_to_phys() gives invalid values for the pci area). - I'm setting both 'snoop' bits to 1 (as per the device errata) So the descriptor 'ctrl' has value (len 2) 12 | 0xff9, and the dma 'ctrl' 0x401. - I've put a 'sync' instruction (iosync()) just before the write to start the dma. On the first request after hard reset, the dma status register REX_RDMA_STAT gets set to 4. On subsequent requests it stays zero. Nothing else appears to happen at all (except the 'start' bit always reads back zero). The 'CSB status register' at 0x81c stays zero - so I don't think it is stuck waiting for the PCIe slave. I'm not at all sure of the word ordering for the dma descriptor! Page 14-113 doesn't say! However I've written the values so that it doesn't matter - once I get a status written back I'll know... I've probably forgotten something very silly! Anyone who has got this working any ideas ?? David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev