MPC8315 PCI express lockup

2012-05-30 Thread David Laight
(I apologise for this not having much to do with linux...)

We have a system with an MPC8315 ppc running linux 2.6.32
that uses the PCI express interface in RC mode to interface
to an Altera FPGA.
This uses both PIO and the PEX DMA interfaces (locally
written dma driver).
Under normal circumstances this all works fine.

However under some circumstances (eg DMA reads from
addresses that don't have actual slaves on the fpga [1])
the dma transfer requests don't complete.
There are no obvious error bits set in the hisr or csmisr
registers and the csb_status shows 'dma in progress'.
The dma transfer itself can be cancelled (by setting the
SUS bit in the dma_ctrl register), and the relevant
status bit are set to show the transfer has been aborted.

Once in this state all further PCI express transfers fail.
DMA requests timeout (driver gives up waiting for completion)
and PIO requests fault (Oops: Machine check, sig 7 [#1])
locking the kernel solid.

This looks very much like the MPC8315's errata PEX7
except that I don't see the CSMISR[RST] bit set.
I'm not at all sure the recovery for that errata is
actually writable! I'm certainly not going to write it
just to find out if it would help.
In any case it is quite likely that the driver's ISR will
try to do a PIO read while a dma transfer is timing out.

We can look at the fpga side and possibly find out
what it is doing, but it would be useful to know more
about the status on the ppc side.

I presume there is a way of doing a 'probe' type memory
cycle that won't panic on a fault?
Although that may not help me keep linux running as the
ISR needs to to a PCIe write to remove the level-sensitive
interrupt.

Any thought on ways to progress?

David

[1] PIO reads are ok and just return 0x.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: pread() and pwrite() system calls

2012-05-29 Thread David Laight
  A special pread/pwrite asm stub that just copies
  r7 to r0 could be used.
  
  Would it be enough to do:
  syscall_pread_pwrite:
  mov 0,7
  sc
  blr
  and handle the -ve - errno in C?
 
 Huh? Won't fly, r0 is used for the system call number!

I was copying that from r7!
Actually I have a much better stub by copying the one
used for mmap().
The system call itself is fine.
Using the system call almost halved the time taken
for a 4-byte read.

 On the other hand, I believed PPC had no problems passing
 up to 8 32 bit arguments in registers (r3 to r10), but
 I may be confusing with the standard ABI for function calls.
 
 Hmm, a quick look at kernel/entry_32.s shows that it should 
 be able to use at least r3 to r8, which should be sufficient.
 
 I think that it is an uClibc problem.

True, in that it isn't a kernel bug.
OTOH the kernel is likely to get blamed for non-atomic pread.
I've found the same user-space code in newlib as well.
glibc may be ok, some code I've found implies it only
uses the 'emulation' when the system call returns ENOSYS.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


pread() and pwrite() system calls

2012-05-25 Thread David Laight
We have a system with linux 2.6.32 and the somewhat archaic
uClibc 0.9.27 (but I'm not sure the current version is
any better, and I think there are binary compatibility
if we update).

I've just discovered that pread() is 'implemented'
by using 3 lseek() system calls and read().
(the same is true for the 64bit versions).

I thought that pread() was supposed to be atomic
(so usable concurrently by multiple threads) which
means that this implementation is completely broken.

I've not looked to see what glibc does.

I can see that part of the problem is the alignment
of the 64bit value on the argument list of syscall()
(when the register save area is cast to a sycall
argument structure).
But it also looks as though the uClibc syscall()
stub can only pass 5 arguments in registers, and
pread() (with an aligned 64bit offset) requires 6.

The ucLibc source seems to be predicated by __NR_pread,
but if that were defined it would try to call
__syscall_pread() and I can't find that anywhere.

A special pread/pwrite asm stub that just copies
r7 to r0 could be used.

Would it be enough to do:
syscall_pread_pwrite:
mov 0,7
sc
blr
and handle the -ve - errno in C?

I've seen other parts of linux pass structures to
avoid issues with excessive number of arguments.
Unlike things like NetBSD amd64 which will read
additional args off the user stack (and has per-system
call stubs in libc so doesn't lose one argument).
(I rearranged the NetBSD amd64 kernel trap frame
layout so all the args ended up adjacent...)

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] powerpc/windfarm: don't pass const strings to snprintf

2012-05-03 Thread David Laight
 
 --- a/drivers/macintosh/windfarm_smu_sat.c
 +++ b/drivers/macintosh/windfarm_smu_sat.c
 @@ -287,7 +287,7 @@ static int wf_sat_probe(struct i2c_client *client,
   sens-sat = sat;
   sens-sens.ops = wf_sat_ops;
   sens-sens.name = (char *) (sens + 1);
 - snprintf(sens-sens.name, 16, %s-%d, name, cpu);
 + snprintf((char *)sens-sens.name, 16, %s-%d,  name,
cpu);
  
   if (wf_register_sensor(sens-sens))
   kfree(sens);

Wouldn't it be better to do:
snprintf((char *)(sens + 1), 16, %s-%d, name, cpu);

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [REGRESSION][PATCH V4 1/3] bpf jit: Make the filter.c::__load_pointer helper non-static for the jits

2012-04-02 Thread David Laight
 
 The function is renamed to make it a little more clear what it does.
 It is not added to any .h because it is not for general 
 consumption, only for bpf internal use (and so by the jits).

I'd have thought it better to put in into a bfp_internal.h
(or similar) with a big warning there about the asm users.

Possibly even worth adding some other defs that the asm
files will need (if there are any).

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH v2] powerpc/srio: Fix the compile errors when building with 64bit

2012-03-07 Thread David Laight
 diff --git a/arch/powerpc/sysdev/fsl_rmu.c
b/arch/powerpc/sysdev/fsl_rmu.c
 index 1548578..1bba6d1 100644
 --- a/arch/powerpc/sysdev/fsl_rmu.c
 +++ b/arch/powerpc/sysdev/fsl_rmu.c

 @@ -657,7 +658,8 @@ fsl_add_outb_message(struct rio_mport 
 *mport, struct rio_dev *rdev, int mbox,
   int ret = 0;
  
   pr_debug(RIO: fsl_add_outb_message(): destid %4.4x mbox %d
buffer  \
 -  %8.8x len %8.8x\n, rdev-destid, mbox, (int)buffer,
len);
 +  %8.8lx len %8.8zx\n, rdev-destid, mbox,
 + (unsigned long)buffer, len);

Should 'buffer' be printed with %p ??

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 4/4] powerpc/mpc8548: Add workaround for erratum NMG_SRIO135

2012-03-06 Thread David Laight
 
 Issue:
 Applications using lwarx/stwcx instructions in the core to
 compete for a software lock or semaphore with a device on
 RapidIO using read atomic set, clr, inc, or dec in a similar
 manner may falsely result in both masters seeing the lock
 as available. This could result in data corruption as
 both masters try to modify the same piece of data protected
 by the lock.
 
 Workaround:
 Set bits 13 and 29 of CCSR offset 0x01010 (EEBPCR register
 of the ECM) during initialization and leave them set
 indefinitely. This may slightly degrade overall system
 performance.

Might be worth actually saying what these bits do, and
why/when overall performance is affected.

Is the problem trying to do locked read-write cycles
on a slow peripheral bus?
Might be a case for just 'not doing that'.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] powerpc/srio: Fix the compile errors when building with 64bit

2012-03-02 Thread David Laight
 
 For the file arch/powerpc/sysdev/fsl_rmu.c, there will be 
 some compile errors while using the corenet64_smp_defconfig:

I'm sure that replacing 'u32' with 'unsigned long'
is really the best thing to do here.
It looks to me as though they should be some pointer type.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 20/21] Introduce struct eeh_stats for EEH

2012-02-28 Thread David Laight
 
 +struct eeh_stats {
 + unsigned int no_device; /* PCI device not found */
...
 + no device   =%d\n
...

Use %u (for all the stats), you really don't want negative
values printed.
I've NFI how long wrapping these counters might take!
If it is feasable (maybe much above 100Hz) then you
need 64bit counters.

David



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 20/21] Introduce struct eeh_stats for EEH

2012-02-24 Thread David Laight
 
 +/*
 + * The struct is used to maintain the EEH global statistic
 + * information. Besides, the EEH global statistics will be
 + * exported to user space through procfs
 + */
 +struct eeh_stats {
 + unsigned long no_device;/* PCI device not found
*/
 + unsigned long no_dn;/* OF node not found
*/
 + unsigned long no_cfg_addr;  /* Config address not  found
*/
 + unsigned long ignored_check;/* EEH check skipped
*/
 + unsigned long total_mmio_ffs;   /* Total EEH checks
*/
 + unsigned long false_positives;  /* Unnecessary EEH checks
*/
 + unsigned long slot_resets;  /* PE reset
*/
 +};

Why 'unsigned long', surely either 'unsigned int'
or a fixed-width type.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: ARM + TI DSP device tree repesentation

2011-12-14 Thread David Laight
 
 I am trying to locate the device tree for OMAP platform which 
 has ARM and TI DSP core.
 
 The background is that we are trying to understand how such 
 SOCs with dissimilar cores are represented in linux.

Most likely linux runs on the ARM, and the DSP processor
is loaded with specific code to perform certain functions.

As such the DSP is probably a 'peripheral' not a cpu.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX

2011-12-08 Thread David Laight
 
  One idea would be to have a structure of function pointers for each
  CPU that gets runtime patched into the right places, 
  similar to how we do some of the MMU fixups.
 
 Sounds good to me :-)

Except the indirect jump/call is almost certainly
never predicted - so will be slow.

You might want to patch jump instructions instead.

The same is true for in-kernel memcpy() and other
similar operations.

I actually wonder sometimes what the typical lengths
are for these sort of functions, and whether, in fact,
small lengths dominate - where the fixed costs matter.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: Re: [PATCHv5] atomic: add *_dec_not_zero

2011-12-05 Thread David Laight
Looking at this:

 #ifndef atomic_inc_unless_negative
 static inline int atomic_inc_unless_negative(atomic_t *p)
 {
 int v, v1;
 for (v = 0; v = 0; v = v1) {
 v1 = atomic_cmpxchg(p, v, v + 1);
 if (likely(v1 == v))
 return 1;
 }
 return 0;
 }
 #endif

why is it optimised for '*p' being zero??
I'd have though the initial assignment to 'v' should
be made by reading '*p' without any memory barriers (etc).

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 5/6] powerpc/boot: Add mfdcrx

2011-11-30 Thread David Laight
 
  +#define mfdcrx(rn) \
  +   ({  \
  +   unsigned long rval; \
  +   asm volatile(mfdcrx %0,%1 : =r(rval) : g(rn)); \
  +   rval; \
  +   })
 
 g is never correct on PowerPC, you want r here.  You can write
 this as a static inline btw, you only need the #define stuff when
 there is an i constraint involved.

I think you can still use static inlines even when a
constraint is one that requires a compile time constant.
eg (not ppc, but the n become part of the instruction
word):

static __inline__ uint32_t
custom_inic(const uint32_t op, uint32_t a, const uint32_t b)
{
uint32_t result;

__asm__ ( custom\t%1, %0, %2, c%3
: =r (result) : n (op), r (a), n (b));
return result;
}

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 2/2] powerpc/85xx: Renamed mpc85xx_common.c to common.c

2011-11-24 Thread David Laight
 
 Subject: [PATCH 2/2] powerpc/85xx: Renamed mpc85xx_common.c 
 to common.c
 
 The file name is already scoped by the directory its in.
...
 rename from arch/powerpc/platforms/85xx/mpc85xx_common.c
 rename to arch/powerpc/platforms/85xx/common.c

All very well until you only have a reference to the
filename part somewhere and want to 'find' the file
in a large directory tree.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 01/16] pmac_zilog: fix unexpected irq

2011-11-24 Thread David Laight
 
 On most 68k Macs the SCC IRQ is an autovector interrupt and cannot be 
 masked. This can be a problem when pmac_zilog starts up.

Wouldn't this also happen if the interrupt were shared?
Hopefully nothing vaguely modern uses the borked Zilog 8530 SCC
(which I presume is the part in question - brings back
too many nightmares)

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 1/3] mtd/nand: fix coding style issue in drivers/mtd/nand/fsl_elbc.c

2011-11-15 Thread David Laight
 
 On Tue, 2011-11-15 at 11:26 +, Jenkins, Clive wrote:
   fix whitespaces,tabs coding style issue and ...
  
  In my opinion this code was already correct, and would display
correctly
  at any TAB setting. This patch changes it so that it displays
  incorrectly at all TAB settings other than 8.
 
 Any tab setting other than 8 is incorrect and should not be used for
 Linux code. If I view it in a proportionally-spaced font, 
 it's not going to line up right either. Fix your editor, if that's an
issue for you.

Personally I don't even attempt to line up argumants on
continuation lines - it just wastes vertical space.
I just indent them as any other continuation line, so
double-indent if using 4-char indents and 1/2 indent for
8-char indents.

Tabs have to be assumed to be 8 columns, otherwise all sorts
of tools just get it wrong.
(Or you ban tabs from source files.)

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Disabling TCP slow start after idle

2011-11-11 Thread David Laight
We have some connections that suffer very badly from the TCP
'slow start' algorithm. These are connections that will
always be local - they may be MAC-Switch-MAC using RGMII
crossover, they might also be connected via an external
switch. In either case the RTT is most likely to be almost
zero, certainly below 1ms.

The traffic is single packets (carrying another protocol)
so we have Nagle disabled and the send and receive sides
run separately. So the traffic is neither bulk, nor
command/response.

This means that there is very rarely any unacked data,
so almost every packet is sent using 'slow start'.

If the external switch drops a packet (they do!) then
slow start stops more packets being sent, and nothing
progresses for about 1.5 seconds by which time there
is a significant amount of traffic queued and, in some
cases, data has to be discarded.

Similar issues happen if the receiving system decides
to defer the ack until a timer tick (instead of
sending one after every second packet). In this case
only 4 packets are sent. (We fixed this one be sending
a software ACK every 4 packets.)

Quite cleary the 'slow start' algorithm just doesn't
work in these cases.

I found this https://lkml.org/lkml/2010/4/9/427
discussion about a socket option to disable slow start.
But it seems that some people are completely against the idea.
I'd have thought that the global option would be more of a
problem - since that will affect ftp connections to remote
hosts where slow start is alomost certainly benefitial.

I'd have thought it would be sensible to allow one (or more)
of the following (either as a sysctl, socket option, or
code change):
1) Disable slow start for the local subnet.
2) Disable slow start for connections with very low RTT.
3) Disable slow start for a minimum period with no traffic
   (after the last packet is acked).

I'm not sure of the resolution used by the Linux RTT
calculations. I know NetBSD had a recent set of patches
to fix calculation errors with low RTT because the code
had been written when all RTT were much longer.

David

(Copied to linuxppc-dev because I'm subscribed to it.)


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH v2 1/5] [ppc] Process dynamic relocations for kernel

2011-11-10 Thread David Laight
 
  How about using clean_dcache_range() to flush the range runtime
  address range [ _stext, _end ] ? That would flush the entire
  instructions.
 
  Wouldn't that result in more cache flushing than the original
solution?
 
  For example, my kernel is 3.5MB.  Assuming a 32 byte cache line
size,
  clean_dcache_range(_stext, _end) would result in about 115,000
dcbst's
  (3.5MB / 32).
 
 Oops ! You are right. We could go back to the 
 clean_dcache_all() or the
 initial approach that you suggested. (dcbst).

Maybe clean_dcache_range() itself should limit the range
to the size of the cache?
I suspect that code can easily know the cache size.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH v2 1/5] [ppc] Process dynamic relocations for kernel

2011-11-07 Thread David Laight
 
 On Fri, 2011-11-04 at 14:06 +0530, Suzuki Poulose wrote:
  On 11/03/11 05:06, Josh Poimboeuf wrote:
   On Tue, 2011-10-25 at 17:23 +0530, Suzuki K. Poulose wrote:
   @@ -137,6 +137,9 @@ get_type:
 lwz r0, 8(r9)   /* r_addend */
 add r0, r0, r3  /* final addend */
 stwxr0, r4, r7  /* memory[r4+r7]) = (u32)r0 */
   + dcbst   r4,r7   /* flush dcache line to memory */
   + sync/* wait for flush to complete */
   + icbir4,r7   /* invalidate icache line */
  
  Doing it this way has two drawbacks :
  
  1) Placing it here in relocate would do the flushing for 
 each and every update.
 
 I agree.  My kernel had around 80,000 relocations, which means 80,000
 d-cache line flushes (for a 32k d-cache) and 80,000 i-cache line
 invalidates (for a 32k i-cache).  Which is obviously a little 
 overkill.
 Although I didn't notice a performance hit during boot.

The I-cache invalidates shouldn't be needed, the un-relocated
code can't be in the I-cache (on the grounds that executing
it would crash the system).
A single sync at the end is probably enough as well.
I guess it is possible for the cpu to prefetch/preload
into the i-cache through the jump into the relocated code?
So maybe a full i-cache invalidate right at the end? (or
a jump indirect? - which is probably there anyway)

The d-cache will need some kind of flush, since the modified
lines have to be written out, the only time it generates
additional memeory cycles are if there are two (or more)
reloations in the same d-cache line. Otherwise the early
write-back might help!

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] powerpc/usb: use unsigned long to type cast an address of ioremap

2011-11-03 Thread David Laight
 
 Below are codes for accessing usb sysif_regs in driver:
 
 usb_sys_regs = (struct usb_sys_interface *)
   ((u32)dr_regs + USB_DR_SYS_OFFSET);
 
 these codes work in 32-bit, but in 64-bit, use u32 to type cast the
address
 of ioremap is not right, and accessing members of 'usb_sys_regs' will
cause
 call trace, so use unsigned long for both 32-bit and 64-bit.

Wouldn't a (char *) cast be even better?

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] powerpc/usb: use unsigned long to type cast an address of ioremap

2011-11-03 Thread David Laight
 
 On Thu, Nov 3, 2011 at 4:58 AM, Shaohui Xie shaohui@freescale.com wrote:
 
                 usb_sys_regs = (struct usb_sys_interface *)
  -                               ((u32)dr_regs + USB_DR_SYS_OFFSET);
  +                               ((unsigned long)dr_regs + 
  USB_DR_SYS_OFFSET);
 
 This makes more sense:
 
 usb_sys_regs = (void *)dr_regs + USB_DR_SYS_OFFSET;

But that is invalid C.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] powerpc/usb: use unsigned long to type cast an address of ioremap

2011-11-03 Thread David Laight
usb_sys_regs = (void *)dr_regs + USB_DR_SYS_OFFSET;
 
  But that is invalid C.
 
 What's invalid about it?  I haven't tried compiling this 
 specific line of code, but I've done stuff like it in the past many
times.
 
 Are you talking about adding an integer to a void pointer?  
 If so, then that's something that gcc supports and that the kernel
uses 
 all over the place.

Arithmetic on 'void *' should not be done. I know some versions of
gcc allow it (provided some warning level/option is enabled) but
that doesn't mean it is valid.
My suspicions are that is was allowed due to the way 'void *'
was originally bodged into gcc.

 A char* is incorrect because a char could be more 
 than one byte, in theory.

It is somewhat difficult to untangle the standard, but
sizeof (char) is defined to be one.

Of course, the C language doesn't actually require that
you can converts between pointers to different types in
any well-defined manner. But most of the low level device
access assumes an adequately linear address space.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] powerpc/usb: use ioremap instead of base plus offset to access regs

2011-11-02 Thread David Laight
 
  #ifndef CONFIG_ARCH_MXC
   if (pdata-have_sysif_regs)
 - usb_sys_regs = (struct usb_sys_interface *)
 - ((u32)dr_regs + USB_DR_SYS_OFFSET);
 + usb_sys_regs = ioremap(res-start + USB_DR_SYS_OFFSET,
 + sizeof(struct
usb_sys_interface)/sizeof(int));
  #endif

That ioremap() doesn't look right.
Isn't the 'size' in bytes??
(Although it will only matter if it crosses a page boundary.)
Mind you, I'd have though the original ioremap() should
have covered the entire structure??

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH][v2] uio: Support 36-bit physical addresses on 32-bit systems

2011-10-13 Thread David Laight
 
 Kumar Gala wrote:
   +   phys_addr_t addr;
   
   Please add a comment here saying:
   
   1) That 'addr' can be a virtual or physical address
  The code and everything else makes that clear
 
 I'm sorry, but I have to strongly disagree here.  It is *NOT* 
 clear that a variable of type 'phys_addr_t' can hold something
 that is not a physical address.

Since there is a discriminating field, could a union be used?
At a guess the type of the address is constrained between
produces and consumer??

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled

2011-10-10 Thread David Laight
 
 Then, this statement:
 
 *(u32 *) (tx_desc-ctrl.vlan_tag) |= ring-doorbell_qpn;

...
 instead do ... :

   *(u32 *) (tx_desc-ctrl.vlan_tag) |=
cpu_to_be32(ring-doorbell_qpn);
 
 (Also get rid of that cast and define vlan_tag as a __be32 to start
 with).

Agreed, casts that change the type of memory - *(foo *)xxx - are
generally bad news unless you are casting a generic 'buffer' to
a specific structure.
I've seen far to much code that ends up being depending on the
endianness and system word size.

For the above I'd actually suggest making 'doorbell_qpn' have the
correct endianness in order to avoid the (potential) swap every
time it is set.

You also need to treble-check the required endianness for the
'vlan_tag' in the tx descriptor. What would be needed is the
MAC PCI slave were on an x86 (LE) system.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled

2011-10-10 Thread David Laight
 
 What is this __iowrite64_copy... oh I see
 
 Nice, somebody _AGAIN_ added a bunch of generic IO 
 accessors that are utterly wrong on all archs except
 x86 (ok, -almost-).
 There isn't a single bloody memory barrier in there !

Actually memory barriers shouldn't really be added to
any of these 'accessor' functions.
(Or, at least, ones without barriers should be provided.)

The driver may want to to a series of writes, then a
single barrier, before a final write of a command (etc).

in_le32() from io.h is specially horrid!

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] mlx4_en: fix transmit of packages when blue frame isenabled

2011-10-06 Thread David Laight
 
  static void mlx4_bf_copy(unsigned long *dst, unsigned long *src,
unsigned bytecnt) {
 + int i;
 + __le32 *psrc = (__le32 *)src;
 +
 + /*
 +  * the buffer is already in big endian. For little endian
machines that's
 +  * fine. For big endain machines we must swap since the chipset
swaps again
 +  */
 + for (i = 0; i  bytecnt / 4; ++i)
 + psrc[i] = le32_to_cpu(psrc[i]);
 +
   __iowrite64_copy(dst, src, bytecnt / 8);
  }

That code looks horrid...
1) I'm not sure the caller expects the buffer to be corrupted.
2) It contains a lot of memory cycles.
3) It looked from the calls that this code is copying descriptors,
   so the transfer length is probably 1 or 2 words - so the loop
   is inefficient.
4) ppc doesn't have a fast byteswap instruction (very new gcc might
   use the byteswapping memery access for the le32_to_cpu() though),
   so it would be better getting the byteswap done inside
   __iowrite64_copy() - since that is probably requesting a byteswap
   anyway.
OTOH I'm not at all clear about the 64bit xfers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: scheduling while atomic:

2011-10-05 Thread David Laight
Not entirely relevant to the error you are seeing,
but your ISR is:

   irqreturn_t cpld_irq_handler(int irq, void * dev_id, struct
pt_regs *regs)
   {
   wake_up(cpld_intr_wait);
   atomic_inc(cpld_intr_data); /* incrementing this will
indicate the poll() that the interrupt is occured */
   return 0;
   }

You need to set 'cpld_intr_data' before clling wake_up() otherwise
the sleeping process might run on another cpu before the data item
changes.

You also don't need to use the 'atomic' functions.
Writing to a 'volatile' data item is sufficent.
You don't want to count interrupts, just mark they have happened.

Maybe your problem it that the ISR doesn't do anything to
cause the hardware to drop its IRQ.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: build failure with gcc 4.6.0 array subscript is above array bounds

2011-08-18 Thread David Laight
 
 Subject: build failure with gcc 4.6.0 array subscript is 
 above array bounds
...
 That corresponds to:
   tmp = ((unsigned long *)child-thread.fpr)
   [TS_FPRWIDTH * (index - PT_FPR0)];
 
 child-thread.fpr is double fpr[32][TS_FPRWIDTH].
 
 index has already been bounds checked so we know it is = PT_FPSCR.

That code looks gross
I think it is trying to index a 2D array with a single index
and type-pun the lookup.
I'm not sure how the array size (for the subscript error)
is determined in the presence of the cast, but without
the cast the index would have to be less than 32.
I also suspect this is failing when gcc inlines the function
from a call where 'index' is a constant.

Possibly the code should read:
   tmp = (unsigned long *)child-thread.fpr[index - PT_FPRO];
although index may have been scaled by 'sizeof double/sizeof long'.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [RFC] Optimize __arch_swab32 and __arch_swab16

2011-08-11 Thread David Laight
 Joakim Tjernlund joakim.tjernl...@transmode.se writes:
 
  unsigned short my__arch_swab16(unsigned short value)
  {
  __asm__(rlwimi %0,%0,16,0x00ff
  : +r (value));
 
 You are creating a value that does not fit in a short.

Which is a problem because the compiler could schedule
it be written back to real memory between the instructions.

Actually the generated code would be better if the swap16()
functions operated on 'unsigned int' fields - since it would
save the compiler from doing a lot of shifts/masks elsewhere.

For instance there is likely to be a mask with 0x prior
to the call to swab16().

Since one use of these is for htons() (etc), when
the host is the correct endianness these are #defines that
do nothing - so out of range values aren't masked.
So it seems to me that defining:
   unsigned int swab(unsigned int);
would be fine - except it clashes with standard headers :-(

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [RFC] Optimize __arch_swab32 and __arch_swab16

2011-08-11 Thread David Laight
 
  Which is a problem because the compiler could schedule
  it be written back to real memory between the instructions.
 
 It can? There is no memory here, just registers. Even if it
 is written to memory, how would that affect the register?

Although the function argument is passed in a register, the
compiler could generate a store-load sequence before and
after each __asm__() line.

 Assuming you are right, would rewriting it to
   __asm__(rlwimi %0,%0,16,0x00ff\n\t
   rlwinm %0,%0,24,0x
   : +r(value));
 help?

Except that now you've stopped the compiler scheduling
another instruction between the two - probably forcing a
execution stall.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: union/struct representations for MAS Registers

2011-08-10 Thread David Laight
 
 I have some and use them in some code, they represent ISA 
 2.06 MAVN=1 (version 2)
 Can I keep them?
 if so, should I put them somewhere useful to others?
 
 Examples:
 union mas1 {
   u32 _val;
   struct {
   u32 v:1;
   u32 iprot:1;
   u32 tid:14;
   u32 _reserved_1:2;
   u32 ind:1;
   u32 ts:1;
   u32 tsize:4;
   u32 _reserved_2:8;
   };
 };

Bitfields are rather non-portable, the compiler has a lot of choice
about how to align the bits in memory.
Their use to map anything physical is doomed to portabilily issues.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: union/struct representations for MAS Registers

2011-08-10 Thread David Laight
 
  Bitfields are rather non-portable, the compiler has a lot of choice
  about how to align the bits in memory.
 
 I'm ok with the masking stuff.
 However, I'm actually surprised this is true given the 
 maturity of our ABIs.

The C standard says nothing at all about how bitfields are implemented,
I think the first bit might be 0x1, 0x80, 0x100 or 0x8000
when treated as a 32bit value, regardless of the endianness.

Different architectures can (and do) assign things in different ways.
So code that is ok on ppc might fail on arm or x86 (etc).

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW trackingof dirty young

2011-07-19 Thread David Laight
 
 Got it, if the fault_in_user_writeable() is designed to catch the
 exact same write permission fault problem we discuss here, so
 your patch fixed that very nicely, we should fixup it by directly
 calling handle_mm_fault like what you did because we are for sure
 to know what just happened(permission violation), its not necessary
 to check what's happened by calling gup--follow_page, and
 further the follow_page failed to report the fault :-)

One thought I've had - and I don't know enough about the data
area in use to know if it is a problem - is what happens if
a different cpu faults on the same user page and has already
marked it 'valid' between the fault happening and the fault
handler looking at the page tables to find out why.
If any of the memory areas are shared, it might be that the
PTE (etc) might already show the page a writable by the
time the fault handler is looking at them - this might confuse it!

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread David Laight
 
 The fault causing futex_atomic_cmpxchg_inatomic() is
 protected by pagefault_disable(), so the page fault handler has
 no chance to toggle the SW dirty/young tracking.

Perhaps that is the bug!
Whatever pagefault_disable() does, it shouldn't disable the
SW dirty/young tracking - which should only needs bits moving
in the page table itself (and TLB update??) rather than any
operations on the rest of the data areas.

It looks to me as though this could happen any time a page
is marked inaccessible by the dirty/young tracking.
Not just as a result of COW.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 4/4] powerpc: Enable lockup and hung task detectors in pseriesand ppc64 defeconfigs

2011-07-07 Thread David Laight
 
 As a result of changes to Kconfig files, we no longer enable
 the lockup and hung task detectors. Both are very light weight
 and provide useful information in the event of a hang, so
 reenable them.
...
 +CONFIG_LOCKUP_DETECTOR=y
 +CONFIG_DETECT_HUNG_TASK=y

Is one of thise responsible for generating a kernel stack traceback
when a process has been sleeping uninterruptably for a 'long' time?

We have a kernel subsystem that has several 'worker' threads,
these always sleep uninterruptable (they are shut down by explicit
request) and, at times, can be idle for long periods.

Perhaps it should be possible to disable the check either on
a per-process of per sleep basis?

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 2/3] powerpc: POWER7 optimised memcpy using VMX

2011-06-17 Thread David Laight
 
 On Fri, Jun 17, 2011 at 02:54:00PM +1000, Anton Blanchard wrote:
  Implement a POWER7 optimised memcpy using VMX. For large aligned
  copies this new loop is over 10% faster and for large unaligned
  copies it is over 200% faster.
...

 BTW: do you have any statistics on the size distribution
 of memcpy memcpy_to_from_usr?
 
 My gut feeling is that the intermediate case is the most
 important, and the short case the less critical (drowned
 in overhead's noise) but that's the kind of things on which
 I've often been wrong.

My thoughts are certainly that the code is too big, and that
the 'cold cache' version and possibly the effects of increasing
the size of the working set (ie displacing other code) may
be significant in real life.

For memcpy() the 'short' case will happen surprisingly often,
I suspect the fixed costs for the short case may dominate some
real workloads.

I'm not sure the speed of misaligned copies matters enough
to take the hit of the alignment test!

Of course, I don't actually remember doing any instrumentation
of this, but I have changed i386/amd64 memcpy (not linux/glibc)
to avoid the 'rep movsb' used for the trailing bytes (copy
the last 'word' first) - the setup cost for 'rep movsb' is
over 40 clocks on netburst P4!
(It is possible to get amd64 to copy data as fast as the
'rep movd', but the setup times are longer. And very recent
Intel cpus contain hardware acceleration for aligned and
misaligned 'rep movsd' - so trying anything clever isn't good.)
I do realise thise doesn't directly apply to ppc :-)

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] fs_enet: fix freescale FCC ethernet dp buffer alignment

2011-06-17 Thread David Laight
 
 Hello,
 Motioned to the memory aligned, now there is such requirement:
 When the driver send an packet to hardware, the skb's address passed
by
 stack do a dma map into hardware, the skb's dma address must 
 be 64-byte aligned.

Does the hardware support buffer chaining?
In which case you only need to copy the data upto the first
64 byte boundary into another buffer.

Actually, given that you are likely to have to fixup every
fragment of the frame being transmitted, if might be worth
allocating a fixed transmnit buffer area and copying the
frames into it prior to sending.
Certainly you need to allow for transmits made up of a
significant number of small buffers linked together.

Really you should beat up the hardware designers!

Copying the data to even a 4 byte boundary is almost
always a misaligned copy. Typically this only applies
to the receive dma - when writing a 2 byte pad before
the frame data would be much better.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: Changes to of_device ?

2011-06-17 Thread David Laight
 
 For finding structures, the following pattern tends to work well:
 
 git grep 'struct device_node {'

In general 'typedef.*\type_name\' will find types.
eg: grep -r -n --include '*.h' 'typedef.*\type_name\'
base_of_source_tree

Finding functions (in .c files) is easy if names start in
column 1 '^function_name\' will work.
But I ended up with: '^[^=]*[^ =][^=]*function_name\' to find
some where the type was on the same line.

If you get really desparate, the definitions of functions can
be found by grepping through the namelists of the .o files.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: Relocatable kernel for ppc44x

2011-06-15 Thread David Laight
 
 The PPC440X currently uses 256M TLB entries to pin the 
 lowmem. When we go for a relocatable kernel we have to :
 
 1) Restrict the kernel load address to be 256M aligned
 
 OR
 
 2) Use 16M TLB(the next possible TLB page size supported) 
 entries till the first
 256M and then use the 256M TLB entries for the rest of lowmem.

What is wrong with:

3) Use 256M TLB entries with the lowest one including
   addresses below the kernel base.

Clearly the kernel shouldn't be accessing the addresses
below its base address - but that is true of a lot of
address space mapped into the kernel.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 7/7] [v5] drivers/virt: introduce Freescale hypervisormanagement driver

2011-06-10 Thread David Laight
 
  +enum fsl_hv_ioctl_cmd {
  +   FSL_HV_IOCTL_PARTITION_RESTART = _IOWR(0, 1, struct
fsl_hv_ioctl_restart),
...
  +};
 
 Using a #define here is usually preferred because then you 
 can use #ifdef in a user application to check if a given
 value has been assigned.

It is also possible to add:
#define FSL_HV_IOCTL_PARTITION_RESTART FSL_HV_IOCTL_PARTITION_RESTART
to have much the same effect.
But there are many cases where #defines are better.
I only tend to use enums when the constanst are beting generated
by the expansion of a #define.

 More importantly, the code you have chose (0) conflicts with 
 existing drivers (frame buffer, scsi and wavefront among others).
 Please chose a free one and
 add it to Documentation/ioctl/ioctl-number.txt in the same patch.

It is rather a PITA that, when 'int' went from 16 to 32 bits, the
BSD people used the high 16 bits for size/flags rather than
using the extra bits to help make ioctl's unique.
Linux seems to have copied BSD here - rather than SYSV.

One problem with clashing ioctl commands is when systems like
NetBSD are running linux binaries and need to translate ioctl
buffers to/from native format. If the ioctl commands are
unique this can be done much more easily.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [gianfar]bandwidth management problem on mpc8313 based board

2011-06-08 Thread David Laight
 Subject: [gianfar]bandwidth management problem on mpc8313 based board
...
 I have mpc8313 powerpc based board with silicon revision 2.1. the
 processor has two ETH ports (eTsec1 and eTsec2) i.e. eth0 and eth1.
 eth0 is 1Gbps port and eth1 is 100Mbps port. On board there is L2
 switch from TANTOS2G (psb6972) supports one port 1Gbps,
 and from switch there are 4 more eth ports derived which are 100Mbps
 ports and port based VLAN is configured for this purpose.
 
 The interface between switch and eth0 (port of processor) is RGMII. So
 the processor port and switch port are connected on 1Gbps Link.
...
 After this I started to perform bandwidth test using iperf tool.
 When I performed this test on one port out of 4 derived ports I am
 getting bandwidth in the range of 80-85Mbps
 but when the same test is performed on 2 ports simultaneously then the
 per port bandwidth is reduced to 40-45Mbps.

To summerise, you have a Ge port connected by RGMII (cross over) to
an on-board switch that is configured to use VLAN tagging to drive
four external 100M ports?

I see two likely reasons for the aggregate throughput being constant:
1) The switch has limited throughput/buffering
2) The host really is 100% busy
3) The remote system has limited throughput

I'd vote for the system being busy and 'top' (or whatever you are
using) lying about the cpu usage. Measuring free cpu time by counting
it in a low priority process is much more accurate than relying on
the 'code interrupted by timer tick' scheme.
(Clearly the scheduler could use a high-res timestamp on entry/exit
to the idle loop and/or process switch - but, to my knowledge, the
linux kernel only uses the timer interrupt.)

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [RFC][PATCH] powerpc: Use the #address-cells information to parsememory/reg

2011-06-06 Thread David Laight
  Changed the add_usable_mem_property() to accept FILE* fp instead of
int fd,
  as most of the other users of read_memory_region_limits() deals with
FILE*.
 
  Signed-off-by: Suzuki K. Poulose suz...@in.ibm.com
 
 Could you please let me know your thoughts/comments about this patch ?

Is the change to use 'FILE *' actually progress?
I'd have thought that the randomly aligned read/lseek system calls
that this allows to happen are not desirable for anything that
isn't a true file.

I'd also suggest that the sizeof's should be applied to the
actual type of the variable being read/written, not arbitrarily
'long' or 'int', and this probably ought to be some fixed size type.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 1/2] ppc/85xx: create a platform node for PCI EDAC device

2011-06-02 Thread David Laight
 
  +int __init fsl_add_pci_err(void)
 
 static :-)

and why __ ?
aren't such names reserved in the C language for some purpose.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: Cannot run tcl program of PPC460EX

2011-05-25 Thread David Laight
 
 On Tue, May 24, 2011 at 11:16:12PM -0700, efti wrote:

 
  I am trying to run tcl of my PPC460ex board. I have download the
  tcl-8.3.3-sol26-sparc-local.gz file. But when I run it I get 
  the Syntax error.
 
 That tarball seems to indicate it's for both Solaris and Sparc.
 PPC460ex is neither of those.

Which may well mean that the 'syntax error' is from the shell
trying to interpret (as a shell script) a sparc elf binary
that the kernel has failed to 'exec'.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: PCI DMA to user mem on mpc83xx

2011-05-24 Thread David Laight
 
 we have a pretty old PCI device driver here that needs some
 basic rework running on 2.6.27 on several MPC83xx.
 It's a simple char-device with give me some data implemented
 using read() resulting in zero-copy DMA to user mem.
 
 There's get_user_pages() working under the hood along with 
 SetPageDirty() and page_cache_release().

Does that dma use the userspace virtual address, or the
physical address - or are you remapping the user memory into
kernel address space.

If the memory is remapped into the kernel address space, the
cost of the mmu and tlb operations (especially on MP systems)
is such that a dma to kernel memory followed by copyout/copytouser
may well be faster!

That may even be the case even if the dma is writing to the
user virtual (or physical) addresses when it is only
necessary to ensure the memory page is resident and that
the caches are coherent.

In any case the second copy is probably far faster than the
PCI one!

I've recently written driver that supports a pread/pwrite interface
to the memory windows on a PCIe card. It was important to use
dma for the PCIe transfers (to get a sensible transfer size).
I overlapped the copyin/copyout with the next dma transfer.
The dma's are fast enough that it is worth spinning waiting
for completion - but slow enough to make the overlapped
operation worthwhile (same speed as a single word pio transfer).

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 1/3] mpt2sas: remove the use of writeq, since writeq isnot atomic

2011-05-19 Thread David Laight
 
 The specific mpt2sas problem is that if we write a 64 bit register non
 atomically, we can't allow any interleaving writes for any other
region
 on the chip, otherwise the HW will take the write as complete in the
64
 bit register and latch the wrong value.  The only way to achieve that
 given the semantics of writeq is a global static spinlock.

That sounds like very specific and slightly dodgy hardware.
You don't say what the scope of 'region on the chip' is, but
it looks like you need to disable ALL writes to the memory
area between the first and second writes of the 64bit value
and not just those coming from writeq().
I don't see how this can possibly be done by normal mutexing
around the writeq() sequence, surely you need to lock the bus
between the two transfers.
Even dma writes would be a problem.

The only way I can think to stop other cpus doing writes
is to use IPIs for force them into a busy wait loop.

All rather reminds me of a PCI slave that got things
horribly wrong when a read was done while a write was
still 'posted', or a 2nd master did a cycle did a read
while a read rerun sequence was still in progress.
(required a mutex and dummy cycles).
At least than one wqs confined to one driver.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: book to learn ppc assembly and architecture

2011-05-18 Thread David Laight
 
  On Mon, 2011-05-16 at 16:37 +1000, Michael Neuling wrote:
   what  is  the  best  book  to  learn  assembly  and architecture .
 
 Assuming you have a powerpc compiler available you can use the -S
 option to produce assembly listings.

With gcc add -fverbose-asm for more info.

For a general background, look at something much simpler than ppc,
even if you don't write/run any code.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 1/3] mpt2sas: remove the use of writeq, since writeq isnot atomic

2011-05-18 Thread David Laight
 

static inline void writeq(__u64 val, volatile void __iomem
*addr)
{
writel(val, addr);
writel(val  32, addr+4);
}
...
We need the 64 bit completed in one access pci memory write,
else spin lock is required.
Since it's going to be difficult to know which writeq was
implemented in the kernel, 
the driver is going to have to always acquire a spin lock each
time we do 64bit write.
...
 I'm just in the process of finding them now on IRC so I can demand an
 explanation: this is a really serious API problem because writeq is
 supposed to be atomic on 64 bit.

Most 32 bit systems don't have atomic 64bit writes.
I'd also have thought there would be code which wouldn't mind the
write being done as two cycles.

I'm not sure that some of the ppc soc systems are capable of
doing a 64bit data pci/pcie cycle except by dma.
So your driver is probably doomed to require a lock.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system callfiltering

2011-05-13 Thread David Laight
 ... If you can be completely stateless its easier, but there's
 a reason that stacking security modules is hard.  Serge has tried in
the
 past and both dhowells and casey schaufler are working on it right
now.
 Stacking is never as easy as it sounds   :)

For a bad example of trying to allow alternate security models
look at NetBSD's kauth code :-)

NetBSD also had issues where some 'system call trace' code
was being used to (try to) apply security - unfortunately
it worked by looking at the user-space buffers on system
call entry - and a multithreaded program can easily arrange
to update them after the initial check!
For trace/event type activities this wouldn't really matter,
for security policy it does.
(I've not looked directly at these event points in linux)

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] atomic: add *_dec_not_zero

2011-05-04 Thread David Laight
 
 Introduce an *_dec_not_zero operation.  Make this a special case of
 *_add_unless because batman-adv uses atomic_dec_not_zero in different
 places like re-broadcast queue or aggregation queue management. There
 are other non-final patches which may also want to use this macro.

Isn't there a place where a default definition of this can be
defined? Instead of adding it separately to every architecture.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH V4] POWER: perf_event: Skip updating kernel counters if register value shrinks

2011-04-27 Thread David Laight
I keep telling Eric that the code below is incorrect
modulo arithimetic...

 +static u64 check_and_compute_delta(u64 prev, u64 val)
 +{
 + u64 delta = (val - prev)  0xul;
 +
 + /*
 +  * POWER7 can roll back counter values, if the new value is
smaller
 +  * than the previous value it will cause the delta and the
counter to
 +  * have bogus values unless we rolled a counter over.  If a
coutner is
 +  * rolled back, it will be smaller, but within 256, which is the
maximum
 +  * number of events to rollback at once.  If we dectect a
rollback
 +  * return 0.  This can lead to a small lack of precision in the
 +  * counters.
 +  */
 + if (prev  val  (prev - val)  256)
 + delta = 0;
 +
 + return delta;

The code should detect rollback by looking at the value of 'delta'
otherwise there are horrid end effects near 2^32-1.

For instance:
u32 delta = val - prev;
return delta  0x8000 ? 0 : delta;


   David




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH V4] POWER: perf_event: Skip updating kernel counters if register value shrinks

2011-04-27 Thread David Laight
 But it isn't, and it doesn't have trouble with 2^32 - 1.  

what about:
prev = 0x0001
val  = 0x

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH V4] POWER: perf_event: Skip updating kernel counters if register value shrinks

2011-04-27 Thread David Laight
 
 + if (prev  val  (prev - val)  256)
 + delta = 0;
 +
 + return delta;

Also, 'prev' is a true 64bit value, but 'val' is only ever 32bits.
So once the 64bit 'prev' exceeds 2^32+256 both 'prev  val'
and 'prev - val' are true regardless of the value of 'val'.
This will lead to jumps in the value 

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH V4] POWER: perf_event: Skip updating kernel counters if register value shrinks

2011-04-27 Thread David Laight
 
 On Wed, 27 Apr 2011, David Laight wrote:
 
   But it isn't, and it doesn't have trouble with 2^32 - 1.  
  
  what about:
  prev = 0x0001
  val  = 0x
 
 Result is 0xfffe and we are fine.

'delta' will be 0xfffe, but you need the function to
return zero - since the underlying counter has decremented
by 2. But 'prev  val' is false so the counter will
be increased by 2^32-2.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH V4] POWER: perf_event: Skip updating kernel counters if register value shrinks

2011-04-27 Thread David Laight
 
 prev and val are both 64 bit variables holding 32 bit numbers, we do
not
 accumulate in either, they are both replaced by values directly from
the
 registers.
 So prev  val will not always be true.

The code seems to be:
prev = local64_read(event-hw.prev_count);
val = read_pmc(event-hw.idx);
delta = check_and_compute_delta(prev, val);
local64_add(delta, event-count);
Which looks very much like 'prev' being a 64bit counter generated
from the 32bit pmc register.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: align DTL buffer to AMS boundary

2011-04-13 Thread David Laight
 From: 
 linuxppc-dev-bounces+david.laight=aculab@lists.ozlabs.org 
 [mailto:linuxppc-dev-bounces+david.laight=aculab@lists.ozl
 abs.org] On Behalf Of Nishanth Aravamudan
 Sent: 13 April 2011 15:53
 To: Ben Herrenschmidt
 Cc: linuxppc-...@ozlabs.org; Paul Mackerras; Anton Blanchard
 Subject: [PATCH] powerpc: align DTL buffer to AMS boundary
 
 PAPR specifies that DTL buffers can not cross AMS environments (aka
CMO
 in the PAPR) and can not cross a memory entitlement granule boundary
 (4k)

How big is the buffer being allocated?
If it is much less than 4k then it might be worth allocating
a buffer of the correct size, and only if that crosses a 4k boundary
allocate the larger buffer.

Also, if the buffer is ever freed, the actual base address is needed
for the free.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH V3] POWER: perf_event: Skip updating kernel counters ifregister value shrinks

2011-04-08 Thread David Laight
 
 + u64 delta = 0;
...
 + if (((prev  0x8000)  !(val  0x8000)) || (val 
prev))
 + delta = (val - prev)  0xul;
 +
 + return delta;

The above is incorrect modulo arithmetic.

It is probably intended to do:
s32 delta = val - prev;
return delta  0 ? 0 : delta;
which will just ignore the fact that some counts are rolled back.

More accurate would be:
static u64 val64;
u32 val = read_perf_count();
s32 delta = val - val_64;
if (delta  0)
return;
val64 += delta;
Which will not double count for rolled back events.
(The low bits of val64 should always match the actual counter.)

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH v2 5/5] xhci: Remove recursive call to xhci_handle_event

2011-03-30 Thread David Laight
 
  -   xhci_handle_event(xhci);
  +   while (xhci_handle_event(xhci)) {}
  
 
 I must admit I dislike the style with empty loop bodies...

I would use an explicit 'continue;' for the body
of an otherwise empty loop.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


High load average (~2.0) on an idle PowerPC 64 machine

2011-03-25 Thread David Laight
 I've this Quad G5 machine that's sitting pretty much idle with the
 latest Debian stable installed, and yet it's got an abnormaly high
 load average.

The 'load average' value is rather useless since it seems to
contain any process that is sleeping uninterruptibly, and
IIRC any process that has run at all (for however short a
period) in the current schedule epoch (or whatever period
is relevant).

I can easily generate a linux system that is 99.9% idle
but has a 'load average' of 20 or more.
It would still be 99.9% idle even if the idle time
were based of the actual time outside the scheduler
idle loop (as NetBSD doe) rather than where timer
ticks interrupted.

We also have the related fubar (on x64/amd64) of the
kernel generating stack traces for processes that are
sleeping uninterruptibly for long periods.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH][v2] driver/FSL SATA:Fix wrong Device Error Register usage

2011-03-08 Thread David Laight
 
  Case when ffs return will never arise.This scenario is 
 already been discussed
  on linuxppc-dev@lists.ozlabs.org. Please see below explanation:
  sata_fsl_error_intr() is called during device error.The 
 mentioned scenario
  will never comes. It can be observed via code:-
 if (cereg) {   -- cereg is set on command error. 
 Means there is at least 1 device present.

This all relies on the hardware working properly, and
nothing else accidentally clearing the device 'error'
bit mask.

Being slightly defensive when the out of range value
causes an array subscript error seems sensible.

It isn't as though the code size or execution time
matters here.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH][v1] driver/FSL SATA:Fix wrong Device Error Register usage

2011-02-21 Thread David Laight
The patch changes the code to:

   if (ap-nr_pmp_links) {
 + int dev_num;
   dereg = ioread32(hcr_base + DE);
...
 + dev_num = ffs(dereg)-1;
 + if (dev_num  ap-nr_pmp_links) {
 + link = ap-pmp_link[dev_num];

On the off chance that no bits are set in 'DE' this
code depends on nr_pmp_links being unsigned (since
dev_num becomes -1.

Changing 'dev_num' to 'unsigned int' would be better.

David



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] driver/FSL SATA:Fix wrong Device Error Register usage

2011-02-18 Thread David Laight
 
 + if ((ffs(dereg)-1)  ap-nr_pmp_links) {
 + /* array start from 0 */
 + link = ap-pmp_link[ffs(dereg)-1];

I'd only call ffs() once - it could be a slow library function.
Any comment should note that ffs() returns 0 when no bits
are set - rather than anything about array indexes.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH V8 03/10] USB/ppc4xx: Add Synopsys DWC OTG Core InterfaceLayer

2011-02-14 Thread David Laight
 
  Sorry, I don't understand that. I think u32 is always 32bit 
  4byte on all archs. Right?
 
 Yes.
 
 Use an unsigned long if you want to hold a pointer correctly on all
 arches.

Although that is true for many systems (and probably all ppc Linux)
it isn't necessarily true (eg 64 bit Microsoft Windows).
C99 inttypes.h should define uintptr_t as an unsigned integer
type that is large enough to hold a (data) pointer.
I'm not sure if this is defined for the Linux kernel.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 1/2] misc: add CARMA DATA-FPGA Access Driver

2011-02-10 Thread David Laight
 
  I was completely unaware of that feature. I hunch that many
drivers
  are incapable of dealing with an unbind while they are still open.
 
 Hmm, maybe older drivers... Anythig hotpluggable (USB, PCI, etc)
should
 be in a better shape because they expect to be yanked at any time.

Whim and prayer ???

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 1/2] misc: add CARMA DATA-FPGA Access Driver

2011-02-09 Thread David Laight
 
 This driver allows userspace to access the data processing 
 FPGAs on the OVRO CARMA board. It has two modes of operation:
 
 1) random access
 
 This allows users to poke any DATA-FPGA registers by using mmap to map
 the address region directly into their memory map.

I needed something similar, but used pread() and pwrite()
to request the transfers.
While this does require a system call per transfer, it allows
the driver to use dma (if available) to speed up the request.
In my case doing single cycle transfers would be too slow.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH V8 03/10] USB/ppc4xx: Add Synopsys DWC OTG Core InterfaceLayer

2011-02-08 Thread David Laight
 
 [Marri] If u32 is 8bytes isn't pointer type would be 8bytes 
 as well.

If u32 is 8bytes 'char' would only be 4 bits!

(IIRC the X code has a fubar that requires a type with 32
in its name to be 64 bits!)

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 4/6] ftrace syscalls: Allow arch specific syscallsymbol matching

2011-02-02 Thread David Laight
 
 +#define arch_syscall_match_sym_name(sym, name) !strcmp(sym + 3, name
+ 3)

Whenever you use a #define macro arg, you should enclose it in ().
About the only time you don't need to is when it is being
passed as an argument to another function
(ie when it's use is also ',' separated).

So the above ought to be:
#define arch_syscall_match_sym_name(sym, name) (!strcmp((sym) + 3,
(name) + 3))

Whether an inline function is better or worse is much more subtle!
For instance I've used:
   asm volatile ( # line  STR(__LINE__) :: )
to stop gcc merging the tails of conditionals.
Useful when the conditional is at the end of a loop (etc),
it might increase code size slightly, but removes a branch.

If I put one of those in an 'inline' function separate copies
of the function end up sharing code.
With a #define __LINE__ differs so they don't.

(I had some code to get below 190 clocks, these changes
were significant!)

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 2/7] PowerPC: add unlikely() to BUG_ON()

2011-01-28 Thread David Laight
 
  +#define __BUG_ON(x) do {   \
  if (__builtin_constant_p(x)) {  \
  if (x)  \
  BUG();  \
  @@ -85,6 +86,8 @@
  }   \
} while (0)
 
  +#define BUG_ON(x) __BUG_ON(unlikely(x))
  +

From my experiments, adding an 'unlikely' at that point isn't
enough for non-trivial conditions - so its presence will
give a false sense the the optimisation is present!
In particular 'if (unlikely(x  y))' needs to be
'if (unlikely(x)  unlikely(y))' in order to avoid
mispredicted branches when 'x' is false.

Also, as (I think) in some of the generated code quoted,
use of __builtin_expect() with a boolean expression can
force some versions of gcc to generate the integer
value of the expression - rather than just selecting the
branch instructions that statically predict the
normal code path.

Sometimes I've also also had to add an asm() statement
that generates no code in order to actually force a
forwards branch (so it has something to place at the
target).

(I've been counting clocks )

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH v2] gianfar: Fall back to software tcp/udp checksum on oldercontrollers

2011-01-28 Thread David Laight
 
 + if (unlikely(gfar_has_errata(priv, GFAR_ERRATA_12)
 +   ((unsigned long)fcb % 0x20)  0x18)) {

You need to check the generated code, but I think you need:

if (unlikely(gfar_has_errata(priv, GFAR_ERRATA_12))
  unlikely(((unsigned long)fcb % 0x20)  0x18))

ie unlikely() around both the primitive comparisons.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: FSL DMA engine transfer to PCI memory

2011-01-26 Thread David Laight
 
 What was the ppc you used?

The 8315E PowerQUIICC II

 On 85xx/QorIQ-family chips such as P2020, there is no DMA controller
 inside the PCIe controller itself (or are you talking about bus
 mastering by the PCIe device[1]?  interface is a bit ambiguous),
 though it was considered part of the PCI controller on 82xx.
 
 The DMA engine and PCIe are both on OCeaN, so the traffic 
 does not need to pass through the e500 Coherency Module.
 My understanding -- for what it's worth, coming from a
 software person :-) -- is that you should
 be able to get large transfer chunks using the DMA engine.

It might be possible - but the ppc's pcie would need to know
the length of the dma (or at least be told that there was more
data to arrive) before even starting the pcie transfer.
I used 128 bytes per pcie transfer (which the altera slave
can handle) but that is longer than you want a burst on
the internal (CSB in my case) bus on the ppc.
It is also longer than a cache line - so the dma engine's
memory reads might induce a cache flush. 

 I suggest getting things working, and then seeing whether the
 performance is acceptable.

The only reason for using dma (instead of pio) is to get
long pcie transfers - otherwise it isn't really worth the
effort. Transfers are unlikely to take long enough to make
it worth taking an interrupt at the end of the dma.

My device driver implements read() and write() (and poll()
to wait for interrupts). So I do overlap the copy_to/from_user
with the next dma.

  The generic dma controller can't even generate 64bit
  cycles into the ppc's PCIe engine.
 
 Could you elaborate?

The pcie is (apparantly) a 64bit interface, to a single 32bit
transfer is actually a 64bit one with only 4 byte enables driven.

I couldn't see anything that would allow a CSB master to generate
two 32bit cycles (since it is a 32bit bus) that the pcie hardware
could convert into a single 64bit pcie transfer.
The fpga target is likely to have 32bit targets (it could have
64 bit ones, but if you've instantiated a NiosII cpu it wont!)
so you get a bus width adapter (which carefully does the cycle
with no byte enables driven) as well as the clock crossing bridge.
These both make the slave even slower than it would otherwise be!

IIRC We managed to get 2us for a read and 500ns for a write cycle.
The per byte costs are relatively small in comparison.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH V8 03/10] USB/ppc4xx: Add Synopsys DWC OTG CoreInterface Layer

2011-01-26 Thread David Laight
 
 Also in_le32/out_le32/in_be32/out_be32 are 
 architecture-specific AFAIK.

Isn't the whole patch architecture-specific ?

 I'd suggest using readl/writel for LE ops and
 __be32_to_cpu(__raw_readl(addr))/__raw_writel(__cpu_to_be32(b),addr)
 for BE ops.

Since the ppc doesn't have a byteswap instruction (and the LE
memory transfers might even be slow!) you really don't want to
be doing software byteswap.

Not to mention the horrific synchronistion instructions
that in_le32() and out_le32() actually contain.

I didn't find __raw_readl() when I was looking for asm
patterns that byteswapped memory transfers.

I did find st_le32() and ld_le32() in arch/powerpc/include/asm/swab.h
but that file is actually quite difficult to #include because
there is another swab.h that gets included instead.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: FSL DMA engine transfer to PCI memory

2011-01-25 Thread David Laight
 
 I'm trying to use FSL DMA engine to perform DMA transfer from
 memory buffer obtained by kmalloc() to PCI memory. This is on
 custom board based on P2020 running linux-2.6.35. The PCI
 device is Altera FPGA, connected directly to SoC PCI-E controller.

You'll need to use the dma engine that is part of the PCIe
interface in order to get large PCIe transfers.
I think everything else will still generate single 32bit PCIe
transfers - which are (if your measurements match mine)
exceptionally lethargic - the ISA bus is faster!

That does work provided you remember to give the dma controller
physical addresses and byteswap absolutely everything.
(Oh, and I didn't get single word transfers to work - they locked
the dma controller - not a problem since they are faster by PIO.)

Note that the PPC Linux (Linux in general??) doesn't have a
'virtual to physical' function that works for all addresses,
you'll need to remember the physical address of the PCIe slave
and use malloc'ed memory for the descriptors (on which
virt_to_phys() actually works).

I don't think there is a standard device driver for the PCIe dma,
I couldn't even find any header files that were vaugely relevent
except in the uboot sources.
I certainly wrote some code that just assumes it is on the right
hardware!

These are the relevant bits of code 

Global initialisation:
/* Enable the read/write dma controllers */
csb_ctrl =  in_le32(pex-pex_csb_ctrl);
csb_ctrl |= PEX_CSB_CTRL_WDMAE | PEX_CSB_CTRL_RDMAE;
out_le32(pex-pex_csb_ctrl, csb_ctrl);

/* We don't rely on the dma polling the descriptor, I have NFI
 * whether the default of 0 means 'never poll' or 'poll very
quickly'.
 * Set a large slow value for sanity. */
out_le32(pex-pex_dms_dstmr, ~0u);

Transfer setup:
/* We only support aligned writes - caller must verify */
dma_ctrl = PDMAD_CTRL_VALID;
dma_ctrl |= PDMAD_CTRL_SNOOP_CSB;
dma_ctrl |= PDMAD_CTRL_1ST_BYTES | PDMAD_CTRL_LAST_BYTES;
dma_ctrl |= PDMAD_CTRL_NEXT_VALID;
dma_ctrl |= len  (PDMAD_CTRL_LEN_SHIFT - 2);

/* Fill in DMA descriptor */
st_le32(desc-pdmad_ctrl, dma_ctrl);
/* We MUST clear the status - otherwise the xfer will be skipped */
st_le32(desc-pdmad_stat, 0);
st_le32(desc-pdmad_src_address, src_phys);
st_le32(desc-pdmad_dst_address, dst_phys);
st_le32(desc-pdmad_next_desc, 0);

/* Clear old status */
st_le32(pex_dma-pex_dma_stat, in_le32(pex_dma-pex_dma_stat));

/* Give descriptor address to dma engine */
st_le32(pex_dma-pex_dma_addr, virt_to_phys(desc));

/* Wait for all above memory cycles, then start xfer */
iosync();
st_le32(pex_dma-pex_dma_ctrl, PEX_DMA_CTRL_START |
PEX_DMA_CTRL_SNOOP);

Poll for completion:
/* Wait for transfer to complete/fail */
do {
desc_stat = ld_le32(desc-pdmad_stat);
} while (!(desc_stat  PDMAD_STAT_DONE));

status = ld_le32(pex_dma-pex_dma_stat);

if (status == (PEX_DMA_STAT_DSCPL | PEX_DMA_STAT_CHCPL)
 desc_stat == PDMAD_STAT_DONE)
/* Transfer ok */
return 0;
/* Transfer failed */

Oh, since I couldn't find it in the documentation, the first
word of the dma descriptor is 'ctrl' and the last 'next_desc'.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


FW: FSL DMA engine transfer to PCI memory

2011-01-25 Thread David Laight
 memcpy_toio() works fine, the data is written correctly. After
 DMA, the correct data appears at offsets 0xC, 0x1C, 0x2C, etc.
 of the destination buffer. I have 12 bytes of junk, 4 bytes of
 correct data, then again 12 bytes of junk and so on.

Does your Avalon MM slave decode the 4 byte enables?
The slave always sees 2 Avalon MM writes, one of which
will have no byte enables asserted.

(Actually it is a 64bit Avalon transfer - and a bus width adapter
gets inserted for you.)

Internal M9K memory blocks get this right...

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: FSL DMA engine transfer to PCI memory

2011-01-25 Thread David Laight
 
   custom board based on P2020 running linux-2.6.35. The PCI
   device is Altera FPGA, connected directly to SoC PCI-E
controller.

 
 This sounds like your FPGA doesn't handle burst mode accesses 
 correctly.
 A logic analyzer will help you prove it.

He is doing PCIe, not PCI.
A PCIe transfers is an HDLC packet pair, one containing the
request, the other the response.
In order to get any significant throughput the hdlc packet(s)
have to contain all the data (eg 128 bytes).
On the ppc we used that means you have to use the dma
controller inside the PCIe interface block.
The generic dma controller can't even generate 64bit
cycles into the ppc's PCIe engine.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: MPC831x (and others?) NAND erase performance improvements

2010-12-13 Thread David Laight
 
  An external IRQ line would let you limit interrupts to rising edges
  rather than all edges, though you'd lose the ability to 
  directly read the line status.
 
 oh, one cannot read the IRQ line? didn't know that. Also I not sure
 all Freescale CPUs can do rising edge.

I suspect that you may be able to leave the interupt masked, but still
read the 'interrupt pending' register. Which would have the same effect.

Our HW engineers tend to feed everything into an FPGA since it
gives than a lot more flexibility over pin connections.
In which case the invertor is trivial.
(and the fpga interface can read the status!)

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: MPC831x (and others?) NAND erase performance improvements

2010-12-07 Thread David Laight
 
 The problem cropped up when there was a lot of traffic to the NAND
 (Samsung K9WAGU08U1B-PIB0), with the NAND being on the LBC along with
 a video chip that needed constant and prompt attention.
 
 What I would see is that, as the writes happened, the erases would
 wind up batched and issued all at once, such that frequently 400-700
 erases were issued in rapid succession with a 1ms LBC BUSY cycle per
 erase.

Are those just the reads of the status register polling to
determine when the sector erase has completed ?
In which case a software delay beteen the reads might work.

Writes probably also have to be polled, but the individual
writes happen faster.

It is possible that an uncached read of another memory area
will stall the cpu long enough to allow another LBC master in.
One every few writes might be enough.
I had to do something similar on rather different hardware ...

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: Getting the IRQ number (Was: Basic driver devel questions ?)

2010-12-06 Thread David Laight
 

 That's the idea. Communication between usermode and the driver is
limited to 
 simple ioctl calls and the driver receives an interrupt when certain
data has 
 been placed in memory blocks by the hardware (like a DMA). 
 Then the user prog figures out where that latest data buffer is (mmap)
and saves it. 

I've used pread() and pwrite() to directly copy data to/from a devices
PCI memory space.
The driver just verifies the offset and calls copy_to/from_user() [1].
High offsets can be used to access driver specific data.
The userspace code for pread/write is less painful than using ioctl()
for everything.
This also lets you dump the device memory space with hexdump.

I also made the device interrupt generate POLLIN (this needs a bit of
care since the interrupt can arrive before poll() is called).

David

[1] I actually had to use the PCIe dma controller to get adequate
throughput.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: Getting the IRQ number (Was: Basic driver devel questions ?)

2010-12-06 Thread David Laight
 
 Another question: I just spent 10 minutes trying to find
 where struct device was defined.

Dirty trick #4:

At the top of a source file (before the first include) add:
struct device { int fubar; };
Then try to compile it.

The compiler will the tell where it is defined!

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [MPC52xx]Latency issue with DMA on FEC

2010-12-01 Thread David Laight
 
 A mb() is usually used if you do a write to device and read from it.
 With out it, the CPU could perform the read before the write, which
 would give you an incorrect result. There's no other way around that.

Possibly the synchronisation functions are doing significantly
more work than is required.

I was looking at the in_le32() and out_le32() functions for the
ppc e300 (and maybe others).

The out_le32() contains a 'sync' instruction - this may only
be needed after a series of writes (eg just before a command).

The iosync() function just adds a 'sync' and can be used as needed.

The in_le32() not only contains the unwanted 'sync', but also
a 'twi' (trap immediate - NFI exactly what this does) and 'isync'.
The 'isync' is particularly horrid and unnecessary (aborts
the instruction queue and refeches the opcode bytes).

The very slow in_le32() might be there to give semi-synchronous
traps on address fault - but unless the hardware is being probed
that really isn't necessary.

I did find st_le32() and ld_le32() in arch/powerpc/include/asm/swab.h
but had difficulty #including that version of swab.h!
#include ../arch/powerpc/include/asm/swab.h
worked - but isn't that nice.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: application needs fast access to physical memory

2010-11-18 Thread David Laight
 
 On Wed, 2010-11-17 at 16:03 -0600, steven@teradyne.com wrote:
  My application needs a fast way to access a specific physical DDR
  memory region. The application runs on an MPC8548 PowerPC which has
an
  MMU. I've tried two approaches that are typical for Linux, mmap()
and
  using a kernel module that implements read()/write() into this
region
  and I'm finding that performance is very slow for both. It's a
couple
  orders of magnitude slower than, for example, copying a large buffer
  from one place in the application's virtual memory to another place
in
  the application's virtual memory.
 
 The mmap() version should basically run at full speed, at least once
 you've faulted the address range in.
 
 This specific DDR region isn't specifically slow is it ? :)

Or being mapped uncached ?

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: Corrected data type mismatch

2010-11-16 Thread David Laight
 
  OOPS! It is wrong here, The right one should be as following:
  
  -   memcpy(mem, current-thread.evr[regno-32],
  +   memcpy(mem, (void*)current-thread.evr[regno-32],
dbg_reg_def[regno].size);

The (void *) cast should be unnecessary 

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: INFO: task snmpd:398 blocked for more than 120 seconds.

2010-11-08 Thread David Laight
 
 I can't make out what is causing this hang every now an then:
 
 INFO: task snmpd:398 blocked for more than 120 seconds.

My problem with that 'error' message is that there is no way
for a driver to disable it on a per-process basis.
We have some processes whose 'normal state' is to sleep
uninterruptibly in the kernel. Shutdown is handled by
an explicit request (not by sending a signal).
The processes could be kernel worker threads (except that
is is ~impossible to handle them exiting from a loadble
kernel module) so are actually children of a daemon sat
inside an ioctl() request that never terminates!

However, on the face of it, your case does look as though
the mutex is fubar'ed.

Might be worth (somehow) dumping the mutex state.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


PowerQUICC II PCIe dma

2010-11-01 Thread David Laight
I'm trying to get a ppc 831x to do PCIe dma transfers, PIO transfers
work ok but are somewhat lethargic (partially due to the slave).

I've done the following:
- All reads/writes (including the dma descriptors) are byteswapped.
- Enabled the dma in PEX_CSB_CTRL (value 0x3f)
- Used the kernel physical addresses for all addresses.
  For memory from virt_to_phys(), pci from pci_resource_start().
  (virt_to_phys() gives invalid values for the pci area).
- I'm setting both 'snoop' bits to 1 (as per the device errata)
  So the descriptor 'ctrl' has value (len  2)  12 | 0xff9,
  and the dma 'ctrl' 0x401.
- I've put a 'sync' instruction (iosync()) just before the
  write to start the dma.

On the first request after hard reset, the dma status register
REX_RDMA_STAT gets set to 4. On subsequent requests it stays zero.
Nothing else appears to happen at all (except the 'start' bit
always reads back zero).
The 'CSB status register' at 0x81c stays zero - so I don't think
it is stuck waiting for the PCIe slave.

I'm not at all sure of the word ordering for the dma descriptor!
Page 14-113 doesn't say! However I've written the values so that
it doesn't matter - once I get a status written back I'll know...

I've probably forgotten something very silly!
Anyone who has got this working any ideas ??

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


<    1   2   3   4   5   6