Re: MPC5121 CAN and USB

2009-10-19 Thread David Jander
On Friday 16 October 2009 01:10:05 am Wolfgang Denk wrote:
 Dear Paul,
 
 In message 26b052040910151603y8fc9b00g678d6a873083f...@mail.gmail.com you 
wrote:
   The ltib-mpc5121ads-20090602 branch reflects the exact state of the
   kernel contained in the LTIB with this name (dated July 2009, despite
   the name; based at 2.6.24.6, i. e. 7+ kernel versions behind).
 
  I have diff'ed this and it is very similar to the ltib in the BSP.
  The MBX patches may be missing though.  These patches can be obtained
  via the Freescale SDK for the OpenGL on the MPC5121e webpage.
 
 Who cares. This code is about 8 (!) kernel releases behind. Scrap
 it.

I would love to, but as it stands, this is still the best we can get for the 
MPC5121e :-(
I took that branch, merged the MBX patch in from the LTIB and ported the new 
NFC driver from the 'mpc512x' branch back to this one, since the original 
driver is buggy. 
The latest OpenGL-ES libraries and MBX drivers released are closed-source, 
buggy, but stable and seem to work well if your program doesn't do the kinds 
of things that make it crash. 
If you know the limitations of this and can live with it, it might be 
bearable... I don't have any other option :-(

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Best hardware platform for native compiling...

2009-07-21 Thread David Jander

Hi all,

This might sound as a stupid question (and maybe sligtly off-topic), but I 
have not found an (easy) answer and I suspect many on this list will have a 
good suggestion to make:

We are developing (and maintaining) different embedded linux systems based on 
different PowerPC processors. From small MPC852T with little RAM and Flash, 
up to 400MHz MPC5200- and MPC5121e based systems that resemble more a PC or 
netbook than an embedded system in terms of RAM and storage.

For smaller systems we use a customized ELDK-based OS and cross-compile almost 
everything on a PC.

For bigger systems we often run a debian-derived OS like Ubuntu, and many 
pieces are compiled natively on the target... just because it is easy and 
quick to do, and cross-compiling certain packages can be a real pain.
But, a 400 MHz e300 core is not really fast for compiling, so I have been 
considering buying some sort of PowerPC-based system with a faster processor, 
just as a build-server (a G5 would do wonders I guess).

It seems like the only real option is one of the smaller IBM Power servers, 
but that seems overkill to me. We also don't feel like buying some old 
second-hand Apple gear.

Is there any other available and affordable platform that can be used to run 
linux and compile software natively for 32-bit PowerPC?

Any suggestion is welcome!

Best regards,

P.S.: I am writing this while running dpkg-buildpackage -rfakeroot on the 
package xserver-xfbdev from ubuntu 9.04 on a MPC5121e it will take 40 
minutes ;-)

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Best hardware platform for native compiling...

2009-07-21 Thread David Jander
On Tuesday 21 July 2009 11:52:51 you wrote:
 On Tue, Jul 21, 2009 at 11:16:52AM +0200, David Jander wrote:
  For bigger systems we often run a debian-derived OS like Ubuntu, and many
  pieces are compiled natively on the target... just because it is easy and
  quick to do, and cross-compiling certain packages can be a real pain.
  But, a 400 MHz e300 core is not really fast for compiling, so I have been
  considering buying some sort of PowerPC-based system with a faster
  processor, just as a build-server (a G5 would do wonders I guess).
 
  It seems like the only real option is one of the smaller IBM Power
  servers, but that seems overkill to me. We also don't feel like buying
  some old second-hand Apple gear.
 
  Is there any other available and affordable platform that can be used to
  run linux and compile software natively for 32-bit PowerPC?

 Have a look at the YDL PowerStation:

 http://us.fixstars.com/products/powerstation/

 It is more or less a quad G5.

This looks great! Thanks a lot for the tip.
I still have to figure out how to get one of these delivered to Europe, but 
that shouldn't be such a big deal...

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Best hardware platform for native compiling...

2009-07-21 Thread David Jander
On Tuesday 21 July 2009 14:00:07 you wrote:
 On Tue, Jul 21, 2009 at 12:31:36PM +0200, David Jander wrote:
  On Tuesday 21 July 2009 11:52:51 you wrote:
   On Tue, Jul 21, 2009 at 11:16:52AM +0200, David Jander wrote:
For bigger systems we often run a debian-derived OS like Ubuntu, and
many pieces are compiled natively on the target... just because it is
easy and quick to do, and cross-compiling certain packages can be a
real pain. But, a 400 MHz e300 core is not really fast for compiling,
so I have been considering buying some sort of PowerPC-based system
with a faster processor, just as a build-server (a G5 would do
wonders I guess).
   
It seems like the only real option is one of the smaller IBM Power
servers, but that seems overkill to me. We also don't feel like
buying some old second-hand Apple gear.
   
Is there any other available and affordable platform that can be used
to run linux and compile software natively for 32-bit PowerPC?
  
   Have a look at the YDL PowerStation:
  
   http://us.fixstars.com/products/powerstation/
  
   It is more or less a quad G5.
 
  This looks great! Thanks a lot for the tip.
  I still have to figure out how to get one of these delivered to Europe,
  but that shouldn't be such a big deal...

 Well, I got one recently here in Spain. Shipping charges are fairly
 large (it's not exactly a light and compact machine). But the current
 dollar exchange rate helps ;-)

 Now I have not yet found the way to install Debian on it
 (it refuses to boot Debian's CDROM), but I have not had
 time to investigate either.

If nothing else helps, try (manually) installing debootstrap from ubuntu 
sources and start from there with debootstrap jaunty /mnt/partition ;-)

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 02/12] fs_enet: Add MPC5121 FEC support.

2009-05-11 Thread David Jander
On Friday 08 May 2009 09:52:51 David Miller wrote:
 From: John Rigby jcri...@gmail.com
 Date: Thu, 7 May 2009 20:02:53 -0600

  Also don't forget that the register map is the same on 512x, mx and
  coldfire platforms but not on the other ppc platforms so if you want
  to one binary to rule them all you will need to have an offest table
  or some such.

 I would suggest using -read_reg() -write_reg() methods for abstracting
 this.  That's how we handle all of the different way ESP scsi chips
 have their registers wired up.

 I/O register reads take hundreds, if not thousands of CPU cycles so,
 relatively speaking, the indirection costs absolutely nothing.

I fear the memory-mapped I/O of the PowerPC SoC is *slightly* faster, so in 
terms of cycle count, this WILL matter, although depending on how much 
register-I/O the driver does, overall performance impact _may_ still be 
negligible. I suggest testing this (benchmarks) before and after the change.

Best regsards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 06/12] mpc5121: Added NAND Flash Controller driver.

2009-05-07 Thread David Jander
 of generic framework for 
providing special chip-select functions here but it just doesn't look 
clean like this oh well.

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 02/12] fs_enet: Add MPC5121 FEC support.

2009-05-07 Thread David Jander
On Thursday 07 May 2009 00:29:59 Grant Likely wrote:
 On Wed, May 6, 2009 at 4:01 PM, Wolfgang Denk w...@denx.de wrote:
  Dear Grant,
 
  in message fa686aa40905061333q29c263c8p24856c048e30f...@mail.gmail.com
  you wrote:
 
  ...
 
   #ifdef CONFIG_FS_ENET_HAS_FEC
   +#ifdef CONFIG_FS_ENET_MPC5121_FEC
   +    {
   +        .compatible = fsl,mpc5121-fec,
   +        .data = (void *)fs_fec_ops,
   +    },
   +#else
      {
          .compatible = fsl,pq1-fec-enet,
          .data = (void *)fs_fec_ops,
      },
   #endif
   +#endif
 
  Hmmm.  A lot of these #ifdefs in here.  Does this have a multiplatform
  impact?  Not to mention the fact that it's just plain ugly.  :-)
 
  Agreed that it's ugly, but duplicatio9ng the code would have been even
  worse. I don't think that it has multiplatform - at least not as long
  as you don't ask for one image that runs on 83xx and on 512x.

 Actually, I *am* asking for one image that runs on 83xx, 52xx and
 521x.  I already can and do build and test a single image which boots
 on all my 52xx boards, on my 8349 board, and on my G4 Mac.

Cool! I also want that! We have different boards with 5200 and 5121e's and it 
would be terrific if one day we'd be able to use just one kernel for all of 
them!

(Sorry for being a me-too-er)

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


MPC5200 FEC driver crashing on 2.6.30-rc1....

2009-04-09 Thread David Jander

Hi all,

Seems like there are some NAPI changes not being (correctly) applied to the 
MPC5200 FEC driver...

[ 1319.265289] [ cut here ]
[ 1319.274699] kernel BUG at .../arch/powerpc/include/asm/dma-mapping.h:164!
[ 1319.297488] Oops: Exception in kernel mode, sig: 5 [#1]
[ 1319.308086] PREEMPT mpc5200-simple-platform
[ 1319.316572] Modules linked in:
[ 1319.322773] NIP: c01dbef8 LR: c01dbec8 CTR: 
[ 1319.332852] REGS: c3a9dc80 TRAP: 0700   Not tainted  
(2.6.29-07103-gd0b70e8-dirty)
[ 1319.348212] MSR: 00029032 EE,ME,CE,IR,DR  CR: 42008448  XER: 2000
[ 1319.361684] TASK = c387acb0[1876] 'ifconfig' THREAD: c3a9c000
[ 1319.372988] GPR00: 0001 c3a9dd30 c387acb0 c3a13820  c3a13820 
c3a13e11 c520
[ 1319.389982] GPR08: 0001  c5001000 0008 22008444 100adabc 
03fc1000 0001
[ 1319.406973] GPR16:   007fff00 03fbb790 c3a9de08 8914 
c388410c c386b400
[ 1319.423967] GPR24: c3884100 c394b640 c040 c386b5e8 c386b400 c5001000 
c39c3600 c39bb3c0
[ 1319.441359] NIP [c01dbef8] mpc52xx_fec_alloc_rx_buffers+0xac/0x1a8
[ 1319.453914] LR [c01dbec8] mpc52xx_fec_alloc_rx_buffers+0x7c/0x1a8
[ 1319.466276] Call Trace:
[ 1319.471250] [c3a9dd30] [c01dbeb0] mpc52xx_fec_alloc_rx_buffers+0x64/0x1a8 
(unreliable)
[ 1319.487345] [c3a9dd60] [c01dc444] mpc52xx_fec_open+0x128/0x2d0
[ 1319.499221] [c3a9dda0] [c02458f4] dev_open+0xc0/0x118
[ 1319.509493] [c3a9ddc0] [c0244678] dev_change_flags+0x84/0x1ac
[ 1319.521176] [c3a9dde0] [c028b1b8] devinet_ioctl+0x638/0x758
[ 1319.532501] [c3a9de50] [c028bca8] inet_ioctl+0xcc/0xf8
[ 1319.542956] [c3a9de60] [c02325a8] sock_ioctl+0x60/0x2e8
[ 1319.553576] [c3a9de80] [c0096e44] vfs_ioctl+0x40/0xc0
[ 1319.563840] [c3a9dea0] [c009728c] do_vfs_ioctl+0x3c8/0x748
[ 1319.574986] [c3a9df10] [c009764c] sys_ioctl+0x40/0x74
[ 1319.585284] [c3a9df40] [c0011878] ret_from_syscall+0x0/0x38
[ 1319.596606] --- Exception: c01 at 0xfe8b6a4
[ 1319.596622] LR = 0xff0573c
[ 1319.611264] Instruction dump:
[ 1319.617281] a13f0018 380005f2 817f0020 815f000c 7d2959d6 7c09512e 7fa95214 
80be0098
[ 1319.633037] 41920104 813c0264 7d200034 5400d97e 0f00 801a5218 3c854000 
7f63db78
[ 1319.649160] ---[ end trace a3057d5e6d98e2c6 ]---

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Issue with UART driver on MPC5121e Platform

2009-03-19 Thread David Jander
On Thursday 19 March 2009 05:23:07 Madhusudhan J wrote:
  Hi,

 this is Madhusudhan J from KPITCummins Bangalore India .
 I have issue with uart driver   in linux kernel 2.6.24
 (drivers\serial\Mpc52xx_uart.c)
 I am using platform  mpc5121e to communicating to modem through serial
 driver. i am not able to communicate with  UART
 driver(drivers\serial\Mpc52xx_uart.c).
 In this file i could not see 512x uart operations but UART opration are
 there for 52xx. could you please let me know how to make it work with
 uart driver on 5121e platform?.

Can you tell more about which kernel you are using?

Please use the lates git head called ads5121 from denx.
I am using that one, and the UART works fine.
There are although a lot of tricky details in your device-tree that have to
be correct in order for this driver to work.
Are you using you own hardware, or are you using the ADS5121?
What version of the MPC5121e processor do you have? M34K, 2M34K or M36P?

Regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [RFC][PATCH v5] MPC5121 TLB errata workaround

2009-03-17 Thread David Jander
On Monday 16 March 2009 19:05:00 Kumar Gala wrote:
 On Mar 16, 2009, at 10:52 AM, David Jander wrote:
  Complete workaround for DTLB errata in e300c2/c3/c4 processors.
 
  Due to the bug, the hardware-implemented LRU algorythm always goes
  to way
  1 of the TLB. This fix implements the proposed software workaround in
  form of a LRW table for chosing the TLB-way.
 
  Signed-off-by: Kumar Gala ga...@kernel.crashing.org
  Signed-off-by: David Jander da...@protonic.nl
 
  ---
  diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/
  head_32.S
  index 0f4fac5..3971ee4 100644
  --- a/arch/powerpc/kernel/head_32.S
  +++ b/arch/powerpc/kernel/head_32.S
  @@ -578,9 +578,21 @@ DataLoadTLBMiss:
  andcr1,r3,r1/* PP = user? (rwdirty? 2: 3): 0 */
  mtspr   SPRN_RPA,r1
  mfspr   r3,SPRN_DMISS
  +   mfspr   r2,SPRN_SRR1/* Need to restore CR0 */
  +   mtcrf   0x80,r2
  +#ifdef CONFIG_PPC_MPC512x
  +   li  r0,1
  +   mfspr   r1,SPRN_SPRG6
  +   rlwinm  r2,r3,17,27,31  /* Get Address bits 19:15 */

 Don't we want:
   rlwinm  r2,r3,20,27,31  /* Get address bits 15:19 */

Hmm, you are right, I must have accidentally counted LE bit-ordering ;-)
Only funny part is, that I don't measure any performance difference after 
fixing this *sigh*.

Regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [RFC] [PATCH v2] MPC5121 TLB errata workaround

2009-03-16 Thread David Jander
On Friday 13 March 2009 16:23:15 Kumar Gala wrote:
[...]
  This errata impacts a number of cores and so we should make this a
  CPU
  feature fixup rather than #ifdef code.
 
  It should impact only MPC5121e and probably MPC5123, but according to
  Freescale no other processors that use this core...

 Not sure about that.. But the errata impacts all e300c2/c3/c4 parts.

Can someone please check if this is true? There should be errata's for all 
other parts that use one of these cores then.

  Anyway, I'll try to investigate about how to write a CPU feature
  fixup,
  I've never done that before (If you could give me a hint?)

 I've posted a patch that should add the CPU feature support.  This is
 only compile tested.  You'll need to try it out on real HW :)

That's a problem right now: The only useable kernel for the MPC5121e is the 
one on the 'ads5121' head from denx, and that is version 2.6.24.6. Your patch 
(v3) does not apply to that kernel, so I would have to change a few things 
before I can actually try it out:

 #define CPU_FTR_NEED_DTLB_SW_LRU   ASM_CONST(0x0001)

In 2.6.24.6 this constant is used for something else. Would it be possible to 
pick anotherone, in order to make dual-kernel patches easier to maintain for 
now?

  +   mfspr   r3,SPRN_DMISS
  +   rlwinm  r3,r3,19,25,29 /* Get Address bits 19:15 */
  +   lis r2,l...@ha   /* Search index in lrw[] */
  +   addir2,r2,l...@l
  +   tophys(r2,r2)
  +   lwzxr1,r3,r2   /* Get item from lrw[] */
  +   cmpwi   0,r1,0 /* Was it way 0 last time? */
 
  Why not use a bit vector since we only need one bit of information.
  Additionally we can use a single SPRG at that point instead to keep
  track of the LRU information.
 
  Sounds interesting. I am just learning my first steps in powerpc-
  assembly, so
  please forgive if this is a little inefficient still. I'll try again
  next week.

 Not at all.  This has been on my todo list just not high priority so
 I'm happy to get someone to work on it and have setup already that can
 show perf differences.

On your Todo list? Does that mean you know about another processor that has 
the same problem?

 I might work up a newer version w/the SPRG idea if I'm feeling up to it.

Do you mean it is possible to just pick an SPRG that will be used only by this 
handler and make sure no other piece of software will touch it? Would be 
great. Is there a way of knowing which SPRG's are used by linux? I read in 
the e300 core-RM, that SPRG4...7 are unique to this iteration of the G2 
anyway, so one of those might be a good candidate?

I have quite a lot of work pressure right now, and unfortunately very little 
time I can dedicate to this. Given the fact that I also cannot test patches 
for mainline, because MPC5121e support is just not complete enough yet, do 
you agree if I modify my own patch (with ifdef's instead of CPU_FTR...) to 
give you feedback on performance impacts, while you implement it as CPU_FTR 
afterwards for mainline? That way I can avoid doing double work, and spend 
more time on testing it actually
If you agree, I'll start hacking away on the SPRG version immediately :-)

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[RFC][PATCH v4] MPC5121 TLB errata workaround

2009-03-16 Thread David Jander
Complete workaround for DTLB errata in MPC5121e processors of die M36P and 
older (all currently existing versions).

Due to the bug, the hardware-implemented LRU algorythm always goes to way 1 of 
the TLB. This fix implements the proposed software workaround in form of a LRW 
table encoded in 32 bits of SPRG6 for chosing the TLB-way.

Signed-off-by: David Jander da...@protonic.nl

---
diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 0f4fac5..6cc0cd3 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -540,9 +540,13 @@ DataLoadTLBMiss:
  * r2: ptr to linux-style pte
  * r3: scratch
  */
+   mfspr   r3,SPRN_DMISS
+#ifdef CONFIG_PPC_MPC512x
+   b  e300_read_tlb_fix/* Code for TLB-errata workaround doesn't 
fit 
here */
+e300_read_tlb_fix_ret:
+#endif
mfctr   r0
/* Get PTE (linux-style) and check access */
-   mfspr   r3,SPRN_DMISS
lis r1,page_off...@h/* check if kernel address */
cmplw   0,r1,r3
mfspr   r2,SPRN_SPRG3
@@ -612,9 +616,32 @@ DataStoreTLBMiss:
  * r2: ptr to linux-style pte
  * r3: scratch
  */
+   mfspr   r3,SPRN_DMISS
+#ifdef CONFIG_PPC_MPC512x
+/* MPC512x: workaround for errata in die M36P and earlier:
+ * Implement LRW for TLB way.
+ */
+   rlwinm  r0,r3,17,27,31 /* Get Address bits 19:15 */
+   li  r1,1
+   slw r0,r1,r0   /* Make bitmask */
+   mfspr   r2,SPRN_SPRG6  /* Get lrw table */
+   and.r1,r2,r0   /* Check entry in lrw */
+   beq-0,113f /* 0? Then goto 113: */
+   
+   mfspr   r1,SPRN_SRR1
+   rlwinm  r1,r1,0,15,13  /* Mask out SRR1[WAY] */
+   mtspr   SPRN_SRR1,r1
+   
+   andcr2,r2,r0
+   mtspr   SPRN_SPRG6,r2
+   b   114f
+113:
+   or  r2,r2,r0
+   mtspr   SPRN_SPRG6,r2
+114:
+#endif
mfctr   r0
/* Get PTE (linux-style) and check access */
-   mfspr   r3,SPRN_DMISS
lis r1,page_off...@h/* check if kernel address */
cmplw   0,r1,r3
mfspr   r2,SPRN_SPRG3
@@ -688,6 +715,29 @@ DataStoreTLBMiss:
.globl mol_trampoline
.set mol_trampoline, i0x2f00
 
+#ifdef CONFIG_PPC_MPC512x
+e300_read_tlb_fix:
+   rlwinm  r0,r3,17,27,31 /* Get Address bits 19:15 */
+   li  r1,1
+   slw r0,r1,r0   /* Make bitmask */
+   mfspr   r2,SPRN_SPRG6  /* Get lrw table */
+   and.r1,r2,r0   /* Check entry in lrw */
+   beq-0,113f /* 0? Then goto 113: */
+   
+   mfspr   r1,SPRN_SRR1
+   rlwinm  r1,r1,0,15,13  /* Mask out SRR1[WAY] */
+   mtspr   SPRN_SRR1,r1
+   
+   andcr2,r2,r0
+   mtspr   SPRN_SPRG6,r2
+   b   114f
+113:
+   or  r2,r2,r0
+   mtspr   SPRN_SPRG6,r2
+114:
+   b   e300_read_tlb_fix_ret
+#endif
+
. = 0x3000
 
 AltiVecUnavailable:
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [RFC][PATCH v4] MPC5121 TLB errata workaround

2009-03-16 Thread David Jander

Ooops, ok I think I just missed your proposal, Kumar.

Anyway, I'll post my benchmark results to this here:

1.- mplayer -nosound -benchmark testfile.mpeg (a DVD-mpeg2 file):

No fix at all:
VC: 30.5s VO: 53.4s Sys:1.95s Total: 85.8s

First fix (force writes to way 0):
VC: 24.3s VO: 40.6s Sys:1.95s Total: 66.9s

Second fix (implementing lrw):
VC: 23.1s VO: 31.5s Sys:1.03s Total: 55.6s

Third fix (patch v4, lrw in SPRG6):
VC: 21.055s VO: 28.289s Sys:0.972s Total: 50.316s



2.- prboom -timedemo doombench1 (where doombench1.lmp is prerecorded demo):

No fix at all: 14.1 fps
First fix (force writes to way 0): 16.7 fps
Second fix (implementing lrw): 18.1 fps
Third fix (patch v4, lrw in SPRG6): 19.9 fps



3.- Synthetic and pathologic memcpy() benchmark:
No fix at all: 26 Mbyte/s
First fix (force writes to way 0): 160 MByte/s
Second fix (implementing lrw): 163 MByte/s
Third fix (patch v4, lrw in SPRG6): 180 MByte/s


Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [RFC][PATCH v4] powerpc/mm: e300c2/c3/c4 TLB errata workaround

2009-03-16 Thread David Jander

Ok, I was analysing your code (which seems much more compact than mine):

On Monday 16 March 2009 14:02:18 Kumar Gala wrote:
[...]
 --- a/arch/powerpc/kernel/head_32.S
 +++ b/arch/powerpc/kernel/head_32.S
 @@ -587,9 +587,19 @@ DataLoadTLBMiss:
   ori r1,r1,0xe04 /* clear out reserved bits */
   andcr1,r0,r1/* PP = user? (rwdirty? 2: 3): 0 */
   mtspr   SPRN_RPA,r1
 + mfspr   r2,SPRN_SRR1/* Need to restore CR0 */
 + mtcrf   0x80,r2
 +BEGIN_MMU_FTR_SECTION
 + li  r0,1
 + lwz r1,sw_way_...@l(0)
 + rlwinm  r3,r3,19,25,29  /* Get Address bits 19:15 */

This should be 'rlwinm  r3,r3,17,27,31' now, since you address bits, not ints.
Note that you are trashing r3 (SPRN_DMISS) here!

 + slw r0,r0,r3
 + xor r1,r0,r1
 + srw r0,r1,r3
 + stw r1,sw_way_...@l(0)
 + rlwimi  r2,r0,31-14,14,14
 +END_MMU_FTR_SECTION_IFSET(MMU_FTR_NEED_DTLB_SW_LRU)
   tlbld   r3

And now you load r3 into the tlb, is this right? It doesn't seem right to 
me

 - mfspr   r3,SPRN_SRR1/* Need to restore CR0 */
 - mtcrf   0x80,r3
   rfi
  DataAddressInvalid:
   mfspr   r3,SPRN_SRR1
 @@ -652,11 +662,25 @@ DataStoreTLBMiss:
   li  r1,0xe05/* clear out reserved bits  PP lsb */
   andcr1,r0,r1/* PP = user? 2: 0 */
   mtspr   SPRN_RPA,r1
 + mfspr   r2,SPRN_SRR1/* Need to restore CR0 */
 + mtcrf   0x80,r2
 +BEGIN_MMU_FTR_SECTION
 + li  r0,1
 + lwz r1,sw_way_...@l(0)
 + rlwinm  r3,r3,19,25,29  /* Get Address bits 19:15 */
 + slw r0,r0,r3
 + xor r1,r0,r1
 + srw r0,r1,r3
 + stw r1,sw_way_...@l(0)
 + rlwimi  r2,r0,31-14,14,14
 +END_MMU_FTR_SECTION_IFSET(MMU_FTR_NEED_DTLB_SW_LRU)
   tlbld   r3

Same thing here, r3 is trashed.

 - mfspr   r3,SPRN_SRR1/* Need to restore CR0 */
 - mtcrf   0x80,r3
   rfi

 + .balign L1_CACHE_BYTES
 +sw_way_lru:
 + .long 0
 +

Ok, I'll try to do it like this, but with lru stored in SPRG6

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[RFC][PATCH v5] MPC5121 TLB errata workaround

2009-03-16 Thread David Jander
Complete workaround for DTLB errata in e300c2/c3/c4 processors.

Due to the bug, the hardware-implemented LRU algorythm always goes to way
1 of the TLB. This fix implements the proposed software workaround in
form of a LRW table for chosing the TLB-way.

Signed-off-by: Kumar Gala ga...@kernel.crashing.org
Signed-off-by: David Jander da...@protonic.nl

---
diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 0f4fac5..3971ee4 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -578,9 +578,21 @@ DataLoadTLBMiss:
andcr1,r3,r1/* PP = user? (rwdirty? 2: 3): 0 */
mtspr   SPRN_RPA,r1
mfspr   r3,SPRN_DMISS
+   mfspr   r2,SPRN_SRR1/* Need to restore CR0 */
+   mtcrf   0x80,r2
+#ifdef CONFIG_PPC_MPC512x
+   li  r0,1
+   mfspr   r1,SPRN_SPRG6
+   rlwinm  r2,r3,17,27,31  /* Get Address bits 19:15 */
+   slw r0,r0,r2
+   xor r1,r0,r1
+   srw r0,r1,r2
+   mtspr   SPRN_SPRG6,r1
+   mfspr   r2,SPRN_SRR1
+   rlwimi  r2,r0,31-14,14,14
+   mtspr   SPRN_SRR1,r2
+#endif
tlbld   r3
-   mfspr   r3,SPRN_SRR1/* Need to restore CR0 */
-   mtcrf   0x80,r3
rfi
 DataAddressInvalid:
mfspr   r3,SPRN_SRR1
@@ -646,9 +658,21 @@ DataStoreTLBMiss:
andcr1,r3,r1/* PP = user? 2: 0 */
mtspr   SPRN_RPA,r1
mfspr   r3,SPRN_DMISS
+   mfspr   r2,SPRN_SRR1/* Need to restore CR0 */
+   mtcrf   0x80,r2
+#ifdef CONFIG_PPC_MPC512x
+   li  r0,1
+   mfspr   r1,SPRN_SPRG6
+   rlwinm  r2,r3,17,27,31  /* Get Address bits 19:15 */
+   slw r0,r0,r2
+   xor r1,r0,r1
+   srw r0,r1,r2
+   mtspr   SPRN_SPRG6,r1
+   mfspr   r2,SPRN_SRR1
+   rlwimi  r2,r0,31-14,14,14
+   mtspr   SPRN_SRR1,r2
+#endif
tlbld   r3
-   mfspr   r3,SPRN_SRR1/* Need to restore CR0 */
-   mtcrf   0x80,r3
rfi
 
 #ifndef CONFIG_ALTIVEC
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [RFC][PATCH v5] MPC5121 TLB errata workaround

2009-03-16 Thread David Jander

In this patch, I placed the LRW table in SPRG6 like before, but Kumar's code 
seems a little more compact, so I decided to use that one and fix it ;-)

It's a pity we seem to have one register short in the handler, so we need to 
load SPRN_SRR1 twice :-(

Allthough the code-path now has 1 instruction less than my previous version 
most of the time (and 2 instructions more when way is not adjusted), 
benchmark results are barely affected by this:

1.- mplayer: Total time: 50.392s (50.316s previous patch)

2.- prboom timedemo: 20.1 fps (19.9 fps previous patch)

3.- memcpy speed: 179 MByte/s (180 Mbyte/s previous patch)

Conclusion: difference not measurable between v4 and v5.

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[RFC] [PATCH v2] MPC5121 TLB errata workaround

2009-03-13 Thread David Jander
Complete workaround for DTLB errata in MPC5121e processors of die M36P and 
older (all currently existing versions).

Due to the bug, the hardware-implemented LRU algorythm always goes to way 1 of 
the TLB. This fix implements the proposed software workaround in form of a LRW 
table for chosing the TLB-way.

Signed-off-by: David Jander da...@protonic.nl

---
 arch/powerpc/kernel/head_32.S |   65 
 1 files changed, 65 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 0f4fac5..a88b3aa 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -540,6 +540,10 @@ DataLoadTLBMiss:
  * r2: ptr to linux-style pte
  * r3: scratch
  */
+#ifdef CONFIG_PPC_MPC512x
+   b  TlbWo/* Code for TLB-errata workaround doesn't fit here */
+RFTlbWo:
+#endif
mfctr   r0
/* Get PTE (linux-style) and check access */
mfspr   r3,SPRN_DMISS
@@ -612,6 +616,31 @@ DataStoreTLBMiss:
  * r2: ptr to linux-style pte
  * r3: scratch
  */
+#ifdef CONFIG_PPC_MPC512x
+/* MPC512x: workaround for errata in die M36P and earlier:
+ * Implement LRW for TLB way.
+ */
+   mfspr   r3,SPRN_DMISS
+   rlwinm  r3,r3,19,25,29 /* Get Address bits 19:15 */
+   lis r2,l...@ha   /* Search index in lrw[] */
+   addir2,r2,l...@l
+   tophys(r2,r2)
+   lwzxr1,r3,r2   /* Get item from lrw[] */
+   cmpwi   0,r1,0 /* Was it way 0 last time? */
+   beq-0,113f /* Then goto 113: */
+
+   mfspr   r1,SPRN_SRR1
+   rlwinm  r1,r1,0,15,13  /* Mask out SRR1[WAY] */
+   mtspr   SPRN_SRR1,r1
+
+   li  r0,0
+   stwxr0,r3,r2   /* Make lrw[] entry 0 */
+   b   114f
+113:
+   li  r0,1
+   stwxr0,r3,r2   /* Make lrw[] entry 1 */
+114:
+#endif
mfctr   r0
/* Get PTE (linux-style) and check access */
mfspr   r3,SPRN_DMISS
@@ -688,6 +717,34 @@ DataStoreTLBMiss:
.globl mol_trampoline
.set mol_trampoline, i0x2f00

+#ifdef CONFIG_PPC_MPC512x
+TlbWo:
+/* MPC512x: workaround for errata in die M36P and earlier:
+ * Implement LRW for TLB way.
+ */
+   mfspr   r3,SPRN_DMISS
+   rlwinm  r3,r3,19,25,29 /* Get Address bits 19:15 */
+   lis r2,l...@ha   /* Search index in lrw[] */
+   addir2,r2,l...@l
+   tophys(r2,r2)
+   lwzxr1,r3,r2   /* Get item from lrw[] */
+   cmpwi   0,r1,0 /* Was it way 0 last time? */
+   beq-0,113f /* Then goto 113: */
+
+   mfspr   r1,SPRN_SRR1
+   rlwinm  r1,r1,0,15,13  /* Mask out SRR1[WAY] */
+   mtspr   SPRN_SRR1,r1
+
+   li  r0,0
+   stwxr0,r3,r2   /* Make lrw[] entry 0 */
+   b   114f
+113:
+   li  r0,1
+   stwxr0,r3,r2   /* Make lrw[] entry 1 */
+114:
+   b   RFTlbWo
+#endif
+
. = 0x3000

 AltiVecUnavailable:
@@ -1321,6 +1378,14 @@ intercept_table:
.long 0, 0, 0, 0, 0, 0, 0, 0
.long 0, 0, 0, 0, 0, 0, 0, 0
.long 0, 0, 0, 0, 0, 0, 0, 0
+
+#ifdef CONFIG_PPC_MPC512x
+lrw:
+   .long 0, 0, 0, 0, 0, 0, 0, 0
+   .long 0, 0, 0, 0, 0, 0, 0, 0
+   .long 0, 0, 0, 0, 0, 0, 0, 0
+   .long 0, 0, 0, 0, 0, 0, 0, 0
+#endif

 /* Room for two PTE pointers, usually the kernel and current user pointers
  * to their respective root page table.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [RFC] [PATCH v2] MPC5121 TLB errata workaround

2009-03-13 Thread David Jander

Forgot to mention: The patch is based on denx git tree head 'ads5121', but 
it should apply without problem (some offset at most) to mainline.

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [RFC] [PATCH v2] MPC5121 TLB errata workaround

2009-03-13 Thread David Jander
On Friday 13 March 2009 14:22:22 Kumar Gala wrote:
 
 On Mar 13, 2009, at 5:26 AM, David Jander wrote:
 
 
  Forgot to mention: The patch is based on denx git tree head  
  'ads5121', but
  it should apply without problem (some offset at most) to mainline.
 
  Best regards,
 
 
 Out of interest did this version produce better performance on the  
 benchmarks than your v1 version?

Some examples:

1.- mplayer -nosound -benchmark testfile.mpeg (a DVD-mpeg2 file):

No fix at all:
VC: 30.5s VO: 53.4s Sys:1.95s Total: 85.8s

First fix (force writes to way 0):
VC: 24.3s VO: 40.6s Sys:1.95s Total: 66.9s

Complete fix (implementing lrw):
VC: 23.1s VO: 31.5s Sys:1.03s Total: 55.6s


2.- prboom -timedemo doombench1 (where doombench1.lmp is prerecorded demo):

No fix at all: 14.1 fps
First fix (force writes to way 0): 16.7 fps
Complete fix (implementing lrw): 18.1 fps


3.- Synthetic and pathologic memcpy() benchmark:
No fix at all: 26 Mbyte/s
First fix (force writes to way 0): 160 MByte/s
Complete fix (implementing lrw): 163 MByte/s

Note, that this benchmark should't really show any difference between v1 
and v2, since v1 is almost the best possible fix for copy's only.

Tell me if you know of some other interesting benchmarks to try.

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [RFC] [PATCH v2] MPC5121 TLB errata workaround

2009-03-13 Thread David Jander
On Friday 13 March 2009 14:21:57 Kumar Gala wrote:
 
 
 What does cat /proc/cpuinfo show on this board?

# cat /proc/cpuinfo
processor   : 0
cpu : e300c4
clock   : 400.00MHz
revision: 1.0 (pvr 8086 2010)
bogomips: 99.84
timebase: 5000
platform: MPC5121 Generic

  +#ifdef CONFIG_PPC_MPC512x
  +/* MPC512x: workaround for errata in die M36P and earlier:
  + * Implement LRW for TLB way.
  + */
 
 This errata impacts a number of cores and so we should make this a CPU  
 feature fixup rather than #ifdef code.

It should impact only MPC5121e and probably MPC5123, but according to
Freescale no other processors that use this core...
Anyway, I'll try to investigate about how to write a CPU feature fixup,
I've never done that before (If you could give me a hint?)

  +   mfspr   r3,SPRN_DMISS
  +   rlwinm  r3,r3,19,25,29 /* Get Address bits 19:15 */
  +   lis r2,l...@ha   /* Search index in lrw[] */
  +   addir2,r2,l...@l
  +   tophys(r2,r2)
  +   lwzxr1,r3,r2   /* Get item from lrw[] */
  +   cmpwi   0,r1,0 /* Was it way 0 last time? */
 
 Why not use a bit vector since we only need one bit of information.   
 Additionally we can use a single SPRG at that point instead to keep  
 track of the LRU information.

Sounds interesting. I am just learning my first steps in powerpc-assembly, so
please forgive if this is a little inefficient still. I'll try again next week.

Greetings,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Proposal: [PATCH] Workaround for MPC5121 DTLB errata

2009-03-12 Thread David Jander
Partial workaround for DTLB errata in MPC5121e processors of die M36P and 
older (all currently existing versions).

Due to the bug, the hardware-implemented LRU algorythm always goes to way 1 of 
the TLB. This fix forces writes to go to way 0, which would speed up 
memory-copy operations where bits 15...19 of source and destination address 
are the same.

Signed-off-by: David Jander da...@protonic.nl

---
 arch/powerpc/kernel/head_32.S |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -614,6 +614,14 @@ DataStoreTLBMiss:
  */
mfctr   r0
/* Get PTE (linux-style) and check access */
+#ifdef CONFIG_PPC_MPC512x
+/* MPC512x: (partial) workaround for errata in die M36P and earlier:
+ * Force writes to Way 0 (reads are always way 1)
+ */
+   mfspr   r3,SPRN_SRR1
+   rlwinm  r3,r3,0,15,13  /* Mask out SRR1[WAY] */
+   mtspr   SPRN_SRR1,r3
+#endif
mfspr   r3,SPRN_DMISS
lis r1,page_off...@h/* check if kernel address */
cmplw   0,r1,r3
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Proposal: [PATCH] Workaround for MPC5121 DTLB errata

2009-03-12 Thread David Jander

Please note: the proposed patch is actually incomplete, someone with better 
knowledge of PowerPC assembly than me should complete it.
According to the errata from Freescale, the proposed workaround should be a 
complete LRW (Least-Recently Written) implementation. AFAIK that would 
implicate holding an extra table in RAM with LRW information for each entry 
in the TLB.

Anyway, with this patch I am experiencing enormous speed-up overall. Some 
example tests I have done so far:

- 'mplayer -nosound -benchmark' shows a speedup of roughly 22 %

- 'prboom -timedemo test' (where 'test.lmb' is a prerecorded demo) shows an 
increase from 14.1 to 16.7 fps.

Sysnthetic memcpy() benchmarks may show a more drastic improvement (if they 
are hit by this bug):

Using 'minibench' from Gunnar Von Boehn, memcpy() speed goes up from 27Mbyte/s 
to 173Mbyte/s for memory-2-memory cases.

Greetings,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MPC5121e, MBX driver, pvr.ko ...

2009-02-16 Thread David Jander
On Friday 06 February 2009 20:17:10 Wolfgang Denk wrote:
 Dear David,

 In message 200902060824.06851.david.jan...@protonic.nl you wrote:
  I decided to try out Application Note AN3793 from Freescale (3D Graphics
  on the ADS512101 Board Using OpenGL ES).
 
  I started trying to load the provided (binary!) kernel modules into our
  kernel, but I am geeting errors inserting the modules using

 The binary kernel modules are a mess. Not  only  they  are  a  pretty
 clear  GPL license violation (and I wonder what Freescale is going to
 do to sort this out), but it effectively always locks you down to the

Sorry if this starts to get a little off-topic to this list...
IANAL, so I won't argue about a binary-driver being by definition a 
GPL-violation or not, or if those gray areas that Linus mentioned in the 
past, apply in this case.
Besides that, do you have another reason why this is a clear GPL-violation?

 specific LTIB kernel version (and probably even to  a  specific  DTS)
 they were built against. Open source? Forget it.

I never expected this driver to be Open-Source. I always supposed that we'd 
never be able to use the MBX because of this, and use the AXE instead (One of 
our applications needs some form of hardware accelerated image-scaling). But 
since I saw that application note, I couldn't resist trying it out, just to 
see how hard it is to actually use it. The point is made: Leaving aside the 
possible legal implications of the driver's existance, it still is an 
undoable job to get this working in a maintainable fashion.
What could Freescale possibly do about this?... beats me. I don't know about 
the legal implications (again, IANAL), but what if there was a driver like 
NVidia's video drivers (i.e. binary object with a re-compileable shell around 
it to adapt it to other kernels)? Otherwise, I guess Freescale can just as 
well stop making the MPC5121e and just make MPC5123's instead (which 
continues to be an awesome chip nevertheless) :-)

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MPC5121e, MBX driver, pvr.ko ...

2009-02-15 Thread David Jander
On Friday 06 February 2009 10:03:55 Klaus Pedersen wrote:
 Hi David

 I'm also run on a custom board, and using the MBX.

 You need to get the device tree file right. You will see the MBX reserved
 the irq 66 in the boot printout.
 Instead of using insmod use modprobe. The are 2 versions of rc.pvr.

Thanks. I'll try that.
Two versions of rc.pvr? Why's that, and where is the other one?

 I a early thread about memcpy for G2/G3 cores, you mentioned that you will
 have a look at the init. of the dram controller and the prio-manager, did
 that give you anything??

Yes it did, and it's a long story, I havn't had persmission to talk about, 
sorry. I believe I can now say that there should be a new errata from 
Freescale explaining it all. As of now it's not on their web-site, but it 
should have been there since last week at least *grrr*.
Btw, it has nothing to do with the prio-manager or the DRAM controller 
whatsoever.

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


MPC5121e, MBX driver, pvr.ko ...

2009-02-05 Thread David Jander

Hi all,

I have a custom board with a MPC5121e (rev 1.5) on it. It is the latest git 
kernel from denx ads5121 head with our BSP mixed in.

I decided to try out Application Note AN3793 from Freescale (3D Graphics on 
the ADS512101 Board Using OpenGL ES).

I started trying to load the provided (binary!) kernel modules into our 
kernel, but I am geeting errors inserting the modules using 
insmod: 'clcdc.ko' complains about not being able to register de device major 
number, and 'dbgdrv.ko' oopses with a BUG() in percpu_modfree()! This 
function should never be called in a non-SMP kernel, so I suspect there are 
some important differences between the kernel I have and the one the 
binary-only drivers where built against :-(

In another approach I managed to load the provided kernel binary (which is 
built for the ADS512101 evaluation board) on our platform, by tweaking our 
device-tree until it booted without crashing. In the end I was able to load 
all the modules and run the OpenGL-ES demo programs.
I can't believe this is the intended way of doing this, so I'd like to know if 
there is someone else who has managed to get the MBX running OpenGL-ES on a 
custom board with a custom build of the kernel.

Note: the kernel version number is still the same: 2.6.24.6, only difference 
AFAIK is some minor unrelated patches to drivers for other MPC5121 SoC 
devices, and probably some different configuration options. Apparently this 
is enough to break binary compatibility for the drivers :-(

Any hint is appreciated...

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Efficient memcpy()/memmove() for G2/G3 cores...

2008-09-04 Thread David Jander
On Thursday 04 September 2008 04:04:58 Paul Mackerras wrote:
 prodyut hazarika writes:
  glibc memxxx for powerpc are horribly inefficient. For optimal
  performance, we should should dcbt instruction to establish the source
  address in cache, and dcbz to establish the destination address in cache.
  We should do dcbt and dcbz such that the touches happen a line ahead of
  the actual copy.
 
  The problem which is see is that dcbt and dcbz instructions don't work on
  non-cacheable memory (obviously!). But memxxx function are used for both
  cached and non-cached memory. Thus this optimized memcpy should be smart
  enough to figure out that both source and destination address fall in
  cacheable space, and only then
  used the optimized dcbt/dcbz instructions.

 I would be careful about adding overhead to memcpy.  I found that in
 the kernel, almost all calls to memcpy are for less than 128 bytes (1
 cache line on most 64-bit machines).  So, adding a lot of code to
 detect cacheability and do prefetching is just going to slow down the
 common case, which is short copies.  I don't have statistics for glibc
 but I wouldn't be surprised if most copies were short there also.

Then please explain the following. This is a memcpy() speed test for different 
sized blocks on a MPC5121e (DIU is turned on). The first case is glibc code 
without optimizations, and the second case is 16-register strides with 
dcbt/dcbz instructions, written in assembly language (see attachment)

$ ./memcpyspeed
Fully aligned:
10 chunks of 5 bytes   :  3.48 Mbyte/s ( throughput:  6.96 Mbytes/s)
5 chunks of 16 bytes   :  14.3 Mbyte/s ( throughput:  28.6 Mbytes/s)
1 chunks of 100 bytes  :  14.4 Mbyte/s ( throughput:  28.8 Mbytes/s)
5000 chunks of 256 bytes   :  14.4 Mbyte/s ( throughput:  28.7 Mbytes/s)
1000 chunks of 1000 bytes  :  14.4 Mbyte/s ( throughput:  28.7 Mbytes/s)
50 chunks of 16384 bytes   :  14.2 Mbyte/s ( throughput:  28.4 Mbytes/s)
1 chunks of 1048576 bytes  :  14.4 Mbyte/s ( throughput:  28.8 Mbytes/s)

$ LD_PRELOAD=./libmemcpye300dj.so ./memcpyspeed
Fully aligned:
10 chunks of 5 bytes   :  7.44 Mbyte/s ( throughput:  14.9 Mbytes/s)
5 chunks of 16 bytes   :  13.1 Mbyte/s ( throughput:  26.2 Mbytes/s)
1 chunks of 100 bytes  :  29.4 Mbyte/s ( throughput:  58.8 Mbytes/s)
5000 chunks of 256 bytes   :  90.2 Mbyte/s ( throughput:   180 Mbytes/s)
1000 chunks of 1000 bytes  :77 Mbyte/s ( throughput:   154 Mbytes/s)
50 chunks of 16384 bytes   :  96.8 Mbyte/s ( throughput:   194 Mbytes/s)
1 chunks of 1048576 bytes  :  97.6 Mbyte/s ( throughput:   195 Mbytes/s)

(I have edited the output of this tool to fit into an e-mail without wrapping 
lines for readability).
Please tell me how on earth there can be such a big difference???
Note that on a MPC5200B this is TOTALLY different, and both processors have an 
e300 core (different versions of it though).

 The other thing that I have found is that code that is optimal for
 cache-cold copies is usually significantly slower than optimal for
 cache-hot copies, because the cache management instructions consume
 cycles and don't help in the cache-hot case.

 In other words, I don't think we should be tuning the glibc memcpy
 based on tests of how fast it copies multiple megabytes.

I don't just copy multiple megabytes! See above example. Also I do constant 
performance testing of different applications using LD_PRELOAD, to se the 
impact. Recentrly I even tried prboom (a free doom port), to remember the 
good old days of PC benchmarking ;-)
I have yet to come across a test that has lower performance with this 
optimization (on an MPC5121e that is).

 Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for
 larger copies.  We don't want to use dcbt/dcbz on the larger 64-bit

At least for MPC5121e you really, really need it!!

 processors (POWER4/5/6) because the hardware prefetching and
 write-combining mean that dcbt/dcbz don't help and just slow things
 down.

That's explainable.
What's not explainable, are the results I am getting on the MPC5121e.
Please, could someone tell me what I am doing wrong? (I must be doing 
something wrong, I'm almost sure).
One thing that I realize is not quite right with memcpyspeed.c is the fact 
that it copies consecutive blocks of memory, that should have an impact on 
5-byte and 16-bytes copy results I guess (a cacheline for the following block 
may already be fetched), but not anymore for 100-byte blocks and bigger (with 
32-byte cache lines). In fact, 16-bytes seems to be the only size where the 
additional overhead has some impact (which is negligible).

Another thing is that performance probably matters most to the end-user when 
applications need to copy big amounts of data (e.g. video frames or bitmap 
data), which is most probably done using big blocks of memcpy(), so 
eventually hurting performance for small copies probably has less weight on 
overall experience.

Best regards,

-- 
David Jander
/* Optimized

Re: Efficient memcpy()/memmove() for G2/G3 cores...

2008-09-04 Thread David Jander
On Thursday 04 September 2008 14:19:26 Josh Boyer wrote:
[...]
 $ ./memcpyspeed
 Fully aligned:
 10 chunks of 5 bytes   :  3.48 Mbyte/s ( throughput:  6.96 Mbytes/s)
 5 chunks of 16 bytes   :  14.3 Mbyte/s ( throughput:  28.6 Mbytes/s)
 1 chunks of 100 bytes  :  14.4 Mbyte/s ( throughput:  28.8 Mbytes/s)
 5000 chunks of 256 bytes   :  14.4 Mbyte/s ( throughput:  28.7 Mbytes/s)
 1000 chunks of 1000 bytes  :  14.4 Mbyte/s ( throughput:  28.7 Mbytes/s)
 50 chunks of 16384 bytes   :  14.2 Mbyte/s ( throughput:  28.4 Mbytes/s)
 1 chunks of 1048576 bytes  :  14.4 Mbyte/s ( throughput:  28.8 Mbytes/s)
 
 $ LD_PRELOAD=./libmemcpye300dj.so ./memcpyspeed
 Fully aligned:
 10 chunks of 5 bytes   :  7.44 Mbyte/s ( throughput:  14.9 Mbytes/s)
 5 chunks of 16 bytes   :  13.1 Mbyte/s ( throughput:  26.2 Mbytes/s)
 1 chunks of 100 bytes  :  29.4 Mbyte/s ( throughput:  58.8 Mbytes/s)
 5000 chunks of 256 bytes   :  90.2 Mbyte/s ( throughput:   180 Mbytes/s)
 1000 chunks of 1000 bytes  :77 Mbyte/s ( throughput:   154 Mbytes/s)
 50 chunks of 16384 bytes   :  96.8 Mbyte/s ( throughput:   194 Mbytes/s)
 1 chunks of 1048576 bytes  :  97.6 Mbyte/s ( throughput:   195 Mbytes/s)
 
 (I have edited the output of this tool to fit into an e-mail without
  wrapping lines for readability).
 Please tell me how on earth there can be such a big difference???
 Note that on a MPC5200B this is TOTALLY different, and both processors
  have an e300 core (different versions of it though).

 How can there be such a big difference in throughput?  Well, your algorithm
 seems better optimized than the glibc one for your testcase :).

Yes, I admit my testcase is focussing on optimizing memcpy() of uncached data, 
and that interest stems from the fact that I was testing X11 performance 
(using xorg kdrive and xorg-server), and wondering why this processor wasn't 
able to get more FPS when moving frames on screen or scrolling, when in 
theory the on-board RAM should have bandwidth enough to get a smooth image.
What I mean is that I have a hard time believing that this processor core is 
so dependent of tweaks in order to get some decent memory throughput. The 
MPC5200B does get higher througput with much less effort, and the two cores 
should be fairly identical (besides the MPC5200B having less cache memory and 
some other details).

[...]
 I don't think you're doing anything wrong exactly.  But it seems that
 your testcase sits there and just copies data with memcpy in varying
 sizes and amounts.  That's not exactly a real-world usecase is it?

No, of course it's not. I made this program to test the performance difference 
of different tweaks quickly. Once I found something that worked, I started 
LD_PRELOADing it to different other programs (among others the kdrive 
Xserver, mplayer, and x11perf) to see its impact on performance of some 
real-life apps. There the difference in performance is not so impressive of 
course, but it is still there (almost always either noticeably in favor of 
the tweaked version of memcpy(), or with a negligible or no difference).

I have not studied the different application's uses of memcpy(), and only done 
empirical tests so far.

 I think what Paul was saying is that during the course of runtime for a
 normal program (the kernel or userspace), most memcpy operations will be of
 a small order of magnitude.  They will also be scattered among code that
 does _other_ stuff than just memcpy.  So he's concerned about the overhead
 of an implementation that sets up the cache to do a single 32 byte memcpy.

I understand. I also have this concern, especially for other processors, as 
the MPC5200B, where there doesn't seem to be so much to gain anyway.

 Of course, I could be totally wrong.  I haven't had my coffee yet this
 morning after all.

You're doing quite good regardless of your lack of caffeine ;-)

Greetings,

-- 
David Jander
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Efficient memcpy()/memmove() for G2/G3 cores...

2008-09-04 Thread David Jander

Hi Steven,

On Thursday 04 September 2008 16:31:13 Steven Munroe wrote:
[...]
  Yes, I admit my testcase is focussing on optimizing memcpy() of uncached
  data, and that interest stems from the fact that I was testing X11
  performance (using xorg kdrive and xorg-server), and wondering why this
  processor wasn't able to get more FPS when moving frames on screen or
  scrolling, when in theory the on-board RAM should have bandwidth enough
  to get a smooth image. What I mean is that I have a hard time believing
  that this processor core is so dependent of tweaks in order to get some
  decent memory throughput. The MPC5200B does get higher througput with
  much less effort, and the two cores should be fairly identical (besides
  the MPC5200B having less cache memory and some other details).

 I have personally optimized memcpy for power4/5/6 and they are all
 different. There are dozens of different PPC implementations from
 different manufacturers and design, every one is different! With painful
 negotiation I was able to get the --with-cpu= framework added to glibc
 but not all distro use it. You can thank me later ...

Well, thank you ;-)

 MPC5200B? never heard of it, don't care. I am busy with power7.

Ok, keep up your work with power7, it's great you care about that one ;-)

 So don't assume we are stupid because we have not dropped everything to
 optimize memcpy for YOUR processor and YOUR specific case.

Ho! I never, ever assumed that anyone (on this list) is stupid. I think you 
got me totally wrong (and _that_ may be my fault). I was asking for other 
users experience. You make it apear as if I was complaining about your 
optimizations for Power4/5/6/970/Cell, but in fact, if you read correctly I 
havn't even touched them... they are useless to me, since this is an e300 
core. My comparisons are all against vanilla glibc _without_ any optimized 
code... that is (most probably) simple loops with char copy, or at most 
32-bit word copies. What I want to know is why this processor (MPC5121e, not 
the MPC5200B) is so terribly inefficient at this without optimizations and if 
someone has done something about it before me (I am doing it right now). I 
have never stated that specifically _you_ did a bad job or something, so why 
are you reacting like that??
In fact, your framework for specific optimizations in glibc will most probably 
come in VERY handy, once I have sorted out the root of the problem with my 
specific case so thanks a lot for your valuable work... yes, I mean it.

 You care, your are a programmer? write code! If you care about the
 community then fit your optimization into the framework provided for CPU
 specific optimization and submit it so others can benefit.

I _am_ writing code, and Gunnar is helping me find an explaination to the 
bizarre behaviour of this particular chip. If the result is useable to 
others, i _will_ fit it on your framework for optimizations.

  [...]
   I don't think you're doing anything wrong exactly.  But it seems that
   your testcase sits there and just copies data with memcpy in varying
   sizes and amounts.  That's not exactly a real-world usecase is it?
 
  No, of course it's not. I made this program to test the performance
  difference of different tweaks quickly. Once I found something that
  worked, I started LD_PRELOADing it to different other programs (among
  others the kdrive Xserver, mplayer, and x11perf) to see its impact on
  performance of some real-life apps. There the difference in performance
  is not so impressive of course, but it is still there (almost always
  either noticeably in favor of the tweaked version of memcpy(), or with a
  negligible or no difference).

 The trick is that the code built into glibc has to be optimal for the
 average case (4-256, average 12 bytes). Actually most memcpy
 implementations are a series of special cases for length and alignment.
 
 You can always do better if you know exactly what processor you are on
 and what specific sizes and alignment your application uses.

Yes, I know that's a problem. Thanks for the information for average size, I 
don't know where it comes from, but I'll take your word.

I am trying to be as polite and friendly as I can, so if you think I am not, 
please tell me where and when... I'll try to improve my social skills for the 
next time ;-)

Greetings,

-- 
David Jander
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Efficient memcpy()/memmove() for G2/G3 cores...

2008-09-04 Thread David Jander
On Thursday 04 September 2008 17:01:21 Gunnar Von Boehn wrote:
[...]
 Regarding the 5121.
 David, you did create a very special memcopy for the 5121e CPU.
 Your test showed us that the normal glibc memcopy is about 10 times
 slower than expected on the 5121.

 I really wonder why this is the case.
 I would have expected the 5121 to perform just like the 5200B.
 What we saw is that switching from READ to WRITE and back is very
 costly on 5121.

 There seems to be a huge difference between the 5200 and its successor the
 5121. Is this performance difference caused by the CPU or by the board
 /memory?

I have some new insight now, and I will look more closely at the working of  
the DRAM controller... there has to be something wrong somewhere, an I am 
going to find it... whether it is some strange bug in my u-boot code 
(initializing the DRAM controller and prio-manager for example) or a 
silicon-errata (John?)

Thanks a lot for your help so far.

-- 
David Jander
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Efficient memcpy()/memmove() for G2/G3 cores...

2008-09-02 Thread David Jander
On Monday 01 September 2008 11:36:15 Joakim Tjernlund wrote:
[...]
  Then I started my test program with LD_PRELOAD=...
 
  My test program only copies big chunks of aligned memory, so it will only
  test for maximum throughput (such as copying video frames). I will make a
  better one, to measure throughput on different sized blocks of aligned
  and unaligned memory, but first I want to find out why I can't seem to
  get even close to the expected RAM bandwidth (bursts occur at 1.6
  Gbyte/s, sustained transfers might be able to reach 400 Mbyte/s in
  theory, taking into account the video controller eating almost half of
  it, I'd like to get somewhere close to 200).
 
  The result is quite a bit better than that of glibc-2.7 (13.2 Mbyte/s --
  22 Mbyte/s), but still far from the 71.5 Mbyte/s achieved when using
  bigger strides of 16 registers load/store at a time.
  Note, that this is copy performance, one-way througput should be double
  these figures.

 Yeah, the code is trying to do a reasonable job without knowing what
 micro arch it is running on. These could probably go to glibc
 as new general purpose memxxx() routines. You will probably see
 a big increase once dcbz is added to the copy/memset functions.

 Fire away :)

Ok here I go:

I have made some astonishing discoveries, and I'd like to post the used 
source-code somewhere in the meantime, any suggestions? To this list?

There seem to be some substantial differences between the e300 core used in 
the MPC5200B and in the MPC5121e (besides the MPC5121 having double the 
amount of cache). Memcpy()-performance wise, these differences amount to the 
following. The tests done are with vanilla glibc (version 2.6.1 and 2.7 
without any powerpc specific memcpy() optimizations), Gunnar von Boehns 
memcpy_e300 and my tweaked version, memcpy_e300_dj which basically uses 
16-register strides instead of 4-register strides in Gunnar's example.

memcpy() peak-performance (RAM memory throughput) on:

MPC5200B, glibc-2.6, no optimizations: 136 Mbyte/s
MPC5121e, glibc-2.7, no optimizations:  30 Mbyte/s

MPC5200B, memcpy_e300: 225 Mbyte/s
MPC5121e, memcpy_e300: 130 Mbyte/s

MPC5200B, memcpy_e300_dj: 200 Mbyte/s
MPC5121e, memcpy_e300_dj: 202 Mbyte/s

For the MPC5121e, 16-register strides seem to be most optimal, whereas for the 
MPC5200B, 4-register stides give best performance. Also, plain C memcpy() 
performance on MPC5121e is terribly poor! Does enyone know why? I don't quite 
seem to understand those results.

Some information on the test hardware:

MPC5200B-based board has 64 Mbyte DDR-SDRAM, 32-bit wide (two x16 chips), 
running ubuntu 7.10 with kernel 2.6.19.2.

MPC5121e-based board has 256 Mbyte DDR2-SDRAM, 32-bit wide (two x16 chips), 
running ubuntu 8.04.1 with kernel 2.6.24.5 from Freescale LTIB with the DIU 
turned OFF. When the DIU is turned on, maximum throughput drops from 202 to 
196 Mbyte/s.

memcpy_e300 variants basically use 4 or 16-register load/store strides, cache 
alignment and dcbz/bcbt cache-manipulation instructions to tweak performance.

I have not tried interleaving integer and fpu instructions.

Does anybody have any suggestion about where to start searching for an 
explaination of these results? I have the impression that there is something 
wrong with my setup, or with the e300c4-core, or both, but what

Greetings,

-- 
David Jander
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Efficient memcpy()/memmove() for G2/G3 cores...

2008-09-01 Thread David Jander
On Sunday 31 August 2008 10:28:43 Benjamin Herrenschmidt wrote:
 O  It would be useful of somebody interested in getting things things

into glibc did the necessary FSF copyright assignment stuff and
worked toward integrating them.
  
   Ben makes a very good point!
 
  Sounds reasonable... but I am still wondering about what you mean
  with things?

 Typo. I meant these things, that is, variants of various libc
 functions optimized for a given processor type.

Ok, we'd have to _make_ those things first then ;-)

  AFAICS there is almost nothing there (besides the memcpy() routine from
  Gunnar von Boehn, which is apparently still far from optimal). And I was
  asking for someone to correct me here ;-)

 No idea, as we said, it's mostly up to users of the processors (or to a
 certain extent, manufacturers, hint hint hint) to do that work.

Ok, I get the point.

   There is also a framework for adding and maintaining optimizations of
   this type:
  
   http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html
 
  I had already stumbled across this one, but it seems to focus on G3 or
  newer processors (power4). There is no optimal memcpy() for
  G2/PPC603/e300.

 It focuses on what the people doing it have access to, are paid to work
 on, or other material constraints. It's up to others from the community
 to fill the gaps.

That's all I need to know ;-)

  [...]
   So it does no good to complain here. If you have core you want to
   contribute, Get your FSF CR assignment and join #glibc on freenode IRC.
 
  I am not complaining. I was only wondering if it is just me or there
  really is very little that has been done (for either uClibc, glibc, or
  whatever for powerpc) to improve performance of (linux-) applications on
  lower-power platforms (G2 core), AFAICS there is a LOT that can be
  gained by simple tweaks.

 Well, possibly, then you are welcome to work on those tweaks and if they
 indeed improve things, submit patches to glibc :-) I'm sure Steve and
 Ryan will be happy to help with the submission process.

Sounds encouraging. I'll try my best (in the limited amount of time I have).

[...]
 You don't have to do it all at once. A  simple tweak of one function
 such as memcpy, if it's measurably improving performances without
 notable regressions could be a first step, and then tweak after tweak...

 It's a common mistake to try to do too much out of tree and then
 struggle and give up when it's time to merge that stuff because there
 are too many areas that won't necessarily be acceptable as is.

 One little bit at a time is generally a better approach.

Ok, I take your advice.

  OTOH, maybe it is easier and simpler to start with a collection of
  functions in a shared-library, that may be suited for preloading via
  LD_PRELOAD or /etc/ld_preload...
 
  Maybe once this collection is more stable (in terms of that heavy
  tweaking has stopped) one could try the pilgrimage towards glibc
  inclusion

 I believe that's the wrong approach as it leads to never-merged out-of
 tree code.

Hmm... you mean, it'll be easier to keep patching (improving) things once they 
are already in glibc? Interesting.

Thanks a lot for your comments.

Best regards,

-- 
David Jander
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Efficient memcpy()/memmove() for G2/G3 cores...

2008-09-01 Thread David Jander
On Friday 29 August 2008 14:20:33 Joakim Tjernlund wrote:
[...]
  The problem is: I have very little experience with powerpc assembly and
  only very limited time to dedicate to this and I am looking for others
  who have

 I improved the PowerPC memcpy and friends in uClibc a while ago. It does
 basically the same a the kernel memcpy but without any cache
 instructions. It is written in C, but in such a way that
 optimal assembly is generated.

Hmm, isn't that going to break on a different version of gcc?
I just copied the latest version of trunk/uClibc/libc/string/powerpc/memcpy.c 
from subversion as uclibc-memcpy.c, removed the last line and did this:

$ gcc -shared -O2 -Wall -o libucmemcpy.so uclibc-memcpy.c

(should I use other compiler options?)

Then I started my test program with LD_PRELOAD=...

My test program only copies big chunks of aligned memory, so it will only test 
for maximum throughput (such as copying video frames). I will make a better 
one, to measure throughput on different sized blocks of aligned and unaligned 
memory, but first I want to find out why I can't seem to get even close to 
the expected RAM bandwidth (bursts occur at 1.6 Gbyte/s, sustained transfers 
might be able to reach 400 Mbyte/s in theory, taking into account the video 
controller eating almost half of it, I'd like to get somewhere close to 200).

The result is quite a bit better than that of glibc-2.7 (13.2 Mbyte/s -- 22 
Mbyte/s), but still far from the 71.5 Mbyte/s achieved when using bigger 
strides of 16 registers load/store at a time.
Note, that this is copy performance, one-way througput should be double these 
figures.

I'll try to learn how cache manipulating instructions work, to see if I can 
gain some more bandwith using them.

Regards,

-- 
David Jander
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Efficient memcpy()/memmove() for G2/G3 cores...

2008-09-01 Thread David Jander
On Friday 29 August 2008 22:34:21 Steven Munroe wrote:
  I am not complaining. I was only wondering if it is just me or there
  really is very little that has been done (for either uClibc, glibc, or
  whatever for powerpc) to improve performance of (linux-) applications on
  lower-power platforms (G2 core), AFAICS there is a LOT that can be
  gained by simple tweaks.

 This is a self help group (free as in freedom) We help each other. And
 you can help yourself. There is no free lunch.

I never expected to be served a free dish of any kind on a mailing-list ;-)
I was just asking around, to avoid reinventing wheels, since I intend to dig 
into this problem, that's all. My intention never was to pick up work from 
others and then run.

  The problem is: I have very little experience with powerpc assembly and
  only very limited time to dedicate to this and I am looking for others
  who have

 Well this will be a good learning experience for you. We will try to
 answer questions.

Excellent. I love learning new stuff ;-)

Thanks a lot for the guidance so far...

Regards,

-- 
David Jander
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Efficient memcpy()/memmove() for G2/G3 cores...

2008-08-29 Thread David Jander
On Wednesday 27 August 2008 23:04:39 Steven Munroe wrote:
 On Tue, 2008-08-26 at 08:28 +1000, Benjamin Herrenschmidt wrote:
  On Mon, 2008-08-25 at 15:06 +0200, David Jander wrote:
   Hi Matt,
  
   On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
The focus has definitely been on VMX but that's not to say lower
power processors were forgotten :)
  
   lower-power (pun intended) is coming strong these days, as
   energy-efficiency is getteing more important every day. And the MPC5121
   is a brand-new embedded processor, that will pop-up in quite a lot
   devices around you most probably ;-)
 
  It would be useful of somebody interested in getting things things
  into glibc did the necessary FSF copyright assignment stuff and worked
  toward integrating them.

 Ben makes a very good point!

Sounds reasonable... but I am still wondering about what you mean 
with things?
AFAICS there is almost nothing there (besides the memcpy() routine from Gunnar 
von Boehn, which is apparently still far from optimal). And I was asking for 
someone to correct me here ;-)

 There is also a framework for adding and maintaining optimizations of
 this type:
 
 http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html

I had already stumbled across this one, but it seems to focus on G3 or newer 
processors (power4). There is no optimal memcpy() for G2/PPC603/e300.

[...]
 So it does no good to complain here. If you have core you want to
 contribute, Get your FSF CR assignment and join #glibc on freenode IRC.

I am not complaining. I was only wondering if it is just me or there really is 
very little that has been done (for either uClibc, glibc, or whatever for 
powerpc) to improve performance of (linux-) applications on lower-power 
platforms (G2 core), AFAICS there is a LOT that can be gained by simple 
tweaks.

 And we will help you.

Thanks, now that I know which is the correct way to contribute, I only need 
to come up with a good set of optimization, worthy of inclusion in glibc.
OTOH, maybe it is easier and simpler to start with a collection of functions 
in a shared-library, that may be suited for preloading via LD_PRELOAD 
or /etc/ld_preload...

Maybe once this collection is more stable (in terms of that heavy tweaking has 
stopped) one could try the pilgrimage towards glibc inclusion

The problem is: I have very little experience with powerpc assembly and only 
very limited time to dedicate to this and I am looking for others who have 

Greetings,

-- 
David Jander
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Efficient memcpy()/memmove() for G2/G3 cores...

2008-08-25 Thread David Jander

Hello,

I was wondering if there is a good replacement for GLibc memcpy() functions, 
that doesn't have horrendous performance on embedded PowerPC processors (such 
as Glibc has).

I did some simple benchmarks with this implementation on our custom MPC5121 
based board (Freescale e300 core, something like a PPC603e, G2, without VMX):

...
unsigned long int a,b,c,d;
unsigned long int a1,b1,c1,d1;
...
while (len = 32)
{
a =  plSrc[0];
b =  plSrc[1];
c =  plSrc[2];
d =  plSrc[3];
a1 = plSrc[4];
b1 = plSrc[5];
c1 = plSrc[6];
d1 = plSrc[7];
plSrc += 8;
plDst[0] = a;
plDst[1] = b;
plDst[2] = c;
plDst[3] = d;
plDst[4] = a1;
plDst[5] = b1;
plDst[6] = c1;
plDst[7] = d1;
plDst += 8;
len -= 32;
}
...

And the results are more than telling by linking this with LD_PRELOAD, 
some programs get an enourmous performance boost.
For example a small test program that copies frames into video memory (just 
RAM) improved throughput from 13.2 MiB/s to 69.5 MiB/s.
I have googled for this issue, but most optimized versions of memcpy() and 
friends seem to focus on AltiVec/VMX, which this processor does not have.
Now I am certain that most of the G2/G3 users on this list _must_ have a 
better solution for this. Any suggestions?

Btw, the tests are done on Ubuntu/PowerPC 7.10, don't know if that matters 
though...

Best regards,

-- 
David Jander
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Efficient memcpy()/memmove() for G2/G3 cores...

2008-08-25 Thread David Jander

Hi Matt,

On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
 The focus has definitely been on VMX but that's not to say lower power
 processors were forgotten :)

lower-power (pun intended) is coming strong these days, as energy-efficiency 
is getteing more important every day. And the MPC5121 is a brand-new embedded 
processor, that will pop-up in quite a lot devices around you most 
probably ;-)

 Gunnar von Boehn did some benchmarking with an assembly optimized routine,
 for Cell, 603e and so on (basically the whole gamut from embedded up to
 sever class IBM chips) and got some pretty good results;

 http://www.powerdeveloper.org/forums/viewtopic.php?t=1426

 It is definitely something that needs fixing. The generic routine in glibc
 just copies words with no benefit of knowing the cache line size or any
 cache block buffers in the chip, and certainly no use of cache control or
 data streaming on higher end chips.

 With knowledge of the right way to unroll the loops, how many copies to
 do at once to try and get a burst, reducing cache usage etc. you can get
 very impressive performance (as you can see, 50MB up to 78MB at the
 smallest size, the basic improvement is 2x performance).

 I hope that helps you a little bit. Gunnar posted code to this list not
 long after. I have a copy of the e300 optimized routine but I thought
 best he should post it here, than myself.

Ok, I think I found it on the thread. The only problem is, that AFAICS it can 
be much better... at least on my platform (e300 core), and I don't know why! 
Can you explain this?

I did this:

I took Gunnars code (copy-paste from the forum), renamed the function from 
memcpy_e300 to memcpy and put it in a file called memcpy_e300.S. Then I 
did:

$ gcc -O2 -Wall -shared -o libmemcpye300.so memcpy_e300.S

I tried the performance with the small program in the attachment:

$ gcc -O2 -Wall -o pruvmem pruvmem.c
$ LD_PRELOAD=/libmemcpye300.so ./pruvmem

Data rate:  45.9 MiB/s

Now I did the same thing with my own memcpy written in C (see attached file 
mymemcpy.c):

$ LD_PRELOAD=/libmymemcpy.so ./pruvmem

Data rate:  72.9 MiB/s

Now, can someone please explain this?

As a reference, here's glibc's performance:

$ ./pruvmem

Data rate:  14.8 MiB/s

 There is a lot of scope I think for optimizing several points (glibc,
 kernel, some applications) for embedded processors which nobody is
 really taking on. But, not many people want to do this kind of work..

They should! It makes a HUGE difference. I surely will of course.

Greetings,

-- 
David Jander
#include stdio.h
#include sys/mman.h
#include string.h
#include sys/time.h
#include time.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h

int main(void)
{
int f;
unsigned long int *mem,*src,*dst;
int t;
long int usecs;
unsigned long int secs, count;
double rate;
struct timeval tv, tv0, tv1;

printf(Opening fb0\n);
f = open(/dev/fb0, O_RDWR);
if(f0) {
perror(opening fb0);
return 1;
}
printf(mmapping fb0\n);

mem = mmap(NULL, 0x0030, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,f,0);

printf(mmap returned: %08x\n,(unsigned int)mem);
perror(mmap);
if(mem==-1)
return 1;

gettimeofday(tv, NULL);
for(t=0; t0x000c; t++)
mem[t] = (tv.tv_usec ^ tv.tv_sec) ^ t;
count = 0;
gettimeofday(tv0, NULL);
for(t=0; t10; t++) {
src = mem;
dst = mem+0x0004;
memcpy(dst, src, 0x0010);
count += 0x0010;
}
gettimeofday(tv1, NULL);
secs = tv1.tv_sec-tv0.tv_sec;
usecs = tv1.tv_usec-tv0.tv_usec;
if(usecs0) {
usecs += 100;
secs -= 1;
}
printf(Time elapsed: %ld secs, %ld usecs data transferred: %ld bytes\n,secs, usecs, count);
rate = (double)count/((double)secs + (double)usecs/100.0);
printf(Data rate: %5.3g MiB/s\n, rate/(1024.0*1024.0));

return 0;
}
#include stdlib.h
void * memcpy(void * dst, void const * src, size_t len)
{
unsigned long int a,b,c,d;
unsigned long int a1,b1,c1,d1;
unsigned long int a2,b2,c2,d2;
unsigned long int a3,b3,c3,d3;
long * plDst = (long *) dst;
long const * plSrc = (long const *) src;
//if (!((unsigned long)src  0xFFFC)  !((unsigned long)dst  0xFFFC))
//{
while (len = 64)
{
a =  plSrc[0];
b =  plSrc[1];
c =  plSrc[2];
d =  plSrc[3];
a1 = plSrc[4];
b1 = plSrc[5];
c1 = plSrc[6];
d1 = plSrc[7];
a2 = plSrc[8];
b2 = plSrc[9

Re: [PATCH 2/8][Version 2] MPC5121 clock driver

2008-06-25 Thread David Jander
On Tuesday 24 June 2008 23:24:26 John Rigby wrote:
 --- /dev/null
 +++ b/arch/powerpc/platforms/512x/clock.c

[...]

 +static void ref_clk_calc(struct clk *clk)
 +{
 + unsigned long rate;
 +
 + rate = devtree_getfreq(bus-frequency);
 + if (rate == 0) {
 + printk(KERN_WARNING
 + No bus-frequency in dev tree using 66MHz\n);

Just nit-picking, but there should be a comma after tree.

Greetings,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Added support for PRTLVT based boards (MPC5121)

2008-06-23 Thread David Jander
On Friday 20 June 2008 16:36:20 you wrote:
 I have a set of patches that I will be submitting later today that
 adds the generic board support without removing ADS.  So I would
 prefer for you to just submit a device tree file for your board.

Ok, thanks. I'll check your patches, fix our DT and resubmit that one.

Greetings,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Added support for PRTLVT based boards (MPC5121)

2008-06-20 Thread David Jander

Hi John,

On Wednesday 18 June 2008 17:33:48 John Rigby wrote:
 Hi David,

 Looks like your device tree is based on the beta ltib bsp.  There were
 some changes in release 1 that you may want to incorporate:

 First as a convention I changed all the interrupt numbers in the
 tuples to be decimal.  I like this better because the interrupts are
 decimal in the reference manual.

 Second, the new clock driver that is in the release 1 bsp and will be
 posting here shortly no longer uses the device tree, so you can remove
 all the clk-name, clk-parent, clk-ctrl properties.

Thanks, I'll incorporate these changes and submit again.
Btw, do you agree with the following part of the patch?

  diff --git a/arch/powerpc/platforms/512x/Kconfig
  b/arch/powerpc/platforms/512x/Kconfig index 4c0da0c..57b3912 100644
  --- a/arch/powerpc/platforms/512x/Kconfig
  +++ b/arch/powerpc/platforms/512x/Kconfig
  @@ -2,18 +2,20 @@ config PPC_MPC512x
 bool
 select FSL_SOC
 select IPIC
  -   default n
 
   config PPC_MPC5121
 bool
 select PPC_MPC512x
  -   default n
 
  -config MPC5121_ADS
  -   bool Freescale MPC5121E ADS
  +config MPC5121_GENERIC
  +   bool Generic support for simple MPC5121 based boards
 depends on PPC_MULTIPLATFORM  PPC32
 select DEFAULT_UIMAGE
 select PPC_MPC5121
 help
  - This option enables support for the MPC5121E ADS board.
  -   default n
  + This option enables support for a simple MPC5121 based boards
  which + do not need a custom platform specific setup.
  +
  + Boards that are compatible with this generic platform support
  + are: Freescale MPC5121 ADS and Protonic LVT based boards
  (ZANMCU + and VICVT2).
  diff --git a/arch/powerpc/platforms/512x/Makefile
  b/arch/powerpc/platforms/512x/Makefile index 232c89f..9d40a2e 100644
  --- a/arch/powerpc/platforms/512x/Makefile
  +++ b/arch/powerpc/platforms/512x/Makefile
  @@ -1,4 +1,4 @@
   #
   # Makefile for the Freescale PowerPC 512x linux kernel.
   #
  -obj-$(CONFIG_MPC5121_ADS)  += mpc5121_ads.o
  +obj-$(CONFIG_MPC5121_GENERIC)  += mpc5121_generic.o
  diff --git a/arch/powerpc/platforms/512x/mpc5121_ads.c
  b/arch/powerpc/platforms/512x/mpc5121_generic.c similarity index 73%
  rename from arch/powerpc/platforms/512x/mpc5121_ads.c
  rename to arch/powerpc/platforms/512x/mpc5121_generic.c
  index 50bd3a3..824ddbb 100644
  --- a/arch/powerpc/platforms/512x/mpc5121_ads.c
  +++ b/arch/powerpc/platforms/512x/mpc5121_generic.c
[...]

The idea is to make it as simple as possible to add new platforms that are 
basically just derivatives of the same.

Greetings,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH] Added support for PRTLVT based boards (MPC5121)

2008-06-13 Thread David Jander
 Made MPC5121_ADS board support generic:
 Renamed arch/powerpc/platforms/512x/mpc5121_ads.c and added list of supported
 boards.
 For both MPC5121 ADS or PRTLVT support, just select MPC5121_GENERIC and use
 the corresponding device-tree.

Signed-off-by: David Jander [EMAIL PROTECTED]
---
 arch/powerpc/boot/dts/prtlvt.dts   |  255 
 arch/powerpc/platforms/512x/Kconfig|   14 +-
 arch/powerpc/platforms/512x/Makefile   |2 +-
 .../512x/{mpc5121_ads.c = mpc5121_generic.c}  |   38 ++-
 4 files changed, 290 insertions(+), 19 deletions(-)
 create mode 100644 arch/powerpc/boot/dts/prtlvt.dts
 rename arch/powerpc/platforms/512x/{mpc5121_ads.c = mpc5121_generic.c} (73%)

diff --git a/arch/powerpc/boot/dts/prtlvt.dts b/arch/powerpc/boot/dts/prtlvt.dts
new file mode 100644
index 000..aeb663b
--- /dev/null
+++ b/arch/powerpc/boot/dts/prtlvt.dts
@@ -0,0 +1,255 @@
+/*
+ * Device tree source for PRTLVT based boards, based on:
+ * MPC5121E MDS Device Tree Source
+ *
+ * Copyright 2007 Freescale Semiconductor Inc.
+ * Copyright 2008 Protonic Holland
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ */
+
+ /* compile with: ./dtc -p 10240 -R 20 -I dts -o prtlvt.dtb -O dtb -b 0 
dts/prtlvt.dts */
+
+/dts-v1/;
+
+/ {
+   model = prtlvt;
+   compatible = prt,prtlvt;
+   #address-cells = 1;
+   #size-cells = 1;
+
+   cpus {
+   #address-cells = 1;
+   #size-cells = 0;
+
+   PowerPC,[EMAIL PROTECTED] {
+   device_type = cpu;
+   reg = 0;
+   d-cache-line-size = 0x20; // 32 bytes
+   i-cache-line-size = 0x20; // 32 bytes
+   d-cache-size = 0x8000;// L1, 32K
+   i-cache-size = 0x8000;// L1, 32K
+   timebase-frequency = 5000;// 50 MHz (csb/4)
+   bus-frequency = 2;// 200 MHz csb bus
+   clock-frequency = 4;  // 400 MHz ppc core
+   };
+   };
+
+   memory {
+   device_type = memory;
+   reg = 0x 0x1000;  // 256MB at 0
+   };
+
+   [EMAIL PROTECTED] {
+   compatible = prt,prtlvt-localbus, simple-bus;
+   #address-cells = 2;
+   #size-cells = 1;
+   reg = 0x8020 0x40;
+   ranges = 0x0 0x0 0xfe00 0x0200;
+   [EMAIL PROTECTED],0 {
+   compatible = amd,s29gl256n, cfi-flash;
+   #address-cells = 1;
+   #size-cells = 1;
+   reg = 0 0x0 0x0200;
+   bank-width = 2;
+   };
+   };
+   
+   [EMAIL PROTECTED] {
+   compatible = fsl,mpc5121-immr, simple-bus;
+   #address-cells = 1;
+   #size-cells = 1;
+   #interrupt-cells = 2;
+   ranges = 0x0 0x8000 0x40;
+   reg = 0x8000 0x40;
+   bus-frequency = 6600; // 66 MHz ips bus
+
+
+   // IPIC
+   // interrupts cell = intr #, sense
+   // sense values match linux IORESOURCE_IRQ_* defines:
+   // sense == 8: Level, low assertion
+   // sense == 2: Edge, high-to-low change
+   //
+   ipic: [EMAIL PROTECTED] {
+   compatible = fsl,mpc5121-ipic, fsl,ipic;
+   interrupt-controller;
+   #address-cells = 0;
+   #interrupt-cells = 2;
+   reg = 0xc00 0x100;
+   };
+
+   // 512x PSCs are not 52xx PSCs compatible
+   // PSC0 serial port aka ttyPSC0
+   [EMAIL PROTECTED] {
+   device_type = serial;
+   compatible = fsl,mpc5121-psc-uart;
+   port-number = 0;
+   cell-index = 0;
+   reg = 0x11000 0x100;
+   interrupts = 0x28 0x8; // actually the fifo irq
+   interrupt-parent =  ipic ;
+   };
+
+   // PSC1 serial port aka ttyPSC1
+   [EMAIL PROTECTED] {
+   device_type = serial;
+   compatible = fsl,mpc5121-psc-uart;
+   port-number = 1;
+   cell-index = 1;
+   reg = 0x11100 0x100;
+   interrupts = 0x28 0x8; // actually the fifo irq
+   interrupt-parent =  ipic ;
+   };
+
+   // PSC2 serial port aka ttyPSC2

Re: [PATCH 2/2] Re-added support for FEC on MPC5121 from Freescale LTIB to current head

2008-06-13 Thread David Jander
On Thursday 12 June 2008 14:12:15 you wrote:
 On Jun 12, 2008, at 6:45 AM, David Jander wrote:

 Your commit message isn't exactly helpful as most people dont know
 what LTIB is and its not terribly relevant.  It just seems like you
 are adding support for the FEC on MPC5121 and this point.

[...]
  --- a/drivers/net/fec.h
  +++ b/drivers/net/fec.h
  @@ -59,6 +59,7 @@ typedef struct fec {
  } fec_t;
 
  #else
  +#if !defined(CONFIG_FS_ENET_MPC5121_FEC)
 
  /*
   *  Define device register set address map.
  @@ -97,6 +98,48 @@ typedef struct fec {
  unsigned long   fec_fifo_ram[112];  /* FIFO RAM buffer */
  } fec_t;
 
  +#else /* CONFIG_FS_ENET_MPC5121_FEC */
  +
  +typedef struct fec {
  [...]
  +} fec_t;
  +
  +#endif /* CONFIG_FS_ENET_MPC5121_FEC */
  #endif /* CONFIG_M5272 */

 I'm not exactly clear as to why this was done this way but this not
 acceptable as it means we can't build a multiplatform kernel that
 needs this driver.

Well, it wouldn't be possible either, since CONFIG_M5272 is a Cold-fire 
processor, and CONFIG_FS_ENET_MPC5121_FEC is for a PowerPC processor.
In this case.
Otherwise you are right, the driver breaks MPC83xx/MPC5121 multiplatform 
builds.

 I'm also not clear to me if the MPC5121 FEC is really the same device
 or close to it that it should be sharing this driver or have its own.

I am coming to the conclusion that it should have its own driver. 
Altough a lot of code could be shared, there are still enough differences, so 
that writing just ONE driver without some #ifdef's that would break 
multiplatform builds, would instead end up with a much bigger amount of if's, 
that would make it unreadable, unmaintainable and inefficient.
Here's why: The above struct fet_t for instance is mapped to a set of 
registers in the FEC. For processors with a CPM1, a CPM2 or without CPM (i.e. 
MPC5121) the register mapping seems to be significantly different, 
nevertheless the structs are all called struct fec_t. How can one fix this 
at runtime without changing the name of the structs and then just use a lot 
of if's or a combination of macro's and if's everywhere a register of the 
FEC is accessed? I fear it will be a mess.

So I think it's either a separate driver, or break multiplatform builds.

Since I am learning from you that breaking multiplatform builds is a no-go, 
I'll settle for splitting up the driver.

Any suggestion on where to put that split-off? How to name it?
I would suggest drivers/net/fec_mpc512x/*

I just resubmitted PATCH 1/2 again without part 2 (which hasn't much to do 
with it anyway), so Grant may have a final look at it (hopefully I did it 
right this time). Part 2 (MPC5121_FEC) will have to wait until monday or so, 
since it will take me a while, and I have to do other things in between.

Any suggestions on how to solve the puzzle are of course welcome...

Thanks a lot for reviewing.

Best regards,

-- 
David Jander
Protonic Holland.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 1/2] Added support for PRTLVT based boards (MPC5121)

2008-06-12 Thread David Jander
 Made MPC5121_ADS board support generic:
 Renamed arch/powerpc/platforms/512x/mpc5121_ads.c and added list of supported
 boards.
 For both MPC5121 ADS or PRTLVT support, just select MPC5121_GENERIC and use
 the corresponding device-tree.

Signed-off-by: David Jander [EMAIL PROTECTED]
---
 arch/powerpc/boot/dts/prtlvt.dts   |  272 
 arch/powerpc/platforms/512x/Kconfig|   14 +-
 arch/powerpc/platforms/512x/Makefile   |2 +-
 .../512x/{mpc5121_ads.c = mpc5121_generic.c}  |   38 ++-
 4 files changed, 307 insertions(+), 19 deletions(-)
 create mode 100644 arch/powerpc/boot/dts/prtlvt.dts
 rename arch/powerpc/platforms/512x/{mpc5121_ads.c = mpc5121_generic.c} (73%)

diff --git a/arch/powerpc/boot/dts/prtlvt.dts b/arch/powerpc/boot/dts/prtlvt.dts
new file mode 100644
index 000..a011c8c
--- /dev/null
+++ b/arch/powerpc/boot/dts/prtlvt.dts
@@ -0,0 +1,272 @@
+/*
+ * Device tree source for PRTLVT based boards, base on:
+ * MPC5121E MDS Device Tree Source
+ *
+ * Copyright 2007 Freescale Semiconductor Inc.
+ * Copyright 2008 Protonic Holland
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ */
+
+ /* compile with: ./dtc -p 10240 -R 20 -I dts -o prtlvt.dtb -O dtb -b 0 
dts/prtlvt.dts */
+
+/dts-v1/;
+
+/ {
+   model = prtlvt;
+   compatible = prt,prtlvt;
+   #address-cells = 1;
+   #size-cells = 1;
+
+   cpus {
+   #address-cells = 1;
+   #size-cells = 0;
+
+   PowerPC,[EMAIL PROTECTED] {
+   device_type = cpu;
+   reg = 0;
+   d-cache-line-size = 0x20; // 32 bytes
+   i-cache-line-size = 0x20; // 32 bytes
+   d-cache-size = 0x8000;// L1, 32K
+   i-cache-size = 0x8000;// L1, 32K
+   timebase-frequency = 5000;// 50 MHz (csb/4)
+   bus-frequency = 2;// 200 MHz csb bus
+   clock-frequency = 4;  // 400 MHz ppc core
+   };
+   };
+
+   memory {
+   device_type = memory;
+   reg = 0x 0x1000;  // 256MB at 0
+   };
+
+   [EMAIL PROTECTED] {
+   compatible = amd,s29gl256n, cfi-flash;
+   reg = 0xfe00 0x0200;
+   bank-width = 2;
+   #address-cells = 1;
+   #size-cells = 1;
+   [EMAIL PROTECTED] {
+   label = rootfs;
+   reg = 0x 0x0180;
+   };
+   [EMAIL PROTECTED] {
+   label =config0;
+   reg = 0x0180 0x0020;
+   };
+   [EMAIL PROTECTED] {
+   label =config1;
+   reg = 0x01a0 0x0020;
+   };
+   [EMAIL PROTECTED] {
+   label =kernel;
+   reg = 0x01c0 0x002e;
+   };
+   [EMAIL PROTECTED] {
+   label =devicetree;
+   reg = 0x01ee 0x0002;
+   };
+   [EMAIL PROTECTED] {
+   label =uboot;
+   reg = 0x01f0 0x0010;
+   };
+   };
+   
+   [EMAIL PROTECTED] {
+   compatible = fsl,mpc5121-immr, simple-bus;
+   #address-cells = 1;
+   #size-cells = 1;
+   #interrupt-cells = 2;
+   ranges = 0x0 0x8000 0x40;
+   reg = 0x8000 0x40;
+   bus-frequency = 6600; // 66 MHz ips bus
+
+
+   // IPIC
+   // interrupts cell = intr #, sense
+   // sense values match linux IORESOURCE_IRQ_* defines:
+   // sense == 8: Level, low assertion
+   // sense == 2: Edge, high-to-low change
+   //
+   ipic: [EMAIL PROTECTED] {
+   compatible = fsl,mpc5121-ipic, fsl,ipic;
+   interrupt-controller;
+   #address-cells = 0;
+   #interrupt-cells = 2;
+   reg = 0xc00 0x100;
+   };
+
+   // 512x PSCs are not 52xx PSCs compatible
+   // PSC0 serial port aka ttyPSC0
+   [EMAIL PROTECTED] {
+   device_type = serial;
+   compatible = fsl,mpc5121-psc-uart;
+   port-number = 0;
+   cell-index = 0;
+   reg = 0x11000 0x100;
+   interrupts = 0x28 0x8; // actually the fifo irq

[PATCH 2/2] Re-added support for FEC on MPC5121 from Freescale LTIB to current head

2008-06-12 Thread David Jander

Signed-off-by: David Jander [EMAIL PROTECTED]
---
 arch/powerpc/platforms/Kconfig |2 +-
 drivers/net/fec.h  |   43 
 drivers/net/fs_enet/Kconfig|   22 +-
 drivers/net/fs_enet/fs_enet-main.c |   76 ++--
 drivers/net/fs_enet/fs_enet.h  |   17 +---
 drivers/net/fs_enet/mac-fec.c  |   22 +-
 drivers/net/fs_enet/mii-fec.c  |   10 -
 7 files changed, 167 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 87454c5..a96937f 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -288,7 +288,7 @@ config CPM2
 
 config PPC_CPM_NEW_BINDING
bool
-   depends on CPM1 || CPM2
+   depends on CPM1 || CPM2 || FS_ENET_MPC5121_FEC
default y
 
 config AXON_RAM
diff --git a/drivers/net/fec.h b/drivers/net/fec.h
index 292719d..5c9fe34 100644
--- a/drivers/net/fec.h
+++ b/drivers/net/fec.h
@@ -59,6 +59,7 @@ typedef struct fec {
 } fec_t;
 
 #else
+#if !defined(CONFIG_FS_ENET_MPC5121_FEC)
 
 /*
  * Define device register set address map.
@@ -97,6 +98,48 @@ typedef struct fec {
unsigned long   fec_fifo_ram[112];  /* FIFO RAM buffer */
 } fec_t;
 
+#else /* CONFIG_FS_ENET_MPC5121_FEC */
+
+typedef struct fec {
+   u32 fec_reserved0;
+   u32 fec_ievent; /* Interrupt event reg */
+   u32 fec_imask;  /* Interrupt mask reg */
+   u32 fec_reserved1;
+   u32 fec_r_des_active;   /* Receive descriptor reg */
+   u32 fec_x_des_active;   /* Transmit descriptor reg */
+   u32 fec_reserved2[3];
+   u32 fec_ecntrl; /* Ethernet control reg */
+   u32 fec_reserved3[6];
+   u32 fec_mii_data;   /* MII manage frame reg */
+   u32 fec_mii_speed;  /* MII speed control reg */
+   u32 fec_reserved4[7];
+   u32 fec_mib_ctrlstat;   /* MIB control/status reg */
+   u32 fec_reserved5[7];
+   u32 fec_r_cntrl;/* Receive control reg */
+   u32 fec_reserved6[15];
+   u32 fec_x_cntrl;/* Transmit Control reg */
+   u32 fec_reserved7[7];
+   u32 fec_addr_low;   /* Low 32bits MAC address */
+   u32 fec_addr_high;  /* High 16bits MAC address */
+   u32 fec_opd;/* Opcode + Pause duration */
+   u32 fec_reserved8[10];
+   u32 fec_hash_table_high;/* High 32bits hash table */
+   u32 fec_hash_table_low; /* Low 32bits hash table */
+   u32 fec_grp_hash_table_high;/* High 32bits hash table */
+   u32 fec_grp_hash_table_low; /* Low 32bits hash table */
+   u32 fec_reserved9[7];
+   u32 fec_x_wmrk; /* FIFO transmit water mark */
+   u32 fec_reserved10;
+   u32 fec_r_bound;/* FIFO receive bound reg */
+   u32 fec_r_fstart;   /* FIFO receive start reg */
+   u32 fec_reserved11[11];
+   u32 fec_r_des_start;/* Receive descriptor ring */
+   u32 fec_x_des_start;/* Transmit descriptor ring */
+   u32 fec_r_buff_size;/* Maximum receive buff size */
+   u32 fec_dma_control;/* DMA Endian and other ctrl */
+} fec_t;
+
+#endif /* CONFIG_FS_ENET_MPC5121_FEC */
 #endif /* CONFIG_M5272 */
 
 
diff --git a/drivers/net/fs_enet/Kconfig b/drivers/net/fs_enet/Kconfig
index 562ea68..5e2520b 100644
--- a/drivers/net/fs_enet/Kconfig
+++ b/drivers/net/fs_enet/Kconfig
@@ -1,9 +1,23 @@
 config FS_ENET
tristate Freescale Ethernet Driver
-   depends on CPM1 || CPM2
+   depends on CPM1 || CPM2 || FS_ENET_MPC5121_FEC
select MII
select PHYLIB
 
+config FS_ENET_MPC5121_FEC
+   bool Freescale MPC512x FEC driver
+   depends on PPC_MPC512x
+   select FS_ENET
+   select FS_ENET_HAS_FEC
+
+config FS_ENET_TX_ALIGN_WORKAROUND
+   bool MPC5121 FEC driver TX alignment workaround
+   depends on FS_ENET_MPC5121_FEC
+   help
+ Workaround for a problem with early Freescale MPC5121 chips.
+ If unsure say 'y'
+   default y
+
 config FS_ENET_HAS_SCC
bool Chip has an SCC usable for ethernet
depends on FS_ENET  (CPM1 || CPM2)
@@ -16,13 +30,15 @@ config FS_ENET_HAS_FCC
 
 config FS_ENET_HAS_FEC
bool Chip has an FEC usable for ethernet
-   depends on FS_ENET  CPM1
+   depends on FS_ENET  (CPM1 || FS_ENET_MPC5121_FEC)
select FS_ENET_MDIO_FEC
default y
 
+
 config FS_ENET_MDIO_FEC
tristate MDIO driver for FEC
-   depends on FS_ENET  CPM1
+   depends on FS_ENET  (CPM1 || FS_ENET_MPC5121_FEC)
+
 
 config FS_ENET_MDIO_FCC
tristate MDIO driver for FCC
diff --git a/drivers/net/fs_enet/fs_enet-main.c 
b/drivers/net/fs_enet/fs_enet-main.c
index 31c9693..54f0079 100644
--- a/drivers/net/fs_enet/fs_enet-main.c
+++ b/drivers/net/fs_enet/fs_enet-main.c
@@ -592,6 +592,31 @@ void fs_cleanup_bds(struct net_device *dev