Re: [Gimp-developer] Solaris 64bit compile

2001-09-06 Thread Mattias Engdegård

Sven Neumann [EMAIL PROTECTED] wrote:
The code is combining the multiplications done on 2 channels of the 
same pixel into one. Also it is also meant as an example of what can 
be done without using CPU-specific instructions.

here's another example (4 x 8bit saturated addition):

uint32 padd_sat_4x8(uint32 a, uint32 b)
{
uint32 ta, tb, tm, q, u, m;
/* save overflow-causing bits in ta, tb */
ta = a  0x80808080;
tb = b  0x80808080;
q = a + b - (ta + tb);
/* determine overflow conditions */
tm = ta | tb;
u = (ta  tb) | (q  tm);
/* u now contains overflow bits, propagate them over fields */
m = (u  1) - (u  7);
return (q + tm - u) | m;
}

This is completely portable, and should be a good deal faster than
conditionally adding each component separately, at least on modern
superscalar machines with expensive unpredicted branches. And benchmarks
confirm this

Extending the above to 8 x 8bit (using 64-bit integers) is trivial of course

___
Gimp-developer mailing list
[EMAIL PROTECTED]
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer



Re: [Gimp-developer] Solaris 64bit compile

2001-09-06 Thread Daniel Egger

Am 06 Sep 2001 10:19:28 +0200 schrieb =?ISO-8859-1?Q?Mattias
Engdeg=E5rd?=:

 This is completely portable, and should be a good deal faster than
 conditionally adding each component separately, at least on modern
 superscalar machines with expensive unpredicted branches. And benchmarks
 confirm this

I like it. I did some benchmarking with a few routines with different
compilers on ppc and i686 and here are the results:

egger@sonja:~/test  time ./testmat 
Time needed for padd_sat_4x8 in clocks: 555
Time needed for padd_sat_4x8_and in clocks: 664
Time needed for padd_sat_4x8_norm in clocks: 709

real0m21.046s
user0m18.680s
sys 0m0.650s

Options to compile:
/opt/experimental/bin/gcc -O3  -fssa  -save-temps test.c -o testmat 

egger@sonja:~/test  time ./testmat 
Time needed for padd_sat_4x8 in clocks: 555
Time needed for padd_sat_4x8_and in clocks: 584
Time needed for padd_sat_4x8_norm in clocks: 784
Time needed for padd_sat_4x8_vec in clocks: 178

real0m21.477s
user0m20.520s
sys 0m0.530s

Same machine but gcc-2.95.3 with Altivec support.
Options to compile:
/opt/gcc-altivec/bin/gcc -O3  -mcpu=7400 -fvec -save-temps test.c -o
testmat

egger@alex:~  time ./testmat 
Time needed for padd_sat_4x8 in clocks: 883
Time needed for padd_sat_4x8_and in clocks: 1073
Time needed for padd_sat_4x8_norm in clocks: 1101

real0m30.614s
user0m30.370s
sys 0m0.210s

This machine is a Duron-800 with 1GB RAM. I've no idea why it performs
so poorly compared to the G4.

The compile was gcc 2.95.3 with -march=i686 and -mcpu=i686 however the
compiler didn't use the conditional move instructions from the higher
Pentium CPUs which should have sped up the _norm case considerable as it
is possible to do the same without branches.

The source is attached, feel free to study it and provide faster code.
At the moment it is pretty clear that Mattias code is pretty efficent
and compiler equally well with several compilers on several
architectures.

Servus,
   Daniel


#include glib-1.2/glib.h
#include time.h

static guint32 dest[2000] __attribute__ ((aligned (16)));
static guint32 source1[2000] __attribute__ ((aligned (16)));
static guint32 source2[2000] __attribute__ ((aligned (16)));

inline void
padd_sat_4x8(guint32 *dest, guint32 *pa, guint32 *pb)
{
  guint32 a = *pa, b = *pb; 
  guint32 ta, tb, tm, q, u, m;
  /* save overflow-causing bits in ta, tb */
  ta = a  0x80808080;
  tb = b  0x80808080;
  q = a + b - (ta + tb);
  /* determine overflow conditions */
  tm = ta | tb;
  u = (ta  tb) | (q  tm);
  /* u now contains overflow bits, propagate them over fields */
  m = (u  1) - (u  7);
  *dest = ((q + tm - u) | m);
}

inline void
padd_sat_4x8_norm (guint32 *dest, guint32 *pa, guint32 *pb)
{
  guint8 *newdest = (guint8 *) dest;
  guint16 dr, dg, db, da; 

  guint8 r1 = *((guint8 *) (pa) + 0);
  guint8 g1 = *((guint8 *) (pa) + 1);
  guint8 b1 = *((guint8 *) (pa) + 2);
  guint8 a1 = *((guint8 *) (pa) + 3);
  
  guint8 r2 = *((guint8 *) (pb) + 0);
  guint8 g2 = *((guint8 *) (pb) + 1);
  guint8 b2 = *((guint8 *) (pb) + 2);
  guint8 a2 = *((guint8 *) (pb) + 3);

  dr = r1 + r2;
  dg = g1 + g2;
  db = b1 + b2;
  da = a1 + a2;
  
  newdest[0] = dr  255 ? 255 : dr; 
  newdest[1] = dg  255 ? 255 : dg; 
  newdest[2] = db  255 ? 255 : db; 
  newdest[3] = da  255 ? 255 : da; 
}

inline void
padd_sat_4x8_and (guint32 *dest, guint32 *pa, guint32 *pb)
{
  guint32 s1 = *pa, s2 = *pb; 
  guint16 dr, dg, db, da; 
  guint8 *newdest = (guint8 *) dest;
  guint8 scratch; 

  dr = (s1  24 )  0xff + (s2  24)  0xff; 
  dg = (s1  16)  0xff + (s2  16)  0xff; 
  db = (s1  8)  0xff + (s2  8)  0xff; 
  da = s1  0xff + s2  0xff; 
  
  newdest[0] = (guint8) (~((dr  8) - 1)) | dr;
  newdest[1] = (guint8) (~((dg  8) - 1)) | dg;
  newdest[2] = (guint8) (~((db  8) - 1)) | db;
  newdest[3] = (guint8) (~((da  8) - 1)) | da;
}

#ifdef __VEC__
inline void
padd_sat_4x8_vec (guint32 *dest, guint32 *pa, guint32 *pb)
{
  vector unsigned char vdest, source1, source2;
  source1 = vec_ld (0, (unsigned char *) pa);
  source2 = vec_ld (0, (unsigned char *) pb);
  vdest = vec_adds (source1, source2);
  vec_st (vdest, 0, (unsigned char *) dest);
}
#endif

int
main (void)
{
  int i, current, iter;
 
  current = clock ();
  for (iter = 0; iter  10; iter++)
  {
for (i = 0; i  2000; i++)
{
  padd_sat_4x8 (dest + i, source1 + i, source2 + i);
}
  }
  
  current = clock () - current;
  printf(Time needed for padd_sat_4x8 in clocks: %i\n, current);
  
  current = clock ();
  for (iter = 0; iter  10; iter++)
  {
for (i = 0; i  2000; i++)
{
  padd_sat_4x8_and (dest + i, source1 + i, source2 + i);
}
  }
  
  current = clock () - current;
  printf(Time needed for padd_sat_4x8_and in clocks: %i\n, current);
  
  current = clock ();
  for (iter = 0; iter  10; iter++)
  {
for (i = 0; i  2000; i++)
{
  padd_sat_4x8_norm (dest + i, source1 + i, source2 + 

Re: [Gimp-developer] Solaris 64bit compile

2001-09-05 Thread Mattias Engdegård

Sven Neumann [EMAIL PROTECTED] wrote:
 __u32 __rb = (((color.r)16) | (color.b));
 __u32 __g  =  ((color.g)8);

 switch (a) {\
 case 0xff: *(d) = (0xff00 | __rb | __g); \
 case 0: break; \
 default: {\
 __u32 pixel = *(d);\
 __u16  s = (a)+1;\
 register __u32 t1,t2; \
 t1 = (pixel0x00ff00ff); t2 = (pixel0xff00); \
 pixel = __rb-t1)*s+(t18))  0xff00ff00) + \
  ((( __g-t2)*s+(t28))  0x00ff))  8; \
 *(d) = pixel;\
  }\
 }

if you think this looks ugly, you should have a look at the same
function for RGB16 and RGB15 ;-) 

I don't think they are that bad --- the readability of the above code
merely suffers from a pollution of backslashes and underscores. But the
general principle is useful and it's not hard to do parallel saturating
additions and subtractions without any branches at all, just using bit
fiddling.

Many modern architectures can do better with vector instructions but
generic fallback code is of course always needed

___
Gimp-developer mailing list
[EMAIL PROTECTED]
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer



Re: [Gimp-developer] Solaris 64bit compile

2001-09-05 Thread Sven Neumann

Hi,

Mattias Engdegård [EMAIL PROTECTED] writes:

 I don't think they are that bad --- the readability of the above code
 merely suffers from a pollution of backslashes and underscores. 

I took this code out of a macro and forgot to remove the backslashes.
Also we'd use different types since glib defines guint8, guint16 and
guint32. I choose not to show off the code for RGB16 and RGB15 since
the masks and shifting values used there make the code hard to
understand, while the RGB32 and ARGB cases are more obvious. Also
fortunately, we don't do RGB16 in gimp.


Salut, Sven
___
Gimp-developer mailing list
[EMAIL PROTECTED]
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer



Re: [Gimp-developer] Solaris 64bit compile

2001-09-04 Thread Daniel Egger

Am 03 Sep 2001 17:45:13 -0700 schrieb Brian Weber:

 So if I understand you correctly, since the gimps main purpose is
to
 perform bitwise manipulation that 64 bit won't give much of a benefit.

No, it's doing bytewise operation on 8, 16, 24 or 32bit data and it has
to because many processors won't operate on several 8bit channels at
once because they lack cpu commands to do so. It would be a nice speedup
to work on 4x8bit at once for processors that support it and for 64bit
we could even do 2 RGBA pixels at once if the processors supports it.

 If
 someone were to go in and change the memory management scheme of the gimp to
 make use of the OS instead of gimps tiles than we would probably see a
 benefit in the load time and save time.

Actually we considered using a chunked memory region instead of tiles
however this also has several drawbacks but would probably speed up
imaging a lot if enough memory is available (which is probably not an
issue anymore with todays memory prices). In theory a tilebased system
has huge advantages when working with pictures that are bigger than
the available memory because only tiles which are being worked on
would have to be in physical memory. 

 I don't know why but for some reason I got it in my head that 64 bit
 would be twice as fast for any application that was processor intensive.

It is (at least) twice as fast if you have data that needs 64bit
precision because emulated 64bit operation is really heavyweight. With
imaging software one normally operates with byte or doublebyte data
data which is why aren't to big speedups possible at the moment. 

 Thinking that you can have, I don't know how many more, many more
 instructions and the system buss would be twice the width allowing twice the
 amount of data to go between ram and the CPUs.  Maybe when I design my new
 processor I will include that in :)

If you do that take a PPClike design and add lots units and a better
dispatcher. :)

Servus,
   Daniel

___
Gimp-developer mailing list
[EMAIL PROTECTED]
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer



Re: [Gimp-developer] Solaris 64bit compile

2001-09-04 Thread Daniel Egger

Am 04 Sep 2001 15:51:34 +0200 schrieb Sven Neumann:

 you certainly can process several 8 bit channels in one operation without 
 special support from the processor and I would like to contribute such 
 code to The GIMP

Uii, now I'm curious. If you group an RGBA pixel together in a word and
want to add another one what would the code look like so it does proper
saturation instead of smearing between the channels?

 but I'm waiting until the new paint_funcs code is in place.

Ayiie, now I'm really starting to feel guilty and I still cannot
compile HEAD gimp :(

Servus,
   Daniel

___
Gimp-developer mailing list
[EMAIL PROTECTED]
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer



Re: [Gimp-developer] Solaris 64bit compile

2001-09-03 Thread Brian Weber

So if I understand you correctly, since the gimps main purpose is to
perform bitwise manipulation that 64 bit won't give much of a benefit.  If
someone were to go in and change the memory management scheme of the gimp to
make use of the OS instead of gimps tiles than we would probably see a
benefit in the load time and save time.  I could also see that breaking
other OS's unless it was done to work across platforms.
I don't know why but for some reason I got it in my head that 64 bit
would be twice as fast for any application that was processor intensive.
Thinking that you can have, I don't know how many more, many more
instructions and the system buss would be twice the width allowing twice the
amount of data to go between ram and the CPUs.  Maybe when I design my new
processor I will include that in :)



- Original Message -
From: Daniel Egger [EMAIL PROTECTED]
To: Brian Weber [EMAIL PROTECTED]
Cc: GIMP developer [EMAIL PROTECTED]
Sent: Monday, September 03, 2001 8:21 AM
Subject: Re: [Gimp-developer] Solaris 64bit compile


 Am 25 Aug 2001 22:21:58 -0700 schrieb Brian Weber:

  I did get a successful compile of 64 bit gimp.  I ran it and used so
  of the limited functionality that I usually use with no problems at
  all.

 I wouldn't expect them either. I ran GIMP before on 64bit USparc, PPC
 and Itanium without any quirks.

  The other thing I didn't see is any performance increase.

 Bad luck. :)

  My machine is an Ultra 2 with dual 168 mhz and 192 Meg of ram.  I
  grabbed a relatively small bitmap and used the globe plugin for 10
  iterations to make sure that I had enough time to get some statistics.
  I wasn't using any swap space and one of the processors was pegged
  about 3/4s of the time.  The other process got pinged a couple times
  but that was probably more os and vmstat.

 Ok, two reasons for that: GIMP is not requesting a single big chunk of
 memory from the system but rather does it's own memory management using
 a tile based system. So if you do not excessivly utilize your system or
 lie about your memory size then GIMP will probably never hit SWAP at
 all.
 The second CPU will be only used for tile rendering if GIMP is compiled
 with the MP option, however it doesn't give much of a benefit, is rather
 untested and will definitely result in a deterioriation on a single CPU
 system because of the additional checks (which is also why it is not
 activated by default).

  My question is, am I expecting too much from 64 bit?

 Yes you are.

  Does the C code actually have to change to get the benefit from 64
  bit?

 Yes. I don't exactly know how your machine is organized and how USPARC
 addresses memory but it maybe that differently aligned memory might
 show benefits here. Also I have some ideas how to optimize the GIMP
 to give bigger benefits on 64bit architectures but cannot benchmark
 it due to unavailability of those systems; the trick here would be to
 reorganize memory utilisation to use the full 64bit capabilites of your
 machine. Those architectures normally shine while handling big chunks
 of data and suffer when addressing small amounts of data (say bytewise)
 with a lot of instructions because they are clocked so slowly.

  Is this the right list to be asking this question?

 Sure it is.

 Servus,
Daniel



___
Gimp-developer mailing list
[EMAIL PROTECTED]
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer



Re: [Gimp-developer] Solaris 64bit compile

2001-08-27 Thread Jens Ch. Restemeier

Hi,

 My question is, am I expecting too much from 64 bit?  Does the C
 code actually have to change to get the benefit from 64 bit?  Is this

Sorry, youre expecting too much for an application like The GIMP to
benefit from 64 bit. I'd say look for pure CPU speed and memory
bandwidth, or SIMD/vector instructions if the code is optimized for it.
64 bit gives most for applications that handle giant amounts of data or
very high precision.

Jens
___
Gimp-developer mailing list
[EMAIL PROTECTED]
http://lists.xcf.berkeley.edu/mailman/listinfo/gimp-developer



[Gimp-developer] Solaris 64bit compile

2001-08-25 Thread Brian Weber



All,
 I hope this question hasn't been 
asked too many times before but I couldn't find it in the archives. I am 
using an Ultra 2 sparc with 64 bit kernel. I was told by sun that the only 
way to see real benefit of 64 bit is to have the kernel and the application 
compiled for 64 bit. That was the start of my adventure. 

 glib and gtk+ version 1.2.8 
compiled without any problems using egcs compile with the following options 
"-m64 -mcmodel=medlow -g -O2". I then went to compile the gimp version 
1.2.2 with the same options. The flarefx.c had some problems but I figured 
if I got 64 bit than I could do without one component. It compiled with a 
few warnings but I don't believe they are any different than the 32 bit 
warnings. 
 I did get a successful compile 
of 64 bit gimp. I ran it and used so of the limited functionality that I 
usually use with no problems at all. Now for the question. The other 
thing I didn't see is any performance increase. My machine is an Ultra 2 
with dual 168 mhz and 192 Meg of ram. I grabbed a relatively small bitmap 
and used the globe plugin for 10 iterations to make sure that I had enough time 
to get some statistics. I wasn't using any swap space and one of the 
processors was pegged about 3/4s of the time. The other process got pinged 
a couple times but that was probably more os and vmstat. I also have a 
windows machine with gimp installed for windows. That machine has 64 Meg 
of ram and a single 266 mhz processor. I did a side by side test of 10 
iterations with the globe plugin. The windows machine finished first with 
the sparc still having 3 more iterations to go. 
 My question is, am I expecting 
too much from 64 bit? Does the C code actually have to change to get the 
benefit from 64 bit? Is this the right list to be asking this 
question?
 I know this was a long 
email. I hope it was an appropriate question for this list and i 
appreciate any comments. If this is not a question for this list and you 
still have comments for me just send them directly too me so as not to disturb 
the list any more than I already have. Thanks

Brian