Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-03-03 Thread Bernd Mueller

Bernd Mueller wrote:
(very unexpected) result of this benchmark is, that a version with 
leaving the TStroke-Record packed, is about 13 % faster than the 
original patch. I am going to send a new patch soon.


unfortunately this one is about 10 % slower on X86. So, I am going to 
leave this to the experts here.


Regards, Bernd.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-03-02 Thread Florian Klaempfl
Daniël Mantione schrieb:
 
 
 Op Fri, 29 Feb 2008, schreef Christian Iversen:
 
 Daniël Mantione wrote:


 Op Fri, 29 Feb 2008, schreef Christian Iversen:

 Instead unaligned will simulate an unaligned load with two loads
 and some rotation etc. On the ARM, where every mnemonic can rotate
 operands, this is isn't that bad of a penalty.

 Therefore, I wouldn't be surprised that even on ARM, arrays with
 packed structures are faster than arrays with unpacked structures.

 That's possible. Why would it be faster, btw? Better cache coherency?

 Like I mentioned, unliek modern x86 processors, ARM processors cannot
 detect an array traversal and preload the array into the cache. If
 the array is not in cache, you get cache miss after cache miss.

 Unlike modern x86 processors?

 Granted, I haven't timed it, but most processors since early P4 models
 are supposed to have Streaming access detection, which is a fancy
 way of saying array detection.

 Are you sure your information is current?
 
 Please read again. I said modern X86 processors have, ARM processors
 don't have.

And that's why we've the prefetch inline procedure and also a reason why
our move is ~10x times faster than gcc's :)
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Marco van de Voort
 Are enumeration types 1 or 4 bytes in Delphi? If they are one byte, it 
 looks quite different (and I'm not sure about all the types used here, 
 some seem to be sets, some enumerations).

Can be configured:

http://lists.freepascal.org/docs-html/prog/progsu50.html

Delphi has the minenumsize one, not the packset one.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Michael Schnell


The VirtualTreeView tries to make the fields of the (packed) record 
aligned at dword boundary by grouping together smaller (one or two 
byte fields) or adding dummy fields. Does this trick overrides the 
unaligned memory access?
Of course it is always a good idea to sort the members of a record 
according to their size (if size is 2**n: order from big to small). As 
there is no definition about how the compiler is to place the record 
elements in a record, IMHO the compiler should do this automatically, 
unless the record is defined as (bit-)packed.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Jonas Maebe


On 29 Feb 2008, at 01:55, Luiz Americo Pereira Camara wrote:


One more question:

The VirtualTreeView tries to make the fields of the (packed) record  
aligned at dword boundary by grouping together smaller (one or two  
byte fields) or adding dummy fields. Does this trick overrides the  
unaligned memory access?


Not at this time.


Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Christian Iversen

Daniël Mantione wrote:



Op Tue, 26 Feb 2008, schreef Luiz Americo Pereira Camara:


Yury Sidorov wrote:

The patch removes packed record for some platforms.
IMO packed can be removed for all platforms. It will gain some speed.


I'd like to understand more this issue.
Why are non packed records faster?


Cache trashing. One of the most underestimated performance killers in 
modern software.



The difference occurs at memory allocation or at memory access?


Memory access. What happens is that the non-packed version causes more 
cache misses. A cache miss costs many cycles on a modern cpu, a 
misaligned read just costs an extra memory access (which is fast if 
cached) on x86, and extra load instruction on ARM. This much cheaper 
than a chache miss.


It's much worse than that. Some architectures simply _can't_ do 
unaligned access, and they will trigger an exception.


This exception will in many configurations be caught by the OS, that 
then might simulate the read by doing 2 reads, putting the result 
together, writing into the application memory, and doing a task switch.


This, in total, is several _orders of magnitude_ worse than unaligned 
access on a supported platform.


Of course, unaligned access in itself is pretty bad.

--
Med venlig hilsen
Christian Iversen
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Christian Iversen

Daniël Mantione wrote:



Op Fri, 29 Feb 2008, schreef Christian Iversen:

Memory access. What happens is that the non-packed version causes 
more cache misses. A cache miss costs many cycles on a modern cpu, a 
misaligned read just costs an extra memory access (which is fast if 
cached) on x86, and extra load instruction on ARM. This much cheaper 
than a chache miss.


It's much worse than that. Some architectures simply _can't_ do 
unaligned access, and they will trigger an exception.


This exception will in many configurations be caught by the OS, that 
then might simulate the read by doing 2 reads, putting the result 
together, writing into the application memory, and doing a task switch.


This, in total, is several _orders of magnitude_ worse than unaligned 
access on a supported platform.


Of course, unaligned access in itself is pretty bad.


True, but irrelevant, because the discussion was under the assumption 
than an unaligned read is done using the unaligned pseudo function. 
Unless there is a bug in the compiler, the use of unaligned will never 
cause an exception.


Oh, you're right of course. I didn't catch that part of the argument.

Instead unaligned will simulate an unaligned load with two loads and 
some rotation etc. On the ARM, where every mnemonic can rotate operands, 
this is isn't that bad of a penalty.


Therefore, I wouldn't be surprised that even on ARM, arrays with packed 
structures are faster than arrays with unpacked structures.


That's possible. Why would it be faster, btw? Better cache coherency?

--
Med venlig hilsen
Christian Iversen
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Daniël Mantione



Op Fri, 29 Feb 2008, schreef Christian Iversen:

Memory access. What happens is that the non-packed version causes more 
cache misses. A cache miss costs many cycles on a modern cpu, a misaligned 
read just costs an extra memory access (which is fast if cached) on x86, 
and extra load instruction on ARM. This much cheaper than a chache miss.


It's much worse than that. Some architectures simply _can't_ do unaligned 
access, and they will trigger an exception.


This exception will in many configurations be caught by the OS, that then 
might simulate the read by doing 2 reads, putting the result together, 
writing into the application memory, and doing a task switch.


This, in total, is several _orders of magnitude_ worse than unaligned access 
on a supported platform.


Of course, unaligned access in itself is pretty bad.


True, but irrelevant, because the discussion was under the assumption than 
an unaligned read is done using the unaligned pseudo function. Unless 
there is a bug in the compiler, the use of unaligned will never cause an 
exception.


Instead unaligned will simulate an unaligned load with two loads and 
some rotation etc. On the ARM, where every mnemonic can rotate operands, 
this is isn't that bad of a penalty.


Therefore, I wouldn't be surprised that even on ARM, arrays with packed 
structures are faster than arrays with unpacked structures.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Daniël Mantione



Op Fri, 29 Feb 2008, schreef Christian Iversen:

Instead unaligned will simulate an unaligned load with two loads and some 
rotation etc. On the ARM, where every mnemonic can rotate operands, this is 
isn't that bad of a penalty.


Therefore, I wouldn't be surprised that even on ARM, arrays with packed 
structures are faster than arrays with unpacked structures.


That's possible. Why would it be faster, btw? Better cache coherency?


Like I mentioned, unliek modern x86 processors, ARM processors cannot 
detect an array traversal and preload the array into the cache. If the 
array is not in cache, you get cache miss after cache miss.


A cache miss is very expensive with latencies of modern memory. A smaller 
array results in less cache misses.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Daniël Mantione



Op Fri, 29 Feb 2008, schreef Christian Iversen:


Daniël Mantione wrote:



Op Fri, 29 Feb 2008, schreef Christian Iversen:

Instead unaligned will simulate an unaligned load with two loads and 
some rotation etc. On the ARM, where every mnemonic can rotate operands, 
this is isn't that bad of a penalty.


Therefore, I wouldn't be surprised that even on ARM, arrays with packed 
structures are faster than arrays with unpacked structures.


That's possible. Why would it be faster, btw? Better cache coherency?


Like I mentioned, unliek modern x86 processors, ARM processors cannot 
detect an array traversal and preload the array into the cache. If the 
array is not in cache, you get cache miss after cache miss.


Unlike modern x86 processors?

Granted, I haven't timed it, but most processors since early P4 models are 
supposed to have Streaming access detection, which is a fancy way of saying 
array detection.


Are you sure your information is current?


Please read again. I said modern X86 processors have, ARM processors 
don't have.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Christian Iversen

Daniël Mantione wrote:



Op Fri, 29 Feb 2008, schreef Christian Iversen:

Instead unaligned will simulate an unaligned load with two loads 
and some rotation etc. On the ARM, where every mnemonic can rotate 
operands, this is isn't that bad of a penalty.


Therefore, I wouldn't be surprised that even on ARM, arrays with 
packed structures are faster than arrays with unpacked structures.


That's possible. Why would it be faster, btw? Better cache coherency?


Like I mentioned, unliek modern x86 processors, ARM processors cannot 
detect an array traversal and preload the array into the cache. If the 
array is not in cache, you get cache miss after cache miss.


Unlike modern x86 processors?

Granted, I haven't timed it, but most processors since early P4 models 
are supposed to have Streaming access detection, which is a fancy way 
of saying array detection.


Are you sure your information is current?

(I could be wrong too, of course)

--
Med venlig hilsen
Christian Iversen
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Yury Sidorov

From: Daniël Mantione [EMAIL PROTECTED]
Instead unaligned will simulate an unaligned load with two loads 
and some
rotation etc. On the ARM, where every mnemonic can rotate 
operands, this is

isn't that bad of a penalty.

Therefore, I wouldn't be surprised that even on ARM, arrays with 
packed

structures are faster than arrays with unpacked structures.


That's possible. Why would it be faster, btw? Better cache 
coherency?


Like I mentioned, unliek modern x86 processors, ARM processors cannot
detect an array traversal and preload the array into the cache. If 
the

array is not in cache, you get cache miss after cache miss.

A cache miss is very expensive with latencies of modern memory. A 
smaller

array results in less cache misses.


I run my benchmark on ARM mobile and got the following results:
2080ms - for non-packed
4450ms - for packed

It clearly shows that ualigned access kills performance on ARM...

Yury. 
___

fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Christian Iversen
Daniël Mantione wrote:
 
 
 Op Fri, 29 Feb 2008, schreef Christian Iversen:
 
 Daniël Mantione wrote:


 Op Fri, 29 Feb 2008, schreef Christian Iversen:

 Instead unaligned will simulate an unaligned load with two loads
 and some rotation etc. On the ARM, where every mnemonic can rotate
 operands, this is isn't that bad of a penalty.

 Therefore, I wouldn't be surprised that even on ARM, arrays with
 packed structures are faster than arrays with unpacked structures.

 That's possible. Why would it be faster, btw? Better cache coherency?

 Like I mentioned, unliek modern x86 processors, ARM processors cannot
 detect an array traversal and preload the array into the cache. If
 the array is not in cache, you get cache miss after cache miss.

 Unlike modern x86 processors?

 Granted, I haven't timed it, but most processors since early P4 models
 are supposed to have Streaming access detection, which is a fancy
 way of saying array detection.

 Are you sure your information is current?
 
 Please read again. I said modern X86 processors have, ARM processors
 don't have.

Sorry, it's just really not my day today.. ;-)

I'll now go join another discussion without reading it properly, in turn
annoying even more people.. :-)

-- 
Med venlig hilsen
Christian Iversen
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Luiz Americo Pereira Camara

Vinzent Hoefler wrote:
Are enumeration types 1 or 4 bytes in Delphi? If they are one byte, it 
looks quite different (and I'm not sure about all the types used here, 
some seem to be sets, some enumerations). But at the first glance it 
seems, they used both packed records to either ensure minimum size or 
known record layout (maybe they even used the structure in some 
assembly module?), and also aligned them manually to avoid unaligned 
access issues.


  


Yes. VirtualTreeView/Delphi uses asm instructions

The size of sets are different from delphi to fpc making the record 
structure different.


This is the record structure (size and offsets) in both compilers:

fpc:

Index Size: 4 Offset: 0
ChildCount Size: 4 Offset: 4
NodeHeight Size: 2 Offset: 8
States Size: 4 Offset: 10
Align Size: 1 Offset: 14
CheckState Size: 1 Offset: 15
CheckType Size: 1 Offset: 16
Dummy Size: 1 Offset: 17

Delphi:

Index Size: 4 Offset: 0
ChildCount Size: 4 Offset: 4
NodeHeight Size: 2 Offset: 8
States Size: 2 Offset: 10
Align Size: 1 Offset: 12
CheckState Size: 1 Offset: 13
CheckType Size: 1 Offset: 14
Dummy Size: 1 Offset: 15

Luiz
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Luiz Americo Pereira Camara

Jonas Maebe wrote:


On 29 Feb 2008, at 01:55, Luiz Americo Pereira Camara wrote:


One more question:

The VirtualTreeView tries to make the fields of the (packed) record 
aligned at dword boundary by grouping together smaller (one or two 
byte fields) or adding dummy fields. Does this trick overrides the 
unaligned memory access?


Not at this time.



Due to differences in sets size, the layout is different between fpc and 
Delphi.


Using packed records i save 4 bytes per record.

Compiled under Delphi the structure is bellow.

The question is: using the layout below with packed (i can force the set 
size to be equal to Delphi) i still have unaligned memory access?


Index Size: 4 Offset: 0
ChildCount Size: 4 Offset: 4
NodeHeight Size: 2 Offset: 8
States Size: 2 Offset: 10
Align Size: 1 Offset: 12
CheckState Size: 1 Offset: 13
CheckType Size: 1 Offset: 14
Dummy Size: 1 Offset: 15

Luiz


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-29 Thread Jonas Maebe


On 01 Mar 2008, at 02:00, Luiz Americo Pereira Camara wrote:

The question is: using the layout below with packed (i can force the  
set size to be equal to Delphi) i still have unaligned memory access?


As long as you record is declared as packed, all memory accesses are  
handled as if they are to unaligned memory locations.



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Daniël Mantione



Op Tue, 26 Feb 2008, schreef Luiz Americo Pereira Camara:


Yury Sidorov wrote:

The patch removes packed record for some platforms.
IMO packed can be removed for all platforms. It will gain some speed.


I'd like to understand more this issue.
Why are non packed records faster?


Cache trashing. One of the most underestimated performance killers in 
modern software.



The difference occurs at memory allocation or at memory access?


Memory access. What happens is that the non-packed version causes more 
cache misses. A cache miss costs many cycles on a modern cpu, a misaligned 
read just costs an extra memory access (which is fast if cached) on x86, 
and extra load instruction on ARM. This much cheaper than a chache miss.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Vinzent Hoefler
On Tuesday 26 February 2008 17:27, Luiz Americo Pereira Camara wrote:
 Yury Sidorov wrote:
  The patch removes packed record for some platforms.
  IMO packed can be removed for all platforms. It will gain some
  speed.

 I'd like to understand more this issue.
 Why are non packed records faster?
 The difference occurs at memory allocation or at memory access?

At memory access.

On x86 processors it's usually only a speed penalty (or has anyone ever 
seen the AC flag turned on?), on other processors you may even have to 
workaround exceptions (i.e. bus errors), because the processor simply 
refuses to read or write unaligned data. And then the only way to 
circumvent the processor's refusal is to read/write the data byte by 
byte or mask it out, which is slower than just reading or writing it.

Consider writing a 16-bit value spanning across 32-bit-values where the 
processor can only access a single 32 bits value at an aligned address:

*_ _ _ _*_ _ _ _
|0|1|2|3|4|5|6|7|
|___|

Now the data you need is spanning across bytes [2:5], but the processor 
can only read full 32 bits either at position 0 (reading bytes [0:3]), 
or position 4 (reading byte [4:7]). You'd need to read both processor 
words, mask the data in the lower and upper half of each and write back 
both words with the new data patched inbetween them.

So by now, no matter if the processor handles it for you or if the 
compiler would insert the necessary code to do it, even a simple 
increment is insanely expensive in terms of processor cycles.


Vinzent.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Micha Nelissen

Luiz Americo Pereira Camara wrote:

Why are non packed records faster?
The difference occurs at memory allocation or at memory access?


In addition to what the others said, think of it like your 32 bit 
processor suddenly being a 8 bit processor: it has to manually load 4 
times 8 bit, arrange them into a 32 bit value, and only then use it. 
With non packed, it can use the value directly.


Micha
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Michael Schnell


On x86 processors it's usually only a speed penalty (or has anyone ever 
seen the AC flag turned on?), on other processors you may even have to 
workaround exceptions (i.e. bus errors), because the processor simply 
refuses to read or write unaligned data. 
It even is not guaranteed (or even common) that a misaligned access with 
a processor that only can do aligned memory actions can be cured by an 
exception.


That is why the compiler needs to create complex code for the 
potentially misaligned elements of a packed record. All C compilers do 
this and I am positive that FP does it, too. So no problem here (beyond 
the additional cycles needed when working with packed records).


A real problem comes up if you manipulate a pointer to a (supposedly 
aligned) multi-byte variable to make it point to an odd address. This 
will make the program crash on certain processors (not PC not big 
68Ks, but small 68 Ks.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Micha Nelissen

Michael Schnell wrote:
If it accesses a misaligned 32 bit value it does two accesses (not 4): 
e.g. once 8 bit and once 24 bit (when reading each of the accesses is 
the same 32 bit, anyway).


Logically you should think about it how I explained. That Intel did an 
optimization to make the speed impact less is a different issue: 
internally the processor still has to have separate 8 bit data paths 
and do shifting to reorder the bytes.


Perhaps this behaviour is specified in their optimization documents, or 
maybe you have the VHDL source? :-)


Transferring data from/to the 1st level cache imposes a lot more delay 
than the misaligned access. Thus if there are many instances of a record 
variable that are used for calculation, it might be much faster to use 
the packed version. If there are only a few, usually the unpacked 
version should be faster.


Show me the benchmark results ;-)

Micha
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Michael Schnell


internally the processor still has to have separate 8 bit data paths 
and do shifting to reorder the bytes.
This is a barrel shifter in the data path that is integrated in the 
queue and does not take an additional execution cycle.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Daniël Mantione



Op Thu, 28 Feb 2008, schreef Vinzent Hoefler:


On Thursday 28 February 2008 09:16, Daniël Mantione wrote:


Memory access. What happens is that the non-packed version causes
more cache misses.


Please elaborate. If the (unaligned) data is crossing a cache-line, thus
causing two full cache-line reads, I'd understand that, but once it's
in the cache, it wouldn't matter anymore?


Yes, but if you have an array of them (as we have in this case), 
considerably more of these records will fit in the cache. Therefore you 
will have considerably less cache misses. This becomes even more serious 
when the processor in question does not have prefetching; in such case, 
traversing the array will cause cache miss after cache miss, a smaller 
array will then have less of these misses.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Vinzent Hoefler
On Thursday 28 February 2008 11:25, Daniël Mantione wrote:
 Op Thu, 28 Feb 2008, schreef Vinzent Hoefler:
  On Thursday 28 February 2008 09:16, Daniël Mantione wrote:
  Memory access. What happens is that the non-packed version causes
  more cache misses.
 

OMG. I'm s confused. ;) I read that the packed version causes more 
cache misses here. That was the part where I didn't understand why.

  Please elaborate. If the (unaligned) data is crossing a cache-line,
  thus causing two full cache-line reads, I'd understand that, but
  once it's in the cache, it wouldn't matter anymore?

 Yes, but if you have an array of them (as we have in this case),
 considerably more of these records will fit in the cache.

Yes, that's what I figured, so I'm on the same path as you here, it 
seems, but tracing back the discussion it read:

-- 8 --
 I'd like to understand more this issue.
 Why are non packed records faster?

Cache trashing. One of the most underestimated performance killers in 
modern software.

 The difference occurs at memory allocation or at memory access?

Memory access. What happens is that the non-packed version causes more 
cache misses.

-- 8 --

The first part tells me non-packed records are faster, but the second 
line tells me that the non-packed version also causes more cache 
misses, thus is slower. That got me confused, I think.

Of course, the net result only depends on the benchmark you're using. ;)


Vinzent.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Yury Sidorov

From: Daniël Mantione [EMAIL PROTECTED]

 On Thursday 28 February 2008 09:16, Daniël Mantione wrote:

 Memory access. What happens is that the non-packed version causes
 more cache misses.

 Please elaborate. If the (unaligned) data is crossing a 
 cache-line, thus
 causing two full cache-line reads, I'd understand that, but once 
 it's

 in the cache, it wouldn't matter anymore?

Yes, but if you have an array of them (as we have in this case),
considerably more of these records will fit in the cache. Therefore 
you
will have considerably less cache misses. This becomes even more 
serious
when the processor in question does not have prefetching; in such 
case,
traversing the array will cause cache miss after cache miss, a 
smaller

array will then have less of these misses.


You are right. Array of packed records is a bit more effective than 
array of non-packed records, at least on modern x86 CPUs.


I do some benchmarks and got on Core Duo:
2070ms - for non-packed
1910ms - for packed

But for CPUs which do not support misaligned data access - packed 
records are speed killers and need to be used as the last resort.


Also if record is not element of large array it is better do declare 
it as non-packed for all CPUs.


Yury. 
___

fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Daniël Mantione



Op Thu, 28 Feb 2008, schreef Yury Sidorov:


Yes, but if you have an array of them (as we have in this case),
considerably more of these records will fit in the cache. Therefore you
will have considerably less cache misses. This becomes even more serious
when the processor in question does not have prefetching; in such case,
traversing the array will cause cache miss after cache miss, a smaller
array will then have less of these misses.


You are right. Array of packed records is a bit more effective than array of 
non-packed records, at least on modern x86 CPUs.


I do some benchmarks and got on Core Duo:
2070ms - for non-packed
1910ms - for packed

But for CPUs which do not support misaligned data access - packed records are 
speed killers and need to be used as the last resort.


I not 100% sure about this. Your Core Duo has a array traverse detector 
which activates prefetching. An ARM does not have such logic and will 
suffer cache miss after cache miss.


However, it is for certain that a manual unaligned load is more expensive 
on ARM than a hardware unaligned load on x86.


Also if record is not element of large array it is better do declare it as 
non-packed for all CPUs.


Yes.

Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Daniël Mantione



Op Thu, 28 Feb 2008, schreef Michael Schnell:





An ARM does not have such logic and will suffer cache miss after cache 
miss.
Nonetheless the count of word transfers form memory to/from the cache would 
be smaller with packed records which might result in a lot faster execution 
(of course depending on the layout of the record, speed of the memory, speed 
of the processor, type of operations done with the records, ...)


That is exactly what I wanted to explain: even on ARM the lower amount of 
cache misses might pay for the (higher) cost of an unaligned load.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Luiz Americo Pereira Camara

Daniël Mantione wrote:



Op Tue, 26 Feb 2008, schreef Luiz Americo Pereira Camara:


Yury Sidorov wrote:

The patch removes packed record for some platforms.
IMO packed can be removed for all platforms. It will gain some speed.


I'd like to understand more this issue.
Why are non packed records faster?


Cache trashing. One of the most underestimated performance killers in 
modern software.



The difference occurs at memory allocation or at memory access?


Memory access. What happens is that the non-packed version causes more 
cache misses. A cache miss costs many cycles on a modern cpu, a 
misaligned read just costs an extra memory access (which is fast if 
cached) on x86, and extra load instruction on ARM. This much cheaper 
than a chache miss.


Thanks for all explanation. I'm sure that the change is worth.

One more question:

The VirtualTreeView tries to make the fields of the (packed) record 
aligned at dword boundary by grouping together smaller (one or two byte 
fields) or adding dummy fields. Does this trick overrides the unaligned 
memory access?


The real beast:

TVirtualNodePacked = packed record
   Index,//Offset 0
   ChildCount: Cardinal; //Offset 4

   NodeHeight: Word;  //Offset 8
   States: TVirtualNodeStates;  //Offset 10 *
   Align: Byte;  //Offset 14 **
   CheckState: TCheckState; //Offset 15 **

   CheckType: TCheckType; //Offset 16
   Dummy: Byte;  //Offset 17 
   TotalCount: Cardinal; //Offset 18 *

  [...]

For what i understand, the fields marked with * makes an unaligned 
access because they are not in dword boundary. Right?
Fields with ** also are not dword boundary aligned, but since are one 
byte fields there's not unaligned access. Right?


And about 64bit systems. Should the fields be qword aligned or dword is 
still sufficient?


Luiz

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Luiz Americo Pereira Camara

Luiz Americo Pereira Camara wrote:


TVirtualNodePacked = packed record
   Index,//Offset 0   ChildCount: Cardinal; //Offset 4
   NodeHeight: Word;  //Offset 8
   States: TVirtualNodeStates;  //Offset 10 *
   Align: Byte;  //Offset 14 **   CheckState: TCheckState; 
//Offset 15 **

   CheckType: TCheckType; //Offset 16
   Dummy: Byte;  //Offset 17TotalCount: Cardinal; //Offset 
18 *

  [...]



TVirtualNodePacked = packed record
  Index,//Offset 0 
 ChildCount: Cardinal; //Offset 4

  NodeHeight: Word;  //Offset 8
  States: TVirtualNodeStates;  //Offset 10 *
  Align: Byte;  //Offset 14 **
  CheckState: TCheckState; //Offset 15 **

  CheckType: TCheckType; //Offset 16
  Dummy: Byte;  //Offset 17 
  TotalCount: Cardinal; //Offset 18 *

 [...]


The mail editor scrambled the record structure. I hope this time is more 
clear.


Luiz
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-28 Thread Vinzent Hoefler
Are enumeration types 1 or 4 bytes in Delphi? If they are one byte, it 
looks quite different (and I'm not sure about all the types used here, 
some seem to be sets, some enumerations). But at the first glance it 
seems, they used both packed records to either ensure minimum size or 
known record layout (maybe they even used the structure in some 
assembly module?), and also aligned them manually to avoid unaligned 
access issues.


Vinzent.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-26 Thread Bernd Mueller

Vincent Snijders wrote:


Instead of testing for arm cpu, you could use 
FPC_REQUIRES_PROPER_ALIGNMENT too. So it is fixed for sparc as well.



yes, the changed patch is attached.

Regards, Bernd.
Index: packages/graph/src/inc/gtext.inc
===
--- packages/graph/src/inc/gtext.inc(Revision 10376)
+++ packages/graph/src/inc/gtext.inc(Arbeitskopie)
@@ -68,7 +68,12 @@
 
 
 {  pStroke = ^TStroke;}
+
+{$ifdef FPC_REQUIRES_PROPER_ALIGNMENT}
+  TStroke = record { avoid misaligned data access }
+{$else}
   TStroke = packed record
+{$endif FPC_REQUIRES_PROPER_ALIGNMENT}
 opcode: byte;
 x: smallint;  { relative x offset character }
 y: smallint;  { relative y offset character }
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-26 Thread Vincent Snijders

Bernd Mueller schreef:

Hello,

the attached patch avoids misaligned data access (bus errors), during 
font rendering (with the graph unit) on Arm-Linux devices.




Instead of testing for arm cpu, you could use FPC_REQUIRES_PROPER_ALIGNMENT too. So 
it is fixed for sparc as well.


Vincent
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-26 Thread Yury Sidorov

From: Daniël Mantione [EMAIL PROTECTED]

 Bernd Mueller schreef:
 Hello,

 the attached patch avoids misaligned data access (bus errors), 
 during font

 rendering (with the graph unit) on Arm-Linux devices.


 Instead of testing for arm cpu, you could use 
 FPC_REQUIRES_PROPER_ALIGNMENT

 too. So it is fixed for sparc as well.

Well, packed records are usually used when speed is unimportant. If 
the

code is speed critical, packed should not be used for aby platform.
Therefore I would like Bernd to consider the use of the 'unaligned'
pseudo-function, ifdefs make code less readable.


The patch removes packed record for some platforms.
IMO packed can be removed for all platforms. It will gain some speed.

Yury. 
___

fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


[fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-26 Thread Bernd Mueller

Hello,

the attached patch avoids misaligned data access (bus errors), during 
font rendering (with the graph unit) on Arm-Linux devices.


Regards, Bernd.
Index: packages/graph/src/inc/gtext.inc
===
--- packages/graph/src/inc/gtext.inc(Revision 10376)
+++ packages/graph/src/inc/gtext.inc(Arbeitskopie)
@@ -68,7 +68,12 @@
 
 
 {  pStroke = ^TStroke;}
+
+{$ifdef cpuarm}
+  TStroke = record { avoid misaligned data access }
+{$else}
   TStroke = packed record
+{$endif cpuarm}
 opcode: byte;
 x: smallint;  { relative x offset character }
 y: smallint;  { relative y offset character }
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-26 Thread Daniël Mantione



Op Tue, 26 Feb 2008, schreef Vincent Snijders:


Bernd Mueller schreef:

Hello,

the attached patch avoids misaligned data access (bus errors), during font 
rendering (with the graph unit) on Arm-Linux devices.




Instead of testing for arm cpu, you could use FPC_REQUIRES_PROPER_ALIGNMENT 
too. So it is fixed for sparc as well.


Well, packed records are usually used when speed is unimportant. If the 
code is speed critical, packed should not be used for aby platform. 
Therefore I would like Bernd to consider the use of the 'unaligned' 
pseudo-function, ifdefs make code less readable.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-26 Thread Florian Klaempfl
Daniël Mantione schrieb:
 
 
 Op Tue, 26 Feb 2008, schreef Vincent Snijders:
 
 Bernd Mueller schreef:
 Hello,

 the attached patch avoids misaligned data access (bus errors), during
 font rendering (with the graph unit) on Arm-Linux devices.


 Instead of testing for arm cpu, you could use
 FPC_REQUIRES_PROPER_ALIGNMENT too. So it is fixed for sparc as well.
 
 Well, packed records are usually used when speed is unimportant. If the

Isn't this used to read a font file?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-26 Thread Daniël Mantione



Op Tue, 26 Feb 2008, schreef Florian Klaempfl:


Daniël Mantione schrieb:



Op Tue, 26 Feb 2008, schreef Vincent Snijders:


Bernd Mueller schreef:

Hello,

the attached patch avoids misaligned data access (bus errors), during
font rendering (with the graph unit) on Arm-Linux devices.



Instead of testing for arm cpu, you could use
FPC_REQUIRES_PROPER_ALIGNMENT too. So it is fixed for sparc as well.


Well, packed records are usually used when speed is unimportant. If the


Isn't this used to read a font file?


You are right. Therefore, the unaligned pseudo function is the proper 
solution.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-26 Thread Bernd Mueller

Daniël Mantione wrote:



Op Tue, 26 Feb 2008, schreef Florian Klaempfl:


Daniël Mantione schrieb:



Op Tue, 26 Feb 2008, schreef Vincent Snijders:


Bernd Mueller schreef:

Hello,

the attached patch avoids misaligned data access (bus errors), during
font rendering (with the graph unit) on Arm-Linux devices.



Instead of testing for arm cpu, you could use
FPC_REQUIRES_PROPER_ALIGNMENT too. So it is fixed for sparc as well.


Well, packed records are usually used when speed is unimportant. If the


Isn't this used to read a font file?


You are right. Therefore, the unaligned pseudo function is the proper 
solution.


the main affected routines are unpack and decode. Both routines were 
called for every single character (only for a stroked font) via 
OutTextXYDefault. So speed is not unimportant ;-)


Regards, Bernd.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.

2008-02-26 Thread Micha Nelissen

Bernd Mueller wrote:
the main affected routines are unpack and decode. Both routines were 
called for every single character (only for a stroked font) via 
OutTextXYDefault. So speed is not unimportant ;-)


Perhaps you can separate I/O and processing? Read into unpacked 
structure and process from there?


Micha
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel