Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Bernd Mueller wrote: (very unexpected) result of this benchmark is, that a version with leaving the TStroke-Record packed, is about 13 % faster than the original patch. I am going to send a new patch soon. unfortunately this one is about 10 % slower on X86. So, I am going to leave this to the experts here. Regards, Bernd. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Daniël Mantione schrieb: Op Fri, 29 Feb 2008, schreef Christian Iversen: Daniël Mantione wrote: Op Fri, 29 Feb 2008, schreef Christian Iversen: Instead unaligned will simulate an unaligned load with two loads and some rotation etc. On the ARM, where every mnemonic can rotate operands, this is isn't that bad of a penalty. Therefore, I wouldn't be surprised that even on ARM, arrays with packed structures are faster than arrays with unpacked structures. That's possible. Why would it be faster, btw? Better cache coherency? Like I mentioned, unliek modern x86 processors, ARM processors cannot detect an array traversal and preload the array into the cache. If the array is not in cache, you get cache miss after cache miss. Unlike modern x86 processors? Granted, I haven't timed it, but most processors since early P4 models are supposed to have Streaming access detection, which is a fancy way of saying array detection. Are you sure your information is current? Please read again. I said modern X86 processors have, ARM processors don't have. And that's why we've the prefetch inline procedure and also a reason why our move is ~10x times faster than gcc's :) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Are enumeration types 1 or 4 bytes in Delphi? If they are one byte, it looks quite different (and I'm not sure about all the types used here, some seem to be sets, some enumerations). Can be configured: http://lists.freepascal.org/docs-html/prog/progsu50.html Delphi has the minenumsize one, not the packset one. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
The VirtualTreeView tries to make the fields of the (packed) record aligned at dword boundary by grouping together smaller (one or two byte fields) or adding dummy fields. Does this trick overrides the unaligned memory access? Of course it is always a good idea to sort the members of a record according to their size (if size is 2**n: order from big to small). As there is no definition about how the compiler is to place the record elements in a record, IMHO the compiler should do this automatically, unless the record is defined as (bit-)packed. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
On 29 Feb 2008, at 01:55, Luiz Americo Pereira Camara wrote: One more question: The VirtualTreeView tries to make the fields of the (packed) record aligned at dword boundary by grouping together smaller (one or two byte fields) or adding dummy fields. Does this trick overrides the unaligned memory access? Not at this time. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Daniël Mantione wrote: Op Tue, 26 Feb 2008, schreef Luiz Americo Pereira Camara: Yury Sidorov wrote: The patch removes packed record for some platforms. IMO packed can be removed for all platforms. It will gain some speed. I'd like to understand more this issue. Why are non packed records faster? Cache trashing. One of the most underestimated performance killers in modern software. The difference occurs at memory allocation or at memory access? Memory access. What happens is that the non-packed version causes more cache misses. A cache miss costs many cycles on a modern cpu, a misaligned read just costs an extra memory access (which is fast if cached) on x86, and extra load instruction on ARM. This much cheaper than a chache miss. It's much worse than that. Some architectures simply _can't_ do unaligned access, and they will trigger an exception. This exception will in many configurations be caught by the OS, that then might simulate the read by doing 2 reads, putting the result together, writing into the application memory, and doing a task switch. This, in total, is several _orders of magnitude_ worse than unaligned access on a supported platform. Of course, unaligned access in itself is pretty bad. -- Med venlig hilsen Christian Iversen ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Daniël Mantione wrote: Op Fri, 29 Feb 2008, schreef Christian Iversen: Memory access. What happens is that the non-packed version causes more cache misses. A cache miss costs many cycles on a modern cpu, a misaligned read just costs an extra memory access (which is fast if cached) on x86, and extra load instruction on ARM. This much cheaper than a chache miss. It's much worse than that. Some architectures simply _can't_ do unaligned access, and they will trigger an exception. This exception will in many configurations be caught by the OS, that then might simulate the read by doing 2 reads, putting the result together, writing into the application memory, and doing a task switch. This, in total, is several _orders of magnitude_ worse than unaligned access on a supported platform. Of course, unaligned access in itself is pretty bad. True, but irrelevant, because the discussion was under the assumption than an unaligned read is done using the unaligned pseudo function. Unless there is a bug in the compiler, the use of unaligned will never cause an exception. Oh, you're right of course. I didn't catch that part of the argument. Instead unaligned will simulate an unaligned load with two loads and some rotation etc. On the ARM, where every mnemonic can rotate operands, this is isn't that bad of a penalty. Therefore, I wouldn't be surprised that even on ARM, arrays with packed structures are faster than arrays with unpacked structures. That's possible. Why would it be faster, btw? Better cache coherency? -- Med venlig hilsen Christian Iversen ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Op Fri, 29 Feb 2008, schreef Christian Iversen: Memory access. What happens is that the non-packed version causes more cache misses. A cache miss costs many cycles on a modern cpu, a misaligned read just costs an extra memory access (which is fast if cached) on x86, and extra load instruction on ARM. This much cheaper than a chache miss. It's much worse than that. Some architectures simply _can't_ do unaligned access, and they will trigger an exception. This exception will in many configurations be caught by the OS, that then might simulate the read by doing 2 reads, putting the result together, writing into the application memory, and doing a task switch. This, in total, is several _orders of magnitude_ worse than unaligned access on a supported platform. Of course, unaligned access in itself is pretty bad. True, but irrelevant, because the discussion was under the assumption than an unaligned read is done using the unaligned pseudo function. Unless there is a bug in the compiler, the use of unaligned will never cause an exception. Instead unaligned will simulate an unaligned load with two loads and some rotation etc. On the ARM, where every mnemonic can rotate operands, this is isn't that bad of a penalty. Therefore, I wouldn't be surprised that even on ARM, arrays with packed structures are faster than arrays with unpacked structures. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Op Fri, 29 Feb 2008, schreef Christian Iversen: Instead unaligned will simulate an unaligned load with two loads and some rotation etc. On the ARM, where every mnemonic can rotate operands, this is isn't that bad of a penalty. Therefore, I wouldn't be surprised that even on ARM, arrays with packed structures are faster than arrays with unpacked structures. That's possible. Why would it be faster, btw? Better cache coherency? Like I mentioned, unliek modern x86 processors, ARM processors cannot detect an array traversal and preload the array into the cache. If the array is not in cache, you get cache miss after cache miss. A cache miss is very expensive with latencies of modern memory. A smaller array results in less cache misses. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Op Fri, 29 Feb 2008, schreef Christian Iversen: Daniël Mantione wrote: Op Fri, 29 Feb 2008, schreef Christian Iversen: Instead unaligned will simulate an unaligned load with two loads and some rotation etc. On the ARM, where every mnemonic can rotate operands, this is isn't that bad of a penalty. Therefore, I wouldn't be surprised that even on ARM, arrays with packed structures are faster than arrays with unpacked structures. That's possible. Why would it be faster, btw? Better cache coherency? Like I mentioned, unliek modern x86 processors, ARM processors cannot detect an array traversal and preload the array into the cache. If the array is not in cache, you get cache miss after cache miss. Unlike modern x86 processors? Granted, I haven't timed it, but most processors since early P4 models are supposed to have Streaming access detection, which is a fancy way of saying array detection. Are you sure your information is current? Please read again. I said modern X86 processors have, ARM processors don't have. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Daniël Mantione wrote: Op Fri, 29 Feb 2008, schreef Christian Iversen: Instead unaligned will simulate an unaligned load with two loads and some rotation etc. On the ARM, where every mnemonic can rotate operands, this is isn't that bad of a penalty. Therefore, I wouldn't be surprised that even on ARM, arrays with packed structures are faster than arrays with unpacked structures. That's possible. Why would it be faster, btw? Better cache coherency? Like I mentioned, unliek modern x86 processors, ARM processors cannot detect an array traversal and preload the array into the cache. If the array is not in cache, you get cache miss after cache miss. Unlike modern x86 processors? Granted, I haven't timed it, but most processors since early P4 models are supposed to have Streaming access detection, which is a fancy way of saying array detection. Are you sure your information is current? (I could be wrong too, of course) -- Med venlig hilsen Christian Iversen ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
From: Daniël Mantione [EMAIL PROTECTED] Instead unaligned will simulate an unaligned load with two loads and some rotation etc. On the ARM, where every mnemonic can rotate operands, this is isn't that bad of a penalty. Therefore, I wouldn't be surprised that even on ARM, arrays with packed structures are faster than arrays with unpacked structures. That's possible. Why would it be faster, btw? Better cache coherency? Like I mentioned, unliek modern x86 processors, ARM processors cannot detect an array traversal and preload the array into the cache. If the array is not in cache, you get cache miss after cache miss. A cache miss is very expensive with latencies of modern memory. A smaller array results in less cache misses. I run my benchmark on ARM mobile and got the following results: 2080ms - for non-packed 4450ms - for packed It clearly shows that ualigned access kills performance on ARM... Yury. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Daniël Mantione wrote: Op Fri, 29 Feb 2008, schreef Christian Iversen: Daniël Mantione wrote: Op Fri, 29 Feb 2008, schreef Christian Iversen: Instead unaligned will simulate an unaligned load with two loads and some rotation etc. On the ARM, where every mnemonic can rotate operands, this is isn't that bad of a penalty. Therefore, I wouldn't be surprised that even on ARM, arrays with packed structures are faster than arrays with unpacked structures. That's possible. Why would it be faster, btw? Better cache coherency? Like I mentioned, unliek modern x86 processors, ARM processors cannot detect an array traversal and preload the array into the cache. If the array is not in cache, you get cache miss after cache miss. Unlike modern x86 processors? Granted, I haven't timed it, but most processors since early P4 models are supposed to have Streaming access detection, which is a fancy way of saying array detection. Are you sure your information is current? Please read again. I said modern X86 processors have, ARM processors don't have. Sorry, it's just really not my day today.. ;-) I'll now go join another discussion without reading it properly, in turn annoying even more people.. :-) -- Med venlig hilsen Christian Iversen ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Vinzent Hoefler wrote: Are enumeration types 1 or 4 bytes in Delphi? If they are one byte, it looks quite different (and I'm not sure about all the types used here, some seem to be sets, some enumerations). But at the first glance it seems, they used both packed records to either ensure minimum size or known record layout (maybe they even used the structure in some assembly module?), and also aligned them manually to avoid unaligned access issues. Yes. VirtualTreeView/Delphi uses asm instructions The size of sets are different from delphi to fpc making the record structure different. This is the record structure (size and offsets) in both compilers: fpc: Index Size: 4 Offset: 0 ChildCount Size: 4 Offset: 4 NodeHeight Size: 2 Offset: 8 States Size: 4 Offset: 10 Align Size: 1 Offset: 14 CheckState Size: 1 Offset: 15 CheckType Size: 1 Offset: 16 Dummy Size: 1 Offset: 17 Delphi: Index Size: 4 Offset: 0 ChildCount Size: 4 Offset: 4 NodeHeight Size: 2 Offset: 8 States Size: 2 Offset: 10 Align Size: 1 Offset: 12 CheckState Size: 1 Offset: 13 CheckType Size: 1 Offset: 14 Dummy Size: 1 Offset: 15 Luiz ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Jonas Maebe wrote: On 29 Feb 2008, at 01:55, Luiz Americo Pereira Camara wrote: One more question: The VirtualTreeView tries to make the fields of the (packed) record aligned at dword boundary by grouping together smaller (one or two byte fields) or adding dummy fields. Does this trick overrides the unaligned memory access? Not at this time. Due to differences in sets size, the layout is different between fpc and Delphi. Using packed records i save 4 bytes per record. Compiled under Delphi the structure is bellow. The question is: using the layout below with packed (i can force the set size to be equal to Delphi) i still have unaligned memory access? Index Size: 4 Offset: 0 ChildCount Size: 4 Offset: 4 NodeHeight Size: 2 Offset: 8 States Size: 2 Offset: 10 Align Size: 1 Offset: 12 CheckState Size: 1 Offset: 13 CheckType Size: 1 Offset: 14 Dummy Size: 1 Offset: 15 Luiz ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
On 01 Mar 2008, at 02:00, Luiz Americo Pereira Camara wrote: The question is: using the layout below with packed (i can force the set size to be equal to Delphi) i still have unaligned memory access? As long as you record is declared as packed, all memory accesses are handled as if they are to unaligned memory locations. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Op Tue, 26 Feb 2008, schreef Luiz Americo Pereira Camara: Yury Sidorov wrote: The patch removes packed record for some platforms. IMO packed can be removed for all platforms. It will gain some speed. I'd like to understand more this issue. Why are non packed records faster? Cache trashing. One of the most underestimated performance killers in modern software. The difference occurs at memory allocation or at memory access? Memory access. What happens is that the non-packed version causes more cache misses. A cache miss costs many cycles on a modern cpu, a misaligned read just costs an extra memory access (which is fast if cached) on x86, and extra load instruction on ARM. This much cheaper than a chache miss. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
On Tuesday 26 February 2008 17:27, Luiz Americo Pereira Camara wrote: Yury Sidorov wrote: The patch removes packed record for some platforms. IMO packed can be removed for all platforms. It will gain some speed. I'd like to understand more this issue. Why are non packed records faster? The difference occurs at memory allocation or at memory access? At memory access. On x86 processors it's usually only a speed penalty (or has anyone ever seen the AC flag turned on?), on other processors you may even have to workaround exceptions (i.e. bus errors), because the processor simply refuses to read or write unaligned data. And then the only way to circumvent the processor's refusal is to read/write the data byte by byte or mask it out, which is slower than just reading or writing it. Consider writing a 16-bit value spanning across 32-bit-values where the processor can only access a single 32 bits value at an aligned address: *_ _ _ _*_ _ _ _ |0|1|2|3|4|5|6|7| |___| Now the data you need is spanning across bytes [2:5], but the processor can only read full 32 bits either at position 0 (reading bytes [0:3]), or position 4 (reading byte [4:7]). You'd need to read both processor words, mask the data in the lower and upper half of each and write back both words with the new data patched inbetween them. So by now, no matter if the processor handles it for you or if the compiler would insert the necessary code to do it, even a simple increment is insanely expensive in terms of processor cycles. Vinzent. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Luiz Americo Pereira Camara wrote: Why are non packed records faster? The difference occurs at memory allocation or at memory access? In addition to what the others said, think of it like your 32 bit processor suddenly being a 8 bit processor: it has to manually load 4 times 8 bit, arrange them into a 32 bit value, and only then use it. With non packed, it can use the value directly. Micha ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
On x86 processors it's usually only a speed penalty (or has anyone ever seen the AC flag turned on?), on other processors you may even have to workaround exceptions (i.e. bus errors), because the processor simply refuses to read or write unaligned data. It even is not guaranteed (or even common) that a misaligned access with a processor that only can do aligned memory actions can be cured by an exception. That is why the compiler needs to create complex code for the potentially misaligned elements of a packed record. All C compilers do this and I am positive that FP does it, too. So no problem here (beyond the additional cycles needed when working with packed records). A real problem comes up if you manipulate a pointer to a (supposedly aligned) multi-byte variable to make it point to an odd address. This will make the program crash on certain processors (not PC not big 68Ks, but small 68 Ks. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Michael Schnell wrote: If it accesses a misaligned 32 bit value it does two accesses (not 4): e.g. once 8 bit and once 24 bit (when reading each of the accesses is the same 32 bit, anyway). Logically you should think about it how I explained. That Intel did an optimization to make the speed impact less is a different issue: internally the processor still has to have separate 8 bit data paths and do shifting to reorder the bytes. Perhaps this behaviour is specified in their optimization documents, or maybe you have the VHDL source? :-) Transferring data from/to the 1st level cache imposes a lot more delay than the misaligned access. Thus if there are many instances of a record variable that are used for calculation, it might be much faster to use the packed version. If there are only a few, usually the unpacked version should be faster. Show me the benchmark results ;-) Micha ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
internally the processor still has to have separate 8 bit data paths and do shifting to reorder the bytes. This is a barrel shifter in the data path that is integrated in the queue and does not take an additional execution cycle. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Op Thu, 28 Feb 2008, schreef Vinzent Hoefler: On Thursday 28 February 2008 09:16, Daniël Mantione wrote: Memory access. What happens is that the non-packed version causes more cache misses. Please elaborate. If the (unaligned) data is crossing a cache-line, thus causing two full cache-line reads, I'd understand that, but once it's in the cache, it wouldn't matter anymore? Yes, but if you have an array of them (as we have in this case), considerably more of these records will fit in the cache. Therefore you will have considerably less cache misses. This becomes even more serious when the processor in question does not have prefetching; in such case, traversing the array will cause cache miss after cache miss, a smaller array will then have less of these misses. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
On Thursday 28 February 2008 11:25, Daniël Mantione wrote: Op Thu, 28 Feb 2008, schreef Vinzent Hoefler: On Thursday 28 February 2008 09:16, Daniël Mantione wrote: Memory access. What happens is that the non-packed version causes more cache misses. OMG. I'm s confused. ;) I read that the packed version causes more cache misses here. That was the part where I didn't understand why. Please elaborate. If the (unaligned) data is crossing a cache-line, thus causing two full cache-line reads, I'd understand that, but once it's in the cache, it wouldn't matter anymore? Yes, but if you have an array of them (as we have in this case), considerably more of these records will fit in the cache. Yes, that's what I figured, so I'm on the same path as you here, it seems, but tracing back the discussion it read: -- 8 -- I'd like to understand more this issue. Why are non packed records faster? Cache trashing. One of the most underestimated performance killers in modern software. The difference occurs at memory allocation or at memory access? Memory access. What happens is that the non-packed version causes more cache misses. -- 8 -- The first part tells me non-packed records are faster, but the second line tells me that the non-packed version also causes more cache misses, thus is slower. That got me confused, I think. Of course, the net result only depends on the benchmark you're using. ;) Vinzent. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
From: Daniël Mantione [EMAIL PROTECTED] On Thursday 28 February 2008 09:16, Daniël Mantione wrote: Memory access. What happens is that the non-packed version causes more cache misses. Please elaborate. If the (unaligned) data is crossing a cache-line, thus causing two full cache-line reads, I'd understand that, but once it's in the cache, it wouldn't matter anymore? Yes, but if you have an array of them (as we have in this case), considerably more of these records will fit in the cache. Therefore you will have considerably less cache misses. This becomes even more serious when the processor in question does not have prefetching; in such case, traversing the array will cause cache miss after cache miss, a smaller array will then have less of these misses. You are right. Array of packed records is a bit more effective than array of non-packed records, at least on modern x86 CPUs. I do some benchmarks and got on Core Duo: 2070ms - for non-packed 1910ms - for packed But for CPUs which do not support misaligned data access - packed records are speed killers and need to be used as the last resort. Also if record is not element of large array it is better do declare it as non-packed for all CPUs. Yury. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Op Thu, 28 Feb 2008, schreef Yury Sidorov: Yes, but if you have an array of them (as we have in this case), considerably more of these records will fit in the cache. Therefore you will have considerably less cache misses. This becomes even more serious when the processor in question does not have prefetching; in such case, traversing the array will cause cache miss after cache miss, a smaller array will then have less of these misses. You are right. Array of packed records is a bit more effective than array of non-packed records, at least on modern x86 CPUs. I do some benchmarks and got on Core Duo: 2070ms - for non-packed 1910ms - for packed But for CPUs which do not support misaligned data access - packed records are speed killers and need to be used as the last resort. I not 100% sure about this. Your Core Duo has a array traverse detector which activates prefetching. An ARM does not have such logic and will suffer cache miss after cache miss. However, it is for certain that a manual unaligned load is more expensive on ARM than a hardware unaligned load on x86. Also if record is not element of large array it is better do declare it as non-packed for all CPUs. Yes. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Op Thu, 28 Feb 2008, schreef Michael Schnell: An ARM does not have such logic and will suffer cache miss after cache miss. Nonetheless the count of word transfers form memory to/from the cache would be smaller with packed records which might result in a lot faster execution (of course depending on the layout of the record, speed of the memory, speed of the processor, type of operations done with the records, ...) That is exactly what I wanted to explain: even on ARM the lower amount of cache misses might pay for the (higher) cost of an unaligned load. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Daniël Mantione wrote: Op Tue, 26 Feb 2008, schreef Luiz Americo Pereira Camara: Yury Sidorov wrote: The patch removes packed record for some platforms. IMO packed can be removed for all platforms. It will gain some speed. I'd like to understand more this issue. Why are non packed records faster? Cache trashing. One of the most underestimated performance killers in modern software. The difference occurs at memory allocation or at memory access? Memory access. What happens is that the non-packed version causes more cache misses. A cache miss costs many cycles on a modern cpu, a misaligned read just costs an extra memory access (which is fast if cached) on x86, and extra load instruction on ARM. This much cheaper than a chache miss. Thanks for all explanation. I'm sure that the change is worth. One more question: The VirtualTreeView tries to make the fields of the (packed) record aligned at dword boundary by grouping together smaller (one or two byte fields) or adding dummy fields. Does this trick overrides the unaligned memory access? The real beast: TVirtualNodePacked = packed record Index,//Offset 0 ChildCount: Cardinal; //Offset 4 NodeHeight: Word; //Offset 8 States: TVirtualNodeStates; //Offset 10 * Align: Byte; //Offset 14 ** CheckState: TCheckState; //Offset 15 ** CheckType: TCheckType; //Offset 16 Dummy: Byte; //Offset 17 TotalCount: Cardinal; //Offset 18 * [...] For what i understand, the fields marked with * makes an unaligned access because they are not in dword boundary. Right? Fields with ** also are not dword boundary aligned, but since are one byte fields there's not unaligned access. Right? And about 64bit systems. Should the fields be qword aligned or dword is still sufficient? Luiz ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Luiz Americo Pereira Camara wrote: TVirtualNodePacked = packed record Index,//Offset 0 ChildCount: Cardinal; //Offset 4 NodeHeight: Word; //Offset 8 States: TVirtualNodeStates; //Offset 10 * Align: Byte; //Offset 14 ** CheckState: TCheckState; //Offset 15 ** CheckType: TCheckType; //Offset 16 Dummy: Byte; //Offset 17TotalCount: Cardinal; //Offset 18 * [...] TVirtualNodePacked = packed record Index,//Offset 0 ChildCount: Cardinal; //Offset 4 NodeHeight: Word; //Offset 8 States: TVirtualNodeStates; //Offset 10 * Align: Byte; //Offset 14 ** CheckState: TCheckState; //Offset 15 ** CheckType: TCheckType; //Offset 16 Dummy: Byte; //Offset 17 TotalCount: Cardinal; //Offset 18 * [...] The mail editor scrambled the record structure. I hope this time is more clear. Luiz ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Are enumeration types 1 or 4 bytes in Delphi? If they are one byte, it looks quite different (and I'm not sure about all the types used here, some seem to be sets, some enumerations). But at the first glance it seems, they used both packed records to either ensure minimum size or known record layout (maybe they even used the structure in some assembly module?), and also aligned them manually to avoid unaligned access issues. Vinzent. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Vincent Snijders wrote: Instead of testing for arm cpu, you could use FPC_REQUIRES_PROPER_ALIGNMENT too. So it is fixed for sparc as well. yes, the changed patch is attached. Regards, Bernd. Index: packages/graph/src/inc/gtext.inc === --- packages/graph/src/inc/gtext.inc(Revision 10376) +++ packages/graph/src/inc/gtext.inc(Arbeitskopie) @@ -68,7 +68,12 @@ { pStroke = ^TStroke;} + +{$ifdef FPC_REQUIRES_PROPER_ALIGNMENT} + TStroke = record { avoid misaligned data access } +{$else} TStroke = packed record +{$endif FPC_REQUIRES_PROPER_ALIGNMENT} opcode: byte; x: smallint; { relative x offset character } y: smallint; { relative y offset character } ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Bernd Mueller schreef: Hello, the attached patch avoids misaligned data access (bus errors), during font rendering (with the graph unit) on Arm-Linux devices. Instead of testing for arm cpu, you could use FPC_REQUIRES_PROPER_ALIGNMENT too. So it is fixed for sparc as well. Vincent ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
From: Daniël Mantione [EMAIL PROTECTED] Bernd Mueller schreef: Hello, the attached patch avoids misaligned data access (bus errors), during font rendering (with the graph unit) on Arm-Linux devices. Instead of testing for arm cpu, you could use FPC_REQUIRES_PROPER_ALIGNMENT too. So it is fixed for sparc as well. Well, packed records are usually used when speed is unimportant. If the code is speed critical, packed should not be used for aby platform. Therefore I would like Bernd to consider the use of the 'unaligned' pseudo-function, ifdefs make code less readable. The patch removes packed record for some platforms. IMO packed can be removed for all platforms. It will gain some speed. Yury. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
[fpc-devel] Patch, font rendering on Arm-Linux devices.
Hello, the attached patch avoids misaligned data access (bus errors), during font rendering (with the graph unit) on Arm-Linux devices. Regards, Bernd. Index: packages/graph/src/inc/gtext.inc === --- packages/graph/src/inc/gtext.inc(Revision 10376) +++ packages/graph/src/inc/gtext.inc(Arbeitskopie) @@ -68,7 +68,12 @@ { pStroke = ^TStroke;} + +{$ifdef cpuarm} + TStroke = record { avoid misaligned data access } +{$else} TStroke = packed record +{$endif cpuarm} opcode: byte; x: smallint; { relative x offset character } y: smallint; { relative y offset character } ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Op Tue, 26 Feb 2008, schreef Vincent Snijders: Bernd Mueller schreef: Hello, the attached patch avoids misaligned data access (bus errors), during font rendering (with the graph unit) on Arm-Linux devices. Instead of testing for arm cpu, you could use FPC_REQUIRES_PROPER_ALIGNMENT too. So it is fixed for sparc as well. Well, packed records are usually used when speed is unimportant. If the code is speed critical, packed should not be used for aby platform. Therefore I would like Bernd to consider the use of the 'unaligned' pseudo-function, ifdefs make code less readable. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Daniël Mantione schrieb: Op Tue, 26 Feb 2008, schreef Vincent Snijders: Bernd Mueller schreef: Hello, the attached patch avoids misaligned data access (bus errors), during font rendering (with the graph unit) on Arm-Linux devices. Instead of testing for arm cpu, you could use FPC_REQUIRES_PROPER_ALIGNMENT too. So it is fixed for sparc as well. Well, packed records are usually used when speed is unimportant. If the Isn't this used to read a font file? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Op Tue, 26 Feb 2008, schreef Florian Klaempfl: Daniël Mantione schrieb: Op Tue, 26 Feb 2008, schreef Vincent Snijders: Bernd Mueller schreef: Hello, the attached patch avoids misaligned data access (bus errors), during font rendering (with the graph unit) on Arm-Linux devices. Instead of testing for arm cpu, you could use FPC_REQUIRES_PROPER_ALIGNMENT too. So it is fixed for sparc as well. Well, packed records are usually used when speed is unimportant. If the Isn't this used to read a font file? You are right. Therefore, the unaligned pseudo function is the proper solution. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Daniël Mantione wrote: Op Tue, 26 Feb 2008, schreef Florian Klaempfl: Daniël Mantione schrieb: Op Tue, 26 Feb 2008, schreef Vincent Snijders: Bernd Mueller schreef: Hello, the attached patch avoids misaligned data access (bus errors), during font rendering (with the graph unit) on Arm-Linux devices. Instead of testing for arm cpu, you could use FPC_REQUIRES_PROPER_ALIGNMENT too. So it is fixed for sparc as well. Well, packed records are usually used when speed is unimportant. If the Isn't this used to read a font file? You are right. Therefore, the unaligned pseudo function is the proper solution. the main affected routines are unpack and decode. Both routines were called for every single character (only for a stroked font) via OutTextXYDefault. So speed is not unimportant ;-) Regards, Bernd. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Patch, font rendering on Arm-Linux devices.
Bernd Mueller wrote: the main affected routines are unpack and decode. Both routines were called for every single character (only for a stroked font) via OutTextXYDefault. So speed is not unimportant ;-) Perhaps you can separate I/O and processing? Read into unpacked structure and process from there? Micha ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel