RE: [PATCH v4 0/7] lib/lzo: performance improvements
Markus, > Request 2 - add COPY16; *NOT* acked by me > > [PATCH 2/8] lib/lzo: clean-up by introducing COPY16 > > is still not correct because of possible overlapping copies. I'll > address this on the weekend. Can you give a syndrome as to why { COPY8(op, ip); COPY8(op+8,ip+8); ip+=16; op+=16; } or { COPY8(op, ip); ip+=8; op+=8; COPY8(op, ip); ip+=8; op+=8; } vs. #define COPY16(dst,src) COPY8(dst,src); COPY8(dst+8,src+8) { COPY16(op, ip); ip+=16; op+=16; } .. causes "overlapping copies"? COPY8 was only ever used in pairs as above and the second method broke compiler optimizers since it adds an artificial barrier between the two groups. The only difference was that decompress and compress had the pointer increments spread out. If we need to fix that then that's a good reason, but your reasoning continues to elude me. I can refactor the patch to align the second method with the first and make compress and decompress get the same codegen, which is functionally identical to the COPY16 patch, but that would seem to in your opinion be the whole problem.. I'll see what you've got after the weekend ;D Ta Matt Sealey
RE: [PATCH v4 0/7] lib/lzo: performance improvements
Markus, > Request 2 - add COPY16; *NOT* acked by me > > [PATCH 2/8] lib/lzo: clean-up by introducing COPY16 > > is still not correct because of possible overlapping copies. I'll > address this on the weekend. Can you give a syndrome as to why { COPY8(op, ip); COPY8(op+8,ip+8); ip+=16; op+=16; } or { COPY8(op, ip); ip+=8; op+=8; COPY8(op, ip); ip+=8; op+=8; } vs. #define COPY16(dst,src) COPY8(dst,src); COPY8(dst+8,src+8) { COPY16(op, ip); ip+=16; op+=16; } .. causes "overlapping copies"? COPY8 was only ever used in pairs as above and the second method broke compiler optimizers since it adds an artificial barrier between the two groups. The only difference was that decompress and compress had the pointer increments spread out. If we need to fix that then that's a good reason, but your reasoning continues to elude me. I can refactor the patch to align the second method with the first and make compress and decompress get the same codegen, which is functionally identical to the COPY16 patch, but that would seem to in your opinion be the whole problem.. I'll see what you've got after the weekend ;D Ta Matt Sealey
RE: [PATCH 2/6] lib/lzo: enable 64-bit CTZ on Arm
Hi Christoph, > > #define LZO_USE_CTZ32 1 > > #elif defined(__i386__) || defined(__powerpc__) > > #define LZO_USE_CTZ32 1 > > -#elif defined(__arm__) && (__LINUX_ARM_ARCH__ >= 5) > > +#elif defined(__arm__) > > +#if (__LINUX_ARM_ARCH__ >= 5) > > #define LZO_USE_CTZ32 1 > > #endif > > +#if (__LINUX_ARM_ARCH__ >= 6) && (CONFIG_THUMB2_KERNEL) > > +#define LZO_USE_CTZ64 1 > > All of this really needs to be driven by Kconfig symbols that > the architecture selects instead of magic arch ifdefs here. TL;DR summary: Low hanging fruit. TL explanation of the thought process: Rather than using arch ifdefs, it seems we can do __builtin_ctzll() and __builtin_clzll() on all architectures with no measurable impact to codegen - libgcc implements them as __builtin_ctz/clz everywhere I've looked, so gcc at least is going to use that code for the builtins. On 32-bit Arm it introduces a tiny bit of extra register pressure in the instruction sequence but it's not measurable. That is to say that the difference between architectures or architecture versions isn't the availability of a count-leading-zeros optimization or the builtin or an efficient algorithm, but the code preceding it to collect a 64-bit quantity to count. That is a whole other cleanup. Longer: Just to further the point, though, if we rationalize the use of builtins to use the 64-bit versions, then we may as well use the kernel version -- __ffs64 is __builtin_clzll in asm-generic with the right magic arch Kconfig, and if not arches provide exactly the same code as LZO's fallback when LZO_USE_CxZnn isn't defined. Even the distinction between little and big endian is moot, since __builtin_clzll(cpu_to_be32(n)) amounts to precisely the same disassembly on both arm and arm64 as __builtin_ctzll(n), so we can just go the whole hog, replace the 32-bit loads and xor with 64-bit ones, and do: __ffs64(cpu_to_be32(n)) which greatly reduces code complexity in the main LZO code by removing the distinction. It turns out the compiler knows what it's doing, and where it doesn't, the arch maintainers do. Therein lies the rub; the source gets smaller and we lose the arch/endian distinction but the efficiency of the preceding 64-bit load and exclusive-or needs investigation. We still have to figure out per architecture what the most efficient element size is or if there's a hideous xor((u64) n) implementation. I simply don't have confidence in anything except that the magic arch ifdefs work to improve performance without changing too much, and the mess it introduces here belies the performance improvement it provides. Versus the true scope of the code cleanup to do it right, it's a very concise and reliable and within-style improvement. Would we agree to let it in for now on the basis that we thought about it but I side with MFXJO's urgency? Ta! Matt Sealey Arm Partner Enablement Group
RE: [PATCH 2/6] lib/lzo: enable 64-bit CTZ on Arm
Hi Christoph, > > #define LZO_USE_CTZ32 1 > > #elif defined(__i386__) || defined(__powerpc__) > > #define LZO_USE_CTZ32 1 > > -#elif defined(__arm__) && (__LINUX_ARM_ARCH__ >= 5) > > +#elif defined(__arm__) > > +#if (__LINUX_ARM_ARCH__ >= 5) > > #define LZO_USE_CTZ32 1 > > #endif > > +#if (__LINUX_ARM_ARCH__ >= 6) && (CONFIG_THUMB2_KERNEL) > > +#define LZO_USE_CTZ64 1 > > All of this really needs to be driven by Kconfig symbols that > the architecture selects instead of magic arch ifdefs here. TL;DR summary: Low hanging fruit. TL explanation of the thought process: Rather than using arch ifdefs, it seems we can do __builtin_ctzll() and __builtin_clzll() on all architectures with no measurable impact to codegen - libgcc implements them as __builtin_ctz/clz everywhere I've looked, so gcc at least is going to use that code for the builtins. On 32-bit Arm it introduces a tiny bit of extra register pressure in the instruction sequence but it's not measurable. That is to say that the difference between architectures or architecture versions isn't the availability of a count-leading-zeros optimization or the builtin or an efficient algorithm, but the code preceding it to collect a 64-bit quantity to count. That is a whole other cleanup. Longer: Just to further the point, though, if we rationalize the use of builtins to use the 64-bit versions, then we may as well use the kernel version -- __ffs64 is __builtin_clzll in asm-generic with the right magic arch Kconfig, and if not arches provide exactly the same code as LZO's fallback when LZO_USE_CxZnn isn't defined. Even the distinction between little and big endian is moot, since __builtin_clzll(cpu_to_be32(n)) amounts to precisely the same disassembly on both arm and arm64 as __builtin_ctzll(n), so we can just go the whole hog, replace the 32-bit loads and xor with 64-bit ones, and do: __ffs64(cpu_to_be32(n)) which greatly reduces code complexity in the main LZO code by removing the distinction. It turns out the compiler knows what it's doing, and where it doesn't, the arch maintainers do. Therein lies the rub; the source gets smaller and we lose the arch/endian distinction but the efficiency of the preceding 64-bit load and exclusive-or needs investigation. We still have to figure out per architecture what the most efficient element size is or if there's a hideous xor((u64) n) implementation. I simply don't have confidence in anything except that the magic arch ifdefs work to improve performance without changing too much, and the mess it introduces here belies the performance improvement it provides. Versus the true scope of the code cleanup to do it right, it's a very concise and reliable and within-style improvement. Would we agree to let it in for now on the basis that we thought about it but I side with MFXJO's urgency? Ta! Matt Sealey Arm Partner Enablement Group
RE: [PATCH 0/6] lib/lzo: performance improvements
Hi Markus. > Hi Dave, > > thanks for your patch set. Just some initial comments: > > > I think the three patches > > [PATCH 2/6] lib/lzo: enable 64-bit CTZ on Arm > [PATCH 3/6] lib/lzo: 64-bit CTZ on Arm aarch64 > [PATCH 4/6] lib/lzo: fast 8-byte copy on arm64 > > should be applied in any case - could you please make an extra > pull request out of these and try to get them merged as fast > as possible. Thanks. > > > The first patch > > [PATCH 1/6] lib/lzo: clean-up by introducing COPY16 > > does not really affect the resulting code at the moment, but please > note that in one case the actual copy unit is not allowed to > be greater 8 bytes (which might be implied by the name "COPY16"). > So this needs more work like an extra COPY16_BY_8() macro. I agree in principle but not in practice ;) The original intent of those patches was to avail ourselves of known architectural features, but also give the compiler a helping hand where LZO isn't supposed to be that knowledgeable of the underlying processor implementation. The end result is I think an architecture having a discrete 16-byte movement is more beneficial than dictating how things are being copied under it. What came out is that there is a difference in codegen in compress vs. decompress literally because of a difference in coding style: decompress: COPY8(op, ip); op += 8; ip += 8; COPY8(op, ip); op += 8; ip += 8; compress: COPY8(op, ip); COPY8(op+8, ip+8); ip += 16; op += 16; Compilers do very well with the latter, and terribly with the former. Once I aligned them to the same sequence I noticed the only time LZO ever does 8-byte copies is in 16-byte chunks, and this aligned with my expectation that on arm64 LDP/STP with writeback gets it done in two instructions. I want to go to there! COPY16 being defined as COPY8,COPY8 is a low hanging fruit that does improve codegen on arm64 slightly (because it cleans the decompression code up) and performance even more slightly as a result. Just for elucidation, it goes from (decompress) add x0, x0, 16 // op, op, ldr w2, [x1]//, MEM[base: ii_17, offset: 0B] add x1, x1, 16 // ii, ii, cmp x0, x3 // op, _490 str w2, [x0, -16] // _89, MEM[base: op_8, offset: 0B] ldr w2, [x1, -12] //, MEM[base: ii_17, offset: 4B] str w2, [x0, -12] // _105, MEM[base: op_8, offset: 4B] ldr w2, [x1, -8]//, MEM[base: ii_17, offset: 8B] str w2, [x0, -8]// _145, MEM[base: op_8, offset: 8B] ldr w2, [x1, -4]//, MEM[base: ii_17, offset: 12B] str w2, [x0, -4]// _156, MEM[base: op_8, offset: 12B] To this (same code as compress): add x0, x0, 16 // op, op, ldr x2, [x1]// _153, MEM[base: ii_17, offset: 0B] add x1, x1, 16 // ii, ii, cmp x0, x3 // op, _418 str x2, [x0, -16] // _153, MEM[base: op_8, offset: 0B] ldr x2, [x1, -8]// _181, MEM[base: ii_17, offset: 8B] str x2, [x0, -8]// _181, MEM[base: op_8, offset: 8B] But it's going to want to be this way: ldp x2, x3, [x1], 16// _182, MEM[base: ii_17, offset: 0B] stp x2, x3, [x0], 16// _182, MEM[base: op_8, offset: 0B] Which is a COPY16_BY_8 but only because of the architecture rules on element sizes.. COPY16_BY_8 doesn't fit the idiom we're trying to create. For arm64 the easiest way to get LDP/STP with writeback just happens to be allowing compilers to lift memcpy(o,i,16) which is always going to be alignment-safe, but this doesn't hold true on arm32 or elsewhere and we've not done enough testing on other arches for that concept to bear fruit, yet. I'm confused by the 'one case the actual copy unit is not allowed to be greater 8 bytes' statement: this is not evidenced by the loop structure. COPY8 is only ever implemented in pairs in the kernel - can you point out where this might not be the case in the future (and why they couldn't still use COPY8 in that case)? Ta! Matt Sealey Arm Partner Enablement Group
RE: [PATCH 0/6] lib/lzo: performance improvements
Hi Markus. > Hi Dave, > > thanks for your patch set. Just some initial comments: > > > I think the three patches > > [PATCH 2/6] lib/lzo: enable 64-bit CTZ on Arm > [PATCH 3/6] lib/lzo: 64-bit CTZ on Arm aarch64 > [PATCH 4/6] lib/lzo: fast 8-byte copy on arm64 > > should be applied in any case - could you please make an extra > pull request out of these and try to get them merged as fast > as possible. Thanks. > > > The first patch > > [PATCH 1/6] lib/lzo: clean-up by introducing COPY16 > > does not really affect the resulting code at the moment, but please > note that in one case the actual copy unit is not allowed to > be greater 8 bytes (which might be implied by the name "COPY16"). > So this needs more work like an extra COPY16_BY_8() macro. I agree in principle but not in practice ;) The original intent of those patches was to avail ourselves of known architectural features, but also give the compiler a helping hand where LZO isn't supposed to be that knowledgeable of the underlying processor implementation. The end result is I think an architecture having a discrete 16-byte movement is more beneficial than dictating how things are being copied under it. What came out is that there is a difference in codegen in compress vs. decompress literally because of a difference in coding style: decompress: COPY8(op, ip); op += 8; ip += 8; COPY8(op, ip); op += 8; ip += 8; compress: COPY8(op, ip); COPY8(op+8, ip+8); ip += 16; op += 16; Compilers do very well with the latter, and terribly with the former. Once I aligned them to the same sequence I noticed the only time LZO ever does 8-byte copies is in 16-byte chunks, and this aligned with my expectation that on arm64 LDP/STP with writeback gets it done in two instructions. I want to go to there! COPY16 being defined as COPY8,COPY8 is a low hanging fruit that does improve codegen on arm64 slightly (because it cleans the decompression code up) and performance even more slightly as a result. Just for elucidation, it goes from (decompress) add x0, x0, 16 // op, op, ldr w2, [x1]//, MEM[base: ii_17, offset: 0B] add x1, x1, 16 // ii, ii, cmp x0, x3 // op, _490 str w2, [x0, -16] // _89, MEM[base: op_8, offset: 0B] ldr w2, [x1, -12] //, MEM[base: ii_17, offset: 4B] str w2, [x0, -12] // _105, MEM[base: op_8, offset: 4B] ldr w2, [x1, -8]//, MEM[base: ii_17, offset: 8B] str w2, [x0, -8]// _145, MEM[base: op_8, offset: 8B] ldr w2, [x1, -4]//, MEM[base: ii_17, offset: 12B] str w2, [x0, -4]// _156, MEM[base: op_8, offset: 12B] To this (same code as compress): add x0, x0, 16 // op, op, ldr x2, [x1]// _153, MEM[base: ii_17, offset: 0B] add x1, x1, 16 // ii, ii, cmp x0, x3 // op, _418 str x2, [x0, -16] // _153, MEM[base: op_8, offset: 0B] ldr x2, [x1, -8]// _181, MEM[base: ii_17, offset: 8B] str x2, [x0, -8]// _181, MEM[base: op_8, offset: 8B] But it's going to want to be this way: ldp x2, x3, [x1], 16// _182, MEM[base: ii_17, offset: 0B] stp x2, x3, [x0], 16// _182, MEM[base: op_8, offset: 0B] Which is a COPY16_BY_8 but only because of the architecture rules on element sizes.. COPY16_BY_8 doesn't fit the idiom we're trying to create. For arm64 the easiest way to get LDP/STP with writeback just happens to be allowing compilers to lift memcpy(o,i,16) which is always going to be alignment-safe, but this doesn't hold true on arm32 or elsewhere and we've not done enough testing on other arches for that concept to bear fruit, yet. I'm confused by the 'one case the actual copy unit is not allowed to be greater 8 bytes' statement: this is not evidenced by the loop structure. COPY8 is only ever implemented in pairs in the kernel - can you point out where this might not be the case in the future (and why they couldn't still use COPY8 in that case)? Ta! Matt Sealey Arm Partner Enablement Group
Re: framebuffer corruption due to overlapping stp instructions on arm64
On 3 August 2018 at 13:25, Mikulas Patocka wrote: > > > On Fri, 3 Aug 2018, Ard Biesheuvel wrote: > >> Are we still talking about overlapping unaligned accesses here? Or do >> you see other failures as well? > > Yes - it is caused by overlapping unaligned accesses inside memcpy. When I > put "dmb sy" between the overlapping accesses in > glibc/sysdeps/aarch64/memcpy.S, this program doesn't detect any memory > corruption. It is a symptom of generating reorderable accesses inside memcpy. It's nothing to do with alignment, per se (see below). A dmb sy just hides the symptoms. What we're talking about here - yes, Ard, within certain amounts of reason - is that you cannot use PCI BAR memory as 'Normal' - certainly never cacheable memory, but Normal NC isn't good either. That is that your CPU cannot post writes or reads towards PCI memory spaces unless it is dealing with it as Device memory or very strictly controlled use of Normal Non-Cacheable. I understand why the rest of the world likes to mark stuff as 'writecombine,' but that's x86-ism, not an Arm memory type. There is potential for accesses to the same slave from different masters (or just different AXI IDs, most cores rotate over 8 or 16 or so for Normal memory to achieve) to be reordered. PCIe has no idea what the source was, it will just accept them in the order it receives them, and also it will be strictly defined to manage incoming AXI or ACE transactions (and barriers..) in a way that does not violate the PCIe memory model - the worst case is deadlocks, the best case is you see some very strange behavior. In any case the original ordering of two Normal-NC transactions may not make it to the PCIe bridge in the first place which is probably why a DMB resolves it - it will force the core to issue them in order and it's likely unless there is some hyper-complex multi-pathing going on, they'll stay ordered. If you MUST preserve the order between two Normal memory accesses, a barrier is required. The same is true also of any re-orderable device access. >> > I tried to run it on system RAM mapped with the NC attribute and I didn't >> > get any corruption - that suggests the the bug may be in the PCIE >> > subsystem. Pure fluke. I'll give a simple explanation. The Arm Architecture defines single-copy and multi-copy atomic transactions. You can treat 'single-copy' to mean that that transaction cannot be made partial, or reordered within itself, i.e. it must modify memory (if it is a store) in a single swift effort and any future reads from that memory must return the FULL result of that write. Multi-copy means it can be resized and reordered a bit. Will Deacon is going to crucify me for simplifying it, but.. let's proceed with a poor example: STR X0,[X1] on a 32-bit bus cannot ever be single-copy atomic, because you cannot write 64-bits of data on a 32-bit bus in a single, unbreakable transaction. This is because from one bus cycle to the next, one half of the transaction will be in a different place. Your interconnect will have latched and buffered 32-bits and the CPU is holding the other. STP X0, X1, [X2] on a 64-bit bus can be single-copy atomic with respect to the element size. But it is on the whole multi-copy atomic - that is to say that it can provide a single transaction with multiple elements which are transmitted, and those elements could be messed with on the way down the pipe. On a 128-bit bus, you might expect it to be single-copy atomic because the entire transaction can be fit into one single data beat, but *it is most definitely not* according to the architecture. The data from X0 and X1 may be required to be stored at *X2 and *(X2+8), but the architecture doesn't care which one is written first. Neither does AMBA. STP is only ever guaranteed to be single-copy atomic with regards to the element size (which is the X register in question). If you swap the data around, and do STP X1, X0, [X2] you may see a different result dependent on how the processor decides to pull data from the register file and in what order. Users of the old 32-bit ARM STM instruction will recall that it writes the register list in incrementing order, lowest register number to lowest address, so what is the solution for STP? Do you expect expect X0 to be emitted on the bus first or the data to be stored in *X2? It's neither! That means you can do an STP on one processor and an LDR of one of the 64-bit words on another processor, and you may be able to see a) None of the STP transaction b) X2 is written with the value in X0, but X2+8 is not holding the value in X1 c) b, only reversed d) What you expect And this can change dependent on the resizers and bridges and QoS and paths between a master interface and a slave interface, although a truly single-copy atomic transaction going through a downsizer to smaller than the transaction size is a broken system design, it may be allowable if the downsizer hazards addresses to the granularity of the larger bus
Re: framebuffer corruption due to overlapping stp instructions on arm64
On 3 August 2018 at 13:25, Mikulas Patocka wrote: > > > On Fri, 3 Aug 2018, Ard Biesheuvel wrote: > >> Are we still talking about overlapping unaligned accesses here? Or do >> you see other failures as well? > > Yes - it is caused by overlapping unaligned accesses inside memcpy. When I > put "dmb sy" between the overlapping accesses in > glibc/sysdeps/aarch64/memcpy.S, this program doesn't detect any memory > corruption. It is a symptom of generating reorderable accesses inside memcpy. It's nothing to do with alignment, per se (see below). A dmb sy just hides the symptoms. What we're talking about here - yes, Ard, within certain amounts of reason - is that you cannot use PCI BAR memory as 'Normal' - certainly never cacheable memory, but Normal NC isn't good either. That is that your CPU cannot post writes or reads towards PCI memory spaces unless it is dealing with it as Device memory or very strictly controlled use of Normal Non-Cacheable. I understand why the rest of the world likes to mark stuff as 'writecombine,' but that's x86-ism, not an Arm memory type. There is potential for accesses to the same slave from different masters (or just different AXI IDs, most cores rotate over 8 or 16 or so for Normal memory to achieve) to be reordered. PCIe has no idea what the source was, it will just accept them in the order it receives them, and also it will be strictly defined to manage incoming AXI or ACE transactions (and barriers..) in a way that does not violate the PCIe memory model - the worst case is deadlocks, the best case is you see some very strange behavior. In any case the original ordering of two Normal-NC transactions may not make it to the PCIe bridge in the first place which is probably why a DMB resolves it - it will force the core to issue them in order and it's likely unless there is some hyper-complex multi-pathing going on, they'll stay ordered. If you MUST preserve the order between two Normal memory accesses, a barrier is required. The same is true also of any re-orderable device access. >> > I tried to run it on system RAM mapped with the NC attribute and I didn't >> > get any corruption - that suggests the the bug may be in the PCIE >> > subsystem. Pure fluke. I'll give a simple explanation. The Arm Architecture defines single-copy and multi-copy atomic transactions. You can treat 'single-copy' to mean that that transaction cannot be made partial, or reordered within itself, i.e. it must modify memory (if it is a store) in a single swift effort and any future reads from that memory must return the FULL result of that write. Multi-copy means it can be resized and reordered a bit. Will Deacon is going to crucify me for simplifying it, but.. let's proceed with a poor example: STR X0,[X1] on a 32-bit bus cannot ever be single-copy atomic, because you cannot write 64-bits of data on a 32-bit bus in a single, unbreakable transaction. This is because from one bus cycle to the next, one half of the transaction will be in a different place. Your interconnect will have latched and buffered 32-bits and the CPU is holding the other. STP X0, X1, [X2] on a 64-bit bus can be single-copy atomic with respect to the element size. But it is on the whole multi-copy atomic - that is to say that it can provide a single transaction with multiple elements which are transmitted, and those elements could be messed with on the way down the pipe. On a 128-bit bus, you might expect it to be single-copy atomic because the entire transaction can be fit into one single data beat, but *it is most definitely not* according to the architecture. The data from X0 and X1 may be required to be stored at *X2 and *(X2+8), but the architecture doesn't care which one is written first. Neither does AMBA. STP is only ever guaranteed to be single-copy atomic with regards to the element size (which is the X register in question). If you swap the data around, and do STP X1, X0, [X2] you may see a different result dependent on how the processor decides to pull data from the register file and in what order. Users of the old 32-bit ARM STM instruction will recall that it writes the register list in incrementing order, lowest register number to lowest address, so what is the solution for STP? Do you expect expect X0 to be emitted on the bus first or the data to be stored in *X2? It's neither! That means you can do an STP on one processor and an LDR of one of the 64-bit words on another processor, and you may be able to see a) None of the STP transaction b) X2 is written with the value in X0, but X2+8 is not holding the value in X1 c) b, only reversed d) What you expect And this can change dependent on the resizers and bridges and QoS and paths between a master interface and a slave interface, although a truly single-copy atomic transaction going through a downsizer to smaller than the transaction size is a broken system design, it may be allowable if the downsizer hazards addresses to the granularity of the larger bus
Re: framebuffer corruption due to overlapping stp instructions on arm64
The easiest explanation for this would be that the memory isn’t mapped correctly. You can’t use PCIe memory spaces with anything other than Device-nGnRE or stricter mappings. That’s just differences between the AMBA and PCIe (posted/unposted) memory models. Normal memory (cacheable or uncacheable, which Linux tends to call “memory” and “writecombine” respectively) is not a good idea. There are two options; make sure Links maps it’s framebuffer as Device memory, or the driver, or both - and make sure that only aligned accesses happen (otherwise you’ll just get a synchronous exception) and there isn’t a Normal memory alias. Alternatively, tell the PCIe driver that the framebuffer is in system memory - you can map it however you like but there’ll be a performance hit if you start to use GPU acceleration, but a significant performance boost from the PoV of the CPU. Only memory accessed from the PCIe master interface (i.e. reads and writes generated by the card itself - telling the GPU to pull from system memory or other DMA) can be in Normal memory and this allows PCIe to be cache coherent with the right interconnect. The slave port on a PCIe root complex (i.e. CPU writes) can’t be used with Normal, or reorderable, and therefore your 2GB of graphics memory is going to be slow from the point of view of the CPU. To find the correct mapping you’ll need to know just how cache coherent the PCIe RC is... Ta, Matt On Thu, Aug 2, 2018 at 14:31 Mikulas Patocka wrote: > Hi > > I tried to use a PCIe graphics card on the MacchiatoBIN board and I hit a > strange problem. > > When I use the links browser in graphics mode on the framebuffer, I get > occasional pixel corruption. Links does memcpy, memset and 4-byte writes > on the framebuffer - nothing else. > > I found out that the pixel corruption is caused by overlapping unaligned > stp instructions inside memcpy. In order to avoid branching, the arm64 > memcpy implementation may write the same destination twice with different > alignment. If I put "dmb sy" between the overlapping stp instructions, the > pixel corruption goes away. > > This seems like a hardware bug. Is it a known errata? Do you have any > workarounds for it? > > I tried AMD card (HD 6350) and NVidia (NVS 285) and both exhibit the same > corruption. OpenGL doesn't work (it results in artifacts on the AMD card > and lock-up on the NVidia card), but it's quite expected if even simple > writing to the framebuffer doesn't work. > > Mikulas >
Re: framebuffer corruption due to overlapping stp instructions on arm64
The easiest explanation for this would be that the memory isn’t mapped correctly. You can’t use PCIe memory spaces with anything other than Device-nGnRE or stricter mappings. That’s just differences between the AMBA and PCIe (posted/unposted) memory models. Normal memory (cacheable or uncacheable, which Linux tends to call “memory” and “writecombine” respectively) is not a good idea. There are two options; make sure Links maps it’s framebuffer as Device memory, or the driver, or both - and make sure that only aligned accesses happen (otherwise you’ll just get a synchronous exception) and there isn’t a Normal memory alias. Alternatively, tell the PCIe driver that the framebuffer is in system memory - you can map it however you like but there’ll be a performance hit if you start to use GPU acceleration, but a significant performance boost from the PoV of the CPU. Only memory accessed from the PCIe master interface (i.e. reads and writes generated by the card itself - telling the GPU to pull from system memory or other DMA) can be in Normal memory and this allows PCIe to be cache coherent with the right interconnect. The slave port on a PCIe root complex (i.e. CPU writes) can’t be used with Normal, or reorderable, and therefore your 2GB of graphics memory is going to be slow from the point of view of the CPU. To find the correct mapping you’ll need to know just how cache coherent the PCIe RC is... Ta, Matt On Thu, Aug 2, 2018 at 14:31 Mikulas Patocka wrote: > Hi > > I tried to use a PCIe graphics card on the MacchiatoBIN board and I hit a > strange problem. > > When I use the links browser in graphics mode on the framebuffer, I get > occasional pixel corruption. Links does memcpy, memset and 4-byte writes > on the framebuffer - nothing else. > > I found out that the pixel corruption is caused by overlapping unaligned > stp instructions inside memcpy. In order to avoid branching, the arm64 > memcpy implementation may write the same destination twice with different > alignment. If I put "dmb sy" between the overlapping stp instructions, the > pixel corruption goes away. > > This seems like a hardware bug. Is it a known errata? Do you have any > workarounds for it? > > I tried AMD card (HD 6350) and NVidia (NVS 285) and both exhibit the same > corruption. OpenGL doesn't work (it results in artifacts on the AMD card > and lock-up on the NVidia card), but it's quite expected if even simple > writing to the framebuffer doesn't work. > > Mikulas >
RE: [RFC PATCH 6/8] dts: coresight: Clean up the device tree graph bindings
Hi Rob, To take this in a somewhat different direction... > > No, the above comment is about using "unit" ( if it is a standard > > property for specifying something specific to hardware) instead of > > "coresight,hwid". I would prefer to stick to the DT graph bindings, > > because : > > "unit" is not a standard property and I don't like it either. Fair enough. > If you have "in-ports" and "out-ports" nodes, then that gives you > direction and you can use reg for the "hardware port". > > in-ports { > port@0 { > reg = <0>; > endpoint {...}; > }; > port@1 { > reg = <1>; > endpoint {...}; > }; > }; > out-ports { > port@0 { > reg = <0>; > endpoint {...}; > }; > }; > > I'll need to check, but dtc may need an update to not warn about this. If the requirement that unit-address and the low #address-cells of reg now being forced by a validation tool is screwing us over here, we can't really change the tool; that's a IEEE 1275-1994 requirement we'd be hacking around. Here's a thought - why don't we define the ports in each component in a separate graph? Describing the ATB topology as a bus with a graph structure and then having the components simply reference the port nodes would make it much easier to design. Part of the problem is trying to describe an ATB bus via a component on an APB bus. You lose all the niceties of describing a bus with bus subcomponents. Lets take the ATB bus outside.. by describing the ingress and egress of the component you have no idea where in the graph you are at the point of entry (you could start parsing at a funnel, which means travelling around looking for other units that match other compatibles). If the CoreSight subsystem could have a graph pre-made of the full topology, then the components can point to how they wrap parts of that topology. You can also do away with ridiculous non-programmable funnels and replicator nodes, since the trace graph would encapsulate that information without having to instantiate a stub driver for it. That doesn't solve 'unit' or 'unit-address' or 'reg' or anything but it would keep us in graph bindings and abstract the ATB topology from the components. Other fun stuff; describing bridge components and bus widths within the graph could give abilities to describe bandwidths over each trace path. In any case, what happens when you have to decide what your funnel port priorities are, you'll have to define a new property for that, and it won't be 'reg'. Trying not to add anything to the graph bindings is one thing but trying to shoehorn it in either by doing that or inventing wholly device-specific properties are just as bad. Perhaps we can use the path@unit-address:arguments <-- parameters here to add extra information about the port. 'my-args' would pull it out in Forth, I don't know how FDT code deals with it (but there should be something, since we use it for console definition in /chosen). They're entirely specific to the node in question, but not used in path matching. Why can't we decide on a graph-binding update that allows specifying something simple as a demultiplexer output in a way the driver can understand and map to the hardware? > If DT bindings can be reused for ACPI, that's fine However in my opinion not the preferred way to do it.. > any DT bindings to be accepted simply because they match ACPI > bindings. I'm sure ACPI has capability of a much classier way of describing things, but I have some fear that it's just OF bindings stuffed in a _DSD.. DT and ACPI have the responsibility of shouldering the burden of describing hardware. Any attempts to try and make some commonality to promote brevity of Linux code is totally out of scope.. You can always abstract whatever definition into whatever the Linux drivers need at the time. That's kind of the point. Locking in to 'how it's currently coded' is stifling when realizing the first attempt didn't actually adequately do what it was intended. Ta, Matt IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
RE: [RFC PATCH 6/8] dts: coresight: Clean up the device tree graph bindings
Hi Rob, To take this in a somewhat different direction... > > No, the above comment is about using "unit" ( if it is a standard > > property for specifying something specific to hardware) instead of > > "coresight,hwid". I would prefer to stick to the DT graph bindings, > > because : > > "unit" is not a standard property and I don't like it either. Fair enough. > If you have "in-ports" and "out-ports" nodes, then that gives you > direction and you can use reg for the "hardware port". > > in-ports { > port@0 { > reg = <0>; > endpoint {...}; > }; > port@1 { > reg = <1>; > endpoint {...}; > }; > }; > out-ports { > port@0 { > reg = <0>; > endpoint {...}; > }; > }; > > I'll need to check, but dtc may need an update to not warn about this. If the requirement that unit-address and the low #address-cells of reg now being forced by a validation tool is screwing us over here, we can't really change the tool; that's a IEEE 1275-1994 requirement we'd be hacking around. Here's a thought - why don't we define the ports in each component in a separate graph? Describing the ATB topology as a bus with a graph structure and then having the components simply reference the port nodes would make it much easier to design. Part of the problem is trying to describe an ATB bus via a component on an APB bus. You lose all the niceties of describing a bus with bus subcomponents. Lets take the ATB bus outside.. by describing the ingress and egress of the component you have no idea where in the graph you are at the point of entry (you could start parsing at a funnel, which means travelling around looking for other units that match other compatibles). If the CoreSight subsystem could have a graph pre-made of the full topology, then the components can point to how they wrap parts of that topology. You can also do away with ridiculous non-programmable funnels and replicator nodes, since the trace graph would encapsulate that information without having to instantiate a stub driver for it. That doesn't solve 'unit' or 'unit-address' or 'reg' or anything but it would keep us in graph bindings and abstract the ATB topology from the components. Other fun stuff; describing bridge components and bus widths within the graph could give abilities to describe bandwidths over each trace path. In any case, what happens when you have to decide what your funnel port priorities are, you'll have to define a new property for that, and it won't be 'reg'. Trying not to add anything to the graph bindings is one thing but trying to shoehorn it in either by doing that or inventing wholly device-specific properties are just as bad. Perhaps we can use the path@unit-address:arguments <-- parameters here to add extra information about the port. 'my-args' would pull it out in Forth, I don't know how FDT code deals with it (but there should be something, since we use it for console definition in /chosen). They're entirely specific to the node in question, but not used in path matching. Why can't we decide on a graph-binding update that allows specifying something simple as a demultiplexer output in a way the driver can understand and map to the hardware? > If DT bindings can be reused for ACPI, that's fine However in my opinion not the preferred way to do it.. > any DT bindings to be accepted simply because they match ACPI > bindings. I'm sure ACPI has capability of a much classier way of describing things, but I have some fear that it's just OF bindings stuffed in a _DSD.. DT and ACPI have the responsibility of shouldering the burden of describing hardware. Any attempts to try and make some commonality to promote brevity of Linux code is totally out of scope.. You can always abstract whatever definition into whatever the Linux drivers need at the time. That's kind of the point. Locking in to 'how it's currently coded' is stifling when realizing the first attempt didn't actually adequately do what it was intended. Ta, Matt IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
RE: [RFC PATCH 6/8] dts: coresight: Clean up the device tree graph bindings
> -Original Message- > From: Mathieu Poirier > > > So, if the suggestion is to use an existing property "unit", I am fine > > with it, if people agree to it. > > If we're going to have something sharply different than ACPI I prefer > Rob's idea. What are you trying to say about being sharply different than ACPI? Ta, Matt IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
RE: [RFC PATCH 6/8] dts: coresight: Clean up the device tree graph bindings
> -Original Message- > From: Mathieu Poirier > > > So, if the suggestion is to use an existing property "unit", I am fine > > with it, if people agree to it. > > If we're going to have something sharply different than ACPI I prefer > Rob's idea. What are you trying to say about being sharply different than ACPI? Ta, Matt IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
RE: [RFC PATCH 6/8] dts: coresight: Clean up the device tree graph bindings
Hi Suzuki, > > Why not use “unit”? > > > > I believe we had this discussion years ago about numbering serial ports > > and sdhci (i.e. how do you know it’s UART0 or UART1 from just the address? > > Some SoC’s don’t address sequentially *or* in a forward direction) - I > > believe it’s not exactly codified in ePAPR, not am I sure where it may be > > otherwise, but it exists. > > We have different situation here. We need to know *the port number* as > understood by the hardware, so that we can enable *the specific* port for > a given path. For the purposes of abstraction, each port will have the property of having a node which is pointed to by other nodes, and in the case of a true ATB endpoint, no other nodes behind it. It doesn't matter what the HW numbers it as as long as the driver can derive it from whatever you put in the DT. So a funnel (which is ~8 ports muxed into one output): f1p0: port { unit = <0>; endpoint = <>; }; f1p1: port { unit = <4>; endpoint = <>; }; f1out: port { endpoint = <>; }; "unit" here is specific to the driver's understanding of ports within it's own cycle of the graph. For a replicator you can invert the logic - input ports don't need a unit, but the two outputs are filtered in CoreSight not by leg but by transiting ATB ID in groups of 16 IDs. In that case maybe you would want to describe all 8 possible units on each leg with the first ID it would filter? Or just list tuples of filter IDs Who cares, really, as long as the driver knows what it means. You don't need to namespace every property. > As I mentioned above, we need the hardware numbers to enable the > "specific" port. Okay and how is this not able to be prescribed in a binding for "arm,coresight-funnel" that: "input ports are numbered from 0 to N where N is the maximum input port number. This number is identified with the "unit" property, which directly corresponds to the bit position in the funnel Ctrl_Reg register, and the bit position multiplied by 3 for each 3-bit priority in the funnel Priority_Ctrl_Reg, with N having a maximum of the defined register bitfield DEVID[PORTCOUNT], minus one, for that component" Or a replicator: "output ports are numbered per the CoreSight ATB Replicator specification, unit corresponding to the IDFILTERn register controlling ID filters for that leg, with a maximum of the defined register bitfield DEVID[PORTNUM], minus one" One could clarify it, even, with labels for readability ("label" definitely is a well defined if also completely arbitrary property). .. > static void funnel_enable_hw(struct funnel_drvdata *drvdata, int port) > { > u32 functl; > > CS_UNLOCK(drvdata->base); > > functl = readl_relaxed(drvdata->base + FUNNEL_FUNCTL); > functl &= ~FUNNEL_HOLDTIME_MASK; > functl |= FUNNEL_HOLDTIME; > functl |= (1 << port); > writel_relaxed(functl, drvdata->base + FUNNEL_FUNCTL); > writel_relaxed(drvdata->priority, drvdata->base + FUNNEL_PRICTL); > > CS_LOCK(drvdata->base); > } > > No we don't need to parse it in both ways, up and down. Btw, the trace > paths are not statically created. They are done at runtime, as configured > by the user. You do realize this isn't how the hardware works, correct? Trace paths are fixed, they may diverge with different configurations, but the full CoreSight topology (all funnels, replicators and intermediary Components) is entirely unchangeable. The DT should provide the information to provide a reference acyclic directed graph of the entire topology (or entirely reasonably programmable topology where at all possible) - if a user wants to trace from ETM_0 then they only have particular paths to particular sinks, for instance ETM_0 and ETF_0 may be on their own path, so you cannot just "configure as a user" a path from ETM_1 to ETF_0 since there isn't one. Walking said graphs with the knowledge that CoreSight specifically disallows loopbacks in ATB topology is basic computer science problem - literally a matter of topological sorting. But let's build a graph once and traverse it - don't build the graph partially each time or try and build it to cross-check every time. The paths are wires in the design, lets not fake to the user that there is any configurability in that or try and encode that in the DT. > Coming back to your suggestion of "unit", what does it imply ? Whatever the driver likes. For uart and mmc, it was just a spurious number but it could be applied as the end of, say, ttyS or mmcblkp3 or used in any other driver-specific manner. The number you put in is up to you, but the valid numbers would be in the binding for that particular device. > Its too generic a term for something as concrete as a port number. Is it? Why would you need a whole other property type to encode a u32 that describes an arbitrary number specific to that hardware device? Ta, Matt IMPORTANT NOTICE: The contents of
RE: [RFC PATCH 6/8] dts: coresight: Clean up the device tree graph bindings
Hi Suzuki, > > Why not use “unit”? > > > > I believe we had this discussion years ago about numbering serial ports > > and sdhci (i.e. how do you know it’s UART0 or UART1 from just the address? > > Some SoC’s don’t address sequentially *or* in a forward direction) - I > > believe it’s not exactly codified in ePAPR, not am I sure where it may be > > otherwise, but it exists. > > We have different situation here. We need to know *the port number* as > understood by the hardware, so that we can enable *the specific* port for > a given path. For the purposes of abstraction, each port will have the property of having a node which is pointed to by other nodes, and in the case of a true ATB endpoint, no other nodes behind it. It doesn't matter what the HW numbers it as as long as the driver can derive it from whatever you put in the DT. So a funnel (which is ~8 ports muxed into one output): f1p0: port { unit = <0>; endpoint = <>; }; f1p1: port { unit = <4>; endpoint = <>; }; f1out: port { endpoint = <>; }; "unit" here is specific to the driver's understanding of ports within it's own cycle of the graph. For a replicator you can invert the logic - input ports don't need a unit, but the two outputs are filtered in CoreSight not by leg but by transiting ATB ID in groups of 16 IDs. In that case maybe you would want to describe all 8 possible units on each leg with the first ID it would filter? Or just list tuples of filter IDs Who cares, really, as long as the driver knows what it means. You don't need to namespace every property. > As I mentioned above, we need the hardware numbers to enable the > "specific" port. Okay and how is this not able to be prescribed in a binding for "arm,coresight-funnel" that: "input ports are numbered from 0 to N where N is the maximum input port number. This number is identified with the "unit" property, which directly corresponds to the bit position in the funnel Ctrl_Reg register, and the bit position multiplied by 3 for each 3-bit priority in the funnel Priority_Ctrl_Reg, with N having a maximum of the defined register bitfield DEVID[PORTCOUNT], minus one, for that component" Or a replicator: "output ports are numbered per the CoreSight ATB Replicator specification, unit corresponding to the IDFILTERn register controlling ID filters for that leg, with a maximum of the defined register bitfield DEVID[PORTNUM], minus one" One could clarify it, even, with labels for readability ("label" definitely is a well defined if also completely arbitrary property). .. > static void funnel_enable_hw(struct funnel_drvdata *drvdata, int port) > { > u32 functl; > > CS_UNLOCK(drvdata->base); > > functl = readl_relaxed(drvdata->base + FUNNEL_FUNCTL); > functl &= ~FUNNEL_HOLDTIME_MASK; > functl |= FUNNEL_HOLDTIME; > functl |= (1 << port); > writel_relaxed(functl, drvdata->base + FUNNEL_FUNCTL); > writel_relaxed(drvdata->priority, drvdata->base + FUNNEL_PRICTL); > > CS_LOCK(drvdata->base); > } > > No we don't need to parse it in both ways, up and down. Btw, the trace > paths are not statically created. They are done at runtime, as configured > by the user. You do realize this isn't how the hardware works, correct? Trace paths are fixed, they may diverge with different configurations, but the full CoreSight topology (all funnels, replicators and intermediary Components) is entirely unchangeable. The DT should provide the information to provide a reference acyclic directed graph of the entire topology (or entirely reasonably programmable topology where at all possible) - if a user wants to trace from ETM_0 then they only have particular paths to particular sinks, for instance ETM_0 and ETF_0 may be on their own path, so you cannot just "configure as a user" a path from ETM_1 to ETF_0 since there isn't one. Walking said graphs with the knowledge that CoreSight specifically disallows loopbacks in ATB topology is basic computer science problem - literally a matter of topological sorting. But let's build a graph once and traverse it - don't build the graph partially each time or try and build it to cross-check every time. The paths are wires in the design, lets not fake to the user that there is any configurability in that or try and encode that in the DT. > Coming back to your suggestion of "unit", what does it imply ? Whatever the driver likes. For uart and mmc, it was just a spurious number but it could be applied as the end of, say, ttyS or mmcblkp3 or used in any other driver-specific manner. The number you put in is up to you, but the valid numbers would be in the binding for that particular device. > Its too generic a term for something as concrete as a port number. Is it? Why would you need a whole other property type to encode a u32 that describes an arbitrary number specific to that hardware device? Ta, Matt IMPORTANT NOTICE: The contents of
Re: [RFC PATCH 6/8] dts: coresight: Clean up the device tree graph bindings
Suzuki, Why not use “unit”? I believe we had this discussion years ago about numbering serial ports and sdhci (i.e. how do you know it’s UART0 or UART1 from just the address? Some SoC’s don’t address sequentially *or* in a forward direction) - I believe it’s not exactly codified in ePAPR, not am I sure where it may be otherwise, but it exists. I agree with Rob on the slave-mode nonsense, this is an SPI controller concept weirdly stuffed into a directed graph which implicitly tells you the data direction - it’s a rooted tree (just like DT!). For the case of a funnel each device supplying trace should end up into an input node - numbered with a unit - and all those nodes should point to the output node as endpoints. Describing the hardware as a black box is probably less of a good idea than showing that it’s a funnel, or replicator by showing the internal paths. You wouldn’t need to “number” ports with a unit except where the HW needs to differentiate between them, and you don’t need reg or a node address to do it. If you really need to parse full graphs in both directions (find root, find leaf) then could we simply introduce properties which list the phandles of all uplink sources, as linked lists point to the list head? This gives a way to validate that the graph starts and ends the way we expect, and also allows every port to be associated with being a required path between any two devices without parsing the *whole* graph (although you still need to do that to find the route to sinks). Ta, Matt Sent from my iPhone > On Jun 13, 2018, at 04:45, Suzuki K Poulose wrote: > > Hi Rob, > >> On 12/06/18 21:48, Rob Herring wrote: >>> On Fri, Jun 01, 2018 at 02:16:05PM +0100, Suzuki K Poulose wrote: >>> The coresight drivers relied on default bindings for graph >>> in DT, while reusing the "reg" field of the "ports" to indicate >>> the actual hardware port number for the connections. However, >>> with the rules getting stricter w.r.t to the address mismatch >>> with the label, it is no longer possible to use the port address >>> field for the hardware port number. Hence, we add an explicit >>> property to denote the hardware port number, "coresight,hwid" >>> which must be specified for each "endpoint". >>> >>> Cc: Mathieu Poirier >>> Cc: Sudeep Holla >>> Cc: Rob Herring >>> Signed-off-by: Suzuki K Poulose >>> --- >>> .../devicetree/bindings/arm/coresight.txt | 26 +--- >>> drivers/hwtracing/coresight/of_coresight.c | 46 >>> -- >>> 2 files changed, 54 insertions(+), 18 deletions(-) >>> >>> diff --git a/Documentation/devicetree/bindings/arm/coresight.txt >>> b/Documentation/devicetree/bindings/arm/coresight.txt >>> index bd36e40..385581a 100644 >>> --- a/Documentation/devicetree/bindings/arm/coresight.txt >>> +++ b/Documentation/devicetree/bindings/arm/coresight.txt >>> @@ -104,7 +104,11 @@ properties to uniquely identify the connection details. >>> "slave-mode" >>> * Hardware Port number at the component: >>> - - The hardware port number is assumed to be the address of the >>> "port" component. >>> + - (Obsolete) The hardware port number is assumed to be the address of >>> the "port" component. >>> + - Each "endpoint" must define the hardware port of the local end of the >>> + connection using the following property: >>> +"coresight,hwid" - 32bit integer, hardware port number at the local >>> end. >> "coresight" is not a vendor and properties are in the form >> [,]. > > OK. The issue here is that a coresight component could be an Arm IP or > a custom partner IP. So, the vendor could be either arm or the partner id. > However, this property is kind of a generic one for the Coresight family, > which is why we opted for "coresight". What is the guideline for such > cases ? > > Or in other words I see the following possible options : > > 1) coresight,hwid- coresight generic > 2) arm,coresight-hwid- arm vendor, however the device could be from any > vendor. > 3) hwid- Generic > 4) none of the above, something completely different. > > What do you recommend from the above ? > >>> + >>> Example: >>> @@ -120,6 +124,7 @@ Example: >>> etb_in_port: endpoint@0 { >> There shouldn't be a unit address here because there is no reg property. >>> slave-mode; >>> remote-endpoint = <_out_port0>; >>> +coresight,hwid = <0>; >> It doesn't make sense for these to be in the endpoint. If you had >> multiple endpoints, then you would have to duplicate it. "ports" are >> a single data stream. "endpoints" are connections to that stream. So if >> you have a muxed (input) or fanout/1-to-many (output) connection, then >> you have multiple endpoints. > > We do have many-to-1 input (e.g, funnels) and 1-to-many outputs > (e.g replicators). However, we have (so far) used only one endpoint per > port. > > Also we could potentially have multiple data streams
Re: [RFC PATCH 6/8] dts: coresight: Clean up the device tree graph bindings
Suzuki, Why not use “unit”? I believe we had this discussion years ago about numbering serial ports and sdhci (i.e. how do you know it’s UART0 or UART1 from just the address? Some SoC’s don’t address sequentially *or* in a forward direction) - I believe it’s not exactly codified in ePAPR, not am I sure where it may be otherwise, but it exists. I agree with Rob on the slave-mode nonsense, this is an SPI controller concept weirdly stuffed into a directed graph which implicitly tells you the data direction - it’s a rooted tree (just like DT!). For the case of a funnel each device supplying trace should end up into an input node - numbered with a unit - and all those nodes should point to the output node as endpoints. Describing the hardware as a black box is probably less of a good idea than showing that it’s a funnel, or replicator by showing the internal paths. You wouldn’t need to “number” ports with a unit except where the HW needs to differentiate between them, and you don’t need reg or a node address to do it. If you really need to parse full graphs in both directions (find root, find leaf) then could we simply introduce properties which list the phandles of all uplink sources, as linked lists point to the list head? This gives a way to validate that the graph starts and ends the way we expect, and also allows every port to be associated with being a required path between any two devices without parsing the *whole* graph (although you still need to do that to find the route to sinks). Ta, Matt Sent from my iPhone > On Jun 13, 2018, at 04:45, Suzuki K Poulose wrote: > > Hi Rob, > >> On 12/06/18 21:48, Rob Herring wrote: >>> On Fri, Jun 01, 2018 at 02:16:05PM +0100, Suzuki K Poulose wrote: >>> The coresight drivers relied on default bindings for graph >>> in DT, while reusing the "reg" field of the "ports" to indicate >>> the actual hardware port number for the connections. However, >>> with the rules getting stricter w.r.t to the address mismatch >>> with the label, it is no longer possible to use the port address >>> field for the hardware port number. Hence, we add an explicit >>> property to denote the hardware port number, "coresight,hwid" >>> which must be specified for each "endpoint". >>> >>> Cc: Mathieu Poirier >>> Cc: Sudeep Holla >>> Cc: Rob Herring >>> Signed-off-by: Suzuki K Poulose >>> --- >>> .../devicetree/bindings/arm/coresight.txt | 26 +--- >>> drivers/hwtracing/coresight/of_coresight.c | 46 >>> -- >>> 2 files changed, 54 insertions(+), 18 deletions(-) >>> >>> diff --git a/Documentation/devicetree/bindings/arm/coresight.txt >>> b/Documentation/devicetree/bindings/arm/coresight.txt >>> index bd36e40..385581a 100644 >>> --- a/Documentation/devicetree/bindings/arm/coresight.txt >>> +++ b/Documentation/devicetree/bindings/arm/coresight.txt >>> @@ -104,7 +104,11 @@ properties to uniquely identify the connection details. >>> "slave-mode" >>> * Hardware Port number at the component: >>> - - The hardware port number is assumed to be the address of the >>> "port" component. >>> + - (Obsolete) The hardware port number is assumed to be the address of >>> the "port" component. >>> + - Each "endpoint" must define the hardware port of the local end of the >>> + connection using the following property: >>> +"coresight,hwid" - 32bit integer, hardware port number at the local >>> end. >> "coresight" is not a vendor and properties are in the form >> [,]. > > OK. The issue here is that a coresight component could be an Arm IP or > a custom partner IP. So, the vendor could be either arm or the partner id. > However, this property is kind of a generic one for the Coresight family, > which is why we opted for "coresight". What is the guideline for such > cases ? > > Or in other words I see the following possible options : > > 1) coresight,hwid- coresight generic > 2) arm,coresight-hwid- arm vendor, however the device could be from any > vendor. > 3) hwid- Generic > 4) none of the above, something completely different. > > What do you recommend from the above ? > >>> + >>> Example: >>> @@ -120,6 +124,7 @@ Example: >>> etb_in_port: endpoint@0 { >> There shouldn't be a unit address here because there is no reg property. >>> slave-mode; >>> remote-endpoint = <_out_port0>; >>> +coresight,hwid = <0>; >> It doesn't make sense for these to be in the endpoint. If you had >> multiple endpoints, then you would have to duplicate it. "ports" are >> a single data stream. "endpoints" are connections to that stream. So if >> you have a muxed (input) or fanout/1-to-many (output) connection, then >> you have multiple endpoints. > > We do have many-to-1 input (e.g, funnels) and 1-to-many outputs > (e.g replicators). However, we have (so far) used only one endpoint per > port. > > Also we could potentially have multiple data streams
RE: [PATCH] drivers/perf: arm-ccn: stop spamming dmesg in event_init
Hi all, > > I don't have any dog in this, but maybe if providing information to the > > users is so essential to having a pleasant user experience, then > > rethinking the whole way these messages are funneled is necessary > > because the kernel log + dmesg is by no means appropriate. Take a look > > at what the networking maintainers recently did with netlink extended > > ack. You used to just get an "unsupported" error, and now you know > > exactly what is wrong when extack is available. It seems to me like > > something like this is what you might want here since you want to have > > perf be as user friendly as possible. > > Thanks, Florian. Florian, I'd love to know if you mean "implement netlink extended ack in perf" or "do something idiomatic for perf"? Let us assume we are semantically challenged over here. I'm going to proceed from the latter. > Acme & other perf people, do you foresee a problem adding netlink > extended ack support to the perf subsystem for extended error message > communication? > > If not, is struct perf_event_attr amenable to having error reporting > bit(s) added? I did have a think about this when Kim mentioned it in passing, and my reasoning was that a serialized error record similar to ACPI APEI BERT/ERST tables (without the NVRAM abstraction) would fit the need. As soon as I wrote down my thoughts I realized it scratch more itches than just something useful for perf. It would conceptually be a buffer passed from userspace with the syscall, which would be populated in the kernel with an identifying header, each record denoting its own length. The syscall fills the buffer with records of particular format for the syscall, and the errno returned to the application is then the state of the error recording buffer - for instance, if the kernel ran out of space in the buffer before reporting all the errors it could (-ENOBUFS or -EAGAIN), if it stopped recording errors on something that requires being contained and returned at that point (-EFAULT), and so on (anyone who's got RAS on the brain will see where I'm coming from). I have a distinct dislike of filling the kernel with const strings, so the records would be strictly machine-readable and contain information about the error for the record and the source. If userspace needs to print a string then it can look up unique identifiers (UUIDv1, give or take, to remove the need for any authority on numbering) in a database - be that plaintext, gettext.po, bdb, sqlite, xml, json, C structure embedded in the tool - one for each error source. That keeps strings, translations, string formatting entirely outside the kernel, and keeps records from being freeform typo-laden strings. That'd give some generic Producer code in the kernel, and imply a companion Consumer library in userspace (with said database backend), which could also be responsible for logging the binary records somewhere for future reference (perhaps bounded by capabilities or container privileges). Pretty much every syscall that has problems returning 'just an errno' could benefit from such a system, the only impediment I can see to it is that it's adding a new subsystem to the kernel to produce these records, and any syscall that needs it would have to gain either an extra parameter, or attribute setting addition (like setsockopts) or to shoehorn the buffer pointer into an existing parameter structure like perf_event_attr. That would have to be locked down to a consistent method that would be recommended (like adding a new syscall interface, not everyone has an attribute interface or parameter structure) although there's no stopping kernel devs adding in a way for legacy applications to easily receive the same information through existing ABI. In the case of perf_event_attr there is space enough to mark a bit to say that errors could be reported in a buffer, but not enough in the reserved space that exists to store a pointer for 64-bit systems (or 128-bit ones..) without increasing its size. But, besides that, it would have the benefit of simply being serialized with the syscall, and not a supplemental, potentially non-thread-safe errno/strerror-like kernel-side implementation, nor extra syscalls to retrieve information or arbitrary formatted or unformatted strings. Netlink extended ack could benefit from it simply by having been passed an nlattr pointing to the buffer and recording extended error information in it - the extended ack structure can report the status of the buffer and use the cookie field to reproduce some information if necessary (which extends the RAS error record concept further when needed). Thoughts? Does anyone have any objections to a RAS-like error reporting system for system calls, or any ideas on things that would benefit from it above and beyond perf? We could always audit all the system calls and their behaviors so we could do some worked examples but anyone who's got a good candidate outside perf is welcome to suggest it. Ta,
RE: [PATCH] drivers/perf: arm-ccn: stop spamming dmesg in event_init
Hi all, > > I don't have any dog in this, but maybe if providing information to the > > users is so essential to having a pleasant user experience, then > > rethinking the whole way these messages are funneled is necessary > > because the kernel log + dmesg is by no means appropriate. Take a look > > at what the networking maintainers recently did with netlink extended > > ack. You used to just get an "unsupported" error, and now you know > > exactly what is wrong when extack is available. It seems to me like > > something like this is what you might want here since you want to have > > perf be as user friendly as possible. > > Thanks, Florian. Florian, I'd love to know if you mean "implement netlink extended ack in perf" or "do something idiomatic for perf"? Let us assume we are semantically challenged over here. I'm going to proceed from the latter. > Acme & other perf people, do you foresee a problem adding netlink > extended ack support to the perf subsystem for extended error message > communication? > > If not, is struct perf_event_attr amenable to having error reporting > bit(s) added? I did have a think about this when Kim mentioned it in passing, and my reasoning was that a serialized error record similar to ACPI APEI BERT/ERST tables (without the NVRAM abstraction) would fit the need. As soon as I wrote down my thoughts I realized it scratch more itches than just something useful for perf. It would conceptually be a buffer passed from userspace with the syscall, which would be populated in the kernel with an identifying header, each record denoting its own length. The syscall fills the buffer with records of particular format for the syscall, and the errno returned to the application is then the state of the error recording buffer - for instance, if the kernel ran out of space in the buffer before reporting all the errors it could (-ENOBUFS or -EAGAIN), if it stopped recording errors on something that requires being contained and returned at that point (-EFAULT), and so on (anyone who's got RAS on the brain will see where I'm coming from). I have a distinct dislike of filling the kernel with const strings, so the records would be strictly machine-readable and contain information about the error for the record and the source. If userspace needs to print a string then it can look up unique identifiers (UUIDv1, give or take, to remove the need for any authority on numbering) in a database - be that plaintext, gettext.po, bdb, sqlite, xml, json, C structure embedded in the tool - one for each error source. That keeps strings, translations, string formatting entirely outside the kernel, and keeps records from being freeform typo-laden strings. That'd give some generic Producer code in the kernel, and imply a companion Consumer library in userspace (with said database backend), which could also be responsible for logging the binary records somewhere for future reference (perhaps bounded by capabilities or container privileges). Pretty much every syscall that has problems returning 'just an errno' could benefit from such a system, the only impediment I can see to it is that it's adding a new subsystem to the kernel to produce these records, and any syscall that needs it would have to gain either an extra parameter, or attribute setting addition (like setsockopts) or to shoehorn the buffer pointer into an existing parameter structure like perf_event_attr. That would have to be locked down to a consistent method that would be recommended (like adding a new syscall interface, not everyone has an attribute interface or parameter structure) although there's no stopping kernel devs adding in a way for legacy applications to easily receive the same information through existing ABI. In the case of perf_event_attr there is space enough to mark a bit to say that errors could be reported in a buffer, but not enough in the reserved space that exists to store a pointer for 64-bit systems (or 128-bit ones..) without increasing its size. But, besides that, it would have the benefit of simply being serialized with the syscall, and not a supplemental, potentially non-thread-safe errno/strerror-like kernel-side implementation, nor extra syscalls to retrieve information or arbitrary formatted or unformatted strings. Netlink extended ack could benefit from it simply by having been passed an nlattr pointing to the buffer and recording extended error information in it - the extended ack structure can report the status of the buffer and use the cookie field to reproduce some information if necessary (which extends the RAS error record concept further when needed). Thoughts? Does anyone have any objections to a RAS-like error reporting system for system calls, or any ideas on things that would benefit from it above and beyond perf? We could always audit all the system calls and their behaviors so we could do some worked examples but anyone who's got a good candidate outside perf is welcome to suggest it. Ta,
Re: [PATCH 1/2] dt-bindings: Documentation for qcom,llcc
Hiya, On 25 January 2018 at 17:55, Channagoud Kadabiwrote: > Documentation for last level cache controller device tree bindings, > client bindings usage examples. [snippety snip] > +- llcc-bank-off: > + Usage: required > + Value Type: > + Definition: Offsets of llcc banks from llcc base address starting from > + LLCC bank0. > + > +- llcc-broadcast-off: > + Usage: required > + Value Type: > + Definition: Offset of broadcast register from LLCC bank0 address. What's with the redundant namespacing? Have we not, as a community, realised that we do not need to namespace properties which are only present under a single binding or node, or even those that aren't? This mess started with the regulator bindings and it's never stopped. What are we at now, 25 years of device trees, 10 years of FDT on Arm? Notwithstanding the complete waste of rodata in the kernel image for matching (& increased time to compare), why wouldn't you consider why "bank-offset" for your node be any different than a common property for any other node? And if you need to describe register offsets... why aren't you able to use the reg property? Ta, Matt
Re: [PATCH 1/2] dt-bindings: Documentation for qcom,llcc
Hiya, On 25 January 2018 at 17:55, Channagoud Kadabi wrote: > Documentation for last level cache controller device tree bindings, > client bindings usage examples. [snippety snip] > +- llcc-bank-off: > + Usage: required > + Value Type: > + Definition: Offsets of llcc banks from llcc base address starting from > + LLCC bank0. > + > +- llcc-broadcast-off: > + Usage: required > + Value Type: > + Definition: Offset of broadcast register from LLCC bank0 address. What's with the redundant namespacing? Have we not, as a community, realised that we do not need to namespace properties which are only present under a single binding or node, or even those that aren't? This mess started with the regulator bindings and it's never stopped. What are we at now, 25 years of device trees, 10 years of FDT on Arm? Notwithstanding the complete waste of rodata in the kernel image for matching (& increased time to compare), why wouldn't you consider why "bank-offset" for your node be any different than a common property for any other node? And if you need to describe register offsets... why aren't you able to use the reg property? Ta, Matt
Re: [RFC PATCH 0/5] Add smp booting support for Qualcomm ARMv8 SoCs
Hi Rob, On Thu, Apr 16, 2015 at 12:17 PM, Rob Clark wrote: > On Thu, Apr 16, 2015 at 11:21 AM, Catalin Marinas > wrote: > >> But I'm definitely going to discourage companies like Qualcomm >> deliberately ignoring the existing booting protocols while trying to get >> their code upstream. This patch series is posted by Qualcomm without >> providing any technical reason on why they don't want to/couldn't use >> PSCI (well, I guess there is no technical reason but they may not care >> much about mainline either). > > Sure.. just trying to make sure the wrong people don't end up being > the ones that suffer. I would assume/expect that it is at least > possible for qcom to change firmware/bootloader for their dev boards > and future devices and whatnot, whether they grumble about it or not. > But I guess most of what the general public has are devices w/ signed > fw, which is why "go fix your firmware" is an option that sets off > alarm bells for me. > > I guess the first device where this will matter to me and a large > group of community folks would be the dragonboard 410c.. *hopefully* > it does not require signed firmware or at least qcom could make > available signed firmware which supports psci.. For development boards, one would hope there is a way to sign your own firmware. You can't expect - even for a phone SoC - that the development boards require entering some kind of Faustian contract for development of low-level software. What if someone wants to develop a platform that doesn't require signing? That said most of these dev boards have completely mangled JTAG anyway, and I know Inforce (and Hardkernel, and so on) love their barely-ever-updated custom firmware binaries, so.. The best thing would be to pick up one of those boards and port a PSCI firmware to it (ATF or your own..) and just embarrass the SoC vendor by having better mainline power management support (implemented by 10 lines in a device tree) with the complicated code hidden away behind the scenes there, like it should have been done in the first place.. Ta. Matt Sealey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Add smp booting support for Qualcomm ARMv8 SoCs
Hi Rob, On Thu, Apr 16, 2015 at 12:17 PM, Rob Clark robdcl...@gmail.com wrote: On Thu, Apr 16, 2015 at 11:21 AM, Catalin Marinas catalin.mari...@arm.com wrote: But I'm definitely going to discourage companies like Qualcomm deliberately ignoring the existing booting protocols while trying to get their code upstream. This patch series is posted by Qualcomm without providing any technical reason on why they don't want to/couldn't use PSCI (well, I guess there is no technical reason but they may not care much about mainline either). Sure.. just trying to make sure the wrong people don't end up being the ones that suffer. I would assume/expect that it is at least possible for qcom to change firmware/bootloader for their dev boards and future devices and whatnot, whether they grumble about it or not. But I guess most of what the general public has are devices w/ signed fw, which is why go fix your firmware is an option that sets off alarm bells for me. I guess the first device where this will matter to me and a large group of community folks would be the dragonboard 410c.. *hopefully* it does not require signed firmware or at least qcom could make available signed firmware which supports psci.. For development boards, one would hope there is a way to sign your own firmware. You can't expect - even for a phone SoC - that the development boards require entering some kind of Faustian contract for development of low-level software. What if someone wants to develop a platform that doesn't require signing? That said most of these dev boards have completely mangled JTAG anyway, and I know Inforce (and Hardkernel, and so on) love their barely-ever-updated custom firmware binaries, so.. The best thing would be to pick up one of those boards and port a PSCI firmware to it (ATF or your own..) and just embarrass the SoC vendor by having better mainline power management support (implemented by 10 lines in a device tree) with the complicated code hidden away behind the scenes there, like it should have been done in the first place.. Ta. Matt Sealey n...@bakuhatsu.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/3] Documentation: arm: add UEFI support documentation
On Wed, Dec 4, 2013 at 4:44 PM, Matthew Garrett wrote: > On Wed, Dec 04, 2013 at 03:06:47PM -0600, Matt Sealey wrote: > >> there's no guarantee that the kernel hasn't been decompressed over >> some important UEFI feature or some memory hasn't been trashed. You >> can't make that guarantee because by entering the plain zImage, you >> forfeited that information. > > The stub is responsible for ensuring that the compressed kernel is > loaded at a suitable address. Take a look at efi_relocate_kernel(). My objection is the suitable address is based on a restriction that booting from UEFI doesn't have and information UEFI provides that makes kernel features from head.S (both of them) easier to get around. The kernel doesn't need to be within a particular range of the start of memory, nor does the device tree or ramdisk require being in a particular place. What the code before efi_relocate_kernel does is allocate a maximum-sized-buffer to safely decompress in, which is just a gross way to do it, then crosses it's fingers based on the way it has historically worked - while you might want to assume that the decompression process is quite well defined and reliable, I keep seeing patches come in that stop it from doing weird unsavory behavior - for example decompressing over it's own page table. The decompressor - and the kernel head it jumps to after decompression - *guess* all the information UEFI could have provided and completely regenerate the environment for the decompressor itself (stacks, hacky memory allocations, cache on, off, on, off, on... fudging locations of page tables, zreladdr fixup, low level debug message output, in context of UEFI - reimplementation of memcpy, memset). It forfeits a more controlled and lean boot process to capitulate to a historical legacy. Since you're taking over the decompressor head.S anyway, why not take control of the decompression process? It sets up a page table location the hard way (as above.. also patched recently not to decompress over it's own page table). It doesn't need to relocate itself past the end of the decompressed image. It doesn't need to set up the C environment - UEFI did that for it. It makes assumptions about the stack and hacks memory allocations for the decompression.. it turns the cache on, decompresses, then turns it off again... you can just walk through the code under the EFI stub in compressed/head.S and see all this can just fall away. There's one immediate advantage too, if it's actually implemented and working, which is that for kernel images that are compressed using the standard UEFI compression method no actual decompression code needs to be added to the stub, and the functionality gets the exact length of the required decompression buffer.. that doesn't reduce flexibility in kernel compression as long as there is still the possibility of adding additional compression code to the stub. The second immediate advantage is that the EFI stub/decompressor can actually verify that the *decompressed* image meets Secure Boot requirements. Once you get past the decompressor and into the kernel proper head.S, creating the page tables (again) and turning the MMU on, pv table patching.. if you still had the information around, that gets simpler too. Grant suggested I should propose some patches; sure, if I'm not otherwise busy. Maybe the Linaro guys can recommend a platform (real or emulated) that would be best to test it on with the available UEFI? >> Most of the guessing is ideally not required to be a guess at all, the >> restrictions are purely to deal with the lack of trust for the >> bootloader environment. Why can't we trust UEFI? Or at least hold it >> to a higher standard. If someone ships a broken UEFI, they screw a >> feature or have a horrible bug and ship it, laud the fact Linux >> doesn't boot on it and the fact that it's their fault - over their >> head. It actually works these days, Linux actually has "market share," >> companies really go out of their way to rescue their "image" and >> resolve the situation when someone blogs about a serious UEFI bug on >> their $1300 laptops, or even $300 tablets. > > Yeah, that hasn't actually worked out too well for us. Aside from Teething problems caused by a rush to market ;) For the "ARM server market" rather than the "get the cheapest tablet/ultrabook out of the door that runs Windows 8/RT" I am sure this is going to get to be VERY important for vendors to take into account. Imagine if Dell shipped a *server* where Linux would brick it out of the box just for setting a variable.. however, if it works the day they ship the server, and Linux gains better support for UEFI booting which breaks the server in question, that's our fault for not doing it in the right way in the first place, and Dell can be just as angry at us as we would be at
Re: [PATCH v3 1/3] Documentation: arm: add UEFI support documentation
On Wed, Dec 4, 2013 at 4:44 PM, Matthew Garrett mj...@srcf.ucam.org wrote: On Wed, Dec 04, 2013 at 03:06:47PM -0600, Matt Sealey wrote: there's no guarantee that the kernel hasn't been decompressed over some important UEFI feature or some memory hasn't been trashed. You can't make that guarantee because by entering the plain zImage, you forfeited that information. The stub is responsible for ensuring that the compressed kernel is loaded at a suitable address. Take a look at efi_relocate_kernel(). My objection is the suitable address is based on a restriction that booting from UEFI doesn't have and information UEFI provides that makes kernel features from head.S (both of them) easier to get around. The kernel doesn't need to be within a particular range of the start of memory, nor does the device tree or ramdisk require being in a particular place. What the code before efi_relocate_kernel does is allocate a maximum-sized-buffer to safely decompress in, which is just a gross way to do it, then crosses it's fingers based on the way it has historically worked - while you might want to assume that the decompression process is quite well defined and reliable, I keep seeing patches come in that stop it from doing weird unsavory behavior - for example decompressing over it's own page table. The decompressor - and the kernel head it jumps to after decompression - *guess* all the information UEFI could have provided and completely regenerate the environment for the decompressor itself (stacks, hacky memory allocations, cache on, off, on, off, on... fudging locations of page tables, zreladdr fixup, low level debug message output, in context of UEFI - reimplementation of memcpy, memset). It forfeits a more controlled and lean boot process to capitulate to a historical legacy. Since you're taking over the decompressor head.S anyway, why not take control of the decompression process? It sets up a page table location the hard way (as above.. also patched recently not to decompress over it's own page table). It doesn't need to relocate itself past the end of the decompressed image. It doesn't need to set up the C environment - UEFI did that for it. It makes assumptions about the stack and hacks memory allocations for the decompression.. it turns the cache on, decompresses, then turns it off again... you can just walk through the code under the EFI stub in compressed/head.S and see all this can just fall away. There's one immediate advantage too, if it's actually implemented and working, which is that for kernel images that are compressed using the standard UEFI compression method no actual decompression code needs to be added to the stub, and the functionality gets the exact length of the required decompression buffer.. that doesn't reduce flexibility in kernel compression as long as there is still the possibility of adding additional compression code to the stub. The second immediate advantage is that the EFI stub/decompressor can actually verify that the *decompressed* image meets Secure Boot requirements. Once you get past the decompressor and into the kernel proper head.S, creating the page tables (again) and turning the MMU on, pv table patching.. if you still had the information around, that gets simpler too. Grant suggested I should propose some patches; sure, if I'm not otherwise busy. Maybe the Linaro guys can recommend a platform (real or emulated) that would be best to test it on with the available UEFI? Most of the guessing is ideally not required to be a guess at all, the restrictions are purely to deal with the lack of trust for the bootloader environment. Why can't we trust UEFI? Or at least hold it to a higher standard. If someone ships a broken UEFI, they screw a feature or have a horrible bug and ship it, laud the fact Linux doesn't boot on it and the fact that it's their fault - over their head. It actually works these days, Linux actually has market share, companies really go out of their way to rescue their image and resolve the situation when someone blogs about a serious UEFI bug on their $1300 laptops, or even $300 tablets. Yeah, that hasn't actually worked out too well for us. Aside from Teething problems caused by a rush to market ;) For the ARM server market rather than the get the cheapest tablet/ultrabook out of the door that runs Windows 8/RT I am sure this is going to get to be VERY important for vendors to take into account. Imagine if Dell shipped a *server* where Linux would brick it out of the box just for setting a variable.. however, if it works the day they ship the server, and Linux gains better support for UEFI booting which breaks the server in question, that's our fault for not doing it in the right way in the first place, and Dell can be just as angry at us as we would be at them. Vendors won't test code that doesn't exist for obvious reasons. This is what I was trying to get at about them not updating their firmware support for the more firmware
Re: [PATCH v3 1/3] Documentation: arm: add UEFI support documentation
On Mon, Dec 2, 2013 at 3:07 PM, Leif Lindholm wrote: > On Mon, Dec 02, 2013 at 01:51:22PM -0600, Matt Sealey wrote: >> Here's where I think this whole thing falls down as being the weirdest >> possible implementation of this. It defies logic to put this >> information in the device tree /chosen node while also attempting to >> boot the kernel using an EFI stub; the stub is going to have this >> information because it is going to have the pointer to the system >> System Table (since it was called by StartImage()). Why not stash the >> System Table pointer somewhere safe in the stub? > > We do. In the DT. Hang on... see way below about "reinventing the wheel" >> The information in the device tree is all accessible from Boot >> Services and as long as the System Table isn't being thrown away (my >> suggestion would be.. stuff it in r2, and set r1 = "EFI\0" then work >> with arch/arm/kernel/head{-common,}.S code to do the right thing) > > You left out the bit of redefining the kernel boot protocol to permit > calling it with caches, MMU and interrupts enabled - also known as > before ExitBootServices(). And that's a horrible idea because of what? What's evident here is there could be two major ways to generate an image that boots from a UEFI implementation; * one whereby UEFI is jostled or coerced by second stage bootloader to load a plain zImage and you lose all information about UEFI except in the event that that information is preserved in the device tree by the firmware * one whereby a 'stock' UEFI is used and it boots only on UEFI because it is in a format very reasonably only capable of being booted by UEFI and, subordinately, - one where that plain zImage got glued to an EFI stub just like the decompressor is glued to the Image - one where the kernel needs to be built with support for UEFI and that somewhat changes the boot path By the time we get half-way through arm/kernel/head.S the cache and MMU has been turned off and on and off again by the decompressor, and after a large amount of guesswork and arbitrary restriction-based implementation, there's no guarantee that the kernel hasn't been decompressed over some important UEFI feature or some memory hasn't been trashed. You can't make that guarantee because by entering the plain zImage, you forfeited that information. This is at worst case going to be lots of blank screens and blinking serial console prompts and little more than frustration.. Most of the guessing is ideally not required to be a guess at all, the restrictions are purely to deal with the lack of trust for the bootloader environment. Why can't we trust UEFI? Or at least hold it to a higher standard. If someone ships a broken UEFI, they screw a feature or have a horrible bug and ship it, laud the fact Linux doesn't boot on it and the fact that it's their fault - over their head. It actually works these days, Linux actually has "market share," companies really go out of their way to rescue their "image" and resolve the situation when someone blogs about a serious UEFI bug on their $1300 laptops, or even $300 tablets. > Which is what we are going to implement anyway in order to permit > firmware to supply DT hardware description in the same way as with > ACPI. Yes, we could pass the system table pointer directly - but that > doesn't get us the memory map. Boot Services gives you the ability to get the memory map.. and the kinds of things that live in those spots in the memory map. It's at least a better guess than "I am located at a specific place and can infer from linker data and masking off the bottom bits that there's probably this amount of RAM that starts at this location or thereabouts". It at least gives the ability to 'allocate' memory to put the page table instead of having a firmware call walk all over it, or having the kernel walk over some parts of firmware, or even not have to do anything except link in a decompressor (eh, sure, it means duplicating decompressor code in some cases, but I also don't think it's a sane requirement to include the entire decompression suite in the kernel proper if it only gets used once at early boot). > I prefer to see it as a way to not reinvent things that do not need > reinventing, while not adding more special-case code to the kernel. Isn't putting the System Table pointer in the DT specifically reinventing the UEFI boot process? Booting from UEFI is a special case in itself.. the EFI stub here is putting a round block in a square hole. There are two much, much better solutions: put the round block in a round hole. Put a square block in that square hole. We could do so much better than gluing the round block into the square hole. >> What that meant is nobody bothered to implement working, re-entrant, >> re-locatable firmware to a great degree. This ended
Re: [PATCH v3 1/3] Documentation: arm: add UEFI support documentation
On Mon, Dec 2, 2013 at 3:07 PM, Leif Lindholm leif.lindh...@linaro.org wrote: On Mon, Dec 02, 2013 at 01:51:22PM -0600, Matt Sealey wrote: Here's where I think this whole thing falls down as being the weirdest possible implementation of this. It defies logic to put this information in the device tree /chosen node while also attempting to boot the kernel using an EFI stub; the stub is going to have this information because it is going to have the pointer to the system System Table (since it was called by StartImage()). Why not stash the System Table pointer somewhere safe in the stub? We do. In the DT. Hang on... see way below about reinventing the wheel The information in the device tree is all accessible from Boot Services and as long as the System Table isn't being thrown away (my suggestion would be.. stuff it in r2, and set r1 = EFI\0 then work with arch/arm/kernel/head{-common,}.S code to do the right thing) You left out the bit of redefining the kernel boot protocol to permit calling it with caches, MMU and interrupts enabled - also known as before ExitBootServices(). And that's a horrible idea because of what? What's evident here is there could be two major ways to generate an image that boots from a UEFI implementation; * one whereby UEFI is jostled or coerced by second stage bootloader to load a plain zImage and you lose all information about UEFI except in the event that that information is preserved in the device tree by the firmware * one whereby a 'stock' UEFI is used and it boots only on UEFI because it is in a format very reasonably only capable of being booted by UEFI and, subordinately, - one where that plain zImage got glued to an EFI stub just like the decompressor is glued to the Image - one where the kernel needs to be built with support for UEFI and that somewhat changes the boot path By the time we get half-way through arm/kernel/head.S the cache and MMU has been turned off and on and off again by the decompressor, and after a large amount of guesswork and arbitrary restriction-based implementation, there's no guarantee that the kernel hasn't been decompressed over some important UEFI feature or some memory hasn't been trashed. You can't make that guarantee because by entering the plain zImage, you forfeited that information. This is at worst case going to be lots of blank screens and blinking serial console prompts and little more than frustration.. Most of the guessing is ideally not required to be a guess at all, the restrictions are purely to deal with the lack of trust for the bootloader environment. Why can't we trust UEFI? Or at least hold it to a higher standard. If someone ships a broken UEFI, they screw a feature or have a horrible bug and ship it, laud the fact Linux doesn't boot on it and the fact that it's their fault - over their head. It actually works these days, Linux actually has market share, companies really go out of their way to rescue their image and resolve the situation when someone blogs about a serious UEFI bug on their $1300 laptops, or even $300 tablets. Which is what we are going to implement anyway in order to permit firmware to supply DT hardware description in the same way as with ACPI. Yes, we could pass the system table pointer directly - but that doesn't get us the memory map. Boot Services gives you the ability to get the memory map.. and the kinds of things that live in those spots in the memory map. It's at least a better guess than I am located at a specific place and can infer from linker data and masking off the bottom bits that there's probably this amount of RAM that starts at this location or thereabouts. It at least gives the ability to 'allocate' memory to put the page table instead of having a firmware call walk all over it, or having the kernel walk over some parts of firmware, or even not have to do anything except link in a decompressor (eh, sure, it means duplicating decompressor code in some cases, but I also don't think it's a sane requirement to include the entire decompression suite in the kernel proper if it only gets used once at early boot). I prefer to see it as a way to not reinvent things that do not need reinventing, while not adding more special-case code to the kernel. Isn't putting the System Table pointer in the DT specifically reinventing the UEFI boot process? Booting from UEFI is a special case in itself.. the EFI stub here is putting a round block in a square hole. There are two much, much better solutions: put the round block in a round hole. Put a square block in that square hole. We could do so much better than gluing the round block into the square hole. What that meant is nobody bothered to implement working, re-entrant, re-locatable firmware to a great degree. This ended up being a self-fulfilling prophecy of don't trust the bootloader and get rid of it as soon as we can, which essentially meant Linux never took advantage of the resources available. In OF's case
Re: [PATCH v3 1/3] Documentation: arm: add UEFI support documentation
> +UEFI kernel support on ARM > +== > +UEFI kernel support on the ARM architectures (arm and arm64) is only > available > +when boot is performed through the stub. > + > +The stub populates the FDT /chosen node with (and the kernel scans for) the > +following parameters: > + > +Name | Size | Description > + > +linux,uefi-system-table | 64-bit | Physical address of the UEFI System > Table. > + > +linux,uefi-mmap-start | 64-bit | Physical address of the UEFI memory map, > + || populated by the UEFI GetMemoryMap() > call. > + > +linux,uefi-mmap-size | 32-bit | Size in bytes of the UEFI memory map > + || pointed to in previous entry. > + > +linux,uefi-mmap-desc-size | 32-bit | Size in bytes of each entry in the UEFI > + || memory map. > + > +linux,uefi-mmap-desc-ver | 32-bit | Version of the mmap descriptor format. > + > +linux,uefi-stub-kern-ver | string | Copy of linux_banner from build. > + This flies in the face of actually using #address-cells and #size-cells to define how big addresses and sizes are. You're limited here by the root node definitions... that's the spec. Here's where I think this whole thing falls down as being the weirdest possible implementation of this. It defies logic to put this information in the device tree /chosen node while also attempting to boot the kernel using an EFI stub; the stub is going to have this information because it is going to have the pointer to the system System Table (since it was called by StartImage()). Why not stash the System Table pointer somewhere safe in the stub? The information in the device tree is all accessible from Boot Services and as long as the System Table isn't being thrown away (my suggestion would be.. stuff it in r2, and set r1 = "EFI\0" then work with arch/arm/kernel/head{-common,}.S code to do the right thing) It seems like the advantages of booting from UEFI and having all this information and API around are being thrown away very early, and picked up when it's no longer relevant to gain access to the very minimal runtime services. What's missing is a UUID for a "Device Tree Blob" in the Configuration Table, so you can very easily go grab that information from the firmware. As implemented, these patches employ a very long-winded and complex method of recovering UEFI after throwing the system table pointer away early in boot, and then implements an EFI calling convention which isn't strictly necessary according to the UEFI spec - the question is, is this a workaround for SetVirtualAddressMap() not actually doing the right thing on ARM UEFI implementations? If you can't guarantee that most of the calls from Boot Services or Runtime Services are going to allow this, then having any UEFI support in the kernel at all seems rather weird. What I'm worried about is that this is basically a hack to try and shoehorn an existing UEFI implementation to an existing Linux boot method - and implements it in a way nobody is ever going to be able to justify improving. Part of the reason the OpenFirmware CIF got thrown away early in SPARC/PowerPC boot (after "flattening" the device tree using the CIF calls to parse it out) was because you had to disable the MMU, caches, interrupts etc. which caused all kinds of slow firmware code to be all kinds of extra-slow. What that meant is nobody bothered to implement working, re-entrant, re-locatable firmware to a great degree. This ended up being a self-fulfilling prophecy of "don't trust the bootloader" and "get rid of it as soon as we can," which essentially meant Linux never took advantage of the resources available. In OF's case, the CIF sucked by specification. In UEFI's case here, it's been implemented in Linux in such a way that guarantees poor-performing firmware code with huge penalties to call them, which isn't even required by UEFI if the earlier boot code did the right things in the first place. Ta, Matt Sealey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/3] Documentation: arm: add UEFI support documentation
+UEFI kernel support on ARM +== +UEFI kernel support on the ARM architectures (arm and arm64) is only available +when boot is performed through the stub. + +The stub populates the FDT /chosen node with (and the kernel scans for) the +following parameters: + +Name | Size | Description + +linux,uefi-system-table | 64-bit | Physical address of the UEFI System Table. + +linux,uefi-mmap-start | 64-bit | Physical address of the UEFI memory map, + || populated by the UEFI GetMemoryMap() call. + +linux,uefi-mmap-size | 32-bit | Size in bytes of the UEFI memory map + || pointed to in previous entry. + +linux,uefi-mmap-desc-size | 32-bit | Size in bytes of each entry in the UEFI + || memory map. + +linux,uefi-mmap-desc-ver | 32-bit | Version of the mmap descriptor format. + +linux,uefi-stub-kern-ver | string | Copy of linux_banner from build. + This flies in the face of actually using #address-cells and #size-cells to define how big addresses and sizes are. You're limited here by the root node definitions... that's the spec. Here's where I think this whole thing falls down as being the weirdest possible implementation of this. It defies logic to put this information in the device tree /chosen node while also attempting to boot the kernel using an EFI stub; the stub is going to have this information because it is going to have the pointer to the system System Table (since it was called by StartImage()). Why not stash the System Table pointer somewhere safe in the stub? The information in the device tree is all accessible from Boot Services and as long as the System Table isn't being thrown away (my suggestion would be.. stuff it in r2, and set r1 = EFI\0 then work with arch/arm/kernel/head{-common,}.S code to do the right thing) It seems like the advantages of booting from UEFI and having all this information and API around are being thrown away very early, and picked up when it's no longer relevant to gain access to the very minimal runtime services. What's missing is a UUID for a Device Tree Blob in the Configuration Table, so you can very easily go grab that information from the firmware. As implemented, these patches employ a very long-winded and complex method of recovering UEFI after throwing the system table pointer away early in boot, and then implements an EFI calling convention which isn't strictly necessary according to the UEFI spec - the question is, is this a workaround for SetVirtualAddressMap() not actually doing the right thing on ARM UEFI implementations? If you can't guarantee that most of the calls from Boot Services or Runtime Services are going to allow this, then having any UEFI support in the kernel at all seems rather weird. What I'm worried about is that this is basically a hack to try and shoehorn an existing UEFI implementation to an existing Linux boot method - and implements it in a way nobody is ever going to be able to justify improving. Part of the reason the OpenFirmware CIF got thrown away early in SPARC/PowerPC boot (after flattening the device tree using the CIF calls to parse it out) was because you had to disable the MMU, caches, interrupts etc. which caused all kinds of slow firmware code to be all kinds of extra-slow. What that meant is nobody bothered to implement working, re-entrant, re-locatable firmware to a great degree. This ended up being a self-fulfilling prophecy of don't trust the bootloader and get rid of it as soon as we can, which essentially meant Linux never took advantage of the resources available. In OF's case, the CIF sucked by specification. In UEFI's case here, it's been implemented in Linux in such a way that guarantees poor-performing firmware code with huge penalties to call them, which isn't even required by UEFI if the earlier boot code did the right things in the first place. Ta, Matt Sealey n...@bakuhatsu.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] ARM: Use udiv/sdiv for __aeabi_{u}idiv library functions
On Fri, Nov 8, 2013 at 5:00 PM, Stephen Boyd wrote: > If we're running on a v7 ARM CPU, detect if the CPU supports the > sdiv/udiv instructions and replace the signed and unsigned > division library functions with an sdiv/udiv instruction. > > Running the perf messaging benchmark in pipe mode > > $ perf bench sched messaging -p > > shows a modest improvement on my v7 CPU. > > before: > (5.060 + 5.960 + 5.971 + 5.643 + 6.029 + 5.665 + 6.050 + 5.870 + 6.117 + > 5.683) / 10 = 5.805 > > after: > (4.884 + 5.549 + 5.749 + 6.001 + 5.460 + 5.103 + 5.956 + 6.112 + 5.468 + > 5.093) / 10 = 5.538 > > (5.805 - 5.538) / 5.805 = 4.6% Even with the change to the output constraint suggested by Mans, you get absolutely identical benchmark results? There's a lot of variance in any case.. BTW has there been any evaluation of the penalty for the extra branching, or the performance hit for the ARMv7-without-division cases? Ta, Matt Sealey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] ARM: Use udiv/sdiv for __aeabi_{u}idiv library functions
On Fri, Nov 8, 2013 at 5:00 PM, Stephen Boyd sb...@codeaurora.org wrote: If we're running on a v7 ARM CPU, detect if the CPU supports the sdiv/udiv instructions and replace the signed and unsigned division library functions with an sdiv/udiv instruction. Running the perf messaging benchmark in pipe mode $ perf bench sched messaging -p shows a modest improvement on my v7 CPU. before: (5.060 + 5.960 + 5.971 + 5.643 + 6.029 + 5.665 + 6.050 + 5.870 + 6.117 + 5.683) / 10 = 5.805 after: (4.884 + 5.549 + 5.749 + 6.001 + 5.460 + 5.103 + 5.956 + 6.112 + 5.468 + 5.093) / 10 = 5.538 (5.805 - 5.538) / 5.805 = 4.6% Even with the change to the output constraint suggested by Mans, you get absolutely identical benchmark results? There's a lot of variance in any case.. BTW has there been any evaluation of the penalty for the extra branching, or the performance hit for the ARMv7-without-division cases? Ta, Matt Sealey n...@bakuhatsu.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ARM audit, seccomp, etc are broken wrt OABI syscalls
On Tue, Nov 5, 2013 at 6:14 PM, Kees Cook wrote: > > Alternatively, CONFIG_SECCOMP_FILTER could depend on > !CONFIG_OABI_COMPAT. That seems like the least work, given the desire > to kill OABI in the real world. (Though I would note that at least > Ubuntu's ARM kernels build with CONFIG_OABI_COMPAT; Chrome OS does > not.) I think CONFIG_OABI_COMPAT probably leaked in from the original configurations of the kernel taken from Debian. There were several big decisions they made (build for ARMv5 soft float, then switch to ARMv7 softfp, then switch to ARMv7 hardfp, then switch to using THUMB2 kernels) which would have just broken OABI binaries at every step of the way since they had some subtle implications in kernel configuration (note: Ubuntu have never, ever enabled FPA emulation in the kernel, and all Debian's OABI userspace is built for FPA, for example. A good chunk of the original Debian arm port probably would just pitch a SIGILL if you ran it under an Ubuntu kernel). I would ignore anyone who enables it in a distribution, since they probably weren't intending to enable it in the first place, and never noticed the.. what.. 3-4KiB it adds to the kernel? Matt Sealey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ARM audit, seccomp, etc are broken wrt OABI syscalls
On Tue, Nov 5, 2013 at 6:14 PM, Kees Cook keesc...@chromium.org wrote: Alternatively, CONFIG_SECCOMP_FILTER could depend on !CONFIG_OABI_COMPAT. That seems like the least work, given the desire to kill OABI in the real world. (Though I would note that at least Ubuntu's ARM kernels build with CONFIG_OABI_COMPAT; Chrome OS does not.) I think CONFIG_OABI_COMPAT probably leaked in from the original configurations of the kernel taken from Debian. There were several big decisions they made (build for ARMv5 soft float, then switch to ARMv7 softfp, then switch to ARMv7 hardfp, then switch to using THUMB2 kernels) which would have just broken OABI binaries at every step of the way since they had some subtle implications in kernel configuration (note: Ubuntu have never, ever enabled FPA emulation in the kernel, and all Debian's OABI userspace is built for FPA, for example. A good chunk of the original Debian arm port probably would just pitch a SIGILL if you ran it under an Ubuntu kernel). I would ignore anyone who enables it in a distribution, since they probably weren't intending to enable it in the first place, and never noticed the.. what.. 3-4KiB it adds to the kernel? Matt Sealey n...@bakuhatsu.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] ARM: dts: i.MX53: dts for Voipac x53-dmm-668 module
On Thu, Oct 24, 2013 at 2:12 PM, Rostislav Lisovy wrote: > Dear Shawn; > Thank you for your comments. > Should I also add Voipac to > Documentation/devicetree/bindings/vendor-prefixes.txt? I would agree with that, but why is your chosen prefix "vp" instead of "voipac" anyway? -- Matt Sealey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] ARM: dts: i.MX53: dts for Voipac x53-dmm-668 module
On Thu, Oct 24, 2013 at 2:12 PM, Rostislav Lisovy lis...@gmail.com wrote: Dear Shawn; Thank you for your comments. Should I also add Voipac to Documentation/devicetree/bindings/vendor-prefixes.txt? I would agree with that, but why is your chosen prefix vp instead of voipac anyway? -- Matt Sealey n...@bakuhatsu.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] ARM: Introduce atomic MMIO clear/set
On Mon, Aug 19, 2013 at 11:59 AM, Ezequiel Garcia wrote: > On Mon, Aug 12, 2013 at 07:29:42PM +0100, Will Deacon wrote: > >> This means that you don't have ordering guarantees between the two accesses >> outside of the CPU, potentially giving you: >> >> spin_lock(&__io_lock); >> spin_unlock(&__io_lock); >> writel((readl(reg) & ~clear) | set, reg); >> >> which is probably not what you want. >> >> I suggest adding an iowmb after the writel if you really need this ordering >> to be enforced (but this may have a significant performance impact, >> depending on your SoC). > > I don't want to argue with you, given I have zero knowledge about this > ordering issue. However let me ask you a question. > > In arch/arm/include/asm/spinlock.h I'm seeing this comment: > > ""ARMv6 ticket-based spin-locking. > A memory barrier is required after we get a lock, and before we > release it, because V6 CPUs are assumed to have weakly ordered > memory."" > > and also: > > static inline void arch_spin_unlock(arch_spinlock_t *lock) > { > smp_mb(); > lock->tickets.owner++; > dsb_sev(); > } > > So, knowing this atomic API should work for every ARMv{N}, and not being very > sure what the call to dsb_sev() does. Would you care to explain how the above > is *not* enough to guarantee a memory barrier before the spin unlocking? arch_spin_[un]lock as an API is not guaranteed to use a barrier before or after doing anything, even if this particular implementation does. dsb_sev() is an SMP helper which does a synchronization barrier and then sends events to other CPUs which informs them of the unlock. If the other CPUs were in WFE state waiting on that spinlock, they can now thunder in and attempt to lock it themselves. It's not really relevant to the discussion as arch_spin_unlock is not guaranteed to perform a barrier before returning. On some other architecture there may be ISA additions which make locking barrier-less, or on a specific implementation of an ARM architecture SoC whereby there may be a bank of "hardware spinlocks" available. So, in this sense, you shouldn't rely on implementation-specific behaviors of a function. If you need to be sure C follows B follows A, insert a barrier yourself. Don't expect A to barrier for you just because you saw some source code that does it today, as tomorrow it may be different. It's not an optimization, just a potential source of new bugs if the implementation changes underneath you later. -- Matt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] ARM: Introduce atomic MMIO clear/set
On Mon, Aug 19, 2013 at 11:59 AM, Ezequiel Garcia ezequiel.gar...@free-electrons.com wrote: On Mon, Aug 12, 2013 at 07:29:42PM +0100, Will Deacon wrote: This means that you don't have ordering guarantees between the two accesses outside of the CPU, potentially giving you: spin_lock(__io_lock); spin_unlock(__io_lock); writel((readl(reg) ~clear) | set, reg); which is probably not what you want. I suggest adding an iowmb after the writel if you really need this ordering to be enforced (but this may have a significant performance impact, depending on your SoC). I don't want to argue with you, given I have zero knowledge about this ordering issue. However let me ask you a question. In arch/arm/include/asm/spinlock.h I'm seeing this comment: ARMv6 ticket-based spin-locking. A memory barrier is required after we get a lock, and before we release it, because V6 CPUs are assumed to have weakly ordered memory. and also: static inline void arch_spin_unlock(arch_spinlock_t *lock) { smp_mb(); lock-tickets.owner++; dsb_sev(); } So, knowing this atomic API should work for every ARMv{N}, and not being very sure what the call to dsb_sev() does. Would you care to explain how the above is *not* enough to guarantee a memory barrier before the spin unlocking? arch_spin_[un]lock as an API is not guaranteed to use a barrier before or after doing anything, even if this particular implementation does. dsb_sev() is an SMP helper which does a synchronization barrier and then sends events to other CPUs which informs them of the unlock. If the other CPUs were in WFE state waiting on that spinlock, they can now thunder in and attempt to lock it themselves. It's not really relevant to the discussion as arch_spin_unlock is not guaranteed to perform a barrier before returning. On some other architecture there may be ISA additions which make locking barrier-less, or on a specific implementation of an ARM architecture SoC whereby there may be a bank of hardware spinlocks available. So, in this sense, you shouldn't rely on implementation-specific behaviors of a function. If you need to be sure C follows B follows A, insert a barrier yourself. Don't expect A to barrier for you just because you saw some source code that does it today, as tomorrow it may be different. It's not an optimization, just a potential source of new bugs if the implementation changes underneath you later. -- Matt -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ksummit-2013-discuss] [ATTEND] [ARM ATTEND] kernel data bloat and how to avoid it
On Fri, Aug 2, 2013 at 3:13 AM, Tony Lindgren wrote: > * Mel Gorman [130731 08:28]: >> On Wed, Jul 31, 2013 at 12:38:03AM -0700, Tony Lindgren wrote: >> > Hi all, >> > >> > Probably the biggest kernel data bloat issue is in the ARM land, but >> > it also seems that it's becoming a Linux generic issue too, so I >> > guess it could be discussed in either context. >> > >> >> Would scripts/bloat-o-meter highlight where the growth problems are? > > Well to some extent yes, the board/SoC/driver specific options are > often behind Kconfig options. So if you want to limit the set of > supported SoCs and drivers for the kernel you can optimize it out. > > The bloat-o-meter won't help for things like checking that a device > tree binding really describes the hardware, and is not just pointing > to a table of defined registers in the device driver. Specifically naming and shaming, like arch/arm/mach-imx/clk-*.c kind of bloat where names of clocks, their bit definitions etc. are extremely specific to the chip but also can change per-board and should be in the device tree? It would be possible to list and parse all these clocks in about 1/3rd of the code of one of those files and have it work on every one of the SoCs they cover, if they're adequately described in the device tree (which gives an opportunity to stop changing the clock binding for each SoC to add new "numbers" for "clocks we forgot" and also allow registration of ONLY the board-specific clocks, on a per-board basis (if you don't use the TVE or SATA, why define a clock for it just to turn it off?). There's not a lot else going on with regards to "too much data in the kernel that shouldn't be in the kernel" to my mind that won't eventually get taken out once bindings in the DT. Most other platforms are pretty sane in this regard - what is left is legacy stuff for boards that don't have DTs and methods are in place to boot similar boards with DT-only. As above, Mike Turquette and the OMAP guys are doing something very similar here. What you'd suggest/need is a review by maintainers of each platform they support for any static data they are maintaining that can either be excised by moving it to the device tree as boot configuration/description data, if that data is not basically constant anyway, or just by getting rid of rare and unused boards. The PowerPC guys had a great cull when they moved from arch-ppc to arch-powerpc and decided no non-DT boards allowed. Anything that didn't get ported got deleted when arch-ppc went away. The arm arch kernel tree doesn't have the luxury of a hard expiration in place.. ~ BTW Mike, the solution is big device trees with clock data. The kernel *should* be pure logic, not description and data tables (data tables are exactly why Device Trees exist!) except where they are a fundamental constant. Clock trees are a *PER BOARD WORLD* even if they use the same SoC. I would - and have since the dawn of time - advocate doing it the way you're doing it (250 clocks in the tree!!! GANBATTE!!!) simply because the "logic" behind mapping clocks to numbers is obtuse and needs to be reproduced in very limited ways - adding a clock that was previously undefined but of a common, already-used structure (like a gate or mux that had no previous driver to use it) means modifying the kernel AND the device tree (and update the binding!). In this case, they are doing it wrong and I keep seeing patches to add 2 or 3 new clocks per merge window, and absolutely NONE of them are the JTAG clock, I keep having to add that in to every kernel I test. It shouldn't be a compile-time thing to boot, it should be a "which DT do I pick, the one with JTAG or the one without?" issue. We should ONLY have to change the device tree - individual clock type bindings would be fixed, boards would have to define their clocks. Since any external clocks and peripherals have to be validated by board designers anyway (for stability), everyone will know exactly what every board will need per peripheral and be able to define their own clock trees with fixed parents if need be. Vendors can provide suitable reference trees that match the full, clumsy configuration of an entire SoC for board bringup and we should be enabling (as in DT status="okay" property) the clocks that make sense on that board. There's a major advantage as above - if you do not use the clock on your board, you do not have to specify it. It may be left on by the bootloader and drain power and cause noise and screw parenting for muxes, but then that's an issue for the bootloader dev guys to resolve. The reason consensus doesn't flow that way is because "it is a lot of work to do this in device tree" but someone wrote out those 5 or 6 board files already and so far the clock subsystem has been rewritten like 4 times in the meantime. None of Freescale's BSPs (for example) use the same clock subsystem as they update to new kernels. I *will* change the i.MX stuff one day if I ever find the time. I
Re: [Ksummit-2013-discuss] [ATTEND] [ARM ATTEND] kernel data bloat and how to avoid it
On Fri, Aug 2, 2013 at 3:13 AM, Tony Lindgren t...@atomide.com wrote: * Mel Gorman mgor...@suse.de [130731 08:28]: On Wed, Jul 31, 2013 at 12:38:03AM -0700, Tony Lindgren wrote: Hi all, Probably the biggest kernel data bloat issue is in the ARM land, but it also seems that it's becoming a Linux generic issue too, so I guess it could be discussed in either context. Would scripts/bloat-o-meter highlight where the growth problems are? Well to some extent yes, the board/SoC/driver specific options are often behind Kconfig options. So if you want to limit the set of supported SoCs and drivers for the kernel you can optimize it out. The bloat-o-meter won't help for things like checking that a device tree binding really describes the hardware, and is not just pointing to a table of defined registers in the device driver. Specifically naming and shaming, like arch/arm/mach-imx/clk-*.c kind of bloat where names of clocks, their bit definitions etc. are extremely specific to the chip but also can change per-board and should be in the device tree? It would be possible to list and parse all these clocks in about 1/3rd of the code of one of those files and have it work on every one of the SoCs they cover, if they're adequately described in the device tree (which gives an opportunity to stop changing the clock binding for each SoC to add new numbers for clocks we forgot and also allow registration of ONLY the board-specific clocks, on a per-board basis (if you don't use the TVE or SATA, why define a clock for it just to turn it off?). There's not a lot else going on with regards to too much data in the kernel that shouldn't be in the kernel to my mind that won't eventually get taken out once bindings in the DT. Most other platforms are pretty sane in this regard - what is left is legacy stuff for boards that don't have DTs and methods are in place to boot similar boards with DT-only. As above, Mike Turquette and the OMAP guys are doing something very similar here. What you'd suggest/need is a review by maintainers of each platform they support for any static data they are maintaining that can either be excised by moving it to the device tree as boot configuration/description data, if that data is not basically constant anyway, or just by getting rid of rare and unused boards. The PowerPC guys had a great cull when they moved from arch-ppc to arch-powerpc and decided no non-DT boards allowed. Anything that didn't get ported got deleted when arch-ppc went away. The arm arch kernel tree doesn't have the luxury of a hard expiration in place.. ~ BTW Mike, the solution is big device trees with clock data. The kernel *should* be pure logic, not description and data tables (data tables are exactly why Device Trees exist!) except where they are a fundamental constant. Clock trees are a *PER BOARD WORLD* even if they use the same SoC. I would - and have since the dawn of time - advocate doing it the way you're doing it (250 clocks in the tree!!! GANBATTE!!!) simply because the logic behind mapping clocks to numbers is obtuse and needs to be reproduced in very limited ways - adding a clock that was previously undefined but of a common, already-used structure (like a gate or mux that had no previous driver to use it) means modifying the kernel AND the device tree (and update the binding!). In this case, they are doing it wrong and I keep seeing patches to add 2 or 3 new clocks per merge window, and absolutely NONE of them are the JTAG clock, I keep having to add that in to every kernel I test. It shouldn't be a compile-time thing to boot, it should be a which DT do I pick, the one with JTAG or the one without? issue. We should ONLY have to change the device tree - individual clock type bindings would be fixed, boards would have to define their clocks. Since any external clocks and peripherals have to be validated by board designers anyway (for stability), everyone will know exactly what every board will need per peripheral and be able to define their own clock trees with fixed parents if need be. Vendors can provide suitable reference trees that match the full, clumsy configuration of an entire SoC for board bringup and we should be enabling (as in DT status=okay property) the clocks that make sense on that board. There's a major advantage as above - if you do not use the clock on your board, you do not have to specify it. It may be left on by the bootloader and drain power and cause noise and screw parenting for muxes, but then that's an issue for the bootloader dev guys to resolve. The reason consensus doesn't flow that way is because it is a lot of work to do this in device tree but someone wrote out those 5 or 6 board files already and so far the clock subsystem has been rewritten like 4 times in the meantime. None of Freescale's BSPs (for example) use the same clock subsystem as they update to new kernels. I *will* change the i.MX stuff one day if I ever find the time. I started already but I
Re: [PATCH v4 2/6] misc: sram: add ability to mark sram sections as reserved
IEEE1275 (Open Firmware) specification has details on it in the MMU (section 3.6.5), Memory (section 3.7.6) node specifications and the glossary defines how the property should look. How you find that document is.. up to you (I have a PDF of it I got hold of around 2004, it used to be linked at www.openfirmware.org but I've not checked). I think it's missed from the devicetree.org spec since it's already very well defined and there's no reason to reinvent (or redefine) the wheel. Essentially it follows the same format as "reg" - phys.hi phys.lo length - which define regions which are ACTUALLY available to the system. The intent is that it would be used to define the memory as going from an address for a certain size, but mark sections of it the device tree user could use (further intent being that you would want the OS to know that there is, for example, 4GiB of memory there and set up any mappings to suit, but the last 256MiB is where you've put some important data and you would rather it didn't just allocate it to something else..). Linux doesn't actually parse and use the information in the available property (or care about "existing" property in an MMU node since it replaces page tables super early and has different ideas about mapping than most bootloaders), but other operating systems possibly do. It would not be a significant hardship for Linux to actually parse the property if present and use that information after a bit of sanity checking rather than just trusting reg. And in that sense, it would not be a significant hardship to use existing specifications to define nodes that describe memory, since if there IS parsing code available, it will be re-used, which is good all round. -- Matt On Thu, Aug 1, 2013 at 11:35 AM, Heiko Stübner wrote: > Am Montag, 29. Juli 2013, 23:39:45 schrieb Matt Sealey: >> On Mon, Jul 29, 2013 at 9:02 AM, Philipp Zabel > wrote: >> > Hi Heiko, >> > >> > Am Montag, den 29.07.2013, 15:12 +0200 schrieb Heiko Stübner: >> >> Some SoCs need parts of their sram for special purposes. So while being >> >> part of the peripheral, it should not be part of the genpool >> >> controlling the sram. >> >> >> >> Therefore add an option mmio-sram-reserved to keep arbitrary portions of >> >> the sram from being part of the pool. >> >> >> >> Suggested-by: Rob Herring >> >> Signed-off-by: Heiko Stuebner >> >> Tested-by: Ulrich Prinz >> >> --- >> >> Philipp: I didn't carry the ack, because the loop changed significantly >> >> again. So if it looks ok, could you re-ack it please? >> > >> > I'd prefer the first loop to contain the magic and produce a list of >> > useable chunks, instead of a list of reserved blocks. The second loop >> > could then iterate over the array and just call gen_pool_add_virt >> > repeatedly. >> > >> > regards >> > Philipp >> >> Agreed, however specifying chunks of memory should probably match the >> format of the standard memory@ node "available" property - mostly >> because it would be the same syntax and definition as defining any >> other chunk of memory, as OpenFirmware and device trees have been >> doing since the dark ages. In this case, why not re-use the >> "available" property name instead of creating a new one? Standard OF >> memory parsing code is then free for you to use to pull the chunks >> out. > > Sound interesting, but could you give me a pointer to some sort of docs about > this "available" property in memory nodes? > > Documentation/devicetree/booting-without-of.txt seems to be the only file > describing the memory node at all but only required properties, and not any > "available" property. > > I've found a document on power.org describing the memory node, but also not > mentioning any "available" property. > > And devicetree.org does not seem to handle the memory node at all. > > > Thanks for any hints > Heiko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 2/6] misc: sram: add ability to mark sram sections as reserved
IEEE1275 (Open Firmware) specification has details on it in the MMU (section 3.6.5), Memory (section 3.7.6) node specifications and the glossary defines how the property should look. How you find that document is.. up to you (I have a PDF of it I got hold of around 2004, it used to be linked at www.openfirmware.org but I've not checked). I think it's missed from the devicetree.org spec since it's already very well defined and there's no reason to reinvent (or redefine) the wheel. Essentially it follows the same format as reg - phys.hi phys.lo length - which define regions which are ACTUALLY available to the system. The intent is that it would be used to define the memory as going from an address for a certain size, but mark sections of it the device tree user could use (further intent being that you would want the OS to know that there is, for example, 4GiB of memory there and set up any mappings to suit, but the last 256MiB is where you've put some important data and you would rather it didn't just allocate it to something else..). Linux doesn't actually parse and use the information in the available property (or care about existing property in an MMU node since it replaces page tables super early and has different ideas about mapping than most bootloaders), but other operating systems possibly do. It would not be a significant hardship for Linux to actually parse the property if present and use that information after a bit of sanity checking rather than just trusting reg. And in that sense, it would not be a significant hardship to use existing specifications to define nodes that describe memory, since if there IS parsing code available, it will be re-used, which is good all round. -- Matt On Thu, Aug 1, 2013 at 11:35 AM, Heiko Stübner he...@sntech.de wrote: Am Montag, 29. Juli 2013, 23:39:45 schrieb Matt Sealey: On Mon, Jul 29, 2013 at 9:02 AM, Philipp Zabel p.za...@pengutronix.de wrote: Hi Heiko, Am Montag, den 29.07.2013, 15:12 +0200 schrieb Heiko Stübner: Some SoCs need parts of their sram for special purposes. So while being part of the peripheral, it should not be part of the genpool controlling the sram. Therefore add an option mmio-sram-reserved to keep arbitrary portions of the sram from being part of the pool. Suggested-by: Rob Herring robherri...@gmail.com Signed-off-by: Heiko Stuebner he...@sntech.de Tested-by: Ulrich Prinz ulrich.pr...@googlemail.com --- Philipp: I didn't carry the ack, because the loop changed significantly again. So if it looks ok, could you re-ack it please? I'd prefer the first loop to contain the magic and produce a list of useable chunks, instead of a list of reserved blocks. The second loop could then iterate over the array and just call gen_pool_add_virt repeatedly. regards Philipp Agreed, however specifying chunks of memory should probably match the format of the standard memory@ node available property - mostly because it would be the same syntax and definition as defining any other chunk of memory, as OpenFirmware and device trees have been doing since the dark ages. In this case, why not re-use the available property name instead of creating a new one? Standard OF memory parsing code is then free for you to use to pull the chunks out. Sound interesting, but could you give me a pointer to some sort of docs about this available property in memory nodes? Documentation/devicetree/booting-without-of.txt seems to be the only file describing the memory node at all but only required properties, and not any available property. I've found a document on power.org describing the memory node, but also not mentioning any available property. And devicetree.org does not seem to handle the memory node at all. Thanks for any hints Heiko -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 2/6] misc: sram: add ability to mark sram sections as reserved
On Mon, Jul 29, 2013 at 9:02 AM, Philipp Zabel wrote: > Hi Heiko, > > Am Montag, den 29.07.2013, 15:12 +0200 schrieb Heiko Stübner: >> Some SoCs need parts of their sram for special purposes. So while being part >> of the peripheral, it should not be part of the genpool controlling the sram. >> >> Therefore add an option mmio-sram-reserved to keep arbitrary portions of the >> sram from being part of the pool. >> >> Suggested-by: Rob Herring >> Signed-off-by: Heiko Stuebner >> Tested-by: Ulrich Prinz >> --- >> Philipp: I didn't carry the ack, because the loop changed significantly >> again. >> So if it looks ok, could you re-ack it please? > > I'd prefer the first loop to contain the magic and produce a list of > useable chunks, instead of a list of reserved blocks. The second loop > could then iterate over the array and just call gen_pool_add_virt > repeatedly. > > regards > Philipp Agreed, however specifying chunks of memory should probably match the format of the standard memory@ node "available" property - mostly because it would be the same syntax and definition as defining any other chunk of memory, as OpenFirmware and device trees have been doing since the dark ages. In this case, why not re-use the "available" property name instead of creating a new one? Standard OF memory parsing code is then free for you to use to pull the chunks out. -- Matt Sealey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 2/6] misc: sram: add ability to mark sram sections as reserved
On Mon, Jul 29, 2013 at 9:02 AM, Philipp Zabel p.za...@pengutronix.de wrote: Hi Heiko, Am Montag, den 29.07.2013, 15:12 +0200 schrieb Heiko Stübner: Some SoCs need parts of their sram for special purposes. So while being part of the peripheral, it should not be part of the genpool controlling the sram. Therefore add an option mmio-sram-reserved to keep arbitrary portions of the sram from being part of the pool. Suggested-by: Rob Herring robherri...@gmail.com Signed-off-by: Heiko Stuebner he...@sntech.de Tested-by: Ulrich Prinz ulrich.pr...@googlemail.com --- Philipp: I didn't carry the ack, because the loop changed significantly again. So if it looks ok, could you re-ack it please? I'd prefer the first loop to contain the magic and produce a list of useable chunks, instead of a list of reserved blocks. The second loop could then iterate over the array and just call gen_pool_add_virt repeatedly. regards Philipp Agreed, however specifying chunks of memory should probably match the format of the standard memory@ node available property - mostly because it would be the same syntax and definition as defining any other chunk of memory, as OpenFirmware and device trees have been doing since the dark ages. In this case, why not re-use the available property name instead of creating a new one? Standard OF memory parsing code is then free for you to use to pull the chunks out. -- Matt Sealey mwsea...@gmail.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 1/3] dts: change Marvell prefix to 'marvell'
On Tue, Jul 9, 2013 at 7:49 AM, Jason Cooper wrote: > Neil, > > I agree with the need to change, however, this has been in the binding > documentation since v3.5. I wish we had caught this when we decided > against using stock ticker symbols (not all stock markets use > alphabetical abbreviated names, not all companies are listed on any > stock exchange). Who decided that? You can't just "stop using stock ticker symbols" - FDT is inherently based on the original OpenFirmware device tree and therefore any existing bindings which are done on real OpenFirmware solutions where using stock ticker symbols is entirely appropriate (although, these days, not useful) is counter-productive. If Marvell had originally had mrvl as their ticker, and used this in OF DTs (and it is..), then mrvl it stays. In the case where new devices are added with marvell, this is in this case wrong. You should keep using the old one. A good example of this; Freescale. Nobody is saying everyone should move to "freescale,imx-this" or "freescale,vybrid-that" - it's fsl and it stays fsl for backwards/forwards compatibility because things exist already. Any new companies can have a long, descriptive name; a privilege of being late to the party, you might say :) Having an odd mix of mrvl and marvell or moving to marvell is just completely obtuse, whether they had that stock ticker, will have it in the future, it is how they're defined and you can't in good conscience change the binding after devices ship with it. > To do this properly, the drivers are going to have to be compatible with > the old and the new names, and the binding docs updated to reflect the > legacy name and the preferred name. Properly would be as above. You can stop using stock tickers for new company names, but anything that has been defined in a device tree before has to stay that way, and all the manufacturer prefixes to device names should be the same. What you're proposing is purely driver bloat and increasing the size of kernel. -- Matt Sealey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 1/3] dts: change Marvell prefix to 'marvell'
On Tue, Jul 9, 2013 at 7:49 AM, Jason Cooper ja...@lakedaemon.net wrote: Neil, I agree with the need to change, however, this has been in the binding documentation since v3.5. I wish we had caught this when we decided against using stock ticker symbols (not all stock markets use alphabetical abbreviated names, not all companies are listed on any stock exchange). Who decided that? You can't just stop using stock ticker symbols - FDT is inherently based on the original OpenFirmware device tree and therefore any existing bindings which are done on real OpenFirmware solutions where using stock ticker symbols is entirely appropriate (although, these days, not useful) is counter-productive. If Marvell had originally had mrvl as their ticker, and used this in OF DTs (and it is..), then mrvl it stays. In the case where new devices are added with marvell, this is in this case wrong. You should keep using the old one. A good example of this; Freescale. Nobody is saying everyone should move to freescale,imx-this or freescale,vybrid-that - it's fsl and it stays fsl for backwards/forwards compatibility because things exist already. Any new companies can have a long, descriptive name; a privilege of being late to the party, you might say :) Having an odd mix of mrvl and marvell or moving to marvell is just completely obtuse, whether they had that stock ticker, will have it in the future, it is how they're defined and you can't in good conscience change the binding after devices ship with it. To do this properly, the drivers are going to have to be compatible with the old and the new names, and the binding docs updated to reflect the legacy name and the preferred name. Properly would be as above. You can stop using stock tickers for new company names, but anything that has been defined in a device tree before has to stay that way, and all the manufacturer prefixes to device names should be the same. What you're proposing is purely driver bloat and increasing the size of kernel. -- Matt Sealey n...@bakuhatsu.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 3/3] clk: dt: binding for basic divider clock
On Tue, Jun 4, 2013 at 2:22 PM, Mike Turquette wrote: > Quoting Matt Sealey (2013-06-04 10:39:53) >> On Tue, Jun 4, 2013 at 12:11 PM, Stephen Boyd wrote: >> > On 06/03/13 10:53, Mike Turquette wrote: >> >> +Required properties: >> >> +- compatible : shall be "divider-clock". >> >> +- #clock-cells : from common clock binding; shall be set to 0. >> >> +- clocks : link to phandle of parent clock >> >> +- reg : base address for register controlling adjustable divider >> >> +- mask : arbitrary bitmask for programming the adjustable divider >> >> + >> >> +Optional properties: >> >> +- clock-output-names : From common clock binding. >> >> +- table : array of integer pairs defining divisors & bitfield values >> >> +- shift : number of bits to shift the mask, defaults to 0 if not present >> >> +- index_one : valid divisor programming starts at 1, not zero >> >> +- index_power_of_two : valid divisor programming must be a power of two >> >> +- allow_zero : implies index_one, and programming zero results in >> >> + divide-by-one >> > >> > It's preferred that property names have dashes instead of underscores. >> >> I think I have a suggestion or two here.. >> >> I mailed one to Mike earlier, I have had some mailing list problems >> which I think are solved now.. >> >> If you provide a mask for programming the divider, surely the "shift" >> property is therefore redundant. Unless some device has a completely >> strange way of doing things and uses two seperated bitfields for >> programming in the same register, the shift value is simply a function >> of ffs() on the mask, isn't it? It is nice to put the information >> specifically in the device tree, but where a shift is not specified >> it'd probably be a good idea to go about deriving that value that way >> anyway, right, and if we're doing that.. why specify it? > > Shift is optional. Also I think the integer values in a DTS are > 32-bits, so if some day someone had a clock driver that needed to write > to a 64-bit register then shift might be useful there, combined with a > 32-bit mask. Or perhaps dtc will have a 64-bit value for that case? I > don't know. > > I certainly do see your point about only using the mask, and the > original clock basic types did this a while back before the shift+width > style came into vogue. > > If other reviewers also want to use a pure 32-bit mask and no shift then > I'll drop it in the next round. Hmm.. that's a very good point... You can't have a 64-bit OF property unless you're on a 64-bit platform or do some really weird workarounds for it (multiple cells etc.). That's how the OF spec was laid out, it sucks.. but it's the standard and the DT seems to stay compatible with the general concept. A supplemental binding to support clock dividers living in 64-bit registers may be a better way. You'd need a shift for the mask if you could only do a 32-bit property (32-bit platform with a 4-byte cell size) and a shift for the bits that would go into the mask as well as a width property for the register size to bound the shift.. Too complicated for now. I would say stick with 32-bit until this magical 64-bit divider register appears, and then do something slightly different :) >> Also it may be much simpler to simply define the default clock setup >> as being an integer divider that follows the form 2^n - I am not sure >> I can think off the top of my head of any clock dividers doing the >> rounds that have divide-by-non-power-of-two and if they exist out >> there (please correct me) they would be the rare ones. I think >> properties that modify behavior should err on the side of being rarely >> used and to define unusual behavior, not confirm the status quo. > > Please review the current use of CLK_DIVIDER_POWER_OF_TWO in the kernel. You're right, I think I got it backwards or was actually thinking of allow_zero (as in, 0x0 divider = divide by 1)? i.MX definitely has more 0x0=div-by-1 than any other kind.. >> In any case, once you figure out that, rather than specifying a full >> table of <0 1>, <1 2>, <2, 4> and creating a big list, you could just >> give the lower and upper limits and let the driver figure out what is >> in between with two values. If there were gaps, a full table would be >> necessary. The allow zero property might be better served by the very >> same range property. > > You have described how it already works. The table is only there for > when a common formula fails to capture the possible divisor > combinations. For the defa
Re: [PATCH RFC 3/3] clk: dt: binding for basic divider clock
On Tue, Jun 4, 2013 at 12:11 PM, Stephen Boyd wrote: > On 06/03/13 10:53, Mike Turquette wrote: >> +Required properties: >> +- compatible : shall be "divider-clock". >> +- #clock-cells : from common clock binding; shall be set to 0. >> +- clocks : link to phandle of parent clock >> +- reg : base address for register controlling adjustable divider >> +- mask : arbitrary bitmask for programming the adjustable divider >> + >> +Optional properties: >> +- clock-output-names : From common clock binding. >> +- table : array of integer pairs defining divisors & bitfield values >> +- shift : number of bits to shift the mask, defaults to 0 if not present >> +- index_one : valid divisor programming starts at 1, not zero >> +- index_power_of_two : valid divisor programming must be a power of two >> +- allow_zero : implies index_one, and programming zero results in >> + divide-by-one > > It's preferred that property names have dashes instead of underscores. I think I have a suggestion or two here.. I mailed one to Mike earlier, I have had some mailing list problems which I think are solved now.. If you provide a mask for programming the divider, surely the "shift" property is therefore redundant. Unless some device has a completely strange way of doing things and uses two seperated bitfields for programming in the same register, the shift value is simply a function of ffs() on the mask, isn't it? It is nice to put the information specifically in the device tree, but where a shift is not specified it'd probably be a good idea to go about deriving that value that way anyway, right, and if we're doing that.. why specify it? Also it may be much simpler to simply define the default clock setup as being an integer divider that follows the form 2^n - I am not sure I can think off the top of my head of any clock dividers doing the rounds that have divide-by-non-power-of-two and if they exist out there (please correct me) they would be the rare ones. I think properties that modify behavior should err on the side of being rarely used and to define unusual behavior, not confirm the status quo. Just to be sure, though, someone would need to go visit a bunch of current clock handlers already implemented or read a lot of docs to see how they're expecting their dividers. Maybe some of the Samsung, Qualcomm, TI, Freescale guys can comment on what is the most common case inside their SoCs. In any case, once you figure out that, rather than specifying a full table of <0 1>, <1 2>, <2, 4> and creating a big list, you could just give the lower and upper limits and let the driver figure out what is in between with two values. If there were gaps, a full table would be necessary. The allow zero property might be better served by the very same range property. I would also say.. bit-mask rather than mask, divider-table rather than table, divider-range (modifying the idea above). While we're inside a particular namespace in the binding I think naming properties with their intent (i.e. what table is it?) is good behavior. -- Matt Sealey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 3/3] clk: dt: binding for basic divider clock
On Tue, Jun 4, 2013 at 12:11 PM, Stephen Boyd sb...@codeaurora.org wrote: On 06/03/13 10:53, Mike Turquette wrote: +Required properties: +- compatible : shall be divider-clock. +- #clock-cells : from common clock binding; shall be set to 0. +- clocks : link to phandle of parent clock +- reg : base address for register controlling adjustable divider +- mask : arbitrary bitmask for programming the adjustable divider + +Optional properties: +- clock-output-names : From common clock binding. +- table : array of integer pairs defining divisors bitfield values +- shift : number of bits to shift the mask, defaults to 0 if not present +- index_one : valid divisor programming starts at 1, not zero +- index_power_of_two : valid divisor programming must be a power of two +- allow_zero : implies index_one, and programming zero results in + divide-by-one It's preferred that property names have dashes instead of underscores. I think I have a suggestion or two here.. I mailed one to Mike earlier, I have had some mailing list problems which I think are solved now.. If you provide a mask for programming the divider, surely the shift property is therefore redundant. Unless some device has a completely strange way of doing things and uses two seperated bitfields for programming in the same register, the shift value is simply a function of ffs() on the mask, isn't it? It is nice to put the information specifically in the device tree, but where a shift is not specified it'd probably be a good idea to go about deriving that value that way anyway, right, and if we're doing that.. why specify it? Also it may be much simpler to simply define the default clock setup as being an integer divider that follows the form 2^n - I am not sure I can think off the top of my head of any clock dividers doing the rounds that have divide-by-non-power-of-two and if they exist out there (please correct me) they would be the rare ones. I think properties that modify behavior should err on the side of being rarely used and to define unusual behavior, not confirm the status quo. Just to be sure, though, someone would need to go visit a bunch of current clock handlers already implemented or read a lot of docs to see how they're expecting their dividers. Maybe some of the Samsung, Qualcomm, TI, Freescale guys can comment on what is the most common case inside their SoCs. In any case, once you figure out that, rather than specifying a full table of 0 1, 1 2, 2, 4 and creating a big list, you could just give the lower and upper limits and let the driver figure out what is in between with two values. If there were gaps, a full table would be necessary. The allow zero property might be better served by the very same range property. I would also say.. bit-mask rather than mask, divider-table rather than table, divider-range (modifying the idea above). While we're inside a particular namespace in the binding I think naming properties with their intent (i.e. what table is it?) is good behavior. -- Matt Sealey m...@genesi-usa.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 3/3] clk: dt: binding for basic divider clock
On Tue, Jun 4, 2013 at 2:22 PM, Mike Turquette mturque...@linaro.org wrote: Quoting Matt Sealey (2013-06-04 10:39:53) On Tue, Jun 4, 2013 at 12:11 PM, Stephen Boyd sb...@codeaurora.org wrote: On 06/03/13 10:53, Mike Turquette wrote: +Required properties: +- compatible : shall be divider-clock. +- #clock-cells : from common clock binding; shall be set to 0. +- clocks : link to phandle of parent clock +- reg : base address for register controlling adjustable divider +- mask : arbitrary bitmask for programming the adjustable divider + +Optional properties: +- clock-output-names : From common clock binding. +- table : array of integer pairs defining divisors bitfield values +- shift : number of bits to shift the mask, defaults to 0 if not present +- index_one : valid divisor programming starts at 1, not zero +- index_power_of_two : valid divisor programming must be a power of two +- allow_zero : implies index_one, and programming zero results in + divide-by-one It's preferred that property names have dashes instead of underscores. I think I have a suggestion or two here.. I mailed one to Mike earlier, I have had some mailing list problems which I think are solved now.. If you provide a mask for programming the divider, surely the shift property is therefore redundant. Unless some device has a completely strange way of doing things and uses two seperated bitfields for programming in the same register, the shift value is simply a function of ffs() on the mask, isn't it? It is nice to put the information specifically in the device tree, but where a shift is not specified it'd probably be a good idea to go about deriving that value that way anyway, right, and if we're doing that.. why specify it? Shift is optional. Also I think the integer values in a DTS are 32-bits, so if some day someone had a clock driver that needed to write to a 64-bit register then shift might be useful there, combined with a 32-bit mask. Or perhaps dtc will have a 64-bit value for that case? I don't know. I certainly do see your point about only using the mask, and the original clock basic types did this a while back before the shift+width style came into vogue. If other reviewers also want to use a pure 32-bit mask and no shift then I'll drop it in the next round. Hmm.. that's a very good point... You can't have a 64-bit OF property unless you're on a 64-bit platform or do some really weird workarounds for it (multiple cells etc.). That's how the OF spec was laid out, it sucks.. but it's the standard and the DT seems to stay compatible with the general concept. A supplemental binding to support clock dividers living in 64-bit registers may be a better way. You'd need a shift for the mask if you could only do a 32-bit property (32-bit platform with a 4-byte cell size) and a shift for the bits that would go into the mask as well as a width property for the register size to bound the shift.. Too complicated for now. I would say stick with 32-bit until this magical 64-bit divider register appears, and then do something slightly different :) Also it may be much simpler to simply define the default clock setup as being an integer divider that follows the form 2^n - I am not sure I can think off the top of my head of any clock dividers doing the rounds that have divide-by-non-power-of-two and if they exist out there (please correct me) they would be the rare ones. I think properties that modify behavior should err on the side of being rarely used and to define unusual behavior, not confirm the status quo. Please review the current use of CLK_DIVIDER_POWER_OF_TWO in the kernel. You're right, I think I got it backwards or was actually thinking of allow_zero (as in, 0x0 divider = divide by 1)? i.MX definitely has more 0x0=div-by-1 than any other kind.. In any case, once you figure out that, rather than specifying a full table of 0 1, 1 2, 2, 4 and creating a big list, you could just give the lower and upper limits and let the driver figure out what is in between with two values. If there were gaps, a full table would be necessary. The allow zero property might be better served by the very same range property. You have described how it already works. The table is only there for when a common formula fails to capture the possible divisor combinations. For the default the driver starts the index at 0 and goes up to mask/width, for index_one the driver does the same but starts at 1, not 0, for power of two ... well you get the picture. The table is only of use when these common divider schemes are insufficient to determine the divisors. Right, but in the case of index_one and allow_zero, one property for ranges basically covers the case where a slight modification to the formula used would be required, and allows specifying the first valid divider and the last valid divider which might come in handy when you have a mask of a certain size (say, 6
Re: [RFC PATCH v3 1/2] ARM: kernel: update cpuinfo to print SoC model name
On Wed, Jan 30, 2013 at 1:07 PM, Nicolas Pitre wrote: > On Wed, 30 Jan 2013, Ruslan Bilovol wrote: > >> Currently, reading /proc/cpuinfo provides userspace with CPU ID of >> the CPU carrying out the read from the file. >> Userspace using this information may decide what module >> to load or how to configure some specific (and processor-depended) >> settings or so. >> However, since really different SoCs can share same ARM core, >> this information currently is not so useful. >> For example, TI OMAP4460 and OMAP4470 SoCs show the same >> information in the /proc/cpuinfo whereas they are different. >> Since in most cases ARM CPU is a part of some system on a chip (SoC), >> the "cpuinfo" file looks like exactly that place, where this >> information have to be displayed. >> >> So added new line "SoC name" in the "cpuinfo" output for system >> on a chip name. It is placed between CPU information and machine >> information, so the file structure looks gracefully (CPU-SoC-Hardware) >> >> Example: >> >> / # cat proc/cpuinfo >> [...] >> CPU variant : 0x2 >> CPU part: 0xc09 >> CPU revision: 10 >> >> SoC name: OMAP4470 >> >> Hardware: OMAP4 Blaze Tablet > > Please remove that extra blank line between "SoC name" and "Hardware". > The blank line after "CPU revision" is fine. > > Also, please rename this to "System name". Not all systems are "on > chip". By using "System name" this is more universally useful. I can't agree with "System name", it is confusing in common terminology since it's about the same definition as the current "Hardware" line. If we're just printing out the name of the device surrounding the CPU - be it a Northbridge/Southbridge combination or SoC packaging - "Chipset" might be a better name for it. -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 1/2] ARM: kernel: update cpuinfo to print SoC model name
On Wed, Jan 30, 2013 at 1:07 PM, Nicolas Pitre n...@fluxnic.net wrote: On Wed, 30 Jan 2013, Ruslan Bilovol wrote: Currently, reading /proc/cpuinfo provides userspace with CPU ID of the CPU carrying out the read from the file. Userspace using this information may decide what module to load or how to configure some specific (and processor-depended) settings or so. However, since really different SoCs can share same ARM core, this information currently is not so useful. For example, TI OMAP4460 and OMAP4470 SoCs show the same information in the /proc/cpuinfo whereas they are different. Since in most cases ARM CPU is a part of some system on a chip (SoC), the cpuinfo file looks like exactly that place, where this information have to be displayed. So added new line SoC name in the cpuinfo output for system on a chip name. It is placed between CPU information and machine information, so the file structure looks gracefully (CPU-SoC-Hardware) Example: / # cat proc/cpuinfo [...] CPU variant : 0x2 CPU part: 0xc09 CPU revision: 10 SoC name: OMAP4470 Hardware: OMAP4 Blaze Tablet Please remove that extra blank line between SoC name and Hardware. The blank line after CPU revision is fine. Also, please rename this to System name. Not all systems are on chip. By using System name this is more universally useful. I can't agree with System name, it is confusing in common terminology since it's about the same definition as the current Hardware line. If we're just printing out the name of the device surrounding the CPU - be it a Northbridge/Southbridge combination or SoC packaging - Chipset might be a better name for it. -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL
On Tue, Jan 22, 2013 at 1:50 AM, Jussi Kivilinna wrote: > >> There is 'tcrypt' module in crypto/ for quick benchmarking. 'modprobe >> tcrypt mode=500 sec=1' tests AES in various cipher-modes, using different >> buffer sizes and outputs results to kernel log. >> > > Actually mode=200 might be better, as mode=500 is for asynchronous > implementations and might use hardware crypto if such device/module is > available. Okeydokey I'll start running some tests.. -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL
On Tue, Jan 22, 2013 at 1:50 AM, Jussi Kivilinna jussi.kivili...@mbnet.fi wrote: There is 'tcrypt' module in crypto/ for quick benchmarking. 'modprobe tcrypt mode=500 sec=1' tests AES in various cipher-modes, using different buffer sizes and outputs results to kernel log. Actually mode=200 might be better, as mode=500 is for asynchronous implementations and might use hardware crypto if such device/module is available. Okeydokey I'll start running some tests.. -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
Matt Sealey Product Development Analyst, Genesi USA, Inc. On Mon, Jan 21, 2013 at 7:31 PM, John Stultz wrote: > On 01/21/2013 05:06 PM, Matt Sealey wrote: >> >> On Mon, Jan 21, 2013 at 6:51 PM, John Stultz >> wrote: >>> >>> On 01/21/2013 02:54 PM, Matt Sealey wrote: >>>> >>>> On Mon, Jan 21, 2013 at 4:36 PM, John Stultz >>>> wrote: >>>>> >>>>> On 01/21/2013 01:14 PM, Matt Sealey wrote: >>> >>> As far as jiffies rating, from jiffies.c: >>> .rating= 1, /* lowest valid rating*/ >>> >>> So I'm not sure what you mean by "the debug on the kernel log is telling >>> me >>> it has a higher resolution". >> >> Oh, it is just if I actually don't run setup_sched_clock on my >> platform, it gives a little message (with #define DEBUG 1 in >> sched_clock.c) about who setup the last sched_clock. Since you only >> get one chance, and I was fiddling with setup_sched_clock being probed >> from multiple possible timers from device tree (i.MX3 has a crapload >> of valid timers, which one you use right now is basically forced by >> the not-quite-fully-DT-only code and some funky iomap tricks). >> >> And what I got was, if I use the real hardware timer, it runs at 66MHz >> and says it has 15ns resolution and wraps every 500 seconds or so. The >> jiffies timer says it's 750MHz, with a 2ns resoluton.. you get the >> drift. The generic reporting of how "good" the sched_clock source is >> kind of glosses over the quality rating of the clock source and at >> first glance (if you're not paying that much attention), it is a >> little bit misleading.. > > > I've got no clue on this. sched_clock is arch specific, and while ARM does > use clocksources for sched_clock, what you're seeing is a detail of the ARM > implementation and not the clocksource code (one complication is that > clocksources rating values are for the requirements of timekeeping, which > are different then the requirements for sched_clock - so the confusion is > understandable). > > > >>> Yes, in the case I was remembering, the 60HZ was driven by the electrical >>> line. >> >> While I have your attention, what would be the minimum "good" speed to >> run the sched_clock or delay timer implementation from? My rudimentary >> scribblings in my notebook give me a value of "don't bother" with less >> than 10KHz based on HZ=100, so I'm wondering if a direct 32.768KHz >> clock would do (i.MX osc clock input if I can supply it to one of the >> above myriad timers) since this would be low-power compared to a 66MHz >> one (by a couple mA anyway). I also have a bunch of questions about >> the delay timer requirements.. I might mail you personally.. or would >> you prefer on-list? > > So there are probably other folks who could better comment on sched_clock() > or the delay timer (I'm guessing the delay() implementation is what you mean > by that) design trade-offs. I'm specifically talking about if I do static struct delay_timer imx_gpt_delay_timer = { .read_current_timer = imx_gpt_read_current_timer, }; and then something like: imx_gpt_delay_timer.freq = clk_get_rate(clk_per); register_current_timer_delay(_gpt_delay_timer); In the sense that now (as of kernel 3.7 iirc), I have an ability to have the delay implementation use this awesome fast accessor (which is nothing to do with a 'clocksource' as in the subsystem..) to get to my (here at least) 66.5MHz counter (up or down, i.MX has both, but I dunno if you can use a down counter for delay_timer, or if that's preferred, or what.. there are no examples of it.. but it seems to work.. that said I can't imagine what would be an immediately visible and not totally random effect of doing it "wrong", maybe that delays are instantly returned, that could be very hard or impossible to ever notice compared to not being able to browse the internet on the target device.. it might pop up on some randomly-not-resetting platform device or so, though..) And I can also put sched_clock on a completely different timer. Does that make any sense at all? I wouldn't know, it's not documented. And if I wanted to I could register 8 more timers. That seems rather excessive, but the ability to use those extra 8 as clock outputs from the SoC or otherwise directly use comparators is useful to some people, does Linux in general really give a damn about having 8 timers of the same quality being available when most systems barely have two clocksources anyway (on x86, tsc and hpet - on ARM I guess twd and some SoC-specific timer). I dunno how many people might actually want to define in a d
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 7:18 PM, Russell King - ARM Linux wrote: > On Mon, Jan 21, 2013 at 07:06:59PM -0600, Matt Sealey wrote: >> On Mon, Jan 21, 2013 at 6:51 PM, John Stultz wrote: >> > On 01/21/2013 02:54 PM, Matt Sealey wrote: > > sched_clock() has nothing to do with time keeping, and that > HZ/NO_HZ/HRTIMERS don't affect it (when it isn't being derived from > jiffies). > > Now, sched_clock() is there to give the scheduler a _fast_ to access, > higher resolution clock than is available from other sources, so that > there's ways of accurately measuring the amount of time processes run > for, That depends on what you meant by timekeeping, right? I'm really not concerned about the wallclock time, more about the accuracy of the scheduler clock (tick?), preemption, accurate delays (i.e. if I msleep(10) does it delay for 10ms or for 40ms because my delay timer is inaccurate? I'd rather it was better but closer to 10ms), and whether the scheduler (the thing that tells my userspace whether firefox is running now, or totem, or any other task) is using the correct high resolution periodic, oneshot, repeatable (however it repeats) timers *properly* given that this magic config item is missing on ARM. That magic config item being CONFIG_SCHED_HRTICK which is referenced a bunch in kernel/sched/*.[ch] but *ONLY* defined as a Kconfig item in kernel/Kconfig.hz. Do we need to copy that Kconfig item out to arch/arm/Kconfig, that's the question? > and other such measurements - and it uses that to determine how > to schedule a particular task and when to preempt it. > > Not providing it means you get those measurements at HZ-based resolution, > which is suboptimal for tasks which run often for sub-HZ periods (which > can end up accumulating zero run time.) Okay, and John said earlier: John Stultz: > So I'm actually not super familiar with SCHED_HRTICK details, but from my > brief skim of it it looks like its useful for turning off the periodic timer > tick, and allowing the scheduler tick to be triggered by an hrtimer itself > (There's a number of these interesting inversions that go on in switching to > HRT mode - for instance, standard timer ticks are switched to being hrtimer > events themselves). > > This likely has the benefit of time-accurate preemption (well, long term, as > if the timer granularity isn't matching you could be delayed up to a tick - > but it wouldn't drift). > > I'm guessing Thomas would probably know best what the potential issues would > be from running ((CONFIG_HRTIMER || CONFIG_NO_HZ) && !CONFIG_SCHED_HRTICK). If SCHED_HRTICK isn't enabled but setup_sched_clock has been given an accessor for a real, hardware, fast, high resolution counter that meets all the needs of sched_clock, what's going on? If it's enabled, what extra is it doing that, say, my_plat_read_sched_clock doesn't? -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 6:51 PM, John Stultz wrote: > On 01/21/2013 02:54 PM, Matt Sealey wrote: >> >> On Mon, Jan 21, 2013 at 4:36 PM, John Stultz >> wrote: >>> >>> On 01/21/2013 01:14 PM, Matt Sealey wrote: > > As far as jiffies rating, from jiffies.c: > .rating= 1, /* lowest valid rating*/ > > So I'm not sure what you mean by "the debug on the kernel log is telling me > it has a higher resolution". Oh, it is just if I actually don't run setup_sched_clock on my platform, it gives a little message (with #define DEBUG 1 in sched_clock.c) about who setup the last sched_clock. Since you only get one chance, and I was fiddling with setup_sched_clock being probed from multiple possible timers from device tree (i.MX3 has a crapload of valid timers, which one you use right now is basically forced by the not-quite-fully-DT-only code and some funky iomap tricks). And what I got was, if I use the real hardware timer, it runs at 66MHz and says it has 15ns resolution and wraps every 500 seconds or so. The jiffies timer says it's 750MHz, with a 2ns resoluton.. you get the drift. The generic reporting of how "good" the sched_clock source is kind of glosses over the quality rating of the clock source and at first glance (if you're not paying that much attention), it is a little bit misleading.. > Yes, in the case I was remembering, the 60HZ was driven by the electrical > line. While I have your attention, what would be the minimum "good" speed to run the sched_clock or delay timer implementation from? My rudimentary scribblings in my notebook give me a value of "don't bother" with less than 10KHz based on HZ=100, so I'm wondering if a direct 32.768KHz clock would do (i.MX osc clock input if I can supply it to one of the above myriad timers) since this would be low-power compared to a 66MHz one (by a couple mA anyway). I also have a bunch of questions about the delay timer requirements.. I might mail you personally.. or would you prefer on-list? -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL
On Mon, Jan 21, 2013 at 4:46 PM, Nicolas Pitre wrote: > On Mon, 21 Jan 2013, Matt Sealey wrote: > >> The optimized assembler SHA1 code for ARM does not conform to Thumb2 >> register usage requirements, so it cannot be built when the kernel is >> configured with THUMB2_KERNEL. >> >> Fix the FTBFS for now by preventing misconfigurations of the kernel. >> >> Signed-off-by: Matt Sealey > > A .arm directive at the top of the assembly code would be a better > "fix", as that wouldn't reduce functionality. If I recall, doing that last time for ssi-fiq.S was the wrong solution and it was suggesed proper configuration (on top of possibly rewriting the assembly) was better than hacking around in the assembly.. > Yet, I'd invite you to have a look at commit 638591cd7b in linux-next. I took a peek, and invite you to ignore my patch. I only tracked the top of Linus' tree.. That said, it seems nobody benchmarked this on something different than IXP425 or KS8695 to see if it's markedly faster than the (moderately recently updated) C-code implementation outside of the mentioned in the logs for initial commit? It seems like rather a specific optimization for a rather specific use case for rather specific processors (and therefore a small test base) probably meant for a very specific product line somewhere. Whether you get any benefit in enabling this config item or not for any other ARM platform is up for debate, isn't it? If it *is* in fact much faster everywhere, and it works in any ARM or THUMB2 configuration, there's a case to be built for it being the default ARM implementation for AES and SHA1.. This question is to the implementor/committer (Dave McCullough), how exactly did you measure the benchmark and can we reproduce it on some other ARM box? If it's long and laborious and not so important to test the IPsec tunnel use-case, what would be the simplest possible benchmark to see if the C vs. assembly version is faster for a particular ARM device? I can get hold of pretty much any Cortex-A8 or Cortex-A9 that matters, I have access to a Chromebook for A15, and maybe an i.MX27 or i.MX35 and a couple Marvell boards (ARMv6) if I set my mind to it... that much testing implies we find a pretty concise benchmark though with a fairly common kernel version we can spread around (i.MX, OMAP and the Chromebook, I can handle, the rest I'm a little wary of bothering to spend too much time on). I think that could cover a good swath of not-ARMv5 use cases from lower speeds to quad core monsters.. but I might stick to i.MX to start with.. -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 6:09 PM, Matt Sealey wrote: [LAKML: about lack of SCHED_HRTICK because we don't use kernel/Kconfig.hz on ARM)] > kind of leaves ARM in the doghouse.. who knows what weirdo scheduler > reactions are related to it not being enabled. Maybe when it is, HZ > *would* need to be allowed to be bumped when using this code path? Or conversely maybe this is exactly why the Samsung maintainers decided they need HZ=200, because SCHED_HRTICK isn't being enabled and they're experiencing some multimedia lag because of it? -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
Okay so the final resolution of this is; 1) That the arch/arm/Kconfig HZ block is suffering from some cruft I think we could all be fairly confident that Exynos4 or S5P does not require HZ=200 - in theory, it has no such timer restrictions like EBSA110 (the docs I have show a perfectly capable 32-bit timer with a double-digits MHz input clock, since these are multimedia-class SoCs it'd be seriously f**ked up if they didn't). But while some of the entries on this line may be cargo-cult programming, the original addition on top of EBSA110 *may* be one of your "unreported" responsiveness issues. We could just let some Samsung employees complain when Android 6.x starts to get laggy with a 3.8 kernel because we forced their HZ=100. What I would do is predicate a fixed, obvious default on ARCH_MULTIPLATFORM so that it would get the benefit of a default HZ that you agree with. It wouldn't CHANGE anything, but it makes it look less funky, since the non-multiplatform settings would be somewhere else (it either needs more comments or an if - either way - otherwise it's potentially confusing); if ARCH_MULTIPLATFORM config HZ int default 100 else # old config HZ block here endif 2) We need to add config SCHED_HRTICK as a copy and paste from kernel/Kconfig.hz since.. well, I still don't understand exactly what the true effect would be, but I assume since Arnd is concerned and John's explanation rings true that it really should be enabled on ARM systems with the exact same dependencies as kernel/Kconfig.hz. Or not.. I see it as an oddity until I understand if we really care about it, but the code seems to be fairly important to the scheduler and also enabled by default almost everywhere else, which means only people with really freakish SMP architectures with no ability to use GENERIC_SMP_HELPERS have ever run these code paths besides ARM. That kind of leaves ARM in the doghouse.. who knows what weirdo scheduler reactions are related to it not being enabled. Maybe when it is, HZ *would* need to be allowed to be bumped when using this code path? Matt Sealey Product Development Analyst, Genesi USA, Inc. On Mon, Jan 21, 2013 at 5:49 PM, Russell King - ARM Linux wrote: > On Mon, Jan 21, 2013 at 05:23:33PM -0600, Matt Sealey wrote: >> On Mon, Jan 21, 2013 at 4:42 PM, Russell King - ARM Linux >> wrote: >> > On Mon, Jan 21, 2013 at 04:20:14PM -0600, Matt Sealey wrote: >> >> I am sorry it sounded if I was being high and mighty about not being >> >> able to select my own HZ (or being forced by Exynos to be 200 or by >> >> not being able to test an Exynos board, forced to default to 100). My >> >> real "grievance" here is we got a configuration item for the scheduler >> >> which is being left out of ARM configurations which *can* use high >> >> resolution timers, but I don't know if this is a real problem or not, >> >> hence asking about it, and that HZ=100 is the ARM default whether we >> >> might be able to select that or not.. which seems low. >> > >> > Well, I have a versatile platform here. It's the inteligence behind >> > the power control system for booting the boards on the nightly tests >> > (currently disabled because I'm waiting for my main server to lock up >> > again, and I need to use one of the serial ports for that.) >> > >> > The point is, it talks via I2C to a load of power monitors to read >> > samples out. It does this at sub-100Hz intervals. Yet the kernel is >> > built with HZ=100. NO_HZ=y and highres timers are enabled... works >> > fine. >> > >> > So, no, HZ=100 is not a limit in that scenario. With NO_HZ=y and >> > highres timers, it all works with epoll() - you get the interval that >> > you're after. I've verified this with calls to gettimeofday() and >> > the POSIX clocks. >> >> Okay. >> >> So, can you read this (it's short): >> >> http://ck.kolivas.org/patches/bfs/bfs-configuration-faq.txt >> >> And please tell me if he's batshit crazy and I should completely >> ignore any scheduler discussion that isn't ARM-specific, or maybe.. >> and I can almost guarantee this, he doesn't have an ARM platform so >> he's just delightfully ill-informed about anything but his quad-core >> x86? > > Well... my x86 laptop is... HZ=1000, NO_HZ, HIGH_RES enabled, ondemand... > doesn't really fit into any of those categories given there. I'd suggest > that what's given there is a suggestion/opinion based on behaviours > observed on x86 platforms. > > Whether it's appropriate for other architectures is not really a proven > point - is it worth running ARM at 1000Hz when the load from running at > 100Hz is measurable as a defini
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 5:13 PM, Russell King - ARM Linux wrote: > On Mon, Jan 21, 2013 at 04:54:31PM -0600, Matt Sealey wrote: >> Hmm, I think it might be appreciated for people looking at this stuff >> (same as I stumbled into it) for a little comment on WHY the default >> is 200. That way you don't wonder even if you know why EBSA110 has a >> HZ=200 default, why Exynos is lumped in there too (to reduce the >> number of interrupts firing? > > Err, _reduce_ ? > > Can you please explain why changing HZ from 100 to 200 is a reduction? We were talking about HZ=1000 at the time, sorry... >> Maybe the Exynos timer interrupt is kind >> of a horrid core NMI kind of thing and it's desirable for it not to be >> every millisecond, > > Huh? HZ=100 is centisecond intervals... See above.. > I think you're understanding is just wy off. That default is > there because that is the _architecture_ _default_ and there _has_ to > be a default. No, including kernel/Kconfig.hz won't give us any kind > of non-specified default because, as I've already said in one of my > other mails, you can't supplement Kconfig symbol definitions by > declaring it multiple times. Okay, so the real >> I know where the 60Hz clocksource might come from, the old Amiga >> platforms have one based on the PSU frequency (50Hz in Europe, 60Hz >> US/Japan). Even a 60Hz clocksource is useful though (on the Amiga, at >> least, it is precisely the vsync clock for synchronizing your display >> output on TV-out, which makes it completely useful for the framebuffer >> driver), but.. you just won't expect to assign it as sched_clock or >> your delay timer. And if anyone does I'd expect they'd know full well >> it'd not run so well. > > Except in the UK where it'd be 50Hz for the TV out. (Lengthy irrelevant > explanation why this is so for UK cut.) Read again: "50Hz in Europe". Australia too. I'm British and I used to have more EU-manufactured Amigas than I knew what to do with.. so.. just like your NTP story, I definitely know this already. >> >From that description, we are booting with standard HZ on ARM, and the >> core sched_clock (as in we can call setup_sched_clock) >> and/or/both/optionally using a real delay_timer switch to HRT mode if >> we have the right equipment available in the kernel and at runtime on >> the SoC.. but the process scheduler isn't compiled with the means to >> actually take advantage of us being in HRT mode? > > Don't mix sched_clock() into this; it has nothing to do with HZ at all. > You're confusing your apples with your oranges. Okay.. >> A simple BUILD_BUG_ON and a BUG_ON right after each other in the >> appropriate clocksource driver solves that.. if there's an insistence >> on having at least some rope, we can put them in a field and tell them >> they have to use the moon to actually hang themselves... > > No it doesn't - it introduces a whole load of new ways to make the > kernel build or boot fail for pointless reasons - more failures, more > regressions. > > No thank you. But it would effectively stop users drinking kool-aid.. if you set your HZ to something stupid, you don't even get a kernel to build, and certainly don't get to boot past the first 40 lines of boot messages.. I think most people would rather a build error, or a runtime unmistakable, unmissable warning than a subtle and almost imperceptible skew in NTP synchronization, to use your example. -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 4:42 PM, Russell King - ARM Linux wrote: > On Mon, Jan 21, 2013 at 04:20:14PM -0600, Matt Sealey wrote: >> I am sorry it sounded if I was being high and mighty about not being >> able to select my own HZ (or being forced by Exynos to be 200 or by >> not being able to test an Exynos board, forced to default to 100). My >> real "grievance" here is we got a configuration item for the scheduler >> which is being left out of ARM configurations which *can* use high >> resolution timers, but I don't know if this is a real problem or not, >> hence asking about it, and that HZ=100 is the ARM default whether we >> might be able to select that or not.. which seems low. > > Well, I have a versatile platform here. It's the inteligence behind > the power control system for booting the boards on the nightly tests > (currently disabled because I'm waiting for my main server to lock up > again, and I need to use one of the serial ports for that.) > > The point is, it talks via I2C to a load of power monitors to read > samples out. It does this at sub-100Hz intervals. Yet the kernel is > built with HZ=100. NO_HZ=y and highres timers are enabled... works > fine. > > So, no, HZ=100 is not a limit in that scenario. With NO_HZ=y and > highres timers, it all works with epoll() - you get the interval that > you're after. I've verified this with calls to gettimeofday() and > the POSIX clocks. Okay. So, can you read this (it's short): http://ck.kolivas.org/patches/bfs/bfs-configuration-faq.txt And please tell me if he's batshit crazy and I should completely ignore any scheduler discussion that isn't ARM-specific, or maybe.. and I can almost guarantee this, he doesn't have an ARM platform so he's just delightfully ill-informed about anything but his quad-core x86? >> HZ=250 is the "current" kernel default if you don't touch anything, it >> seems, apologies for thinking it was HZ=100. > > Actually, it always used to be 100Hz on everything, including x86. > It got upped when there were interactivity issues... which haven't > been reported on ARM - so why change something that we know works and > everyone is happy with? I don't know. I guess this is why I included Ingo and Peter as they seem to be responsible for core HZ-related things; why have HZ=250 on x86 when CONFIG_NO_HZ and HZ=100 would work just as effectively? Isn't CONFIG_NO_HZ the default on x86 and PPC and.. pretty much everything else? I know Con K. has been accused many times of peddling snake-oil... but he has pretty graphs and benchmarks that kind of bear him out on most things even if the results do not get his work upstream. I can't fault the statistical significance of his results.. but even a placebo effect can be graphed, correlation is not causation, etc, etc. - I don't know if anything real filters down into the documentation though. >> And that is too high for >> EBSA110 and a couple of other boards, especially where HZ must equal >> some exact divisor being pumped right into some timer unit. > > EBSA110 can do 250Hz, but it'll mean manually recalculating the timer > arithmetic - because it's not a "reloading" counter - software has to > manually reload it, and you have to take account of how far it's > rolled over to get anything close to a regular interrupt rate which > NTP is happy with. And believe me, it used to be one of two main NTP > broadcasting servers on my network, so I know it works. A-ha... >> Anyway, a patch for ARM could perhaps end up like this: >> >> ~~ >> if ARCH_MULTIPLATFORM >> source kernel/Kconfig.hz >> else >> HZ >> default 100 >> endif >> >> HZ >> default 200 if ARCH_EBSA110 || ARCH_ETC_ETC || ARCH_UND_SO_WEITER >> # any previous platform definitions where *really* required here. >> # but not default 100 since it would override kernel/Kconfig.hz every >> time > > That doesn't work - if you define the same symbol twice, one definition > takes priority over the other (I don't remember which way it works). > They don't accumulate. Well I did some testing.. a couple days of poking around, and they don't need to accumulate. > Because... it simply doesn't work like that. Try it and check to see > what Kconfig produces. I did test it.. whatever you define last, sticks, and it's down to the order they're parsed in the tree - luckily, arch/arm/Kconfig is sourced first, which sources the mach/plat stuff way down at the bottom. As long as you have your "default" set somewhere, any further default just has to be sourced or added later in *one* of the Kconfigs, same as building any C file with "gcc -E" and spitting it out. Someone, at the end of it all, has to set some
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 4:45 PM, Russell King - ARM Linux wrote: > On Mon, Jan 21, 2013 at 10:30:07PM +, Arnd Bergmann wrote: >> On Monday 21 January 2013, Matt Sealey wrote: >> > So is that a bug in that it is not available to ARM right now, a bug >> > in that it would be impossible for anyone on ARM to have ever tested >> > this code, or a bug in that it should NEVER be enabled for ARM for >> > some reason? John? Ingo? :) >> > >> >> I think it's a bug that it's not available. That does not look intentional. > > What's a bug? kernel/Kconfig.hz not being available? No, it's > intentional. (See my replies). The bug I saw as real is that CONFIG_SCHED_HRTICK is defined only in kernel/Kconfig.hz (and used in kernel/sched only) - so if we want that functionality enabled we will also have to opencode it in arch/arm/Kconfig. Everyone else, by virtue of using kernel/Kconfig.hz, gets this config item enabled for free if they have hrtimers or generic smp helpers.. if I understood what John just said, this means on ARM, since we don't use kernel/Kconfig.hz and we don't also define an item for CONFIG_SCHED_HRTICK, the process scheduler is completely oblivious that we're running in HRT mode? The thing I don't know is real is if that really matters one bit.. -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 4:36 PM, John Stultz wrote: > On 01/21/2013 01:14 PM, Matt Sealey wrote: >> >> On Mon, Jan 21, 2013 at 3:00 PM, John Stultz >> wrote: >>> >>> On 01/21/2013 12:41 PM, Arnd Bergmann wrote: >>>> >>>> Right. It's pretty clear that the above logic does not work >>>> with multiplatform. Maybe we should just make ARCH_MULTIPLATFORM >>>> select NO_HZ to make the question much less interesting. >>> >>> Although, even with NO_HZ, we still have some sense of HZ. >> >> In this case, no matter whether CONFIG_HZ=1000 or CONFIG_HZ=250 (for >> example) combined with CONFIG_NO_HZ and less than e.g. 250 things >> happening per second will wake up "exactly" the same number of times? > > Ideally, if both systems are completely idle, they may see similar number of > actual interrupts. > > But when the cpus are running processes, the HZ=1000 system will see more > frequent interrupts, since the timer/scheduler interrupt will jump in 4 > times more frequently. Understood.. >> CONFIG_HZ=1000 with CONFIG_NO_HZ would be an effective, all-round >> solution here, then, and CONFIG_HZ=100 should be a reasonable default >> (as it is anyway with an otherwise-unconfigured kernel on any other >> platform) for !CONFIG_NO_HZ. > > Eeehhh... I'm not sure this is follows. Okay, I'm happy to be wrong on this... >> As above, or "not select anything at all" since HZ=100 if you don't >> touch anything, right? > > Well, Russell brought up a case that doesn't handle this. If a system > *can't* do HZ=100, but can do HZ=200. > > Though there are hacks, of course, that might get around this (skip every > other interrupt at 200HZ). Hmm, I think it might be appreciated for people looking at this stuff (same as I stumbled into it) for a little comment on WHY the default is 200. That way you don't wonder even if you know why EBSA110 has a HZ=200 default, why Exynos is lumped in there too (to reduce the number of interrupts firing? Maybe the Exynos timer interrupt is kind of a horrid core NMI kind of thing and it's desirable for it not to be every millisecond, or maybe it has the same restrictions as EBSA110, but where would anyone go to find out this information?) >> If someone picks HZ=1000 and their platform can't support it, then >> that's their own damn problem (don't touch things you don't >> understand, right? ;) > > Well, ideally with kconfig we try to add proper dependencies so impossible > options aren't left to the user. > HZ is a common enough knob to turn on most systems, I don't know if leaving > the user rope to hang himself is a great idea. I think then the default 100 at the end of the arch/arm/Kconfig is saying "you are not allowed to know that such a thing as rope even exists," when in fact what we should be doing is just making sure they can't swing it over the rafters.. am I taking the analogy too far? :) >> My question really has to be is CONFIG_SCHED_HRTICK useful, what >> exactly is it going to do on ARM here since nobody can ever have >> enabled it? Is it going to keel over and explode if nobody registers a >> non-jiffies sched_clock (since the jiffies clock is technically >> reporting itself as a ridiculously high resolution clocksource..)? > > ??? Not following this at all. jiffies is the *MOST* coarse resolution > clocksource there is (at least that I'm aware of.. I recall someone wanting > to do a 60Hz clocksource, but I don't think that ever happened). Is that based on it's clocksource rating (probably worse than a real hrtimer) or it's reported resolution? Because on i.MX51 if I force it to use the jiffies clock the debug on the kernel log is telling me it has a higher resolution (it TELLS me that it ticks "as fast" as the CPU frequency and wraps less than my real timer). I know where the 60Hz clocksource might come from, the old Amiga platforms have one based on the PSU frequency (50Hz in Europe, 60Hz US/Japan). Even a 60Hz clocksource is useful though (on the Amiga, at least, it is precisely the vsync clock for synchronizing your display output on TV-out, which makes it completely useful for the framebuffer driver), but.. you just won't expect to assign it as sched_clock or your delay timer. And if anyone does I'd expect they'd know full well it'd not run so well. >> Or is this one of those things that if your platform doesn't have a >> real high resolution timer, you shouldn't enable HRTIMERS and >> therefore not enable SCHED_HRTICK as a result? That affects >> ARCH_MULTIPLATFORM here. Is the solution as simple as >> ARCH_MULTIPLATFORM compliant platforms kind of have to have a high >> resolution timer? Documentation to that effec
Re: One of these things (CONFIG_HZ) is not like the others..
to 100 on it's own, then that "default 100" is overly restrictive and we could remove it, allowing each {mach,plat}-*/Kconfig owner to investigate and find the correct HZ value and implement an override or selection, or just allow free configuration? As far as I can tell AT91 and SHMOBILE only supply defaults because HZ *must* meet some exact timer divisor (OMAP says "Kernel internal timer frequency should be a divisor of 32768") in which case their timer drivers should not be so stupid and instead round down to the nearest acceptable timer divisor or WARN_ON if the compile-time values are unacceptable at runtime before anyone sees any freakish behavior. Is it a hard requirement for the ARM architecture that a woefully mis-configured kernel MUST boot completely to userspace? -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 3:00 PM, John Stultz wrote: > On 01/21/2013 12:41 PM, Arnd Bergmann wrote: >> >> Right. It's pretty clear that the above logic does not work >> with multiplatform. Maybe we should just make ARCH_MULTIPLATFORM >> select NO_HZ to make the question much less interesting. > > Although, even with NO_HZ, we still have some sense of HZ. I wonder if you can confirm my understanding of this by the way? The way I think this works is; CONFIG_HZ on it's own defines the rate at which the kernel wakes up from sleeping on the job, and checks for current or expired timer events such that it can do things like schedule_work (as in workqueues) or perform scheduler (as in processes/tasks) operations. CONFIG_NO_HZ turns on logic which effectively only wakes up at a *maximum* of CONFIG_HZ times per second, but otherwise will go to sleep and stay that way if no events actually happened (so, we rely on a timer interrupt popping up). In this case, no matter whether CONFIG_HZ=1000 or CONFIG_HZ=250 (for example) combined with CONFIG_NO_HZ and less than e.g. 250 things happening per second will wake up "exactly" the same number of times? CONFIG_HZ=1000 with CONFIG_NO_HZ would be an effective, all-round solution here, then, and CONFIG_HZ=100 should be a reasonable default (as it is anyway with an otherwise-unconfigured kernel on any other platform) for !CONFIG_NO_HZ. I have to admit, the only reason I noticed the above is because I was reading one of CK's BFS logs and reading it makes it seem like the above is the case, but I have no idea if he thinks BFS makes that the case or if the current CFQ scheduler makes that the case, or if this is simply.. the case.. (can you see this is kind of confusing to me as this is basically not written anywhere except maybe an LWN article from 2008 I read up on? :) >> Regarding the defaults, I would suggest putting them into all the >> defaults into the defconfig files and removing the other hardcoding >> otherwise. Ben Dooks and Russell are probably the best to know >> what triggered the 200 HZ for s3c24xx and for ebsa110. My guess >> is that the other samsung ones are the result of cargo cult >> programming. >> >> at91 and omap set the HZ value to something that is derived >> from their hardware timer, but we have also forever had logic >> to calculate the exact time when that does not match. This code >> has very recently been moved into the new register_refined_jiffies() >> function. John can probably tell is if this solves all the problems >> for these platforms. > > > Yea, as far as timekeeping is concerned, we shouldn't be HZ dependent (and > the register_refined_jiffies is really only necessary if you're not > expecting a proper clocksource to eventually be registered), assuming the > hardware can do something close to the HZ value requested. > > So I'd probably want to hear about what history caused the specific 200 HZ > selections, as I suspect there's actual hardware limitations there. So if > you can not get actual timer ticks any faster then 200 HZ on that hardware, > setting HZ higher could cause some jiffies related timer trouble (ie: if the > kernel thinks HZ is 1000 but the hardware can only do 200, that's a > different problem then if the hardware actually can only do 999.8 HZ). So > things like timer-wheel timeouts may not happen when they should. > > I suspect the best approach for multi-arch in those cases may be to select > HZ=100 As above, or "not select anything at all" since HZ=100 if you don't touch anything, right? If someone picks HZ=1000 and their platform can't support it, then that's their own damn problem (don't touch things you don't understand, right? ;) > and use HRT to allow more modern systems to have finer-grained > timers. My question really has to be is CONFIG_SCHED_HRTICK useful, what exactly is it going to do on ARM here since nobody can ever have enabled it? Is it going to keel over and explode if nobody registers a non-jiffies sched_clock (since the jiffies clock is technically reporting itself as a ridiculously high resolution clocksource..)? Or is this one of those things that if your platform doesn't have a real high resolution timer, you shouldn't enable HRTIMERS and therefore not enable SCHED_HRTICK as a result? That affects ARCH_MULTIPLATFORM here. Is the solution as simple as ARCH_MULTIPLATFORM compliant platforms kind of have to have a high resolution timer? Documentation to that effect? -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 2:41 PM, Arnd Bergmann wrote: > On Monday 21 January 2013, Matt Sealey wrote: >> >> ARM seems to be the only "major" platform not using the >> kernel/Kconfig.hz definitions, instead rolling it's own and setting >> what could be described as both reasonable and unreasonable defaults >> for platforms. If we're going wholesale for multiplatform on ARM then >> having CONFIG_HZ be selected dependent on platform options seems >> rather curious since building a kernel for Exynos, OMAP or so will >> force the default to a value which is not truly desired by the >> maintainers. > > Agreed 100%. > > (adding John Stultz to Cc, he's the local time expert) Hi, John! Welcome to the fray :) >> config HZ >> int >> default 200 if ARCH_EBSA110 || ARCH_S3C24XX || ARCH_S5P64X0 || \ >> ARCH_S5PV210 || ARCH_EXYNOS4 >> default OMAP_32K_TIMER_HZ if ARCH_OMAP && OMAP_32K_TIMER >> default AT91_TIMER_HZ if ARCH_AT91 >> default SHMOBILE_TIMER_HZ if ARCH_SHMOBILE >> default 100 >> [snip] >> Either way, if I boot a kernel on i.MX6, CONFIG_HZ depends on the >> other ARM platforms I also want to boot on it.. this is not exactly >> multiplatform compliant, right? > > Right. It's pretty clear that the above logic does not work > with multiplatform. Maybe we should just make ARCH_MULTIPLATFORM > select NO_HZ to make the question much less interesting. > > Regarding the defaults, I would suggest putting them into all the > defaults into the defconfig files and removing the other hardcoding > otherwise. Ben Dooks and Russell are probably the best to know > what triggered the 200 HZ for s3c24xx and for ebsa110. My guess > is that the other samsung ones are the result of cargo cult > programming. > > at91 and omap set the HZ value to something that is derived > from their hardware timer, but we have also forever had logic > to calculate the exact time when that does not match. This code > has very recently been moved into the new register_refined_jiffies() > function. John can probably tell is if this solves all the problems > for these platforms. I would be very interested. My plan would be then (providing John responds in the affirmative) to basically submit a patch to remove the 8 lines pasted above and source kernel/Kconfig.hz instead. I'm doing this now on a local kernel tree and I can't see any real problem with it. It would then be up to the above-mentioned maintainers to decide if they are part of the cargo cult and don't need it or refine their board files to match the New World Order of using Kconfig.hz. The unconfigured kernel default is 100 anyway which is lower than all the above default setting, so I would technically be causing a regression on those platforms... do I want to be responsible for that? Probably not, but as I said, it's not affecting (in fact, it may be *improving*) the platforms I care about. >> Additionally, using kernel/Kconfig.hz is a predicate for enabling >> (forced enabling, even) CONFIG_SCHED_HRTICK which is defined nowhere >> else. I don't know how many ARM systems here benefit from this, if >> there is a benefit, or what this really means.. if you really have a >> high resolution timer (and hrtimers enabled) that would assist the >> scheduler this way, is it supposed to make a big difference to the way >> the scheduler works for the better or worse? Is this actually >> overridden by ARM sched_clock handling or so? Shouldn't there be a >> help entry or some documentation for what this option does? I have >> CC'd the scheduler maintainers because I'd really like to know what I >> am doing here before I venture into putting patches out which could >> potentially rip open spacetime and have us all sucked in.. > > Yes, that sounds like yet another bug. So is that a bug in that it is not available to ARM right now, a bug in that it would be impossible for anyone on ARM to have ever tested this code, or a bug in that it should NEVER be enabled for ARM for some reason? John? Ingo? :) -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
One of these things (CONFIG_HZ) is not like the others..
y the scheduler works for the better or worse? Is this actually overridden by ARM sched_clock handling or so? Shouldn't there be a help entry or some documentation for what this option does? I have CC'd the scheduler maintainers because I'd really like to know what I am doing here before I venture into putting patches out which could potentially rip open spacetime and have us all sucked in.. And I guess I have one more question before I do attempt to open that tear, what really is the effect of CONFIG_HZ vs. CONFIG_NO_HZ vs. ARM sched_clock, and usage of the new hooks to register a real timer as ARM delay_timer? I have patches I can modify for upstream that add both device tree implementation and probing of i.MX highres clocksources (GPT and EPIT) and registration of sched_clock and delay timer implementations based on these clocks, but while the code compiles and seems to work, the ACTUAL effect of these (and the fundamental requirements for the clocks being used) seems to be information only in the minds of the people who wrote the code. It's not that obvious to me what the true effect of using a non-architected ARM core timer for at least the delay_timer is, and I have some really odd lpj values and very strange re-calibrations popping out (with constant rate for the timer, lpj goes down.. when using the delay_timer implementation, shouldn't lpj be still relative to the timer rate and NOT cpu frequency?) when using cpufreq on i.MX5 when I do it, and whether CONFIG_SCHED_HRTICK is a good or bad idea.. Apologies for the insane number of questions here, but fully appreciative of any answers, -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mc13892: sanity check num_regulators parsed vs. registered
Imagine a situation where a device tree has a few regulators in an appropriate node: regulators { sw1 { .. }; vvideo { .. }; : vfake { .. }; vtypo { .. }; }; In the above example, the node name "vfake" is an attempt to match a regulator name inside the driver which just so happens to not exist. The node name "vtypo" represents an accidental typographical error in a regulator name which may have been introduced to a device tree. In these cases, the number of regulators the mc13892 driver thinks it has does not match the number of regulators it parsed and registered. Since it will go over this array based on this number, it will actually re-register regulator "0" (which happens to be SW1) over and over again until it reaches the number, resulting in messages on the kernel log such as these: SW1: at 1100 mV VVIDEO: at 2775mV : SW1: at 1100 mV SW1: at 1100 mV .. up to that number of "mismatched" regulators. Nobody using DT can/will consume these regulators, so it should not be possible for it to cause any real regulator problems or driver breakages, but it is an easy thing to miss in a kernel log and is an immediate indication of a problem with the device tree authoring. This patch effectively sanity checks the number of counted children of the regulators node vs. the number that actually matched driver names, and sets the appropriate num_regulators value. It also gives a little warning for device tree authors that they MAY have screwed something up, such that this patch does not hide the device tree authoring problem. Signed-off-by: Matt Sealey Tested-by: Steev Klimaszewski Cc: Shawn Guo Cc: Fabio Estevam --- drivers/regulator/mc13892-regulator.c | 39 ++-- drivers/regulator/mc13xxx-regulator-core.c | 10 +-- drivers/regulator/mc13xxx.h|4 +-- 3 files changed, 47 insertions(+), 6 deletions(-) diff --git a/drivers/regulator/mc13892-regulator.c b/drivers/regulator/mc13892-regulator.c index e4f2197..7c86040 100644 --- a/drivers/regulator/mc13892-regulator.c +++ b/drivers/regulator/mc13892-regulator.c @@ -535,15 +535,18 @@ static int mc13892_regulator_probe(struct platform_device *pdev) struct mc13xxx_regulator_init_data *mc13xxx_data; struct regulator_config config = { }; int i, ret; - int num_regulators = 0; + int num_regulators = 0, num_parsed; u32 val; num_regulators = mc13xxx_get_num_regulators_dt(pdev); + if (num_regulators <= 0 && pdata) num_regulators = pdata->num_regulators; if (num_regulators <= 0) return -EINVAL; + num_parsed = num_regulators; + priv = devm_kzalloc(>dev, sizeof(*priv) + num_regulators * sizeof(priv->regulators[0]), GFP_KERNEL); @@ -586,7 +589,39 @@ static int mc13892_regulator_probe(struct platform_device *pdev) = mc13892_vcam_get_mode; mc13xxx_data = mc13xxx_parse_regulators_dt(pdev, mc13892_regulators, - ARRAY_SIZE(mc13892_regulators)); + ARRAY_SIZE(mc13892_regulators), + _parsed); + + /* +* Perform a little sanity check on the regulator tree - if we found +* a number of regulators from mc13xxx_get_num_regulators_dt and +* then parsed a smaller number in mc13xxx_parse_regulators_dt then +* there is a regulator defined in the regulators node which has +* not matched any usable regulator in the driver. In this case, +* there is one missing and what will happen is the first regulator +* will get registered again. +* +* Fix this by basically making our number of registerable regulators +* equal to the number of regulators we parsed. We are allocating +* too much memory for priv, but this is unavoidable at this point. +* +* As an example of how this can happen, try making a typo in your +* regulators node (vviohi {} instead of viohi {}) so that the name +* does not match.. +* +* The check will basically pass for platform data (non-DT) because +* mc13xxx_parse_regulators_dt for !CONFIG_OF will not touch num_parsed. +* +*/ + if (num_parsed != num_regulators) { + dev_warn(>dev, + "parsed %d != regulators %d - check your device tree!\n", + num_parsed, num_regulators); + + num_regulators = num_parsed; + priv->num_regulators = num_regulators; + } + for (i = 0; i < num_regulators; i++) { struct regulator_init_data *init_data; st
[PATCH] mc13892-regulator: correct/refine handling of the SWxHI bit
MC13892 PMIC supports a "HI" bit for 3 of it's 4 buck switcher outputs, which enables a higher set of voltage ranges. Despite a comment in the code ('sw regulators need special care due to the "hi" bit'), it actually does not take special care since it does not modify it's use of the selector table index when this bit is set, giving us very odd behavior when setting a high voltage on supported switchers or listing current voltages. Net effect is in best case the kernel and sysfs report lower voltages than are actually set in hardware (1300mV instead of 1800mV for example) and in the worst case setting a voltage (e.g. 1800mV) will cause an undervoltage condition (e.g. 1300mV). Correct the behavior, taking into account SW1 doesn't support the HI bit, and as such we need to ignore it. While we are modifying these functions, fix and optimize the following; * set_voltage_sel callback was using .reg instead of .vsel_reg - since they were set to the same value it actually didn't break anything but it would be semantically incorrect to use .reg in this case. We now use .vsel_reg and be consistent. * vsel_shift is always 0 for every SWx regulator, and constantly shifting and masking off the bottom few bits is time consuming and makes the code very hard to read - optimize this out. * get_voltage_sel uses the variable "val" and set_voltage_sel uses the variable "selector" (and reg_value). Introduce the variable "selector" to get_voltage_sel such that it makes more sense and allow some leaner code in light of the modifications in this patch. Add better exposure to the debug print so the register value AND the selector are printed as this will adequately show the HI bit in the register. * correct a comment in probe which is doing a version check. Magic values are awful but for once instance, a comment does just as good a job as something symbolic. Signed-off-by: Matt Sealey Tested-by: Steev Klimaszewski Cc: Shawn Guo Cc: Fabio Estevam --- drivers/regulator/mc13892-regulator.c | 72 + 1 file changed, 56 insertions(+), 16 deletions(-) diff --git a/drivers/regulator/mc13892-regulator.c b/drivers/regulator/mc13892-regulator.c index 0d84b1f..169457b 100644 --- a/drivers/regulator/mc13892-regulator.c +++ b/drivers/regulator/mc13892-regulator.c @@ -164,6 +164,14 @@ static const unsigned int mc13892_sw1[] = { 135, 1375000 }; +/* + * Note: this table is used to derive SWxVSEL by index into + * the array. Offset the values by the index of 110uV + * to get the actual register value for that voltage selector + * if the HI bit is to be set as well. + */ +#define MC13892_SWxHI_SEL_OFFSET 20 + static const unsigned int mc13892_sw[] = { 60, 625000, 65, 675000, 70, 725000, 75, 775000, 80, 825000, 85, 875000, @@ -239,7 +247,6 @@ static const unsigned int mc13892_pwgtdrv[] = { }; static struct regulator_ops mc13892_gpo_regulator_ops; -/* sw regulators need special care due to the "hi bit" */ static struct regulator_ops mc13892_sw_regulator_ops; @@ -396,7 +403,7 @@ static int mc13892_sw_regulator_get_voltage_sel(struct regulator_dev *rdev) { struct mc13xxx_regulator_priv *priv = rdev_get_drvdata(rdev); int ret, id = rdev_get_id(rdev); - unsigned int val; + unsigned int val, selector; dev_dbg(rdev_get_dev(rdev), "%s id: %d\n", __func__, id); @@ -407,12 +414,28 @@ static int mc13892_sw_regulator_get_voltage_sel(struct regulator_dev *rdev) if (ret) return ret; - val = (val & mc13892_regulators[id].vsel_mask) - >> mc13892_regulators[id].vsel_shift; + /* +* Figure out if the HI bit is set inside the switcher mode register +* since this means the selector value we return is at a different +* offset into the selector table. +* +* According to the MC13892 documentation note 59 (Table 47) the SW1 +* buck switcher does not support output range programming therefore +* the HI bit must always remain 0. So do not do anything strange if +* our register is MC13892_SWITCHERS0. +*/ + + selector = val & mc13892_regulators[id].vsel_mask; + + if ((mc13892_regulators[id].vsel_reg != MC13892_SWITCHERS0) && + (val & MC13892_SWITCHERS0_SWxHI)) { + selector += MC13892_SWxHI_SEL_OFFSET; + } - dev_dbg(rdev_get_dev(rdev), "%s id: %d val: %d\n", __func__, id, val); + dev_dbg(rdev_get_dev(rdev), "%s id: %d val: 0x%08x selector: %d\n", + __func__, id, val, selector); - return val; + return selector; } static int mc13892_sw_regulator_set_voltage_sel(struct regulator_dev *rdev, @@ -425,18 +448,35 @@ static int mc13892_sw_re
[PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL
The optimized assembler SHA1 code for ARM does not conform to Thumb2 register usage requirements, so it cannot be built when the kernel is configured with THUMB2_KERNEL. Fix the FTBFS for now by preventing misconfigurations of the kernel. Signed-off-by: Matt Sealey --- crypto/Kconfig |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crypto/Kconfig b/crypto/Kconfig index 4641d95..304d60b 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -472,7 +472,7 @@ config CRYPTO_SHA1_SPARC64 config CRYPTO_SHA1_ARM tristate "SHA1 digest algorithm (ARM-asm)" - depends on ARM + depends on ARM && !THUMB2_KERNEL select CRYPTO_SHA1 select CRYPTO_HASH help -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Compilation problem with drivers/staging/zsmalloc when !SMP on ARM
On Fri, Jan 18, 2013 at 10:46 PM, Konrad Rzeszutek Wilk wrote: > On Fri, Jan 18, 2013 at 07:11:32PM -0600, Matt Sealey wrote: >> On Fri, Jan 18, 2013 at 3:08 PM, Russell King - ARM Linux >> >> I'm gonna put this out to the maintainers (Konrad, and Seth since he >> committed it) that if this code is buggy it gets taken back out, even >> if it makes zsmalloc "slow" on ARM, for the following reasons: > > Just to make sure I understand, you mean don't use page table > mapping but instead use copying? Yes, just back out the USE_PGTABLE_MAPPING code. But as I just replied with Minchan, maybe there is a better way. The only real problem here apart from the non-per-cpu usage Russell describes (which does not affect UP systems anyway) is that without CONFIG_SMP we have a FTBFS. However I am sure you agree on the odd fix of enabling pagetable mapping optimization only on "high end" systems and leaving the low end using the slow path. It *is* odd. Also, my rudimentary patch for disabling the code on !CONFIG_SMP is just *hiding* a misuse of a non-exported mm function... The FTBFS showed the problem, I don't want the fix to be to hide it, which is why I brought it up. >> * It's buggy on SMP as Russell describes above >> * It might not be buggy on UP (opposite to Russell's description above >> as the restrictions he states do not exist), but that would imply an >> export for a really core internal MM function nobody should be using >> anyway >> * By that assessment, using that core internal MM function on SMP is >> also bad voodoo that zsmalloc should not be doing > > 'local_tlb_flush' is bad voodoo? See previous mail to Minchan; local_tlb_flush_kernel_range calls cpu_tlb.flush_kernel_range on SMP, but a direct function call ("glue(_TLB, flush_kernel_range)" which resolves to v7wbi_flush_kernel_range etc. etc.) without CONFIG_SMP. That direct function call it resolves to is not an export and Russell just said he won't accept a patch to export it. > If you have an ARM server that you would be willing to part with I would > be thrilled to look at it. >> diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c >> b/drivers/staging/zsmalloc/zsmalloc-main.c >> index 09a9d35..ecf75fb 100644 > > --- a/drivers/staging/zsmalloc/zsmalloc-main.c >> +++ b/drivers/staging/zsmalloc/zsmalloc-main.c >> @@ -228,7 +228,7 @@ struct zs_pool { >> * mapping rather than copying >> * for object mapping. >> */ >> -#if defined(CONFIG_ARM) >> +#if defined(CONFIG_ARM) && defined(CONFIG_SMP) >> #define USE_PGTABLE_MAPPING >> #endif >> >> .. such that it even compiles in both "guess" configurations, the >> slower Cortex-A8 600MHz single core system gets to use the slow copy >> path and the dual-core 1GHz+ Cortex-A9 (with twice the RAM..) gets to >> use the fast mapping path. Essentially all the patch does is "improve >> performance" on the fastest, best-configured, large-amounts-of-RAM, >> lots-of-CPU-performance ARM systems (OMAP4+, Snapdragon, Exynos4+, >> marvell armada, i.MX6..) while introducing the problems Russell >> describes, and leave performance exactly the same and potentially far >> more stable on the slower, memory-limited ARM machines. > > Any ideas on how to detect that? Nope, sorry. It would rely on knowing the precise configuration *and* the user intent of using zram which is far beyond the scope of zram or zsmalloc itself. It could be a VM, a phone with only 256MB RAM and slow storage, it could be a device with 2TB RAM but no fast local storage.. whether using it as a compressed block device or a swap device that is compressed, you can never know inside the driver what the real use it except the kernel build time config. All we can know in zram, at build time, is whether we're configuring for SMP, UP-on-SMP, or UP (!SMP) and which code path makes more sense to build (ideally, SMP and UP would run the pagetable mapping code alike). If we can make the pagetable mapping code compile on UP without the above patch being required, and it has the same effect, then this would be the best solution. Then the code needs to be fixed for proper operation on SMP anyway. If there are still a bunch of export problems with an alternate method of flushing the tlb for a range of kernel memory exposed by trying a different way around, this is just proving an issue here that the ARM guys disagree that things that can be built as modules should be doing such things, or that the cpu_tlb.flush_kernel_range vs. v7wbi_tlb_flush_kernel_range export thing is confusing as crap at the very least in that the CPU topology model the kernel lives by and is compiled for at build time causes build breakages if you don't want (or have) your
Re: Compilation problem with drivers/staging/zsmalloc when !SMP on ARM
Hi Minchan, On Sun, Jan 20, 2013 at 11:55 PM, Minchan Kim wrote: > On Fri, Jan 18, 2013 at 11:46:02PM -0500, Konrad Rzeszutek Wilk wrote: >> On Fri, Jan 18, 2013 at 07:11:32PM -0600, Matt Sealey wrote: >> > >> > diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c >> > b/drivers/staging/zsmalloc/zsmalloc-main.c >> > index 09a9d35..ecf75fb 100644 >> > --- a/drivers/staging/zsmalloc/zsmalloc-main.c >> > +++ b/drivers/staging/zsmalloc/zsmalloc-main.c >> > @@ -228,7 +228,7 @@ struct zs_pool { >> > * mapping rather than copying >> > * for object mapping. >> > */ >> > -#if defined(CONFIG_ARM) >> > +#if defined(CONFIG_ARM) && defined(CONFIG_SMP) >> > #define USE_PGTABLE_MAPPING > > I don't get it. How to prevent the problem Russel described? > The problem is that other CPU can prefetch _speculatively_ under us. It prevents no problems, but if that isn't there, kernels build without SMP support (i.e. specifically uniprocessor kernels) will fail at the linker stage. That's not desirable. We have 3 problems here, this solves the first of them, and creates the third. The second is constant regardless.. 1) zsmalloc will not build on ARM without CONFIG_SMP because on UP local_tlb_flush_kern_range uses a function which uses an export which isn't required on SMP Basically, with CONFIG_SMP (and CONFIG_UP_ON_SMP), local_tlb_flush_kern_range is calling functions by dereference from the per-cpu global cpu_tlb structure. On UP (!CONFIG_SMP), it is calling functions directly (in my case, v7wbi_local_tlb_flush_kern_range or whatever, but on v6, v5, v4 ARM processor kernel builds it may be different) which need to be exported outside of the MM core. If this difference is going to stick around - Russell is refusing here to export that/those direct functions - then the optimized vm mapping code simply should never be allowed to run on non-SMP systems to keep it building for everyone. The patch above is simply a build fix for !CONFIG_SMP in this case to force it to use the slow path for systems where the above missing export problem will cause the linker failure. 2) the optimized vm mapping isn't per-cpu aware as per Russell's arguments. I'll let you guys discuss that as I have no idea what the real implications are for SMP systems (and my current testing is only on a non-SMP CPU, I will have to go grab a couple boards from the lab for SMP) 3) it somewhat defeats the purpose of the optimization if UP systems (which tend to have less memory and might benefit from things like zsmalloc/zram more) cannot use it, but SMP systems which tend to have more memory (unless we're talking about a frugal configuration of a virtual machine, perhaps). Given the myriad use cases of zram that is not a pervasive or persuasive argument, I know, but it does seem slightly backwards. > If I don't miss something, we could have 2 choice. > > 1) use flush_tlb_kernel_range instead of local_flush_tlb_kernel_range > Or > 2) use only memory copy > > I guess everybody want 2 because it makes code very simple and maintainable. > Even, rencently Joonsoo sent optimize patch. > Look at https://lkml.org/lkml/2013/1/16/68 so zram/zcache effect caused by 2 > would be minimized. > > But please give me the time and I will borrow quad-core embedded target board > and test 1 on the phone with Seth's benchmark. In the meantime please take into account building a non-SMP kernel for this board and testing that; if there is a way to do the flush without using the particular function which uses the particular export that Russell will not export, then that would be better. Maybe for !CONFIG_SMP using flush_tlb_kernel_range is doing the exact same job and the real patch is not to disable the optimization with !CONFIG_SMP, but to additionally #if defined(CONFIG_SMP) around local_flush_tlb_kernel_range and alternatively for UP use flush_tlb_kernel_range which.. probably.. doesn't use that contentious export? This is far beyond the level I want to be digging around in the Linux kernel so I am not comfortable even trying that to find out. -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Compilation problem with drivers/staging/zsmalloc when !SMP on ARM
Hi Minchan, On Sun, Jan 20, 2013 at 11:55 PM, Minchan Kim minc...@kernel.org wrote: On Fri, Jan 18, 2013 at 11:46:02PM -0500, Konrad Rzeszutek Wilk wrote: On Fri, Jan 18, 2013 at 07:11:32PM -0600, Matt Sealey wrote: diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c b/drivers/staging/zsmalloc/zsmalloc-main.c index 09a9d35..ecf75fb 100644 --- a/drivers/staging/zsmalloc/zsmalloc-main.c +++ b/drivers/staging/zsmalloc/zsmalloc-main.c @@ -228,7 +228,7 @@ struct zs_pool { * mapping rather than copying * for object mapping. */ -#if defined(CONFIG_ARM) +#if defined(CONFIG_ARM) defined(CONFIG_SMP) #define USE_PGTABLE_MAPPING I don't get it. How to prevent the problem Russel described? The problem is that other CPU can prefetch _speculatively_ under us. It prevents no problems, but if that isn't there, kernels build without SMP support (i.e. specifically uniprocessor kernels) will fail at the linker stage. That's not desirable. We have 3 problems here, this solves the first of them, and creates the third. The second is constant regardless.. 1) zsmalloc will not build on ARM without CONFIG_SMP because on UP local_tlb_flush_kern_range uses a function which uses an export which isn't required on SMP Basically, with CONFIG_SMP (and CONFIG_UP_ON_SMP), local_tlb_flush_kern_range is calling functions by dereference from the per-cpu global cpu_tlb structure. On UP (!CONFIG_SMP), it is calling functions directly (in my case, v7wbi_local_tlb_flush_kern_range or whatever, but on v6, v5, v4 ARM processor kernel builds it may be different) which need to be exported outside of the MM core. If this difference is going to stick around - Russell is refusing here to export that/those direct functions - then the optimized vm mapping code simply should never be allowed to run on non-SMP systems to keep it building for everyone. The patch above is simply a build fix for !CONFIG_SMP in this case to force it to use the slow path for systems where the above missing export problem will cause the linker failure. 2) the optimized vm mapping isn't per-cpu aware as per Russell's arguments. I'll let you guys discuss that as I have no idea what the real implications are for SMP systems (and my current testing is only on a non-SMP CPU, I will have to go grab a couple boards from the lab for SMP) 3) it somewhat defeats the purpose of the optimization if UP systems (which tend to have less memory and might benefit from things like zsmalloc/zram more) cannot use it, but SMP systems which tend to have more memory (unless we're talking about a frugal configuration of a virtual machine, perhaps). Given the myriad use cases of zram that is not a pervasive or persuasive argument, I know, but it does seem slightly backwards. If I don't miss something, we could have 2 choice. 1) use flush_tlb_kernel_range instead of local_flush_tlb_kernel_range Or 2) use only memory copy I guess everybody want 2 because it makes code very simple and maintainable. Even, rencently Joonsoo sent optimize patch. Look at https://lkml.org/lkml/2013/1/16/68 so zram/zcache effect caused by 2 would be minimized. But please give me the time and I will borrow quad-core embedded target board and test 1 on the phone with Seth's benchmark. In the meantime please take into account building a non-SMP kernel for this board and testing that; if there is a way to do the flush without using the particular function which uses the particular export that Russell will not export, then that would be better. Maybe for !CONFIG_SMP using flush_tlb_kernel_range is doing the exact same job and the real patch is not to disable the optimization with !CONFIG_SMP, but to additionally #if defined(CONFIG_SMP) around local_flush_tlb_kernel_range and alternatively for UP use flush_tlb_kernel_range which.. probably.. doesn't use that contentious export? This is far beyond the level I want to be digging around in the Linux kernel so I am not comfortable even trying that to find out. -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Compilation problem with drivers/staging/zsmalloc when !SMP on ARM
On Fri, Jan 18, 2013 at 10:46 PM, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote: On Fri, Jan 18, 2013 at 07:11:32PM -0600, Matt Sealey wrote: On Fri, Jan 18, 2013 at 3:08 PM, Russell King - ARM Linux I'm gonna put this out to the maintainers (Konrad, and Seth since he committed it) that if this code is buggy it gets taken back out, even if it makes zsmalloc slow on ARM, for the following reasons: Just to make sure I understand, you mean don't use page table mapping but instead use copying? Yes, just back out the USE_PGTABLE_MAPPING code. But as I just replied with Minchan, maybe there is a better way. The only real problem here apart from the non-per-cpu usage Russell describes (which does not affect UP systems anyway) is that without CONFIG_SMP we have a FTBFS. However I am sure you agree on the odd fix of enabling pagetable mapping optimization only on high end systems and leaving the low end using the slow path. It *is* odd. Also, my rudimentary patch for disabling the code on !CONFIG_SMP is just *hiding* a misuse of a non-exported mm function... The FTBFS showed the problem, I don't want the fix to be to hide it, which is why I brought it up. * It's buggy on SMP as Russell describes above * It might not be buggy on UP (opposite to Russell's description above as the restrictions he states do not exist), but that would imply an export for a really core internal MM function nobody should be using anyway * By that assessment, using that core internal MM function on SMP is also bad voodoo that zsmalloc should not be doing 'local_tlb_flush' is bad voodoo? See previous mail to Minchan; local_tlb_flush_kernel_range calls cpu_tlb.flush_kernel_range on SMP, but a direct function call (glue(_TLB, flush_kernel_range) which resolves to v7wbi_flush_kernel_range etc. etc.) without CONFIG_SMP. That direct function call it resolves to is not an export and Russell just said he won't accept a patch to export it. If you have an ARM server that you would be willing to part with I would be thrilled to look at it. diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c b/drivers/staging/zsmalloc/zsmalloc-main.c index 09a9d35..ecf75fb 100644 --- a/drivers/staging/zsmalloc/zsmalloc-main.c +++ b/drivers/staging/zsmalloc/zsmalloc-main.c @@ -228,7 +228,7 @@ struct zs_pool { * mapping rather than copying * for object mapping. */ -#if defined(CONFIG_ARM) +#if defined(CONFIG_ARM) defined(CONFIG_SMP) #define USE_PGTABLE_MAPPING #endif .. such that it even compiles in both guess configurations, the slower Cortex-A8 600MHz single core system gets to use the slow copy path and the dual-core 1GHz+ Cortex-A9 (with twice the RAM..) gets to use the fast mapping path. Essentially all the patch does is improve performance on the fastest, best-configured, large-amounts-of-RAM, lots-of-CPU-performance ARM systems (OMAP4+, Snapdragon, Exynos4+, marvell armada, i.MX6..) while introducing the problems Russell describes, and leave performance exactly the same and potentially far more stable on the slower, memory-limited ARM machines. Any ideas on how to detect that? Nope, sorry. It would rely on knowing the precise configuration *and* the user intent of using zram which is far beyond the scope of zram or zsmalloc itself. It could be a VM, a phone with only 256MB RAM and slow storage, it could be a device with 2TB RAM but no fast local storage.. whether using it as a compressed block device or a swap device that is compressed, you can never know inside the driver what the real use it except the kernel build time config. All we can know in zram, at build time, is whether we're configuring for SMP, UP-on-SMP, or UP (!SMP) and which code path makes more sense to build (ideally, SMP and UP would run the pagetable mapping code alike). If we can make the pagetable mapping code compile on UP without the above patch being required, and it has the same effect, then this would be the best solution. Then the code needs to be fixed for proper operation on SMP anyway. If there are still a bunch of export problems with an alternate method of flushing the tlb for a range of kernel memory exposed by trying a different way around, this is just proving an issue here that the ARM guys disagree that things that can be built as modules should be doing such things, or that the cpu_tlb.flush_kernel_range vs. v7wbi_tlb_flush_kernel_range export thing is confusing as crap at the very least in that the CPU topology model the kernel lives by and is compiled for at build time causes build breakages if you don't want (or have) your driver to be knowledgable of the differences :) can build for SMP and UP and get the same code.. with less bugs. I get that you want to have this fixed right now. Somewhat. It is not urgent, since I fixed it for now in my tree, I am just worried that if that fix goes upstream it hides a real, more important issue here. I am mostly only concerned about whether zram
[PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL
The optimized assembler SHA1 code for ARM does not conform to Thumb2 register usage requirements, so it cannot be built when the kernel is configured with THUMB2_KERNEL. Fix the FTBFS for now by preventing misconfigurations of the kernel. Signed-off-by: Matt Sealey m...@genesi-usa.com --- crypto/Kconfig |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crypto/Kconfig b/crypto/Kconfig index 4641d95..304d60b 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -472,7 +472,7 @@ config CRYPTO_SHA1_SPARC64 config CRYPTO_SHA1_ARM tristate SHA1 digest algorithm (ARM-asm) - depends on ARM + depends on ARM !THUMB2_KERNEL select CRYPTO_SHA1 select CRYPTO_HASH help -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mc13892-regulator: correct/refine handling of the SWxHI bit
MC13892 PMIC supports a HI bit for 3 of it's 4 buck switcher outputs, which enables a higher set of voltage ranges. Despite a comment in the code ('sw regulators need special care due to the hi bit'), it actually does not take special care since it does not modify it's use of the selector table index when this bit is set, giving us very odd behavior when setting a high voltage on supported switchers or listing current voltages. Net effect is in best case the kernel and sysfs report lower voltages than are actually set in hardware (1300mV instead of 1800mV for example) and in the worst case setting a voltage (e.g. 1800mV) will cause an undervoltage condition (e.g. 1300mV). Correct the behavior, taking into account SW1 doesn't support the HI bit, and as such we need to ignore it. While we are modifying these functions, fix and optimize the following; * set_voltage_sel callback was using .reg instead of .vsel_reg - since they were set to the same value it actually didn't break anything but it would be semantically incorrect to use .reg in this case. We now use .vsel_reg and be consistent. * vsel_shift is always 0 for every SWx regulator, and constantly shifting and masking off the bottom few bits is time consuming and makes the code very hard to read - optimize this out. * get_voltage_sel uses the variable val and set_voltage_sel uses the variable selector (and reg_value). Introduce the variable selector to get_voltage_sel such that it makes more sense and allow some leaner code in light of the modifications in this patch. Add better exposure to the debug print so the register value AND the selector are printed as this will adequately show the HI bit in the register. * correct a comment in probe which is doing a version check. Magic values are awful but for once instance, a comment does just as good a job as something symbolic. Signed-off-by: Matt Sealey m...@genesi-usa.com Tested-by: Steev Klimaszewski st...@genesi-usa.com Cc: Shawn Guo shawn@linaro.org Cc: Fabio Estevam fabio.este...@freescale.com --- drivers/regulator/mc13892-regulator.c | 72 + 1 file changed, 56 insertions(+), 16 deletions(-) diff --git a/drivers/regulator/mc13892-regulator.c b/drivers/regulator/mc13892-regulator.c index 0d84b1f..169457b 100644 --- a/drivers/regulator/mc13892-regulator.c +++ b/drivers/regulator/mc13892-regulator.c @@ -164,6 +164,14 @@ static const unsigned int mc13892_sw1[] = { 135, 1375000 }; +/* + * Note: this table is used to derive SWxVSEL by index into + * the array. Offset the values by the index of 110uV + * to get the actual register value for that voltage selector + * if the HI bit is to be set as well. + */ +#define MC13892_SWxHI_SEL_OFFSET 20 + static const unsigned int mc13892_sw[] = { 60, 625000, 65, 675000, 70, 725000, 75, 775000, 80, 825000, 85, 875000, @@ -239,7 +247,6 @@ static const unsigned int mc13892_pwgtdrv[] = { }; static struct regulator_ops mc13892_gpo_regulator_ops; -/* sw regulators need special care due to the hi bit */ static struct regulator_ops mc13892_sw_regulator_ops; @@ -396,7 +403,7 @@ static int mc13892_sw_regulator_get_voltage_sel(struct regulator_dev *rdev) { struct mc13xxx_regulator_priv *priv = rdev_get_drvdata(rdev); int ret, id = rdev_get_id(rdev); - unsigned int val; + unsigned int val, selector; dev_dbg(rdev_get_dev(rdev), %s id: %d\n, __func__, id); @@ -407,12 +414,28 @@ static int mc13892_sw_regulator_get_voltage_sel(struct regulator_dev *rdev) if (ret) return ret; - val = (val mc13892_regulators[id].vsel_mask) -mc13892_regulators[id].vsel_shift; + /* +* Figure out if the HI bit is set inside the switcher mode register +* since this means the selector value we return is at a different +* offset into the selector table. +* +* According to the MC13892 documentation note 59 (Table 47) the SW1 +* buck switcher does not support output range programming therefore +* the HI bit must always remain 0. So do not do anything strange if +* our register is MC13892_SWITCHERS0. +*/ + + selector = val mc13892_regulators[id].vsel_mask; + + if ((mc13892_regulators[id].vsel_reg != MC13892_SWITCHERS0) + (val MC13892_SWITCHERS0_SWxHI)) { + selector += MC13892_SWxHI_SEL_OFFSET; + } - dev_dbg(rdev_get_dev(rdev), %s id: %d val: %d\n, __func__, id, val); + dev_dbg(rdev_get_dev(rdev), %s id: %d val: 0x%08x selector: %d\n, + __func__, id, val, selector); - return val; + return selector; } static int mc13892_sw_regulator_set_voltage_sel(struct regulator_dev *rdev, @@ -425,18 +448,35 @@ static int mc13892_sw_regulator_set_voltage_sel(struct regulator_dev *rdev, volt
[PATCH] mc13892: sanity check num_regulators parsed vs. registered
Imagine a situation where a device tree has a few regulators in an appropriate node: regulators { sw1 { .. }; vvideo { .. }; : vfake { .. }; vtypo { .. }; }; In the above example, the node name vfake is an attempt to match a regulator name inside the driver which just so happens to not exist. The node name vtypo represents an accidental typographical error in a regulator name which may have been introduced to a device tree. In these cases, the number of regulators the mc13892 driver thinks it has does not match the number of regulators it parsed and registered. Since it will go over this array based on this number, it will actually re-register regulator 0 (which happens to be SW1) over and over again until it reaches the number, resulting in messages on the kernel log such as these: SW1: at 1100 mV VVIDEO: at 2775mV : SW1: at 1100 mV SW1: at 1100 mV .. up to that number of mismatched regulators. Nobody using DT can/will consume these regulators, so it should not be possible for it to cause any real regulator problems or driver breakages, but it is an easy thing to miss in a kernel log and is an immediate indication of a problem with the device tree authoring. This patch effectively sanity checks the number of counted children of the regulators node vs. the number that actually matched driver names, and sets the appropriate num_regulators value. It also gives a little warning for device tree authors that they MAY have screwed something up, such that this patch does not hide the device tree authoring problem. Signed-off-by: Matt Sealey m...@genesi-usa.com Tested-by: Steev Klimaszewski st...@genesi-usa.com Cc: Shawn Guo shawn@linaro.org Cc: Fabio Estevam fabio.este...@freescale.com --- drivers/regulator/mc13892-regulator.c | 39 ++-- drivers/regulator/mc13xxx-regulator-core.c | 10 +-- drivers/regulator/mc13xxx.h|4 +-- 3 files changed, 47 insertions(+), 6 deletions(-) diff --git a/drivers/regulator/mc13892-regulator.c b/drivers/regulator/mc13892-regulator.c index e4f2197..7c86040 100644 --- a/drivers/regulator/mc13892-regulator.c +++ b/drivers/regulator/mc13892-regulator.c @@ -535,15 +535,18 @@ static int mc13892_regulator_probe(struct platform_device *pdev) struct mc13xxx_regulator_init_data *mc13xxx_data; struct regulator_config config = { }; int i, ret; - int num_regulators = 0; + int num_regulators = 0, num_parsed; u32 val; num_regulators = mc13xxx_get_num_regulators_dt(pdev); + if (num_regulators = 0 pdata) num_regulators = pdata-num_regulators; if (num_regulators = 0) return -EINVAL; + num_parsed = num_regulators; + priv = devm_kzalloc(pdev-dev, sizeof(*priv) + num_regulators * sizeof(priv-regulators[0]), GFP_KERNEL); @@ -586,7 +589,39 @@ static int mc13892_regulator_probe(struct platform_device *pdev) = mc13892_vcam_get_mode; mc13xxx_data = mc13xxx_parse_regulators_dt(pdev, mc13892_regulators, - ARRAY_SIZE(mc13892_regulators)); + ARRAY_SIZE(mc13892_regulators), + num_parsed); + + /* +* Perform a little sanity check on the regulator tree - if we found +* a number of regulators from mc13xxx_get_num_regulators_dt and +* then parsed a smaller number in mc13xxx_parse_regulators_dt then +* there is a regulator defined in the regulators node which has +* not matched any usable regulator in the driver. In this case, +* there is one missing and what will happen is the first regulator +* will get registered again. +* +* Fix this by basically making our number of registerable regulators +* equal to the number of regulators we parsed. We are allocating +* too much memory for priv, but this is unavoidable at this point. +* +* As an example of how this can happen, try making a typo in your +* regulators node (vviohi {} instead of viohi {}) so that the name +* does not match.. +* +* The check will basically pass for platform data (non-DT) because +* mc13xxx_parse_regulators_dt for !CONFIG_OF will not touch num_parsed. +* +*/ + if (num_parsed != num_regulators) { + dev_warn(pdev-dev, + parsed %d != regulators %d - check your device tree!\n, + num_parsed, num_regulators); + + num_regulators = num_parsed; + priv-num_regulators = num_regulators; + } + for (i = 0; i num_regulators; i++) { struct regulator_init_data *init_data; struct
One of these things (CONFIG_HZ) is not like the others..
there be a help entry or some documentation for what this option does? I have CC'd the scheduler maintainers because I'd really like to know what I am doing here before I venture into putting patches out which could potentially rip open spacetime and have us all sucked in.. And I guess I have one more question before I do attempt to open that tear, what really is the effect of CONFIG_HZ vs. CONFIG_NO_HZ vs. ARM sched_clock, and usage of the new hooks to register a real timer as ARM delay_timer? I have patches I can modify for upstream that add both device tree implementation and probing of i.MX highres clocksources (GPT and EPIT) and registration of sched_clock and delay timer implementations based on these clocks, but while the code compiles and seems to work, the ACTUAL effect of these (and the fundamental requirements for the clocks being used) seems to be information only in the minds of the people who wrote the code. It's not that obvious to me what the true effect of using a non-architected ARM core timer for at least the delay_timer is, and I have some really odd lpj values and very strange re-calibrations popping out (with constant rate for the timer, lpj goes down.. when using the delay_timer implementation, shouldn't lpj be still relative to the timer rate and NOT cpu frequency?) when using cpufreq on i.MX5 when I do it, and whether CONFIG_SCHED_HRTICK is a good or bad idea.. Apologies for the insane number of questions here, but fully appreciative of any answers, -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 2:41 PM, Arnd Bergmann a...@arndb.de wrote: On Monday 21 January 2013, Matt Sealey wrote: ARM seems to be the only major platform not using the kernel/Kconfig.hz definitions, instead rolling it's own and setting what could be described as both reasonable and unreasonable defaults for platforms. If we're going wholesale for multiplatform on ARM then having CONFIG_HZ be selected dependent on platform options seems rather curious since building a kernel for Exynos, OMAP or so will force the default to a value which is not truly desired by the maintainers. Agreed 100%. (adding John Stultz to Cc, he's the local time expert) Hi, John! Welcome to the fray :) config HZ int default 200 if ARCH_EBSA110 || ARCH_S3C24XX || ARCH_S5P64X0 || \ ARCH_S5PV210 || ARCH_EXYNOS4 default OMAP_32K_TIMER_HZ if ARCH_OMAP OMAP_32K_TIMER default AT91_TIMER_HZ if ARCH_AT91 default SHMOBILE_TIMER_HZ if ARCH_SHMOBILE default 100 [snip] Either way, if I boot a kernel on i.MX6, CONFIG_HZ depends on the other ARM platforms I also want to boot on it.. this is not exactly multiplatform compliant, right? Right. It's pretty clear that the above logic does not work with multiplatform. Maybe we should just make ARCH_MULTIPLATFORM select NO_HZ to make the question much less interesting. Regarding the defaults, I would suggest putting them into all the defaults into the defconfig files and removing the other hardcoding otherwise. Ben Dooks and Russell are probably the best to know what triggered the 200 HZ for s3c24xx and for ebsa110. My guess is that the other samsung ones are the result of cargo cult programming. at91 and omap set the HZ value to something that is derived from their hardware timer, but we have also forever had logic to calculate the exact time when that does not match. This code has very recently been moved into the new register_refined_jiffies() function. John can probably tell is if this solves all the problems for these platforms. I would be very interested. My plan would be then (providing John responds in the affirmative) to basically submit a patch to remove the 8 lines pasted above and source kernel/Kconfig.hz instead. I'm doing this now on a local kernel tree and I can't see any real problem with it. It would then be up to the above-mentioned maintainers to decide if they are part of the cargo cult and don't need it or refine their board files to match the New World Order of using Kconfig.hz. The unconfigured kernel default is 100 anyway which is lower than all the above default setting, so I would technically be causing a regression on those platforms... do I want to be responsible for that? Probably not, but as I said, it's not affecting (in fact, it may be *improving*) the platforms I care about. Additionally, using kernel/Kconfig.hz is a predicate for enabling (forced enabling, even) CONFIG_SCHED_HRTICK which is defined nowhere else. I don't know how many ARM systems here benefit from this, if there is a benefit, or what this really means.. if you really have a high resolution timer (and hrtimers enabled) that would assist the scheduler this way, is it supposed to make a big difference to the way the scheduler works for the better or worse? Is this actually overridden by ARM sched_clock handling or so? Shouldn't there be a help entry or some documentation for what this option does? I have CC'd the scheduler maintainers because I'd really like to know what I am doing here before I venture into putting patches out which could potentially rip open spacetime and have us all sucked in.. Yes, that sounds like yet another bug. So is that a bug in that it is not available to ARM right now, a bug in that it would be impossible for anyone on ARM to have ever tested this code, or a bug in that it should NEVER be enabled for ARM for some reason? John? Ingo? :) -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 3:00 PM, John Stultz john.stu...@linaro.org wrote: On 01/21/2013 12:41 PM, Arnd Bergmann wrote: Right. It's pretty clear that the above logic does not work with multiplatform. Maybe we should just make ARCH_MULTIPLATFORM select NO_HZ to make the question much less interesting. Although, even with NO_HZ, we still have some sense of HZ. I wonder if you can confirm my understanding of this by the way? The way I think this works is; CONFIG_HZ on it's own defines the rate at which the kernel wakes up from sleeping on the job, and checks for current or expired timer events such that it can do things like schedule_work (as in workqueues) or perform scheduler (as in processes/tasks) operations. CONFIG_NO_HZ turns on logic which effectively only wakes up at a *maximum* of CONFIG_HZ times per second, but otherwise will go to sleep and stay that way if no events actually happened (so, we rely on a timer interrupt popping up). In this case, no matter whether CONFIG_HZ=1000 or CONFIG_HZ=250 (for example) combined with CONFIG_NO_HZ and less than e.g. 250 things happening per second will wake up exactly the same number of times? CONFIG_HZ=1000 with CONFIG_NO_HZ would be an effective, all-round solution here, then, and CONFIG_HZ=100 should be a reasonable default (as it is anyway with an otherwise-unconfigured kernel on any other platform) for !CONFIG_NO_HZ. I have to admit, the only reason I noticed the above is because I was reading one of CK's BFS logs and reading it makes it seem like the above is the case, but I have no idea if he thinks BFS makes that the case or if the current CFQ scheduler makes that the case, or if this is simply.. the case.. (can you see this is kind of confusing to me as this is basically not written anywhere except maybe an LWN article from 2008 I read up on? :) Regarding the defaults, I would suggest putting them into all the defaults into the defconfig files and removing the other hardcoding otherwise. Ben Dooks and Russell are probably the best to know what triggered the 200 HZ for s3c24xx and for ebsa110. My guess is that the other samsung ones are the result of cargo cult programming. at91 and omap set the HZ value to something that is derived from their hardware timer, but we have also forever had logic to calculate the exact time when that does not match. This code has very recently been moved into the new register_refined_jiffies() function. John can probably tell is if this solves all the problems for these platforms. Yea, as far as timekeeping is concerned, we shouldn't be HZ dependent (and the register_refined_jiffies is really only necessary if you're not expecting a proper clocksource to eventually be registered), assuming the hardware can do something close to the HZ value requested. So I'd probably want to hear about what history caused the specific 200 HZ selections, as I suspect there's actual hardware limitations there. So if you can not get actual timer ticks any faster then 200 HZ on that hardware, setting HZ higher could cause some jiffies related timer trouble (ie: if the kernel thinks HZ is 1000 but the hardware can only do 200, that's a different problem then if the hardware actually can only do 999.8 HZ). So things like timer-wheel timeouts may not happen when they should. I suspect the best approach for multi-arch in those cases may be to select HZ=100 As above, or not select anything at all since HZ=100 if you don't touch anything, right? If someone picks HZ=1000 and their platform can't support it, then that's their own damn problem (don't touch things you don't understand, right? ;) and use HRT to allow more modern systems to have finer-grained timers. My question really has to be is CONFIG_SCHED_HRTICK useful, what exactly is it going to do on ARM here since nobody can ever have enabled it? Is it going to keel over and explode if nobody registers a non-jiffies sched_clock (since the jiffies clock is technically reporting itself as a ridiculously high resolution clocksource..)? Or is this one of those things that if your platform doesn't have a real high resolution timer, you shouldn't enable HRTIMERS and therefore not enable SCHED_HRTICK as a result? That affects ARCH_MULTIPLATFORM here. Is the solution as simple as ARCH_MULTIPLATFORM compliant platforms kind of have to have a high resolution timer? Documentation to that effect? -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
and find the correct HZ value and implement an override or selection, or just allow free configuration? As far as I can tell AT91 and SHMOBILE only supply defaults because HZ *must* meet some exact timer divisor (OMAP says Kernel internal timer frequency should be a divisor of 32768) in which case their timer drivers should not be so stupid and instead round down to the nearest acceptable timer divisor or WARN_ON if the compile-time values are unacceptable at runtime before anyone sees any freakish behavior. Is it a hard requirement for the ARM architecture that a woefully mis-configured kernel MUST boot completely to userspace? -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 4:36 PM, John Stultz john.stu...@linaro.org wrote: On 01/21/2013 01:14 PM, Matt Sealey wrote: On Mon, Jan 21, 2013 at 3:00 PM, John Stultz john.stu...@linaro.org wrote: On 01/21/2013 12:41 PM, Arnd Bergmann wrote: Right. It's pretty clear that the above logic does not work with multiplatform. Maybe we should just make ARCH_MULTIPLATFORM select NO_HZ to make the question much less interesting. Although, even with NO_HZ, we still have some sense of HZ. In this case, no matter whether CONFIG_HZ=1000 or CONFIG_HZ=250 (for example) combined with CONFIG_NO_HZ and less than e.g. 250 things happening per second will wake up exactly the same number of times? Ideally, if both systems are completely idle, they may see similar number of actual interrupts. But when the cpus are running processes, the HZ=1000 system will see more frequent interrupts, since the timer/scheduler interrupt will jump in 4 times more frequently. Understood.. CONFIG_HZ=1000 with CONFIG_NO_HZ would be an effective, all-round solution here, then, and CONFIG_HZ=100 should be a reasonable default (as it is anyway with an otherwise-unconfigured kernel on any other platform) for !CONFIG_NO_HZ. Eeehhh... I'm not sure this is follows. Okay, I'm happy to be wrong on this... As above, or not select anything at all since HZ=100 if you don't touch anything, right? Well, Russell brought up a case that doesn't handle this. If a system *can't* do HZ=100, but can do HZ=200. Though there are hacks, of course, that might get around this (skip every other interrupt at 200HZ). Hmm, I think it might be appreciated for people looking at this stuff (same as I stumbled into it) for a little comment on WHY the default is 200. That way you don't wonder even if you know why EBSA110 has a HZ=200 default, why Exynos is lumped in there too (to reduce the number of interrupts firing? Maybe the Exynos timer interrupt is kind of a horrid core NMI kind of thing and it's desirable for it not to be every millisecond, or maybe it has the same restrictions as EBSA110, but where would anyone go to find out this information?) If someone picks HZ=1000 and their platform can't support it, then that's their own damn problem (don't touch things you don't understand, right? ;) Well, ideally with kconfig we try to add proper dependencies so impossible options aren't left to the user. HZ is a common enough knob to turn on most systems, I don't know if leaving the user rope to hang himself is a great idea. I think then the default 100 at the end of the arch/arm/Kconfig is saying you are not allowed to know that such a thing as rope even exists, when in fact what we should be doing is just making sure they can't swing it over the rafters.. am I taking the analogy too far? :) My question really has to be is CONFIG_SCHED_HRTICK useful, what exactly is it going to do on ARM here since nobody can ever have enabled it? Is it going to keel over and explode if nobody registers a non-jiffies sched_clock (since the jiffies clock is technically reporting itself as a ridiculously high resolution clocksource..)? ??? Not following this at all. jiffies is the *MOST* coarse resolution clocksource there is (at least that I'm aware of.. I recall someone wanting to do a 60Hz clocksource, but I don't think that ever happened). Is that based on it's clocksource rating (probably worse than a real hrtimer) or it's reported resolution? Because on i.MX51 if I force it to use the jiffies clock the debug on the kernel log is telling me it has a higher resolution (it TELLS me that it ticks as fast as the CPU frequency and wraps less than my real timer). I know where the 60Hz clocksource might come from, the old Amiga platforms have one based on the PSU frequency (50Hz in Europe, 60Hz US/Japan). Even a 60Hz clocksource is useful though (on the Amiga, at least, it is precisely the vsync clock for synchronizing your display output on TV-out, which makes it completely useful for the framebuffer driver), but.. you just won't expect to assign it as sched_clock or your delay timer. And if anyone does I'd expect they'd know full well it'd not run so well. Or is this one of those things that if your platform doesn't have a real high resolution timer, you shouldn't enable HRTIMERS and therefore not enable SCHED_HRTICK as a result? That affects ARCH_MULTIPLATFORM here. Is the solution as simple as ARCH_MULTIPLATFORM compliant platforms kind of have to have a high resolution timer? Documentation to that effect? SO HRITMERS was designed to be be build time enabled, while still giving you a functioning system if it was booted on a system that didn't support clockevents. We boot with standard HZ, and only switch over to HRT mode if we have a proper clocksource and clockevent driver. Okay. I'm still a little confused as to what SCHED_HRTICK actually makes a difference to, though. From that description, we are booting
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 4:45 PM, Russell King - ARM Linux li...@arm.linux.org.uk wrote: On Mon, Jan 21, 2013 at 10:30:07PM +, Arnd Bergmann wrote: On Monday 21 January 2013, Matt Sealey wrote: So is that a bug in that it is not available to ARM right now, a bug in that it would be impossible for anyone on ARM to have ever tested this code, or a bug in that it should NEVER be enabled for ARM for some reason? John? Ingo? :) I think it's a bug that it's not available. That does not look intentional. What's a bug? kernel/Kconfig.hz not being available? No, it's intentional. (See my replies). The bug I saw as real is that CONFIG_SCHED_HRTICK is defined only in kernel/Kconfig.hz (and used in kernel/sched only) - so if we want that functionality enabled we will also have to opencode it in arch/arm/Kconfig. Everyone else, by virtue of using kernel/Kconfig.hz, gets this config item enabled for free if they have hrtimers or generic smp helpers.. if I understood what John just said, this means on ARM, since we don't use kernel/Kconfig.hz and we don't also define an item for CONFIG_SCHED_HRTICK, the process scheduler is completely oblivious that we're running in HRT mode? The thing I don't know is real is if that really matters one bit.. -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 4:42 PM, Russell King - ARM Linux li...@arm.linux.org.uk wrote: On Mon, Jan 21, 2013 at 04:20:14PM -0600, Matt Sealey wrote: I am sorry it sounded if I was being high and mighty about not being able to select my own HZ (or being forced by Exynos to be 200 or by not being able to test an Exynos board, forced to default to 100). My real grievance here is we got a configuration item for the scheduler which is being left out of ARM configurations which *can* use high resolution timers, but I don't know if this is a real problem or not, hence asking about it, and that HZ=100 is the ARM default whether we might be able to select that or not.. which seems low. Well, I have a versatile platform here. It's the inteligence behind the power control system for booting the boards on the nightly tests (currently disabled because I'm waiting for my main server to lock up again, and I need to use one of the serial ports for that.) The point is, it talks via I2C to a load of power monitors to read samples out. It does this at sub-100Hz intervals. Yet the kernel is built with HZ=100. NO_HZ=y and highres timers are enabled... works fine. So, no, HZ=100 is not a limit in that scenario. With NO_HZ=y and highres timers, it all works with epoll() - you get the interval that you're after. I've verified this with calls to gettimeofday() and the POSIX clocks. Okay. So, can you read this (it's short): http://ck.kolivas.org/patches/bfs/bfs-configuration-faq.txt And please tell me if he's batshit crazy and I should completely ignore any scheduler discussion that isn't ARM-specific, or maybe.. and I can almost guarantee this, he doesn't have an ARM platform so he's just delightfully ill-informed about anything but his quad-core x86? HZ=250 is the current kernel default if you don't touch anything, it seems, apologies for thinking it was HZ=100. Actually, it always used to be 100Hz on everything, including x86. It got upped when there were interactivity issues... which haven't been reported on ARM - so why change something that we know works and everyone is happy with? I don't know. I guess this is why I included Ingo and Peter as they seem to be responsible for core HZ-related things; why have HZ=250 on x86 when CONFIG_NO_HZ and HZ=100 would work just as effectively? Isn't CONFIG_NO_HZ the default on x86 and PPC and.. pretty much everything else? I know Con K. has been accused many times of peddling snake-oil... but he has pretty graphs and benchmarks that kind of bear him out on most things even if the results do not get his work upstream. I can't fault the statistical significance of his results.. but even a placebo effect can be graphed, correlation is not causation, etc, etc. - I don't know if anything real filters down into the documentation though. And that is too high for EBSA110 and a couple of other boards, especially where HZ must equal some exact divisor being pumped right into some timer unit. EBSA110 can do 250Hz, but it'll mean manually recalculating the timer arithmetic - because it's not a reloading counter - software has to manually reload it, and you have to take account of how far it's rolled over to get anything close to a regular interrupt rate which NTP is happy with. And believe me, it used to be one of two main NTP broadcasting servers on my network, so I know it works. A-ha... Anyway, a patch for ARM could perhaps end up like this: ~~ if ARCH_MULTIPLATFORM source kernel/Kconfig.hz else HZ default 100 endif HZ default 200 if ARCH_EBSA110 || ARCH_ETC_ETC || ARCH_UND_SO_WEITER # any previous platform definitions where *really* required here. # but not default 100 since it would override kernel/Kconfig.hz every time That doesn't work - if you define the same symbol twice, one definition takes priority over the other (I don't remember which way it works). They don't accumulate. Well I did some testing.. a couple days of poking around, and they don't need to accumulate. Because... it simply doesn't work like that. Try it and check to see what Kconfig produces. I did test it.. whatever you define last, sticks, and it's down to the order they're parsed in the tree - luckily, arch/arm/Kconfig is sourced first, which sources the mach/plat stuff way down at the bottom. As long as you have your default set somewhere, any further default just has to be sourced or added later in *one* of the Kconfigs, same as building any C file with gcc -E and spitting it out. Someone, at the end of it all, has to set some default, and as long as the one you want is the last one, everything is shiny. We know this, because our FRAME_POINTER config overrides the generic one - not partially, but totally and utterly in every way. But for something as simple as CONFIG_HZ getting a value.. it works okay. If Kconfig.hz sets CONFIG_HZ=250 because CONFIG_HZ_250 is default yes, and it CONFIG_HZ defaults to 250 if it's set
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 5:13 PM, Russell King - ARM Linux li...@arm.linux.org.uk wrote: On Mon, Jan 21, 2013 at 04:54:31PM -0600, Matt Sealey wrote: Hmm, I think it might be appreciated for people looking at this stuff (same as I stumbled into it) for a little comment on WHY the default is 200. That way you don't wonder even if you know why EBSA110 has a HZ=200 default, why Exynos is lumped in there too (to reduce the number of interrupts firing? Err, _reduce_ ? Can you please explain why changing HZ from 100 to 200 is a reduction? We were talking about HZ=1000 at the time, sorry... Maybe the Exynos timer interrupt is kind of a horrid core NMI kind of thing and it's desirable for it not to be every millisecond, Huh? HZ=100 is centisecond intervals... See above.. I think you're understanding is just wy off. That default is there because that is the _architecture_ _default_ and there _has_ to be a default. No, including kernel/Kconfig.hz won't give us any kind of non-specified default because, as I've already said in one of my other mails, you can't supplement Kconfig symbol definitions by declaring it multiple times. Okay, so the real I know where the 60Hz clocksource might come from, the old Amiga platforms have one based on the PSU frequency (50Hz in Europe, 60Hz US/Japan). Even a 60Hz clocksource is useful though (on the Amiga, at least, it is precisely the vsync clock for synchronizing your display output on TV-out, which makes it completely useful for the framebuffer driver), but.. you just won't expect to assign it as sched_clock or your delay timer. And if anyone does I'd expect they'd know full well it'd not run so well. Except in the UK where it'd be 50Hz for the TV out. (Lengthy irrelevant explanation why this is so for UK cut.) Read again: 50Hz in Europe. Australia too. I'm British and I used to have more EU-manufactured Amigas than I knew what to do with.. so.. just like your NTP story, I definitely know this already. From that description, we are booting with standard HZ on ARM, and the core sched_clock (as in we can call setup_sched_clock) and/or/both/optionally using a real delay_timer switch to HRT mode if we have the right equipment available in the kernel and at runtime on the SoC.. but the process scheduler isn't compiled with the means to actually take advantage of us being in HRT mode? Don't mix sched_clock() into this; it has nothing to do with HZ at all. You're confusing your apples with your oranges. Okay.. A simple BUILD_BUG_ON and a BUG_ON right after each other in the appropriate clocksource driver solves that.. if there's an insistence on having at least some rope, we can put them in a field and tell them they have to use the moon to actually hang themselves... No it doesn't - it introduces a whole load of new ways to make the kernel build or boot fail for pointless reasons - more failures, more regressions. No thank you. But it would effectively stop users drinking kool-aid.. if you set your HZ to something stupid, you don't even get a kernel to build, and certainly don't get to boot past the first 40 lines of boot messages.. I think most people would rather a build error, or a runtime unmistakable, unmissable warning than a subtle and almost imperceptible skew in NTP synchronization, to use your example. -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
Okay so the final resolution of this is; 1) That the arch/arm/Kconfig HZ block is suffering from some cruft I think we could all be fairly confident that Exynos4 or S5P does not require HZ=200 - in theory, it has no such timer restrictions like EBSA110 (the docs I have show a perfectly capable 32-bit timer with a double-digits MHz input clock, since these are multimedia-class SoCs it'd be seriously f**ked up if they didn't). But while some of the entries on this line may be cargo-cult programming, the original addition on top of EBSA110 *may* be one of your unreported responsiveness issues. We could just let some Samsung employees complain when Android 6.x starts to get laggy with a 3.8 kernel because we forced their HZ=100. What I would do is predicate a fixed, obvious default on ARCH_MULTIPLATFORM so that it would get the benefit of a default HZ that you agree with. It wouldn't CHANGE anything, but it makes it look less funky, since the non-multiplatform settings would be somewhere else (it either needs more comments or an if - either way - otherwise it's potentially confusing); if ARCH_MULTIPLATFORM config HZ int default 100 else # old config HZ block here endif 2) We need to add config SCHED_HRTICK as a copy and paste from kernel/Kconfig.hz since.. well, I still don't understand exactly what the true effect would be, but I assume since Arnd is concerned and John's explanation rings true that it really should be enabled on ARM systems with the exact same dependencies as kernel/Kconfig.hz. Or not.. I see it as an oddity until I understand if we really care about it, but the code seems to be fairly important to the scheduler and also enabled by default almost everywhere else, which means only people with really freakish SMP architectures with no ability to use GENERIC_SMP_HELPERS have ever run these code paths besides ARM. That kind of leaves ARM in the doghouse.. who knows what weirdo scheduler reactions are related to it not being enabled. Maybe when it is, HZ *would* need to be allowed to be bumped when using this code path? Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. On Mon, Jan 21, 2013 at 5:49 PM, Russell King - ARM Linux li...@arm.linux.org.uk wrote: On Mon, Jan 21, 2013 at 05:23:33PM -0600, Matt Sealey wrote: On Mon, Jan 21, 2013 at 4:42 PM, Russell King - ARM Linux li...@arm.linux.org.uk wrote: On Mon, Jan 21, 2013 at 04:20:14PM -0600, Matt Sealey wrote: I am sorry it sounded if I was being high and mighty about not being able to select my own HZ (or being forced by Exynos to be 200 or by not being able to test an Exynos board, forced to default to 100). My real grievance here is we got a configuration item for the scheduler which is being left out of ARM configurations which *can* use high resolution timers, but I don't know if this is a real problem or not, hence asking about it, and that HZ=100 is the ARM default whether we might be able to select that or not.. which seems low. Well, I have a versatile platform here. It's the inteligence behind the power control system for booting the boards on the nightly tests (currently disabled because I'm waiting for my main server to lock up again, and I need to use one of the serial ports for that.) The point is, it talks via I2C to a load of power monitors to read samples out. It does this at sub-100Hz intervals. Yet the kernel is built with HZ=100. NO_HZ=y and highres timers are enabled... works fine. So, no, HZ=100 is not a limit in that scenario. With NO_HZ=y and highres timers, it all works with epoll() - you get the interval that you're after. I've verified this with calls to gettimeofday() and the POSIX clocks. Okay. So, can you read this (it's short): http://ck.kolivas.org/patches/bfs/bfs-configuration-faq.txt And please tell me if he's batshit crazy and I should completely ignore any scheduler discussion that isn't ARM-specific, or maybe.. and I can almost guarantee this, he doesn't have an ARM platform so he's just delightfully ill-informed about anything but his quad-core x86? Well... my x86 laptop is... HZ=1000, NO_HZ, HIGH_RES enabled, ondemand... doesn't really fit into any of those categories given there. I'd suggest that what's given there is a suggestion/opinion based on behaviours observed on x86 platforms. Whether it's appropriate for other architectures is not really a proven point - is it worth running ARM at 1000Hz when the load from running at 100Hz is measurable as a definite error in loops_per_jiffy calibration? Remember - the load from the interrupt handler at 1000Hz is 10x the load at 100Hz. Do you want to spend more cycles per second on the possibly multi-layer IRQ servicing and timer servicing? And what about the interrupt latency issue that we've hit several times already with devices taking longer than 10ms to service their peripherals because the driver doesn't make use of delayed
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 6:09 PM, Matt Sealey m...@genesi-usa.com wrote: [LAKML: about lack of SCHED_HRTICK because we don't use kernel/Kconfig.hz on ARM)] kind of leaves ARM in the doghouse.. who knows what weirdo scheduler reactions are related to it not being enabled. Maybe when it is, HZ *would* need to be allowed to be bumped when using this code path? Or conversely maybe this is exactly why the Samsung maintainers decided they need HZ=200, because SCHED_HRTICK isn't being enabled and they're experiencing some multimedia lag because of it? -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL
On Mon, Jan 21, 2013 at 4:46 PM, Nicolas Pitre n...@fluxnic.net wrote: On Mon, 21 Jan 2013, Matt Sealey wrote: The optimized assembler SHA1 code for ARM does not conform to Thumb2 register usage requirements, so it cannot be built when the kernel is configured with THUMB2_KERNEL. Fix the FTBFS for now by preventing misconfigurations of the kernel. Signed-off-by: Matt Sealey m...@genesi-usa.com A .arm directive at the top of the assembly code would be a better fix, as that wouldn't reduce functionality. If I recall, doing that last time for ssi-fiq.S was the wrong solution and it was suggesed proper configuration (on top of possibly rewriting the assembly) was better than hacking around in the assembly.. Yet, I'd invite you to have a look at commit 638591cd7b in linux-next. I took a peek, and invite you to ignore my patch. I only tracked the top of Linus' tree.. That said, it seems nobody benchmarked this on something different than IXP425 or KS8695 to see if it's markedly faster than the (moderately recently updated) C-code implementation outside of the mentioned in the logs for initial commit? It seems like rather a specific optimization for a rather specific use case for rather specific processors (and therefore a small test base) probably meant for a very specific product line somewhere. Whether you get any benefit in enabling this config item or not for any other ARM platform is up for debate, isn't it? If it *is* in fact much faster everywhere, and it works in any ARM or THUMB2 configuration, there's a case to be built for it being the default ARM implementation for AES and SHA1.. This question is to the implementor/committer (Dave McCullough), how exactly did you measure the benchmark and can we reproduce it on some other ARM box? If it's long and laborious and not so important to test the IPsec tunnel use-case, what would be the simplest possible benchmark to see if the C vs. assembly version is faster for a particular ARM device? I can get hold of pretty much any Cortex-A8 or Cortex-A9 that matters, I have access to a Chromebook for A15, and maybe an i.MX27 or i.MX35 and a couple Marvell boards (ARMv6) if I set my mind to it... that much testing implies we find a pretty concise benchmark though with a fairly common kernel version we can spread around (i.MX, OMAP and the Chromebook, I can handle, the rest I'm a little wary of bothering to spend too much time on). I think that could cover a good swath of not-ARMv5 use cases from lower speeds to quad core monsters.. but I might stick to i.MX to start with.. -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 6:51 PM, John Stultz john.stu...@linaro.org wrote: On 01/21/2013 02:54 PM, Matt Sealey wrote: On Mon, Jan 21, 2013 at 4:36 PM, John Stultz john.stu...@linaro.org wrote: On 01/21/2013 01:14 PM, Matt Sealey wrote: As far as jiffies rating, from jiffies.c: .rating= 1, /* lowest valid rating*/ So I'm not sure what you mean by the debug on the kernel log is telling me it has a higher resolution. Oh, it is just if I actually don't run setup_sched_clock on my platform, it gives a little message (with #define DEBUG 1 in sched_clock.c) about who setup the last sched_clock. Since you only get one chance, and I was fiddling with setup_sched_clock being probed from multiple possible timers from device tree (i.MX3 has a crapload of valid timers, which one you use right now is basically forced by the not-quite-fully-DT-only code and some funky iomap tricks). And what I got was, if I use the real hardware timer, it runs at 66MHz and says it has 15ns resolution and wraps every 500 seconds or so. The jiffies timer says it's 750MHz, with a 2ns resoluton.. you get the drift. The generic reporting of how good the sched_clock source is kind of glosses over the quality rating of the clock source and at first glance (if you're not paying that much attention), it is a little bit misleading.. Yes, in the case I was remembering, the 60HZ was driven by the electrical line. While I have your attention, what would be the minimum good speed to run the sched_clock or delay timer implementation from? My rudimentary scribblings in my notebook give me a value of don't bother with less than 10KHz based on HZ=100, so I'm wondering if a direct 32.768KHz clock would do (i.MX osc clock input if I can supply it to one of the above myriad timers) since this would be low-power compared to a 66MHz one (by a couple mA anyway). I also have a bunch of questions about the delay timer requirements.. I might mail you personally.. or would you prefer on-list? -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
On Mon, Jan 21, 2013 at 7:18 PM, Russell King - ARM Linux li...@arm.linux.org.uk wrote: On Mon, Jan 21, 2013 at 07:06:59PM -0600, Matt Sealey wrote: On Mon, Jan 21, 2013 at 6:51 PM, John Stultz john.stu...@linaro.org wrote: On 01/21/2013 02:54 PM, Matt Sealey wrote: sched_clock() has nothing to do with time keeping, and that HZ/NO_HZ/HRTIMERS don't affect it (when it isn't being derived from jiffies). Now, sched_clock() is there to give the scheduler a _fast_ to access, higher resolution clock than is available from other sources, so that there's ways of accurately measuring the amount of time processes run for, That depends on what you meant by timekeeping, right? I'm really not concerned about the wallclock time, more about the accuracy of the scheduler clock (tick?), preemption, accurate delays (i.e. if I msleep(10) does it delay for 10ms or for 40ms because my delay timer is inaccurate? I'd rather it was better but closer to 10ms), and whether the scheduler (the thing that tells my userspace whether firefox is running now, or totem, or any other task) is using the correct high resolution periodic, oneshot, repeatable (however it repeats) timers *properly* given that this magic config item is missing on ARM. That magic config item being CONFIG_SCHED_HRTICK which is referenced a bunch in kernel/sched/*.[ch] but *ONLY* defined as a Kconfig item in kernel/Kconfig.hz. Do we need to copy that Kconfig item out to arch/arm/Kconfig, that's the question? and other such measurements - and it uses that to determine how to schedule a particular task and when to preempt it. Not providing it means you get those measurements at HZ-based resolution, which is suboptimal for tasks which run often for sub-HZ periods (which can end up accumulating zero run time.) Okay, and John said earlier: John Stultz: So I'm actually not super familiar with SCHED_HRTICK details, but from my brief skim of it it looks like its useful for turning off the periodic timer tick, and allowing the scheduler tick to be triggered by an hrtimer itself (There's a number of these interesting inversions that go on in switching to HRT mode - for instance, standard timer ticks are switched to being hrtimer events themselves). This likely has the benefit of time-accurate preemption (well, long term, as if the timer granularity isn't matching you could be delayed up to a tick - but it wouldn't drift). I'm guessing Thomas would probably know best what the potential issues would be from running ((CONFIG_HRTIMER || CONFIG_NO_HZ) !CONFIG_SCHED_HRTICK). If SCHED_HRTICK isn't enabled but setup_sched_clock has been given an accessor for a real, hardware, fast, high resolution counter that meets all the needs of sched_clock, what's going on? If it's enabled, what extra is it doing that, say, my_plat_read_sched_clock doesn't? -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One of these things (CONFIG_HZ) is not like the others..
Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. On Mon, Jan 21, 2013 at 7:31 PM, John Stultz john.stu...@linaro.org wrote: On 01/21/2013 05:06 PM, Matt Sealey wrote: On Mon, Jan 21, 2013 at 6:51 PM, John Stultz john.stu...@linaro.org wrote: On 01/21/2013 02:54 PM, Matt Sealey wrote: On Mon, Jan 21, 2013 at 4:36 PM, John Stultz john.stu...@linaro.org wrote: On 01/21/2013 01:14 PM, Matt Sealey wrote: As far as jiffies rating, from jiffies.c: .rating= 1, /* lowest valid rating*/ So I'm not sure what you mean by the debug on the kernel log is telling me it has a higher resolution. Oh, it is just if I actually don't run setup_sched_clock on my platform, it gives a little message (with #define DEBUG 1 in sched_clock.c) about who setup the last sched_clock. Since you only get one chance, and I was fiddling with setup_sched_clock being probed from multiple possible timers from device tree (i.MX3 has a crapload of valid timers, which one you use right now is basically forced by the not-quite-fully-DT-only code and some funky iomap tricks). And what I got was, if I use the real hardware timer, it runs at 66MHz and says it has 15ns resolution and wraps every 500 seconds or so. The jiffies timer says it's 750MHz, with a 2ns resoluton.. you get the drift. The generic reporting of how good the sched_clock source is kind of glosses over the quality rating of the clock source and at first glance (if you're not paying that much attention), it is a little bit misleading.. I've got no clue on this. sched_clock is arch specific, and while ARM does use clocksources for sched_clock, what you're seeing is a detail of the ARM implementation and not the clocksource code (one complication is that clocksources rating values are for the requirements of timekeeping, which are different then the requirements for sched_clock - so the confusion is understandable). Yes, in the case I was remembering, the 60HZ was driven by the electrical line. While I have your attention, what would be the minimum good speed to run the sched_clock or delay timer implementation from? My rudimentary scribblings in my notebook give me a value of don't bother with less than 10KHz based on HZ=100, so I'm wondering if a direct 32.768KHz clock would do (i.MX osc clock input if I can supply it to one of the above myriad timers) since this would be low-power compared to a 66MHz one (by a couple mA anyway). I also have a bunch of questions about the delay timer requirements.. I might mail you personally.. or would you prefer on-list? So there are probably other folks who could better comment on sched_clock() or the delay timer (I'm guessing the delay() implementation is what you mean by that) design trade-offs. I'm specifically talking about if I do static struct delay_timer imx_gpt_delay_timer = { .read_current_timer = imx_gpt_read_current_timer, }; and then something like: imx_gpt_delay_timer.freq = clk_get_rate(clk_per); register_current_timer_delay(imx_gpt_delay_timer); In the sense that now (as of kernel 3.7 iirc), I have an ability to have the delay implementation use this awesome fast accessor (which is nothing to do with a 'clocksource' as in the subsystem..) to get to my (here at least) 66.5MHz counter (up or down, i.MX has both, but I dunno if you can use a down counter for delay_timer, or if that's preferred, or what.. there are no examples of it.. but it seems to work.. that said I can't imagine what would be an immediately visible and not totally random effect of doing it wrong, maybe that delays are instantly returned, that could be very hard or impossible to ever notice compared to not being able to browse the internet on the target device.. it might pop up on some randomly-not-resetting platform device or so, though..) And I can also put sched_clock on a completely different timer. Does that make any sense at all? I wouldn't know, it's not documented. And if I wanted to I could register 8 more timers. That seems rather excessive, but the ability to use those extra 8 as clock outputs from the SoC or otherwise directly use comparators is useful to some people, does Linux in general really give a damn about having 8 timers of the same quality being available when most systems barely have two clocksources anyway (on x86, tsc and hpet - on ARM I guess twd and some SoC-specific timer). I dunno how many people might actually want to define in a device tree, but I figure every single one is not a bad thing and which ones end up as sched_clock, delay_timer or just plain registered clocksources, or not registered as a clocksource and accessed as some kind of comparator through some kooky ioctl API, is something you would also configure... But again, someone more familiar with the scheduler and driver requirements would probably be more informational. Okay. I assume that's a combination of Russell and Thomas.. -- Matt Sealey
Re: Compilation problem with drivers/staging/zsmalloc when !SMP on ARM
On Fri, Jan 18, 2013 at 3:08 PM, Russell King - ARM Linux wrote: > On Fri, Jan 18, 2013 at 02:24:15PM -0600, Matt Sealey wrote: >> Hello all, >> >> I wonder if anyone can shed some light on this linking problem I have >> right now. If I configure my kernel without SMP support (it is a very >> lean config for i.MX51 with device tree support only) I hit this error >> on linking: > > Yes, I looked at this, and I've decided that I will _not_ fix this export, > neither will I accept a patch to add an export. Understood.. > As far as I can see, this code is buggy in a SMP environment. There's > apparantly no guarantee that: > > 1. the mapping will be created on a particular CPU. > 2. the mapping will then be used only on this specific CPU. > 3. no guarantee that another CPU won't speculatively prefetch from this >region. > 4. when the mapping is torn down, no guarantee that it's the same CPU that >used the happing. > > So, the use of the local TLB flush leaves all the other CPUs potentially > containing TLB entries for this mapping. I'm gonna put this out to the maintainers (Konrad, and Seth since he committed it) that if this code is buggy it gets taken back out, even if it makes zsmalloc "slow" on ARM, for the following reasons: * It's buggy on SMP as Russell describes above * It might not be buggy on UP (opposite to Russell's description above as the restrictions he states do not exist), but that would imply an export for a really core internal MM function nobody should be using anyway * By that assessment, using that core internal MM function on SMP is also bad voodoo that zsmalloc should not be doing It also either smacks of a lack of comprehensive testing or defiance of logic that nobody ever built the code without CONFIG_SMP, which means it was only tested on a bunch of SMP ARM systems (I'm guessing.. Pandaboard? :) or UP systems with SMP/SMP_ON_UP enabled (to expand on that guess, maybe Beagleboard in some multiplatform Beagle/Panda hybrid kernel). I am sure I was reading the mailing lists when that patch was discussed, coded and committed and my guess is correct. In this case, what we have here anyway is code which when PROPERLY configured as so.. diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c b/drivers/staging/zsmalloc/zsmalloc-main.c index 09a9d35..ecf75fb 100644 --- a/drivers/staging/zsmalloc/zsmalloc-main.c +++ b/drivers/staging/zsmalloc/zsmalloc-main.c @@ -228,7 +228,7 @@ struct zs_pool { * mapping rather than copying * for object mapping. */ -#if defined(CONFIG_ARM) +#if defined(CONFIG_ARM) && defined(CONFIG_SMP) #define USE_PGTABLE_MAPPING #endif .. such that it even compiles in both "guess" configurations, the slower Cortex-A8 600MHz single core system gets to use the slow copy path and the dual-core 1GHz+ Cortex-A9 (with twice the RAM..) gets to use the fast mapping path. Essentially all the patch does is "improve performance" on the fastest, best-configured, large-amounts-of-RAM, lots-of-CPU-performance ARM systems (OMAP4+, Snapdragon, Exynos4+, marvell armada, i.MX6..) while introducing the problems Russell describes, and leave performance exactly the same and potentially far more stable on the slower, memory-limited ARM machines. Given the purpose of zsmalloc, zram, zcache etc. this somewhat defies logic. If it's not making the memory-limited, slow ARM systems run better, what's the point? So in summary I suggest "we" (Greg? or is it Seth's responsibility?) should just back out that whole USE_PGTABLE_MAPPING chunk of code introduced with f553646. Then Russell can carry on randconfiging and I can build for SMP and UP and get the same code.. with less bugs. I am sure zsmalloc/zram/zcache2 are not so awful at the end of the day despite the churn in staging.. but the amount of time I just spent today with my brain on fire because of cross-referencing mm code for a linker error, when all I wanted was a non-SMP kernel, I feel Greg's hurt a little bit. -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Compilation problem with drivers/staging/zsmalloc when !SMP on ARM
Hello all, I wonder if anyone can shed some light on this linking problem I have right now. If I configure my kernel without SMP support (it is a very lean config for i.MX51 with device tree support only) I hit this error on linking: MODPOST 216 modules LZMAarch/arm/boot/compressed/piggy.lzma AS arch/arm/boot/compressed/lib1funcs.o ERROR: "v7wbi_flush_kern_tlb_range" [drivers/staging/zsmalloc/zsmalloc.ko] undefined! make[1]: *** [__modpost] Error 1 make: *** [modules] Error 2 make: *** Waiting for unfinished jobs I am wondering if someone who is far better versed in the mm code can tell me exactly what would end up calling this function and why it would only be in the kernel at all if SMP was enabled - all I can see right now is that it is PROBABLY something resulting from this code and comment inside the zsmalloc driver: /* * By default, zsmalloc uses a copy-based object mapping method to access * allocations that span two pages. However, if a particular architecture * 1) Implements local_flush_tlb_kernel_range() and 2) Performs VM mapping * faster than copying, then it should be added here so that * USE_PGTABLE_MAPPING is defined. This causes zsmalloc to use page table * mapping rather than copying * for object mapping. */ #if defined(CONFIG_ARM) #define USE_PGTABLE_MAPPING #endif And of course, once USE_PGTABLE_MAPPING is enabled, local_flush_tlb_kernel_range being used is the actual culprit here. But the question is, for the ARM guys, shouldn't local_flush_tlb_kernel_range actually be defined in the kernel build, even without SMP, even if it would be architecturally a no-op on UP systems, and then CONFIG_SMP_ON_UP would catch this case? If not, then the fix is obvious, the check inside zsmalloc for CONFIG_ARM should be fixed to check specifically for local_flush_tlb_kernel_range definition as well, or for CONFIG_SMP as well, or the non-presence of CONFIG_SMP_ON_UP or something? The build cases I have tested are basically CONFIG_SMP + CONFIG_SMP_ON_UP (since it's a single-core Cortex-A8) and without CONFIG_SMP (since it's a Cortex-A8..) using imx_v7_defconfig and imx_v6_v7_defconfig alike and making the appropriate adjustments. Trying to compile it with CONFIG_SMP without CONFIG_SMP_ON_UP makes no sense on my target system - but essentially it only links with CONFIG_SMP. I got no clue what I am looking at in arch/arm/mm and related, so I am unsure precisely how I should proceed in patching it with the intent it goes upstream.. or what the real implication of this kind of memory management actually is on SMP vs. UP systems, but the intended functionality of local_flush_tlb_kernel_range seems like it should exist even on UP, to me. -- Matt Sealey Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Compilation problem with drivers/staging/zsmalloc when !SMP on ARM
Hello all, I wonder if anyone can shed some light on this linking problem I have right now. If I configure my kernel without SMP support (it is a very lean config for i.MX51 with device tree support only) I hit this error on linking: MODPOST 216 modules LZMAarch/arm/boot/compressed/piggy.lzma AS arch/arm/boot/compressed/lib1funcs.o ERROR: v7wbi_flush_kern_tlb_range [drivers/staging/zsmalloc/zsmalloc.ko] undefined! make[1]: *** [__modpost] Error 1 make: *** [modules] Error 2 make: *** Waiting for unfinished jobs I am wondering if someone who is far better versed in the mm code can tell me exactly what would end up calling this function and why it would only be in the kernel at all if SMP was enabled - all I can see right now is that it is PROBABLY something resulting from this code and comment inside the zsmalloc driver: /* * By default, zsmalloc uses a copy-based object mapping method to access * allocations that span two pages. However, if a particular architecture * 1) Implements local_flush_tlb_kernel_range() and 2) Performs VM mapping * faster than copying, then it should be added here so that * USE_PGTABLE_MAPPING is defined. This causes zsmalloc to use page table * mapping rather than copying * for object mapping. */ #if defined(CONFIG_ARM) #define USE_PGTABLE_MAPPING #endif And of course, once USE_PGTABLE_MAPPING is enabled, local_flush_tlb_kernel_range being used is the actual culprit here. But the question is, for the ARM guys, shouldn't local_flush_tlb_kernel_range actually be defined in the kernel build, even without SMP, even if it would be architecturally a no-op on UP systems, and then CONFIG_SMP_ON_UP would catch this case? If not, then the fix is obvious, the check inside zsmalloc for CONFIG_ARM should be fixed to check specifically for local_flush_tlb_kernel_range definition as well, or for CONFIG_SMP as well, or the non-presence of CONFIG_SMP_ON_UP or something? The build cases I have tested are basically CONFIG_SMP + CONFIG_SMP_ON_UP (since it's a single-core Cortex-A8) and without CONFIG_SMP (since it's a Cortex-A8..) using imx_v7_defconfig and imx_v6_v7_defconfig alike and making the appropriate adjustments. Trying to compile it with CONFIG_SMP without CONFIG_SMP_ON_UP makes no sense on my target system - but essentially it only links with CONFIG_SMP. I got no clue what I am looking at in arch/arm/mm and related, so I am unsure precisely how I should proceed in patching it with the intent it goes upstream.. or what the real implication of this kind of memory management actually is on SMP vs. UP systems, but the intended functionality of local_flush_tlb_kernel_range seems like it should exist even on UP, to me. -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Compilation problem with drivers/staging/zsmalloc when !SMP on ARM
On Fri, Jan 18, 2013 at 3:08 PM, Russell King - ARM Linux li...@arm.linux.org.uk wrote: On Fri, Jan 18, 2013 at 02:24:15PM -0600, Matt Sealey wrote: Hello all, I wonder if anyone can shed some light on this linking problem I have right now. If I configure my kernel without SMP support (it is a very lean config for i.MX51 with device tree support only) I hit this error on linking: Yes, I looked at this, and I've decided that I will _not_ fix this export, neither will I accept a patch to add an export. Understood.. As far as I can see, this code is buggy in a SMP environment. There's apparantly no guarantee that: 1. the mapping will be created on a particular CPU. 2. the mapping will then be used only on this specific CPU. 3. no guarantee that another CPU won't speculatively prefetch from this region. 4. when the mapping is torn down, no guarantee that it's the same CPU that used the happing. So, the use of the local TLB flush leaves all the other CPUs potentially containing TLB entries for this mapping. I'm gonna put this out to the maintainers (Konrad, and Seth since he committed it) that if this code is buggy it gets taken back out, even if it makes zsmalloc slow on ARM, for the following reasons: * It's buggy on SMP as Russell describes above * It might not be buggy on UP (opposite to Russell's description above as the restrictions he states do not exist), but that would imply an export for a really core internal MM function nobody should be using anyway * By that assessment, using that core internal MM function on SMP is also bad voodoo that zsmalloc should not be doing It also either smacks of a lack of comprehensive testing or defiance of logic that nobody ever built the code without CONFIG_SMP, which means it was only tested on a bunch of SMP ARM systems (I'm guessing.. Pandaboard? :) or UP systems with SMP/SMP_ON_UP enabled (to expand on that guess, maybe Beagleboard in some multiplatform Beagle/Panda hybrid kernel). I am sure I was reading the mailing lists when that patch was discussed, coded and committed and my guess is correct. In this case, what we have here anyway is code which when PROPERLY configured as so.. diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c b/drivers/staging/zsmalloc/zsmalloc-main.c index 09a9d35..ecf75fb 100644 --- a/drivers/staging/zsmalloc/zsmalloc-main.c +++ b/drivers/staging/zsmalloc/zsmalloc-main.c @@ -228,7 +228,7 @@ struct zs_pool { * mapping rather than copying * for object mapping. */ -#if defined(CONFIG_ARM) +#if defined(CONFIG_ARM) defined(CONFIG_SMP) #define USE_PGTABLE_MAPPING #endif .. such that it even compiles in both guess configurations, the slower Cortex-A8 600MHz single core system gets to use the slow copy path and the dual-core 1GHz+ Cortex-A9 (with twice the RAM..) gets to use the fast mapping path. Essentially all the patch does is improve performance on the fastest, best-configured, large-amounts-of-RAM, lots-of-CPU-performance ARM systems (OMAP4+, Snapdragon, Exynos4+, marvell armada, i.MX6..) while introducing the problems Russell describes, and leave performance exactly the same and potentially far more stable on the slower, memory-limited ARM machines. Given the purpose of zsmalloc, zram, zcache etc. this somewhat defies logic. If it's not making the memory-limited, slow ARM systems run better, what's the point? So in summary I suggest we (Greg? or is it Seth's responsibility?) should just back out that whole USE_PGTABLE_MAPPING chunk of code introduced with f553646. Then Russell can carry on randconfiging and I can build for SMP and UP and get the same code.. with less bugs. I am sure zsmalloc/zram/zcache2 are not so awful at the end of the day despite the churn in staging.. but the amount of time I just spent today with my brain on fire because of cross-referencing mm code for a linker error, when all I wanted was a non-SMP kernel, I feel Greg's hurt a little bit. -- Matt Sealey m...@genesi-usa.com Product Development Analyst, Genesi USA, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ehci-mxc: remove Efika MX-specific CHRGVBUS hack
Since Efika MX platform support (pre-devicetree) was removed from the tree this code no longer has any possibility of running and clutters up the driver which is being replaced by the chipidea host in the future anyway. Signed-off-by: Matt Sealey Tested-by: Steev Klimazewski CC: Sascha Hauer CC: Alan Stern --- drivers/usb/host/ehci-mxc.c | 20 1 file changed, 20 deletions(-) diff --git a/drivers/usb/host/ehci-mxc.c b/drivers/usb/host/ehci-mxc.c index 8e58a5f..b62c3a8 100644 --- a/drivers/usb/host/ehci-mxc.c +++ b/drivers/usb/host/ehci-mxc.c @@ -102,7 +102,6 @@ static int ehci_mxc_drv_probe(struct platform_device *pdev) struct usb_hcd *hcd; struct resource *res; int irq, ret; - unsigned int flags; struct ehci_mxc_priv *priv; struct device *dev = >dev; struct ehci_hcd *ehci; @@ -213,25 +212,6 @@ static int ehci_mxc_drv_probe(struct platform_device *pdev) if (ret) goto err_add; - if (pdata->otg) { - /* -* efikamx and efikasb have some hardware bug which is -* preventing usb to work unless CHRGVBUS is set. -* It's in violation of USB specs -*/ - if (machine_is_mx51_efikamx() || machine_is_mx51_efikasb()) { - flags = usb_phy_io_read(pdata->otg, - ULPI_OTG_CTRL); - flags |= ULPI_OTG_CTRL_CHRGVBUS; - ret = usb_phy_io_write(pdata->otg, flags, - ULPI_OTG_CTRL); - if (ret) { - dev_err(dev, "unable to set CHRVBUS\n"); - goto err_add; - } - } - } - return 0; err_add: -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ehci-mxc: remove Efika MX-specific CHRGVBUS hack
Since Efika MX platform support (pre-devicetree) was removed from the tree this code no longer has any possibility of running and clutters up the driver which is being replaced by the chipidea host in the future anyway. Signed-off-by: Matt Sealey Tested-by: Steev Klimazewski CC: Sascha Hauer CC: Alan Stern --- drivers/usb/host/ehci-mxc.c | 20 1 file changed, 20 deletions(-) diff --git a/drivers/usb/host/ehci-mxc.c b/drivers/usb/host/ehci-mxc.c index 8e58a5f..b62c3a8 100644 --- a/drivers/usb/host/ehci-mxc.c +++ b/drivers/usb/host/ehci-mxc.c @@ -102,7 +102,6 @@ static int ehci_mxc_drv_probe(struct platform_device *pdev) struct usb_hcd *hcd; struct resource *res; int irq, ret; - unsigned int flags; struct ehci_mxc_priv *priv; struct device *dev = >dev; struct ehci_hcd *ehci; @@ -213,25 +212,6 @@ static int ehci_mxc_drv_probe(struct platform_device *pdev) if (ret) goto err_add; - if (pdata->otg) { - /* -* efikamx and efikasb have some hardware bug which is -* preventing usb to work unless CHRGVBUS is set. -* It's in violation of USB specs -*/ - if (machine_is_mx51_efikamx() || machine_is_mx51_efikasb()) { - flags = usb_phy_io_read(pdata->otg, - ULPI_OTG_CTRL); - flags |= ULPI_OTG_CTRL_CHRGVBUS; - ret = usb_phy_io_write(pdata->otg, flags, - ULPI_OTG_CTRL); - if (ret) { - dev_err(dev, "unable to set CHRVBUS\n"); - goto err_add; - } - } - } - return 0; err_add: -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ehci-mxc: remove Efika MX-specific CHRGVBUS hack
Since Efika MX platform support (pre-devicetree) was removed from the tree this code no longer has any possibility of running and clutters up the driver which is being replaced by the chipidea host in the future anyway. Signed-off-by: Matt Sealey m...@genesi-usa.com Tested-by: Steev Klimazewski st...@genesi-usa.com CC: Sascha Hauer ker...@pengutronix.de CC: Alan Stern st...@rowland.harvard.edu --- drivers/usb/host/ehci-mxc.c | 20 1 file changed, 20 deletions(-) diff --git a/drivers/usb/host/ehci-mxc.c b/drivers/usb/host/ehci-mxc.c index 8e58a5f..b62c3a8 100644 --- a/drivers/usb/host/ehci-mxc.c +++ b/drivers/usb/host/ehci-mxc.c @@ -102,7 +102,6 @@ static int ehci_mxc_drv_probe(struct platform_device *pdev) struct usb_hcd *hcd; struct resource *res; int irq, ret; - unsigned int flags; struct ehci_mxc_priv *priv; struct device *dev = pdev-dev; struct ehci_hcd *ehci; @@ -213,25 +212,6 @@ static int ehci_mxc_drv_probe(struct platform_device *pdev) if (ret) goto err_add; - if (pdata-otg) { - /* -* efikamx and efikasb have some hardware bug which is -* preventing usb to work unless CHRGVBUS is set. -* It's in violation of USB specs -*/ - if (machine_is_mx51_efikamx() || machine_is_mx51_efikasb()) { - flags = usb_phy_io_read(pdata-otg, - ULPI_OTG_CTRL); - flags |= ULPI_OTG_CTRL_CHRGVBUS; - ret = usb_phy_io_write(pdata-otg, flags, - ULPI_OTG_CTRL); - if (ret) { - dev_err(dev, unable to set CHRVBUS\n); - goto err_add; - } - } - } - return 0; err_add: -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ehci-mxc: remove Efika MX-specific CHRGVBUS hack
Since Efika MX platform support (pre-devicetree) was removed from the tree this code no longer has any possibility of running and clutters up the driver which is being replaced by the chipidea host in the future anyway. Signed-off-by: Matt Sealey m...@genesi-usa.com Tested-by: Steev Klimazewski st...@genesi-usa.com CC: Sascha Hauer ker...@pengutronix.de CC: Alan Stern st...@rowland.harvard.edu --- drivers/usb/host/ehci-mxc.c | 20 1 file changed, 20 deletions(-) diff --git a/drivers/usb/host/ehci-mxc.c b/drivers/usb/host/ehci-mxc.c index 8e58a5f..b62c3a8 100644 --- a/drivers/usb/host/ehci-mxc.c +++ b/drivers/usb/host/ehci-mxc.c @@ -102,7 +102,6 @@ static int ehci_mxc_drv_probe(struct platform_device *pdev) struct usb_hcd *hcd; struct resource *res; int irq, ret; - unsigned int flags; struct ehci_mxc_priv *priv; struct device *dev = pdev-dev; struct ehci_hcd *ehci; @@ -213,25 +212,6 @@ static int ehci_mxc_drv_probe(struct platform_device *pdev) if (ret) goto err_add; - if (pdata-otg) { - /* -* efikamx and efikasb have some hardware bug which is -* preventing usb to work unless CHRGVBUS is set. -* It's in violation of USB specs -*/ - if (machine_is_mx51_efikamx() || machine_is_mx51_efikasb()) { - flags = usb_phy_io_read(pdata-otg, - ULPI_OTG_CTRL); - flags |= ULPI_OTG_CTRL_CHRGVBUS; - ret = usb_phy_io_write(pdata-otg, flags, - ULPI_OTG_CTRL); - if (ret) { - dev_err(dev, unable to set CHRVBUS\n); - goto err_add; - } - } - } - return 0; err_add: -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/