Re: State of PREEMPT_RT in PPC arch
On Thu, Feb 18, 2010 at 09:34:28PM -0800, Ryan wrote: On Thu, Feb 18, 2010 at 5:15 AM, Wolfram Sang w.s...@pengutronix.de wrote: I'm soliciting comments from the community. Thanks in advance for sharing your thoughts. Mainline your driver and you are free from such problems :) Shouldn't it be the opposite way? That is, get drivers working first then mainline them? I even don't have a working driver with RT patch. If you have problems getting your driver to mainline, the lists are there to help. Of course, this implies working with a recent kernel (which also answers your initial question a bit). Most people don't want to supprt old kernels like 2.6.2x in their freetime. You can get commercial support for such tasks (be it my employer or other consultants on this list). and discussions about them, kinda dedicated to active developers. Is there a place that users or not-so-active developers can post technical questions and get some relatively quick answers or comments? The list was apropriate. As all people are busy by default, getting no response is not that exceptional. If you need a fast response, you should consider commercial support. Regards, Wolfram -- Pengutronix e.K. | Wolfram Sang| Industrial Linux Solutions | http://www.pengutronix.de/ | signature.asc Description: Digital signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7
On Fri, 2010-02-19 at 17:05 +1100, Michael Neuling wrote: include/linux/sched.h |2 +- kernel/sched_fair.c | 61 +-- - 2 files changed, 58 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 0eef87b..42fa5c6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -849,7 +849,7 @@ enum cpu_idle_type { #define SD_POWERSAVINGS_BALANCE0x0100 /* Balance for power savings */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instanc e */ - +#define SD_ASYM_PACKING0x0800 Would we eventually add this to SD_SIBLING_INIT in a arch specific hook, or is this ok to add it generically? I'd think we'd want to keep that limited to architectures that actually need it. +static int update_sd_pick_busiest(struct sched_domain *sd, + struct sd_lb_stats *sds, + struct sched_group *sg, + struct sg_lb_stats *sgs) +{ + if (sgs-sum_nr_running sgs-group_capacity) + return 1; + + if (sgs-group_imb) + return 1; + + if ((sd-flags SD_ASYM_PACKING) sgs-sum_nr_running) { + if (!sds-busiest) + return 1; + + if (group_first_cpu(sds-busiest) group_first_cpu(group)) group = sg here? (I get a compile error otherwise) Oh, quite ;-) +static int check_asym_packing(struct sched_domain *sd, + struct sd_lb_stats *sds, + int cpu, unsigned long *imbalance) +{ + int i, cpu, busiest_cpu; Redefining cpu here. Looks like the cpu parameter is not really needed? Seems that way indeed, I went back and forth a few times on the actual implementation of this function (which started out live as a copy of check_power_save_busiest_group), its amazing there were only these two compile glitches ;-) + + if (!(sd-flags SD_ASYM_PACKING)) + return 0; + + if (!sds-busiest) + return 0; + + i = 0; + busiest_cpu = group_first_cpu(sds-busiest); + for_each_cpu(cpu, sched_domain_span(sd)) { + i++; + if (cpu == busiest_cpu) + break; + } + + if (sds-total_nr_running i) + return 0; + + *imbalance = sds-max_load; + return 1; +} ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7
In message 1266573672.1806.70.ca...@laptop you wrote: On Fri, 2010-02-19 at 17:05 +1100, Michael Neuling wrote: include/linux/sched.h |2 +- kernel/sched_fair.c | 61 +++ ++-- - 2 files changed, 58 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 0eef87b..42fa5c6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -849,7 +849,7 @@ enum cpu_idle_type { #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instanc e */ - +#define SD_ASYM_PACKING 0x0800 Would we eventually add this to SD_SIBLING_INIT in a arch specific hook, or is this ok to add it generically? I'd think we'd want to keep that limited to architectures that actually need it. OK +static int update_sd_pick_busiest(struct sched_domain *sd, + struct sd_lb_stats *sds, + struct sched_group *sg, + struct sg_lb_stats *sgs) +{ + if (sgs-sum_nr_running sgs-group_capacity) + return 1; + + if (sgs-group_imb) + return 1; + + if ((sd-flags SD_ASYM_PACKING) sgs-sum_nr_running) { + if (!sds-busiest) + return 1; + + if (group_first_cpu(sds-busiest) group_first_cpu(group)) group = sg here? (I get a compile error otherwise) Oh, quite ;-) +static int check_asym_packing(struct sched_domain *sd, + struct sd_lb_stats *sds, + int cpu, unsigned long *imbalance) +{ + int i, cpu, busiest_cpu; Redefining cpu here. Looks like the cpu parameter is not really needed? Seems that way indeed, I went back and forth a few times on the actual implementation of this function (which started out live as a copy of check_power_save_busiest_group), its amazing there were only these two compile glitches ;-) :-) Below are the cleanups + the arch specific bits. It doesn't change your logic at all. Obviously the PPC arch bits would need to be split into a separate patch. Compiles and boots against linux-next. Mikey --- arch/powerpc/include/asm/cputable.h |3 + arch/powerpc/kernel/process.c |7 +++ include/linux/sched.h |4 +- include/linux/topology.h|1 kernel/sched_fair.c | 64 ++-- 5 files changed, 74 insertions(+), 5 deletions(-) Index: linux-next/arch/powerpc/include/asm/cputable.h === --- linux-next.orig/arch/powerpc/include/asm/cputable.h +++ linux-next/arch/powerpc/include/asm/cputable.h @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_SAOLONG_ASM_CONST(0x0020) #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040) #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080) +#define CPU_FTR_ASYM_SMT4 LONG_ASM_CONST(0x0100) #ifndef __ASSEMBLY__ @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | CPU_FTR_SAO) + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYM_SMT4) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ Index: linux-next/arch/powerpc/kernel/process.c === --- linux-next.orig/arch/powerpc/kernel/process.c +++ linux-next/arch/powerpc/kernel/process.c @@ -1265,3 +1265,10 @@ unsigned long randomize_et_dyn(unsigned return ret; } + +int arch_sd_asym_packing(void) +{ + if (cpu_has_feature(CPU_FTR_ASYM_SMT4)) + return SD_ASYM_PACKING; + return 0; +} Index: linux-next/include/linux/sched.h === --- linux-next.orig/include/linux/sched.h +++ linux-next/include/linux/sched.h @@ -849,7 +849,7 @@ enum cpu_idle_type { #define SD_POWERSAVINGS_BALANCE0x0100 /* Balance for power savings */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ - +#define SD_ASYM_PACKING0x0800 /* Asymetric SMT packing */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ enum powersavings_balance_level { @@
Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
On Fri, Feb 19, 2010 at 11:07:30AM +1100, Anton Blanchard wrote: Hi, The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables zone reclaim. I've no problem with the patch anyway. FYI even with this enabled I could trip it up pretty easily with a multi threaded application. I tried running stream across all threads in node 0. The machine looks like: node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 free: 30254 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 free: 31832 MB Now create some clean pagecache on node 0: # taskset -c 0 dd if=/dev/zero of=/tmp/bigfile bs=1G count=16 # sync node 0 free: 12880 MB node 1 free: 31830 MB I built stream to use about 25GB of memory. I then ran stream across all threads in node 0: # OMP_NUM_THREADS=16 taskset -c 0-15 ./stream We exhaust all memory on node 0, and start using memory on node 1: node 0 free: 0 MB node 1 free: 20795 MB ie about 10GB of node 1. Now if we run the same test with one thread: # OMP_NUM_THREADS=1 taskset -c 0 ./stream things are much better: node 0 free: 11 MB node 1 free: 31552 MB Interestingly enough it takes two goes to get completely onto node 0, even with one thread. The second run looks like: node 0 free: 14 MB node 1 free: 31811 MB I had a quick look at the page allocation logic and I think I understand why we would have issues with multple threads all trying to allocate at once. - The ZONE_RECLAIM_LOCKED flag allows only one thread into zone reclaim at a time, and whatever thread is in zone reclaim probably only frees a small amount of memory. Certainly not enough to satisfy all 16 threads. - We seem to end up racing between zone_watermark_ok, zone_reclaim and buffered_rmqueue. Since everyone is in here the memory one thread reclaims may be stolen by another thread. You're pretty much on the button here. Only one thread at a time enters zone_reclaim. The others back off and try the next zone in the zonelist instead. I'm not sure what the original intention was but most likely it was to prevent too many parallel reclaimers in the same zone potentially dumping out way more data than necessary. I'm not sure if there is an easy way to fix this without penalising other workloads though. You could experiment with waiting on the bit if the GFP flags allowi it? The expectation would be that the reclaim operation does not take long. Wait on the bit, if you are making the forward progress, recheck the watermarks before continueing. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
On Fri, 19 Feb 2010, Mel Gorman wrote: The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables zone reclaim. I've no problem with the patch anyway. Nor do I. - We seem to end up racing between zone_watermark_ok, zone_reclaim and buffered_rmqueue. Since everyone is in here the memory one thread reclaims may be stolen by another thread. You're pretty much on the button here. Only one thread at a time enters zone_reclaim. The others back off and try the next zone in the zonelist instead. I'm not sure what the original intention was but most likely it was to prevent too many parallel reclaimers in the same zone potentially dumping out way more data than necessary. Yes it was to prevent concurrency slowing down reclaim. At that time the number of processors per NUMA node was 2 or so. The number of pages that are reclaimed is limited to avoid tossing too many page cache pages. You could experiment with waiting on the bit if the GFP flags allowi it? The expectation would be that the reclaim operation does not take long. Wait on the bit, if you are making the forward progress, recheck the watermarks before continueing. You could reclaim more pages during a zone reclaim pass? Increase the nr_to_reclaim in __zone_reclaim() and see if that helps. One zone reclaim pass should reclaim enough local pages to keep the processors on a node happy for a reasonable interval. Maybe do a fraction of a zone? 1/16th? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
On Fri, Feb 19, 2010 at 8:42 PM, Christoph Lameter c...@linux-foundation.org wrote: On Fri, 19 Feb 2010, Mel Gorman wrote: The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables zone reclaim. I've no problem with the patch anyway. Nor do I. - We seem to end up racing between zone_watermark_ok, zone_reclaim and buffered_rmqueue. Since everyone is in here the memory one thread reclaims may be stolen by another thread. You're pretty much on the button here. Only one thread at a time enters zone_reclaim. The others back off and try the next zone in the zonelist instead. I'm not sure what the original intention was but most likely it was to prevent too many parallel reclaimers in the same zone potentially dumping out way more data than necessary. Yes it was to prevent concurrency slowing down reclaim. At that time the number of processors per NUMA node was 2 or so. The number of pages that are reclaimed is limited to avoid tossing too many page cache pages. That is interesting, I always thought it was to try and free page cache first. For example with zone-min_unmapped_pages, if zone_pagecache_reclaimable is greater than unmapped pages, we start reclaim the cached pages first. The min_unmapped_pages almost sounds like the higher level watermark - or am I misreading the code. Balbir Singh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
On Fri, Feb 19, 2010 at 3:59 AM, Anton Blanchard an...@samba.org wrote: I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets enabled via this: /* * If another node is sufficiently far away then it is better * to reclaim pages in a zone before going off node. */ if (distance RECLAIM_DISTANCE) zone_reclaim_mode = 1; Since we use the default value of 20 for REMOTE_DISTANCE and 20 for RECLAIM_DISTANCE it never kicks in. The local to remote bandwidth ratios can be quite large on System p machines so it makes sense for us to reclaim clean pagecache locally before going off node. The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables zone reclaim. A reclaim distance of 10 implies a ratio of 1, that means we'll always do zone_reclaim() to free page cache and slab cache before moving on to another node? Balbir Singh. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
On Fri, 19 Feb 2010, Balbir Singh wrote: zone_reclaim. The others back off and try the next zone in the zonelist instead. I'm not sure what the original intention was but most likely it was to prevent too many parallel reclaimers in the same zone potentially dumping out way more data than necessary. Yes it was to prevent concurrency slowing down reclaim. At that time the number of processors per NUMA node was 2 or so. The number of pages that are reclaimed is limited to avoid tossing too many page cache pages. That is interesting, I always thought it was to try and free page cache first. For example with zone-min_unmapped_pages, if zone_pagecache_reclaimable is greater than unmapped pages, we start reclaim the cached pages first. The min_unmapped_pages almost sounds like the higher level watermark - or am I misreading the code. Indeed the purpose is to free *old* page cache pages. The min_unmapped_pages is to protect a mininum of the page cache pages / fs metadata from zone reclaim so that ongoing file I/O is not impacted. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
anyone used the Marvell mv64560?
Is anyone familiar with the mv64560? I'm curious how much difference there might be from the older mv64360 as far as setting up the PCI bus, cpu bus, i2c, memory, etc. I don't see any mention of this chip in current linux sources, but there's some mention of people trying it and it's referenced in one of the text files in Documentation...is it supported? Are there any caveats? We'll be going through proper channels eventually but I thought I'd check here just in case. Any information appreciated. Thanks, Chris ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] eeh: Fixing a bug when pci structure is null
Hi Ben, I'd like to ask about this patch ? Should I re-submit ? Thanks, Breno Leitao wrote: During a EEH recover, the pci_dev structure can be null, mainly if an eeh event is detected during cpi config operation. In this case, the pci_dev will not be known (and will be null) the kernel will crash with the following message: ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] eeh: Fixing a bug when pci structure is null
Hi Paul, Breno, Some confusion -- I've been out of the loop for a while -- I assume its still Paul who is pushing these patches upstream, and not Ben? So Breno, maybe you should resend the patch to Paul? --linas On 19 February 2010 10:43, Breno Leitao lei...@linux.vnet.ibm.com wrote: Hi Ben, I'd like to ask about this patch ? Should I re-submit ? Thanks, Breno Leitao wrote: During a EEH recover, the pci_dev structure can be null, mainly if an eeh event is detected during cpi config operation. In this case, the pci_dev will not be known (and will be null) the kernel will crash with the following message: ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
* Christoph Lameter c...@linux-foundation.org [2010-02-19 09:51:12]: On Fri, 19 Feb 2010, Balbir Singh wrote: zone_reclaim. The others back off and try the next zone in the zonelist instead. I'm not sure what the original intention was but most likely it was to prevent too many parallel reclaimers in the same zone potentially dumping out way more data than necessary. Yes it was to prevent concurrency slowing down reclaim. At that time the number of processors per NUMA node was 2 or so. The number of pages that are reclaimed is limited to avoid tossing too many page cache pages. That is interesting, I always thought it was to try and free page cache first. For example with zone-min_unmapped_pages, if zone_pagecache_reclaimable is greater than unmapped pages, we start reclaim the cached pages first. The min_unmapped_pages almost sounds like the higher level watermark - or am I misreading the code. Indeed the purpose is to free *old* page cache pages. The min_unmapped_pages is to protect a mininum of the page cache pages / fs metadata from zone reclaim so that ongoing file I/O is not impacted. Thanks for the explanation! -- Three Cheers, Balbir ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: MPC5200B XLB Configuration Issues, FEC RFIFO Events, ATA Crashes
Hi Roman: Sorry for the long delay, I had to fix some other stuff first, before I could launch the test... Here is just a short intermediate result. Am 04.02.10 20:35 schrieb(en) Albrecht Dreß: Actually, I forgot that I have to explicitly enable libata dma on the 5200b, due to the known silicon bugs... I will repeat my tests with the proper configuration, stay tuned. ... a signal processor attached to the localbus, using bestcomm and the fifo for the bulk transfer Are you using an own driver, or are you using Grant's SCLPC+SDMA driver? BD task? Basically Grant's driver, but with a slightly modified variant of the gen_bd task. The signal processor is a LE, and I managed to insert the LE/BE conversion into the bestcomm task (see also http://patchwork.ozlabs.org/patch/35038/). Unfortunately, there is no good documentation of the engine; I would like to also shift crc calculation into bestcomm, which seems to be possible in principle, but I never got it running. The best thing is to run very ugly tests with very high load for at least 24h. I today launched my test application, on kernel 2.6.32 with a few minor tweaks, which runs 4 threads in parallel, all first writing a number of data blocks, then doing a sync() when appropriate, and reading reading them all back and checking the contents (md5 hash): - one writes/reads back 256 files of 256k each to a nfs3 share on a Xeon server, using a 100 MBit line; - one writes/reads back one 1 MByte block using BestComm to a Localbus device (see quote above); - two write/read back 128 files of 64k each to two CF cards w/ vfat, both attached to the ata (master/slave). Booting with 'libata.force=mwdma2', this tests reproducibly freezes the system *within a few minutes*, in one case leaving the vfat fs on one card completely broken. The system didn't throw a panic, it was always simply stuck - no response to the serial console, nothing. Booting *without* this option (i.e. using pio for the cf cards), the system seems to run flawlessly. I will continue the test over the weekend (now active for ~5 hours), but it looks as if I can reproduce your problem. Next week, I'll try your fix (hope I don't wear out the cf cards...), and re-run the test. Best, Albrecht. pgpXPDVyEnoYv.pgp Description: PGP signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: State of PREEMPT_RT in PPC arch
Hi Wolfram, On Fri, Feb 19, 2010 at 12:54 AM, Wolfram Sang w.s...@pengutronix.de wrote: The list was apropriate. As all people are busy by default, getting no response is not that exceptional. If you need a fast response, you should consider commercial support. I appreciate your comments. -Ryan. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: register long sp asm(r1) incorrect
On Mon 2010-02-15 14:15:17, H. Peter Anvin wrote: On 02/15/2010 01:04 PM, Benjamin Herrenschmidt wrote: It's true that most other use of it we have are global scope (local_paca in r13, glibc use of r2/r13, etc...) afaik, but since r1 itself is the stack pointer always, I think they pretty much guarantee it works. It should work, because r1, being the stack pointer, is already marked a reserved register in gcc. The reference Pavel is citing bascially states that gcc won't globally reserve the register, which is true, but it is already reserved anyway. Ok, thanks for clarification. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
hrtimers in powerpc arch?
Hi, Is hrtimers supported in the powerpc arch and used in embedded powerpc drivers? I greped ktime_t and hrtimer_start() under arch/powerpc and found not two many calls. Does it indicate the powerpc world doesn't use hrtimers? Thanks, -Ryan. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev