Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
> > What's the point of this indirection other than another way of avoiding > > empty node 0? > > Honestly, I do not have any idea. I've traced it down to > Author: Andi Kleen > Date: Tue Jan 11 15:35:48 2005 -0800 I don't remember all the details, and I can't even find the commit (is it in linux-historic?). But AFAIK there's no guarantee PXMs are small and continuous, so it seemed better to have a clean zero based space. Back then we had a lot of problems with buggy SRAT tables in BIOS, so we really tried to avoid trusting the BIOS as much as possible. -Andi
Re: [PATCH v8 6/7] tools/perf: Enable Hz/hz prinitg for --metric-only option
> > diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c > > index 9e757d18d713..679aaa655824 100644 > > --- a/tools/perf/util/stat-display.c > > +++ b/tools/perf/util/stat-display.c > > @@ -237,8 +237,6 @@ static bool valid_only_metric(const char *unit) > > if (!unit) > > return false; > > if (strstr(unit, "/sec") || > > - strstr(unit, "hz") || > > - strstr(unit, "Hz") || > > will this change output of --metric-only for some setups then? > > Andi, are you ok with this? Yes should be ok $ grep -i ScaleUnit.*hz arch/x86/*/* $ Probably was for some metric we later dropped. -Andi
Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information
On Mon, Mar 02, 2020 at 11:13:32AM +0100, Peter Zijlstra wrote: > On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote: > > Modern processors export such hazard data in Performance > > Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event > > Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on > > AMD[3] provides similar information. > > > > Implementation detail: > > > > A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced. > > If it's set, kernel converts arch specific hazard information > > into generic format: > > > > struct perf_pipeline_haz_data { > > /* Instruction/Opcode type: Load, Store, Branch */ > > __u8itype; > > /* Instruction Cache source */ > > __u8icache; > > /* Instruction suffered hazard in pipeline stage */ > > __u8hazard_stage; > > /* Hazard reason */ > > __u8hazard_reason; > > /* Instruction suffered stall in pipeline stage */ > > __u8stall_stage; > > /* Stall reason */ > > __u8stall_reason; > > __u16 pad; > > }; > > Kim, does this format indeed work for AMD IBS? Intel PEBS has a similar concept for annotation of memory accesses, which is already exported through perf_mem_data_src. This is essentially an extension. It would be better to have something unified here. Right now it seems to duplicate at least part of the PEBS facility. -Andi
Re: [RFC 5/6] perf/tools: Enhance JSON/metric infrastructure to handle "?"
> Here, '?' will be replaced with a runtime value and metric expression will > be replicated. Okay seems reasonable to me. Thanks, -Andi
Re: [RFC 5/6] perf/tools: Enhance JSON/metric infrastructure to handle "?"
On Fri, Jan 17, 2020 at 06:16:19PM +0530, Kajol Jain wrote: > Patch enhances current metric infrastructure to handle "?" in the metric > expression. The "?" can be use for parameters whose value not known while > creating metric events and which can be replace later at runtime to > the proper value. It also add flexibility to create multiple events out > of single metric event added in json file. Please add a proper specification how this ? thing is supposed to work, what the exact semantics are, how it is different from the existing # mechanism etc. The standard way to do similar things before was to define an explicit # name and let the expr code take care of it. -Andi
Re: [PATCH v9 21/24] perf tools: Add support for the SPF perf event
On Mon, Mar 26, 2018 at 02:44:48PM -0700, David Rientjes wrote: > On Tue, 13 Mar 2018, Laurent Dufour wrote: > > > Add support for the new speculative faults event. > > > > Signed-off-by: Laurent Dufour> > Acked-by: David Rientjes > > Aside: should there be a new spec_flt field for struct task_struct that > complements maj_flt and min_flt and be exported through /proc/pid/stat? No. task_struct is already too bloated. If you need per process tracking you can always get it through trace points. -Andi
Re: [PATCH v6 03/24] mm: Dont assume page-table invariance during faults
Laurent Dufourwrites: > From: Peter Zijlstra > > One of the side effects of speculating on faults (without holding > mmap_sem) is that we can race with free_pgtables() and therefore we > cannot assume the page-tables will stick around. > > Remove the reliance on the pte pointer. This needs a lot more explanation. So why is this code not needed with SPF only? -Andi
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
> That makes me extremely nervous... there could be all sort of > assumptions esp. in arch code about the fact that we never populate the > tree without the mm sem. > > We'd have to audit archs closely. Things like the page walk cache > flushing on power etc... Yes the whole thing is quite risky. Probably will need some kind of per architecture opt-in scheme? -Andi
Re: [PATCH v2 2/5] perf/x86/intel: Record branch type
> > It's a somewhat common situation with partially JITed code, if you > > don't have an agent. You can still do a lot of useful things. > > Like what? How can you say anything about code you don't have? For example if you combine the PMU topdown measurement, and see if it's frontend bound, and then you see it has lots of forward conditionals, then dynamic basic block reordering will help. If you have lots of cross page jumps then function reordering will help. etc. > > We found it useful to have this extra information during workload > > analysis. Forward conditionals and page crossing jumps > > are indications of frontend problems. > > But you already have the exact same information in {to,from}, why would > you need to repackage information already contained? Without this patch, we don't know if it's conditional or something else. And the kernel already knows this for its filtering, so it can as well report it. Right the CROSS_* and forward backward information could be computed later. -Andi
Re: [PATCH v2 2/5] perf/x86/intel: Record branch type
On Fri, Apr 07, 2017 at 05:20:31PM +0200, Peter Zijlstra wrote: > On Fri, Apr 07, 2017 at 06:47:43PM +0800, Jin Yao wrote: > > Perf already has support for disassembling the branch instruction > > and using the branch type for filtering. The patch just records > > the branch type in perf_branch_entry. > > > > Before recording, the patch converts the x86 branch classification > > to common branch classification and compute for checking if the > > branches cross 4K or 2MB areas. It's an approximate computing for > > crossing 4K page or 2MB page. > > The changelog is completely empty of rationale. Why do we care? > > Not having the binary is a very bad reason; you can't do much of > anything if that's missing. It's a somewhat common situation with partially JITed code, if you don't have an agent. You can still do a lot of useful things. We found it useful to have this extra information during workload analysis. Forward conditionals and page crossing jumps are indications of frontend problems. -Andi
Re: [PATCH v21 00/20] perf, tools: Add support for PMU events in JSON format
On Mon, Sep 26, 2016 at 12:03:43PM -0300, Arnaldo Carvalho de Melo wrote: > Em Mon, Sep 26, 2016 at 10:35:33AM +0200, Jiri Olsa escreveu: > > ping.. is that working for you? IMO we can include this > > as additional patch to the set.. > > No, it doesn't fails to build on the first cross env I tried, fixing it > now, resulting patch: Yes it shouldn't be difficult to fix cross building. I don't think there are any fundamental problems. -Andi
Re: [PATCH v20 00/20] perf, tools: Add support for PMU events in JSON format
> > > > > > > > I've already made some changes in pmu-events/* to support > > > this hierarchy to see how bad the change would be.. and > > > it's not that bad ;-) > > > > Everything has to be automated, please no manual changes. > > sure > > so, if you're ok with the layout, how do you want to proceed further? If the split version is acceptable it's fine for me to merge it. I'll add split-json to my scripting, so the next update would be split too. -Andi
Re: [PATCH v20 00/20] perf, tools: Add support for PMU events in JSON format
> hi, > I had discussion with Ingo about the state of this patchset > and there's one more requirement from his side - to split > event files into per topic files Thanks Jiri. > > I made some initial changes over latest Sukadev's branch > and came up with something like this: Did you just split it by the "Topic" fields? > > $ find pmu-events/arch/x86/ > pmu-events/arch/x86/ > pmu-events/arch/x86/NehalemEX_core > pmu-events/arch/x86/NehalemEX_core/Memory.json > pmu-events/arch/x86/NehalemEX_core/Virtual-Memory.json > pmu-events/arch/x86/NehalemEX_core/Cache.json > pmu-events/arch/x86/NehalemEX_core/Pipeline.json > pmu-events/arch/x86/NehalemEX_core/Floating-point.json > pmu-events/arch/x86/NehalemEX_core/Other.json > pmu-events/arch/x86/mapfile.csv > pmu-events/arch/x86/Broadwell_core > pmu-events/arch/x86/Broadwell_core/Memory.json > pmu-events/arch/x86/Broadwell_core/Virtual-Memory.json > pmu-events/arch/x86/Broadwell_core/Cache.json > pmu-events/arch/x86/Broadwell_core/Pipeline.json > pmu-events/arch/x86/Broadwell_core/Floating-point.json > pmu-events/arch/x86/Broadwell_core/Other.json > pmu-events/arch/x86/Broadwell_core/Frontend.json > > so let's have a discussion if this is acceptable for you guys Splitting is fine for me, as long as it's scriptable. I already have some scripts to generate the perf json files, can update them to split. > > I've already made some changes in pmu-events/* to support > this hierarchy to see how bad the change would be.. and > it's not that bad ;-) Everything has to be automated, please no manual changes. -Andi
Re: [PATCH 2/5] kbuild: allow archs to select build for link dead code/data elimination
On Wed, Aug 10, 2016 at 12:29:29AM +0200, Arnd Bergmann wrote: > On Monday, August 8, 2016 8:16:05 PM CEST Andi Kleen wrote: > > > I don't understand what led Andi Kleen to also move .text.hot and > > > .text.unlikely together with .text [2], but this may have > > > been a related issue. > > > > > > [2] https://lkml.org/lkml/2015/7/19/377 > > > > The goal was just to move .hot and .unlikely all together, so that > > they are clustered and use the minimum amount of cache. On x86 doesn't > > matter where they are exactly, as long as each is together. > > If they are not explicitely listed then the linker interleaves > > them with the normal text, which defeats the purpose. > > I still don't see it, my reading of your patch is that you did > the opposite, by changing the description that puts all .text.hot > in front of .text, and all .text.unlikely after exit.text into > one that mixes them with .text. What am I missing here? No it doesn't mix .unlikely with .text, .unlikely is all in one place. -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH 2/5] kbuild: allow archs to select build for link dead code/data elimination
On Wed, Aug 10, 2016 at 12:29:29AM +0200, Arnd Bergmann wrote: > On Monday, August 8, 2016 8:16:05 PM CEST Andi Kleen wrote: > > > I don't understand what led Andi Kleen to also move .text.hot and > > > .text.unlikely together with .text [2], but this may have > > > been a related issue. > > > > > > [2] https://lkml.org/lkml/2015/7/19/377 > > > > The goal was just to move .hot and .unlikely all together, so that > > they are clustered and use the minimum amount of cache. On x86 doesn't > > matter where they are exactly, as long as each is together. > > If they are not explicitely listed then the linker interleaves > > them with the normal text, which defeats the purpose. > > I still don't see it, my reading of your patch is that you did > the opposite, by changing the description that puts all .text.hot > in front of .text, and all .text.unlikely after exit.text into > one that mixes them with .text. What am I missing here? .text.hot is actually not used, the critical part is .text.unlikely which was not listed and was interleaved before the patch. -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH 2/5] kbuild: allow archs to select build for link dead code/data elimination
> I don't understand what led Andi Kleen to also move .text.hot and > .text.unlikely together with .text [2], but this may have > been a related issue. The goal was just to move .hot and .unlikely all together, so that they are clustered and use the minimum amount of cache. On x86 doesn't matter where they are exactly, as long as each is together. If they are not explicitely listed then the linker interleaves them with the normal text, which defeats the purpose. -Andi
Re: [PATCH v19 00/19] perf, tools: Add support for PMU events in JSON format
On Mon, Nov 30, 2015 at 06:56:55PM -0800, Sukadev Bhattiprolu wrote: > CPUs support a large number of performance monitoring events (PMU events) > and often these events are very specific to an architecture/model of the > CPU. To use most of these PMU events with perf, we currently have to identify > them by their raw codes: Could this patch be merged please? I'm not aware of any open objection to it, and it has been already extensively reviewed in the past. Full event lists are very useful for people. Also there are additional changes (uncore event lists) which are blocked on these patches getting merged. Thanks, -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v14 15/19] perf, tools: Support long descriptions with perf list
On Fri, Jun 05, 2015 at 12:21:38PM +0200, Jiri Olsa wrote: On Thu, Jun 04, 2015 at 11:27:23PM -0700, Sukadev Bhattiprolu wrote: SNIP --- tools/perf/builtin-list.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/tools/perf/builtin-list.c b/tools/perf/builtin-list.c index 3f058f7..d0f7a18 100644 --- a/tools/perf/builtin-list.c +++ b/tools/perf/builtin-list.c @@ -22,10 +22,13 @@ int cmd_list(int argc, const char **argv, const char *prefix __maybe_unused) { int i; bool raw_dump = false; + bool long_desc_flag = false; struct option list_options[] = { OPT_BOOLEAN(0, raw-dump, raw_dump, Dump raw events), OPT_BOOLEAN('d', desc, desc_flag, Print extra event descriptions. --no-desc to not print.), + OPT_BOOLEAN('d', long-desc, long_desc_flag, + Print longer event descriptions.), hum, it should be 'v' , right? Yes that's right. Also BTW need to add the new option to the usage line a few lines below. -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 12/14] perf, tools: Add support for event list topics
please split at least the jevents Topic parsing from the rest idelay also the alias update and the display change What's the point of all these splits? It's already one logical unit, not too large, and is bisectable. -andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file
Ok I did some scripting to add these topics you requested to the Intel JSON files, and changed perf list to group events by them. I'll redirect any questions on their value to you. And I certainly hope this is the last of your improvements for now. The updated event lists are available in git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/intel-json-files-3 The updated patches are available in git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/builtin-json-6 Also posted separately. The output looks like this % perf list ... Cache: l1d.replacement [L1D data line replacements] l1d_pend_miss.pending [L1D miss oustandings duration in cycles] l1d_pend_miss.pending_cycles [Cycles with L1D load Misses outstanding] ... Floating point: fp_assist.any [Cycles with any input/output SSE or FP assist] fp_assist.simd_input [Number of SIMD FP assists due to input values] fp_assist.simd_output [Number of SIMD FP assists due to Output values] ... Memory: machine_clears.memory_ordering [Counts the number of machine clears due to memory order conflicts] mem_trans_retired.load_latency_gt_128 [Loads with latency value being above 128 (Must be precise)] mem_trans_retired.load_latency_gt_16 [Loads with latency value being above 16 (Must be precise)] ... Pipeline: arith.fpu_div [Divide operations executed] arith.fpu_div_active [Cycles when divider is busy executing divide operations] baclears.any [Counts the total number when the front end is resteered, mainly when the BPU cannot provide a correct prediction and this is corrected by other branch handling mechanisms at the front end] -Andi P.S.: You may want to look up the definition of logical fallacy in wikipedia. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 04/10] perf, tools: Handle header line in mapfile
On Fri, May 29, 2015 at 11:13:15AM +0200, Jiri Olsa wrote: On Thu, May 28, 2015 at 10:45:06PM -0700, Sukadev Bhattiprolu wrote: Jiri Olsa [jo...@redhat.com] wrote: | if (line[0] == '#' || line[0] == '\n') | continue; | + if (!strncmp(line, Family, 6)) | + continue; | | I think we should fix mapfiles to put the 'Family' starting | line as a comment.. the way powerpc mapfile is done You mean add something like this to the Intel mapfile: # Power8 entries 004d0100,1,power8.json,core and drop this patch? right But it's a CSV file. CSV files are supposed to have column headers. There are lots of tools that work better with them if they have headers. Please keep the header. -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 09/10] perf, tools: Add a --no-desc flag to perf list
On Thu, May 28, 2015 at 02:39:14PM +0200, Jiri Olsa wrote: On Wed, May 27, 2015 at 02:23:28PM -0700, Sukadev Bhattiprolu wrote: From: Andi Kleen a...@linux.intel.com Add a --no-desc flag to perf list to not print the event descriptions that were earlier added for JSON events. This may be useful to get a less crowded listing. It's still default to print descriptions as that is the more useful default for most users. I might not be typical user, but the first thing I tried to explore was 'perf list -v' ;-) would it be better to have just the list with event names for: $ perf list and with descriptions for: $ perf list -v not sure we already discussed this.. It was discussed last time. I think it's better to have descriptions by default. Far more user friendly. One thing we could in theory do with -v is to switch between Brief and Public Description (the later is often more verbose). Would need some changes to the alias code to have two descriptions though. -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file
So instead of this flat structure, there should at minimum be broad categorization of the various parts of the hardware they relate to: whether they relate to the branch predictor, memory caches, TLB caches, memory ops, offcore, decoders, execution units, FPU ops, etc., etc. - so that they can be queried via 'perf list'. The categorization is generally on the stem name, which already works fine with the existing perf list wildcard support. So for example you only want branches. perf list br* ... br_inst_exec.all_branches [Speculative and retired branches] br_inst_exec.all_conditional [Speculative and retired macro-conditional branches] br_inst_exec.all_direct_jmp [Speculative and retired macro-unconditional branches excluding calls and indirects] br_inst_exec.all_direct_near_call [Speculative and retired direct near calls] br_inst_exec.all_indirect_jump_non_call_ret [Speculative and retired indirect branches excluding calls and returns] br_inst_exec.all_indirect_near_return [Speculative and retired indirect return branches] ... Or mid level cache events: perf list l2* ... l2_l1d_wb_rqsts.all [Not rejected writebacks from L1D to L2 cache lines in any state] l2_l1d_wb_rqsts.hit_e [Not rejected writebacks from L1D to L2 cache lines in E state] l2_l1d_wb_rqsts.hit_m [Not rejected writebacks from L1D to L2 cache lines in M state] l2_l1d_wb_rqsts.miss [Count the number of modified Lines evicted from L1 and missed L2. (Non-rejected WBs from the DCU.)] l2_lines_in.all [L2 cache lines filling L2] ... There are some exceptions, but generally it works this way. The stem could be put into a separate header, but it would seem redundant to me. We don't just want the import the unstructured mess that these event files are - we want to turn them into real structure. We can still keep the messy vendor names as well, like IDQ.DSB_CYCLES, but we want to impose structure as well. The vendor names directly map to the micro architecture, which is whole point of the events. IDQ is a part of the CPU, and is described in the CPU manuals. One of the main motivations for adding event lists is to make perf match to that documentation. 3) There should be good 'perf list' visualization for these events: grouping, individual names, with a good interface to query details if needed. I.e. it should be possible to browse and discover events relevant to the CPU the tool is executing on. I suppose we could change perf list to give the stem names as section headers to make the long list a bit more readable. Generally you need to have some knowledge of the micro architecture to use these events. There is no way around that. -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] perf: Add power8 PMU events in JSON format
On Thu, May 28, 2015 at 12:01:31AM +0900, Namhyung Kim wrote: On Wed, May 27, 2015 at 11:41 PM, Andi Kleen a...@linux.intel.com wrote: + { +EventCode: 0x2505e, +EventName: PM_BACK_BR_CMPL, +BriefDescription: Branch instruction completed with a target address less than current instruction address,, +PublicDescription: Branch instruction completed with a target address less than current instruction address., Can't we remove PublicDescription field if it's identical to BriefDescription? It seems just wasting spaces.. It's not always identical. There are events where PublicDescription is much longer (several paragraphs) I know. What I said is make it optional so that we can drop if it's identical. Should be easy enough. It's already optional in the jevents parser. -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] perf: Add power8 PMU events in JSON format
+ { +EventCode: 0x2505e, +EventName: PM_BACK_BR_CMPL, +BriefDescription: Branch instruction completed with a target address less than current instruction address,, +PublicDescription: Branch instruction completed with a target address less than current instruction address., Can't we remove PublicDescription field if it's identical to BriefDescription? It seems just wasting spaces.. It's not always identical. There are events where PublicDescription is much longer (several paragraphs) -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file
So we build tables of all models in the architecture, and choose matching one when compiling perf, right? Can't we do that when building the tables? IOW, why don't we check the VFM and discard non-matching tables? Those non-matching tables are also needed? We build it for all cpus in an architecture, not all architectures. So e.g. for an x86 binary power is not included, and vice versa. It always includes all CPUs for a given architecture, so it's possible to use the perf binary on other systems than just the one it was build on. -andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file
Sure, but shouldn't we allow JSON files to be in subdirs pmu-events/arch/x86/HSX/Haswell_core.json and this could go to arbitrary levels? I used a flat hierarchy. Should be good enough. -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file
pmu-events.c depends only on JSON files relevant to the arch perf is being built on and there could be several JSON files per arch. So it would complicate the Makefiles. Could just use a wildcard dependency on */$(ARCH)/*.json Also it would be good to move the generated file into the object directory. I tried it but it needs some more changes to the Makefiles. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/4] perf: Use pmu_events_map table to create event aliases
On Wed, May 20, 2015 at 10:02:04PM -0700, Sukadev Bhattiprolu wrote: Andi Kleen [a...@linux.intel.com] wrote: | If you need something else in vfm to identify the CPU | can't you just add it there? I wouldn't really call it vfm, it's | really a abstract cpu identifier per architecture. So if you | need pvr just add it there. Ok. I will change vfm to cpuid_str and include pvr in it. Thanks. With that change it would be also cleaner to provide a get_cpuid_str() function by the architecture code, and then strcmp in the matching code, instead of having architecture specific compare code. -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/4] perf: Use pmu_events_map table to create event aliases
+/* + * Return TRUE if the CPU identified by @vfm, @version, and @type + * matches the current CPU. vfm refers to [Vendor, Family, Model], + * + * Return FALSE otherwise. + * + * For Powerpc, we only compare @version to the processor PVR. + */ +bool arch_pmu_events_match_cpu(const char *vfm __maybe_unused, + const char *version, + const char *type __maybe_unused) +{ + char *cpustr; + bool rc; + + cpustr = get_cpu_str(); + rc = !strcmp(version, cpustr); Surely against vfm not version I think your mapfile is wrong if that works? That's the Intel format: .vfm = GenuineIntel-6-3E, .version = V16, .type = core, .table = pme_IvyTown_core -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/4] perf: Use pmu_events_map table to create event aliases
Obviously, that does not fit into the VFM field. We could either add a new PVR field to the mapfile: [vfm, version, type, pvr] or, as the patch currently does, let architectures intepret the version field as they see fit? IOW, leave it to architectures to keep arch_pmu_events_match_cpu() consistent with _their_ mapfile? version is the version number of the event file. This way you can't signify the version number if you ever change something. If you need something else in vfm to identify the CPU can't you just add it there? I wouldn't really call it vfm, it's really a abstract cpu identifier per architecture. So if you need pvr just add it there. -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC][PATCH 4/4] perf: Create aliases for PMU events
I personaly like having set of event files in JSON notation rather than having them directly in C structure Yes, strings are better and JSON input is also better. I prototyped translating JSON into the proposed structures. I already had to add three new fields, and it wouldn't work for uncore. The string format is much more extensible. BTW as expected the binary sizes are gigantic (for 14 CPU types): % size all.o textdata bss dec hex filename 662698 0 0 662698 a1caa all.o % gcc -E all.c | wc -l 55475 -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 'perf upgrade' (was: Re: [PATCH v9 00/11] Add support for JSON event files.)
On Fri, Apr 17, 2015 at 05:31:26PM +0200, Jiri Olsa wrote: On Wed, Apr 15, 2015 at 01:50:42PM -0700, Sukadev Bhattiprolu wrote: SNIP | | - to blindly follow some poorly constructed vendor format with no |high level structure, that IMHO didn't work very well when OProfile |was written, and misrepresenting it as 'symbolic event names'. | |Take a look at: | | https://download.01.org/perfmon/HSW/Haswell_core_V17.json | |and weep. Evil vendor formats, but to be fair, here is what _we_ have today: perf stat -e r10068,r20036,r40060,r40ac sleep 1 hum, you could also use the 'cpu/event=.../' syntax right? That's even worse -- same hex numbers, just more redundancy. All the other profilers support symbolic names, which is what users want. -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 'perf upgrade' (was: Re: [PATCH v9 00/11] Add support for JSON event files.)
My suggestion to resolve the technical objections and lift the NAK would be: - to add the tables to the source code, in a more human readable format and (optionally) structure the event names better into a higher level hierarchy, than the humungous linear dumps with no explanations that you propose - while still supporting the 'raw' vendor event names you want to use, for those people who are used to them. Trying to understand what you mean with high level hierarchy: Do you mean something like the oprofile event / unit mask split, with events having a default unit mask? This one actually works poorly for modern Intel systems, as unit masks can completely change the behavior, so there is not necessarily a direct relation between the name before the dot and the one after, or a concept of a default unit mask. Or do you mean someone creating a tree hierarchy of events to systematically debug some particular problem? I implemented this in my toplev tool here: https://github.com/andikleen/pmu-tools/wiki/toplev-manual http://github.com/andikleen/pmu-tools But it's quite complicated and cannot really be done portable in a full way without completely turning perf into an architecture specific tool (which I assume you're opposed to) Or just some sections grouping events into different topics? branches, caches, execution or somesuch. I think this could be done with the JSON format. Just add a new header. Print them as sections in perf list. Or do you mean something like the perf cache events tables? We already have those. But they cannot replace direct micro architectural specific events, as the cache event tables do not necessarily express everything a given micro architecture supports. And also there is a lot of documentation using the vendor event names, which the users then want to use. If it's something else please explain. - to pre-parse the event descriptions at build time - beyond the speedup FWIW I just measured it and the overhead parse time for the haswell JSON file on my laptop is about 33ms. # with event map % time ./obj-perf/perf list /dev/null real0m0.045s user0m0.041s sys 0m0.003s % export EVENTMAP=/dev/zero % time perf list /dev/null real0m0.011s user0m0.004s sys 0m0.006s - to upgrade perf as a whole unit: this helps not just your usecase but many other usecases as well. With the downloader it actually automatically downloads any missing files, so the upgrading just works. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 'perf upgrade' (was: Re: [PATCH v9 00/11] Add support for JSON event files.)
On Tue, Apr 14, 2015 at 10:55:41AM +0200, Ingo Molnar wrote: * Sukadev Bhattiprolu suka...@linux.vnet.ibm.com wrote: This is another attempt to resurrect Andi Kleen's patchset so users can specify perf events by their event names rather than raw codes. This is a rebase of Andi Kleen's patchset from Jul 30, 2014[1] to 4.0. (I fixed minor and not so minor conflicts). So this series shows some progress, but instead of this limited checkout ability I'd still prefer it if 'perf download' downloaded the latest perf code itself and built it - it shouldn't be limited to just a small subset of the perf source code! I'm not against that idea (and can see it being useful in some cases[1]), but this would need some authentication. Otherwise it would be a $bigbrother/nsa/anyone controling your network can run everything on your computer style hole. The original JSON files don't have this problem because they didn't contain any code. -Andi [1] especially to work around the broken perf setup on d./u. distros. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: perf report broken for branch stack samples
On Wed, Apr 01, 2015 at 02:32:54PM +0530, Anshuman Khandual wrote: Hello, perf report is not showing up the branch stack sample results in the from_symbol --- to_symbol format even if the perf.data file has got the samples (through 'perf record -b workload' session). Perf report Sorry for the late answer. This was already fixed in tip by commit fefd2d9619de3bf0bf02a8622e9f445c3d19cc3f Author: He Kuang heku...@huawei.com Date: Sun Feb 15 10:33:37 2015 +0800 perf report: Fix branch stack mode cannot be set -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] perf/e6500: Make event translations available in sysfs
Thanks for supporting the JSON format too. (c) If not, given we don't know how to get us out of the current status quo, can this patchseries still be applied, given the original complaint was the size of our events-list.h (whereas The Intel core event lists are far larger even (and will grow even more when uncore gets added) power7-events-list.h is almost twice the size)? If not, patch 3/3 in this series is still valid, no matter what, and it should still be applied (let us know if we need to resubmit). Could also just leave out the downloader for now, so that you have to get your own event file and set it up with export EVENTMAP=... That's basically the patchkit, minus one patch. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] perf/e6500: Make event translations available in sysfs
Well I'm tired of discussing this. I don't think what you proposed makes sense, putting 3.4MB[1] of changing blob into perf. I'll resubmit the JSON parser without the downloader. Then users have the option to get their own events and use that. If you don't like that, standard perf just has to stay with limited events and r as before, with users having to use external tools or libraries for names for more events[2][3]. -Andi [1] Current size of https://download.01.org/perfmon/ [2] ocperf in https://github.com/andikleen/pmu-tools [3] http://perfmon2.sourceforge.net/ -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] perf/e6500: Make event translations available in sysfs
I'll NAK any external 'download area' (and I told that Andi before): tools/perf/event-tables/ or so is a good enough 'download area' with fast enough update cycles. The proposal was to put it on kernel.org, similar to how external firmware blobs are distributed. CPU event lists are data sheets, so are like firmware. They do not follow the normal kernel code licenses. They are not source code. They cannot be reviewed in the normal way. If any 'update' of event descriptions is needed it can happen through the distro package mechanism, or via a simple 'git pull' if it's compiled directly. Lets not overengineer this with any dependence on an external site and with a separate update mechanism - lets just get the tables into tools/ and see it from there... That experiment has been already done for oprofile, didn't work very well. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
Anton Blanchard an...@samba.org writes: Thoughts? It seems like we could hit a similar situation if a machine is balanced but we run out of memory on a single node. Yes I agree, but your patch doesn't seem to attempt to handle this? -Andi Index: b/mm/slub.c === --- a/mm/slub.c +++ b/mm/slub.c @@ -2278,10 +2278,17 @@ redo: if (unlikely(!node_match(page, node))) { stat(s, ALLOC_NODE_MISMATCH); - deactivate_slab(s, page, c-freelist); - c-page = NULL; - c-freelist = NULL; - goto new_slab; + + /* + * If the node contains no memory there is no point in trying + * to allocate a new node local slab + */ + if (node_spanned_pages(node)) { + deactivate_slab(s, page, c-freelist); + c-page = NULL; + c-freelist = NULL; + goto new_slab; + } } /* -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V4 04/10] x86, perf: Add conditional branch filtering support
On Wed, Dec 04, 2013 at 04:02:36PM +0530, Anshuman Khandual wrote: This patch adds conditional branch filtering support, enabling it for PERF_SAMPLE_BRANCH_COND in perf branch stack sampling framework by utilizing an available software filter X86_BR_JCC. Newer Intel CPUs a hardware filter too for not a conditional branch. I can look at implementing that. The software option seems fine for now. -Andi Signed-off-by: Anshuman Khandual khand...@linux.vnet.ibm.com Reviewed-by: Stephane Eranian eran...@google.com --- arch/x86/kernel/cpu/perf_event_intel_lbr.c | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c index d82d155..9dd2459 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c @@ -384,6 +384,9 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event) if (br_type PERF_SAMPLE_BRANCH_NO_TX) mask |= X86_BR_NO_TX; + if (br_type PERF_SAMPLE_BRANCH_COND) + mask |= X86_BR_JCC; + /* * stash actual user request into reg, it may * be used by fixup code for some CPU @@ -678,6 +681,7 @@ static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = { * NHM/WSM erratum: must include IND_JMP to capture IND_CALL */ [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP, + [PERF_SAMPLE_BRANCH_COND] = LBR_JCC, }; static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = { @@ -689,6 +693,7 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = { [PERF_SAMPLE_BRANCH_ANY_CALL] = LBR_REL_CALL | LBR_IND_CALL | LBR_FAR, [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL, + [PERF_SAMPLE_BRANCH_COND] = LBR_JCC, }; /* core */ -- 1.7.11.7 -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
Linus Torvalds torva...@linux-foundation.org writes: On Mon, Sep 30, 2013 at 1:01 PM, Waiman Long waiman.l...@hp.com wrote: I think this patch is worth a trial if relevant hardware is more widely available. The TSX code certainly need to be moved to an architecture specific area and should be runtime enabled using a static key. We also need more TSX support infrastructure in place first. I think we can pick that up from Andi's patches, he should have that. Although that did have very x86-specific naming (ie xbegin). And I don't think he used asm goto to quite the same advantage as this - and I think we do want to make sure that the overhead is minimal. FWIW my version #0 used asm goto directly, but I later switched to not using it to support more compilers and higher level abstractions (locks etc.) and use the same intrinsics as the user level guys are using. The two extra instructions from not using asm goto for xbegin don't matter all that much in the end. That's the old asm goto stuff I wrote originally (user level version): https://github.com/andikleen/tsx-tools/blob/master/include/rtm-goto.h There was also a kernel version of it that patched, right now this is done in the main TSX patchkit like this: https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle312/rtm-baseid=9190346d57a9bc89e746aee774d07e54cd1e6e75 Essentially without RTM it just becomes and unconditional jump to the abort handler, xabort is a nop, and xtest always returns 0. -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture
Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com writes: + } + + /* + * We use 2M page, but we need to remove part of them, + * so split 2M page to 4K page. + */ + pte = alloc_low_page(pte_phys); What happens when the allocation fails? alloc_low_page seems to be buggy there too, it would __pa a NULL pointer. + if (pud_large(*pud)) { + if ((addr ~PUD_MASK) == 0 next = end) { + set_pud(pud, __pud(0)); + pages++; + continue; + } + + /* + * We use 1G page, but we need to remove part of them, + * so split 1G page to 2M page. + */ + pmd = alloc_low_page(pmd_phys); Same here + __split_large_page((pte_t *)pud, addr, (pte_t *)pmd); + + spin_lock(init_mm.page_table_lock); + pud_populate(init_mm, pud, __va(pmd_phys)); + spin_unlock(init_mm.page_table_lock); + } + + pmd = map_low_page(pmd_offset(pud, 0)); + phys_pmd_remove(pmd, addr, end); + unmap_low_page(pmd); + __flush_tlb_all(); + } + __flush_tlb_all(); This doesn't flush the other CPUs doesn't it? -Andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 4/6] x86: Add clear_page_nocache
Moving 64 bytes per cycle is faster on Sandy Bridge, but slower on Westmere. Any preference? ;) You have to be careful with these benchmarks. - You need to make sure the data is cache cold, cache hot is misleading. - The numbers can change if you have multiple CPUs doing this in parallel. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] [6/99] seqlock: Don't smp_rmb in seqlock reader spin loop
2.6.35-longterm review patch. If anyone has any objections, please let me know. -- From: Milton Miller milt...@bga.com commit 5db1256a5131d3b133946fa02ac9770a784e6eb2 upstream. Move the smp_rmb after cpu_relax loop in read_seqlock and add ACCESS_ONCE to make sure the test and return are consistent. A multi-threaded core in the lab didn't like the update from 2.6.35 to 2.6.36, to the point it would hang during boot when multiple threads were active. Bisection showed af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 (clockevents: Remove the per cpu tick skew) as the culprit and it is supported with stack traces showing xtime_lock waits including tick_do_update_jiffies64 and/or update_vsyscall. Experimentation showed the combination of cpu_relax and smp_rmb was significantly slowing the progress of other threads sharing the core, and this patch is effective in avoiding the hang. A theory is the rmb is affecting the whole core while the cpu_relax is causing a resource rebalance flush, together they cause an interfernce cadance that is unbroken when the seqlock reader has interrupts disabled. At first I was confused why the refactor in 3c22cd5709e8143444a6d08682a87f4c57902df3 (kernel: optimise seqlock) didn't affect this patch application, but after some study that affected seqcount not seqlock. The new seqcount was not factored back into the seqlock. I defer that the future. While the removal of the timer interrupt offset created contention for the xtime lock while a cpu does the additonal work to update the system clock, the seqlock implementation with the tight rmb spin loop goes back much further, and is just waiting for the right trigger. Signed-off-by: Milton Miller milt...@bga.com Cc: linuxppc-dev@lists.ozlabs.org Cc: Linus Torvalds torva...@linux-foundation.org Cc: Andi Kleen a...@firstfloor.org Cc: Nick Piggin npig...@kernel.dk Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Anton Blanchard an...@samba.org Cc: Paul McKenney paul...@linux.vnet.ibm.com Acked-by: Eric Dumazet eric.duma...@gmail.com Signed-off-by: Andi Kleen a...@linux.intel.com Link: http://lkml.kernel.org/r/%3Cseqlock-rmb%40mdm.bga.com%3E Signed-off-by: Thomas Gleixner t...@linutronix.de Signed-off-by: Greg Kroah-Hartman gre...@suse.de --- include/linux/seqlock.h |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.35.y/include/linux/seqlock.h === --- linux-2.6.35.y.orig/include/linux/seqlock.h +++ linux-2.6.35.y/include/linux/seqlock.h @@ -88,12 +88,12 @@ static __always_inline unsigned read_seq unsigned ret; repeat: - ret = sl-sequence; - smp_rmb(); + ret = ACCESS_ONCE(sl-sequence); if (unlikely(ret 1)) { cpu_relax(); goto repeat; } + smp_rmb(); return ret; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] seqlock: don't smp_rmb in seqlock reader spin loop
On Thu, May 12, 2011 at 04:13:54AM -0500, Milton Miller wrote: Move the smp_rmb after cpu_relax loop in read_seqlock and add ACCESS_ONCE to make sure the test and return are consistent. A multi-threaded core in the lab didn't like the update Which core was that? -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/5] make *_gate_vma accept mm_struct instead of task_struct
On Tue, Mar 08, 2011 at 07:31:56PM -0500, Stephen Wilson wrote: Morally, the question of whether an address lies in a gate vma should be asked with respect to an mm, not a particular task. Practically, dropping the dependency on task_struct will help make current and future operations on mm's more flexible and convenient. In particular, it allows some code paths to avoid the need to hold task_lock. The only architecture this change impacts in any significant way is x86_64. The principle change on that architecture is to mirror TIF_IA32 via a new flag in mm_context_t. The problem is -- you're adding a likely cache miss on mm_struct for every 32bit compat syscall now, even if they don't need mm_struct currently (and a lot of them do not) Unless there's a very good justification to make up for this performance issue elsewhere (including numbers) this seems like a bad idea. This is the first of a two part series that implements safe writes to /proc/pid/mem. I will be posting the second series to lkml shortly. These Making every syscall slower for /proc/pid/mem doesn't seem like a good tradeoff to me. Please solve this in some other way. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/5] make *_gate_vma accept mm_struct instead of task_struct II
On Thu, Mar 10, 2011 at 08:00:32AM -0800, Andi Kleen wrote: On Tue, Mar 08, 2011 at 07:31:56PM -0500, Stephen Wilson wrote: Morally, the question of whether an address lies in a gate vma should be asked with respect to an mm, not a particular task. Practically, dropping the dependency on task_struct will help make current and future operations on mm's more flexible and convenient. In particular, it allows some code paths to avoid the need to hold task_lock. The only architecture this change impacts in any significant way is x86_64. The principle change on that architecture is to mirror TIF_IA32 via a new flag in mm_context_t. The problem is -- you're adding a likely cache miss on mm_struct for every 32bit compat syscall now, even if they don't need mm_struct currently (and a lot of them do not) Unless there's a very good justification to make up for this performance issue elsewhere (including numbers) this seems like a bad idea. Hmm I see you're only setting it on exec time actually on rereading the patches. I thought you were changing TS_COMPAT which is in the syscall path. Never mind. I have no problems with doing such a change on exec time. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/5] make *_gate_vma accept mm_struct instead of task_struct II
On Thu, Mar 10, 2011 at 11:54:14AM -0500, Stephen Wilson wrote: On Thu, Mar 10, 2011 at 08:38:09AM -0800, Andi Kleen wrote: On Thu, Mar 10, 2011 at 08:00:32AM -0800, Andi Kleen wrote: On Tue, Mar 08, 2011 at 07:31:56PM -0500, Stephen Wilson wrote: The only architecture this change impacts in any significant way is x86_64. The principle change on that architecture is to mirror TIF_IA32 via a new flag in mm_context_t. The problem is -- you're adding a likely cache miss on mm_struct for every 32bit compat syscall now, even if they don't need mm_struct currently (and a lot of them do not) Unless there's a very good justification to make up for this performance issue elsewhere (including numbers) this seems like a bad idea. Hmm I see you're only setting it on exec time actually on rereading the patches. I thought you were changing TS_COMPAT which is in the syscall path. Never mind. I have no problems with doing such a change on exec time. OK. Great! Does this mean I have your ACK'ed by or reviewed by? I didn't read it all, but the first two patches looked ok. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC] [PATCH] allow low HZ values?
Thomas Gleixner t...@linutronix.de writes: On Mon, 11 Oct 2010, Tim Pepper wrote: I'm not necessarily wanting to open up the age old question of what is a good HZ, but we were doing some testing on timer tick overheads for HPC applications and this came up... Yeah. This comes always up when the timer tick overhead on HPC is tested. And this patch is again the fundamentally wrong answer. That's a unfair description of the proposal. We have told HPC folks for years that we need a kind of NOHZ mode for HPC where we can transparently switch off the tick when only one user space bound thread is active and switch back to normal once this thing terminates or goes into the kernel via a syscall. Sigh, nothing happened ever except for repeating the same crap patches over and over. Jan Blunck posted a patch for this exactly few months ago. Unfortunately it didn't get the accounting right, but other than that it seemed like a reasonable starting point. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 31/40] trace syscalls: Convert various generic compat syscalls
, Ian Munsie wrote: From: Ian Munsieimun...@au1.ibm.com This patch converts numerous trivial compat syscalls through the generic kernel code to use the COMPAT_SYSCALL_DEFINE family of macros. Why? This just makes the code look uglier and the functions harder to grep for. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 31/40] trace syscalls: Convert various generic compat syscalls
, Frederic Weisbecker wrote: On Wed, Jun 23, 2010 at 12:19:38PM +0200, Andi Kleen wrote: , Ian Munsie wrote: From: Ian Munsieimun...@au1.ibm.com This patch converts numerous trivial compat syscalls through the generic kernel code to use the COMPAT_SYSCALL_DEFINE family of macros. Why? This just makes the code look uglier and the functions harder to grep for. Because it makes them usable with syscall tracing. Ok that information is missing in the changelog then. Also I hope the uglification-usefullness factor is really worth it. The patch is certainly no slouch on the uglification side. It also has maintenance costs, e.g. I doubt ctags and cscope will be able to deal with these kinds of macros, so it has a high cost for everyone using these tools. For those it would be actually better if you used separate annotation that does not confuse standard C parsers. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 31/40] trace syscalls: Convert various generic compat syscalls
I haven't heard any complains about existing syscalls wrappers. At least for me they always interrupt my grepping. What kind of annotations could solve that? If you put the annotation in a separate macro and leave the original prototype alone. Then C parsers could still parse it. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/4] panic: Allow taint flag for warnings to be changed from TAINT_WARN
Ben Hutchings b...@decadent.org.uk writes: WARN() is used in some places to report firmware or hardware bugs that are then worked-around. These bugs do not affect the stability of the kernel and should not set the usual TAINT_WARN flag. To allow for this, add WARN_TAINT() and WARN_TAINT_ONCE() macros that take a taint flag as argument. Architectures that implement warnings using trap instructions instead of calls to warn_slowpath_*() must now implement __WARN_TAINT(taint) instead of __WARN(). I guess this should enforce that at least some taint flag is set? (e.g. with a BUILD_BUG_ON) -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.
How about something like this.. If the arch does not enable CONFIG_CPU_IDLE, the cpuidle_idle_call which is called from cpu_idle() should call default_idle without involving the registering cpuidle steps. This should prevent bloating up of the kernel for archs which dont want to use cpuidle. On x86 some people want small kernel too, so selecting it on a architecture granuality is not good. Also you can make it default, you just need to slim it down first. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.
Peter Zijlstra a.p.zijls...@chello.nl writes: So does it make sense to have a set of sets? Why not integrate them all into one set to be ruled by this governor thing? cpuidle is currently optional, that is why the two level hierarchy is there so that you can still have simple idle selection without it. % size drivers/cpuidle/*.o textdata bss dec hex filename 55141416 4469741b3e drivers/cpuidle/built-in.o Adding it unconditionally would add ~7k to everyone who wants idle functions. I think making it unconditional would require putting it on a serious diet first. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] Support for PCI Express reset type
Mike Mason mm...@us.ibm.com writes: These patches supersede the previously submitted patch that implemented a fundamental reset bit field. Please review and let me know of any concerns. Any plans to implement that for x86 too? Right now it seems to be a PPC specific hack. And where is the driver that is using it? -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: question about softirqs
Chris Friesen cfrie...@nortel.com writes: One of the reasons I brought up this issue is that there is a lot of documentation out there that says softirqs will be processed on return from a syscall. The fact that it actually depends on the scheduler parameters of the task issuing the syscall isn't ever mentioned. It's not mentioned because it is not currently. However some network TCP RX processing can happen in process context, which gives you most of the benefit anyways. In fact, Documentation/DocBook/kernel-hacking.tmpl in the kernel source still has the following: Whenever a system call is about to return to userspace, or a hardware interrupt handler exits, any 'software interrupts' which are marked pending (usually by hardware interrupts) are run (filenamekernel/softirq.c/filename). If anyone is looking at changing this code, it might be good to ensure that at least the kernel docs are updated. So far the code is not changed in mainline. There have been some proposals only. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: question about softirqs
If a soft irq is raised in process context, raise_softirq() in kernel/softirq.c calls wakeup_softirqd() to make sure that ksoftirqd softirqd is only used when the softirq runs for too long or when there are no suitable irq exits for a long time. In normal situations (not excessive time in softirq) they don't do anything. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: question about softirqs
Thomas Gleixner t...@linutronix.de writes: Err, no. Chris is completely correct: if (!in_interrupt()) wakeup_softirqd(); Yes you have to wake it up just in case, but it doesn't normally process the data because a normal softirq comes in faster. It's just a safety policy. You can check this by checking the accumulated CPU time on your ksoftirqs. Mine are all 0 even on long running systems. The reason Andrea originally added the softirqds was just that if you have very softirq intensive workloads they would tie up too much CPU time or not make enough process with the default don't loop too often heuristics. We can not rely on irqs coming in when the softirq is raised from You can't rely on it, but it happens in near all cases. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: question about softirqs
I have one machine SMP flooded by network frames, CPU0 handling all Yes that's the case softirqd is supposed to handle. When you spend a significant part of your CPU time in softirq context it kicks in to provide somewhat fair additional CPU time. But most systems (like mine) don't do that. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: question about softirqs
On Wed, May 13, 2009 at 09:05:01AM -0600, Chris Friesen wrote: Andi Kleen wrote: Thomas Gleixner t...@linutronix.de writes: Err, no. Chris is completely correct: if (!in_interrupt()) wakeup_softirqd(); Yes you have to wake it up just in case, but it doesn't normally process the data because a normal softirq comes in faster. It's just a safety policy. What about the scenario I raised earlier, where we have incoming network packets, network packets are normally processed by the network packet interrupt's softirq or alternatively in the NAPI poll loop. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: question about softirqs
On Wed, May 13, 2009 at 01:04:09PM -0600, Chris Friesen wrote: Andi Kleen wrote: network packets are normally processed by the network packet interrupt's softirq or alternatively in the NAPI poll loop. If we have a high priority task, ksoftirqd may not get a chance to run. In this case the next interrupt will also process them. It will just go more slowly because interrupts limit the work compared to ksoftirqd. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: question about softirqs
On Wed, May 13, 2009 at 01:44:59PM -0600, Chris Friesen wrote: Andi Kleen wrote: On Wed, May 13, 2009 at 01:04:09PM -0600, Chris Friesen wrote: Andi Kleen wrote: network packets are normally processed by the network packet interrupt's softirq or alternatively in the NAPI poll loop. If we have a high priority task, ksoftirqd may not get a chance to run. In this case the next interrupt will also process them. It will just go more slowly because interrupts limit the work compared to ksoftirqd. I realize that they will eventually get processed. My point is that the documentation (in-kernel, online, and in various books) says that softirqs will be processed _on the return from a syscall_. They are. The documentation is correct. What might not be all processed is all packets that are in the per CPU backlog queue when the network softirq runs (for non NAPI, for NAPI that's obsolete anyways). That's because there are limits. Or when new work comes in in parallel it doesn't process it all. But that's always the case -- no queue is infinite, so you have always situations where it can drop or delay items. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
Markus Gutschke (ÜÒÐ) mar...@google.com writes: There are a large number of system calls that normal C/C++ code uses quite frequently, and that are not security sensitive. A typical example would be gettimeofday(). At least on x86-64 gettimeofday() (and time(2)) work inside seccomp because they're vsyscalls that run in ring 3 only. -Andi -- a...@linux.intel.com -- Speaking for myself only. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC/PATCH] Block device for the ISS simulator
Benjamin Herrenschmidt [EMAIL PROTECTED] writes: The ISS simulator is a simple powerpc simulator used among other things for hardware bringup. It implements a simple memory mapped block device interface. ... --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-work/drivers/block/iss_blk.c2008-09-23 11:12:03.0 +1000 @@ -0,0 +1,365 @@ +/* + * Simple block device for the ISS simulator + */ The first paragraph in your description above should be in this comment. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
Andrew Morton [EMAIL PROTECTED] writes: Do we expect that this change will be replicated in other memory-intensive apps? (I do). The catch with 2MB pages on x86 is that x86 CPUs generally have much less 2MB TLB entries than 4K entries. So if you're unlucky and access a lot of mappings you might actually thrash more with them. That is why they are not necessarily an universal win. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 0/4] 16G huge page support for powerpc
FWIW i turned over the hugepages patchkit to Nick Piggin. So send all future patches to him please. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/4] allow arch specific function for allocating gigantic pages
Haven't reviewed it in detail, just noticed something. @@ -614,6 +610,7 @@ static int __init hugetlb_init(void) { if (HPAGE_SHIFT == 0) return 0; + INIT_LIST_HEAD(huge_boot_pages); return hugetlb_init_hstate(global_hstate); I don't think adding the INIT_LIST_HEAD here is correct. There can be huge pages added by the __setup handlers before hugetlb_init -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/3] Fix Unlikely(x) == y
On Tue, Feb 19, 2008 at 08:46:46PM +1100, Nick Piggin wrote: On Tuesday 19 February 2008 20:25, Andi Kleen wrote: On Tue, Feb 19, 2008 at 01:33:53PM +1100, Nick Piggin wrote: I actually once measured context switching performance in the scheduler, and removing the unlikely hint for testing RT tasks IIRC gave about 5% performance drop. OT: what benchmarks did you use for that? I had a change some time ago to the CFS scheduler to avoid unpredicted indirect calls for the common case, but I wasn't able to benchmark a difference with the usual suspect benchmark (lmbench). Since it increased code size by a few bytes it was rejected then. I think it was just a simple context switch benchmark, but not lmbench (which I found to be a bit too variable). But it was a long time ago... Do you still have it? I thought about writing my own but ended up being too lazy for that @) However, the P4's branch predictor is pretty good, and it should easily I think it depends on the generation. Prescott class branch prediction should be much better than the earlier ones. I was using a Nocona Xeon, which I think is a Prescott class? Yes. And don't they have much higher mispredict penalty (than older P4s)? They do have a longer pipeline, so yes more penalty (5 or 6 stages more iirc), but also a lot better branch predictor which makes up for that. Actually one thing I don't like about gcc is that I think it still emits cmovs for likely/unlikely branches, That's -Os. And -O2 and -O3, on the gccs that I'm using, AFAIKS. Well if it still happens on gcc 4.2 with P4 tuning you should perhaps open a gcc PR. They tend to ignore these bugs mostly in my experience, but sometimes they act on them. which is silly (the gcc developers It depends on the CPU. e.g. on K8 and P6 using CMOV if possible makes sense. P4 doesn't like it though. If the branch is completely predictable (eg. annotated), then I think branches should be used anyway. Even on well predicted branches, cmov is similar speed on microbenchmarks, but it will increase data hazards I think, so it will probably be worse for some real world situations. At least the respective optimization manuals say they should be used. I presume they only made this recommendation after some extensive benchmarking. the quite good numbers that cold CPU predictors can attain. However for really performance critical code (or really never executed code), then I think it is OK to have the hints and not have to rely on gcc heuristics. But only when the explicit hints are different from what the implicit branch predictors would predict anyways. And if you look at the heuristics that is not often the case... But a likely branch will be _strongly_ predicted to be taken, wheras a lot of the gcc heuristics simply have slightly more or slightly less probability. So it's not just a question of which way is more likely, but also _how_ likely it is to go that way. Yes, but a lot of the heuristics are pretty strong (80%) and gcc will act on them unless it has a very strong contra cue. And that should normally not be the case. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/3] Fix Unlikely(x) == y
Arjan van de Ven [EMAIL PROTECTED] writes: you have more faith in the authors knowledge of how his code actually behaves than I think is warranted :) iirc there was a mm patch some time ago to keep track of the actual unlikely values at runtime and it showed indeed some wrong ones. But the far majority of them are probably correct. Or faith in that he knows what unlikely means. I should write docs about this; but unlikely() means: 1) It happens less than 0.01% of the cases. 2) The compiler couldn't have figured this out by itself (NULL pointer checks are compiler done already, same for some other conditions) 3) It's a hot codepath where shaving 0.5 cycles (less even on x86) matters (and the author is ok with taking a 500 cycles hit if he's wrong) One more thing unlikely() does is to move the unlikely code out of line. So it should conserve some icache in critical functions, which might well be worth some more cycles (don't have numbers though). But overall I agree with you that unlikely is in most cases a bad idea (and I submitted the original patch introducing it originally @). That is because it is often used in situations where gcc's default branch prediction heuristics do would make exactly the same decision if (unlikely(x == NULL)) is simply totally useless because gcc already assumes all x == NULL tests are unlikely. I appended some of the builtin heuristics from a recent gcc source so people can see them. Note in particular the last predictors; assuming branch ending with goto, including call, causing early function return or returning negative constant are not taken. Just these alone are likely 95+% of the unlikelies in the kernel. -Andi /* Use number of loop iterations determined by # of iterations analysis to set probability. We don't want to use Dempster-Shaffer theory here, as the predictions is exact. */ DEF_PREDICTOR (PRED_LOOP_ITERATIONS, loop iterations, PROB_ALWAYS, PRED_FLAG_FIRST_MATCH) /* Hints dropped by user via __builtin_expect feature. */ DEF_PREDICTOR (PRED_BUILTIN_EXPECT, __builtin_expect, PROB_VERY_LIKELY, PRED_FLAG_FIRST_MATCH) /* Use number of loop iterations guessed by the contents of the loop. */ DEF_PREDICTOR (PRED_LOOP_ITERATIONS_GUESSED, guessed loop iterations, PROB_ALWAYS, PRED_FLAG_FIRST_MATCH) /* Branch containing goto is probably not taken. */ DEF_PREDICTOR (PRED_CONTINUE, continue, HITRATE (56), 0) /* Branch to basic block containing call marked by noreturn attribute. */ DEF_PREDICTOR (PRED_NORETURN, noreturn call, HITRATE (99), PRED_FLAG_FIRST_MATCH) /* Branch to basic block containing call marked by cold function attribute. */ DEF_PREDICTOR (PRED_COLD_FUNCTION, cold function call, HITRATE (99), PRED_FLAG_FIRST_MATCH) /* Loopback edge is taken. */ DEF_PREDICTOR (PRED_LOOP_BRANCH, loop branch, HITRATE (86), PRED_FLAG_FIRST_MATCH) /* Edge causing loop to terminate is probably not taken. */ DEF_PREDICTOR (PRED_LOOP_EXIT, loop exit, HITRATE (91), PRED_FLAG_FIRST_MATCH) /* Pointers are usually not NULL. */ DEF_PREDICTOR (PRED_POINTER, pointer, HITRATE (85), 0) DEF_PREDICTOR (PRED_TREE_POINTER, pointer (on trees), HITRATE (85), 0) /* NE is probable, EQ not etc... */ DEF_PREDICTOR (PRED_OPCODE_POSITIVE, opcode values positive, HITRATE (79), 0) DEF_PREDICTOR (PRED_OPCODE_NONEQUAL, opcode values nonequal, HITRATE (71), 0) DEF_PREDICTOR (PRED_FPOPCODE, fp_opcode, HITRATE (90), 0) DEF_PREDICTOR (PRED_TREE_OPCODE_POSITIVE, opcode values positive (on trees), HITRATE (70), 0) DEF_PREDICTOR (PRED_TREE_OPCODE_NONEQUAL, opcode values nonequal (on trees), HITRATE (69), 0) DEF_PREDICTOR (PRED_TREE_FPOPCODE, fp_opcode (on trees), HITRATE (90), 0) /* Branch guarding call is probably taken. */ DEF_PREDICTOR (PRED_CALL, call, HITRATE (69), 0) /* Branch causing function to terminate is probably not taken. */ DEF_PREDICTOR (PRED_TREE_EARLY_RETURN, early return (on trees), HITRATE (54), 0) /* Branch containing goto is probably not taken. */ DEF_PREDICTOR (PRED_GOTO, goto, HITRATE (70), 0) /* Branch guarding call is probably taken. */ DEF_PREDICTOR (PRED_CALL, call, HITRATE (69), 0) /* Branch causing function to terminate is probably not taken. */ DEF_PREDICTOR (PRED_TREE_EARLY_RETURN, early return (on trees), HITRATE (54), 0) /* Branch containing goto is probably not taken. */ DEF_PREDICTOR (PRED_GOTO, goto, HITRATE (70), 0) /* Branch ending with return constant is probably not taken. */ DEF_PREDICTOR (PRED_CONST_RETURN, const return, HITRATE (67), 0) /* Branch ending with return negative constant is probably not taken. */ DEF_PREDICTOR (PRED_NEGATIVE_RETURN, negative return, HITRATE (96), 0) /* Branch ending with return; is probably not taken */ DEF_PREDICTOR (PRED_NULL_RETURN, null return, HITRATE (96), 0) /* Branches to a mudflap bounds check are extremely unlikely. */ DEF_PREDICTOR
Re: [PATCH 00/10] x86: Reduce Memory Usage and Inter-Node message traffic (v3)
On Wednesday 12 September 2007 03:56, [EMAIL PROTECTED] wrote: Note: This patch consolidates all the previous patches regarding the conversion of static arrays sized by NR_CPUS into per_cpu data arrays and is referenced against 2.6.23-rc6 . Looks good to me from the x86 side. I'll leave it to Andrew to handle for now though because it touches too many files outside x86. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Too verbose compat_ioctl messages?
H. Peter Anvin [EMAIL PROTECTED] writes: For one thing, it looks like we're returning the wrong thing (EINVAL rather than ENOTTY) across the board. This was unfortunately a common misunderstanding with non-tty-related ioctls in the early days of Linux. ENOTTY is so excessively misnamed that it is actually surprising anybody ever got that right (for very small values of right i guess) -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev