Re: [v5] powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV platform
Hi Michael, Thanks for review. Responses below On 06/03/2015 10:43 AM, Michael Ellerman wrote: On Mon, 2015-18-05 at 15:18:04 UTC, Vipin K Parashar wrote: This patch adds support for FSP EPOW (Early Power Off Warning) and Please spell out the acronyms the first time you use them, including FSP. Will do. DPO (Delayed Power Off) events for PowerNV platform. EPOW events are ^ the the PowerNV platform. Will edit. generated by SPCN/FSP due to various critical system conditions that SPCN? Will remove SPCN. FSP should be sufficient. need system shutdown. Few examples of these conditions are high ^ s/need/require/ ? A few Agreed. ambient temperature or system running on UPS power with low UPS battery. DPO event is generated in response to admin initiated system request. Blank line between paragraphs please. Sure Upon receipt of EPOW and DPO events host kernel invokes ^ the host kernel will edit orderly_poweroff for performing graceful system shutdown. System admin I like it if you spell functions with a trailing () to make it clear they are functions, so this would be orderly_powerof(). Agreed. can also add systemd service shutdown scripts to perform any specific actions like graceful guest shutdown upon system poweroff. libvirt-guests is systemd service available on recent distros for management of guests at system start/shutdown time. This last part about the scripts is not relevant to the kernel patch so just leave it out please. Agreed. Signed-off-by: Vipin K Parashar vi...@linux.vnet.ibm.com Reviewed-by: Joel Stanley j...@jms.id.au Reviewed-by: Vaibhav Jain vaib...@linux.vnet.ibm.com --- arch/powerpc/include/asm/opal-api.h| 44 arch/powerpc/include/asm/opal.h| 3 +- arch/powerpc/platforms/powernv/opal-power.c| 147 ++--- arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + 4 files changed, 179 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 0321a90..90fa364 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -355,6 +355,10 @@ enum opal_msg_type { OPAL_MSG_TYPE_MAX, }; +/* OPAL_MSG_SHUTDOWN parameter values */ +#defineSOFT_OFF0x00 +#defineSOFT_REBOOT 0x01 I don't see this in the skiboot version of opal-api.h ? They should be kept in sync. If it's a Linux only define it should go in opal.h Agreed. Won't add these definitions to opal-api.h as its not present in skiboot version of opal-api.h. struct opal_msg { __be32 msg_type; __be32 reserved; @@ -730,6 +734,46 @@ struct opal_i2c_request { __be64 buffer_ra; /* Buffer real address */ }; +/* + * EPOW status sharing (OPAL and the host) + * + * The host will pass on OPAL, a buffer of length OPAL_SYSEPOW_MAX + * with individual elements being 16 bits wide to fetch the system + * wide EPOW status. Each element in the buffer will contain the + * EPOW status in it's bit representation for a particular EPOW sub + * class as defiend here. So multiple detailed EPOW status bits + * specific for any sub class can be represented in a single buffer + * element as it's bit representation. + */ + +/* System EPOW type */ +enum OpalSysEpow { + OPAL_SYSEPOW_POWER = 0,/* Power EPOW */ + OPAL_SYSEPOW_TEMP = 1,/* Temperature EPOW */ + OPAL_SYSEPOW_COOLING= 2,/* Cooling EPOW */ + OPAL_SYSEPOW_MAX= 3,/* Max EPOW categories */ +}; + +/* Power EPOW */ +enum OpalSysPower { + OPAL_SYSPOWER_UPS = 0x0001, /* System on UPS power */ + OPAL_SYSPOWER_CHNG = 0x0002, /* System power config change */ + OPAL_SYSPOWER_FAIL = 0x0004, /* System impending power failure */ + OPAL_SYSPOWER_INCL = 0x0008, /* System incomplete power */ +}; + +/* Temperature EPOW */ +enum OpalSysTemp { + OPAL_SYSTEMP_AMB= 0x0001, /* System over ambient temperature */ + OPAL_SYSTEMP_INT= 0x0002, /* System over internal temperature */ + OPAL_SYSTEMP_HMD= 0x0004, /* System over ambient humidity */ +}; + +/* Cooling EPOW */ +enum OpalSysCooling { + OPAL_SYSCOOL_INSF = 0x0001, /* System insufficient cooling */ +}; I don't see the last three of these enums used at all, so please drop them. OPAL_SYSPOWER_CHNG / FAIL / INCL, OPAL_SYSTEMP_HMD and OPAL_SYSCOOL_INSF enums aren't used here but they are part of skiboot version of opal-api.h and thus need to be retained. PKVM2.1 uses these enums and thus can't be removed from skiboot opal-api.h #endif /* __ASSEMBLY__ */ #endif /* __OPAL_API_H */ diff
Re: [PATCH v13 11/14] perf, tools: Support long descriptions with perf list -v
On Tue, Jun 02, 2015 at 10:12:11AM -0700, Sukadev Bhattiprolu wrote: From: Andi Kleen a...@linux.intel.com Previously we were dropping the useful longer descriptions that some events have in the event list completely. This patch makes them appear with perf list. Old perf list: baclears: baclears.all [Counts the number of baclears] vs new: perf list -v: ... baclears: baclears.all [The BACLEARS event counts the number of times the front end is resteered, mainly when the Branch Prediction Unit cannot provide a correct prediction and this is corrected by the Branch Address Calculator at the front end. The BACLEARS.ANY event counts the number of baclears for any type of branch] Signed-off-by: Andi Kleen a...@linux.intel.com Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- tools/perf/builtin-list.c |9 ++--- tools/perf/pmu-events/jevents.c| 29 - tools/perf/pmu-events/jevents.h|2 +- tools/perf/pmu-events/pmu-events.h |1 + tools/perf/util/parse-events.c |4 ++-- tools/perf/util/parse-events.h |2 +- tools/perf/util/pmu.c | 17 - tools/perf/util/pmu.h |4 +++- 8 files changed, 46 insertions(+), 22 deletions(-) I think this change should be split into: - jevents update of parsing out PublicDescription tag - alias support for long_desc - perf list update jirka ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v11 26/34] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
On 06/02/2015 09:30 AM, David Gibson wrote: On Fri, May 29, 2015 at 06:44:50PM +1000, Alexey Kardashevskiy wrote: This is a part of moving DMA window programming to an iommu_ops callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as a first parameter (not pnv_ioda_pe) as it is going to be used as a callback for VFIO DDW code. This adds pnv_pci_ioda2_tvt_invalidate() to invalidate TVT as it is I'm assuming that's what's now called pnv_pci_ioda2_invalidate_entire()? Yes, my bad... And the patch is not adding it at all... a good thing to do. It does not have immediate effect now as the table is never recreated after reboot but it will in the following patches. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Reviewed-by: David Gibson da...@gibson.dropbear.id.au Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com --- Changes: v11: * replaced some 1it_page_shift with IOMMU_PAGE_SIZE() macro v9: * initialize pe-table_group.tables[0] at the very end when tbl is fully initialized * moved pnv_pci_ioda2_tvt_invalidate() from earlier patch --- arch/powerpc/platforms/powernv/pci-ioda.c | 47 +-- 1 file changed, 38 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 3d29fe3..fda01c1 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1968,6 +1968,43 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, } } +static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, + int num, struct iommu_table *tbl) +{ + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, + table_group); + struct pnv_phb *phb = pe-phb; + int64_t rc; + const __u64 start_addr = tbl-it_offset tbl-it_page_shift; + const __u64 win_size = tbl-it_size tbl-it_page_shift; + + pe_info(pe, Setting up window %llx..%llx pg=%x\n, + start_addr, start_addr + win_size - 1, + IOMMU_PAGE_SIZE(tbl)); + + /* +* Map TCE table through TVT. The TVE index is the PE number +* shifted by 1 bit for 32-bits DMA space. +*/ + rc = opal_pci_map_pe_dma_window(phb-opal_id, + pe-pe_number, + pe-pe_number 1, + 1, + __pa(tbl-it_base), + tbl-it_size 3, + IOMMU_PAGE_SIZE(tbl)); + if (rc) { + pe_err(pe, Failed to configure TCE table, err %ld\n, rc); + return rc; + } + + pnv_pci_link_table_and_group(phb-hose-node, num, + tbl, pe-table_group); + pnv_pci_ioda2_tce_invalidate_entire(pe); + + return 0; +} + static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { uint16_t window_id = (pe-pe_number 1 ) + 1; @@ -2123,21 +2160,13 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, pe-table_group.ops = pnv_pci_ioda2_ops; #endif - /* -* Map TCE table through TVT. The TVE index is the PE number -* shifted by 1 bit for 32-bits DMA space. -*/ - rc = opal_pci_map_pe_dma_window(phb-opal_id, pe-pe_number, - pe-pe_number 1, 1, __pa(tbl-it_base), - tbl-it_size 3, 1ULL tbl-it_page_shift); + rc = pnv_pci_ioda2_set_window(pe-table_group, 0, tbl); if (rc) { pe_err(pe, Failed to configure 32-bit TCE table, err %ld\n, rc); goto fail; } - pnv_pci_ioda2_tce_invalidate_entire(pe); - /* OPAL variant of PHB3 invalidated TCEs */ if (phb-ioda.tce_inval_reg) tbl-it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE); -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 04/14] perf, tools: Allow events with dot
On Tue, Jun 02, 2015 at 10:12:04AM -0700, Sukadev Bhattiprolu wrote: From: Andi Kleen a...@linux.intel.com The Intel events use a dot to separate event name and unit mask. Allow dot in names in the scanner, and remove special handling of dot as EOF. Also remove the hack in jevents to replace dot with underscore. This way dotted events can be specified directly by the user. I'm not fully sure this change to the scanner is correct (what was the dot special case good for?), but I haven't found anything that breaks with it so far at least. can't see anything either Robert, does it ring a bell? seems like you introduced it ;-) thanks, jirka V2: Add the dot to name too, to handle events outside cpu// Acked-by: Namhyung Kim namhy...@kernel.org Signed-off-by: Andi Kleen a...@linux.intel.com Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- tools/perf/util/parse-events.l |5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l index 09e738f..13cef3c 100644 --- a/tools/perf/util/parse-events.l +++ b/tools/perf/util/parse-events.l @@ -119,8 +119,8 @@ event [^,{}/]+ num_dec [0-9]+ num_hex 0x[a-fA-F0-9]+ num_raw_hex [a-fA-F0-9]+ -name [a-zA-Z_*?][a-zA-Z0-9_*?]* -name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?]* +name [a-zA-Z_*?][a-zA-Z0-9_*?.]* +name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?.]* /* If you add a modifier you need to update check_modifier() */ modifier_event [ukhpGHSDI]+ modifier_bp [rwx]{1,3} @@ -165,7 +165,6 @@ modifier_bp [rwx]{1,3} return PE_EVENT_NAME; } -.| EOF { BEGIN(INITIAL); REWIND(0); -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 06/14] perf, tools: Support alias descriptions
On Tue, Jun 02, 2015 at 10:12:06AM -0700, Sukadev Bhattiprolu wrote: SNIP @@ -1033,37 +1064,49 @@ void print_pmu_events(const char *event_glob, bool name_only) event_glob continue; - if (is_cpu !name_only) + if (is_cpu !name_only !alias-desc) name = format_alias_or(buf, sizeof(buf), pmu, alias); - aliases[j] = strdup(name); - if (aliases[j] == NULL) - goto out_enomem; + aliases[j].name = name; + if (is_cpu !name_only !alias-desc) + aliases[j].name = format_alias_or(buf, sizeof(buf), + pmu, alias); + aliases[j].name = strdup(aliases[j].name); + /* failure harmless */ yea but we still try to care everywhere.. ;-) we would print NULL for name in the code below right? please keep the above pattern: if (aliases[j].name == NULL) goto out_enomem; + aliases[j].desc = alias-desc; j++; } if (pmu-selectable) { char *s; if (asprintf(s, %s//, pmu-name) 0) goto out_enomem; - aliases[j] = s; + aliases[j].name = s; j++; } } len = j; - qsort(aliases, len, sizeof(char *), cmp_string); + qsort(aliases, len, sizeof(struct pair), cmp_pair); for (j = 0; j len; j++) { if (name_only) { - printf(%s , aliases[j]); + printf(%s , aliases[j].name); continue; } - printf( %-50s [Kernel PMU event]\n, aliases[j]); + if (aliases[j].desc) { + if (numdesc++ == 0) + printf(\n); + printf( %-50s\n, aliases[j].name); + printf(%*s, 8, [); + wordwrap(aliases[j].desc, 8, columns, 0); + printf(]\n); + } else + printf( %-50s [Kernel PMU event]\n, aliases[j].name); printed++; SNIP ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 03/14] perf, tools: Use pmu_events_map table to create event aliases
On Tue, Jun 02, 2015 at 10:12:03AM -0700, Sukadev Bhattiprolu wrote: SNIP +/* + * Return the CPU id as a raw string. + * + * Each architecture should provide a more precise id string that + * can be use to match the architecture's mapfile. + */ +char *__attribute__((weak))get_cpuid_str(void) we have '__weak' define in linux/compiler.h include jirka ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 03/14] perf, tools: Use pmu_events_map table to create event aliases
On Tue, Jun 02, 2015 at 10:12:03AM -0700, Sukadev Bhattiprolu wrote: At run time, (i.e when perf is starting up), locate the specific events table for the current CPU and create event aliases for each of the events. Use these aliases to parse user's specified perf event. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Changelog[v3] [Jiri Olsa] Fix a memory leak with cpuid. Changelog[v2] [Andi Kleen] Replace the pmu_events_map-vfm field with a simple generic cpuid string and use that string to find the matching mapfile entry. --- tools/perf/arch/powerpc/util/header.c | 11 tools/perf/util/header.h |3 +- tools/perf/util/pmu.c | 104 - 3 files changed, 104 insertions(+), 14 deletions(-) I think this patch is doing too much, it should be split into 3 pieces: - introduce get_cpuid_str for powerpc - introducing __perf_pmu__new_alias/perf_pmu__new_alias functions split - adding pmu_add_cpu_aliases functionality jirka ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 02/14] perf, tools, jevents: Program to convert JSON file to C style file
On Tue, Jun 02, 2015 at 10:12:02AM -0700, Sukadev Bhattiprolu wrote: SNIP + +static int process_mapfile(FILE *outfp, char *fpath) +{ + int n = 16384; + FILE *mapfp; + char *save; + char *line, *p; + int line_num; + char *tblname; + + pr_info(%s: Processing mapfile %s\n, prog, fpath); SNIP + + cpuid = strtok_r(p, ,, save); + version = strtok_r(NULL, ,, save); + fname = strtok_r(NULL, ,, save); + type = strtok_r(NULL, ,, save); + + tblname = file_name_to_table_name(fname); + fprintf(outfp, {\n); + fprintf(outfp, \t.cpuid = \%s\,\n, cpuid); + fprintf(outfp, \t.version = \%s\,\n, version); + fprintf(outfp, \t.type = \%s\,\n, type); got build failure for make DEBUG=1: CC pmu-events/jevents.o pmu-events/jevents.c: In function ‘process_mapfile’: pmu-events/jevents.c:498:10: error: ‘save’ may be used uninitialized in this function [-Werror=maybe-uninitialized] fprintf(outfp, \t.type = \%s\,\n, type); ^ jirka ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 07/24] x86/thinkpad_acpi: Use arch_nvram_ops methods instead of nvram_read_byte() and nvram_write_byte()
On Wed, Jun 3, 2015, at 00:34, Darren Hart wrote: On Tue, Jun 02, 2015 at 07:09:28AM -0300, Henrique de Moraes Holschuh wrote: Test results were sent to me privately, and they are correct, so... Finn, unless there is some compelling reason not to - like they are MBs worth of data, please submit these to the list in the future so we have them for reference. After I told him which exact bitmask to use on a T43 to test hotkey_source_mask, his test results can be summarized as I could see no difference in behavior, which is *exactly* what I expected to happen. If anything went wrong with the thinkpad-acpi NVRAM code, you'd notice a very large change in behavior (typical: hotkeys don't work, less typical: random hotkey keypresses, hotkey press bursts, low responsivity of hotkeys). Acked-by: Henrique de Moraes Holschuh h...@hmh.eng.br I'm fine with the changes, but they need to be submitted with the other changes as this one change cannot compile independently in my tree. Finn, please work with whomever is pulling the series to include this in their pull request. Reviewed-by: Darren Hart dvh...@linux.intel.com -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 02/14] perf, tools, jevents: Program to convert JSON file to C style file
On Tue, Jun 02, 2015 at 10:12:02AM -0700, Sukadev Bhattiprolu wrote: SNIP + +static char *file_name_to_table_name(char *fname) +{ + unsigned int i, j; + int c; + int n = 1024; /* use max variable length? */ I think this should be at least PATH_MAX, or you might actually use asprintf and have all below done within one line or so jirka + char *tblname; + char *p; + + tblname = malloc(n); + if (!tblname) + return NULL; + + p = basename(fname); + + memset(tblname, 0, n); + + /* Ensure table name starts with an alphabetic char */ + strcpy(tblname, pme_); + + n = strlen(fname) + strlen(tblname); + n = min(1024, n); + + for (i = 0, j = strlen(tblname); i strlen(fname); i++, j++) { + c = p[i]; + if (isalnum(c) || c == '_') + tblname[j] = c; + else if (c == '-') + tblname[j] = '_'; + else if (c == '.') { + tblname[j] = '\0'; + break; + } else { + pr_err(%s: Invalid character '%c' in file name %s\n, + prog, c, p); + free(tblname); + return NULL; + } + } + + return tblname; +} SNIP ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 03/14] perf, tools: Use pmu_events_map table to create event aliases
On Tue, Jun 02, 2015 at 10:12:03AM -0700, Sukadev Bhattiprolu wrote: SNIP + +/* + * From the pmu_events_map, find the table of PMU events that corresponds + * to the current running CPU. Then, add all PMU events from that table + * as aliases. + */ +static int pmu_add_cpu_aliases(void *data) any reason why the argument is not 'head' directly? jirka +{ + struct list_head *head = (struct list_head *)data; + int i; + struct pmu_events_map *map; + struct pmu_event *pe; + char *cpuid; + + cpuid = get_cpuid_str(); + if (!cpuid) + return 0; + + i = 0; + while (1) { + map = pmu_events_map[i++]; + if (!map-table) { + goto out; + } + + if (!strcmp(map-cpuid, cpuid)) + break; + } + + /* + * Found a matching PMU events table. Create aliases + */ + i = 0; + while (1) { + pe = map-table[i++]; + if (!pe-name) + break; + + /* need type casts to override 'const' */ + __perf_pmu__new_alias(head, (char *)pe-name, NULL, + (char *)pe-desc, (char *)pe-event); + } + +out: + free(cpuid); + return 0; +} + + static struct perf_pmu *pmu_lookup(const char *name) { struct perf_pmu *pmu; @@ -464,6 +540,8 @@ static struct perf_pmu *pmu_lookup(const char *name) if (pmu_aliases(name, aliases)) return NULL; + if (!strcmp(name, cpu)) + (void)pmu_add_cpu_aliases(aliases); if (pmu_type(name, type)) return NULL; -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 07/24] x86/thinkpad_acpi: Use arch_nvram_ops methods instead of nvram_read_byte() and nvram_write_byte()
On Tue, 2 Jun 2015, Darren Hart wrote: On Tue, Jun 02, 2015 at 07:09:28AM -0300, Henrique de Moraes Holschuh wrote: Test results were sent to me privately, and they are correct, so... Finn, unless there is some compelling reason not to - like they are MBs worth of data, please submit these to the list in the future so we have them for reference. Sure. Those results were just confirmation that this patch series doesn't affect input events read directly from /dev/input/by-path/platform-thinkpad_acpi-event given the the hotkey_source_mask settings discussed in this thread. Acked-by: Henrique de Moraes Holschuh h...@hmh.eng.br I'm fine with the changes, but they need to be submitted with the other changes as this one change cannot compile independently in my tree. Finn, please work with whomever is pulling the series to include this in their pull request. Right. Reviewed-by: Darren Hart dvh...@linux.intel.com Thanks for your review. -- ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] of: clean-up unnecessary libfdt include paths
On Wed, Jun 03, 2015 at 12:10:25AM -0500, Rob Herring wrote: Date: Wed, 3 Jun 2015 00:10:25 -0500 From: Rob Herring r...@kernel.org To: devicet...@vger.kernel.org, linux-ker...@vger.kernel.org Cc: Grant Likely grant.lik...@linaro.org, Rob Herring r...@kernel.org, Ralf Baechle r...@linux-mips.org, Benjamin Herrenschmidt b...@kernel.crashing.org, Paul Mackerras pau...@samba.org, Michael Ellerman m...@ellerman.id.au, linux-m...@linux-mips.org, linuxppc-dev@lists.ozlabs.org Subject: [PATCH] of: clean-up unnecessary libfdt include paths With the latest dtc import include fixups, it is no longer necessary to add explicit include paths to use libfdt. Remove these across the kernel. Signed-off-by: Rob Herring r...@kernel.org Cc: Ralf Baechle r...@linux-mips.org Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Michael Ellerman m...@ellerman.id.au Cc: Grant Likely grant.lik...@linaro.org Cc: linux-m...@linux-mips.org Cc: linuxppc-dev@lists.ozlabs.org For the MIPS bits; Acked-by: Ralf Baechle r...@linux-mips.org Ralf ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V5 12/13] selftests, powerpc: Add thread based stress test for DSCR sysfs interfaces
On 05/21/2015 12:13 PM, Anshuman Khandual wrote: This patch adds a test to update the system wide DSCR value repeatedly and then verifies that any thread on any given CPU on the system must be able to see the same DSCR value whether its is being read through the problem state based SPR or the privilege state based SPR. This test can fail on a system if some kind of cpu hotplug activity is happening when this test is being run at the same time. Then call to sched_setaffinity() might fail as the test does not check for the CPU availability/online every time before changing the affinity of the thread. Here is one changed version of this test which achieves similar test objective. Michael, Please let me know if the patch here would be okay or I need to re-spin the patch series for this change. Thank you. -- [PATCH] selftests, powerpc: Add thread based stress test for DSCR sysfs interfaces This patch adds a test to update the system wide DSCR value repeatedly and then verifies that any thread on any given CPU on the system must be able to see the same DSCR value whether its is being read through the problem state based SPR or the privilege state based SPR. Acked-by: Shuah Khan shua...@osg.samsung.com Signed-off-by: Anshuman Khandual khand...@linux.vnet.ibm.com --- tools/testing/selftests/powerpc/dscr/Makefile | 2 +- .../powerpc/dscr/dscr_sysfs_thread_test.c | 81 ++ 2 files changed, 82 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/powerpc/dscr/dscr_sysfs_thread_test.c diff --git a/tools/testing/selftests/powerpc/dscr/Makefile b/tools/testing/selftests/powerpc/dscr/Makefile index fada526..834ef88 100644 --- a/tools/testing/selftests/powerpc/dscr/Makefile +++ b/tools/testing/selftests/powerpc/dscr/Makefile @@ -1,6 +1,6 @@ PROGS := dscr_default_test dscr_explicit_test dscr_user_test \ dscr_inherit_test dscr_inherit_exec_test \ -dscr_sysfs_test +dscr_sysfs_test dscr_sysfs_thread_test CFLAGS := $(CFLAGS) -lpthread diff --git a/tools/testing/selftests/powerpc/dscr/dscr_sysfs_thread_test.c b/tools/testing/selftests/powerpc/dscr/dscr_sysfs_thread_test.c new file mode 100644 index 000..9671d52 --- /dev/null +++ b/tools/testing/selftests/powerpc/dscr/dscr_sysfs_thread_test.c @@ -0,0 +1,81 @@ +/* + * POWER Data Stream Control Register (DSCR) sysfs thread test + * + * This test updates the system wide DSCR default value through + * sysfs interface which should then update all the CPU specific + * DSCR default values which must also be then visible to threads + * executing on individual CPUs on the system. + * + * Copyright (C) 2015 Anshuman Khandual khand...@linux.vnet.ibm.com, IBM + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#define _GNU_SOURCE +#include dscr.h + +static int test_thread_dscr(unsigned long val) +{ + unsigned long cur_dscr, cur_dscr_usr; + + cur_dscr = get_dscr(); + cur_dscr_usr = get_dscr_usr(); + + if (val != cur_dscr) { + printf([cpu %d] Kernel DSCR should be %ld but is %ld\n, + sched_getcpu(), val, cur_dscr); + return 1; + } + + if (val != cur_dscr_usr) { + printf([cpu %d] User DSCR should be %ld but is %ld\n, + sched_getcpu(), val, cur_dscr_usr); + return 1; + } + return 0; +} + +static int check_cpu_dscr_thread(unsigned long val) +{ + cpu_set_t mask; + int cpu; + + for (cpu = 0; cpu CPU_SETSIZE; cpu++) { + CPU_ZERO(mask); + CPU_SET(cpu, mask); + if (sched_setaffinity(0, sizeof(mask), mask)) + continue; + + if (test_thread_dscr(val)) + return 1; + } + return 0; + +} + +int dscr_sysfs_thread(void) +{ + unsigned long orig_dscr_default; + int i, j; + + orig_dscr_default = get_default_dscr(); + for (i = 0; i COUNT; i++) { + for (j = 0; j DSCR_MAX; j++) { + set_default_dscr(j); + if (check_cpu_dscr_thread(j)) + goto fail; + } + } + set_default_dscr(orig_dscr_default); + return 0; +fail: + set_default_dscr(orig_dscr_default); + return 1; +} + +int main(int argc, char *argv[]) +{ + return test_harness(dscr_sysfs_thread, dscr_sysfs_thread_test); +} -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 03/14] perf, tools: Use pmu_events_map table to create event aliases
On Tue, Jun 02, 2015 at 10:12:03AM -0700, Sukadev Bhattiprolu wrote: SNIP @@ -225,26 +221,47 @@ static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, FI alias-unit[0] = '\0'; alias-per_pkg = false; - ret = parse_events_terms(alias-terms, buf); + ret = parse_events_terms(alias-terms, val); if (ret) { + pr_err(Cannot parse alias %s: %d\n, val, ret); free(alias); return ret; } alias-name = strdup(name); + if (dir) { + /* + * load unit name and scale if available + */ + perf_pmu__parse_unit(alias, dir, name); + perf_pmu__parse_scale(alias, dir, name); + perf_pmu__parse_per_pkg(alias, dir, name); + perf_pmu__parse_snapshot(alias, dir, name); + } + /* - * load unit name and scale if available + * TODO: pickup description from Andi's patchset */ - perf_pmu__parse_unit(alias, dir, name); - perf_pmu__parse_scale(alias, dir, name); - perf_pmu__parse_per_pkg(alias, dir, name); - perf_pmu__parse_snapshot(alias, dir, name); + //alias-desc = desc ? strdpu(desc) : NULL; please remove the TODO line and above commented code, it is addressed later in this patchset jirka ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 02/14] perf, tools, jevents: Program to convert JSON file to C style file
On Tue, Jun 02, 2015 at 10:12:02AM -0700, Sukadev Bhattiprolu wrote: SNIP + * If we fail to locate/process JSON and map files, create a NULL mapping + * table. This would at least allow perf to build even if we can't find/use + * the aliases. + */ +static void create_empty_mapping(const char *output_file) +{ + FILE *outfp; + + pr_info(%s: Creating empty pmu_events_map[] table\n, prog); + + /* Unlink file to clear any partial writes to it */ + unlink(output_file); + + outfp = fopen(output_file, a); you could open with w+ and save the unlink call SNIP +int main(int argc, char *argv[]) +{ + int rc; + int flags; + int maxfds; + char dirname[PATH_MAX]; + + const char *arch; + const char *output_file; + const char *start_dirname; + + prog = basename(argv[0]); + if (argc 4) { + pr_err(Usage: %s arch starting_dir output_file\n, prog); + return 1; + } + + arch = argv[1]; + start_dirname = argv[2]; + output_file = argv[3]; + + if (argc 4) + verbose = atoi(argv[4]); + + unlink(output_file); + eventsfp = fopen(output_file, a); ditto SNIP ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 12/14] perf, tools: Add support for event list topics
On Tue, Jun 02, 2015 at 12:16:41PM -0700, Sukadev Bhattiprolu wrote: SNIP [Speculative and retired macro-conditional branches] br_inst_exec.all_direct_jmp [Speculative and retired macro-unconditional branches excluding calls and indirects] br_inst_exec.all_direct_near_call [Speculative and retired direct near calls] br_inst_exec.all_indirect_jump_non_call_ret Signed-off-by: Andi Kleen a...@linux.intel.com Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Changelog[v2] Dropped an unnecessary patch before this and fixed resulting conflicts in tools/perf/util/pmu.c --- tools/perf/pmu-events/jevents.c| 16 +++- tools/perf/pmu-events/jevents.h| 3 ++- tools/perf/pmu-events/pmu-events.h | 1 + tools/perf/util/pmu.c | 34 -- tools/perf/util/pmu.h | 1 + 5 files changed, 39 insertions(+), 16 deletions(-) please split at least the jevents Topic parsing from the rest idelay also the alias update and the display change jirka ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 13/14] perf, tools: Handle header line in mapfile
On Tue, Jun 02, 2015 at 10:12:13AM -0700, Sukadev Bhattiprolu wrote: From: Andi Kleen a...@linux.intel.com Support a header line in the mapfile.csv, to match the existing mapfiles 'Suport' means 'skip' in here jirka Signed-off-by: Andi Kleen a...@linux.intel.com Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Changelog[v2] All architectures may not use the Family to identify. So, assume first line is header. --- tools/perf/pmu-events/jevents.c |9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/tools/perf/pmu-events/jevents.c b/tools/perf/pmu-events/jevents.c index 14707fb..8d365f2 100644 --- a/tools/perf/pmu-events/jevents.c +++ b/tools/perf/pmu-events/jevents.c @@ -461,7 +461,12 @@ static int process_mapfile(FILE *outfp, char *fpath) print_mapping_table_prefix(outfp); - line_num = 0; + /* Skip first line (header) */ + p = fgets(line, n, mapfp); + if (!p) + goto out; + + line_num = 1; while (1) { char *cpuid, *version, *type, *fname; @@ -505,8 +510,8 @@ static int process_mapfile(FILE *outfp, char *fpath) fprintf(outfp, },\n); } +out: print_mapping_table_suffix(outfp); - return 0; } -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v11 09/34] vfio: powerpc/spapr: Move locked_vm accounting to helpers
On 06/01/2015 02:28 PM, David Gibson wrote: On Fri, May 29, 2015 at 06:44:33PM +1000, Alexey Kardashevskiy wrote: There moves locked pages accounting to helpers. Later they will be reused for Dynamic DMA windows (DDW). This reworks debug messages to show the current value and the limit. This stores the locked pages number in the container so when unlocking the iommu table pointer won't be needed. This does not have an effect now but it will with the multiple tables per container as then we will allow attaching/detaching groups on fly and we may end up having a container with no group attached but with the counter incremented. While we are here, update the comment explaining why RLIMIT_MEMLOCK might be required to be bigger than the guest RAM. This also prints pid of the current process in pr_warn/pr_debug. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com Reviewed-by: David Gibson da...@gibson.dropbear.id.au Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com --- Changes: v4: * new helpers do nothing if @npages == 0 * tce_iommu_disable() now can decrement the counter if the group was detached (not possible now but will be in the future) --- drivers/vfio/vfio_iommu_spapr_tce.c | 82 - 1 file changed, 63 insertions(+), 19 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 64300cc..40583f9 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -29,6 +29,51 @@ static void tce_iommu_detach_group(void *iommu_data, struct iommu_group *iommu_group); +static long try_increment_locked_vm(long npages) +{ + long ret = 0, locked, lock_limit; + + if (!current || !current-mm) + return -ESRCH; /* process exited */ + + if (!npages) + return 0; + + down_write(current-mm-mmap_sem); + locked = current-mm-locked_vm + npages; Is there a possibility of userspace triggering an integer overflow here, if npages is really huge? I do not see how. I just do not accept npages bigger than the host RAM size in pages. And it is long. For (lets say) 128GB host, the number of 4KB pages is (12830)/4096=33554432. + lock_limit = rlimit(RLIMIT_MEMLOCK) PAGE_SHIFT; + if (locked lock_limit !capable(CAP_IPC_LOCK)) + ret = -ENOMEM; + else + current-mm-locked_vm += npages; + + pr_debug([%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n, current-pid, + npages PAGE_SHIFT, + current-mm-locked_vm PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK), + ret ? - exceeded : ); + + up_write(current-mm-mmap_sem); + + return ret; +} + +static void decrement_locked_vm(long npages) +{ + if (!current || !current-mm || !npages) + return; /* process exited */ + + down_write(current-mm-mmap_sem); + if (npages current-mm-locked_vm) + npages = current-mm-locked_vm; Can this case ever occur (without there being a leak bug somewhere else in the code)? It should not. Safety measure. Having a warning here might make sense but I believe if this happens, there will be many, many warnings in other places :) + current-mm-locked_vm -= npages; + pr_debug([%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n, current-pid, + npages PAGE_SHIFT, + current-mm-locked_vm PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK)); + up_write(current-mm-mmap_sem); +} + /* * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation * @@ -45,6 +90,7 @@ struct tce_container { struct mutex lock; struct iommu_table *tbl; bool enabled; + unsigned long locked_pages; }; static bool tce_page_is_contained(struct page *page, unsigned page_shift) @@ -60,7 +106,7 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift) static int tce_iommu_enable(struct tce_container *container) { int ret = 0; - unsigned long locked, lock_limit, npages; + unsigned long locked; struct iommu_table *tbl = container-tbl; if (!container-tbl) @@ -89,21 +135,22 @@ static int tce_iommu_enable(struct tce_container *container) * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits, * that would effectively kill the guest at random points, much better * enforcing the limit based on the max that the guest can map. +* +* Unfortunately at the moment it counts whole tables, no matter how +* much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups +* each with 2GB DMA window, 8GB will be counted here. The reason for +* this is that we cannot tell here the amount of RAM used by the guest +* as this
Re: [PATCH kernel v11 27/34] powerpc/powernv: Implement multilevel TCE tables
On 06/02/2015 09:50 AM, David Gibson wrote: On Fri, May 29, 2015 at 06:44:51PM +1000, Alexey Kardashevskiy wrote: TCE tables might get too big in case of 4K IOMMU pages and DDW enabled on huge guests (hundreds of GB of RAM) so the kernel might be unable to allocate contiguous chunk of physical memory to store the TCE table. To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables, up to 5 levels which splits the table into a tree of smaller subtables. This adds multi-level TCE tables support to pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages() helpers. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v10: * fixed multiple comments received for v9 v9: * moved from ioda2 to common powernv pci code * fixed cleanup if allocation fails in a middle * removed check for the size - all boundary checks happen in the calling code anyway --- arch/powerpc/include/asm/iommu.h | 2 + arch/powerpc/platforms/powernv/pci-ioda.c | 98 --- arch/powerpc/platforms/powernv/pci.c | 13 3 files changed, 104 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 4636734..706cfc0 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -96,6 +96,8 @@ struct iommu_pool { struct iommu_table { unsigned long it_busno; /* Bus number this table belongs to */ unsigned long it_size; /* Size of iommu table in entries */ + unsigned long it_indirect_levels; + unsigned long it_level_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index fda01c1..68ffc7a 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -49,6 +49,9 @@ /* 256M DMA window, 4K TCE pages, 8 bytes TCE */ #define TCE32_TABLE_SIZE ((0x1000 / 0x1000) * 8) +#define POWERNV_IOMMU_DEFAULT_LEVELS 1 +#define POWERNV_IOMMU_MAX_LEVELS 5 + static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl); static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, @@ -1975,6 +1978,8 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, table_group); struct pnv_phb *phb = pe-phb; int64_t rc; + const unsigned long size = tbl-it_indirect_levels ? + tbl-it_level_size : tbl-it_size; const __u64 start_addr = tbl-it_offset tbl-it_page_shift; const __u64 win_size = tbl-it_size tbl-it_page_shift; @@ -1989,9 +1994,9 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, rc = opal_pci_map_pe_dma_window(phb-opal_id, pe-pe_number, pe-pe_number 1, - 1, + tbl-it_indirect_levels + 1, __pa(tbl-it_base), - tbl-it_size 3, + size 3, IOMMU_PAGE_SIZE(tbl)); if (rc) { pe_err(pe, Failed to configure TCE table, err %ld\n, rc); @@ -2071,11 +2076,19 @@ static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb) phb-ioda.tce_inval_reg = ioremap(phb-ioda.tce_inval_reg_phys, 8); } -static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift) +static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift, + unsigned levels, unsigned long limit, + unsigned long *tce_table_allocated) { struct page *tce_mem = NULL; - __be64 *addr; + __be64 *addr, *tmp; unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT; + unsigned long local_allocated = 1UL (order + PAGE_SHIFT); + unsigned entries = 1UL (shift - 3); + long i; + + if (*tce_table_allocated = limit) + return NULL; I'm not quite clear what case this limit logic is trying to catch. The function is allocating some amount of entries which may be in one chunk of memory and spread between multiple chunks in multiple levels. limit is the amount of memory for actual TCEs (not intermediate levels). If I do not do this, and the user requests 5 levels, and I do not check this, more memory will be allocated that actually needed because size of the window is limited. tce_mem = alloc_pages_node(nid, GFP_KERNEL, order); if (!tce_mem) { @@ -2083,31 +2096,69 @@ static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift) return NULL; } addr = page_address(tce_mem); - memset(addr, 0, 1UL (order +
Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2
On 06/02/2015 02:17 PM, David Gibson wrote: On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote: The existing implementation accounts the whole DMA window in the locked_vm counter. This is going to be worse with multiple containers and huge DMA windows. Also, real-time accounting would requite additional tracking of accounted pages due to the page size difference - IOMMU uses 4K pages and system uses 4K or 64K pages. Another issue is that actual pages pinning/unpinning happens on every DMA map/unmap request. This does not affect the performance much now as we spend way too much time now on switching context between guest/userspace/host but this will start to matter when we add in-kernel DMA map/unmap acceleration. This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU. New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces 2 new ioctls to register/unregister DMA memory - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - which receive user space address and size of a memory region which needs to be pinned/unpinned and counted in locked_vm. New IOMMU splits physical pages pinning and TCE table update into 2 different operations. It requires: 1) guest pages to be registered first 2) consequent map/unmap requests to work only with pre-registered memory. For the default single window case this means that the entire guest (instead of 2GB) needs to be pinned before using VFIO. When a huge DMA window is added, no additional pinning will be required, otherwise it would be guest RAM + 2GB. The new memory registration ioctls are not supported by VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration will require memory to be preregistered in order to work. The accounting is done per the user process. This advertises v2 SPAPR TCE IOMMU and restricts what the userspace can do with v1 or v2 IOMMUs. In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com --- Changes: v11: * mm_iommu_put() does not return a code so this does not check it * moved v2 in tce_container to pack the struct v10: * moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO specific thing * squashed powerpc/iommu: Add userspace view of TCE table into this as it is a part of IOMMU v2 * s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/ * fixed some function names to have tce_iommu_ in the beginning rather just tce_ * as mm_iommu_mapped_inc() can now fail, check for the return code v9: * s/tce_get_hva_cached/tce_iommu_use_page_v2/ v7: * now memory is registered per mm (i.e. process) * moved memory registration code to powerpc/mmu * merged vfio: powerpc/spapr: Define v2 IOMMU into this * limited new ioctls to v2 IOMMU * updated doc * unsupported ioclts return -ENOTTY instead of -EPERM v6: * tce_get_hva_cached() returns hva via a pointer v4: * updated docs * s/kzmalloc/vzalloc/ * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and replaced offset with index * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory and removed duplicating vfio_iommu_spapr_register_memory --- Documentation/vfio.txt | 31 ++- arch/powerpc/include/asm/iommu.h| 6 + drivers/vfio/vfio_iommu_spapr_tce.c | 512 ++-- include/uapi/linux/vfio.h | 27 ++ 4 files changed, 487 insertions(+), 89 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 96978ec..7dcf2b5 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note This implementation has some specifics: -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). +Newer systems (POWER8 with IODA2) have improved hardware design which allows +to remove this limitation and have multiple IOMMU groups per a VFIO container. 2) The hardware supports so called DMA windows - the PCI address range within which DMA transfer is allowed,
Re: [PATCH v13 12/14] perf, tools: Add support for event list topics
On Wed, Jun 03, 2015 at 05:57:33AM -0700, Andi Kleen wrote: please split at least the jevents Topic parsing from the rest idelay also the alias update and the display change What's the point of all these splits? It's already one logical unit, not too large, and is bisectable. splitting the patch in logical pieces helps review and distro backporting You changed the parsing tool and perf alias code that uses the new output. IMO it's separate enough to be placed into separate patches. I believe the review would have been easier for me if those changes were separate, also easing my job when backporting this change later into the distro jirka ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 12/14] perf, tools: Add support for event list topics
please split at least the jevents Topic parsing from the rest idelay also the alias update and the display change What's the point of all these splits? It's already one logical unit, not too large, and is bisectable. -andi -- a...@linux.intel.com -- Speaking for myself only ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/4] ppc64 ftrace configuration
Add Kconfig variables and Makefile magic for ftrace with -mprofile-kernel Signed-off-by: Torsten Duwe d...@suse.de diff --git a/Makefile b/Makefile index 3d16bcc..bbd5e87 100644 --- a/Makefile +++ b/Makefile @@ -733,7 +733,10 @@ export CC_FLAGS_FTRACE ifdef CONFIG_HAVE_FENTRY CC_USING_FENTRY:= $(call cc-option, -mfentry -DCC_USING_FENTRY) endif -KBUILD_CFLAGS += $(CC_FLAGS_FTRACE) $(CC_USING_FENTRY) +ifdef CONFIG_HAVE_MPROFILE_KERNEL +CC_USING_MPROFILE_KERNEL := $(call cc-option, -mprofile-kernel -DCC_USING_MPROFILE_KERNEL) +endif +KBUILD_CFLAGS += $(CC_FLAGS_FTRACE) $(CC_USING_FENTRY) $(CC_USING_MPROFILE_KERNEL) KBUILD_AFLAGS += $(CC_USING_FENTRY) ifdef CONFIG_DYNAMIC_FTRACE ifdef CONFIG_HAVE_C_RECORDMCOUNT diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 22b0940..566f204 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -94,8 +94,10 @@ config PPC select OF_RESERVED_MEM select HAVE_FTRACE_MCOUNT_RECORD select HAVE_DYNAMIC_FTRACE + select HAVE_DYNAMIC_FTRACE_WITH_REGS select HAVE_FUNCTION_TRACER select HAVE_FUNCTION_GRAPH_TRACER + select HAVE_MPROFILE_KERNEL select SYSCTL_EXCEPTION_TRACE select ARCH_WANT_OPTIONAL_GPIOLIB select VIRT_TO_BUS if !PPC64 diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index a5da09c..dd53f3d 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -52,6 +52,11 @@ config HAVE_FENTRY help Arch supports the gcc options -pg with -mfentry +config HAVE_MPROFILE_KERNEL + bool + help + Arch supports the gcc options -pg with -mprofile-kernel + config HAVE_C_RECORDMCOUNT bool help ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/4] ppc64 ftrace: spare early boot and low level code
Using -mprofile-kernel on early boot code not only confuses the checker but is also useless, as the infrastructure is not yet in place. Proceed like with -pg, equally with time.o and ftrace itself. Signed-off-by: Torsten Duwe d...@suse.de diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 502cf69..fb33fc5 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -17,14 +17,14 @@ endif ifdef CONFIG_FUNCTION_TRACER # Do not trace early boot code -CFLAGS_REMOVE_cputable.o = -pg -mno-sched-epilog -CFLAGS_REMOVE_prom_init.o = -pg -mno-sched-epilog -CFLAGS_REMOVE_btext.o = -pg -mno-sched-epilog -CFLAGS_REMOVE_prom.o = -pg -mno-sched-epilog +CFLAGS_REMOVE_cputable.o = -pg -mno-sched-epilog -mprofile-kernel +CFLAGS_REMOVE_prom_init.o = -pg -mno-sched-epilog -mprofile-kernel +CFLAGS_REMOVE_btext.o = -pg -mno-sched-epilog -mprofile-kernel +CFLAGS_REMOVE_prom.o = -pg -mno-sched-epilog -mprofile-kernel # do not trace tracer code -CFLAGS_REMOVE_ftrace.o = -pg -mno-sched-epilog +CFLAGS_REMOVE_ftrace.o = -pg -mno-sched-epilog -mprofile-kernel # timers used by tracing -CFLAGS_REMOVE_time.o = -pg -mno-sched-epilog +CFLAGS_REMOVE_time.o = -pg -mno-sched-epilog -mprofile-kernel endif obj-y := cputable.o ptrace.o syscalls.o \ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] ppc64 ftrace recursion protection
As suggested by You and Jikos, a flag in task_struct's trace_recursion is used to block a tracer function to recurse into itself, especially on a data access fault. This should catch all functions called by the fault handlers which are not yet attributed notrace. Signed-off-by: Torsten Duwe d...@suse.de diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 4717859..ae10752 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -72,6 +72,7 @@ int main(void) DEFINE(THREAD, offsetof(struct task_struct, thread)); DEFINE(MM, offsetof(struct task_struct, mm)); DEFINE(MMCONTEXTID, offsetof(struct mm_struct, context.id)); + DEFINE(TASK_TRACEREC, offsetof(struct task_struct, trace_recursion)); #ifdef CONFIG_PPC64 DEFINE(AUDITCONTEXT, offsetof(struct task_struct, audit_context)); DEFINE(SIGSEGV, SIGSEGV); diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index a4132ef..4768104 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -1202,7 +1202,13 @@ _GLOBAL(ftrace_caller) SAVE_8GPRS(16,r1) SAVE_8GPRS(24,r1) - + ld r3, PACACURRENT(r13) + ld r4, TASK_TRACEREC(r3) + andi. r5, r4, 0x0010 // ( 1 TRACE_FTRACE_BIT ) + ori r4, r4, 0x0010 + std r4, TASK_TRACEREC(r3) + bne 3f // ftrace in progress - avoid recursion! + LOAD_REG_IMMEDIATE(r3,function_trace_op) ld r5,0(r3) @@ -1224,9 +1230,14 @@ ftrace_call: bl ftrace_stub nop + ld r3, PACACURRENT(r13) + ld r4, TASK_TRACEREC(r3) + andi. r4, r4, 0xffef // ~( 1 TRACE_FTRACE_BIT ) + std r4, TASK_TRACEREC(r3) + ld r3, _NIP(r1) mtlrr3 - +3: REST_8GPRS(0,r1) REST_8GPRS(8,r1) REST_8GPRS(16,r1) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v13 12/14] perf, tools: Add support for event list topics
Em Wed, Jun 03, 2015 at 05:57:33AM -0700, Andi Kleen escreveu: please split at least the jevents Topic parsing from the rest idelay also the alias update and the display change What's the point of all these splits? It's already one logical unit, not too large, and is bisectable. Eases review, improves bisectability, and its a reasonable request from a reviewer/maintainer that has to look at an evergrowing number of patch flows. - Arnaldo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/4] ppc64 ftrace implementation
On Tue, May 19, 2015 at 11:52:47AM +0200, Jiri Kosina wrote: On Tue, 19 May 2015, Michael Ellerman wrote: ftrace already handles recursion protection by itself (depending on the per-ftrace-ops FTRACE_OPS_FL_RECURSION_SAFE flag). OK, so I wonder why that's not working for us? The situation when traced function recurses to itself is different from the situation when tracing core infrastrcuture would recurse to itself while performing tracing. I have used this inspiration to add a catch-all parachute for ftrace_caller, see my last reply. It reappears here as patch 4/4. Expect noticable performance impact compared to the selected notrace attributation discussed here. This should still be done in a second step especially for the hardware assistance functions I mentioned. Torsten ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/4] ppc64 ftrace implementation
Implement ftrace on ppc64 Signed-off-by: Torsten Duwe d...@suse.de diff --git a/arch/powerpc/include/asm/ftrace.h b/arch/powerpc/include/asm/ftrace.h index e366187..691 100644 --- a/arch/powerpc/include/asm/ftrace.h +++ b/arch/powerpc/include/asm/ftrace.h @@ -46,6 +46,8 @@ extern void _mcount(void); #ifdef CONFIG_DYNAMIC_FTRACE +# define FTRACE_ADDR ((unsigned long)ftrace_caller+8) +# define FTRACE_REGS_ADDR FTRACE_ADDR static inline unsigned long ftrace_call_adjust(unsigned long addr) { /* reloction of mcount call site is the same as the address */ @@ -58,6 +60,9 @@ struct dyn_arch_ftrace { #endif /* CONFIG_DYNAMIC_FTRACE */ #endif /* __ASSEMBLY__ */ +#ifdef CONFIG_DYNAMIC_FTRACE +#define ARCH_SUPPORTS_FTRACE_OPS 1 +#endif #endif #if defined(CONFIG_FTRACE_SYSCALLS) defined(CONFIG_PPC64) !defined(__ASSEMBLY__) diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index d180caf..a4132ef 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -1152,32 +1152,107 @@ _GLOBAL(enter_prom) #ifdef CONFIG_FUNCTION_TRACER #ifdef CONFIG_DYNAMIC_FTRACE -_GLOBAL(mcount) + +#define TOCSAVE 24 + _GLOBAL(_mcount) - blr + nop // REQUIRED for ftrace, to calculate local/global entry diff +.localentry _mcount,.-_mcount + mflrr0 + mtctr r0 + + LOAD_REG_ADDR_PIC(r12,ftrace_trace_function) + ld r12,0(r12) + LOAD_REG_ADDR_PIC(r0,ftrace_stub) + cmpdr0,r12 + ld r0,LRSAVE(r1) + bne-2f + + mtlrr0 + bctr + +2: /* here we have (*ftrace_trace_function)() in r12, + selfpc in CTR + and frompc in r0 */ + + mtlrr0 + bctr + +_GLOBAL(ftrace_caller) + mr r0,r2 // global (module) call: save module TOC + b 1f +.localentry ftrace_caller,.-ftrace_caller + mr r0,r2 // local call: callee's TOC == our TOC + b 2f + +1: addis r2,r12,(.TOC.-0b)@ha + addir2,r2,(.TOC.-0b)@l + +2: // Here we have our proper TOC ptr in R2, + // and the one we need to restore on return in r0. + + ld r12, 16(r1) // get caller's adress + + stdur1,-SWITCH_FRAME_SIZE(r1) + + std r12, _LINK(r1) + SAVE_8GPRS(0,r1) + std r0,TOCSAVE(r1) + SAVE_8GPRS(8,r1) + SAVE_8GPRS(16,r1) + SAVE_8GPRS(24,r1) + + + LOAD_REG_IMMEDIATE(r3,function_trace_op) + ld r5,0(r3) + + mflrr3 + std r3, _NIP(r1) + std r3, 16(r1) + subir3, r3, MCOUNT_INSN_SIZE + mfmsr r4 + std r4, _MSR(r1) + mfctr r4 + std r4, _CTR(r1) + mfxer r4 + std r4, _XER(r1) + mr r4, r12 + addir6, r1 ,STACK_FRAME_OVERHEAD -_GLOBAL_TOC(ftrace_caller) - /* Taken from output of objdump from lib64/glibc */ - mflrr3 - ld r11, 0(r1) - stdur1, -112(r1) - std r3, 128(r1) - ld r4, 16(r11) - subir3, r3, MCOUNT_INSN_SIZE .globl ftrace_call ftrace_call: bl ftrace_stub nop + + ld r3, _NIP(r1) + mtlrr3 + + REST_8GPRS(0,r1) + REST_8GPRS(8,r1) + REST_8GPRS(16,r1) + REST_8GPRS(24,r1) + + addi r1, r1, SWITCH_FRAME_SIZE + + ld r12, 16(r1) // get caller's adress + mr r2,r0 // restore callee's TOC + mflrr0 // move this LR to CTR + mtctr r0 + mr r0,r12 // restore callee's lr at _mcount site + mtlrr0 + bctr// jump after _mcount site + #ifdef CONFIG_FUNCTION_GRAPH_TRACER .globl ftrace_graph_call ftrace_graph_call: b ftrace_graph_stub _GLOBAL(ftrace_graph_stub) #endif - ld r0, 128(r1) - mtlrr0 - addir1, r1, 112 + _GLOBAL(ftrace_stub) + nop + nop +.localentry ftrace_stub,.-ftrace_stub blr #else _GLOBAL_TOC(_mcount) @@ -1211,12 +1286,12 @@ _GLOBAL(ftrace_stub) #ifdef CONFIG_FUNCTION_GRAPH_TRACER _GLOBAL(ftrace_graph_caller) /* load r4 with local address */ - ld r4, 128(r1) + ld r4, LRSAVE+SWITCH_FRAME_SIZE(r1) subir4, r4, MCOUNT_INSN_SIZE /* Grab the LR out of the caller stack frame */ - ld r11, 112(r1) - ld r3, 16(r11) + ld r11, SWITCH_FRAME_SIZE(r1) + ld r3, LRSAVE(r11) bl prepare_ftrace_return nop @@ -1228,10 +1303,7 @@ _GLOBAL(ftrace_graph_caller) ld r11, 112(r1) std r3, 16(r11) - ld r0, 128(r1) - mtlrr0 - addir1, r1, 112 - blr + b ftrace_graph_stub _GLOBAL(return_to_handler) /* need to save return values */ diff --git a/arch/powerpc/kernel/ftrace.c b/arch/powerpc/kernel/ftrace.c index 44d4d8e..349d07c 100644 ---
Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency
* Benjamin Herrenschmidt b...@au1.ibm.com [2015-05-30 20:38:22]: On Sat, 2015-05-30 at 11:31 +0530, Vaidyanathan Srinivasan wrote: In shared lpar case, spinning in guest context may potentially take away cycles from other lpars waiting to run on the same physical cpu. So the policy in shared lpar case is to let PowerVM hypervisor know immediately that the guest cpu is idle which will allow the hypervisor to use the cycles for other tasks/lpars. But that will have negative side effects under KVM no ? Yes, you have a good point. If one of the thread in the core goes to cede, it can still come back quickly since the KVM guest context is not switched yet. But in single threaded guest, this can force an unnecessary exit/context switch overhead. Now that we have fixed the snooze loop to be bounded and exit predictably, KVM guest should actually use snooze state to improve latency. I will test this scenario and enable snooze state for KVM guest. Suresh mentioned something with his new directed interrupts code that we had many cases where the interrupts ended up arriving shortly after we exited to host for NAP'ing ... Snooze might fix it... Right. This scenario is worth experimenting and then introduce snooze loop for guest. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V7 06/10] powerpc/eeh: Create PE for VFs
On Wed, Jun 03, 2015 at 03:10:23PM +1000, Gavin Shan wrote: On Wed, Jun 03, 2015 at 11:31:42AM +0800, Wei Yang wrote: On Mon, Jun 01, 2015 at 06:46:45PM -0500, Bjorn Helgaas wrote: On Tue, May 19, 2015 at 06:50:08PM +0800, Wei Yang wrote: Current EEH recovery code works with the assumption: the PE has primary bus. Unfortunately, that's not true to VF PEs, which generally contains one or multiple VFs (for VF group case). The patch creates PEs for VFs at PCI final fixup time. Those PEs for VFs are indentified with newly introduced flag EEH_PE_VF so that we handle them differently during EEH recovery. [gwshan: changelog and code refactoring] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com Acked-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/eeh.h |1 + arch/powerpc/kernel/eeh_pe.c | 10 -- arch/powerpc/platforms/powernv/eeh-powernv.c | 17 + 3 files changed, 26 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index 1b3614d..c1fde48 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -70,6 +70,7 @@ struct pci_dn; #define EEH_PE_PHB (1 1)/* PHB PE*/ #define EEH_PE_DEVICE(1 2)/* Device PE */ #define EEH_PE_BUS (1 3)/* Bus PE*/ +#define EEH_PE_VF(1 4)/* VF PE */ #define EEH_PE_ISOLATED (1 0)/* Isolated PE */ #define EEH_PE_RECOVERING(1 1)/* Recovering PE */ diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c index 35f0b62..260a701 100644 --- a/arch/powerpc/kernel/eeh_pe.c +++ b/arch/powerpc/kernel/eeh_pe.c @@ -299,7 +299,10 @@ static struct eeh_pe *eeh_pe_get_parent(struct eeh_dev *edev) * EEH device already having associated PE, but * the direct parent EEH device doesn't have yet. */ - pdn = pdn ? pdn-parent : NULL; + if (edev-physfn) + pdn = pci_get_pdn(edev-physfn); + else + pdn = pdn ? pdn-parent : NULL; while (pdn) { /* We're poking out of PCI territory */ parent = pdn_to_eeh_dev(pdn); @@ -382,7 +385,10 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev) } /* Create a new EEH PE */ - pe = eeh_pe_alloc(edev-phb, EEH_PE_DEVICE); + if (edev-physfn) + pe = eeh_pe_alloc(edev-phb, EEH_PE_VF); + else + pe = eeh_pe_alloc(edev-phb, EEH_PE_DEVICE); if (!pe) { pr_err(%s: out of memory!\n, __func__); return -ENOMEM; diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index ce738ab..c505036 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -1520,6 +1520,23 @@ static struct eeh_ops pnv_eeh_ops = { .restore_config = pnv_eeh_restore_config }; +static void pnv_eeh_vf_final_fixup(struct pci_dev *pdev) +{ + struct pci_dn *pdn = pci_get_pdn(pdev); + + if (!pdev-is_virtfn) + return; + + /* + * The following operations will fail if VF's sysfs files + * aren't created or its resources aren't finalized. + */ I don't understand this comment. The following operations seems to refer to eeh_add_device_early() and eeh_add_device_late(), and VF's sysfs files being created seems to refer to eeh_sysfs_add_device(). So the comment suggests that eeh_add_device_early() and eeh_add_device_late() will fail because they're called before eeh_sysfs_add_device(). So I think you must be talking about some other following operations, not eeh_add_device_early() and eeh_add_device_late(). Sorry for this confusion. The comment here wants to say the eeh_sysfs_add_device() will fail if the VF's sysfs is not created well. Or it will fail if the VF's resources are not set properly, since we would cache the VF's BAR in eeh_add_device_late(). Gavin, If my understanding is not correct please let me know. It's correct. The following operations refers to eeh_add_device_late() and eeh_sysfs_add_device(). The former one requires the resources for one particular PCI device (VF here) are finalized (assigned). eeh_sysfs_add_device() will fail if the sysfs entry for the PCI device isn't populated yet. eeh_add_device_late() contains several things that read config space: eeh_save_bars() caches the entire config header, and eeh_addr_cache_insert_dev() looks at the device resources (which are determined by BARs in config space). I think this is an error-prone approach. I think it would be simpler and safer for you to capture what you need in your PCI config accessors. eeh_add_device_late() also contains code to deal with an EEH cache that might not be removed correctly because of unbalanced kref to the device
Re: [v3,33/36] genirq: Use helper function to access irq_data-msi_desc
On Tue, 2 Jun 2015, Michael Ellerman wrote: Though I would point out we already have 16 irq_foo_get_bar() accessors?! Say thanks to the people who mindlessly fiddled in the core managed fields of irqdesc and irqdata. The hard to track down wreckage they caused me to enforce accessors. The new accessors are to help migrating members of irq_data to a new data structure without breaking the world and some more. Thanks, tglx ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v11 09/34] vfio: powerpc/spapr: Move locked_vm accounting to helpers
On Wed, Jun 03, 2015 at 09:11:09PM +1000, Alexey Kardashevskiy wrote: On 06/01/2015 02:28 PM, David Gibson wrote: On Fri, May 29, 2015 at 06:44:33PM +1000, Alexey Kardashevskiy wrote: There moves locked pages accounting to helpers. Later they will be reused for Dynamic DMA windows (DDW). This reworks debug messages to show the current value and the limit. This stores the locked pages number in the container so when unlocking the iommu table pointer won't be needed. This does not have an effect now but it will with the multiple tables per container as then we will allow attaching/detaching groups on fly and we may end up having a container with no group attached but with the counter incremented. While we are here, update the comment explaining why RLIMIT_MEMLOCK might be required to be bigger than the guest RAM. This also prints pid of the current process in pr_warn/pr_debug. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com Reviewed-by: David Gibson da...@gibson.dropbear.id.au Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com --- Changes: v4: * new helpers do nothing if @npages == 0 * tce_iommu_disable() now can decrement the counter if the group was detached (not possible now but will be in the future) --- drivers/vfio/vfio_iommu_spapr_tce.c | 82 - 1 file changed, 63 insertions(+), 19 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 64300cc..40583f9 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -29,6 +29,51 @@ static void tce_iommu_detach_group(void *iommu_data, struct iommu_group *iommu_group); +static long try_increment_locked_vm(long npages) +{ + long ret = 0, locked, lock_limit; + + if (!current || !current-mm) + return -ESRCH; /* process exited */ + + if (!npages) + return 0; + + down_write(current-mm-mmap_sem); + locked = current-mm-locked_vm + npages; Is there a possibility of userspace triggering an integer overflow here, if npages is really huge? I do not see how. I just do not accept npages bigger than the host RAM size in pages. And it is long. For (lets say) 128GB host, the number of 4KB pages is (12830)/4096=33554432. Ah, yes, npages has already been shifted right so it should be safe. Ok. + lock_limit = rlimit(RLIMIT_MEMLOCK) PAGE_SHIFT; + if (locked lock_limit !capable(CAP_IPC_LOCK)) + ret = -ENOMEM; + else + current-mm-locked_vm += npages; + + pr_debug([%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n, current-pid, + npages PAGE_SHIFT, + current-mm-locked_vm PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK), + ret ? - exceeded : ); + + up_write(current-mm-mmap_sem); + + return ret; +} + +static void decrement_locked_vm(long npages) +{ + if (!current || !current-mm || !npages) + return; /* process exited */ + + down_write(current-mm-mmap_sem); + if (npages current-mm-locked_vm) + npages = current-mm-locked_vm; Can this case ever occur (without there being a leak bug somewhere else in the code)? It should not. Safety measure. Having a warning here might make sense but I believe if this happens, there will be many, many warnings in other places :) Ok. I'd would be nice to see a WARN_ON() as documentation that this isn't a situation that should ever happen. I wouldn't nack on that basis alone though. + current-mm-locked_vm -= npages; + pr_debug([%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n, current-pid, + npages PAGE_SHIFT, + current-mm-locked_vm PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK)); + up_write(current-mm-mmap_sem); +} + /* * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation * @@ -45,6 +90,7 @@ struct tce_container { struct mutex lock; struct iommu_table *tbl; bool enabled; + unsigned long locked_pages; }; static bool tce_page_is_contained(struct page *page, unsigned page_shift) @@ -60,7 +106,7 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift) static int tce_iommu_enable(struct tce_container *container) { int ret = 0; - unsigned long locked, lock_limit, npages; + unsigned long locked; struct iommu_table *tbl = container-tbl; if (!container-tbl) @@ -89,21 +135,22 @@ static int tce_iommu_enable(struct tce_container *container) * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits, * that would effectively kill the guest at random points, much better * enforcing the limit based on the max that the guest can map. +* +* Unfortunately at the moment it counts whole tables, no matter how +* much
[RFT v2 26/48] powerpc, irq: Prepare for killing the first parameter 'irq' of irq_flow_handler_t
Change irq flow handler to prepare for killing the first parameter 'irq' of irq_flow_handler_t. Signed-off-by: Jiang Liu jiang@linux.intel.com --- arch/powerpc/platforms/512x/mpc5121_ads_cpld.c |4 +++- arch/powerpc/platforms/85xx/socrates_fpga_pic.c |2 +- arch/powerpc/platforms/cell/interrupt.c |3 ++- 3 files changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/512x/mpc5121_ads_cpld.c b/arch/powerpc/platforms/512x/mpc5121_ads_cpld.c index ca3a062ed1b9..4411ed51803e 100644 --- a/arch/powerpc/platforms/512x/mpc5121_ads_cpld.c +++ b/arch/powerpc/platforms/512x/mpc5121_ads_cpld.c @@ -105,8 +105,10 @@ cpld_pic_get_irq(int offset, u8 ignore, u8 __iomem *statusp, } static void -cpld_pic_cascade(unsigned int irq, struct irq_desc *desc) +cpld_pic_cascade(unsigned int __irq, struct irq_desc *desc) { + unsigned int irq; + irq = cpld_pic_get_irq(0, PCI_IGNORE, cpld_regs-pci_status, cpld_regs-pci_mask); if (irq != NO_IRQ) { diff --git a/arch/powerpc/platforms/85xx/socrates_fpga_pic.c b/arch/powerpc/platforms/85xx/socrates_fpga_pic.c index 55a9682b9529..5153e58654f7 100644 --- a/arch/powerpc/platforms/85xx/socrates_fpga_pic.c +++ b/arch/powerpc/platforms/85xx/socrates_fpga_pic.c @@ -100,7 +100,7 @@ void socrates_fpga_pic_cascade(unsigned int irq, struct irq_desc *desc) * See if we actually have an interrupt, call generic handling code if * we do. */ - cascade_irq = socrates_fpga_pic_get_irq(irq); + cascade_irq = socrates_fpga_pic_get_irq(irq_desc_get_irq(desc)); if (cascade_irq != NO_IRQ) generic_handle_irq(cascade_irq); diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c index 3af8324c122e..e2dd6c9d3a78 100644 --- a/arch/powerpc/platforms/cell/interrupt.c +++ b/arch/powerpc/platforms/cell/interrupt.c @@ -99,8 +99,9 @@ static void iic_ioexc_eoi(struct irq_data *d) { } -static void iic_ioexc_cascade(unsigned int irq, struct irq_desc *desc) +static void iic_ioexc_cascade(unsigned int __irq, struct irq_desc *desc) { + unsigned int irq = irq_desc_get_irq(desc); struct irq_chip *chip = irq_desc_get_chip(desc); struct cbe_iic_regs __iomem *node_iic = (void __iomem *)irq_desc_get_handler_data(desc); -- 1.7.10.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V7 06/10] powerpc/eeh: Create PE for VFs
On Wed, Jun 03, 2015 at 10:46:38AM -0500, Bjorn Helgaas wrote: On Wed, Jun 03, 2015 at 03:10:23PM +1000, Gavin Shan wrote: On Wed, Jun 03, 2015 at 11:31:42AM +0800, Wei Yang wrote: On Mon, Jun 01, 2015 at 06:46:45PM -0500, Bjorn Helgaas wrote: On Tue, May 19, 2015 at 06:50:08PM +0800, Wei Yang wrote: Current EEH recovery code works with the assumption: the PE has primary bus. Unfortunately, that's not true to VF PEs, which generally contains one or multiple VFs (for VF group case). The patch creates PEs for VFs at PCI final fixup time. Those PEs for VFs are indentified with newly introduced flag EEH_PE_VF so that we handle them differently during EEH recovery. [gwshan: changelog and code refactoring] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com Acked-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/eeh.h |1 + arch/powerpc/kernel/eeh_pe.c | 10 -- arch/powerpc/platforms/powernv/eeh-powernv.c | 17 + 3 files changed, 26 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index 1b3614d..c1fde48 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -70,6 +70,7 @@ struct pci_dn; #define EEH_PE_PHB (1 1)/* PHB PE*/ #define EEH_PE_DEVICE (1 2)/* Device PE */ #define EEH_PE_BUS (1 3)/* Bus PE*/ +#define EEH_PE_VF (1 4)/* VF PE */ #define EEH_PE_ISOLATED (1 0)/* Isolated PE */ #define EEH_PE_RECOVERING (1 1)/* Recovering PE */ diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c index 35f0b62..260a701 100644 --- a/arch/powerpc/kernel/eeh_pe.c +++ b/arch/powerpc/kernel/eeh_pe.c @@ -299,7 +299,10 @@ static struct eeh_pe *eeh_pe_get_parent(struct eeh_dev *edev) * EEH device already having associated PE, but * the direct parent EEH device doesn't have yet. */ -pdn = pdn ? pdn-parent : NULL; +if (edev-physfn) +pdn = pci_get_pdn(edev-physfn); +else +pdn = pdn ? pdn-parent : NULL; while (pdn) { /* We're poking out of PCI territory */ parent = pdn_to_eeh_dev(pdn); @@ -382,7 +385,10 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev) } /* Create a new EEH PE */ -pe = eeh_pe_alloc(edev-phb, EEH_PE_DEVICE); +if (edev-physfn) +pe = eeh_pe_alloc(edev-phb, EEH_PE_VF); +else +pe = eeh_pe_alloc(edev-phb, EEH_PE_DEVICE); if (!pe) { pr_err(%s: out of memory!\n, __func__); return -ENOMEM; diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index ce738ab..c505036 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -1520,6 +1520,23 @@ static struct eeh_ops pnv_eeh_ops = { .restore_config = pnv_eeh_restore_config }; +static void pnv_eeh_vf_final_fixup(struct pci_dev *pdev) +{ +struct pci_dn *pdn = pci_get_pdn(pdev); + +if (!pdev-is_virtfn) +return; + +/* + * The following operations will fail if VF's sysfs files + * aren't created or its resources aren't finalized. + */ I don't understand this comment. The following operations seems to refer to eeh_add_device_early() and eeh_add_device_late(), and VF's sysfs files being created seems to refer to eeh_sysfs_add_device(). So the comment suggests that eeh_add_device_early() and eeh_add_device_late() will fail because they're called before eeh_sysfs_add_device(). So I think you must be talking about some other following operations, not eeh_add_device_early() and eeh_add_device_late(). Sorry for this confusion. The comment here wants to say the eeh_sysfs_add_device() will fail if the VF's sysfs is not created well. Or it will fail if the VF's resources are not set properly, since we would cache the VF's BAR in eeh_add_device_late(). Gavin, If my understanding is not correct please let me know. It's correct. The following operations refers to eeh_add_device_late() and eeh_sysfs_add_device(). The former one requires the resources for one particular PCI device (VF here) are finalized (assigned). eeh_sysfs_add_device() will fail if the sysfs entry for the PCI device isn't populated yet. eeh_add_device_late() contains several things that read config space: eeh_save_bars() caches the entire config header, and eeh_addr_cache_insert_dev() looks at the device resources (which are determined by BARs in config space). I think this is an error-prone
Re: [PATCH kernel v11 27/34] powerpc/powernv: Implement multilevel TCE tables
On Wed, Jun 03, 2015 at 09:27:10PM +1000, Alexey Kardashevskiy wrote: On 06/02/2015 09:50 AM, David Gibson wrote: On Fri, May 29, 2015 at 06:44:51PM +1000, Alexey Kardashevskiy wrote: TCE tables might get too big in case of 4K IOMMU pages and DDW enabled on huge guests (hundreds of GB of RAM) so the kernel might be unable to allocate contiguous chunk of physical memory to store the TCE table. To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables, up to 5 levels which splits the table into a tree of smaller subtables. This adds multi-level TCE tables support to pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages() helpers. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v10: * fixed multiple comments received for v9 v9: * moved from ioda2 to common powernv pci code * fixed cleanup if allocation fails in a middle * removed check for the size - all boundary checks happen in the calling code anyway --- arch/powerpc/include/asm/iommu.h | 2 + arch/powerpc/platforms/powernv/pci-ioda.c | 98 --- arch/powerpc/platforms/powernv/pci.c | 13 3 files changed, 104 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 4636734..706cfc0 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -96,6 +96,8 @@ struct iommu_pool { struct iommu_table { unsigned long it_busno; /* Bus number this table belongs to */ unsigned long it_size; /* Size of iommu table in entries */ + unsigned long it_indirect_levels; + unsigned long it_level_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index fda01c1..68ffc7a 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -49,6 +49,9 @@ /* 256M DMA window, 4K TCE pages, 8 bytes TCE */ #define TCE32_TABLE_SIZE ((0x1000 / 0x1000) * 8) +#define POWERNV_IOMMU_DEFAULT_LEVELS 1 +#define POWERNV_IOMMU_MAX_LEVELS 5 + static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl); static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, @@ -1975,6 +1978,8 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, table_group); struct pnv_phb *phb = pe-phb; int64_t rc; + const unsigned long size = tbl-it_indirect_levels ? + tbl-it_level_size : tbl-it_size; const __u64 start_addr = tbl-it_offset tbl-it_page_shift; const __u64 win_size = tbl-it_size tbl-it_page_shift; @@ -1989,9 +1994,9 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, rc = opal_pci_map_pe_dma_window(phb-opal_id, pe-pe_number, pe-pe_number 1, - 1, + tbl-it_indirect_levels + 1, __pa(tbl-it_base), - tbl-it_size 3, + size 3, IOMMU_PAGE_SIZE(tbl)); if (rc) { pe_err(pe, Failed to configure TCE table, err %ld\n, rc); @@ -2071,11 +2076,19 @@ static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb) phb-ioda.tce_inval_reg = ioremap(phb-ioda.tce_inval_reg_phys, 8); } -static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift) +static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift, + unsigned levels, unsigned long limit, + unsigned long *tce_table_allocated) { struct page *tce_mem = NULL; - __be64 *addr; + __be64 *addr, *tmp; unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT; + unsigned long local_allocated = 1UL (order + PAGE_SHIFT); + unsigned entries = 1UL (shift - 3); + long i; + + if (*tce_table_allocated = limit) + return NULL; I'm not quite clear what case this limit logic is trying to catch. The function is allocating some amount of entries which may be in one chunk of memory and spread between multiple chunks in multiple levels. limit is the amount of memory for actual TCEs (not intermediate levels). If I do not do this, and the user requests 5 levels, and I do not check this, more memory will be allocated that actually needed because size of the window is limited. Ah, ok. It's to handle the case where the requested window size doesn't match a whole number of levels. It seems a rather counter-intuitive way of handling it to me - tracking the amount of memory allocated at the leaf level, rather than tracking what window offset you're
[RFT v2 38/48] genirq, powerpc: Kill the first parameter 'irq' of irq_flow_handler_t
Now most IRQ flow handlers make no use of the first parameter 'irq'. And for those who do make use of 'irq', we could easily get the irq number through irq_desc-irq_data-irq. So kill the first parameter 'irq' of irq_flow_handler_t. To ease review, I have split the changes into several parts, though they should be merge as one to support bisecting. Signed-off-by: Jiang Liu jiang@linux.intel.com --- arch/powerpc/include/asm/qe_ic.h| 23 +-- arch/powerpc/include/asm/tsi108_pci.h |2 +- arch/powerpc/platforms/512x/mpc5121_ads_cpld.c |2 +- arch/powerpc/platforms/52xx/media5200.c |2 +- arch/powerpc/platforms/52xx/mpc52xx_gpt.c |2 +- arch/powerpc/platforms/82xx/pq2ads-pci-pic.c|2 +- arch/powerpc/platforms/85xx/common.c|2 +- arch/powerpc/platforms/85xx/mpc85xx_cds.c |5 ++--- arch/powerpc/platforms/85xx/mpc85xx_ds.c|2 +- arch/powerpc/platforms/85xx/socrates_fpga_pic.c |2 +- arch/powerpc/platforms/86xx/pic.c |2 +- arch/powerpc/platforms/8xx/m8xx_setup.c |2 +- arch/powerpc/platforms/cell/axon_msi.c |2 +- arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/cell/spider-pic.c|2 +- arch/powerpc/platforms/chrp/setup.c |2 +- arch/powerpc/platforms/embedded6xx/hlwd-pic.c |3 +-- arch/powerpc/platforms/embedded6xx/mvme5100.c |2 +- arch/powerpc/platforms/pseries/setup.c |2 +- arch/powerpc/sysdev/ge/ge_pic.c |2 +- arch/powerpc/sysdev/ge/ge_pic.h |2 +- arch/powerpc/sysdev/mpic.c |2 +- arch/powerpc/sysdev/qe_lib/qe_ic.c |4 ++-- arch/powerpc/sysdev/tsi108_pci.c|2 +- arch/powerpc/sysdev/uic.c |2 +- arch/powerpc/sysdev/xilinx_intc.c |2 +- 26 files changed, 36 insertions(+), 43 deletions(-) diff --git a/arch/powerpc/include/asm/qe_ic.h b/arch/powerpc/include/asm/qe_ic.h index 25784cc959a0..1e155ca6d33c 100644 --- a/arch/powerpc/include/asm/qe_ic.h +++ b/arch/powerpc/include/asm/qe_ic.h @@ -59,14 +59,14 @@ enum qe_ic_grp_id { #ifdef CONFIG_QUICC_ENGINE void qe_ic_init(struct device_node *node, unsigned int flags, - void (*low_handler)(unsigned int irq, struct irq_desc *desc), - void (*high_handler)(unsigned int irq, struct irq_desc *desc)); + void (*low_handler)(struct irq_desc *desc), + void (*high_handler)(struct irq_desc *desc)); unsigned int qe_ic_get_low_irq(struct qe_ic *qe_ic); unsigned int qe_ic_get_high_irq(struct qe_ic *qe_ic); #else static inline void qe_ic_init(struct device_node *node, unsigned int flags, - void (*low_handler)(unsigned int irq, struct irq_desc *desc), - void (*high_handler)(unsigned int irq, struct irq_desc *desc)) + void (*low_handler)(struct irq_desc *desc), + void (*high_handler)(struct irq_desc *desc)) {} static inline unsigned int qe_ic_get_low_irq(struct qe_ic *qe_ic) { return 0; } @@ -78,8 +78,7 @@ void qe_ic_set_highest_priority(unsigned int virq, int high); int qe_ic_set_priority(unsigned int virq, unsigned int priority); int qe_ic_set_high_priority(unsigned int virq, unsigned int priority, int high); -static inline void qe_ic_cascade_low_ipic(unsigned int irq, - struct irq_desc *desc) +static inline void qe_ic_cascade_low_ipic(struct irq_desc *desc) { struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic); @@ -88,8 +87,7 @@ static inline void qe_ic_cascade_low_ipic(unsigned int irq, generic_handle_irq(cascade_irq); } -static inline void qe_ic_cascade_high_ipic(unsigned int irq, - struct irq_desc *desc) +static inline void qe_ic_cascade_high_ipic(struct irq_desc *desc) { struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic); @@ -98,8 +96,7 @@ static inline void qe_ic_cascade_high_ipic(unsigned int irq, generic_handle_irq(cascade_irq); } -static inline void qe_ic_cascade_low_mpic(unsigned int irq, - struct irq_desc *desc) +static inline void qe_ic_cascade_low_mpic(struct irq_desc *desc) { struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic); @@ -111,8 +108,7 @@ static inline void qe_ic_cascade_low_mpic(unsigned int irq, chip-irq_eoi(desc-irq_data); } -static inline void qe_ic_cascade_high_mpic(unsigned int irq, - struct irq_desc *desc) +static inline void qe_ic_cascade_high_mpic(struct irq_desc *desc) { struct qe_ic *qe_ic =
Re: [RFC 07/24] x86/thinkpad_acpi: Use arch_nvram_ops methods instead of nvram_read_byte() and nvram_write_byte()
On Wed, Jun 03, 2015 at 07:37:13AM -0300, Henrique de Moraes Holschuh wrote: On Wed, Jun 3, 2015, at 00:34, Darren Hart wrote: On Tue, Jun 02, 2015 at 07:09:28AM -0300, Henrique de Moraes Holschuh wrote: Test results were sent to me privately, and they are correct, so... Finn, unless there is some compelling reason not to - like they are MBs worth of data, please submit these to the list in the future so we have them for reference. After I told him which exact bitmask to use on a T43 to test hotkey_source_mask, his test results can be summarized as I could see no difference in behavior, which is *exactly* what I expected to happen. If anything went wrong with the thinkpad-acpi NVRAM code, you'd notice a very large change in behavior (typical: hotkeys don't work, less typical: random hotkey keypresses, hotkey press bursts, low responsivity of hotkeys). Perfect, thanks for the update so we have it recorded here on the list. -- Darren Hart Intel Open Source Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2
On Wed, Jun 03, 2015 at 09:40:49PM +1000, Alexey Kardashevskiy wrote: On 06/02/2015 02:17 PM, David Gibson wrote: On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote: The existing implementation accounts the whole DMA window in the locked_vm counter. This is going to be worse with multiple containers and huge DMA windows. Also, real-time accounting would requite additional tracking of accounted pages due to the page size difference - IOMMU uses 4K pages and system uses 4K or 64K pages. Another issue is that actual pages pinning/unpinning happens on every DMA map/unmap request. This does not affect the performance much now as we spend way too much time now on switching context between guest/userspace/host but this will start to matter when we add in-kernel DMA map/unmap acceleration. This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU. New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces 2 new ioctls to register/unregister DMA memory - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - which receive user space address and size of a memory region which needs to be pinned/unpinned and counted in locked_vm. New IOMMU splits physical pages pinning and TCE table update into 2 different operations. It requires: 1) guest pages to be registered first 2) consequent map/unmap requests to work only with pre-registered memory. For the default single window case this means that the entire guest (instead of 2GB) needs to be pinned before using VFIO. When a huge DMA window is added, no additional pinning will be required, otherwise it would be guest RAM + 2GB. The new memory registration ioctls are not supported by VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration will require memory to be preregistered in order to work. The accounting is done per the user process. This advertises v2 SPAPR TCE IOMMU and restricts what the userspace can do with v1 or v2 IOMMUs. In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com --- Changes: v11: * mm_iommu_put() does not return a code so this does not check it * moved v2 in tce_container to pack the struct v10: * moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO specific thing * squashed powerpc/iommu: Add userspace view of TCE table into this as it is a part of IOMMU v2 * s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/ * fixed some function names to have tce_iommu_ in the beginning rather just tce_ * as mm_iommu_mapped_inc() can now fail, check for the return code v9: * s/tce_get_hva_cached/tce_iommu_use_page_v2/ v7: * now memory is registered per mm (i.e. process) * moved memory registration code to powerpc/mmu * merged vfio: powerpc/spapr: Define v2 IOMMU into this * limited new ioctls to v2 IOMMU * updated doc * unsupported ioclts return -ENOTTY instead of -EPERM v6: * tce_get_hva_cached() returns hva via a pointer v4: * updated docs * s/kzmalloc/vzalloc/ * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and replaced offset with index * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory and removed duplicating vfio_iommu_spapr_register_memory --- Documentation/vfio.txt | 31 ++- arch/powerpc/include/asm/iommu.h| 6 + drivers/vfio/vfio_iommu_spapr_tce.c | 512 ++-- include/uapi/linux/vfio.h | 27 ++ 4 files changed, 487 insertions(+), 89 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 96978ec..7dcf2b5 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note This implementation has some specifics: -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). +Newer systems (POWER8 with IODA2) have improved hardware design which allows +to
Re: [PATCH V7 06/10] powerpc/eeh: Create PE for VFs
On Wed, Jun 03, 2015 at 10:46:38AM -0500, Bjorn Helgaas wrote: On Wed, Jun 03, 2015 at 03:10:23PM +1000, Gavin Shan wrote: On Wed, Jun 03, 2015 at 11:31:42AM +0800, Wei Yang wrote: On Mon, Jun 01, 2015 at 06:46:45PM -0500, Bjorn Helgaas wrote: On Tue, May 19, 2015 at 06:50:08PM +0800, Wei Yang wrote: Current EEH recovery code works with the assumption: the PE has primary bus. Unfortunately, that's not true to VF PEs, which generally contains one or multiple VFs (for VF group case). The patch creates PEs for VFs at PCI final fixup time. Those PEs for VFs are indentified with newly introduced flag EEH_PE_VF so that we handle them differently during EEH recovery. [gwshan: changelog and code refactoring] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com Acked-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/eeh.h |1 + arch/powerpc/kernel/eeh_pe.c | 10 -- arch/powerpc/platforms/powernv/eeh-powernv.c | 17 + 3 files changed, 26 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index 1b3614d..c1fde48 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -70,6 +70,7 @@ struct pci_dn; #define EEH_PE_PHB (1 1)/* PHB PE*/ #define EEH_PE_DEVICE (1 2)/* Device PE */ #define EEH_PE_BUS (1 3)/* Bus PE*/ +#define EEH_PE_VF (1 4)/* VF PE */ #define EEH_PE_ISOLATED (1 0)/* Isolated PE */ #define EEH_PE_RECOVERING (1 1)/* Recovering PE */ diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c index 35f0b62..260a701 100644 --- a/arch/powerpc/kernel/eeh_pe.c +++ b/arch/powerpc/kernel/eeh_pe.c @@ -299,7 +299,10 @@ static struct eeh_pe *eeh_pe_get_parent(struct eeh_dev *edev) * EEH device already having associated PE, but * the direct parent EEH device doesn't have yet. */ -pdn = pdn ? pdn-parent : NULL; +if (edev-physfn) +pdn = pci_get_pdn(edev-physfn); +else +pdn = pdn ? pdn-parent : NULL; while (pdn) { /* We're poking out of PCI territory */ parent = pdn_to_eeh_dev(pdn); @@ -382,7 +385,10 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev) } /* Create a new EEH PE */ -pe = eeh_pe_alloc(edev-phb, EEH_PE_DEVICE); +if (edev-physfn) +pe = eeh_pe_alloc(edev-phb, EEH_PE_VF); +else +pe = eeh_pe_alloc(edev-phb, EEH_PE_DEVICE); if (!pe) { pr_err(%s: out of memory!\n, __func__); return -ENOMEM; diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index ce738ab..c505036 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -1520,6 +1520,23 @@ static struct eeh_ops pnv_eeh_ops = { .restore_config = pnv_eeh_restore_config }; +static void pnv_eeh_vf_final_fixup(struct pci_dev *pdev) +{ +struct pci_dn *pdn = pci_get_pdn(pdev); + +if (!pdev-is_virtfn) +return; + +/* + * The following operations will fail if VF's sysfs files + * aren't created or its resources aren't finalized. + */ I don't understand this comment. The following operations seems to refer to eeh_add_device_early() and eeh_add_device_late(), and VF's sysfs files being created seems to refer to eeh_sysfs_add_device(). So the comment suggests that eeh_add_device_early() and eeh_add_device_late() will fail because they're called before eeh_sysfs_add_device(). So I think you must be talking about some other following operations, not eeh_add_device_early() and eeh_add_device_late(). Sorry for this confusion. The comment here wants to say the eeh_sysfs_add_device() will fail if the VF's sysfs is not created well. Or it will fail if the VF's resources are not set properly, since we would cache the VF's BAR in eeh_add_device_late(). Gavin, If my understanding is not correct please let me know. It's correct. The following operations refers to eeh_add_device_late() and eeh_sysfs_add_device(). The former one requires the resources for one particular PCI device (VF here) are finalized (assigned). eeh_sysfs_add_device() will fail if the sysfs entry for the PCI device isn't populated yet. eeh_add_device_late() contains several things that read config space: eeh_save_bars() caches the entire config header, and eeh_addr_cache_insert_dev() looks at the device resources (which are determined by BARs in config space). I think this is an error-prone
RE: [PATCH 2/2] rheap: move rheap.c from arch/powerpc/lib/ to lib/
On Thu, 2015-05-28 at 1:37AM +0800, Wood Scott wrote: -Original Message- From: Wood Scott-B07421 Sent: Thursday, May 28, 2015 1:37 AM To: Zhao Qiang-B45475 Cc: linuxppc-dev@lists.ozlabs.org; Wood Scott-B07421; Xie Xiaobo-R63061 Subject: Re: [PATCH 2/2] rheap: move rheap.c from arch/powerpc/lib/ to lib/ On Wed, 2015-05-27 at 17:12 +0800, Zhao Qiang wrote: qe need to use the rheap, so move it to public directory. You've been previously asked to use lib/genalloc.c rather than introduce duplicate functionality into /lib. NACK. Can't use lib/genalloc.c instead of rheap.c. Qe need to alloc muram of qe, not DIMM. Also, please don't use coreid-based e-mail addresses with no real names associated, which makes it hard to tell who has been CCed. -Scott Best Regards Zhao Qiang ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFT v2 04/48] powerpc, irq: Use irq_desc_get_xxx() to avoid redundant lookup of irq_desc
Use irq_desc_get_xxx() to avoid redundant lookup of irq_desc while we already have a pointer to corresponding irq_desc. Note: this patch has been queued for 4.2 by Michael Ellerman m...@ellerman.id.au Signed-off-by: Jiang Liu jiang@linux.intel.com --- arch/powerpc/platforms/52xx/mpc52xx_gpt.c |2 +- arch/powerpc/platforms/cell/axon_msi.c|2 +- arch/powerpc/platforms/embedded6xx/hlwd-pic.c |2 +- arch/powerpc/sysdev/uic.c |2 +- arch/powerpc/sysdev/xics/xics-common.c|2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/52xx/mpc52xx_gpt.c b/arch/powerpc/platforms/52xx/mpc52xx_gpt.c index c949ca055712..63016621aff8 100644 --- a/arch/powerpc/platforms/52xx/mpc52xx_gpt.c +++ b/arch/powerpc/platforms/52xx/mpc52xx_gpt.c @@ -193,7 +193,7 @@ static struct irq_chip mpc52xx_gpt_irq_chip = { void mpc52xx_gpt_irq_cascade(unsigned int virq, struct irq_desc *desc) { - struct mpc52xx_gpt_priv *gpt = irq_get_handler_data(virq); + struct mpc52xx_gpt_priv *gpt = irq_desc_get_handler_data(desc); int sub_virq; u32 status; diff --git a/arch/powerpc/platforms/cell/axon_msi.c b/arch/powerpc/platforms/cell/axon_msi.c index 623bd961465a..817d0e6747ea 100644 --- a/arch/powerpc/platforms/cell/axon_msi.c +++ b/arch/powerpc/platforms/cell/axon_msi.c @@ -95,7 +95,7 @@ static void msic_dcr_write(struct axon_msic *msic, unsigned int dcr_n, u32 val) static void axon_msi_cascade(unsigned int irq, struct irq_desc *desc) { struct irq_chip *chip = irq_desc_get_chip(desc); - struct axon_msic *msic = irq_get_handler_data(irq); + struct axon_msic *msic = irq_desc_get_handler_data(desc); u32 write_offset, msi; int idx; int retry = 0; diff --git a/arch/powerpc/platforms/embedded6xx/hlwd-pic.c b/arch/powerpc/platforms/embedded6xx/hlwd-pic.c index c269caee58f9..9dd154d6f89a 100644 --- a/arch/powerpc/platforms/embedded6xx/hlwd-pic.c +++ b/arch/powerpc/platforms/embedded6xx/hlwd-pic.c @@ -124,7 +124,7 @@ static void hlwd_pic_irq_cascade(unsigned int cascade_virq, struct irq_desc *desc) { struct irq_chip *chip = irq_desc_get_chip(desc); - struct irq_domain *irq_domain = irq_get_handler_data(cascade_virq); + struct irq_domain *irq_domain = irq_desc_get_handler_data(desc); unsigned int virq; raw_spin_lock(desc-lock); diff --git a/arch/powerpc/sysdev/uic.c b/arch/powerpc/sysdev/uic.c index 7c37157d4c24..e763fe215cf5 100644 --- a/arch/powerpc/sysdev/uic.c +++ b/arch/powerpc/sysdev/uic.c @@ -198,7 +198,7 @@ void uic_irq_cascade(unsigned int virq, struct irq_desc *desc) { struct irq_chip *chip = irq_desc_get_chip(desc); struct irq_data *idata = irq_desc_get_irq_data(desc); - struct uic *uic = irq_get_handler_data(virq); + struct uic *uic = irq_desc_get_handler_data(desc); u32 msr; int src; int subvirq; diff --git a/arch/powerpc/sysdev/xics/xics-common.c b/arch/powerpc/sysdev/xics/xics-common.c index 878a54036a25..76be7b00dd80 100644 --- a/arch/powerpc/sysdev/xics/xics-common.c +++ b/arch/powerpc/sysdev/xics/xics-common.c @@ -227,7 +227,7 @@ void xics_migrate_irqs_away(void) /* Locate interrupt server */ server = -1; - ics = irq_get_chip_data(virq); + ics = irq_desc_get_chip_data(desc); if (ics) server = ics-get_server(ics, irq); if (server 0) { -- 1.7.10.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] rtc/rtc-opal: Disable rtc-alarms when opal doesn't support tpo
On 06/03/2015 10:21 AM, Vaibhav Jain wrote: rtc-opal driver provides support for rtc alarms via times-power-on(tpo). However some platforms like BML use a fake rtc clock and don't support tpo. Such platforms are indicated by the missing 'has-tpo' property in the device tree. Current implementation however enables callback for rtc_class_ops.read/set alarm irrespective of the tpo support from the platform. This results in a failed opal call when kernel tries to read an existing alarms via opal_get_tpo_time during rtc device registration. This patch fixes this issue by setting opal_rtc_ops.read/set_alarm callback pointers only when tpo is supported. Signed-off-by: Vaibhav Jain vaib...@linux.vnet.ibm.com Acked-by: Neelesh Gupta neele...@linux.vnet.ibm.com Thanks, Neelesh. --- drivers/rtc/rtc-opal.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/rtc/rtc-opal.c b/drivers/rtc/rtc-opal.c index 7061dca..1125641 100644 --- a/drivers/rtc/rtc-opal.c +++ b/drivers/rtc/rtc-opal.c @@ -190,11 +190,9 @@ exit: return rc; } -static const struct rtc_class_ops opal_rtc_ops = { +static struct rtc_class_ops opal_rtc_ops = { .read_time = opal_get_rtc_time, .set_time = opal_set_rtc_time, - .read_alarm = opal_get_tpo_time, - .set_alarm = opal_set_tpo_time, }; static int opal_rtc_probe(struct platform_device *pdev) @@ -202,8 +200,11 @@ static int opal_rtc_probe(struct platform_device *pdev) struct rtc_device *rtc; if (pdev-dev.of_node of_get_property(pdev-dev.of_node, has-tpo, -NULL)) +NULL)) { device_set_wakeup_capable(pdev-dev, true); + opal_rtc_ops.read_alarm = opal_get_tpo_time; + opal_rtc_ops.set_alarm = opal_set_tpo_time; + } rtc = devm_rtc_device_register(pdev-dev, DRVNAME, opal_rtc_ops, THIS_MODULE); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev