[PATCH] pseries/iommu: Tweak ddw behavior in presence of pmem

2020-03-10 Thread Vaibhav Jain
Recently we discovered an issue on pseries guests that prevents pci
devices from accessing pmem memory via DMA. Performing such an
operation will cause PHB to freeze the corresponding partition
endpoint and in some scenarios will shutdown the disk that hosts the
rootfs.

A fix for this is in works until then this patch proposes to disable
DDW if pmem nodes are present in the device-tree. This would force all
DMA from I/O adapters through default 2-GB window and prevent direct
access of pmem memory which is located beyond 4-TB guest physical
address.

Since this change can have performance penalty for cases where its
known that i/o adapters wont be performing DMA to pmem, the patch
adds new args to the 'disable_ddw' kernel commanline flag with
following possible values:

'default' : Enable DDW only if no Pmem nodes present in device-tree
'Yes' : Disable DDW always
'No'  : Force enable DDW even if Pmem nodes present in device-tree

Signed-off-by: Vaibhav Jain 
---
 .../admin-guide/kernel-parameters.txt | 10 ++-
 arch/powerpc/platforms/pseries/iommu.c| 67 +--
 2 files changed, 70 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index c07815d230bc..58e09f7a2cb9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -901,9 +901,13 @@
The feature only exists starting from
Arch Perfmon v4 (Skylake and newer).
 
-   disable_ddw [PPC/PSERIES]
-   Disable Dynamic DMA Window support. Use this if
-   to workaround buggy firmware.
+   disable_ddw=[PPC/PSERIES]
+   Controls weather Dynamic DMA Window support is enabled.
+   Use this if to workaround buggy firmware. Following
+   values are supported:
+on  Disable ddw always
+off Enable ddw always
+   default Enable ddw if no Persistent memory present (default)
 
disable_ipv6=   [IPV6]
See Documentation/networking/ipv6.txt.
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 2e0a8eab5588..97498cc25c9f 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -12,6 +12,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -755,12 +756,39 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
   pci_name(dev));
 }
 
+/*
+ * Following values for the variable are handled
+ * '-1': Force enable ddw even if Persistent memory is present
+ * '0' : Enable ddw if no Persistent memory present (default)
+ * '1' : Disable ddw always
+ */
 static int __read_mostly disable_ddw;
 
-static int __init disable_ddw_setup(char *str)
+static int __init disable_ddw_setup(char *param)
 {
-   disable_ddw = 1;
-   printk(KERN_INFO "ppc iommu: disabling ddw.\n");
+   bool val;
+   int res;
+
+   /* Maintain old behaviour that disables DDW when flag given */
+   if (!param) {
+   disable_ddw = 1;
+   return 0;
+   }
+
+   res = strtobool(param, );
+
+   if (!res) {
+   if (val) {
+   disable_ddw = 1;
+   pr_info("ppc iommu: disabling ddw.\n");
+   } else if (!val) {
+   /* Force enable of DDW even if pmem is available */
+   disable_ddw = -1;
+   pr_info("ppc iommu: will force enable ddw.\n");
+   }
+   } else if (strcmp(param, "default") == 0) {
+   disable_ddw = 0;
+   }
 
return 0;
 }
@@ -1313,6 +1341,37 @@ static struct notifier_block iommu_reconfig_nb = {
.notifier_call = iommu_reconfig_notifier,
 };
 
+/* Check if DDW can be supported for this lpar */
+int ddw_supported(void)
+{
+   struct device_node *dn;
+
+   if (disable_ddw == -1) /* force enable ddw */
+   goto out;
+
+   if (disable_ddw == 1)
+   return 0;
+
+   /*
+* Due to DMA window limitations currently DDW is not supported
+* for persistent memory. This is due 1 TiB size of direct mapped
+* DMA window size limitation enforce by phyp. Since pmem memory
+* will be mapped at phy address > 4TiB, we cannot accmodate pmem
+* in the DDW window and DMA's to/from the pmem memory will result in
+* PHBs getting frozen triggering EEH. Hence for the the time being
+* disable DDW in presence of a 'ibm,pmemory' node.
+*/
+   dn = of_find_compatible_node(NULL, NULL, "ibm,pmemory");
+   if (dn) {
+   pr_info("IOMMU: Disabling DDW as pmem memory available\n");
+   of_node_put(dn);
+   return 0;
+   }
+ out:

[PATCH -next 000/491] treewide: use fallthrough;

2020-03-10 Thread Joe Perches
There is a new fallthrough pseudo-keyword macro that can be used
to replace the various /* fallthrough */ style comments that are
used to indicate a case label code block is intended to fallthrough
to the next case label block.

See commit 294f69e662d1 ("compiler_attributes.h: Add 'fallthrough'
pseudo keyword for switch/case use")

These patches are intended to allow clang to detect missing
switch/case fallthrough uses.

Do a depth-first pass on the MAINTAINERS file and find the various
F: pattern files and convert the fallthrough comments to fallthrough;
for all files matched by all  F: patterns in in each section.

Done via the perl script below and the previously posted
cvt_fallthrough.pl script.

Link: 
https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe.com/

These patches are based on next-20200310 and are available in

git://repo.or.cz/linux-2.6/trivial-mods.git in branch 20200310_fallthrough_2

$ cat commit_fallthrough.pl
#!/usr/bin/env perl

use sort 'stable';

#
# Reorder a sorted array so file entries are before directory entries
# depends on a trailing / for directories
# so:
#   foo/
#   foo/bar.c
# becomes
#   foo/bar.c
#   foo/
#
sub file_before_directory {
my ($array_ref) = (@_);

my $count = scalar(@$array_ref);

for (my $i = 1; $i < $count; $i++) {
if (substr(@$array_ref[$i - 1], -1) eq '/' &&
substr(@$array_ref[$i], 0, length(@$array_ref[$i - 1])) eq 
@$array_ref[$i - 1]) {
my $string = @$array_ref[$i - 1];
@$array_ref[$i - 1] = @$array_ref[$i];
@$array_ref[$i] = $string;
}
}
}

sub uniq {
my (@parms) = @_;

my %saw;
@parms = grep(!$saw{$_}++, @parms);

return @parms;
}

# Get all the F: file patterns in MAINTAINERS that could be a .[ch] file
my $maintainer_patterns = `grep -P '^F:\\s+' MAINTAINERS`;
my @patterns = split('\n', $maintainer_patterns);
s/^F:\s*// for @patterns;
@patterns = grep(!/^(?:Documentation|tools|scripts)\//, @patterns);
@patterns = grep(!/\.(?:dtsi?|rst|config)$/, @patterns);
@patterns = sort @patterns;
@patterns = sort { $b =~ tr/\//\// cmp $a =~ tr/\//\// } @patterns;
file_before_directory(\@patterns);

my %sections_done;

foreach my $pattern (@patterns) {

# Find the files the pattern matches
my $pattern_files = `git ls-files -- $pattern`;
my @new_patterns = split('\n', $pattern_files);
$pattern_files = join(' ', @new_patterns);
next if ($pattern_files =~ /^\s*$/);

# Find the section the first file matches
my $pattern_file = @new_patterns[0];
my $section_output = `./scripts/get_maintainer.pl --nogit --nogit-fallback 
--sections --pattern-depth=1 $pattern_file`;
my @section = split('\n', $section_output);
my $section_header = @section[0];

print("Section: <$section_header>\n");

# Skip the section if it's already done
print("Already done '$section_header'\n") if 
($sections_done{$section_header});
next if ($sections_done{$section_header}++);

# Find all the .[ch] files in all F: lines in that section
my @new_section;
foreach my $line (@section) {
last if ($line =~ /^\s*$/);
push(@new_section, $line);
}
@section = grep(/^F:/, @new_section);
s/^F:\s*// for @section;

@section = grep(!/^(?:Documentation|tools|scripts)\//, @section);
@section = grep(!/\.(?:dtsi?|rst|config)$/, @section);
@section = sort @section;
@section = uniq(@section);

my $section_files = join(' ', @section);

print("section_files: <$section_files>\n");

next if ($section_files =~ /^\s*$/);

my $cvt_files = `git ls-files -- $section_files`;
my @files = split('\n', $cvt_files);

@files = grep(!/^(?:Documentation|tools|scripts)\//, @files);
@files = grep(!/\.(?:dtsi?|rst|config)$/, @files);
@files = grep(/\.[ch]$/, @files);
@files = sort @files;
@files = uniq(@files);

$cvt_files = join(' ', @files);
print("files: <$cvt_files>\n");

next if (scalar(@files) < 1);

# Convert fallthroughs for all [.ch] files in the section
print("doing cvt_fallthrough.pl -- $cvt_files\n");

`cvt_fallthrough.pl -- $cvt_files`;

# If nothing changed, nothing to commit
`git diff-index --quiet HEAD --`;
next if (!$?);

# Commit the changes
my $fh;

open($fh, "+>", "cvt_fallthrough.commit_msg") or die "$0: can't create 
temporary file: $!\n";
print $fh <https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git@perches.com/
EOF
;
close $fh;

`git commit -s -a -F cvt_fallthrough.commit_msg`;
}

Joe Perches (491):
  MELLANOX ETHERNET INNOVA DRIVERS: Use fallthrough;
  MARVELL OCTEONTX2 RVU ADMIN FUNCTION DRIVER: Use fallthrough;
  MELLANOX MLX5 core VPI driver: Use fallthrough;
  PERFORMANCE EVENTS SUBSYSTEM: Use fallthrough;
  ARM/UNI

[PATCH -next 017/491] CELL BROADBAND ENGINE ARCHITECTURE: Use fallthrough;

2020-03-10 Thread Joe Perches
Convert the various uses of fallthrough comments to fallthrough;

Done via script
Link: 
https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe.com/

Signed-off-by: Joe Perches 
---
 arch/powerpc/platforms/cell/spufs/switch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/cell/spufs/switch.c 
b/arch/powerpc/platforms/cell/spufs/switch.c
index 5c3f5d0..d56b4e3 100644
--- a/arch/powerpc/platforms/cell/spufs/switch.c
+++ b/arch/powerpc/platforms/cell/spufs/switch.c
@@ -177,7 +177,7 @@ static inline void save_mfc_cntl(struct spu_state *csa, 
struct spu *spu)
POLL_WHILE_FALSE((in_be64(>mfc_control_RW) &
  MFC_CNTL_SUSPEND_DMA_STATUS_MASK) ==
 MFC_CNTL_SUSPEND_COMPLETE);
-   /* fall through */
+   fallthrough;
case MFC_CNTL_SUSPEND_COMPLETE:
if (csa)
csa->priv2.mfc_control_RW =
-- 
2.24.0



[PATCH -next 016/491] KERNEL VIRTUAL MACHINE FOR POWERPC (KVM/powerpc): Use fallthrough;

2020-03-10 Thread Joe Perches
Convert the various uses of fallthrough comments to fallthrough;

Done via script
Link: 
https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe.com/

Signed-off-by: Joe Perches 
---
 arch/powerpc/kvm/book3s_32_mmu.c | 2 +-
 arch/powerpc/kvm/book3s_64_mmu.c | 2 +-
 arch/powerpc/kvm/book3s_pr.c | 2 +-
 arch/powerpc/kvm/booke.c | 6 +++---
 arch/powerpc/kvm/powerpc.c   | 1 -
 5 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_32_mmu.c b/arch/powerpc/kvm/book3s_32_mmu.c
index f21e734..3fbd57 100644
--- a/arch/powerpc/kvm/book3s_32_mmu.c
+++ b/arch/powerpc/kvm/book3s_32_mmu.c
@@ -234,7 +234,7 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct kvm_vcpu 
*vcpu, gva_t eaddr,
case 2:
case 6:
pte->may_write = true;
-   /* fall through */
+   fallthrough;
case 3:
case 5:
case 7:
diff --git a/arch/powerpc/kvm/book3s_64_mmu.c b/arch/powerpc/kvm/book3s_64_mmu.c
index 5991332..26b8b2 100644
--- a/arch/powerpc/kvm/book3s_64_mmu.c
+++ b/arch/powerpc/kvm/book3s_64_mmu.c
@@ -311,7 +311,7 @@ static int kvmppc_mmu_book3s_64_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
case 2:
case 6:
gpte->may_write = true;
-   /* fall through */
+   fallthrough;
case 3:
case 5:
case 7:
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 729a0f..7db3695 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -740,7 +740,7 @@ int kvmppc_handle_pagefault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
(vcpu->arch.hflags & BOOK3S_HFLAG_SPLIT_HACK) &&
((pte.raddr & SPLIT_HACK_MASK) == SPLIT_HACK_OFFS))
pte.raddr &= ~SPLIT_HACK_MASK;
-   /* fall through */
+   fallthrough;
case MSR_IR:
vcpu->arch.mmu.esid_to_vsid(vcpu, eaddr >> SID_SHIFT, );
 
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 7b27604..be47815 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -421,11 +421,11 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
case BOOKE_IRQPRIO_DATA_STORAGE:
case BOOKE_IRQPRIO_ALIGNMENT:
update_dear = true;
-   /* fall through */
+   fallthrough;
case BOOKE_IRQPRIO_INST_STORAGE:
case BOOKE_IRQPRIO_PROGRAM:
update_esr = true;
-   /* fall through */
+   fallthrough;
case BOOKE_IRQPRIO_ITLB_MISS:
case BOOKE_IRQPRIO_SYSCALL:
case BOOKE_IRQPRIO_FP_UNAVAIL:
@@ -459,7 +459,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
case BOOKE_IRQPRIO_DECREMENTER:
case BOOKE_IRQPRIO_FIT:
keep_irq = true;
-   /* fall through */
+   fallthrough;
case BOOKE_IRQPRIO_EXTERNAL:
case BOOKE_IRQPRIO_DBELL:
allowed = vcpu->arch.shared->msr & MSR_EE;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 1af96fb5..c242a79 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -525,7 +525,6 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = 1;
break;
case KVM_CAP_PPC_GUEST_DEBUG_SSTEP:
-   /* fall through */
case KVM_CAP_PPC_PAIRED_SINGLES:
case KVM_CAP_PPC_OSI:
case KVM_CAP_PPC_GET_PVINFO:
-- 
2.24.0



Re: [PATCH v3 21/27] powerpc/powernv/pmem: Add an IOCTL to request controller health & perf data

2020-03-10 Thread Alastair D'Silva
On Wed, 2020-03-04 at 12:06 +0100, Frederic Barrat wrote:
> 
> Le 28/02/2020 à 07:12, Andrew Donnellan a écrit :
> > On 21/2/20 2:27 pm, Alastair D'Silva wrote:
> > > From: Alastair D'Silva 
> > > 
> > > When health & performance data is requested from the controller,
> > > it responds with an error log containing the requested
> > > information.
> > > 
> > > This patch allows the request to me issued via an IOCTL.
> > 
> > A better explanation would be good - this IOCTL triggers a request
> > to 
> > the controller to collect controller health/perf data, and the 
> > controller will later respond with an error log that can be picked
> > up 
> > via the error log IOCTL that you've defined earlier.
> 
> And even more precisely (to also check my understanding):
> 
>  > this IOCTL triggers a request to
>  > the controller to collect controller health/perf data, and the
>  > controller will later respond
> 
> by raising an interrupt to let the user app know that
> 
>  > an error log that can be picked up
>  > via the error log IOCTL that you've defined earlier.
> 
> 
> The rest of the patch looks ok to me.
> 
>Fred

Ok

-- 
Alastair D'Silva
Open Source Developer
Linux Technology Centre, IBM Australia
mob: 0423 762 819



Re: [PATCH v3 20/27] powerpc/powernv/pmem: Forward events to userspace

2020-03-10 Thread Alastair D'Silva
On Wed, 2020-03-04 at 12:00 +0100, Frederic Barrat wrote:
> 
> Le 21/02/2020 à 04:27, Alastair D'Silva a écrit :
> > From: Alastair D'Silva 
> > 
> > Some of the interrupts that the card generates are better handled
> > by the userspace daemon, in particular:
> > Controller Hardware/Firmware Fatal
> > Controller Dump Available
> > Error Log available
> > 
> > This patch allows a userspace application to register an eventfd
> > with
> > the driver via SCM_IOCTL_EVENTFD to receive notifications of these
> > interrupts.
> > 
> > Userspace can then identify what events have occurred by calling
> > SCM_IOCTL_EVENT_CHECK and checking against the SCM_IOCTL_EVENT_FOO
> > masks.
> > 
> > Signed-off-by: Alastair D'Silva 
> > ---
> >   arch/powerpc/platforms/powernv/pmem/ocxl.c| 216
> > ++
> >   .../platforms/powernv/pmem/ocxl_internal.h|   5 +
> >   include/uapi/nvdimm/ocxl-pmem.h   |  16 ++
> >   3 files changed, 237 insertions(+)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c
> > b/arch/powerpc/platforms/powernv/pmem/ocxl.c
> > index 009d4fd29e7d..e46696d3cc36 100644
> > --- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
> > +++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
> > @@ -10,6 +10,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -335,11 +336,22 @@ static void free_ocxlpmem(struct ocxlpmem
> > *ocxlpmem)
> >   {
> > int rc;
> >   
> > +   // Disable doorbells
> > +   (void)ocxl_global_mmio_set64(ocxlpmem->ocxl_afu,
> > GLOBAL_MMIO_CHIEC,
> > +OCXL_LITTLE_ENDIAN,
> > +GLOBAL_MMIO_CHI_ALL);
> > +
> > if (ocxlpmem->nvdimm_bus)
> > nvdimm_bus_unregister(ocxlpmem->nvdimm_bus);
> >   
> > free_minor(ocxlpmem);
> >   
> > +   if (ocxlpmem->irq_addr[1])
> > +   iounmap(ocxlpmem->irq_addr[1]);
> > +
> > +   if (ocxlpmem->irq_addr[0])
> > +   iounmap(ocxlpmem->irq_addr[0]);
> > +
> > if (ocxlpmem->cdev.owner)
> > cdev_del(>cdev);
> >   
> > @@ -443,6 +455,11 @@ static int file_release(struct inode *inode,
> > struct file *file)
> >   {
> > struct ocxlpmem *ocxlpmem = file->private_data;
> >   
> > +   if (ocxlpmem->ev_ctx) {
> > +   eventfd_ctx_put(ocxlpmem->ev_ctx);
> > +   ocxlpmem->ev_ctx = NULL;
> > +   }
> > +
> > ocxlpmem_put(ocxlpmem);
> > return 0;
> >   }
> > @@ -938,6 +955,51 @@ static int ioctl_controller_stats(struct
> > ocxlpmem *ocxlpmem,
> > return rc;
> >   }
> >   
> > +static int ioctl_eventfd(struct ocxlpmem *ocxlpmem,
> > +struct ioctl_ocxl_pmem_eventfd __user *uarg)
> > +{
> > +   struct ioctl_ocxl_pmem_eventfd args;
> > +
> > +   if (copy_from_user(, uarg, sizeof(args)))
> > +   return -EFAULT;
> > +
> > +   if (ocxlpmem->ev_ctx)
> > +   return -EINVAL;
> 
> EBUSY?
> 
Ok

> 
> > +
> > +   ocxlpmem->ev_ctx = eventfd_ctx_fdget(args.eventfd);
> > +   if (!ocxlpmem->ev_ctx)
> > +   return -EFAULT;
> 
> Why not use what eventfd_ctx_fdget() returned? (through some
> IS_ERR() 
> and PTR_ERR() convolution)
> 

Ok
> 
> > +
> > +   return 0;
> > +}
> > +
> > +static int ioctl_event_check(struct ocxlpmem *ocxlpmem, u64 __user
> > *uarg)
> > +{
> > +   u64 val = 0;
> > +   int rc;
> > +   u64 chi = 0;
> > +
> > +   rc = ocxlpmem_chi(ocxlpmem, );
> > +   if (rc < 0)
> > +   return rc;
> > +
> > +   if (chi & GLOBAL_MMIO_CHI_ELA)
> > +   val |= IOCTL_OCXL_PMEM_EVENT_ERROR_LOG_AVAILABLE;
> > +
> > +   if (chi & GLOBAL_MMIO_CHI_CDA)
> > +   val |= IOCTL_OCXL_PMEM_EVENT_CONTROLLER_DUMP_AVAILABLE;
> > +
> > +   if (chi & GLOBAL_MMIO_CHI_CFFS)
> > +   val |= IOCTL_OCXL_PMEM_EVENT_FIRMWARE_FATAL;
> > +
> > +   if (chi & GLOBAL_MMIO_CHI_CHFS)
> > +   val |= IOCTL_OCXL_PMEM_EVENT_HARDWARE_FATAL;
> > +
> > +   rc = copy_to_user((u64 __user *) uarg, , sizeof(val));
> > +
> 
> copy_to_user doesn't return an errno. Should be:
> 
> if (copy_to_user((u64 __user *) uarg, , sizeof(val)))
>   return -EFAULT;
> 
Ok

> 
> > +   return rc;
> > +}
> > +
> >   static long file_ioctl(struct file *file, unsigned int cmd,
> > unsigned long args)
> >   {
> > struct ocxlpmem *ocxlpmem = file->private_data;
> > @@ -966,6 +1028,15 @@ static long file_ioctl(struct file *file,
> > unsigned int cmd, unsigned long args)
> > rc = ioctl_controller_stats(ocxlpmem,
> > (struct
> > ioctl_ocxl_pmem_controller_stats __user *)args);
> > break;
> > +
> > +   case IOCTL_OCXL_PMEM_EVENTFD:
> > +   rc = ioctl_eventfd(ocxlpmem,
> > +  (struct ioctl_ocxl_pmem_eventfd
> > __user *)args);
> > +   break;
> > +
> > +   case IOCTL_OCXL_PMEM_EVENT_CHECK:
> > +   rc = ioctl_event_check(ocxlpmem, (u64 __user *)args);
> > +   break;
> > }
> >   
> > return rc;
> > @@ -1107,6 

Re: [PATCH] KVM: PPC: Book3S HV: Fix H_CEDE return code for nested guests

2020-03-10 Thread David Gibson
On Tue, Mar 10, 2020 at 04:11:28PM -0500, Michael Roth wrote:
> The h_cede_tm kvm-unit-test currently fails when run inside an L1 guest
> via the guest/nested hypervisor.
> 
>   ./run-tests.sh -v
>   ...
>   TESTNAME=h_cede_tm TIMEOUT=90s ACCEL= ./powerpc/run powerpc/tm.elf -smp 
> 2,threads=2 -machine cap-htm=on -append "h_cede_tm"
>   FAIL h_cede_tm (2 tests, 1 unexpected failures)
> 
> While the test relates to transactional memory instructions, the actual
> failure is due to the return code of the H_CEDE hypercall, which is
> reported as 224 instead of 0. This happens even when no TM instructions
> are issued.
> 
> 224 is the value placed in r3 to execute a hypercall for H_CEDE, and r3
> is where the caller expects the return code to be placed upon return.
> 
> In the case of guest running under a nested hypervisor, issuing H_CEDE
> causes a return from H_ENTER_NESTED. In this case H_CEDE is
> specially-handled immediately rather than later in
> kvmppc_pseries_do_hcall() as with most other hcalls, but we forget to
> set the return code for the caller, hence why kvm-unit-test sees the
> 224 return code and reports an error.
> 
> Guest kernels generally don't check the return value of H_CEDE, so
> that likely explains why this hasn't caused issues outside of
> kvm-unit-tests so far.
> 
> Fix this by setting r3 to 0 after we finish processing the H_CEDE.
> 
> RHBZ: 1778556
> 
> Fixes: 4bad77799fed ("KVM: PPC: Book3S HV: Handle hypercalls correctly when 
> nested")
> Cc: linuxppc-...@ozlabs.org
> Cc: David Gibson 
> Cc: Paul Mackerras 
> Signed-off-by: Michael Roth 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/kvm/book3s_hv.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 2cefd071b848..c0c43a733830 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3616,6 +3616,7 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 
> time_limit,
>   if (trap == BOOK3S_INTERRUPT_SYSCALL && !vcpu->arch.nested &&
>   kvmppc_get_gpr(vcpu, 3) == H_CEDE) {
>   kvmppc_nested_cede(vcpu);
> + kvmppc_set_gpr(vcpu, 3, 0);
>   trap = 0;
>   }
>   } else {

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v3] powerpc: setup_64: set up PACA earlier to avoid kcov problems

2020-03-10 Thread Daniel Axtens
>>> So:
>>>  - change the test when setting up a PACA to consider the actual value of
>>>the MSR rather than the CPU feature.
>>> 
>>>  - move the PACA setup to before the cpu feature parsing.
>>
>> Hmm. Problem is that equally we want PACA to be sane before we call too
>> far into the rest of the kernel ("generic dt parsing code").
>
> But currently we call into that code with no paca at all. Or rather,
> with r13 pointing somewhere random that will be interpreted as being a
> paca.
>
> This took a while for Daniel to debug because depending on how you boot
> r13 contains a different junk value. That junk value may not point to
> memory at all, or if it does the memory it points to may or may not send
> you down the wrong path, depending on which exact bit you're looking at
> in some random location.
>
> So this is really not about kcov from my POV, that's just how we
> discovered it.

Ah, yes. I agree with mpe, and reading back over my commit message I
think I did a pretty poor job of explaining it. How about this for a
commit message:

---

powerpc: setup_64: set up PACA before parsing device tree

Currently, we set up the PACA after parsing the device tree for CPU
features. Before that, r13 contains random data, which means there is
random data in r13 while we're running the generic dt parsing code.

This random data varies depending on whether we boot through a vmlinux or a
zImage: for the vmlinux case it's usually around zero, but for zImages we
see random values like 912a72603d420015.

This is poor practice, and can also lead to difficult-to-debug crashes. For
example, when kcov is enabled, the kcov instrumentation attempts to read
preempt_count out of the current task, which goes via the PACA. This then
crashes in the zImage case.

To resolve this:

 - move the PACA setup to before the cpu feature parsing.

 - because we no longer have access to cpu feature flags in PACA setup,
   change the HV feature test in the PACA setup path to consider the actual
   value of the MSR rather than the CPU feature.

Translations get switched on once we leave early_setup, so I think we'd
already catch any other cases where the PACA or task aren't set up.

Boot tested on a P9 guest and host.

Fixes: fb0b0a73b223 ("powerpc: Enable kcov")
Cc: Andrew Donnellan 
Cc: sta...@vger.kernel.org
Reviewed-by: Andrew Donnellan 
Suggested-by: Michael Ellerman 
Signed-off-by: Daniel Axtens 

---

Regards,
Daniel

>
> cheers


Re: [PATCH net v2] ibmvnic: Do not process device remove during device reset

2020-03-10 Thread David Miller
From: Juliet Kim 
Date: Tue, 10 Mar 2020 09:23:58 -0500

> The ibmvnic driver does not check the device state when the device
> is removed. If the device is removed while a device reset is being
> processed, the remove may free structures needed by the reset,
> causing an oops.
> 
> Fix this by checking the device state before processing device remove.
> 
> Signed-off-by: Juliet Kim 

Applied, thank you.


[PATCH] KVM: PPC: Book3S HV: Fix H_CEDE return code for nested guests

2020-03-10 Thread Michael Roth
The h_cede_tm kvm-unit-test currently fails when run inside an L1 guest
via the guest/nested hypervisor.

  ./run-tests.sh -v
  ...
  TESTNAME=h_cede_tm TIMEOUT=90s ACCEL= ./powerpc/run powerpc/tm.elf -smp 
2,threads=2 -machine cap-htm=on -append "h_cede_tm"
  FAIL h_cede_tm (2 tests, 1 unexpected failures)

While the test relates to transactional memory instructions, the actual
failure is due to the return code of the H_CEDE hypercall, which is
reported as 224 instead of 0. This happens even when no TM instructions
are issued.

224 is the value placed in r3 to execute a hypercall for H_CEDE, and r3
is where the caller expects the return code to be placed upon return.

In the case of guest running under a nested hypervisor, issuing H_CEDE
causes a return from H_ENTER_NESTED. In this case H_CEDE is
specially-handled immediately rather than later in
kvmppc_pseries_do_hcall() as with most other hcalls, but we forget to
set the return code for the caller, hence why kvm-unit-test sees the
224 return code and reports an error.

Guest kernels generally don't check the return value of H_CEDE, so
that likely explains why this hasn't caused issues outside of
kvm-unit-tests so far.

Fix this by setting r3 to 0 after we finish processing the H_CEDE.

RHBZ: 1778556

Fixes: 4bad77799fed ("KVM: PPC: Book3S HV: Handle hypercalls correctly when 
nested")
Cc: linuxppc-...@ozlabs.org
Cc: David Gibson 
Cc: Paul Mackerras 
Signed-off-by: Michael Roth 
---
 arch/powerpc/kvm/book3s_hv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2cefd071b848..c0c43a733830 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3616,6 +3616,7 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 
time_limit,
if (trap == BOOK3S_INTERRUPT_SYSCALL && !vcpu->arch.nested &&
kvmppc_get_gpr(vcpu, 3) == H_CEDE) {
kvmppc_nested_cede(vcpu);
+   kvmppc_set_gpr(vcpu, 3, 0);
trap = 0;
}
} else {
-- 
2.17.1



[PATCH 2/2] powerpc/fadump: consider reserved ranges while reserving memory

2020-03-10 Thread Hari Bathini
Commit 0962e8004e97 ("powerpc/prom: Scan reserved-ranges node for
memory reservations") enabled support to parse reserved-ranges DT
node and reserve kernel memory falling in these ranges for F/W
purposes. Memory reserved for FADump should not overlap with these
ranges as it could corrupt memory meant for F/W or crash'ed kernel
memory to be exported as vmcore.

But since commit 579ca1a27675 ("powerpc/fadump: make use of memblock's
bottom up allocation mode"), memblock_find_in_range() is being used to
find the appropriate area to reserve memory for FADump, which can't
account for reserved-ranges as these ranges are reserved only after
FADump memory reservation.

With reserved-ranges now being populated during early boot, look out
for these memory ranges while reserving memory for FADump. Without
this change, MPIPL on PowerNV systems aborts with hostboot failure,
when memory reserved for FADump is less than 4096MB.

Fixes: 579ca1a27675 ("powerpc/fadump: make use of memblock's bottom up 
allocation mode")
Cc: sta...@vger.kernel.org # v5.4+
Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/fadump.c |   76 --
 1 file changed, 66 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 7fcf4a8f..ab83be9 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -443,10 +443,70 @@ static int __init fadump_get_boot_mem_regions(void)
return ret;
 }
 
+/*
+ * Returns true, if the given range overlaps with reserved memory ranges
+ * starting at idx. Also, updates idx to index of overlapping memory range
+ * with the given memory range.
+ * False, otherwise.
+ */
+static bool overlaps_reserved_ranges(u64 base, u64 end, int *idx)
+{
+   bool ret = false;
+   int i;
+
+   for (i = *idx; i < reserved_mrange_info.mem_range_cnt; i++) {
+   u64 rbase = reserved_mrange_info.mem_ranges[i].base;
+   u64 rend = rbase + reserved_mrange_info.mem_ranges[i].size;
+
+   if (end <= rbase)
+   break;
+
+   if ((end > rbase) &&  (base < rend)) {
+   *idx = i;
+   ret = true;
+   break;
+   }
+   }
+
+   return ret;
+}
+
+/*
+ * Locate a suitable memory area to reserve memory for FADump. While at it,
+ * lookup reserved-ranges & avoid overlap with them, as they are used by F/W.
+ */
+static u64 __init fadump_locate_reserve_mem(u64 base, u64 size)
+{
+   struct fadump_memory_range *mrngs;
+   phys_addr_t mstart, mend;
+   int idx = 0;
+   u64 i;
+
+   mrngs = reserved_mrange_info.mem_ranges;
+   for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE,
+   , , NULL) {
+   pr_debug("%llu) mstart: %llx, mend: %llx, base: %llx\n",
+i, mstart, mend, base);
+
+   if (mstart > base)
+   base = PAGE_ALIGN(mstart);
+
+   while ((mend > base) && ((mend - base) >= size)) {
+   if (!overlaps_reserved_ranges(base, base + size, ))
+   goto out;
+
+   base = mrngs[idx].base + mrngs[idx].size;
+   base = PAGE_ALIGN(base);
+   }
+   }
+
+out:
+   return base;
+}
+
 int __init fadump_reserve_mem(void)
 {
-   u64 base, size, mem_boundary, bootmem_min, align = PAGE_SIZE;
-   bool is_memblock_bottom_up = memblock_bottom_up();
+   u64 base, size, mem_boundary, bootmem_min;
int ret = 1;
 
if (!fw_dump.fadump_enabled)
@@ -467,9 +527,9 @@ int __init fadump_reserve_mem(void)
PAGE_ALIGN(fadump_calculate_reserve_size());
 #ifdef CONFIG_CMA
if (!fw_dump.nocma) {
-   align = FADUMP_CMA_ALIGNMENT;
fw_dump.boot_memory_size =
-   ALIGN(fw_dump.boot_memory_size, align);
+   ALIGN(fw_dump.boot_memory_size,
+ FADUMP_CMA_ALIGNMENT);
}
 #endif
 
@@ -537,13 +597,9 @@ int __init fadump_reserve_mem(void)
 * Reserve memory at an offset closer to bottom of the RAM to
 * minimize the impact of memory hot-remove operation.
 */
-   memblock_set_bottom_up(true);
-   base = memblock_find_in_range(base, mem_boundary, size, align);
-
-   /* Restore the previous allocation mode */
-   memblock_set_bottom_up(is_memblock_bottom_up);
+   base = fadump_locate_reserve_mem(base, size);
 
-   if (!base) {
+   if (base > (mem_boundary - size)) {
pr_err("Failed to find memory chunk for 
reservation!\n");
goto error_out;
}



[PATCH 1/2] powerpc/fadump: use static allocation for reserved memory ranges

2020-03-10 Thread Hari Bathini
At times, memory ranges have to be looked up during early boot, when
kernel couldn't be initialized for dynamic memory allocation. In fact,
reserved-ranges look up is needed during FADump memory reservation.
Without accounting for reserved-ranges in reserving memory for FADump,
MPIPL boot fails with memory corruption issues. So, extend memory
ranges handling to support static allocation and populate reserved
memory ranges during early boot.

Fixes: dda9dbfeeb7a ("powerpc/fadump: consider reserved ranges while releasing 
memory")
Cc: sta...@vger.kernel.org # v5.4+
Signed-off-by: Hari Bathini 
---
 arch/powerpc/include/asm/fadump-internal.h |4 +
 arch/powerpc/kernel/fadump.c   |   77 
 2 files changed, 48 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump-internal.h 
b/arch/powerpc/include/asm/fadump-internal.h
index c814a2b..8d61c8f 100644
--- a/arch/powerpc/include/asm/fadump-internal.h
+++ b/arch/powerpc/include/asm/fadump-internal.h
@@ -64,12 +64,14 @@ struct fadump_memory_range {
 };
 
 /* fadump memory ranges info */
+#define RNG_NAME_SZ16
 struct fadump_mrange_info {
-   charname[16];
+   charname[RNG_NAME_SZ];
struct fadump_memory_range  *mem_ranges;
u32 mem_ranges_sz;
u32 mem_range_cnt;
u32 max_mem_ranges;
+   boolis_static;
 };
 
 /* Platform specific callback functions */
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index ff0114a..7fcf4a8f 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -38,8 +38,17 @@ static void __init fadump_reserve_crash_area(u64 base);
 
 #ifndef CONFIG_PRESERVE_FA_DUMP
 static DEFINE_MUTEX(fadump_mutex);
-struct fadump_mrange_info crash_mrange_info = { "crash", NULL, 0, 0, 0 };
-struct fadump_mrange_info reserved_mrange_info = { "reserved", NULL, 0, 0, 0 };
+struct fadump_mrange_info crash_mrange_info = { "crash", NULL, 0, 0, 0, false 
};
+
+#define RESERVED_RNGS_SZ   16384 /* 16K - 128 entries */
+#define RESERVED_RNGS_CNT  (RESERVED_RNGS_SZ / \
+sizeof(struct fadump_memory_range))
+static struct fadump_memory_range rngs[RESERVED_RNGS_CNT];
+struct fadump_mrange_info reserved_mrange_info = { "reserved", rngs,
+  RESERVED_RNGS_SZ, 0,
+  RESERVED_RNGS_CNT, true };
+
+static void __init early_init_dt_scan_reserved_ranges(unsigned long node);
 
 #ifdef CONFIG_CMA
 static struct cma *fadump_cma;
@@ -108,6 +117,11 @@ static int __init fadump_cma_init(void) { return 1; }
 int __init early_init_dt_scan_fw_dump(unsigned long node, const char *uname,
  int depth, void *data)
 {
+   if (depth == 0) {
+   early_init_dt_scan_reserved_ranges(node);
+   return 0;
+   }
+
if (depth != 1)
return 0;
 
@@ -726,10 +740,14 @@ void fadump_free_cpu_notes_buf(void)
 
 static void fadump_free_mem_ranges(struct fadump_mrange_info *mrange_info)
 {
+   if (mrange_info->is_static) {
+   mrange_info->mem_range_cnt = 0;
+   return;
+   }
+
kfree(mrange_info->mem_ranges);
-   mrange_info->mem_ranges = NULL;
-   mrange_info->mem_ranges_sz = 0;
-   mrange_info->max_mem_ranges = 0;
+   memset((void *)((u64)mrange_info + RNG_NAME_SZ), 0,
+  (sizeof(struct fadump_mrange_info) - RNG_NAME_SZ));
 }
 
 /*
@@ -786,6 +804,12 @@ static inline int fadump_add_mem_range(struct 
fadump_mrange_info *mrange_info,
if (mrange_info->mem_range_cnt == mrange_info->max_mem_ranges) {
int ret;
 
+   if (mrange_info->is_static) {
+   pr_err("Reached array size limit for %s memory 
ranges\n",
+  mrange_info->name);
+   return -ENOSPC;
+   }
+
ret = fadump_alloc_mem_ranges(mrange_info);
if (ret)
return ret;
@@ -1202,20 +1226,19 @@ static void sort_and_merge_mem_ranges(struct 
fadump_mrange_info *mrange_info)
  * Scan reserved-ranges to consider them while reserving/releasing
  * memory for FADump.
  */
-static inline int fadump_scan_reserved_mem_ranges(void)
+static void __init early_init_dt_scan_reserved_ranges(unsigned long node)
 {
-   struct device_node *root;
const __be32 *prop;
int len, ret = -1;
unsigned long i;
 
-   root = of_find_node_by_path("/");
-   if (!root)
-   return ret;
+   /* reserved-ranges already scanned */
+   if (reserved_mrange_info.mem_range_cnt != 0)
+   

Re: [PATCH v4 6/8] perf/tools: Enhance JSON/metric infrastructure to handle "?"

2020-03-10 Thread Arnaldo Carvalho de Melo
Em Mon, Mar 09, 2020 at 11:55:50AM +0530, Kajol Jain escreveu:
> Patch enhances current metric infrastructure to handle "?" in the metric
> expression. The "?" can be use for parameters whose value not known while
> creating metric events and which can be replace later at runtime to
> the proper value. It also add flexibility to create multiple events out
> of single metric event added in json file.
> 
> Patch adds function 'arch_get_runtimeparam' which is a arch specific
> function, returns the count of metric events need to be created.
> By default it return 1.
> 
> One loop is added in function 'metricgroup__add_metric', which create
> multiple events at run time depend on return value of
> 'arch_get_runtimeparam' and merge that event in 'group_list'.
> 
> This infrastructure needed for hv_24x7 socket/chip level events.
> "hv_24x7" chip level events needs specific chip-id to which the
> data is requested. Function 'arch_get_runtimeparam' implemented
> in header.c which extract number of sockets from sysfs file
> "sockets" under "/sys/devices/hv_24x7/interface/".
> 
> Signed-off-by: Kajol Jain 
> ---
>  tools/perf/arch/powerpc/util/header.c |  22 +
>  tools/perf/util/expr.h|   1 +
>  tools/perf/util/expr.l|  19 +++-
>  tools/perf/util/metricgroup.c | 124 --
>  tools/perf/util/metricgroup.h |   1 +
>  tools/perf/util/stat-shadow.c |   8 ++
>  6 files changed, 148 insertions(+), 27 deletions(-)
> 
> diff --git a/tools/perf/arch/powerpc/util/header.c 
> b/tools/perf/arch/powerpc/util/header.c
> index 3b4cdfc5efd6..036f6b2ce202 100644
> --- a/tools/perf/arch/powerpc/util/header.c
> +++ b/tools/perf/arch/powerpc/util/header.c
> @@ -7,6 +7,11 @@
>  #include 
>  #include 
>  #include "header.h"
> +#include "metricgroup.h"
> +#include "evlist.h"
> +#include 
> +#include "pmu.h"
> +#include 
>  
>  #define mfspr(rn)   ({unsigned long rval; \
>asm volatile("mfspr %0," __stringify(rn) \
> @@ -16,6 +21,8 @@
>  #define PVR_VER(pvr)(((pvr) >>  16) & 0x) /* Version field */
>  #define PVR_REV(pvr)(((pvr) >>   0) & 0x) /* Revison field */
>  
> +#define SOCKETS_INFO_FILE_PATH "/devices/hv_24x7/interface/"
> +
>  int
>  get_cpuid(char *buffer, size_t sz)
>  {
> @@ -44,3 +51,18 @@ get_cpuid_str(struct perf_pmu *pmu __maybe_unused)
>  
>   return bufp;
>  }
> +
> +int arch_get_runtimeparam(void)
> +{
> + int count;
> + char path[PATH_MAX];
> + char filename[] = "sockets";
> +
> + snprintf(path, PATH_MAX,
> +  SOCKETS_INFO_FILE_PATH "%s", filename);
> +
> + if (sysfs__read_ull(path, (unsigned long long *)) < 0)
> + return 1;
> + else
> + return count;

Why this cast dance? We have sysfs__read_int(path, ).

Also this is more compact:

return sysfs__read_int(path, ) < 0 ? 1 : count;

> +}
> diff --git a/tools/perf/util/expr.h b/tools/perf/util/expr.h
> index 9377538f4097..d17664e628db 100644
> --- a/tools/perf/util/expr.h
> +++ b/tools/perf/util/expr.h
> @@ -15,6 +15,7 @@ struct parse_ctx {
>   struct parse_id ids[MAX_PARSE_ID];
>  };
>  
> +int expr__runtimeparam;
>  void expr__ctx_init(struct parse_ctx *ctx);
>  void expr__add_id(struct parse_ctx *ctx, const char *id, double val);
>  int expr__parse(double *final_val, struct parse_ctx *ctx, const char *expr);
> diff --git a/tools/perf/util/expr.l b/tools/perf/util/expr.l
> index 1928f2a3dddc..ec4b00671f67 100644
> --- a/tools/perf/util/expr.l
> +++ b/tools/perf/util/expr.l
> @@ -45,6 +45,21 @@ static char *normalize(char *str)
>   *dst++ = '/';
>   else if (*str == '\\')
>   *dst++ = *++str;
> +else if (*str == '?') {
> +
> + int size = snprintf(NULL, 0, "%d", expr__runtimeparam);

TIL that C99 allows for a NULL str to format and return the number of
bytes it would write if the string was large enough... wonders never
cease :-)

> + char * paramval = (char *)malloc(size);

No need for the cast, malloc returns void *, or has that changed? 8-)
and please no space before the variable name.

Humm this is all complicated, why not use asprintf and have something
like:

> + int i = 0;
> +
> + if(!paramval)
> + *dst++ = '0';
> + else {
> + sprintf(paramval, "%d", expr__runtimeparam);
> + while(i < size)
> + *dst++ = paramval[i++];
> + free(paramval);
> + }

char *paramval;
int size = asprintf(, "%d", 
expr__runtimeparam);

if (size < 0)
*dst++ = '0';
else {
while (i < size)
 

Re: [PATCH v4 0/8] powerpc/perf: Add json file metric support for the hv_24x7 socket/chip level events

2020-03-10 Thread Jiri Olsa
On Tue, Mar 10, 2020 at 03:18:36PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Mar 09, 2020 at 10:35:06AM +0100, Jiri Olsa escreveu:
> > On Mon, Mar 09, 2020 at 11:55:44AM +0530, Kajol Jain wrote:
> > > First patch of the patchset fix inconsistent results we are getting when
> > > we run multiple 24x7 events.
> > > 
> > > Patchset adds json file metric support for the hv_24x7 socket/chip level
> > > events. "hv_24x7" pmu interface events needs system dependent parameter
> > > like socket/chip/core. For example, hv_24x7 chip level events needs
> > > specific chip-id to which the data is requested should be added as part
> > > of pmu events.
> > > 
> > > So to enable JSON file support to "hv_24x7" interface, patchset expose
> > > total number of sockets and chips per-socket details in sysfs
> > > files (sockets, chips) under "/sys/devices/hv_24x7/interface/".
> > > 
> > > To get sockets and number of chips per sockets, patchset adds a rtas call
> > > with token "PROCESSOR_MODULE_INFO" to get these details. Patchset also
> > > handles partition migration case to re-init these system depended
> > > parameters by adding proper calls in post_mobility_fixup() (mobility.c).
> > > 
> > > Patch 6 & 8 of the patchset handles perf tool plumbing needed to replace
> > > the "?" character in the metric expression to proper value and hv_24x7
> > > json metric file for different Socket/chip resources.
> > > 
> > > Patch set also enable Hz/hz prinitg for --metric-only option to print
> > > metric data for bus frequency.
> > > 
> > > Applied and tested all these patches cleanly on top of jiri's flex changes
> > > with the changes done by Kan Liang for "Support metric group constraint"
> > > patchset and made required changes.
> > > 
> > > Changelog:
> > > v3 -> v4
> > > - Made changes suggested by jiri.
> > 
> > could you please mention them next time? ;-)
> > 
> > > - Apply these patch on top of Kan liang changes.
> > 
> > Arnaldo, could you please pull the expr flex changes and Kan's
> > metric group constraint changes? it's both prereq of this patchset
> 
> Both are now in my perf/core branch, will go upstream soon, should I go
> and pickup the perf tooling bits in this patchkit?

Kajol mentioned there are still some issues with the
global variable, I plan to check on it this week

thanks,
jirka



Re: [PATCH] powerpc/pseries: fix of_read_drc_info_cell() to point at next record

2020-03-10 Thread Tyrel Datwyler
On 3/10/20 10:25 AM, Nathan Lynch wrote:
> Tyrel Datwyler  writes:
>> The expectation is that when calling of_read_drc_info_cell()
>> repeatedly to parse multiple drc-info records that the in/out curval
>> parameter points at the start of the next record on return. However,
>> the current behavior has curval still pointing at the final value of
>> the record just parsed. The result of which is that if the
>> ibm,drc-info property contains multiple properties the parsed value
>> of the drc_type for any record after the first has the power_domain
>> value of the previous record appended to the type string.
>>
>> Ex: observed the following 0x prepended to PHB
>>
>> [   69.485037] drc-info: type: \xff\xff\xff\xffPHB, prefix: PHB , 
>> index_start: 0x2001
>> [   69.485038] drc-info: suffix_start: 1, sequential_elems: 3072, 
>> sequential_inc: 1
>> [   69.485038] drc-info: power-domain: 0x, last_index: 0x2c00
>>
>> Fix by incrementing curval past the power_domain value to point at
>> drc_type string of next record.
>>
>> Fixes: a29396653b8bf ("pseries/drc-info: Search DRC properties for CPU 
>> indexes")
> 
> I have a different commit hash for that:
> e83636ac3334 pseries/drc-info: Search DRC properties for CPU indexes

Oof, looks like I grabbed the commit hash from the SLES 15 SP1 branch. You have
the correct upstream commit.

Fixes: e83636ac3334 ("pseries/drc-info: Search DRC properties for CPU indexes")

Michael, let me know if you want me to resubmit, or if you will fixup the Fixes
tag on your end?

-Tyrel

> 
>> Signed-off-by: Tyrel Datwyler 
> 
> Otherwise:
> Acked-by: Nathan Lynch 
> 



Re: [PATCH v4 0/8] powerpc/perf: Add json file metric support for the hv_24x7 socket/chip level events

2020-03-10 Thread Arnaldo Carvalho de Melo
Em Mon, Mar 09, 2020 at 10:35:06AM +0100, Jiri Olsa escreveu:
> On Mon, Mar 09, 2020 at 11:55:44AM +0530, Kajol Jain wrote:
> > First patch of the patchset fix inconsistent results we are getting when
> > we run multiple 24x7 events.
> > 
> > Patchset adds json file metric support for the hv_24x7 socket/chip level
> > events. "hv_24x7" pmu interface events needs system dependent parameter
> > like socket/chip/core. For example, hv_24x7 chip level events needs
> > specific chip-id to which the data is requested should be added as part
> > of pmu events.
> > 
> > So to enable JSON file support to "hv_24x7" interface, patchset expose
> > total number of sockets and chips per-socket details in sysfs
> > files (sockets, chips) under "/sys/devices/hv_24x7/interface/".
> > 
> > To get sockets and number of chips per sockets, patchset adds a rtas call
> > with token "PROCESSOR_MODULE_INFO" to get these details. Patchset also
> > handles partition migration case to re-init these system depended
> > parameters by adding proper calls in post_mobility_fixup() (mobility.c).
> > 
> > Patch 6 & 8 of the patchset handles perf tool plumbing needed to replace
> > the "?" character in the metric expression to proper value and hv_24x7
> > json metric file for different Socket/chip resources.
> > 
> > Patch set also enable Hz/hz prinitg for --metric-only option to print
> > metric data for bus frequency.
> > 
> > Applied and tested all these patches cleanly on top of jiri's flex changes
> > with the changes done by Kan Liang for "Support metric group constraint"
> > patchset and made required changes.
> > 
> > Changelog:
> > v3 -> v4
> > - Made changes suggested by jiri.
> 
> could you please mention them next time? ;-)
> 
> > - Apply these patch on top of Kan liang changes.
> 
> Arnaldo, could you please pull the expr flex changes and Kan's
> metric group constraint changes? it's both prereq of this patchset

Both are now in my perf/core branch, will go upstream soon, should I go
and pickup the perf tooling bits in this patchkit?

- Arnaldo


Re: [PATCH 1/4] powerpc/xive: Use XIVE_BAD_IRQ instead of zero to catch non configured IPIs

2020-03-10 Thread Cédric Le Goater
On 3/10/20 4:09 PM, Greg Kurz wrote:
> On Fri,  6 Mar 2020 16:01:40 +0100
> Cédric Le Goater  wrote:
> 
>> When a CPU is brought up, an IPI number is allocated and recorded
>> under the XIVE CPU structure. Invalid IPI numbers are tracked with
>> interrupt number 0x0.
>>
>> On the PowerNV platform, the interrupt number space starts at 0x10 and
>> this works fine. However, on the sPAPR platform, it is possible to
>> allocate the interrupt number 0x0 and this raises an issue when CPU 0
>> is unplugged. The XIVE spapr driver tracks allocated interrupt numbers
>> in a bitmask and it is not correctly updated when interrupt number 0x0
>> is freed. It stays allocated and it is then impossible to reallocate.
>>
>> Fix by using the XIVE_BAD_IRQ value instead of zero on both platforms.
>>
>> Reported-by: David Gibson 
>> Fixes: eac1e731b59e ("powerpc/xive: guest exploitation of the XIVE interrupt 
>> controller")
>> Cc: sta...@vger.kernel.org # v4.14+
>> Signed-off-by: Cédric Le Goater 
>> ---
> 
> This looks mostly good. I'm juste wondering about potential overlooks:

Yes. Thanks for doing so. There are some non-obvious use of interrupt
numbers under the hood. 

One thing we should be adding is a comment on the different interrupt 
number spaces. The HW interrupt numbers, the EISN numbers found on the 
XIVE even queue and the Linux interrupt numbers are different spaces 
and have different limits. XIVE_BAD_IRQ was introduced for the EISN 
numbers and the patch is using this value for HW numbers. This is ok 
because it's more a tag than a limit.

> $ git grep 'if.*hw_i' arch/powerpc/ | egrep -v 'xics|XIVE_BAD_IRQ'
>
>
> arch/powerpc/kvm/book3s_xive.h: if (out_hw_irq)
> arch/powerpc/kvm/book3s_xive.h: if (out_hw_irq)
> arch/powerpc/kvm/book3s_xive_template.c:else if (hw_irq && xd->flags 
> & XIVE_IRQ_FLAG_EOI_FW)
> arch/powerpc/sysdev/xive/common.c:  else if (hw_irq && xd->flags & 
> XIVE_IRQ_FLAG_EOI_FW) {
>
> This hw_irq check in xive_do_source_eoi() for example is related to:
> 
>   /*
>* Note: We pass "0" to the hw_irq argument in order to
>* avoid calling into the backend EOI code which we don't
>* want to do in the case of a re-trigger. Backends typically
>* only do EOI for LSIs anyway.
>*/
>   xive_do_source_eoi(0, xd);

that's a hack to skip the following part of the code in case of EOI 
being done through FW:

else if (hw_irq && xd->flags & XIVE_IRQ_FLAG_EOI_FW) {
/*
 * The FW told us to call it. This happens for some
 * interrupt sources that need additional HW whacking
 * beyond the ESB manipulation. For example LPC interrupts
 * on P9 DD1.0 needed a latch to be clared in the LPC bridge
 * itself. The Firmware will take care of it.
 */
if (WARN_ON_ONCE(!xive_ops->eoi))
return;
xive_ops->eoi(hw_irq);

That was the case for PHB4 LSIs on P9 DD1 only. We could probably drop
the code in Linux unless similar bugs show up on new platforms.

> but it can get hw_irq from:
> 
>   xive_do_source_eoi(xc->hw_ipi, >ipi_data);

That part is fine. It's an IPI EOI.

> 
> It seems that these should use XIVE_BAD_IRQ as well or I'm missing
> something ?
> 
> arch/powerpc/sysdev/xive/common.c:  if (hw_irq)

This tests the XIVE IPI number which is mapped to 0 for all CPUs.
See xive_request_ipi() and xive_irq_domain_map()

C.

> arch/powerpc/sysdev/xive/common.c:  if (d->domain != 
> xive_irq_domain || hw_irq == 0)
> 
> 
> 
>>  arch/powerpc/sysdev/xive/xive-internal.h |  7 +++
>>  arch/powerpc/sysdev/xive/common.c| 12 +++-
>>  arch/powerpc/sysdev/xive/native.c|  4 ++--
>>  arch/powerpc/sysdev/xive/spapr.c |  4 ++--
>>  4 files changed, 14 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
>> b/arch/powerpc/sysdev/xive/xive-internal.h
>> index 59cd366e7933..382980f4de2d 100644
>> --- a/arch/powerpc/sysdev/xive/xive-internal.h
>> +++ b/arch/powerpc/sysdev/xive/xive-internal.h
>> @@ -5,6 +5,13 @@
>>  #ifndef __XIVE_INTERNAL_H
>>  #define __XIVE_INTERNAL_H
>>  
>> +/*
>> + * A "disabled" interrupt should never fire, to catch problems
>> + * we set its logical number to this
>> + */
>> +#define XIVE_BAD_IRQ0x7fff
>> +#define XIVE_MAX_IRQ(XIVE_BAD_IRQ - 1)
>> +
>>  /* Each CPU carry one of these with various per-CPU state */
>>  struct xive_cpu {
>>  #ifdef CONFIG_SMP
>> diff --git a/arch/powerpc/sysdev/xive/common.c 
>> b/arch/powerpc/sysdev/xive/common.c
>> index fa49193206b6..550baba98ec9 100644
>> --- a/arch/powerpc/sysdev/xive/common.c
>> +++ b/arch/powerpc/sysdev/xive/common.c
>> @@ -68,13 +68,6 @@ static u32 xive_ipi_irq;
>>  /* Xive state for each CPU */
>>  static DEFINE_PER_CPU(struct xive_cpu *, xive_cpu);
>>  
>> -/*
>> - * A "disabled" 

[PATCH v3] powerpc/32s: reorder Linux PTE bits to better match Hash PTE bits.

2020-03-10 Thread Christophe Leroy
Reorder Linux PTE bits to (almost) match Hash PTE bits.

RW Kernel : PP = 00
RO Kernel : PP = 00
RW User   : PP = 01
RO User   : PP = 11

So naturally, we should have
_PAGE_USER = 0x001
_PAGE_RW   = 0x002

Today 0x001 and 0x002 and _PAGE_PRESENT and _PAGE_HASHPTE which
both are software only bits.

Switch _PAGE_USER and _PAGE_PRESET
Switch _PAGE_RW and _PAGE_HASHPTE

This allows to remove a few insns.

Signed-off-by: Christophe Leroy 
---
v3: rebased on today's powerpc/merge

v2: rebased on today's powerpc/merge
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/hash.h |  8 
 arch/powerpc/kernel/head_32.S |  9 +++--
 arch/powerpc/mm/book3s32/hash_low.S   | 14 ++
 3 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/hash.h 
b/arch/powerpc/include/asm/book3s/32/hash.h
index 2a0a467d2985..34a7215ae81e 100644
--- a/arch/powerpc/include/asm/book3s/32/hash.h
+++ b/arch/powerpc/include/asm/book3s/32/hash.h
@@ -17,9 +17,9 @@
  * updating the accessed and modified bits in the page table tree.
  */
 
-#define _PAGE_PRESENT  0x001   /* software: pte contains a translation */
-#define _PAGE_HASHPTE  0x002   /* hash_page has made an HPTE for this pte */
-#define _PAGE_USER 0x004   /* usermode access allowed */
+#define _PAGE_USER 0x001   /* usermode access allowed */
+#define _PAGE_RW   0x002   /* software: user write access allowed */
+#define _PAGE_PRESENT  0x004   /* software: pte contains a translation */
 #define _PAGE_GUARDED  0x008   /* G: prohibit speculative access */
 #define _PAGE_COHERENT 0x010   /* M: enforce memory coherence (SMP systems) */
 #define _PAGE_NO_CACHE 0x020   /* I: cache inhibit */
@@ -27,7 +27,7 @@
 #define _PAGE_DIRTY0x080   /* C: page changed */
 #define _PAGE_ACCESSED 0x100   /* R: page referenced */
 #define _PAGE_EXEC 0x200   /* software: exec allowed */
-#define _PAGE_RW   0x400   /* software: user write access allowed */
+#define _PAGE_HASHPTE  0x400   /* hash_page has made an HPTE for this pte */
 #define _PAGE_SPECIAL  0x800   /* software: Special page */
 
 #ifdef CONFIG_PTE_64BIT
diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 97c887950c3c..daaa153950c2 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -348,7 +348,7 @@ BEGIN_MMU_FTR_SECTION
andis.  r0, r5, (DSISR_BAD_FAULT_32S | DSISR_DABRMATCH)@h
 #endif
bne handle_page_fault_tramp_2   /* if not, try to put a PTE */
-   rlwinm  r3, r5, 32 - 15, 21, 21 /* DSISR_STORE -> _PAGE_RW */
+   rlwinm  r3, r5, 32 - 24, 30, 30 /* DSISR_STORE -> _PAGE_RW */
bl  hash_page
b   handle_page_fault_tramp_1
 FTR_SECTION_ELSE
@@ -497,7 +497,6 @@ InstructionTLBMiss:
andc.   r1,r1,r0/* check access & ~permission */
bne-InstructionAddressInvalid /* return if access not permitted */
/* Convert linux-style PTE to low word of PPC-style PTE */
-   rlwimi  r0,r0,32-2,31,31/* _PAGE_USER -> PP lsb */
ori r1, r1, 0xe06   /* clear out reserved bits */
andcr1, r0, r1  /* PP = user? 1 : 0 */
 BEGIN_FTR_SECTION
@@ -565,9 +564,8 @@ DataLoadTLBMiss:
 * we would need to update the pte atomically with lwarx/stwcx.
 */
/* Convert linux-style PTE to low word of PPC-style PTE */
-   rlwinm  r1,r0,32-9,30,30/* _PAGE_RW -> PP msb */
-   rlwimi  r0,r0,32-1,30,30/* _PAGE_USER -> PP msb */
-   rlwimi  r0,r0,32-1,31,31/* _PAGE_USER -> PP lsb */
+   rlwinm  r1,r0,0,30,30   /* _PAGE_RW -> PP msb */
+   rlwimi  r0,r0,1,30,30   /* _PAGE_USER -> PP msb */
ori r1,r1,0xe04 /* clear out reserved bits */
andcr1,r0,r1/* PP = user? rw? 1: 3: 0 */
 BEGIN_FTR_SECTION
@@ -645,7 +643,6 @@ DataStoreTLBMiss:
 * we would need to update the pte atomically with lwarx/stwcx.
 */
/* Convert linux-style PTE to low word of PPC-style PTE */
-   rlwimi  r0,r0,32-2,31,31/* _PAGE_USER -> PP lsb */
li  r1,0xe06/* clear out reserved bits & PP msb */
andcr1,r0,r1/* PP = user? 1: 0 */
 BEGIN_FTR_SECTION
diff --git a/arch/powerpc/mm/book3s32/hash_low.S 
b/arch/powerpc/mm/book3s32/hash_low.S
index 877d880890fe..6d236080cb1a 100644
--- a/arch/powerpc/mm/book3s32/hash_low.S
+++ b/arch/powerpc/mm/book3s32/hash_low.S
@@ -35,7 +35,7 @@ mmu_hash_lock:
 /*
  * Load a PTE into the hash table, if possible.
  * The address is in r4, and r3 contains an access flag:
- * _PAGE_RW (0x400) if a write.
+ * _PAGE_RW (0x002) if a write.
  * r9 contains the SRR1 value, from which we use the MSR_PR bit.
  * SPRG_THREAD contains the physical address of the current task's thread.
  *
@@ -69,7 +69,7 @@ _GLOBAL(hash_page)
blt+112f

Re: [PATCH] powerpc/pseries: fix of_read_drc_info_cell() to point at next record

2020-03-10 Thread Nathan Lynch
Tyrel Datwyler  writes:
> The expectation is that when calling of_read_drc_info_cell()
> repeatedly to parse multiple drc-info records that the in/out curval
> parameter points at the start of the next record on return. However,
> the current behavior has curval still pointing at the final value of
> the record just parsed. The result of which is that if the
> ibm,drc-info property contains multiple properties the parsed value
> of the drc_type for any record after the first has the power_domain
> value of the previous record appended to the type string.
>
> Ex: observed the following 0x prepended to PHB
>
> [   69.485037] drc-info: type: \xff\xff\xff\xffPHB, prefix: PHB , 
> index_start: 0x2001
> [   69.485038] drc-info: suffix_start: 1, sequential_elems: 3072, 
> sequential_inc: 1
> [   69.485038] drc-info: power-domain: 0x, last_index: 0x2c00
>
> Fix by incrementing curval past the power_domain value to point at
> drc_type string of next record.
>
> Fixes: a29396653b8bf ("pseries/drc-info: Search DRC properties for CPU 
> indexes")

I have a different commit hash for that:
e83636ac3334 pseries/drc-info: Search DRC properties for CPU indexes

> Signed-off-by: Tyrel Datwyler 

Otherwise:
Acked-by: Nathan Lynch 


Re: [PATCH 2/4] powerpc/xive: Fix xmon support on the PowerNV platform

2020-03-10 Thread Greg Kurz
On Fri,  6 Mar 2020 16:01:41 +0100
Cédric Le Goater  wrote:

> The PowerNV platform has multiple IRQ chips and the xmon command
> dumping the state of the XIVE interrupt should only operate on the
> XIVE IRQ chip.
> 
> Fixes: 5896163f7f91 ("powerpc/xmon: Improve output of XIVE interrupts")
> Cc: sta...@vger.kernel.org # v5.4+
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/common.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 550baba98ec9..8155adc2225a 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -261,11 +261,15 @@ notrace void xmon_xive_do_dump(int cpu)
>  
>  int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
>  {
> + struct irq_chip *chip = irq_data_get_irq_chip(d);
>   int rc;
>   u32 target;
>   u8 prio;
>   u32 lirq;
>  
> + if (!is_xive_irq(chip))
> + return -EINVAL;
> +
>   rc = xive_ops->get_irq_config(hw_irq, , , );
>   if (rc) {
>   xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);



Re: [PATCH 3/4] powerpc/xmon: Add source flags to output of XIVE interrupts

2020-03-10 Thread Greg Kurz
On Fri,  6 Mar 2020 16:01:42 +0100
Cédric Le Goater  wrote:

> Some firmwares or hypervisors can advertise different source
> characteristics. Track their value under XMON. What we are mostly
> interested in is the StoreEOI flag.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/common.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 8155adc2225a..c865ae554605 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -283,7 +283,10 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data 
> *d)
>   struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
>   u64 val = xive_esb_read(xd, XIVE_ESB_GET);
>  
> - xmon_printf("PQ=%c%c",
> + xmon_printf("flags=%c%c%c PQ=%c%c",
> + xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
> + xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
> + xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
>   val & XIVE_ESB_VAL_P ? 'P' : '-',
>   val & XIVE_ESB_VAL_Q ? 'Q' : '-');
>   }



Re: [PATCH 1/4] powerpc/xive: Use XIVE_BAD_IRQ instead of zero to catch non configured IPIs

2020-03-10 Thread Greg Kurz
On Fri,  6 Mar 2020 16:01:40 +0100
Cédric Le Goater  wrote:

> When a CPU is brought up, an IPI number is allocated and recorded
> under the XIVE CPU structure. Invalid IPI numbers are tracked with
> interrupt number 0x0.
> 
> On the PowerNV platform, the interrupt number space starts at 0x10 and
> this works fine. However, on the sPAPR platform, it is possible to
> allocate the interrupt number 0x0 and this raises an issue when CPU 0
> is unplugged. The XIVE spapr driver tracks allocated interrupt numbers
> in a bitmask and it is not correctly updated when interrupt number 0x0
> is freed. It stays allocated and it is then impossible to reallocate.
> 
> Fix by using the XIVE_BAD_IRQ value instead of zero on both platforms.
> 
> Reported-by: David Gibson 
> Fixes: eac1e731b59e ("powerpc/xive: guest exploitation of the XIVE interrupt 
> controller")
> Cc: sta...@vger.kernel.org # v4.14+
> Signed-off-by: Cédric Le Goater 
> ---

This looks mostly good. I'm juste wondering about potential overlooks:

$ git grep 'if.*hw_i' arch/powerpc/ | egrep -v 'xics|XIVE_BAD_IRQ'
arch/powerpc/kvm/book3s_xive.h: if (out_hw_irq)
arch/powerpc/kvm/book3s_xive.h: if (out_hw_irq)
arch/powerpc/kvm/book3s_xive_template.c:else if (hw_irq && xd->flags & 
XIVE_IRQ_FLAG_EOI_FW)
arch/powerpc/sysdev/xive/common.c:  else if (hw_irq && xd->flags & 
XIVE_IRQ_FLAG_EOI_FW) {

This hw_irq check in xive_do_source_eoi() for example is related to:

/*
 * Note: We pass "0" to the hw_irq argument in order to
 * avoid calling into the backend EOI code which we don't
 * want to do in the case of a re-trigger. Backends typically
 * only do EOI for LSIs anyway.
 */
xive_do_source_eoi(0, xd);

but it can get hw_irq from:

xive_do_source_eoi(xc->hw_ipi, >ipi_data);

It seems that these should use XIVE_BAD_IRQ as well or I'm missing
something ?

arch/powerpc/sysdev/xive/common.c:  if (hw_irq)
arch/powerpc/sysdev/xive/common.c:  if (d->domain != 
xive_irq_domain || hw_irq == 0)



>  arch/powerpc/sysdev/xive/xive-internal.h |  7 +++
>  arch/powerpc/sysdev/xive/common.c| 12 +++-
>  arch/powerpc/sysdev/xive/native.c|  4 ++--
>  arch/powerpc/sysdev/xive/spapr.c |  4 ++--
>  4 files changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
> b/arch/powerpc/sysdev/xive/xive-internal.h
> index 59cd366e7933..382980f4de2d 100644
> --- a/arch/powerpc/sysdev/xive/xive-internal.h
> +++ b/arch/powerpc/sysdev/xive/xive-internal.h
> @@ -5,6 +5,13 @@
>  #ifndef __XIVE_INTERNAL_H
>  #define __XIVE_INTERNAL_H
>  
> +/*
> + * A "disabled" interrupt should never fire, to catch problems
> + * we set its logical number to this
> + */
> +#define XIVE_BAD_IRQ 0x7fff
> +#define XIVE_MAX_IRQ (XIVE_BAD_IRQ - 1)
> +
>  /* Each CPU carry one of these with various per-CPU state */
>  struct xive_cpu {
>  #ifdef CONFIG_SMP
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index fa49193206b6..550baba98ec9 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -68,13 +68,6 @@ static u32 xive_ipi_irq;
>  /* Xive state for each CPU */
>  static DEFINE_PER_CPU(struct xive_cpu *, xive_cpu);
>  
> -/*
> - * A "disabled" interrupt should never fire, to catch problems
> - * we set its logical number to this
> - */
> -#define XIVE_BAD_IRQ 0x7fff
> -#define XIVE_MAX_IRQ (XIVE_BAD_IRQ - 1)
> -
>  /* An invalid CPU target */
>  #define XIVE_INVALID_TARGET  (-1)
>  
> @@ -1153,7 +1146,7 @@ static int xive_setup_cpu_ipi(unsigned int cpu)
>   xc = per_cpu(xive_cpu, cpu);
>  
>   /* Check if we are already setup */
> - if (xc->hw_ipi != 0)
> + if (xc->hw_ipi != XIVE_BAD_IRQ)
>   return 0;
>  
>   /* Grab an IPI from the backend, this will populate xc->hw_ipi */
> @@ -1190,7 +1183,7 @@ static void xive_cleanup_cpu_ipi(unsigned int cpu, 
> struct xive_cpu *xc)
>   /* Disable the IPI and free the IRQ data */
>  
>   /* Already cleaned up ? */
> - if (xc->hw_ipi == 0)
> + if (xc->hw_ipi == XIVE_BAD_IRQ)
>   return;
>  
>   /* Mask the IPI */
> @@ -1346,6 +1339,7 @@ static int xive_prepare_cpu(unsigned int cpu)
>   if (np)
>   xc->chip_id = of_get_ibm_chip_id(np);
>   of_node_put(np);
> + xc->hw_ipi = XIVE_BAD_IRQ;
>  
>   per_cpu(xive_cpu, cpu) = xc;
>   }
> diff --git a/arch/powerpc/sysdev/xive/native.c 
> b/arch/powerpc/sysdev/xive/native.c
> index 0ff6b739052c..50e1a8e02497 100644
> --- a/arch/powerpc/sysdev/xive/native.c
> +++ b/arch/powerpc/sysdev/xive/native.c
> @@ -312,7 +312,7 @@ static void xive_native_put_ipi(unsigned int cpu, struct 
> xive_cpu *xc)
>   s64 rc;
>  
>   /* Free the IPI */
> - if (!xc->hw_ipi)
> + if (xc->hw_ipi == 

Re: [5.6.0-rc2-next-20200218/powerpc] Boot failure on POWER9

2020-03-10 Thread Michal Hocko
On Thu 27-02-20 19:26:54, Michal Hocko wrote:
> [Cc ppc maintainers]
[...]
> Please have a look at 
> http://lkml.kernel.org/r/52ef4673-7292-4c4c-b459-af583951b...@linux.vnet.ibm.com
> for the boot log with the debugging patch which tracks set_numa_mem.
> This seems to lead to a crash in the slab allocator bebcause
> node_to_mem_node(0) for memory less node resolves to the memory less
> node http://lkml.kernel.org/r/dd450314-d428-6776-af07-f92c04c7b...@suse.cz.
> The original report is 
> http://lkml.kernel.org/r/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com

ping 
-- 
Michal Hocko
SUSE Labs


[PATCH net v2] ibmvnic: Do not process device remove during device reset

2020-03-10 Thread Juliet Kim
The ibmvnic driver does not check the device state when the device
is removed. If the device is removed while a device reset is being
processed, the remove may free structures needed by the reset,
causing an oops.

Fix this by checking the device state before processing device remove.

Signed-off-by: Juliet Kim 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 24 ++--
 drivers/net/ethernet/ibm/ibmvnic.h |  6 +-
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index c75239d8820f..4bd33245bad6 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2142,6 +2142,8 @@ static void __ibmvnic_reset(struct work_struct *work)
 {
struct ibmvnic_rwi *rwi;
struct ibmvnic_adapter *adapter;
+   bool saved_state = false;
+   unsigned long flags;
u32 reset_state;
int rc = 0;
 
@@ -2153,17 +2155,25 @@ static void __ibmvnic_reset(struct work_struct *work)
return;
}
 
-   reset_state = adapter->state;
-
rwi = get_next_rwi(adapter);
while (rwi) {
+   spin_lock_irqsave(>state_lock, flags);
+
if (adapter->state == VNIC_REMOVING ||
adapter->state == VNIC_REMOVED) {
+   spin_unlock_irqrestore(>state_lock, flags);
kfree(rwi);
rc = EBUSY;
break;
}
 
+   if (!saved_state) {
+   reset_state = adapter->state;
+   adapter->state = VNIC_RESETTING;
+   saved_state = true;
+   }
+   spin_unlock_irqrestore(>state_lock, flags);
+
if (rwi->reset_reason == VNIC_RESET_CHANGE_PARAM) {
/* CHANGE_PARAM requestor holds rtnl_lock */
rc = do_change_param_reset(adapter, rwi, reset_state);
@@ -5091,6 +5101,7 @@ static int ibmvnic_probe(struct vio_dev *dev, const 
struct vio_device_id *id)
  __ibmvnic_delayed_reset);
INIT_LIST_HEAD(>rwi_list);
spin_lock_init(>rwi_lock);
+   spin_lock_init(>state_lock);
mutex_init(>fw_lock);
init_completion(>init_done);
init_completion(>fw_done);
@@ -5163,8 +5174,17 @@ static int ibmvnic_remove(struct vio_dev *dev)
 {
struct net_device *netdev = dev_get_drvdata(>dev);
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
+   unsigned long flags;
+
+   spin_lock_irqsave(>state_lock, flags);
+   if (adapter->state == VNIC_RESETTING) {
+   spin_unlock_irqrestore(>state_lock, flags);
+   return -EBUSY;
+   }
 
adapter->state = VNIC_REMOVING;
+   spin_unlock_irqrestore(>state_lock, flags);
+
rtnl_lock();
unregister_netdevice(netdev);
 
diff --git a/drivers/net/ethernet/ibm/ibmvnic.h 
b/drivers/net/ethernet/ibm/ibmvnic.h
index 60eccaf91b12..f8416e1d4cf0 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.h
+++ b/drivers/net/ethernet/ibm/ibmvnic.h
@@ -941,7 +941,8 @@ enum vnic_state {VNIC_PROBING = 1,
 VNIC_CLOSING,
 VNIC_CLOSED,
 VNIC_REMOVING,
-VNIC_REMOVED};
+VNIC_REMOVED,
+VNIC_RESETTING};
 
 enum ibmvnic_reset_reason {VNIC_RESET_FAILOVER = 1,
   VNIC_RESET_MOBILITY,
@@ -1090,4 +1091,7 @@ struct ibmvnic_adapter {
 
struct ibmvnic_tunables desired;
struct ibmvnic_tunables fallback;
+
+   /* Used for serializatin of state field */
+   spinlock_t state_lock;
 };
-- 
2.25.0



[Bug 206669] Little-endian kernel crashing on POWER8 on heavy big-endian PowerKVM load

2020-03-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=206669

--- Comment #10 from John Paul Adrian Glaubitz (glaub...@physik.fu-berlin.de) 
---
(In reply to Aneesh Kumar KV from comment #9)
> Also, can you try disabling THP. echo "never" >
> /sys/kernel/mm/transparent_hugepage/enabled 

Yes. Just disabled.

FWIW, the machine just crashed some minutes ago but I didn't have a serial
console open to capture the trace.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 206669] Little-endian kernel crashing on POWER8 on heavy big-endian PowerKVM load

2020-03-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=206669

Aneesh Kumar KV (aneesh.ku...@linux.ibm.com) changed:

   What|Removed |Added

 CC||aneesh.ku...@linux.ibm.com

--- Comment #9 from Aneesh Kumar KV (aneesh.ku...@linux.ibm.com) ---
Also, can you try disabling THP. echo "never" >
/sys/kernel/mm/transparent_hugepage/enabled 

-aneesh

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH v2] powerpc/Makefile: Mark phony targets as PHONY

2020-03-10 Thread Michael Ellerman
Masahiro Yamada  writes:
> On Fri, Mar 6, 2020 at 9:27 AM Michael Ellerman
>  wrote:
>>
>> On Wed, 2020-02-19 at 00:04:34 UTC, Michael Ellerman wrote:
>> > Some of our phony targets are not marked as such. This can lead to
>> > confusing errors, eg:
>> >
>> >   $ make clean
>> >   $ touch install
>> >   $ make install
>> >   make: 'install' is up to date.
>> >   $
>> >
>> > Fix it by adding them to the PHONY variable which is marked phony in
>> > the top-level Makefile, or in scripts/Makefile.build for the boot
>> > Makefile.
>> >
>> > Suggested-by: Masahiro Yamada 
>> > Signed-off-by: Michael Ellerman 
>>
>> Applied to powerpc next.
>>
>> https://git.kernel.org/powerpc/c/d42c6d0f8d004c3661dde3c376ed637e9f292c22
>>
>
> You do not have to double your Signed-off-by.

Oops :/

My scripts don't cope with applying my own patches very well. Will try
to fix it.

cheers


Re: [PATCH v3] powerpc: setup_64: set up PACA earlier to avoid kcov problems

2020-03-10 Thread Michael Ellerman
Nicholas Piggin  writes:
> Daniel Axtens's on March 6, 2020 5:30 pm:
>> kcov instrumentation is collected the __sanitizer_cov_trace_pc hook in
>> kernel/kcov.c. The compiler inserts these hooks into every basic block
>> unless kcov is disabled for that file.
>> 
>> We then have a deep call-chain:
>>  - __sanitizer_cov_trace_pc calls to check_kcov_mode()
>>  - check_kcov_mode() (kernel/kcov.c) calls in_task()
>>  - in_task() (include/linux/preempt.h) calls preempt_count().
>>  - preempt_count() (include/asm-generic/preempt.h) calls
>>  current_thread_info()
>>  - because powerpc has THREAD_INFO_IN_TASK, current_thread_info()
>>  (include/linux/thread_info.h) is defined to 'current'
>>  - current (arch/powerpc/include/asm/current.h) is defined to
>>  get_current().
>>  - get_current (same file) loads an offset of r13.
>>  - arch/powerpc/include/asm/paca.h makes r13 a register variable
>>  called local_paca - it is the PACA for the current CPU, so
>>  this has the effect of loading the current task from PACA.
>>  - get_current returns the current task from PACA,
>>  - current_thread_info returns the task cast to a thread_info
>>  - preempt_count dereferences the thread_info to load preempt_count
>>  - that value is used by in_task and so on up the chain
>> 
>> The problem is:
>> 
>>  - kcov instrumentation is enabled for arch/powerpc/kernel/dt_cpu_ftrs.c
>> 
>>  - even if it were not, dt_cpu_ftrs_init calls generic dt parsing code
>>which should definitely have instrumentation enabled.
>> 
>>  - setup_64.c calls dt_cpu_ftrs_init before it sets up a PACA.
>> 
>>  - If we don't set up a paca, r13 will contain unpredictable data.
>> 
>>  - In a zImage compiled with kcov and KASAN, we see r13 containing a value
>>that leads to dereferencing invalid memory (something like
>>912a72603d420015).
>> 
>>  - Weirdly, the same kernel as a vmlinux loaded directly by qemu does not
>>crash. Investigating with gdb, it seems that in the vmlinux boot case,
>>r13 is near enough to zero that we just happen to be able to read that
>>part of memory (we're operating with translation off at this point) and
>>the current pointer also happens to land in readable memory and
>>everything just works.
>> 
>>  - PACA setup refers to CPU features - setup_paca() looks at
>>early_cpu_has_feature(CPU_FTR_HVMODE)
>> 
>> There's no generic kill switch for kcov (as far as I can tell), and we
>> don't want to have to turn off instrumentation in the generic dt parsing
>> code (which lives outside arch/powerpc/) just because we don't have a real
>> paca or task yet.
>> 
>> So:
>>  - change the test when setting up a PACA to consider the actual value of
>>the MSR rather than the CPU feature.
>> 
>>  - move the PACA setup to before the cpu feature parsing.
>
> Hmm. Problem is that equally we want PACA to be sane before we call too
> far into the rest of the kernel ("generic dt parsing code").

But currently we call into that code with no paca at all. Or rather,
with r13 pointing somewhere random that will be interpreted as being a
paca.

This took a while for Daniel to debug because depending on how you boot
r13 contains a different junk value. That junk value may not point to
memory at all, or if it does the memory it points to may or may not send
you down the wrong path, depending on which exact bit you're looking at
in some random location.

So this is really not about kcov from my POV, that's just how we
discovered it.

cheers


Re: [PATCH] macintosh: windfarm: fix MODINFO regression

2020-03-10 Thread Wolfram Sang
On Tue, Mar 03, 2020 at 01:50:46PM +0100, Wolfram Sang wrote:
> Commit af503716ac14 made sure OF devices get an OF style modalias with
> I2C events. It assumed all in-tree users were converted, yet it missed
> some Macintosh drivers.
> 
> Add an OF module device table for all windfarm drivers to make them
> automatically load again.
> 
> Fixes: af503716ac14 ("i2c: core: report OF style module alias for devices 
> registered via OF")
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=199471
> Reported-by: Erhard Furtner 
> Tested-by: Erhard Furtner 
> Signed-off-by: Wolfram Sang 

Thanks, Michael! Applied to for-current.



signature.asc
Description: PGP signature


Re: ppc32 panic on boot on linux-next

2020-03-10 Thread Laurentiu Tudor

Hello,

On 10.03.2020 06:44, Michael Ellerman wrote:

Christophe Leroy  writes:

Le 07/03/2020 à 09:42, Christophe Leroy a écrit :

Le 06/03/2020 à 20:05, Nick Desaulniers a écrit :

As a heads up, our CI went red last night, seems like a panic from
free_initmem?  Is this a known issue?


Thanks for the heads up.

No such issue with either 8xx or book3s/32.

I've now been able to reproduce it with bamboo QEMU.

Reverting 2efc7c085f05 makes it disappear. I'll investigate.



Ok, I found the problem. virt_to_kpte() lacks a NULL pmd check. I'll
send a patch for that.

However, if there is no PMD I guess this area is mapped through some
kind of block mapping. Therefore it should bail out of the function through:

if (v_block_mapped(address))
return 0;


Can someone who knows BOOKE investigate that ?


Not sure we have anyone left?

cheers



With latest linux-next, PPC32 Book-E boots to prompt:

https://paste.debian.net/1134272/

---
Best Regards, Laurentiu


Re: [PATCH] macintosh: windfarm: fix MODINFO regression

2020-03-10 Thread Michael Ellerman
Wolfram Sang  writes:
> On Tue, Mar 03, 2020 at 01:50:46PM +0100, Wolfram Sang wrote:
>> Commit af503716ac14 made sure OF devices get an OF style modalias with
>> I2C events. It assumed all in-tree users were converted, yet it missed
>> some Macintosh drivers.
>> 
>> Add an OF module device table for all windfarm drivers to make them
>> automatically load again.
>> 
>> Fixes: af503716ac14 ("i2c: core: report OF style module alias for devices 
>> registered via OF")
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=199471
>> Reported-by: Erhard Furtner 
>> Tested-by: Erhard Furtner 
>> Signed-off-by: Wolfram Sang 
>
> Michael, I can take this via I2C again, if you ack it.

Thanks, done.

cheers


Re: [PATCH] macintosh: windfarm: fix MODINFO regression

2020-03-10 Thread Michael Ellerman
Wolfram Sang  writes:

> Commit af503716ac14 made sure OF devices get an OF style modalias with
> I2C events. It assumed all in-tree users were converted, yet it missed
> some Macintosh drivers.
>
> Add an OF module device table for all windfarm drivers to make them
> automatically load again.
>
> Fixes: af503716ac14 ("i2c: core: report OF style module alias for devices 
> registered via OF")
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=199471
> Reported-by: Erhard Furtner 
> Tested-by: Erhard Furtner 
> Signed-off-by: Wolfram Sang 
> ---
>
> This should also help with this: 
> https://lists.debian.org/debian-powerpc/2020/01/msg00062.html
> Some more testing would be appreciated because lm75 also has some code
> changes I can't test myself obviusly.
>
> By grepping, I found some more potential candidates (using a "MAC,"
> prefix but not defining a OF MODULE DEVICE TABLE). Does someone know
> about bugreports filed for those? I don't want to change them for no
> reason:
>
> drivers/macintosh/ams/ams-i2c.c
> drivers/macintosh/therm_adt746x.c
> sound/aoa/codecs/onyx.c
> sound/aoa/codecs/tas.c
> sound/ppc/keywest.c
>
> Happy hacking,
>
>Wolfram
>
>  drivers/macintosh/windfarm_ad7417_sensor.c  |  7 +++
>  drivers/macintosh/windfarm_fcu_controls.c   |  7 +++
>  drivers/macintosh/windfarm_lm75_sensor.c| 16 +++-
>  drivers/macintosh/windfarm_lm87_sensor.c|  7 +++
>  drivers/macintosh/windfarm_max6690_sensor.c |  7 +++
>  drivers/macintosh/windfarm_smu_sat.c|  7 +++
>  6 files changed, 50 insertions(+), 1 deletion(-)

Acked-by: Michael Ellerman  (powerpc)

cheers



CDC ethernet gadget: complete system freeze

2020-03-10 Thread Joakim Tjernlund
We have an embedded T1042 NXP CDC ethernet gadget which seems to completely 
freeze when an
usb0 I/F is established and one do 1 of two things:
 1) reboot the connected Linux laptop -> CDC gadget appears to enter complete 
system freeze.
 2) on laptop, ifconfig usb0 down; rmmod cdc_ether -> CDC gaget freezes

I cannot finns and OOPS or other trace in console or syslog. I could really use 
som ideas here.
Linux kernel 4.19.108 (4.19.76 has the same issue)

 Jocke



Re: [PATCH] i2c: powermac: correct comment about custom handling

2020-03-10 Thread Wolfram Sang
On Tue, Feb 25, 2020 at 03:26:13PM +0100, Wolfram Sang wrote:
> The comment had some flaws which are now fixed:
> - the prefix is 'MAC' not 'AAPL'
> - no kernel coding style and too short length
> - 'we do' instead of 'we to'
> 
> Signed-off-by: Wolfram Sang 

Applied to for-next, thanks!



signature.asc
Description: PGP signature


Re: [PATCH] macintosh: windfarm: fix MODINFO regression

2020-03-10 Thread Wolfram Sang
On Tue, Mar 03, 2020 at 01:50:46PM +0100, Wolfram Sang wrote:
> Commit af503716ac14 made sure OF devices get an OF style modalias with
> I2C events. It assumed all in-tree users were converted, yet it missed
> some Macintosh drivers.
> 
> Add an OF module device table for all windfarm drivers to make them
> automatically load again.
> 
> Fixes: af503716ac14 ("i2c: core: report OF style module alias for devices 
> registered via OF")
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=199471
> Reported-by: Erhard Furtner 
> Tested-by: Erhard Furtner 
> Signed-off-by: Wolfram Sang 

Michael, I can take this via I2C again, if you ack it.

Thanks,

   Wolfram



signature.asc
Description: PGP signature