RE: [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'

2019-02-28 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org 
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of
> Erwan Velu
> Sent: Wednesday, February 27, 2019 10:32 AM
> Subject: [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'
> 
> When this HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is 
> offlined.
> When reading the kernel log, the cause why the device got offlined isn't 
> reported to the user.
> This situation makes difficult for admins to estimate _why_ the volume got 
> offlined.
> Reading this part of the code makes clear this is because driver received a 
> HARDWARE_ERROR/0x3e/0x1
> which is a 'logical unit failure'.
> 
> This patch is just about reporting that fact to help admins making a 
> relationship between this event
> and the offlining.
> 
> Signed-off-by: Erwan Velu 
> ---
>  drivers/scsi/smartpqi/smartpqi_init.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/scsi/smartpqi/smartpqi_init.c 
> b/drivers/scsi/smartpqi/smartpqi_init.c
> index f564af8949e8..89f37d76735c 100644
> --- a/drivers/scsi/smartpqi/smartpqi_init.c
> +++ b/drivers/scsi/smartpqi/smartpqi_init.c
> @@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct 
> pqi_io_request *io_request)
>   sshdr.sense_key == HARDWARE_ERROR &&
>   sshdr.asc == 0x3e &&
>   sshdr.ascq == 0x1) {
> + struct pqi_ctrl_info *ctrl_info = 
> shost_to_hba(scmd->device->host);
> + struct pqi_scsi_dev *device = scmd->device->hostdata;
> +
> + dev_err(_info->pci_dev->dev, "received 'logical 
> unit failure' from controller
> for scsi %d:%d:%d:%d\n",
> + 
> ctrl_info->scsi_host->host_no, device->bus,
> + device->target, 
> device->lun);
>   pqi_take_device_offline(scmd->device, "RAID");
>   host_byte = DID_NO_CONNECT;
>   }

Be careful printing errors per-IO; you could get thousands of them if things go 
bad.
The block layer print_req_error() uses printk_ratelimited(KERN_ERR) for that 
reason,
and the SCSI layer scsi_io_completion_action() maintains a ratelimit on its own.

The dev_err_ratelimited() macro might be a good fit here.


---
Robert Elliott, HPE Persistent Memory




RE: [PATCH] nvme: Enable acceleration feature of A64FX processor

2019-02-14 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: Linux-nvme [mailto:linux-nvme-boun...@lists.infradead.org] On Behalf Of 
> Keith Busch
> Sent: Tuesday, February 5, 2019 8:39 AM
> To: Takao Indoh 
> Cc: Takao Indoh ; s...@grimberg.me; 
> linux-kernel@vger.kernel.org; linux-
> n...@lists.infradead.org; ax...@fb.com; h...@lst.de
> Subject: Re: [PATCH] nvme: Enable acceleration feature of A64FX processor
> 
> On Tue, Feb 05, 2019 at 09:56:05PM +0900, Takao Indoh wrote:
> > On Fri, Feb 01, 2019 at 07:54:14AM -0700, Keith Busch wrote:
> > > On Fri, Feb 01, 2019 at 09:46:15PM +0900, Takao Indoh wrote:
> > > > From: Takao Indoh 
> > > >
> > > > Fujitsu A64FX processor has a feature to accelerate data transfer of
> > > > internal bus by relaxed ordering. It is enabled when the bit 56 of dma
> > > > address is set to 1.
> > >
> > > Wait, what? RO is a standard PCIe TLP attribute. Why would we need this?
> >
> > I should have explained this patch more carefully.
> >
> > Standard PCIe devices can use Relaxed Ordering (RO) by setting Attr
> > field in the TLP header, however, this mechanism cannot be utilized if
> > the device does not support RO feature. Fujitsu A64FX processor has an
> > alternate feature to enable RO in its Root Port by setting the bit 56 of
> > DMA address. This mechanism enables to utilize RO feature even if the
> > device does not support standard PCIe RO.
> 
> I think you're better of just purchasing devices that support the
> capability per spec rather than with a non-standard work around.
> 

The PCIe and NVMe specifications dosn't standardize a way to tell the device
when to use RO, which leads to system workarounds like this.

The Enable Relaxed Ordering bit defined by PCIe tells the device when it
cannot use RO, but doesn't advise when it should or shall use RO.

For SCSI Express (SOP+PQI), we were going to allow specifying these
on a per-command basis:
* TLP attributes (No Snoop, Relaxed Ordering, ID-based Ordering)
* TLP processing hints (Processing Hints and Steering Tags)

to be used by the data transfers for the command. In some systems, one
setting per queue or per device might suffice. Transactions to the
queues and doorbells require stronger ordering.

For this workaround:
* making an extra pass through the SGL to set the address bit is 
inefficient; it should be done as the SGL is created.
* why doesn't it support PRP Lists?
* how does this interact with an iommu, if there is one? Must the 
address with bit 56 also be granted permission, or is that
stripped off before any iommu comparisons?








RE: [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500

2019-02-11 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Jens Axboe
> Sent: Monday, February 11, 2019 8:50 PM
> To: James Bottomley ; Mikael
> Pettersson 
> Cc: Linux SPARC Kernel Mailing List ;
> linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-scsi
> 
> Subject: Re: [5.0-rc5 regression] "scsi: kill off the legacy IO path"
> causes 5 minute delay during boot on Sun Blade 2500
...

> > We can either fix this by setting the QUEUE_FLAG_ADD_RANDOM if
> > there's no 0xB1 page or by setting the default as Jens proposed.
> 
> I'd recommend just doing my patch, since that'll be the same behavior
> that SCSI had before.

When blk-mq and scsi-mq were introduced to boost SAS SSDs over
1 million IOPS, this was purposely kept off, because it generated
a noticeable 3-12 us latency blip whenever the "fast_mix" code ran
out of bits. If Jens' patch changes the default to be enabled for
all scsi-mq users, that'll be a change for the newer legacy
scsi-mq users (except for users that don't trust the default and
always set the value in sysfs).




RE: remove exofs, the T10 OSD code and block/scsi bidi support V3

2018-12-20 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Douglas Gilbert
> Sent: Wednesday, December 19, 2018 9:02 PM
> Subject: Re: remove exofs, the T10 OSD code and block/scsi bidi support V3
> 
... 
> While on the subject of bidi, the order of transfers: is the data-out
> (to the target) always before the data-in or is it the target device
> that decides (depending on the semantics of the command) who is first?

In SCSI, that was command-specific. Some necessitated intermixing the
transfers, while others did all one direction before all the other
direction.

---
Robert Elliott, HPE Persistent Memory






RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-26 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: Daniel Jordan [mailto:daniel.m.jor...@oracle.com]
> Sent: Monday, November 19, 2018 10:02 AM
> On Mon, Nov 12, 2018 at 10:15:46PM +0000, Elliott, Robert (Persistent Memory) 
> wrote:
> >
> > > -Original Message-
> > > From: Daniel Jordan 
> > > Sent: Monday, November 12, 2018 11:54 AM
> > >
> > > On Sat, Nov 10, 2018 at 03:48:14AM +, Elliott, Robert (Persistent
> > > Memory) wrote:
> > > > > -Original Message-
> > > > > From: linux-kernel-ow...@vger.kernel.org  > > > > ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> > > > > Sent: Monday, November 05, 2018 10:56 AM
> > > > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > > > initialization within each node
> > > > >
> > ...
> > > > > In testing, a reasonable value turned out to be about a quarter of the
> > > > > CPUs on the node.
> > > > ...
> > > > > + /*
> > > > > +  * We'd like to know the memory bandwidth of the chip to
> > > > > calculate the
> > > > > +  * most efficient number of threads to start, but we can't.
> > > > > +  * In testing, a good value for a variety of systems was a
> > > > > quarter of the CPUs on the node.
> > > > > +  */
> > > > > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> > > >
> > > >
> > > > You might want to base that calculation on and limit the threads to
> > > > physical cores, not hyperthreaded cores.
> > >
> > > Why?  Hyperthreads can be beneficial when waiting on memory.  That said, I
> > > don't have data that shows that in this case.
> >
> > I think that's only if there are some register-based calculations to do 
> > while
> > waiting. If both threads are just doing memory accesses, they'll both 
> > stall, and
> > there doesn't seem to be any benefit in having two contexts generate the IOs
> > rather than one (at least on the systems I've used). I think it takes longer
> > to switch contexts than to just turnaround the next IO.
> 
> (Sorry for the delay, Plumbers is over now...)
> 
> I guess we're both just waving our hands without data.  I've only got x86, so
> using a quarter of the CPUs rules out HT on my end.  Do you have a system that
> you can test this on, where using a quarter of the CPUs will involve HT?

I ran a short test with:
* HPE ProLiant DL360 Gen9 system
* Intel Xeon E5-2699 CPU with 18 physical cores (0-17) and 
  18 hyperthreaded cores (36-53)
* DDR4 NVDIMM-Ns (which run at regular DRAM DIMM speeds)
* fio workload generator
* cores on one CPU socket talking to a pmem device on the same CPU
* large (1 MiB) random writes (to minimize the threads getting CPU cache
  hits from each other)

Results:
* 31.7 GB/sfour threads, four physical cores (0,1,2,3)
* 22.2 GB/sfour threads, two physical cores (0,1,36,37)
* 21.4 GB/stwo threads, two physical cores (0,1)
* 12.1 GB/stwo threads, one physical core (0,36)
* 11.2 GB/sone thread, one physical core (0)

So, I think it's important that the initialization threads run on
separate physical cores.

For the number of cores to use, one approach is:
memory bandwidth (number of interleaved channels * speed)
divided by 
CPU core max sustained write bandwidth

For example, this 2133 MT/s system is roughly:
68 GB/s(4 * 17 GB/s nominal)
divided by
11.2 GB/s  (one core's performance)
which is 
6 cores

ACPI HMAT will report that 68 GB/s number.  I'm not sure of
a good way to discover the 11.2 GB/s number.


fio job file:
[global]
direct=1
ioengine=sync
norandommap
randrepeat=0
bs=1M
runtime=20
time_based=1
group_reporting
thread
gtod_reduce=1
zero_buffers
cpus_allowed_policy=split
# pick the desired number of threads
numjobs=4
numjobs=2
numjobs=1

# CPU0: cores 0-17, hyperthreaded cores 36-53
[pmem0]
filename=/dev/pmem0
# pick the desired cpus_allowed list
cpus_allowed=0,1,2,3
cpus_allowed=0,1,36,37
cpus_allowed=0,36
cpus_allowed=0,1
cpus_allowed=0
rw=randwrite

Although most CPU time is in movnti instructions (non-temporal stores),
there is overhead in clearing the page cache and in the pmem block
driver; those won't be present in your initialization function. 
perf top shows:
  82.00%  [kernel][k] memcpy_flushcache
   5.23%  [kernel][k] gup_pgd_range
   3.41%  [kernel][k] __blkdev_direct_IO_simple
   2.38%  [kernel][k] pmem_make_request
   1.46%  [kernel][k] write_pmem
   1.29%  [kernel][k] pmem_do_bvec


---
Robert Elliott, HPE Persistent Memory





RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-26 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: Daniel Jordan [mailto:daniel.m.jor...@oracle.com]
> Sent: Monday, November 19, 2018 10:02 AM
> On Mon, Nov 12, 2018 at 10:15:46PM +0000, Elliott, Robert (Persistent Memory) 
> wrote:
> >
> > > -Original Message-
> > > From: Daniel Jordan 
> > > Sent: Monday, November 12, 2018 11:54 AM
> > >
> > > On Sat, Nov 10, 2018 at 03:48:14AM +, Elliott, Robert (Persistent
> > > Memory) wrote:
> > > > > -Original Message-
> > > > > From: linux-kernel-ow...@vger.kernel.org  > > > > ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> > > > > Sent: Monday, November 05, 2018 10:56 AM
> > > > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > > > initialization within each node
> > > > >
> > ...
> > > > > In testing, a reasonable value turned out to be about a quarter of the
> > > > > CPUs on the node.
> > > > ...
> > > > > + /*
> > > > > +  * We'd like to know the memory bandwidth of the chip to
> > > > > calculate the
> > > > > +  * most efficient number of threads to start, but we can't.
> > > > > +  * In testing, a good value for a variety of systems was a
> > > > > quarter of the CPUs on the node.
> > > > > +  */
> > > > > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> > > >
> > > >
> > > > You might want to base that calculation on and limit the threads to
> > > > physical cores, not hyperthreaded cores.
> > >
> > > Why?  Hyperthreads can be beneficial when waiting on memory.  That said, I
> > > don't have data that shows that in this case.
> >
> > I think that's only if there are some register-based calculations to do 
> > while
> > waiting. If both threads are just doing memory accesses, they'll both 
> > stall, and
> > there doesn't seem to be any benefit in having two contexts generate the IOs
> > rather than one (at least on the systems I've used). I think it takes longer
> > to switch contexts than to just turnaround the next IO.
> 
> (Sorry for the delay, Plumbers is over now...)
> 
> I guess we're both just waving our hands without data.  I've only got x86, so
> using a quarter of the CPUs rules out HT on my end.  Do you have a system that
> you can test this on, where using a quarter of the CPUs will involve HT?

I ran a short test with:
* HPE ProLiant DL360 Gen9 system
* Intel Xeon E5-2699 CPU with 18 physical cores (0-17) and 
  18 hyperthreaded cores (36-53)
* DDR4 NVDIMM-Ns (which run at regular DRAM DIMM speeds)
* fio workload generator
* cores on one CPU socket talking to a pmem device on the same CPU
* large (1 MiB) random writes (to minimize the threads getting CPU cache
  hits from each other)

Results:
* 31.7 GB/sfour threads, four physical cores (0,1,2,3)
* 22.2 GB/sfour threads, two physical cores (0,1,36,37)
* 21.4 GB/stwo threads, two physical cores (0,1)
* 12.1 GB/stwo threads, one physical core (0,36)
* 11.2 GB/sone thread, one physical core (0)

So, I think it's important that the initialization threads run on
separate physical cores.

For the number of cores to use, one approach is:
memory bandwidth (number of interleaved channels * speed)
divided by 
CPU core max sustained write bandwidth

For example, this 2133 MT/s system is roughly:
68 GB/s(4 * 17 GB/s nominal)
divided by
11.2 GB/s  (one core's performance)
which is 
6 cores

ACPI HMAT will report that 68 GB/s number.  I'm not sure of
a good way to discover the 11.2 GB/s number.


fio job file:
[global]
direct=1
ioengine=sync
norandommap
randrepeat=0
bs=1M
runtime=20
time_based=1
group_reporting
thread
gtod_reduce=1
zero_buffers
cpus_allowed_policy=split
# pick the desired number of threads
numjobs=4
numjobs=2
numjobs=1

# CPU0: cores 0-17, hyperthreaded cores 36-53
[pmem0]
filename=/dev/pmem0
# pick the desired cpus_allowed list
cpus_allowed=0,1,2,3
cpus_allowed=0,1,36,37
cpus_allowed=0,36
cpus_allowed=0,1
cpus_allowed=0
rw=randwrite

Although most CPU time is in movnti instructions (non-temporal stores),
there is overhead in clearing the page cache and in the pmem block
driver; those won't be present in your initialization function. 
perf top shows:
  82.00%  [kernel][k] memcpy_flushcache
   5.23%  [kernel][k] gup_pgd_range
   3.41%  [kernel][k] __blkdev_direct_IO_simple
   2.38%  [kernel][k] pmem_make_request
   1.46%  [kernel][k] write_pmem
   1.29%  [kernel][k] pmem_do_bvec


---
Robert Elliott, HPE Persistent Memory





RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-12 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: Daniel Jordan 
> Sent: Monday, November 12, 2018 11:54 AM
> To: Elliott, Robert (Persistent Memory) 
> Cc: Daniel Jordan ; linux...@kvack.org;
> k...@vger.kernel.org; linux-kernel@vger.kernel.org; aarca...@redhat.com;
> aaron...@intel.com; a...@linux-foundation.org; alex.william...@redhat.com;
> b...@redhat.com; darrick.w...@oracle.com; dave.han...@linux.intel.com;
> j...@mellanox.com; jwad...@google.com; jiangshan...@gmail.com;
> mho...@kernel.org; mike.krav...@oracle.com; pavel.tatas...@microsoft.com;
> prasad.singamse...@oracle.com; rdun...@infradead.org;
> steven.sist...@oracle.com; tim.c.c...@intel.com; t...@kernel.org;
> vba...@suse.cz
> Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
> 
> On Sat, Nov 10, 2018 at 03:48:14AM +, Elliott, Robert (Persistent
> Memory) wrote:
> > > -Original Message-
> > > From: linux-kernel-ow...@vger.kernel.org  > > ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> > > Sent: Monday, November 05, 2018 10:56 AM
> > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > initialization within each node
> > >
...
> > > In testing, a reasonable value turned out to be about a quarter of the
> > > CPUs on the node.
> > ...
> > > + /*
> > > +  * We'd like to know the memory bandwidth of the chip to
> > > calculate the
> > > +  * most efficient number of threads to start, but we can't.
> > > +  * In testing, a good value for a variety of systems was a
> > > quarter of the CPUs on the node.
> > > +  */
> > > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> >
> >
> > You might want to base that calculation on and limit the threads to
> > physical cores, not hyperthreaded cores.
> 
> Why?  Hyperthreads can be beneficial when waiting on memory.  That said, I
> don't have data that shows that in this case.

I think that's only if there are some register-based calculations to do while
waiting. If both threads are just doing memory accesses, they'll both stall, and
there doesn't seem to be any benefit in having two contexts generate the IOs
rather than one (at least on the systems I've used). I think it takes longer
to switch contexts than to just turnaround the next IO.


---
Robert Elliott, HPE Persistent Memory





RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-12 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: Daniel Jordan 
> Sent: Monday, November 12, 2018 11:54 AM
> To: Elliott, Robert (Persistent Memory) 
> Cc: Daniel Jordan ; linux...@kvack.org;
> k...@vger.kernel.org; linux-kernel@vger.kernel.org; aarca...@redhat.com;
> aaron...@intel.com; a...@linux-foundation.org; alex.william...@redhat.com;
> b...@redhat.com; darrick.w...@oracle.com; dave.han...@linux.intel.com;
> j...@mellanox.com; jwad...@google.com; jiangshan...@gmail.com;
> mho...@kernel.org; mike.krav...@oracle.com; pavel.tatas...@microsoft.com;
> prasad.singamse...@oracle.com; rdun...@infradead.org;
> steven.sist...@oracle.com; tim.c.c...@intel.com; t...@kernel.org;
> vba...@suse.cz
> Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
> 
> On Sat, Nov 10, 2018 at 03:48:14AM +, Elliott, Robert (Persistent
> Memory) wrote:
> > > -Original Message-
> > > From: linux-kernel-ow...@vger.kernel.org  > > ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> > > Sent: Monday, November 05, 2018 10:56 AM
> > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > initialization within each node
> > >
...
> > > In testing, a reasonable value turned out to be about a quarter of the
> > > CPUs on the node.
> > ...
> > > + /*
> > > +  * We'd like to know the memory bandwidth of the chip to
> > > calculate the
> > > +  * most efficient number of threads to start, but we can't.
> > > +  * In testing, a good value for a variety of systems was a
> > > quarter of the CPUs on the node.
> > > +  */
> > > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> >
> >
> > You might want to base that calculation on and limit the threads to
> > physical cores, not hyperthreaded cores.
> 
> Why?  Hyperthreads can be beneficial when waiting on memory.  That said, I
> don't have data that shows that in this case.

I think that's only if there are some register-based calculations to do while
waiting. If both threads are just doing memory accesses, they'll both stall, and
there doesn't seem to be any benefit in having two contexts generate the IOs
rather than one (at least on the systems I've used). I think it takes longer
to switch contexts than to just turnaround the next IO.


---
Robert Elliott, HPE Persistent Memory





RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-09 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> Sent: Monday, November 05, 2018 10:56 AM
> Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
> 
> ...  The kernel doesn't
> know the memory bandwidth of a given system to get the most efficient
> number of threads, so there's some guesswork involved.  

The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
that kind of information, and could facilitate automatic tuning.

There was discussion last year about kernel support for it:
https://lore.kernel.org/lkml/20171214021019.13579-1-ross.zwis...@linux.intel.com/


> In testing, a reasonable value turned out to be about a quarter of the
> CPUs on the node.
...
> + /*
> +  * We'd like to know the memory bandwidth of the chip to
> calculate the
> +  * most efficient number of threads to start, but we can't.
> +  * In testing, a good value for a variety of systems was a
> quarter of the CPUs on the node.
> +  */
> + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);


You might want to base that calculation on and limit the threads to
physical cores, not hyperthreaded cores.

---
Robert Elliott, HPE Persistent Memory




RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-09 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> Sent: Monday, November 05, 2018 10:56 AM
> Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
> 
> ...  The kernel doesn't
> know the memory bandwidth of a given system to get the most efficient
> number of threads, so there's some guesswork involved.  

The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
that kind of information, and could facilitate automatic tuning.

There was discussion last year about kernel support for it:
https://lore.kernel.org/lkml/20171214021019.13579-1-ross.zwis...@linux.intel.com/


> In testing, a reasonable value turned out to be about a quarter of the
> CPUs on the node.
...
> + /*
> +  * We'd like to know the memory bandwidth of the chip to
> calculate the
> +  * most efficient number of threads to start, but we can't.
> +  * In testing, a good value for a variety of systems was a
> quarter of the CPUs on the node.
> +  */
> + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);


You might want to base that calculation on and limit the threads to
physical cores, not hyperthreaded cores.

---
Robert Elliott, HPE Persistent Memory




RE: [PATCH 0/9] Allow persistent memory to be used like normal RAM

2018-10-23 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of 
> Dan Williams
> Sent: Monday, October 22, 2018 8:05 PM
> Subject: Re: [PATCH 0/9] Allow persistent memory to be used like normal RAM
> 
> On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen  
> wrote:
...
> This series adds a new "driver" to which pmem devices can be
> attached.  Once attached, the memory "owned" by the device is
> hot-added to the kernel and managed like any other memory.  On

Would this memory be considered volatile (with the driver initializing
it to zeros), or persistent (contents are presented unchanged,
applications may guarantee persistence by using cache flush
instructions, fence instructions, and writing to flush hint addresses
per the persistent memory programming model)?

> > 1. The device re-binding hacks are ham-fisted at best.  We
> >need a better way of doing this, especially so the kmem
> >driver does not get in the way of normal pmem devices.
...
> To me this looks like teaching the nvdimm-bus and this dax_kmem driver
> to require explicit matching based on 'id'. The attachment scheme
> would look like this:
> 
> modprobe dax_kmem
> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id
> echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind
> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind
> 
> At step1 the dax_kmem drivers will match no devices and stays out of
> the way of dax_pmem. It learns about devices it cares about by being
> explicitly told about them. Then unbind from the typical dax_pmem
> driver and attach to dax_kmem to perform the one way hotplug.
> 
> I expect udev can automate this by setting up a rule to watch for
> device-dax instances by UUID and call a script to do the detach /
> reattach dance.

Where would that rule be stored? Storing it on another device
is problematic. If that rule is lost, it could confuse other
drivers trying to grab device DAX devices for use as persistent
memory.

A new namespace mode would record the intended usage in the
device itself, eliminating dependencies. It could join the
other modes like:

ndctl create-namespace -m raw
create /dev/pmem4 block device
ndctl create-namespace -m sector
create /dev/pmem4s block device
ndctl create-namespace -m fsdax
create /dev/pmem4 block device
ndctl create-namespace -m devdax
create /dev/dax4.3 character device
for use as persistent memory
ndctl create-namespace -m mem
create /dev/mem4.3 character device
for use as volatile memory

---
Robert Elliott, HPE Persistent Memory





RE: [PATCH 0/9] Allow persistent memory to be used like normal RAM

2018-10-23 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of 
> Dan Williams
> Sent: Monday, October 22, 2018 8:05 PM
> Subject: Re: [PATCH 0/9] Allow persistent memory to be used like normal RAM
> 
> On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen  
> wrote:
...
> This series adds a new "driver" to which pmem devices can be
> attached.  Once attached, the memory "owned" by the device is
> hot-added to the kernel and managed like any other memory.  On

Would this memory be considered volatile (with the driver initializing
it to zeros), or persistent (contents are presented unchanged,
applications may guarantee persistence by using cache flush
instructions, fence instructions, and writing to flush hint addresses
per the persistent memory programming model)?

> > 1. The device re-binding hacks are ham-fisted at best.  We
> >need a better way of doing this, especially so the kmem
> >driver does not get in the way of normal pmem devices.
...
> To me this looks like teaching the nvdimm-bus and this dax_kmem driver
> to require explicit matching based on 'id'. The attachment scheme
> would look like this:
> 
> modprobe dax_kmem
> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id
> echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind
> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind
> 
> At step1 the dax_kmem drivers will match no devices and stays out of
> the way of dax_pmem. It learns about devices it cares about by being
> explicitly told about them. Then unbind from the typical dax_pmem
> driver and attach to dax_kmem to perform the one way hotplug.
> 
> I expect udev can automate this by setting up a rule to watch for
> device-dax instances by UUID and call a script to do the detach /
> reattach dance.

Where would that rule be stored? Storing it on another device
is problematic. If that rule is lost, it could confuse other
drivers trying to grab device DAX devices for use as persistent
memory.

A new namespace mode would record the intended usage in the
device itself, eliminating dependencies. It could join the
other modes like:

ndctl create-namespace -m raw
create /dev/pmem4 block device
ndctl create-namespace -m sector
create /dev/pmem4s block device
ndctl create-namespace -m fsdax
create /dev/pmem4 block device
ndctl create-namespace -m devdax
create /dev/dax4.3 character device
for use as persistent memory
ndctl create-namespace -m mem
create /dev/mem4.3 character device
for use as volatile memory

---
Robert Elliott, HPE Persistent Memory





RE: [PATCH 0/3] mm: Randomize free memory

2018-09-21 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Kees Cook
> Sent: Friday, September 21, 2018 2:13 PM
> Subject: Re: [PATCH 0/3] mm: Randomize free memory
...
> I'd be curious to hear more about the mentioned cache performance
> improvements. I love it when a security feature actually _improves_
> performance. :)

It's been a problem in the HPC space:
http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

A kernel module called zonesort is available to try to help:
https://software.intel.com/en-us/articles/xeon-phi-software

and this abandoned patch series proposed that for the kernel:
https://lkml.org/lkml/2017/8/23/195

Dan's patch series doesn't attempt to ensure buffers won't conflict, but
also reduces the chance that the buffers will. This will make performance
more consistent, albeit slower than "optimal" (which is near impossible
to attain in a general-purpose kernel).  That's better than forcing
users to deploy remedies like:
"To eliminate this gradual degradation, we have added a Stream
 measurement to the Node Health Check that follows each job;
 nodes are rebooted whenever their measured memory bandwidth
 falls below 300 GB/s."

---
Robert Elliott, HPE Persistent Memory




RE: [PATCH 0/3] mm: Randomize free memory

2018-09-21 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Kees Cook
> Sent: Friday, September 21, 2018 2:13 PM
> Subject: Re: [PATCH 0/3] mm: Randomize free memory
...
> I'd be curious to hear more about the mentioned cache performance
> improvements. I love it when a security feature actually _improves_
> performance. :)

It's been a problem in the HPC space:
http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

A kernel module called zonesort is available to try to help:
https://software.intel.com/en-us/articles/xeon-phi-software

and this abandoned patch series proposed that for the kernel:
https://lkml.org/lkml/2017/8/23/195

Dan's patch series doesn't attempt to ensure buffers won't conflict, but
also reduces the chance that the buffers will. This will make performance
more consistent, albeit slower than "optimal" (which is near impossible
to attain in a general-purpose kernel).  That's better than forcing
users to deploy remedies like:
"To eliminate this gradual degradation, we have added a Stream
 measurement to the Node Health Check that follows each job;
 nodes are rebooted whenever their measured memory bandwidth
 falls below 300 GB/s."

---
Robert Elliott, HPE Persistent Memory




RE: [PATCH v2] RFC: clear 1G pages with streaming stores on x86

2018-07-24 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Cannon Matthews
> Sent: Tuesday, July 24, 2018 9:37 PM
> Subject: Re: [PATCH v2] RFC: clear 1G pages with streaming stores on
> x86
> 
> Reimplement clear_gigantic_page() to clear gigabytes pages using the
> non-temporal streaming store instructions that bypass the cache
> (movnti), since an entire 1GiB region will not fit in the cache
> anyway.
>
> Doing an mlock() on a 512GiB 1G-hugetlb region previously would take
> on average 134 seconds, about 260ms/GiB which is quite slow. Using
> `movnti` and optimizing the control flow over the constituent small
> pages, this can be improved roughly by a factor of 3-4x, with the
> 512GiB mlock() taking only 34 seconds on average, or 67ms/GiB.

...
> - Are there any obvious pitfalls or caveats that have not been
> considered? 

Note that Kirill attempted something like this in 2012 - see
https://www.spinics.net/lists/linux-mm/msg40575.html

...
> +++ b/arch/x86/lib/clear_gigantic_page.c
> @@ -0,0 +1,29 @@
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) ||
> defined(CONFIG_HUGETLBFS)
> +#define PAGES_BETWEEN_RESCHED 64
> +void clear_gigantic_page(struct page *page,
> + unsigned long addr,

The previous attempt used cacheable stores in the page containing
addr to prevent an inevitable cache miss after the clearing completes.
This function is not using addr at all.

> + unsigned int pages_per_huge_page)
> +{
> + int i;
> + void *dest = page_to_virt(page);
> + int resched_count = 0;
> +
> + BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
> + BUG_ON(!dest);

Are those really possible conditions?  Is there a safer fallback
than crashing the whole kernel?

> +
> + for (i = 0; i < pages_per_huge_page; i +=
> PAGES_BETWEEN_RESCHED) {
> + __clear_page_nt(dest + (i * PAGE_SIZE),
> + PAGES_BETWEEN_RESCHED * PAGE_SIZE);
> + resched_count += cond_resched();
> + }
> + /* __clear_page_nt requrires and `sfence` barrier. */

requires an

...
> diff --git a/arch/x86/lib/clear_page_64.S
...
> +/*
> + * Zero memory using non temporal stores, bypassing the cache.
> + * Requires an `sfence` (wmb()) afterwards.
> + * %rdi - destination.
> + * %rsi - page size. Must be 64 bit aligned.
> +*/
> +ENTRY(__clear_page_nt)
> + leaq(%rdi,%rsi), %rdx
> + xorl%eax, %eax
> + .p2align 4,,10
> + .p2align 3
> +.L2:
> + movnti  %rax, (%rdi)
> + addq$8, %rdi

Also consider using the AVX vmovntdq instruction (if available), the
most recent of which does 64-byte (cache line) sized transfers to
zmm registers. There's a hefty context switching overhead (e.g.,
304 clocks), but it might be worthwhile for 1 GiB (which
is 16,777,216 cache lines).

glibc memcpy() makes that choice for transfers > 75% of the L3 cache
size divided by the number of cores.  (last I tried, it was still
selecting "rep stosb" for large memset()s, although it has an
AVX-512 function available)

Even with that, one CPU core won't saturate the memory bus; multiple
CPU cores (preferably on the same NUMA node as the memory) need to
share the work.

---
Robert Elliott, HPE Persistent Memory





RE: [PATCH v2] RFC: clear 1G pages with streaming stores on x86

2018-07-24 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Cannon Matthews
> Sent: Tuesday, July 24, 2018 9:37 PM
> Subject: Re: [PATCH v2] RFC: clear 1G pages with streaming stores on
> x86
> 
> Reimplement clear_gigantic_page() to clear gigabytes pages using the
> non-temporal streaming store instructions that bypass the cache
> (movnti), since an entire 1GiB region will not fit in the cache
> anyway.
>
> Doing an mlock() on a 512GiB 1G-hugetlb region previously would take
> on average 134 seconds, about 260ms/GiB which is quite slow. Using
> `movnti` and optimizing the control flow over the constituent small
> pages, this can be improved roughly by a factor of 3-4x, with the
> 512GiB mlock() taking only 34 seconds on average, or 67ms/GiB.

...
> - Are there any obvious pitfalls or caveats that have not been
> considered? 

Note that Kirill attempted something like this in 2012 - see
https://www.spinics.net/lists/linux-mm/msg40575.html

...
> +++ b/arch/x86/lib/clear_gigantic_page.c
> @@ -0,0 +1,29 @@
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) ||
> defined(CONFIG_HUGETLBFS)
> +#define PAGES_BETWEEN_RESCHED 64
> +void clear_gigantic_page(struct page *page,
> + unsigned long addr,

The previous attempt used cacheable stores in the page containing
addr to prevent an inevitable cache miss after the clearing completes.
This function is not using addr at all.

> + unsigned int pages_per_huge_page)
> +{
> + int i;
> + void *dest = page_to_virt(page);
> + int resched_count = 0;
> +
> + BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
> + BUG_ON(!dest);

Are those really possible conditions?  Is there a safer fallback
than crashing the whole kernel?

> +
> + for (i = 0; i < pages_per_huge_page; i +=
> PAGES_BETWEEN_RESCHED) {
> + __clear_page_nt(dest + (i * PAGE_SIZE),
> + PAGES_BETWEEN_RESCHED * PAGE_SIZE);
> + resched_count += cond_resched();
> + }
> + /* __clear_page_nt requrires and `sfence` barrier. */

requires an

...
> diff --git a/arch/x86/lib/clear_page_64.S
...
> +/*
> + * Zero memory using non temporal stores, bypassing the cache.
> + * Requires an `sfence` (wmb()) afterwards.
> + * %rdi - destination.
> + * %rsi - page size. Must be 64 bit aligned.
> +*/
> +ENTRY(__clear_page_nt)
> + leaq(%rdi,%rsi), %rdx
> + xorl%eax, %eax
> + .p2align 4,,10
> + .p2align 3
> +.L2:
> + movnti  %rax, (%rdi)
> + addq$8, %rdi

Also consider using the AVX vmovntdq instruction (if available), the
most recent of which does 64-byte (cache line) sized transfers to
zmm registers. There's a hefty context switching overhead (e.g.,
304 clocks), but it might be worthwhile for 1 GiB (which
is 16,777,216 cache lines).

glibc memcpy() makes that choice for transfers > 75% of the L3 cache
size divided by the number of cores.  (last I tried, it was still
selecting "rep stosb" for large memset()s, although it has an
AVX-512 function available)

Even with that, one CPU core won't saturate the memory bus; multiple
CPU cores (preferably on the same NUMA node as the memory) need to
share the work.

---
Robert Elliott, HPE Persistent Memory





RE: [PATCH v2] acpi: nfit: document sysfs interface

2018-05-09 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Aishwarya Pant
> Sent: Friday, February 23, 2018 6:55 AM
> To: Dan Williams ; Rafael J. Wysocki
> ; Len Brown ; linux-
> nvd...@lists.01.org; linux-a...@vger.kernel.org; linux-
> ker...@vger.kernel.org; Jonathan Corbet ; Greg KH
> 
> Cc: Julia Lawall 
> Subject: [PATCH v2] acpi: nfit: document sysfs interface
> 
> This is an attempt to document the nfit sysfs interface. The
> descriptions have been collected from git commit logs and the ACPI
> specification 6.2.
> 
> Signed-off-by: Aishwarya Pant 
> ---
> Changes in v2:
> - Add descriptions for range_index and ecc_unit_size
> - Edit various descriptions as suggested
> 
>  Documentation/ABI/testing/sysfs-bus-nfit | 233
> +++
>  1 file changed, 233 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-bus-nfit
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-nfit
> b/Documentation/ABI/testing/sysfs-bus-nfit
> new file mode 100644
> index ..619eb8ca0f99
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-bus-nfit
> @@ -0,0 +1,233 @@
> +For all of the nmem device attributes under nfit/*, see the 'NVDIMM
> Firmware
> +Interface Table (NFIT)' section in the ACPI specification
> +(http://www.uefi.org/specifications) for more details.
> +
> +What:/sys/bus/nd/devices/nmemX/nfit/serial
> +Date:Jun, 2015
> +KernelVersion:   v4.2
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) Serial number of the NVDIMM (non-volatile dual in-line
> + memory module), assigned by the module vendor.

The endianness should be indicated. 

After some churn, I think it is always presented as big-endian now
(originally printed using the CPU's native endianness).

> +
> +What:/sys/bus/nd/devices/nmemX/nfit/handle
> +Date:Apr, 2015
> +KernelVersion:   v4.2
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) The address (given by the _ADR object) of the device on
> its
> + parent bus of the NVDIMM device containing the NVDIMM region.
> +
> +
> +What:/sys/bus/nd/devices/nmemX/nfit/device
> +Date:Apr, 2015
> +KernelVersion:   v4.1
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) Device id for the NVDIMM, assigned by the module vendor.

Since "device" is confusing (a problem with the ACPI field name), this should 
state that it is the "Identifier for the NVDIMM", i.e., the module itself,
not some device on the module.

> +
> +What:/sys/bus/nd/devices/nmemX/nfit/rev_id
> +Date:Jun, 2015
> +KernelVersion:   v4.2
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) Revision of the NVDIMM, assigned by the module vendor.
> +
> +
> +What:/sys/bus/nd/devices/nmemX/nfit/phys_id
> +Date:Apr, 2015
> +KernelVersion:   v4.2
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) Handle (i.e., instance number) for the SMBIOS (system
> + management BIOS) Memory Device structure describing the NVDIMM
> + containing the NVDIMM region.

Perhaps mention "dmidecode" as a tool to see the SMBIOS structures?

> +
> +What:/sys/bus/nd/devices/nmemX/nfit/flags
> +Date:Jun, 2015
> +KernelVersion:   v4.2
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) The flags in the NFIT memory device sub-structure
> indicate
> + the state of the data on the nvdimm relative to its energy
> + source or last "flush to persistence".

Capitalize nvdimm.

The bits in this field are more varied than that description, though,
and you can't predict what future standards will toss into this field.
A more generic description might be better.

> +
> + The attribute is a translation of the 'NVDIMM State Flags'
> field
> + in section 5.2.25.3 'NVDIMM Region Mapping' Structure of the
> + ACPI specification 6.2.
> +
> + The health states are "save_fail", "restore_fail", "flush_fail",
> + "not_armed", "smart_event", "map_fail" and "smart_notify".
> +
> +
> +What:/sys/bus/nd/devices/nmemX/nfit/format
> +What:/sys/bus/nd/devices/nmemX/nfit/format1
> +What:/sys/bus/nd/devices/nmemX/nfit/formats
> +Date:Apr, 2016
> +KernelVersion:   v4.7
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) The interface codes indicate support for persistent
> memory
> + mapped directly into system physical address space and / or a
> + 

RE: [PATCH v2] acpi: nfit: document sysfs interface

2018-05-09 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Aishwarya Pant
> Sent: Friday, February 23, 2018 6:55 AM
> To: Dan Williams ; Rafael J. Wysocki
> ; Len Brown ; linux-
> nvd...@lists.01.org; linux-a...@vger.kernel.org; linux-
> ker...@vger.kernel.org; Jonathan Corbet ; Greg KH
> 
> Cc: Julia Lawall 
> Subject: [PATCH v2] acpi: nfit: document sysfs interface
> 
> This is an attempt to document the nfit sysfs interface. The
> descriptions have been collected from git commit logs and the ACPI
> specification 6.2.
> 
> Signed-off-by: Aishwarya Pant 
> ---
> Changes in v2:
> - Add descriptions for range_index and ecc_unit_size
> - Edit various descriptions as suggested
> 
>  Documentation/ABI/testing/sysfs-bus-nfit | 233
> +++
>  1 file changed, 233 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-bus-nfit
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-nfit
> b/Documentation/ABI/testing/sysfs-bus-nfit
> new file mode 100644
> index ..619eb8ca0f99
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-bus-nfit
> @@ -0,0 +1,233 @@
> +For all of the nmem device attributes under nfit/*, see the 'NVDIMM
> Firmware
> +Interface Table (NFIT)' section in the ACPI specification
> +(http://www.uefi.org/specifications) for more details.
> +
> +What:/sys/bus/nd/devices/nmemX/nfit/serial
> +Date:Jun, 2015
> +KernelVersion:   v4.2
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) Serial number of the NVDIMM (non-volatile dual in-line
> + memory module), assigned by the module vendor.

The endianness should be indicated. 

After some churn, I think it is always presented as big-endian now
(originally printed using the CPU's native endianness).

> +
> +What:/sys/bus/nd/devices/nmemX/nfit/handle
> +Date:Apr, 2015
> +KernelVersion:   v4.2
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) The address (given by the _ADR object) of the device on
> its
> + parent bus of the NVDIMM device containing the NVDIMM region.
> +
> +
> +What:/sys/bus/nd/devices/nmemX/nfit/device
> +Date:Apr, 2015
> +KernelVersion:   v4.1
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) Device id for the NVDIMM, assigned by the module vendor.

Since "device" is confusing (a problem with the ACPI field name), this should 
state that it is the "Identifier for the NVDIMM", i.e., the module itself,
not some device on the module.

> +
> +What:/sys/bus/nd/devices/nmemX/nfit/rev_id
> +Date:Jun, 2015
> +KernelVersion:   v4.2
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) Revision of the NVDIMM, assigned by the module vendor.
> +
> +
> +What:/sys/bus/nd/devices/nmemX/nfit/phys_id
> +Date:Apr, 2015
> +KernelVersion:   v4.2
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) Handle (i.e., instance number) for the SMBIOS (system
> + management BIOS) Memory Device structure describing the NVDIMM
> + containing the NVDIMM region.

Perhaps mention "dmidecode" as a tool to see the SMBIOS structures?

> +
> +What:/sys/bus/nd/devices/nmemX/nfit/flags
> +Date:Jun, 2015
> +KernelVersion:   v4.2
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) The flags in the NFIT memory device sub-structure
> indicate
> + the state of the data on the nvdimm relative to its energy
> + source or last "flush to persistence".

Capitalize nvdimm.

The bits in this field are more varied than that description, though,
and you can't predict what future standards will toss into this field.
A more generic description might be better.

> +
> + The attribute is a translation of the 'NVDIMM State Flags'
> field
> + in section 5.2.25.3 'NVDIMM Region Mapping' Structure of the
> + ACPI specification 6.2.
> +
> + The health states are "save_fail", "restore_fail", "flush_fail",
> + "not_armed", "smart_event", "map_fail" and "smart_notify".
> +
> +
> +What:/sys/bus/nd/devices/nmemX/nfit/format
> +What:/sys/bus/nd/devices/nmemX/nfit/format1
> +What:/sys/bus/nd/devices/nmemX/nfit/formats
> +Date:Apr, 2016
> +KernelVersion:   v4.7
> +Contact: linux-nvd...@lists.01.org
> +Description:
> + (RO) The interface codes indicate support for persistent
> memory
> + mapped directly into system physical address space and / or a
> + block aperture access mechanism to the NVDIMM media.
> + The 'formats' attribute displays the number of supported
> + 

RE: [PATCH] block: ratelimite pr_err on IO path

2018-04-11 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Jack Wang
> Sent: Wednesday, April 11, 2018 8:21 AM
> Subject: [PATCH] block: ratelimite pr_err on IO path
> 
> From: Jack Wang 
...
> - pr_err("%s: ref tag error at location %llu " \
> -"(rcvd %u)\n", iter->disk_name,
> -(unsigned long long)
> -iter->seed, be32_to_cpu(pi->ref_tag));
> + pr_err_ratelimited("%s: ref tag error at "
> +"location %llu (rcvd %u)\n",

Per process/coding-style.rst, you should keep a string like that on
one line even if that exceeds 80 columns:

  Statements longer than 80 columns will be broken into sensible chunks, unless
  exceeding 80 columns significantly increases readability and does not hide
  information. ... However, never break user-visible strings such as
  printk messages, because that breaks the ability to grep for them.




RE: [PATCH] block: ratelimite pr_err on IO path

2018-04-11 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Jack Wang
> Sent: Wednesday, April 11, 2018 8:21 AM
> Subject: [PATCH] block: ratelimite pr_err on IO path
> 
> From: Jack Wang 
...
> - pr_err("%s: ref tag error at location %llu " \
> -"(rcvd %u)\n", iter->disk_name,
> -(unsigned long long)
> -iter->seed, be32_to_cpu(pi->ref_tag));
> + pr_err_ratelimited("%s: ref tag error at "
> +"location %llu (rcvd %u)\n",

Per process/coding-style.rst, you should keep a string like that on
one line even if that exceeds 80 columns:

  Statements longer than 80 columns will be broken into sensible chunks, unless
  exceeding 80 columns significantly increases readability and does not hide
  information. ... However, never break user-visible strings such as
  printk messages, because that breaks the ability to grep for them.




RE: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-20 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Ross Zwisler
...
> 
> On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
...
> > initiator is a CPU?  I'd have expected you to expose a memory controller
> > abstraction rather than re-use storage terminology.
> 
> Yea, I agree that at first blush it seems weird.  It turns out that
> looking at it in sort of a storage initiator/target way is beneficial,
> though, because it allows us to cut down on the number of data values
> we need to represent.
> 
> For example the SLIT, which doesn't differentiate between initiator and
> target proximity domains (and thus nodes) always represents a system
> with N proximity domains using a NxN distance table.  This makes sense
> if every node contains both CPUs and memory.
> 
> With the introduction of the HMAT, though, we can have memory-only
> initiator nodes and we can explicitly associate them with their local 
> CPU.  This is necessary so that we can separate memory with different
> performance characteristics (HBM vs normal memory vs persistent memory,
> for example) that are all attached to the same CPU.
> 
> So, say we now have a system with 4 CPUs, and each of those CPUs has 3
> different types of memory attached to it.  We now have 16 total proximity
> domains, 4 CPU and 12 memory.

The CPU cores that make up a node can have performance restrictions of
their own; for example, they might max out at 10 GB/s even though the
memory controller supports 120 GB/s (meaning you need to use 12 cores
on the node to fully exercise memory).  It'd be helpful to report this,
so software can decide how many cores to use for bandwidth-intensive work.

> If we represent this with the SLIT we end up with a 16 X 16 distance table
> (256 entries), most of which don't matter because they are memory-to-
> memory distances which don't make sense.
> 
> In the HMAT, though, we separate out the initiators and the targets and
> put them into separate lists.  (See 5.2.27.4 System Locality Latency and
> Bandwidth Information Structure in ACPI 6.2 for details.)  So, this same
> config in the HMAT only has 4*12=48 performance values of each type, all
> of which convey meaningful information.
> 
> The HMAT indeed even uses the storage "initiator" and "target"
> terminology. :)

Centralized DMA engines (e.g., as used by the "DMA based blk-mq pmem
driver") have performance differences too.  A CPU might include
CPU cores that reach 10 GB/s, DMA engines that reach 60 GB/s, and
memory controllers that reach 120 GB/s.  I guess these would be
represented as extra initiators on the node?


---
Robert Elliott, HPE Persistent Memory





RE: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-20 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Ross Zwisler
...
> 
> On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
...
> > initiator is a CPU?  I'd have expected you to expose a memory controller
> > abstraction rather than re-use storage terminology.
> 
> Yea, I agree that at first blush it seems weird.  It turns out that
> looking at it in sort of a storage initiator/target way is beneficial,
> though, because it allows us to cut down on the number of data values
> we need to represent.
> 
> For example the SLIT, which doesn't differentiate between initiator and
> target proximity domains (and thus nodes) always represents a system
> with N proximity domains using a NxN distance table.  This makes sense
> if every node contains both CPUs and memory.
> 
> With the introduction of the HMAT, though, we can have memory-only
> initiator nodes and we can explicitly associate them with their local 
> CPU.  This is necessary so that we can separate memory with different
> performance characteristics (HBM vs normal memory vs persistent memory,
> for example) that are all attached to the same CPU.
> 
> So, say we now have a system with 4 CPUs, and each of those CPUs has 3
> different types of memory attached to it.  We now have 16 total proximity
> domains, 4 CPU and 12 memory.

The CPU cores that make up a node can have performance restrictions of
their own; for example, they might max out at 10 GB/s even though the
memory controller supports 120 GB/s (meaning you need to use 12 cores
on the node to fully exercise memory).  It'd be helpful to report this,
so software can decide how many cores to use for bandwidth-intensive work.

> If we represent this with the SLIT we end up with a 16 X 16 distance table
> (256 entries), most of which don't matter because they are memory-to-
> memory distances which don't make sense.
> 
> In the HMAT, though, we separate out the initiators and the targets and
> put them into separate lists.  (See 5.2.27.4 System Locality Latency and
> Bandwidth Information Structure in ACPI 6.2 for details.)  So, this same
> config in the HMAT only has 4*12=48 performance values of each type, all
> of which convey meaningful information.
> 
> The HMAT indeed even uses the storage "initiator" and "target"
> terminology. :)

Centralized DMA engines (e.g., as used by the "DMA based blk-mq pmem
driver") have performance differences too.  A CPU might include
CPU cores that reach 10 GB/s, DMA engines that reach 60 GB/s, and
memory controllers that reach 120 GB/s.  I guess these would be
represented as extra initiators on the node?


---
Robert Elliott, HPE Persistent Memory





RE: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages

2017-08-17 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Andrew Morton [mailto:a...@linux-foundation.org]
> Sent: Thursday, August 17, 2017 5:10 PM
> To: Luck, Tony <tony.l...@intel.com>
> Cc: Borislav Petkov <b...@suse.de>; Dave Hansen <dave.han...@intel.com>;
> Naoya Horiguchi <n-horigu...@ah.jp.nec.com>; Elliott, Robert (Persistent
> Memory) <elli...@hpe.com>; x...@kernel.org; linux...@kvack.org; linux-
> ker...@vger.kernel.org
> Subject: Re: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1
> mappings of poison pages
> 
> On Wed, 16 Aug 2017 10:18:03 -0700 "Luck, Tony" <tony.l...@intel.com>
> wrote:
> 
> > Speculative processor accesses may reference any memory that has a
> > valid page table entry.  While a speculative access won't generate
> > a machine check, it will log the error in a machine check bank. That
> > could cause escalation of a subsequent error since the overflow bit
> > will be then set in the machine check bank status register.
> >
> > Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
> > address of the page we want to map out otherwise we may trigger the
> > very problem we are trying to avoid.  We use a non-canonical address
> > that passes through the usual Linux table walking code to get to the
> > same "pte".
> >
> > Thanks to Dave Hansen for reviewing several iterations of this.
> 
> It's unclear (to lil ole me) what the end-user-visible effects of this
> are.
> 
> Could we please have a description of that?  So a) people can
> understand your decision to cc:stable and b) people whose kernels are
> misbehaving can use your description to decide whether your patch might
> fix the issue their users are reporting.

In general, the system is subject to halting due to uncorrectable
memory errors at addresses that software is not even accessing.  

The first error doesn't cause the crash, but if a second error happens
before the machine check handler services the first one, it'll find
the Overflow bit set and won't know what errors or how many errors
happened (e.g., it might have been problems in an instruction fetch,
and the instructions the CPU is slated to run are bogus).  Halting is 
the only safe thing to do.

For persistent memory, the BIOS reports known-bad addresses in the
ACPI ARS (address range scrub) table.  They are likely to keep
reappearing every boot since it is persistent memory, so you can't
just reboot and hope they go away.  Software is supposed to avoid
reading those addresses until it fixes them (e.g., writes new data
to those locations).  Even if it follows this rule, the system can
still crash due to speculative reads (e.g., prefetches) touching
those addresses.

Tony's patch marks those addresses in the page tables so the CPU
won't speculatively try to read them.

---
Robert Elliott, HPE Persistent Memory








RE: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages

2017-08-17 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Andrew Morton [mailto:a...@linux-foundation.org]
> Sent: Thursday, August 17, 2017 5:10 PM
> To: Luck, Tony 
> Cc: Borislav Petkov ; Dave Hansen ;
> Naoya Horiguchi ; Elliott, Robert (Persistent
> Memory) ; x...@kernel.org; linux...@kvack.org; linux-
> ker...@vger.kernel.org
> Subject: Re: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1
> mappings of poison pages
> 
> On Wed, 16 Aug 2017 10:18:03 -0700 "Luck, Tony" 
> wrote:
> 
> > Speculative processor accesses may reference any memory that has a
> > valid page table entry.  While a speculative access won't generate
> > a machine check, it will log the error in a machine check bank. That
> > could cause escalation of a subsequent error since the overflow bit
> > will be then set in the machine check bank status register.
> >
> > Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
> > address of the page we want to map out otherwise we may trigger the
> > very problem we are trying to avoid.  We use a non-canonical address
> > that passes through the usual Linux table walking code to get to the
> > same "pte".
> >
> > Thanks to Dave Hansen for reviewing several iterations of this.
> 
> It's unclear (to lil ole me) what the end-user-visible effects of this
> are.
> 
> Could we please have a description of that?  So a) people can
> understand your decision to cc:stable and b) people whose kernels are
> misbehaving can use your description to decide whether your patch might
> fix the issue their users are reporting.

In general, the system is subject to halting due to uncorrectable
memory errors at addresses that software is not even accessing.  

The first error doesn't cause the crash, but if a second error happens
before the machine check handler services the first one, it'll find
the Overflow bit set and won't know what errors or how many errors
happened (e.g., it might have been problems in an instruction fetch,
and the instructions the CPU is slated to run are bogus).  Halting is 
the only safe thing to do.

For persistent memory, the BIOS reports known-bad addresses in the
ACPI ARS (address range scrub) table.  They are likely to keep
reappearing every boot since it is persistent memory, so you can't
just reboot and hope they go away.  Software is supposed to avoid
reading those addresses until it fixes them (e.g., writes new data
to those locations).  Even if it follows this rule, the system can
still crash due to speculative reads (e.g., prefetches) touching
those addresses.

Tony's patch marks those addresses in the page tables so the CPU
won't speculatively try to read them.

---
Robert Elliott, HPE Persistent Memory








RE: [PATCH 05/13] mpt3sas: Set NVMe device queue depth as 128

2017-07-11 Thread Elliott, Robert (Persistent Memory)
> +++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
> @@ -115,7 +115,7 @@
> 
>  #define MPT3SAS_RAID_MAX_SECTORS 8192
>  #define MPT3SAS_HOST_PAGE_SIZE_4K12
> -
> +#define MPT3SAS_NVME_QUEUE_DEPTH 128
...
> + /*TODO-right Queue Depth?*/
> + qdepth = MPT3SAS_NVME_QUEUE_DEPTH;
> + ds = "NVMe";

The native NVMe driver is getting a modparam to set that value (rather than
using a #define of 1024) in this patch:
http://lists.infradead.org/pipermail/linux-nvme/2017-July/011734.html

Perhaps this driver should do the same.

---
Robert Elliott, HPE Persistent Memory




RE: [PATCH 05/13] mpt3sas: Set NVMe device queue depth as 128

2017-07-11 Thread Elliott, Robert (Persistent Memory)
> +++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
> @@ -115,7 +115,7 @@
> 
>  #define MPT3SAS_RAID_MAX_SECTORS 8192
>  #define MPT3SAS_HOST_PAGE_SIZE_4K12
> -
> +#define MPT3SAS_NVME_QUEUE_DEPTH 128
...
> + /*TODO-right Queue Depth?*/
> + qdepth = MPT3SAS_NVME_QUEUE_DEPTH;
> + ds = "NVMe";

The native NVMe driver is getting a modparam to set that value (rather than
using a #define of 1024) in this patch:
http://lists.infradead.org/pipermail/linux-nvme/2017-July/011734.html

Perhaps this driver should do the same.

---
Robert Elliott, HPE Persistent Memory




RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages

2017-06-23 Thread Elliott, Robert (Persistent Memory)
> > > + if (set_memory_np(decoy_addr, 1))
> > > + pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n",

Another concept to consider is mapping the page as UC rather than
completely unmapping it.

The uncorrectable error scope could be smaller than a page size, like:
* memory ECC width (e.g., 8 bytes)
* cache line size (e.g., 64 bytes)
* block device logical block size (e.g., 512 bytes, for persistent memory)

UC preserves the ability to access adjacent data within the page that
hasn't gone bad, and is particularly useful for persistent memory.

---
Robert Elliott, HPE Persistent Memory




RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages

2017-06-23 Thread Elliott, Robert (Persistent Memory)
> > > + if (set_memory_np(decoy_addr, 1))
> > > + pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n",

Another concept to consider is mapping the page as UC rather than
completely unmapping it.

The uncorrectable error scope could be smaller than a page size, like:
* memory ECC width (e.g., 8 bytes)
* cache line size (e.g., 64 bytes)
* block device logical block size (e.g., 512 bytes, for persistent memory)

UC preserves the ability to access adjacent data within the page that
hasn't gone bad, and is particularly useful for persistent memory.

---
Robert Elliott, HPE Persistent Memory




RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages

2017-06-21 Thread Elliott, Robert (Persistent Memory)
> + decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> +#else
> +#error "no unused virtual bit available"
> +#endif
> +
> + if (set_memory_np(decoy_addr, 1))
> + pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);

Does this patch handle breaking up 512 GiB, 1 GiB or 2 MiB page mappings
if it's just trying to mark a 4 KiB page as bad?

Although the kernel doesn't use MTRRs itself anymore, what if the system
BIOS still uses them for some memory regions, and the bad address falls in
an MTRR region?

---
Robert Elliott, HPE Persistent Memory






RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages

2017-06-21 Thread Elliott, Robert (Persistent Memory)
> + decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> +#else
> +#error "no unused virtual bit available"
> +#endif
> +
> + if (set_memory_np(decoy_addr, 1))
> + pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);

Does this patch handle breaking up 512 GiB, 1 GiB or 2 MiB page mappings
if it's just trying to mark a 4 KiB page as bad?

Although the kernel doesn't use MTRRs itself anymore, what if the system
BIOS still uses them for some memory regions, and the bad address falls in
an MTRR region?

---
Robert Elliott, HPE Persistent Memory






RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages

2017-06-21 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Luck, Tony
> Sent: Wednesday, June 21, 2017 12:54 PM
> To: Naoya Horiguchi 
> Cc: Borislav Petkov ; Dave Hansen ;
> x...@kernel.org; linux...@kvack.org; linux-kernel@vger.kernel.org

(adding linux-nvdimm list in this reply)

> Subject: Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1
> mappings of poison pages
> 
> On Wed, Jun 21, 2017 at 02:12:27AM +, Naoya Horiguchi wrote:
> 
> > We had better have a reverse operation of this to cancel the unmapping
> > when unpoisoning?
> 
> When we have unpoisoning, we can add something.  We don't seem to have
> an inverse function for "set_memory_np" to just flip the _PRESENT bit
> back on again. But it would be trivial to write a set_memory_pp().
> 
> Since we'd be doing this after the poison has been cleared, we wouldn't
> need to play games with the address.  We'd just use:
> 
>   set_memory_pp((unsigned long)pfn_to_kaddr(pfn), 1);
> 
> -Tony

Persistent memory does have unpoisoning and would require this inverse
operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
nvdimm_clear_poison().

---
Robert Elliott, HPE Persistent Memory






RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages

2017-06-21 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Luck, Tony
> Sent: Wednesday, June 21, 2017 12:54 PM
> To: Naoya Horiguchi 
> Cc: Borislav Petkov ; Dave Hansen ;
> x...@kernel.org; linux...@kvack.org; linux-kernel@vger.kernel.org

(adding linux-nvdimm list in this reply)

> Subject: Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1
> mappings of poison pages
> 
> On Wed, Jun 21, 2017 at 02:12:27AM +, Naoya Horiguchi wrote:
> 
> > We had better have a reverse operation of this to cancel the unmapping
> > when unpoisoning?
> 
> When we have unpoisoning, we can add something.  We don't seem to have
> an inverse function for "set_memory_np" to just flip the _PRESENT bit
> back on again. But it would be trivial to write a set_memory_pp().
> 
> Since we'd be doing this after the poison has been cleared, we wouldn't
> need to play games with the address.  We'd just use:
> 
>   set_memory_pp((unsigned long)pfn_to_kaddr(pfn), 1);
> 
> -Tony

Persistent memory does have unpoisoning and would require this inverse
operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
nvdimm_clear_poison().

---
Robert Elliott, HPE Persistent Memory






RE: [PATCH 2/3] tools lib api fs: Add sysfs__write_int function

2017-03-27 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of kan.li...@intel.com
> Sent: Thursday, March 23, 2017 1:26 PM
> Subject: [PATCH 2/3] tools lib api fs: Add sysfs__write_int function
...
> diff --git a/tools/lib/api/fs/fs.c b/tools/lib/api/fs/fs.c
...
> +
> +int sysfs__write_int(const char *entry, int value)
> +{
> + char path[PATH_MAX];
> + const char *sysfs = sysfs__mountpoint();
> +
> + if (!sysfs)
> + return -1;
> +
> + snprintf(path, sizeof(path), "%s/%s", sysfs, entry);
> +
> + return filename__write_int(path, value);

In the unlikely event of an overflow, it would be safer to confirm that
the string fit into the path array (by using scnprintf()?) before trying
to open that path.





RE: [PATCH 2/3] tools lib api fs: Add sysfs__write_int function

2017-03-27 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of kan.li...@intel.com
> Sent: Thursday, March 23, 2017 1:26 PM
> Subject: [PATCH 2/3] tools lib api fs: Add sysfs__write_int function
...
> diff --git a/tools/lib/api/fs/fs.c b/tools/lib/api/fs/fs.c
...
> +
> +int sysfs__write_int(const char *entry, int value)
> +{
> + char path[PATH_MAX];
> + const char *sysfs = sysfs__mountpoint();
> +
> + if (!sysfs)
> + return -1;
> +
> + snprintf(path, sizeof(path), "%s/%s", sysfs, entry);
> +
> + return filename__write_int(path, value);

In the unlikely event of an overflow, it would be safer to confirm that
the string fit into the path array (by using scnprintf()?) before trying
to open that path.





RE: [RFC PATCH v4 15/28] Add support to access persistent memory in the clear

2017-03-17 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Tom Lendacky
> Sent: Thursday, February 16, 2017 9:45 AM
> Subject: [RFC PATCH v4 15/28] Add support to access persistent memory in
> the clear
> 
> Persistent memory is expected to persist across reboots. The encryption
> key used by SME will change across reboots which will result in corrupted
> persistent memory.  Persistent memory is handed out by block devices
> through memory remapping functions, so be sure not to map this memory as
> encrypted.

The system might be able to save and restore the correct encryption key for a 
region of persistent memory, in which case it does need to be mapped as
encrypted.

This might deserve a new EFI_MEMORY_ENCRYPTED attribute bit so the
system firmware can communicate that information to the OS (in the
UEFI memory map and the ACPI NFIT SPA Range structures).  It wouldn't
likely ever be added to the E820h table - ACPI 6.1 already obsoleted the
Extended Attribute for AddressRangeNonVolatile.

> 
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/mm/ioremap.c |2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index b0ff6bc..c6cb921 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -498,6 +498,8 @@ static bool
> memremap_should_map_encrypted(resource_size_t phys_addr,
>   case E820_TYPE_ACPI:
>   case E820_TYPE_NVS:
>   case E820_TYPE_UNUSABLE:
> + case E820_TYPE_PMEM:
> + case E820_TYPE_PRAM:
>   return false;
>   default:
>   break;

E820_TYPE_RESERVED is also used to report persistent memory in
some systems (patch 16 adds that for other reasons).

You might want to intercept the persistent memory types in the 
efi_mem_type(phys_addr) switch statement earlier in the function
as well.  https://lkml.org/lkml/2017/3/13/357 recently mentioned that
"in qemu hotpluggable memory isn't put into E820," with the latest 
information only in the UEFI memory map.

Persistent memory can be reported there as:
* EfiReservedMemoryType type with the EFI_MEMORY_NV attribute
* EfiPersistentMemory type with the EFI_MEMORY_NV attribute

Even the UEFI memory map is not authoritative, though.  To really
determine what is in these regions requires parsing the ACPI NFIT
SPA Ranges structures.  Parts of the E820 or UEFI regions could be
reported as volatile there and should thus be encrypted.

---
Robert Elliott, HPE Persistent Memory





RE: [RFC PATCH v4 15/28] Add support to access persistent memory in the clear

2017-03-17 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Tom Lendacky
> Sent: Thursday, February 16, 2017 9:45 AM
> Subject: [RFC PATCH v4 15/28] Add support to access persistent memory in
> the clear
> 
> Persistent memory is expected to persist across reboots. The encryption
> key used by SME will change across reboots which will result in corrupted
> persistent memory.  Persistent memory is handed out by block devices
> through memory remapping functions, so be sure not to map this memory as
> encrypted.

The system might be able to save and restore the correct encryption key for a 
region of persistent memory, in which case it does need to be mapped as
encrypted.

This might deserve a new EFI_MEMORY_ENCRYPTED attribute bit so the
system firmware can communicate that information to the OS (in the
UEFI memory map and the ACPI NFIT SPA Range structures).  It wouldn't
likely ever be added to the E820h table - ACPI 6.1 already obsoleted the
Extended Attribute for AddressRangeNonVolatile.

> 
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/mm/ioremap.c |2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index b0ff6bc..c6cb921 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -498,6 +498,8 @@ static bool
> memremap_should_map_encrypted(resource_size_t phys_addr,
>   case E820_TYPE_ACPI:
>   case E820_TYPE_NVS:
>   case E820_TYPE_UNUSABLE:
> + case E820_TYPE_PMEM:
> + case E820_TYPE_PRAM:
>   return false;
>   default:
>   break;

E820_TYPE_RESERVED is also used to report persistent memory in
some systems (patch 16 adds that for other reasons).

You might want to intercept the persistent memory types in the 
efi_mem_type(phys_addr) switch statement earlier in the function
as well.  https://lkml.org/lkml/2017/3/13/357 recently mentioned that
"in qemu hotpluggable memory isn't put into E820," with the latest 
information only in the UEFI memory map.

Persistent memory can be reported there as:
* EfiReservedMemoryType type with the EFI_MEMORY_NV attribute
* EfiPersistentMemory type with the EFI_MEMORY_NV attribute

Even the UEFI memory map is not authoritative, though.  To really
determine what is in these regions requires parsing the ACPI NFIT
SPA Ranges structures.  Parts of the E820 or UEFI regions could be
reported as volatile there and should thus be encrypted.

---
Robert Elliott, HPE Persistent Memory





RE: [PATCHv3 4/4] MAINTAINERS: Remove powerpc's opal match

2017-02-17 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> ow...@vger.kernel.org] On Behalf Of Jon Derrick
> Sent: Thursday, February 16, 2017 10:15 AM
> To: Michael Ellerman 
> Cc: Jens Axboe ; Rafael Antognolli
> ; Greg Kroah-Hartman
> ; linux-kernel@vger.kernel.org; linux-
> bl...@vger.kernel.org; linuxppc-...@lists.ozlabs.org; Christoph Hellwig
> ; Scott Bauer 
> Subject: Re: [PATCHv3 4/4] MAINTAINERS: Remove powerpc's opal match
> 
> Thanks everyone. Sorry about the mess :)
> 
> On 02/15/2017 10:23 PM, Michael Ellerman wrote:
> > Jon Derrick  writes:
> >
> >> PPC's 'opal' match pattern also matches block/sed-opal.c, where it looks
> >> like the 'arch/powerpc' file pattern should be enough to match powerpc
> >> opal code by itself. Remove the opal regex pattern from powerpc.
> >
> > We thought of it first.
> >
> > Can't you just rename your driver, Opal Storage Specification, so "oss",
> > that should be pretty unique?
> >
> > ... :)

The library could easily be used for devices supporting the Opalite and Pyrite 
SSCs, not just the Opal SSC. With some effort, I suspect that Enterprise SSC
could also be supported.  So, a broader name might indeed be useful.

The full names of the specifications are:
TCG Storage Security Subsystem Class: Opal
TCG Storage Security Subsystem Class: Opalite
TCG Storage Security Subsystem Class: Pyrite
TCG Storage Security Subsystem Class: Enterprise

---
Robert Elliott, HPE Persistent Memory





RE: [PATCHv3 4/4] MAINTAINERS: Remove powerpc's opal match

2017-02-17 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> ow...@vger.kernel.org] On Behalf Of Jon Derrick
> Sent: Thursday, February 16, 2017 10:15 AM
> To: Michael Ellerman 
> Cc: Jens Axboe ; Rafael Antognolli
> ; Greg Kroah-Hartman
> ; linux-kernel@vger.kernel.org; linux-
> bl...@vger.kernel.org; linuxppc-...@lists.ozlabs.org; Christoph Hellwig
> ; Scott Bauer 
> Subject: Re: [PATCHv3 4/4] MAINTAINERS: Remove powerpc's opal match
> 
> Thanks everyone. Sorry about the mess :)
> 
> On 02/15/2017 10:23 PM, Michael Ellerman wrote:
> > Jon Derrick  writes:
> >
> >> PPC's 'opal' match pattern also matches block/sed-opal.c, where it looks
> >> like the 'arch/powerpc' file pattern should be enough to match powerpc
> >> opal code by itself. Remove the opal regex pattern from powerpc.
> >
> > We thought of it first.
> >
> > Can't you just rename your driver, Opal Storage Specification, so "oss",
> > that should be pretty unique?
> >
> > ... :)

The library could easily be used for devices supporting the Opalite and Pyrite 
SSCs, not just the Opal SSC. With some effort, I suspect that Enterprise SSC
could also be supported.  So, a broader name might indeed be useful.

The full names of the specifications are:
TCG Storage Security Subsystem Class: Opal
TCG Storage Security Subsystem Class: Opalite
TCG Storage Security Subsystem Class: Pyrite
TCG Storage Security Subsystem Class: Enterprise

---
Robert Elliott, HPE Persistent Memory





RE: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-01 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Al Viro
> Sent: Friday, December 30, 2016 8:26 PM
> Subject: [RFC] memcpy_nocache() and memcpy_writethrough()
> 
...
> Why does pmem need writethrough warranties, anyway?  

Using either 
* nontemporal store instructions; or
* following regular store instructions with a sequence of cache flush
and store fence instructions (e.g., clflushopt or clwb + sfence)

ensures that write data has reached an "ADR-safe zone" that the system
promises will be persistent even if there is a surprise power loss or
a CPU suffers from an error that isn't totally catastrophic (e.g., the
CPU getting disconnected from the SDRAM will always lose data on an
NVDIMM-N).

The ACPI NFIT Flush Hints provide a guarantee that data is safe even
in the case of a CPU error, but that feature is not present in all
systems for all types of persistent memory.

> All explanations I've found on the net had been along the lines of
> "we should not store a pointer to pmem data structure until the
> structure itself had been committed to pmem itself" and it looks
> like something that ought to be a job for barriers - after all,
> we don't want the pointer store to be observed by _anything_
> in the system until the earlier stores are visible, so what makes
> pmem different from e.g. another CPU or a PCI busmaster, or...

Newly written data becomes globally visible before it becomes ADR-safe.
This means software could act on the new data before a power loss, then
see the old data reappear after the power loss - not good.  Software
needs to understand that any data in the process of being written is
indeterminate until the persistence guarantee is met.  The BTT shows
one way that software can avoid that problem.

---
Robert Elliott, HPE Persistent Memory




RE: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-01 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Al Viro
> Sent: Friday, December 30, 2016 8:26 PM
> Subject: [RFC] memcpy_nocache() and memcpy_writethrough()
> 
...
> Why does pmem need writethrough warranties, anyway?  

Using either 
* nontemporal store instructions; or
* following regular store instructions with a sequence of cache flush
and store fence instructions (e.g., clflushopt or clwb + sfence)

ensures that write data has reached an "ADR-safe zone" that the system
promises will be persistent even if there is a surprise power loss or
a CPU suffers from an error that isn't totally catastrophic (e.g., the
CPU getting disconnected from the SDRAM will always lose data on an
NVDIMM-N).

The ACPI NFIT Flush Hints provide a guarantee that data is safe even
in the case of a CPU error, but that feature is not present in all
systems for all types of persistent memory.

> All explanations I've found on the net had been along the lines of
> "we should not store a pointer to pmem data structure until the
> structure itself had been committed to pmem itself" and it looks
> like something that ought to be a job for barriers - after all,
> we don't want the pointer store to be observed by _anything_
> in the system until the earlier stores are visible, so what makes
> pmem different from e.g. another CPU or a PCI busmaster, or...

Newly written data becomes globally visible before it becomes ADR-safe.
This means software could act on the new data before a power loss, then
see the old data reappear after the power loss - not good.  Software
needs to understand that any data in the process of being written is
indeterminate until the persistence guarantee is met.  The BTT shows
one way that software can avoid that problem.

---
Robert Elliott, HPE Persistent Memory




RE: Observing Softlockup's while running heavy IOs

2016-08-19 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Sreekanth Reddy [mailto:sreekanth.re...@broadcom.com]
> Sent: Friday, August 19, 2016 6:45 AM
> To: Elliott, Robert (Persistent Memory) <elli...@hpe.com>
> Subject: Re: Observing Softlockup's while running heavy IOs
> 
...
> Yes I am also observing that all the interrupts are routed to one
> CPU.  But still I observing softlockups (sometime hardlockups)
> even when I set rq_affinity to 2.

That'll ensure the block layer's completion handling is done there,
but not your driver's interrupt handler (which precedes the block
layer completion handling).

 
> Is their any way to route the interrupts the same CPUs which has
> submitted the corresponding IOs?
> or
> Is their any way/option in the irqbalance/kernel which can route
> interrupts to CPUs (enabled in affinity_hint) in round robin manner
> after specific time period.

Ensure your driver creates one MSIX interrupt per CPU core, uses
that interrupt for all submissions from that core, and reports
that it would like that interrupt to be serviced by that core
in /proc/irq/nnn/affinity_hint.  

Even with hyperthreading, this needs to be based on the logical
CPU cores, not just the physical core or the physical socket.
You can swamp a logical CPU core as easily as a physical CPU core.

Then, provide an irqbalance policy script that honors the
affinity_hint for your driver, or turn off irqbalance and
manually set /proc/irq/nnn/smp_affinity to match the
affinity_hint.  

Some versions of irqbalance honor the hints; some purposely
don't and need to be overridden with a policy script.


---
Robert Elliott, HPE Persistent Memory




RE: Observing Softlockup's while running heavy IOs

2016-08-19 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Sreekanth Reddy [mailto:sreekanth.re...@broadcom.com]
> Sent: Friday, August 19, 2016 6:45 AM
> To: Elliott, Robert (Persistent Memory) 
> Subject: Re: Observing Softlockup's while running heavy IOs
> 
...
> Yes I am also observing that all the interrupts are routed to one
> CPU.  But still I observing softlockups (sometime hardlockups)
> even when I set rq_affinity to 2.

That'll ensure the block layer's completion handling is done there,
but not your driver's interrupt handler (which precedes the block
layer completion handling).

 
> Is their any way to route the interrupts the same CPUs which has
> submitted the corresponding IOs?
> or
> Is their any way/option in the irqbalance/kernel which can route
> interrupts to CPUs (enabled in affinity_hint) in round robin manner
> after specific time period.

Ensure your driver creates one MSIX interrupt per CPU core, uses
that interrupt for all submissions from that core, and reports
that it would like that interrupt to be serviced by that core
in /proc/irq/nnn/affinity_hint.  

Even with hyperthreading, this needs to be based on the logical
CPU cores, not just the physical core or the physical socket.
You can swamp a logical CPU core as easily as a physical CPU core.

Then, provide an irqbalance policy script that honors the
affinity_hint for your driver, or turn off irqbalance and
manually set /proc/irq/nnn/smp_affinity to match the
affinity_hint.  

Some versions of irqbalance honor the hints; some purposely
don't and need to be overridden with a policy script.


---
Robert Elliott, HPE Persistent Memory




RE: Observing Softlockup's while running heavy IOs

2016-08-18 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Sreekanth Reddy
> Sent: Thursday, August 18, 2016 12:56 AM
> Subject: Observing Softlockup's while running heavy IOs
> 
> Problem statement:
> Observing softlockups while running heavy IOs on 8 SSD drives
> connected behind our LSI SAS 3004 HBA.
> 
...
> Observing a loop in the IO path, i.e only one CPU is busy with
> processing the interrupts and other CPUs (in the affinity_hint mask)
> are busy with sending the IOs (these CPUs are not yet all receiving
> any interrupts). For example, only CPU6 is busy with processing the
> interrupts from IRQ 219 and remaining CPUs i.e CPU 7,8,9,10 & 11 are
> just busy with pumping the IOs and they never processed any IO
> interrupts from IRQ 219. So we are observing softlockups due to
> existence this loop in the IO Path.
> 
> We may not observe these softlockups if irqbalancer might have
> balanced the interrupts among the CPUs enabled in the particular
> irq's
> affinity_hint mask. so that all the CPUs are equaly busy with send
> IOs
> and processing the interrupts. I am not sure how irqbalancer balance
> the load among the CPUs, but here I see only one CPU from irq's
> affinity_hint mask is busy with interrupts and remaining CPUs won't
> receive any interrupts from this IRQ.
> 
> Please help me with any suggestions/recomendations to slove/limit
> these kind of softlockups. Also please let me known if I have missed
> any setting in the irqbalance.
> 

The CPUs need to be forced to self-throttle by processing interrupts for 
their own submissions, which reduces the time they can submit more IOs.

See https://lkml.org/lkml/2014/9/9/931 for discussion of this
problem when blk-mq was added.


---
Robert Elliott, HPE Persistent Memory





RE: Observing Softlockup's while running heavy IOs

2016-08-18 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Sreekanth Reddy
> Sent: Thursday, August 18, 2016 12:56 AM
> Subject: Observing Softlockup's while running heavy IOs
> 
> Problem statement:
> Observing softlockups while running heavy IOs on 8 SSD drives
> connected behind our LSI SAS 3004 HBA.
> 
...
> Observing a loop in the IO path, i.e only one CPU is busy with
> processing the interrupts and other CPUs (in the affinity_hint mask)
> are busy with sending the IOs (these CPUs are not yet all receiving
> any interrupts). For example, only CPU6 is busy with processing the
> interrupts from IRQ 219 and remaining CPUs i.e CPU 7,8,9,10 & 11 are
> just busy with pumping the IOs and they never processed any IO
> interrupts from IRQ 219. So we are observing softlockups due to
> existence this loop in the IO Path.
> 
> We may not observe these softlockups if irqbalancer might have
> balanced the interrupts among the CPUs enabled in the particular
> irq's
> affinity_hint mask. so that all the CPUs are equaly busy with send
> IOs
> and processing the interrupts. I am not sure how irqbalancer balance
> the load among the CPUs, but here I see only one CPU from irq's
> affinity_hint mask is busy with interrupts and remaining CPUs won't
> receive any interrupts from this IRQ.
> 
> Please help me with any suggestions/recomendations to slove/limit
> these kind of softlockups. Also please let me known if I have missed
> any setting in the irqbalance.
> 

The CPUs need to be forced to self-throttle by processing interrupts for 
their own submissions, which reduces the time they can submit more IOs.

See https://lkml.org/lkml/2014/9/9/931 for discussion of this
problem when blk-mq was added.


---
Robert Elliott, HPE Persistent Memory





RE: [PATCH] x86/microcode/intel: Quieten down microcode updates on large systems

2016-06-08 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Andi Kleen
> Sent: Tuesday, May 31, 2016 4:23 PM
> To: x...@kernel.org
> Cc: linux-kernel@vger.kernel.org; Andi Kleen 
> Subject: [PATCH] x86/microcode/intel: Quieten down microcode updates
> on large systems
...
> diff --git a/arch/x86/kernel/cpu/microcode/intel.c
> @@ -808,8 +809,14 @@ static int collect_cpu_info(int cpu_num, struct
> cpu_signature *csig)
>   }
> 
>   csig->rev = c->microcode;
> - pr_info("CPU%d sig=0x%x, pf=0x%x, revision=0x%x\n",
> - cpu_num, csig->sig, csig->pf, csig->rev);
> +
> + /* No extra locking on prev, races are harmless. */
> + if (csig->sig != prev.sig || csig->pf != prev.pf ||
> + csig->rev != prev.rev) {
> + pr_info("CPU sig=0x%x, pf=0x%x, revision=0x%x\n",
> + csig->sig, csig->pf, csig->rev);
> + prev = *csig;
> + }
...
> - pr_info("CPU%d updated to revision 0x%x, date = %04x-%02x-
> %02x\n",
> - cpu, val[1],
> - mc->hdr.date & 0x,
> - mc->hdr.date >> 24,
> - (mc->hdr.date >> 16) & 0xff);
> + if (val[1] != prev_rev) {
> + pr_info("CPU updated to revision 0x%x, date = %04x-%02x-
> %02x\n",

"CPU(s)" would help convey that the messages now apply to one or 
more CPUs.



RE: [PATCH] x86/microcode/intel: Quieten down microcode updates on large systems

2016-06-08 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Andi Kleen
> Sent: Tuesday, May 31, 2016 4:23 PM
> To: x...@kernel.org
> Cc: linux-kernel@vger.kernel.org; Andi Kleen 
> Subject: [PATCH] x86/microcode/intel: Quieten down microcode updates
> on large systems
...
> diff --git a/arch/x86/kernel/cpu/microcode/intel.c
> @@ -808,8 +809,14 @@ static int collect_cpu_info(int cpu_num, struct
> cpu_signature *csig)
>   }
> 
>   csig->rev = c->microcode;
> - pr_info("CPU%d sig=0x%x, pf=0x%x, revision=0x%x\n",
> - cpu_num, csig->sig, csig->pf, csig->rev);
> +
> + /* No extra locking on prev, races are harmless. */
> + if (csig->sig != prev.sig || csig->pf != prev.pf ||
> + csig->rev != prev.rev) {
> + pr_info("CPU sig=0x%x, pf=0x%x, revision=0x%x\n",
> + csig->sig, csig->pf, csig->rev);
> + prev = *csig;
> + }
...
> - pr_info("CPU%d updated to revision 0x%x, date = %04x-%02x-
> %02x\n",
> - cpu, val[1],
> - mc->hdr.date & 0x,
> - mc->hdr.date >> 24,
> - (mc->hdr.date >> 16) & 0xff);
> + if (val[1] != prev_rev) {
> + pr_info("CPU updated to revision 0x%x, date = %04x-%02x-
> %02x\n",

"CPU(s)" would help convey that the messages now apply to one or 
more CPUs.



RE: [PATCH] mpt3sas: Do scsi_remove_host() before deleting SAS PHY objects

2016-05-13 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Calvin Owens
> Sent: Friday, May 13, 2016 3:28 PM
...
> Subject: [PATCH] mpt3sas: Do scsi_remove_host() before deleting SAS PHY
> objects
> 
...
> The issue is that enclosure_remove_device() expects to be able to re-add
> the device that was previously enclosured: so with SES running, the order
> we unwind things matters in a way it otherwise wouldn't.
> 
> Since mpt3sas deletes the SAS objects before the SCSI hosts, enclosure
> ends up trying to re-parent the SCSI device from the enclosure to the SAS
> PHY which has already been deleted. This obviously ends in sadness.
> 
> The fix appears to be simple: just call scsi_remove_host() before we call
> sas_port_delete() and/or sas_remove_host().
> 
...
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> @@ -8149,6 +8149,8 @@ void scsih_remove(struct pci_dev *pdev)
>   _scsih_raid_device_remove(ioc, raid_device);
>   }
> 
> + scsi_remove_host(shost);
> +
>   /* free ports attached to the sas_host */
>   list_for_each_entry_safe(mpt3sas_port, next_port,
>  >sas_hba.sas_port_list, port_list) {
> @@ -8172,7 +8174,6 @@ void scsih_remove(struct pci_dev *pdev)
>   }
> 
>   sas_remove_host(shost);
> - scsi_remove_host(shost);
>   mpt3sas_base_detach(ioc);
>   spin_lock(_lock);
>   list_del(>list);

That change matches the pattern of all other drivers calling
sas_remove_host, except for one:
drivers/message/fusion/mptsas.c

That consensus means this comment in drivers/scsi/scsi_transport_sas.c
is wrong:

/**
 * sas_remove_host  -  tear down a Scsi_Host's SAS data structures
 * @shost:  Scsi Host that is torn down
 *
 * Removes all SAS PHYs and remote PHYs for a given Scsi_Host.
 * Must be called just before scsi_remove_host for SAS HBAs.
 */




RE: [PATCH] mpt3sas: Do scsi_remove_host() before deleting SAS PHY objects

2016-05-13 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Calvin Owens
> Sent: Friday, May 13, 2016 3:28 PM
...
> Subject: [PATCH] mpt3sas: Do scsi_remove_host() before deleting SAS PHY
> objects
> 
...
> The issue is that enclosure_remove_device() expects to be able to re-add
> the device that was previously enclosured: so with SES running, the order
> we unwind things matters in a way it otherwise wouldn't.
> 
> Since mpt3sas deletes the SAS objects before the SCSI hosts, enclosure
> ends up trying to re-parent the SCSI device from the enclosure to the SAS
> PHY which has already been deleted. This obviously ends in sadness.
> 
> The fix appears to be simple: just call scsi_remove_host() before we call
> sas_port_delete() and/or sas_remove_host().
> 
...
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> @@ -8149,6 +8149,8 @@ void scsih_remove(struct pci_dev *pdev)
>   _scsih_raid_device_remove(ioc, raid_device);
>   }
> 
> + scsi_remove_host(shost);
> +
>   /* free ports attached to the sas_host */
>   list_for_each_entry_safe(mpt3sas_port, next_port,
>  >sas_hba.sas_port_list, port_list) {
> @@ -8172,7 +8174,6 @@ void scsih_remove(struct pci_dev *pdev)
>   }
> 
>   sas_remove_host(shost);
> - scsi_remove_host(shost);
>   mpt3sas_base_detach(ioc);
>   spin_lock(_lock);
>   list_del(>list);

That change matches the pattern of all other drivers calling
sas_remove_host, except for one:
drivers/message/fusion/mptsas.c

That consensus means this comment in drivers/scsi/scsi_transport_sas.c
is wrong:

/**
 * sas_remove_host  -  tear down a Scsi_Host's SAS data structures
 * @shost:  Scsi Host that is torn down
 *
 * Removes all SAS PHYs and remote PHYs for a given Scsi_Host.
 * Must be called just before scsi_remove_host for SAS HBAs.
 */




RE: [PATCH 2/5] efibc: Fix excessive stack footprint warning

2016-05-09 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Matt Fleming
> Sent: Friday, May 06, 2016 4:39 PM
...
> Subject: [PATCH 2/5] efibc: Fix excessive stack footprint warning
> 
> From: Jeremy Compostella 
> 
...
> 
> -static void efibc_set_variable(const char *name, const char *value)
> +static int efibc_set_variable(const char *name, const char *value)
>  {
>   int ret;
>   efi_guid_t guid = LINUX_EFI_LOADER_ENTRY_GUID;
> - struct efivar_entry entry;
> + struct efivar_entry *entry;
>   size_t size = (strlen(value) + 1) * sizeof(efi_char16_t);
> 
> - if (size > sizeof(entry.var.Data))
> + if (size > sizeof(entry->var.Data)) {
>   pr_err("value is too large");

That pr_err is introduced by patch 25/40 of the first series.

How about including the name of the variable for which this is failing, 
like the final pr_err?

> + return -EINVAL;
> + }
> 
> - efibc_str_to_str16(name, entry.var.VariableName);
> - efibc_str_to_str16(value, (efi_char16_t *)entry.var.Data);
> - memcpy(, , sizeof(guid));
> + entry = kmalloc(sizeof(*entry), GFP_KERNEL);
> + if (!entry) {
> + pr_err("failed to allocate efivar entry");

How about including the name of the variable for which this
is failing, like the final pr_err?

> + return -ENOMEM;
> + }
> 
> - ret = efivar_entry_set(,
> + efibc_str_to_str16(name, entry->var.VariableName);
> + efibc_str_to_str16(value, (efi_char16_t *)entry->var.Data);
> + memcpy(>var.VendorGuid, , sizeof(guid));
> +
> + ret = efivar_entry_set(entry,
>  EFI_VARIABLE_NON_VOLATILE
>  | EFI_VARIABLE_BOOTSERVICE_ACCESS
>  | EFI_VARIABLE_RUNTIME_ACCESS,
> -size, entry.var.Data, NULL);
> +size, entry->var.Data, NULL);
>   if (ret)
>   pr_err("failed to set %s EFI variable: 0x%x\n",
>  name, ret);
> +
> + kfree(entry);
> + return ret;
>  }



RE: [PATCH 2/5] efibc: Fix excessive stack footprint warning

2016-05-09 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Matt Fleming
> Sent: Friday, May 06, 2016 4:39 PM
...
> Subject: [PATCH 2/5] efibc: Fix excessive stack footprint warning
> 
> From: Jeremy Compostella 
> 
...
> 
> -static void efibc_set_variable(const char *name, const char *value)
> +static int efibc_set_variable(const char *name, const char *value)
>  {
>   int ret;
>   efi_guid_t guid = LINUX_EFI_LOADER_ENTRY_GUID;
> - struct efivar_entry entry;
> + struct efivar_entry *entry;
>   size_t size = (strlen(value) + 1) * sizeof(efi_char16_t);
> 
> - if (size > sizeof(entry.var.Data))
> + if (size > sizeof(entry->var.Data)) {
>   pr_err("value is too large");

That pr_err is introduced by patch 25/40 of the first series.

How about including the name of the variable for which this is failing, 
like the final pr_err?

> + return -EINVAL;
> + }
> 
> - efibc_str_to_str16(name, entry.var.VariableName);
> - efibc_str_to_str16(value, (efi_char16_t *)entry.var.Data);
> - memcpy(, , sizeof(guid));
> + entry = kmalloc(sizeof(*entry), GFP_KERNEL);
> + if (!entry) {
> + pr_err("failed to allocate efivar entry");

How about including the name of the variable for which this
is failing, like the final pr_err?

> + return -ENOMEM;
> + }
> 
> - ret = efivar_entry_set(,
> + efibc_str_to_str16(name, entry->var.VariableName);
> + efibc_str_to_str16(value, (efi_char16_t *)entry->var.Data);
> + memcpy(>var.VendorGuid, , sizeof(guid));
> +
> + ret = efivar_entry_set(entry,
>  EFI_VARIABLE_NON_VOLATILE
>  | EFI_VARIABLE_BOOTSERVICE_ACCESS
>  | EFI_VARIABLE_RUNTIME_ACCESS,
> -size, entry.var.Data, NULL);
> +size, entry->var.Data, NULL);
>   if (ret)
>   pr_err("failed to set %s EFI variable: 0x%x\n",
>  name, ret);
> +
> + kfree(entry);
> + return ret;
>  }



bnx2x in 4.6rc7 with FW 7.13.1.0 not present

2016-05-09 Thread Elliott, Robert (Persistent Memory)
Upgrading a system from kernel 4.2 to 4.6rc7, there is an extra 2 minute
delay while booting due to these problems:

[   47.977221] bnx2x :04:00.1: Direct firmware load for 
bnx2x/bnx2x-e2-7.13.1.0.fw failed with error -2
[   48.029997] bnx2x :04:00.1: Falling back to user helper
[**] A start job is running for Network Manager Wait Online (1min 27s / 
no limit)
[  108.064132] bnx2x: [bnx2x_init_firmware:13472(eno50)]Can't load firmware 
file bnx2x/bnx2x-e2-7.13.1.0.fw
[  108.119951] bnx2x: [bnx2x_func_hw_init:5785(eno50)]Error loading firmware
[  108.156522] bnx2x: [bnx2x_nic_load:2727(eno50)]HW init failed, aborting
[* ] A start job is running for Network Manager Wait Online (1min 28s / 
no limit)

These are the versions in /lib/firmware/bnx2x (a RHEL installation):
bnx2x-e2-6.2.9.0.fwbnx2x-e1-7.8.19.0.fwbnx2x-e1h-7.8.2.0.fw
bnx2x-e1h-6.2.9.0.fw   bnx2x-e1-7.8.2.0.fw bnx2x-e2-6.0.34.0.fw
bnx2x-e1-6.2.9.0.fwbnx2x-e1h-6.0.34.0.fw   bnx2x-e2-6.2.5.0.fw
bnx2x-e1-6.0.34.0.fw   bnx2x-e1h-6.2.5.0.fwbnx2x-e2-7.0.20.0.fw
bnx2x-e1-6.2.5.0.fwbnx2x-e1h-7.0.20.0.fw   bnx2x-e2-7.0.23.0.fw
bnx2x-e1-7.0.20.0.fw   bnx2x-e1h-7.0.23.0.fw   bnx2x-e2-7.0.29.0.fw
bnx2x-e1-7.0.23.0.fw   bnx2x-e1h-7.0.29.0.fw   bnx2x-e2-7.10.51.0.fw
bnx2x-e1-7.0.29.0.fw   bnx2x-e1h-7.10.51.0.fw  bnx2x-e2-7.12.30.0.fw
bnx2x-e1-7.10.51.0.fw  bnx2x-e1h-7.12.30.0.fw  bnx2x-e2-7.2.16.0.fw
bnx2x-e1-7.12.30.0.fw  bnx2x-e1h-7.2.16.0.fw   bnx2x-e2-7.2.51.0.fw
bnx2x-e1-7.2.16.0.fw   bnx2x-e1h-7.2.51.0.fw   bnx2x-e2-7.8.17.0.fw
bnx2x-e1-7.2.51.0.fw   bnx2x-e1h-7.8.17.0.fw   bnx2x-e2-7.8.19.0.fw
bnx2x-e1-7.8.17.0.fw   bnx2x-e1h-7.8.19.0.fw   bnx2x-e2-7.8.2.0.fw

Reverting 5e091e7a "bnx2x: Utilize FW 7.13.1.0" avoids the problem.
I assume getting 7.13.1.0 from the linux-firmware project will also do so.

Could the driver fall back to an older firmware version more gracefully?

4.6rc7
==
[    0.00] Linux version 4.6.0-rc7+ ... (gcc version 4.8.5 20150623 (Red 
Hat 4.8.5-4) (GCC) ) #7 SMP Mon May 9 14:22:53 EDT 2016
[    0.00] Command line: BOOT_IMAGE=/vmlinuz-4.6.0-rc7+ 
root=/dev/mapper/rhel-root ro crashkernel=auto rd.lvm.lv=rhel/root 
rd.lvm.lv=rhel/swap rhgb quiet efi=debug ... ignore_loglevel 
console=ttyS1,115200 console=tty0 LANG=en_US.UTF-8 
...
[   16.143816] bnx2x: QLogic 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 
1.712.30-0 (2014/02/10)
[   16.144780] bnx2x :04:00.0: msix capability found
[   16.144995] bnx2x :04:00.0: part number 0-0-0-0
[   16.237686] bnx2x :04:00.1: msix capability found
[   16.237859] bnx2x :04:00.1: part number 0-0-0-0
[   16.525718] bnx2x :04:00.1 eno50: renamed from eth2
[   16.648816] bnx2x :04:00.0 eno49: renamed from eth1
...
[   47.333227] IPv6: ADDRCONF(NETDEV_UP): eno2: link is not ready
[   47.498841] IPv6: ADDRCONF(NETDEV_UP): eno2: link is not ready
[   47.530225] IPv6: ADDRCONF(NETDEV_UP): eno3: link is not ready
[   47.680716] IPv6: ADDRCONF(NETDEV_UP): eno3: link is not ready
[   47.714908] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
[   47.874737] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
[   47.907457] IPv6: ADDRCONF(NETDEV_UP): eno50: link is not ready
[   47.977221] bnx2x :04:00.1: Direct firmware load for 
bnx2x/bnx2x-e2-7.13.1.0.fw failed with error -2
[   48.029997] bnx2x :04:00.1: Falling back to user helper
[   51.104604] tg3 :02:00.1 eno2: Link is up at 1000 Mbps, full duplex
[   51.142403] tg3 :02:00.1 eno2: Flow control is off for TX and off for RX
[   51.180648] tg3 :02:00.1 eno2: EEE is disabled
[**    ] A start job is running for Network Manager Wait Online (1min 27s / no 
limit)
[  108.064132] bnx2x: [bnx2x_init_firmware:13472(eno50)]Can't load firmware 
file bnx2x/bnx2x-e2-7.13.1.0.fw
[  108.119951] bnx2x: [bnx2x_func_hw_init:5785(eno50)]Error loading firmware
[  108.156522] bnx2x: [bnx2x_nic_load:2727(eno50)]HW init failed, aborting
[* ] A start job is running for Network Manager Wait Online (1min 28s / no 
limit)
[  108.435477] IPv6: ADDRCONF(NETDEV_CHANGE): eno2: link becomes ready
[  108.475065] IPv6: ADDRCONF(NETDEV_UP): eno49: link is not ready
[  108.556168] bnx2x :04:00.0: Direct firmware load for 
bnx2x/bnx2x-e2-7.13.1.0.fw failed with error -2
[  108.608229] bnx2x :04:00.0: Falling back to user helper
[  169.337933] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
[  169.338749] IPv6: ADDRCONF(NETDEV_UP): eno50: link is not ready
[  169.339772] IPv6: ADDRCONF(NETDEV_UP): eno49: link is not ready
[  169.340630] IPv6: ADDRCONF(NETDEV_UP): eno4: link is not ready
 Starting WPA Supplicant daemon...
[  OK  ] Started WPA Supplicant daemon.
 Starting Network Manager Script Dispatcher Service...
[  OK  ] Started Network Manager Script Dispatcher Service.
[  OK  ] Started Network Manager Wait Online.
 Starting LSB: Bring up/down networking...
[FAILED] Failed to start LSB: Bring up/down networking.
See 'systemctl 

bnx2x in 4.6rc7 with FW 7.13.1.0 not present

2016-05-09 Thread Elliott, Robert (Persistent Memory)
Upgrading a system from kernel 4.2 to 4.6rc7, there is an extra 2 minute
delay while booting due to these problems:

[   47.977221] bnx2x :04:00.1: Direct firmware load for 
bnx2x/bnx2x-e2-7.13.1.0.fw failed with error -2
[   48.029997] bnx2x :04:00.1: Falling back to user helper
[**] A start job is running for Network Manager Wait Online (1min 27s / 
no limit)
[  108.064132] bnx2x: [bnx2x_init_firmware:13472(eno50)]Can't load firmware 
file bnx2x/bnx2x-e2-7.13.1.0.fw
[  108.119951] bnx2x: [bnx2x_func_hw_init:5785(eno50)]Error loading firmware
[  108.156522] bnx2x: [bnx2x_nic_load:2727(eno50)]HW init failed, aborting
[* ] A start job is running for Network Manager Wait Online (1min 28s / 
no limit)

These are the versions in /lib/firmware/bnx2x (a RHEL installation):
bnx2x-e2-6.2.9.0.fwbnx2x-e1-7.8.19.0.fwbnx2x-e1h-7.8.2.0.fw
bnx2x-e1h-6.2.9.0.fw   bnx2x-e1-7.8.2.0.fw bnx2x-e2-6.0.34.0.fw
bnx2x-e1-6.2.9.0.fwbnx2x-e1h-6.0.34.0.fw   bnx2x-e2-6.2.5.0.fw
bnx2x-e1-6.0.34.0.fw   bnx2x-e1h-6.2.5.0.fwbnx2x-e2-7.0.20.0.fw
bnx2x-e1-6.2.5.0.fwbnx2x-e1h-7.0.20.0.fw   bnx2x-e2-7.0.23.0.fw
bnx2x-e1-7.0.20.0.fw   bnx2x-e1h-7.0.23.0.fw   bnx2x-e2-7.0.29.0.fw
bnx2x-e1-7.0.23.0.fw   bnx2x-e1h-7.0.29.0.fw   bnx2x-e2-7.10.51.0.fw
bnx2x-e1-7.0.29.0.fw   bnx2x-e1h-7.10.51.0.fw  bnx2x-e2-7.12.30.0.fw
bnx2x-e1-7.10.51.0.fw  bnx2x-e1h-7.12.30.0.fw  bnx2x-e2-7.2.16.0.fw
bnx2x-e1-7.12.30.0.fw  bnx2x-e1h-7.2.16.0.fw   bnx2x-e2-7.2.51.0.fw
bnx2x-e1-7.2.16.0.fw   bnx2x-e1h-7.2.51.0.fw   bnx2x-e2-7.8.17.0.fw
bnx2x-e1-7.2.51.0.fw   bnx2x-e1h-7.8.17.0.fw   bnx2x-e2-7.8.19.0.fw
bnx2x-e1-7.8.17.0.fw   bnx2x-e1h-7.8.19.0.fw   bnx2x-e2-7.8.2.0.fw

Reverting 5e091e7a "bnx2x: Utilize FW 7.13.1.0" avoids the problem.
I assume getting 7.13.1.0 from the linux-firmware project will also do so.

Could the driver fall back to an older firmware version more gracefully?

4.6rc7
==
[    0.00] Linux version 4.6.0-rc7+ ... (gcc version 4.8.5 20150623 (Red 
Hat 4.8.5-4) (GCC) ) #7 SMP Mon May 9 14:22:53 EDT 2016
[    0.00] Command line: BOOT_IMAGE=/vmlinuz-4.6.0-rc7+ 
root=/dev/mapper/rhel-root ro crashkernel=auto rd.lvm.lv=rhel/root 
rd.lvm.lv=rhel/swap rhgb quiet efi=debug ... ignore_loglevel 
console=ttyS1,115200 console=tty0 LANG=en_US.UTF-8 
...
[   16.143816] bnx2x: QLogic 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 
1.712.30-0 (2014/02/10)
[   16.144780] bnx2x :04:00.0: msix capability found
[   16.144995] bnx2x :04:00.0: part number 0-0-0-0
[   16.237686] bnx2x :04:00.1: msix capability found
[   16.237859] bnx2x :04:00.1: part number 0-0-0-0
[   16.525718] bnx2x :04:00.1 eno50: renamed from eth2
[   16.648816] bnx2x :04:00.0 eno49: renamed from eth1
...
[   47.333227] IPv6: ADDRCONF(NETDEV_UP): eno2: link is not ready
[   47.498841] IPv6: ADDRCONF(NETDEV_UP): eno2: link is not ready
[   47.530225] IPv6: ADDRCONF(NETDEV_UP): eno3: link is not ready
[   47.680716] IPv6: ADDRCONF(NETDEV_UP): eno3: link is not ready
[   47.714908] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
[   47.874737] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
[   47.907457] IPv6: ADDRCONF(NETDEV_UP): eno50: link is not ready
[   47.977221] bnx2x :04:00.1: Direct firmware load for 
bnx2x/bnx2x-e2-7.13.1.0.fw failed with error -2
[   48.029997] bnx2x :04:00.1: Falling back to user helper
[   51.104604] tg3 :02:00.1 eno2: Link is up at 1000 Mbps, full duplex
[   51.142403] tg3 :02:00.1 eno2: Flow control is off for TX and off for RX
[   51.180648] tg3 :02:00.1 eno2: EEE is disabled
[**    ] A start job is running for Network Manager Wait Online (1min 27s / no 
limit)
[  108.064132] bnx2x: [bnx2x_init_firmware:13472(eno50)]Can't load firmware 
file bnx2x/bnx2x-e2-7.13.1.0.fw
[  108.119951] bnx2x: [bnx2x_func_hw_init:5785(eno50)]Error loading firmware
[  108.156522] bnx2x: [bnx2x_nic_load:2727(eno50)]HW init failed, aborting
[* ] A start job is running for Network Manager Wait Online (1min 28s / no 
limit)
[  108.435477] IPv6: ADDRCONF(NETDEV_CHANGE): eno2: link becomes ready
[  108.475065] IPv6: ADDRCONF(NETDEV_UP): eno49: link is not ready
[  108.556168] bnx2x :04:00.0: Direct firmware load for 
bnx2x/bnx2x-e2-7.13.1.0.fw failed with error -2
[  108.608229] bnx2x :04:00.0: Falling back to user helper
[  169.337933] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
[  169.338749] IPv6: ADDRCONF(NETDEV_UP): eno50: link is not ready
[  169.339772] IPv6: ADDRCONF(NETDEV_UP): eno49: link is not ready
[  169.340630] IPv6: ADDRCONF(NETDEV_UP): eno4: link is not ready
 Starting WPA Supplicant daemon...
[  OK  ] Started WPA Supplicant daemon.
 Starting Network Manager Script Dispatcher Service...
[  OK  ] Started Network Manager Script Dispatcher Service.
[  OK  ] Started Network Manager Wait Online.
 Starting LSB: Bring up/down networking...
[FAILED] Failed to start LSB: Bring up/down networking.
See 'systemctl 

RE: [PATCH v3 2/2] i2c, i2c_imc: Add DIMM bus code

2016-05-02 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Andy Lutomirski [mailto:l...@kernel.org]
> Sent: Thursday, April 28, 2016 8:03 PM
> To: Wolfram Sang ; Christoph Hellwig 
> Cc: Boaz Harrosh ; One Thousand Gnomes
> ; Rui Wang ; Jean Delvare
> ; Alun Evans ; Robert Elliott
> ; linux-...@vger.kernel.org; Mauro Carvalho Chehab
> ; Paul Bolle ; Tony Luck
> ; linux-kernel@vger.kernel.org; Guenter Roeck
> ; Andy Lutomirski 
> Subject: [PATCH v3 2/2] i2c, i2c_imc: Add DIMM bus code
> 
> Add i2c_scan_dimm_bus to declare that a particular i2c_adapter
> contains DIMMs.  This will probe (and autoload modules!) for useful
> SMBUS devices that live on DIMMs.  i2c_imc calls it.
> 
> As more SMBUS-addressable DIMM components become supported, this
> code can be extended to probe for them.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  drivers/i2c/busses/Kconfig|   5 ++
>  drivers/i2c/busses/Makefile   |   4 ++
>  drivers/i2c/busses/dimm-bus.c | 107
> ++
>  drivers/i2c/busses/i2c-imc.c  |   3 ++
>  include/linux/i2c/dimm-bus.h  |  20 
>  5 files changed, 139 insertions(+)
>  create mode 100644 drivers/i2c/busses/dimm-bus.c
>  create mode 100644 include/linux/i2c/dimm-bus.h
> 
> diff --git a/drivers/i2c/busses/Kconfig b/drivers/i2c/busses/Kconfig
> index 3c05de897566..10aa87872408 100644
> --- a/drivers/i2c/busses/Kconfig
> +++ b/drivers/i2c/busses/Kconfig
> @@ -152,9 +152,14 @@ config I2C_ISMT
> This driver can also be built as a module.  If so, the module will
> be
> called i2c-ismt.
> 
> +config I2C_DIMM_BUS
> +   tristate
> +   default n
> +
>  config I2C_IMC
>   tristate "Intel iMC (LGA 2011) SMBus Controller"
>   depends on PCI && X86
> + select I2C_DIMM_BUS
>   help
> If you say yes to this option, support will be included for the
> Intel
> Integrated Memory Controller SMBus host controller interface.  This
> diff --git a/drivers/i2c/busses/Makefile b/drivers/i2c/busses/Makefile
> index ab3cdf1b3ca1..093591935bc8 100644
> --- a/drivers/i2c/busses/Makefile
> +++ b/drivers/i2c/busses/Makefile
> @@ -25,6 +25,10 @@ obj-$(CONFIG_I2C_SIS96X)   += i2c-sis96x.o
>  obj-$(CONFIG_I2C_VIA)+= i2c-via.o
>  obj-$(CONFIG_I2C_VIAPRO) += i2c-viapro.o
> 
> +# DIMM busses
> +obj-$(CONFIG_I2C_DIMM_BUS)   += dimm-bus.o
> +obj-$(CONFIG_I2C_IMC)+= i2c-imc.o
> +
>  # Mac SMBus host controller drivers
>  obj-$(CONFIG_I2C_HYDRA)  += i2c-hydra.o
>  obj-$(CONFIG_I2C_POWERMAC)   += i2c-powermac.o
> diff --git a/drivers/i2c/busses/dimm-bus.c b/drivers/i2c/busses/dimm-bus.c
> new file mode 100644
> index ..d41c1095c093
> --- /dev/null
> +++ b/drivers/i2c/busses/dimm-bus.c
> @@ -0,0 +1,107 @@
> +/*
> + * Copyright (c) 2013-2016 Andrew Lutomirski 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2
> + * as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static bool probe_addr(struct i2c_adapter *adapter, int addr)
> +{
> + /*
> +  * So far, all known devices that live on DIMMs can be safely
> +  * and reliably detected by trying to read a byte at address
> +  * zero.  (The exception is the SPD write protection control,
> +  * which can't be probed and requires special hardware and/or
> +  * quick writes to access, and has no driver.)
> +  */

If the bus is in the middle of any other kind of access or sequence of
accesses, it's hard to predict what this will do.

If it's a 512 byte SPD EEPROM and the last page select was to page 1,
this will read byte 256 rather than byte 0.  The thread needs to do
its own set page 0 write to ensure it knows what it is reading.

> + union i2c_smbus_data dummy;
> +
> + return i2c_smbus_xfer(adapter, addr, 0, I2C_SMBUS_READ, 0,
> +   I2C_SMBUS_BYTE_DATA, ) >= 0;
> +}
> +
> +/**
> + * i2c_scan_dimm_bus() - Scans an SMBUS segment known to contain DIMMs
> + * @adapter: The SMBUS adapter to scan
> + *
> + * This function tells the DIMM-bus code that the adapter is known to
> + * contain DIMMs.  i2c_scan_dimm_bus will probe for devices known to
> + * live on DIMMs.
> + *
> + * Do NOT call this function on general-purpose system SMBUS segments
> + * unless you know that the only things on the bus 

RE: [PATCH v3 2/2] i2c, i2c_imc: Add DIMM bus code

2016-05-02 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Andy Lutomirski [mailto:l...@kernel.org]
> Sent: Thursday, April 28, 2016 8:03 PM
> To: Wolfram Sang ; Christoph Hellwig 
> Cc: Boaz Harrosh ; One Thousand Gnomes
> ; Rui Wang ; Jean Delvare
> ; Alun Evans ; Robert Elliott
> ; linux-...@vger.kernel.org; Mauro Carvalho Chehab
> ; Paul Bolle ; Tony Luck
> ; linux-kernel@vger.kernel.org; Guenter Roeck
> ; Andy Lutomirski 
> Subject: [PATCH v3 2/2] i2c, i2c_imc: Add DIMM bus code
> 
> Add i2c_scan_dimm_bus to declare that a particular i2c_adapter
> contains DIMMs.  This will probe (and autoload modules!) for useful
> SMBUS devices that live on DIMMs.  i2c_imc calls it.
> 
> As more SMBUS-addressable DIMM components become supported, this
> code can be extended to probe for them.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  drivers/i2c/busses/Kconfig|   5 ++
>  drivers/i2c/busses/Makefile   |   4 ++
>  drivers/i2c/busses/dimm-bus.c | 107
> ++
>  drivers/i2c/busses/i2c-imc.c  |   3 ++
>  include/linux/i2c/dimm-bus.h  |  20 
>  5 files changed, 139 insertions(+)
>  create mode 100644 drivers/i2c/busses/dimm-bus.c
>  create mode 100644 include/linux/i2c/dimm-bus.h
> 
> diff --git a/drivers/i2c/busses/Kconfig b/drivers/i2c/busses/Kconfig
> index 3c05de897566..10aa87872408 100644
> --- a/drivers/i2c/busses/Kconfig
> +++ b/drivers/i2c/busses/Kconfig
> @@ -152,9 +152,14 @@ config I2C_ISMT
> This driver can also be built as a module.  If so, the module will
> be
> called i2c-ismt.
> 
> +config I2C_DIMM_BUS
> +   tristate
> +   default n
> +
>  config I2C_IMC
>   tristate "Intel iMC (LGA 2011) SMBus Controller"
>   depends on PCI && X86
> + select I2C_DIMM_BUS
>   help
> If you say yes to this option, support will be included for the
> Intel
> Integrated Memory Controller SMBus host controller interface.  This
> diff --git a/drivers/i2c/busses/Makefile b/drivers/i2c/busses/Makefile
> index ab3cdf1b3ca1..093591935bc8 100644
> --- a/drivers/i2c/busses/Makefile
> +++ b/drivers/i2c/busses/Makefile
> @@ -25,6 +25,10 @@ obj-$(CONFIG_I2C_SIS96X)   += i2c-sis96x.o
>  obj-$(CONFIG_I2C_VIA)+= i2c-via.o
>  obj-$(CONFIG_I2C_VIAPRO) += i2c-viapro.o
> 
> +# DIMM busses
> +obj-$(CONFIG_I2C_DIMM_BUS)   += dimm-bus.o
> +obj-$(CONFIG_I2C_IMC)+= i2c-imc.o
> +
>  # Mac SMBus host controller drivers
>  obj-$(CONFIG_I2C_HYDRA)  += i2c-hydra.o
>  obj-$(CONFIG_I2C_POWERMAC)   += i2c-powermac.o
> diff --git a/drivers/i2c/busses/dimm-bus.c b/drivers/i2c/busses/dimm-bus.c
> new file mode 100644
> index ..d41c1095c093
> --- /dev/null
> +++ b/drivers/i2c/busses/dimm-bus.c
> @@ -0,0 +1,107 @@
> +/*
> + * Copyright (c) 2013-2016 Andrew Lutomirski 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2
> + * as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static bool probe_addr(struct i2c_adapter *adapter, int addr)
> +{
> + /*
> +  * So far, all known devices that live on DIMMs can be safely
> +  * and reliably detected by trying to read a byte at address
> +  * zero.  (The exception is the SPD write protection control,
> +  * which can't be probed and requires special hardware and/or
> +  * quick writes to access, and has no driver.)
> +  */

If the bus is in the middle of any other kind of access or sequence of
accesses, it's hard to predict what this will do.

If it's a 512 byte SPD EEPROM and the last page select was to page 1,
this will read byte 256 rather than byte 0.  The thread needs to do
its own set page 0 write to ensure it knows what it is reading.

> + union i2c_smbus_data dummy;
> +
> + return i2c_smbus_xfer(adapter, addr, 0, I2C_SMBUS_READ, 0,
> +   I2C_SMBUS_BYTE_DATA, ) >= 0;
> +}
> +
> +/**
> + * i2c_scan_dimm_bus() - Scans an SMBUS segment known to contain DIMMs
> + * @adapter: The SMBUS adapter to scan
> + *
> + * This function tells the DIMM-bus code that the adapter is known to
> + * contain DIMMs.  i2c_scan_dimm_bus will probe for devices known to
> + * live on DIMMs.
> + *
> + * Do NOT call this function on general-purpose system SMBUS segments
> + * unless you know that the only things on the bus are DIMMs.
> + * Otherwise is it very likely to mis-identify other things on the
> + * bus.
> + *
> + * Callers are advised not to set adapter->class = I2C_CLASS_SPD to
> + * avoid having two separate mechanisms trying to automatically claim
> + * devices on the bus.
> + */
> +void 

RE: [RFC PATCH v1 00/18] x86: Secure Memory Encryption (AMD)

2016-04-30 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Tom Lendacky
> Sent: Tuesday, April 26, 2016 5:56 PM
> Subject: [RFC PATCH v1 00/18] x86: Secure Memory Encryption (AMD)
> 
> This RFC patch series provides support for AMD's new Secure Memory
> Encryption (SME) feature.
> 
> SME can be used to mark individual pages of memory as encrypted through the
> page tables. A page of memory that is marked encrypted will be automatically
> decrypted when read from DRAM and will be automatically encrypted when
> written to DRAM. Details on SME can found in the links below.
> 
...
> ...  Certain data must be accounted for
> as having been placed in memory before SME was enabled (EFI, initrd, etc.)
> and accessed accordingly.
> 
...
>   x86/efi: Access EFI related tables in the clear
>   x86: Access device tree in the clear
>   x86: Do not specify encrypted memory for VGA mapping

If the SME encryption key "is created randomly each time a system is booted,"
data on NVDIMMs won't decrypt properly on the next boot.  You need to exclude
persistent memory regions (reported in the UEFI memory map as 
EfiReservedMemoryType with the NV attribute, or as EfiPersistentMemory).

Perhaps the SEV feature will allow key export/import that could work for
NVDIMMs.

---
Robert Elliott, HPE Persistent Memory




RE: [RFC PATCH v1 00/18] x86: Secure Memory Encryption (AMD)

2016-04-30 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Tom Lendacky
> Sent: Tuesday, April 26, 2016 5:56 PM
> Subject: [RFC PATCH v1 00/18] x86: Secure Memory Encryption (AMD)
> 
> This RFC patch series provides support for AMD's new Secure Memory
> Encryption (SME) feature.
> 
> SME can be used to mark individual pages of memory as encrypted through the
> page tables. A page of memory that is marked encrypted will be automatically
> decrypted when read from DRAM and will be automatically encrypted when
> written to DRAM. Details on SME can found in the links below.
> 
...
> ...  Certain data must be accounted for
> as having been placed in memory before SME was enabled (EFI, initrd, etc.)
> and accessed accordingly.
> 
...
>   x86/efi: Access EFI related tables in the clear
>   x86: Access device tree in the clear
>   x86: Do not specify encrypted memory for VGA mapping

If the SME encryption key "is created randomly each time a system is booted,"
data on NVDIMMs won't decrypt properly on the next boot.  You need to exclude
persistent memory regions (reported in the UEFI memory map as 
EfiReservedMemoryType with the NV attribute, or as EfiPersistentMemory).

Perhaps the SEV feature will allow key export/import that could work for
NVDIMMs.

---
Robert Elliott, HPE Persistent Memory




RE: [PATCH] block: partitions: efi: Always check for alternative GPT at end of drive

2016-04-26 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Davidlohr Bueso
> Sent: Tuesday, April 26, 2016 1:34 PM
> To: Karel Zak 
> Cc: Julius Werner ; linux-...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-bl...@vger.kernel.org; Gwendal
> Grignou ; Doug Anderson 
> Subject: Re: [PATCH] block: partitions: efi: Always check for
> alternative GPT at end of drive
> 
> On Tue, 26 Apr 2016, Karel Zak wrote:
> 
> >On Mon, Apr 25, 2016 at 06:06:46PM -0700, Julius Werner wrote:
> >> The GUID Partiton Table layout maintains two synonymous partition
> >> tables on a block device, one starting in sector 1 and one in the
> >> very last sectors of the block device. This is useful if one of
> >> the tables gets
> >> accidentally corrupted (e.g. through a partial write because of an
> >> unexpected power loss).
> >>
> >> Linux normally only boots if the primary GPT is valid. It will not
> >> even try to find the alternative GPT to an invalid primary one
> >> unless the "gpt" command line option forces more aggressive
> >> detection. This doesn't
> >> really make any sense... if the "gpt" option is not set, the code
> >> validates the protective or hybrid MBR in sector 0 anyway before
> >> it even starts looking for the actual GPTs. If we get to the point
> >> where a valid proctective or hybrid MBR was found but the primary
> >> GPT was not found (valid), checking the alternative GPT is our 
> >> best bet: we know that this
> 
> 'best bet' in a kernel is not enough :) Which is why userland tools
> can fix and/or do any sort of crazy stuff with the backup and recover
> the primary etc etc.

Drive blocks go bad; the redundant GPTs are there to let the
system keep booting and running if that happens.

Rewriting the bad GPTs is what should require user intervention.

> 
> >> block device is meant to use GPT (because any other partitioning
> system
> >> would've presumably overwritten sector 0), and we know that if the
> >> alternative GPT is valid it should contain more accurate
> information
> >> than parsing the protective/hybrid MBR with msdos_partition()
> would
> >> yield (which would otherwise be what happens next).
> 
> >I guess "force_gpt" (and "gpt" on kernel command line) exists to
> >force users to think and care about a reason why the device has
> >unreadable (broken) primary GPT header.
> 
> Yes, from find_valid_gpt():
> 
>  * If the Primary GPT header is not valid, the Alternate GPT header
>  * is not checked unless the 'gpt' kernel command line option is
> passed.
>  * This protects against devices which misreport their size, and
> forces
>  * the user to decide to use the Alternate GPT.
> 
> ... so users are at least forced in some way to think about this.
> 
> >It seems like bad (and dangerous) idea to silently ignore corrupted
> >primary GTP header and boot from such device.
> 
> Yeah, there's no way in hell I trust a backup gpt in kernel space.
> We simply have no way of distinguishing between good and bad devices.
>
> >And note that alternative GPT header and the end of the device is a
> >just guess. The proper location of the alternative header is
> >specified with-in primary header (pgpt->alternate_lba). The header
> >at the end of
> >the device (as used for "force_gpt") is a fallback solution only.
> 
> And this only illustrates the ambiguity of the backup.

The UEFI specification is not ambiguous - you should always look
for the backup GPT Header at the last LBA:

"Two GPT Header structures are stored on the device: the primary
and the backup. The primary GPT Header must be located in LBA 1
(i.e., the second logical block), and the backup GPT Header must
be located in the last LBA of the device."

If the primary GPT Header is corrupted (e.g., CRC is bad), you
cannot trust any fields in it, including the Alternate LBA field.
The Alternate LBA field is there to help you tolerate failures 
while growing or shrinking the block device size (not important
for individual physical drives, but an issue for logical drives
presented by RAID controllers).



RE: [PATCH] block: partitions: efi: Always check for alternative GPT at end of drive

2016-04-26 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Davidlohr Bueso
> Sent: Tuesday, April 26, 2016 1:34 PM
> To: Karel Zak 
> Cc: Julius Werner ; linux-...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-bl...@vger.kernel.org; Gwendal
> Grignou ; Doug Anderson 
> Subject: Re: [PATCH] block: partitions: efi: Always check for
> alternative GPT at end of drive
> 
> On Tue, 26 Apr 2016, Karel Zak wrote:
> 
> >On Mon, Apr 25, 2016 at 06:06:46PM -0700, Julius Werner wrote:
> >> The GUID Partiton Table layout maintains two synonymous partition
> >> tables on a block device, one starting in sector 1 and one in the
> >> very last sectors of the block device. This is useful if one of
> >> the tables gets
> >> accidentally corrupted (e.g. through a partial write because of an
> >> unexpected power loss).
> >>
> >> Linux normally only boots if the primary GPT is valid. It will not
> >> even try to find the alternative GPT to an invalid primary one
> >> unless the "gpt" command line option forces more aggressive
> >> detection. This doesn't
> >> really make any sense... if the "gpt" option is not set, the code
> >> validates the protective or hybrid MBR in sector 0 anyway before
> >> it even starts looking for the actual GPTs. If we get to the point
> >> where a valid proctective or hybrid MBR was found but the primary
> >> GPT was not found (valid), checking the alternative GPT is our 
> >> best bet: we know that this
> 
> 'best bet' in a kernel is not enough :) Which is why userland tools
> can fix and/or do any sort of crazy stuff with the backup and recover
> the primary etc etc.

Drive blocks go bad; the redundant GPTs are there to let the
system keep booting and running if that happens.

Rewriting the bad GPTs is what should require user intervention.

> 
> >> block device is meant to use GPT (because any other partitioning
> system
> >> would've presumably overwritten sector 0), and we know that if the
> >> alternative GPT is valid it should contain more accurate
> information
> >> than parsing the protective/hybrid MBR with msdos_partition()
> would
> >> yield (which would otherwise be what happens next).
> 
> >I guess "force_gpt" (and "gpt" on kernel command line) exists to
> >force users to think and care about a reason why the device has
> >unreadable (broken) primary GPT header.
> 
> Yes, from find_valid_gpt():
> 
>  * If the Primary GPT header is not valid, the Alternate GPT header
>  * is not checked unless the 'gpt' kernel command line option is
> passed.
>  * This protects against devices which misreport their size, and
> forces
>  * the user to decide to use the Alternate GPT.
> 
> ... so users are at least forced in some way to think about this.
> 
> >It seems like bad (and dangerous) idea to silently ignore corrupted
> >primary GTP header and boot from such device.
> 
> Yeah, there's no way in hell I trust a backup gpt in kernel space.
> We simply have no way of distinguishing between good and bad devices.
>
> >And note that alternative GPT header and the end of the device is a
> >just guess. The proper location of the alternative header is
> >specified with-in primary header (pgpt->alternate_lba). The header
> >at the end of
> >the device (as used for "force_gpt") is a fallback solution only.
> 
> And this only illustrates the ambiguity of the backup.

The UEFI specification is not ambiguous - you should always look
for the backup GPT Header at the last LBA:

"Two GPT Header structures are stored on the device: the primary
and the backup. The primary GPT Header must be located in LBA 1
(i.e., the second logical block), and the backup GPT Header must
be located in the last LBA of the device."

If the primary GPT Header is corrupted (e.g., CRC is bad), you
cannot trust any fields in it, including the Alternate LBA field.
The Alternate LBA field is there to help you tolerate failures 
while growing or shrinking the block device size (not important
for individual physical drives, but an issue for logical drives
presented by RAID controllers).



RE: [PATCH 0/2] Add Opal unlock support to NVMe.

2016-04-25 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Monday, April 25, 2016 3:24 AM
> To: Rafael Antognolli 
> Cc: linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-bl...@vger.kernel.org
> Subject: Re: [PATCH 0/2] Add Opal unlock support to NVMe.
> 
> On Fri, Apr 22, 2016 at 04:12:10PM -0700, Rafael Antognolli wrote:
> > This patch series implement a small set of the Opal protocol for
> > self encrypting devices. It's implemented only what is needed for
> > saving a password and unlocking a given "locking range". The
> > password is saved on the driver and replayed back to the device
> > on resume from suspend to RAM. It is specifically supporting
> > the single user mode.

Passwords stored in memory are subject to cold boot attacks.

Could you tie this into the keyring infrastructure, so it would
least be no worse than other kernel modules?  This would allow
support for TPM-based keys (if present) to resist more attacks.
If register-based key storage or other techniques prove viable,
they would probably show up there first.

> > It is not planned to implement the full Opal protocol (at least
> > not for now).
> 
> I think the OPAL code should be a generic library outside the NVMe
> code so that we can use it for SATA and SAS as well, just with a
> little glue code for the Security Send / Receive commands to wire
> it up to NVMe.

NVDIMMs would benefit from that as well.




RE: [PATCH 0/2] Add Opal unlock support to NVMe.

2016-04-25 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Monday, April 25, 2016 3:24 AM
> To: Rafael Antognolli 
> Cc: linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-bl...@vger.kernel.org
> Subject: Re: [PATCH 0/2] Add Opal unlock support to NVMe.
> 
> On Fri, Apr 22, 2016 at 04:12:10PM -0700, Rafael Antognolli wrote:
> > This patch series implement a small set of the Opal protocol for
> > self encrypting devices. It's implemented only what is needed for
> > saving a password and unlocking a given "locking range". The
> > password is saved on the driver and replayed back to the device
> > on resume from suspend to RAM. It is specifically supporting
> > the single user mode.

Passwords stored in memory are subject to cold boot attacks.

Could you tie this into the keyring infrastructure, so it would
least be no worse than other kernel modules?  This would allow
support for TPM-based keys (if present) to resist more attacks.
If register-based key storage or other techniques prove viable,
they would probably show up there first.

> > It is not planned to implement the full Opal protocol (at least
> > not for now).
> 
> I think the OPAL code should be a generic library outside the NVMe
> code so that we can use it for SATA and SAS as well, just with a
> little glue code for the Security Send / Receive commands to wire
> it up to NVMe.

NVDIMMs would benefit from that as well.




RE: [PATCH v2] x86: PAT: Documentation: rewrite "MTRR effects on PAT / non-PAT systems"

2016-03-04 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Luis R. Rodriguez
> Sent: Friday, March 04, 2016 4:45 PM
> Subject: [PATCH v2] x86: PAT: Documentation: rewrite "MTRR effects on
> PAT / non-PAT systems"
...
> +MMIO and another PCI BAR for write-combing, if needed.

typo: combining





RE: [PATCH v2] x86: PAT: Documentation: rewrite "MTRR effects on PAT / non-PAT systems"

2016-03-04 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Luis R. Rodriguez
> Sent: Friday, March 04, 2016 4:45 PM
> Subject: [PATCH v2] x86: PAT: Documentation: rewrite "MTRR effects on
> PAT / non-PAT systems"
...
> +MMIO and another PCI BAR for write-combing, if needed.

typo: combining





RE: [PATCH 1/4] block: bio: introduce helpers to get the 1st and last bvec

2016-02-16 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> ow...@vger.kernel.org] On Behalf Of Ming Lei
> Sent: Monday, February 15, 2016 3:42 AM
> Subject: Re: [PATCH 1/4] block: bio: introduce helpers to get the 1st and
> last bvec
...
> diff --git a/include/linux/bio.h b/include/linux/bio.h
...
>  static inline void bio_get_last_bvec(struct bio *bio, struct bio_vec *bv)
>  {
> - struct bvec_iter iter;
> + struct bvec_iter iter = bio->bi_iter;
> + int idx;
> +
> + bio_advance_iter(bio, , iter.bi_size);
> +
> + WARN_ON(!iter.bi_idx && !iter.bi_bvec_done);

If this ever did trigger, I don't think you'd want it for every bio
with a problem.  WARN_ONCE would be safer.

---
Robert Elliott, HPE Persistent Memory



RE: [PATCH 1/4] block: bio: introduce helpers to get the 1st and last bvec

2016-02-16 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> ow...@vger.kernel.org] On Behalf Of Ming Lei
> Sent: Monday, February 15, 2016 3:42 AM
> Subject: Re: [PATCH 1/4] block: bio: introduce helpers to get the 1st and
> last bvec
...
> diff --git a/include/linux/bio.h b/include/linux/bio.h
...
>  static inline void bio_get_last_bvec(struct bio *bio, struct bio_vec *bv)
>  {
> - struct bvec_iter iter;
> + struct bvec_iter iter = bio->bi_iter;
> + int idx;
> +
> + bio_advance_iter(bio, , iter.bi_size);
> +
> + WARN_ON(!iter.bi_idx && !iter.bi_bvec_done);

If this ever did trigger, I don't think you'd want it for every bio
with a problem.  WARN_ONCE would be safer.

---
Robert Elliott, HPE Persistent Memory



RE: [PATCH 14/14] x86/efi: Print size in binary units in efi_print_memmap

2016-02-03 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: Matt Fleming [mailto:m...@codeblueprint.co.uk]
> Sent: Wednesday, February 3, 2016 5:28 AM
> To: Ingo Molnar 
> Cc: Laszlo Ersek ; H . Peter Anvin ;
> Thomas Gleixner ; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org; Elliott, Robert (Persistent Memory)
> ; Andy Shevchenko ;
> Ard Biesheuvel ; Taku Izumi
> ; Linus Torvalds  foundation.org>; Andrew Morton 
> Subject: Re: [PATCH 14/14] x86/efi: Print size in binary units in
> efi_print_memmap
...
> OK, this patch has caused enough headaches. Let's drop it from this
> series.
> 
> Robert, Andy, feel free to resubmit it after you've addressed
> everyone's concerns and we can discuss it in isolation.

We could just delete the size print altogether - better to print
nothing than a silently rounded number.  The end address already
communicates the size - it's just not as readable.  

The e820 table prints don't bother with a size print. 

That would also shorten these extremely wide prints to 116
characters (131 if printk time is enabled).

[0.00] BIOS-e820: [mem 0x00188000-0x00207fff] reserved
vs.
[0.00] efi: mem62: [Reserved   |   |  |NV|  |  |  |  |   
|WB|WT|WC|UC] range=[0x00188000-0x00207fff] (32 GiB)

---
Robert Elliott, HPE Persistent Memory



RE: [PATCH 14/14] x86/efi: Print size in binary units in efi_print_memmap

2016-02-03 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: Matt Fleming [mailto:m...@codeblueprint.co.uk]
> Sent: Wednesday, February 3, 2016 5:28 AM
> To: Ingo Molnar <mi...@kernel.org>
> Cc: Laszlo Ersek <ler...@redhat.com>; H . Peter Anvin <h...@zytor.com>;
> Thomas Gleixner <t...@linutronix.de>; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org; Elliott, Robert (Persistent Memory)
> <elli...@hpe.com>; Andy Shevchenko <andriy.shevche...@linux.intel.com>;
> Ard Biesheuvel <ard.biesheu...@linaro.org>; Taku Izumi
> <izumi.t...@jp.fujitsu.com>; Linus Torvalds <torvalds@linux-
> foundation.org>; Andrew Morton <a...@linux-foundation.org>
> Subject: Re: [PATCH 14/14] x86/efi: Print size in binary units in
> efi_print_memmap
...
> OK, this patch has caused enough headaches. Let's drop it from this
> series.
> 
> Robert, Andy, feel free to resubmit it after you've addressed
> everyone's concerns and we can discuss it in isolation.

We could just delete the size print altogether - better to print
nothing than a silently rounded number.  The end address already
communicates the size - it's just not as readable.  

The e820 table prints don't bother with a size print. 

That would also shorten these extremely wide prints to 116
characters (131 if printk time is enabled).

[0.00] BIOS-e820: [mem 0x00188000-0x00207fff] reserved
vs.
[0.00] efi: mem62: [Reserved   |   |  |NV|  |  |  |  |   
|WB|WT|WC|UC] range=[0x00188000-0x00207fff] (32 GiB)

---
Robert Elliott, HPE Persistent Memory



RE: [PATCH v3 3/4] x86/efi: print size in binary units in efi_print_memmap

2016-01-25 Thread Elliott, Robert (Persistent Memory)


---
Robert Elliott, HPE Persistent Memory


> -Original Message-
> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> Sent: Monday, January 25, 2016 2:01 PM
> To: James Bottomley 
> Cc: Elliott, Robert (Persistent Memory) ; Andy Shevchenko
> ; Matt Fleming
> ; Thomas Gleixner ; Ingo
> Molnar ; H . Peter Anvin ; linux-
> e...@vger.kernel.org; Rasmus Villemoes ; Andrew
> Morton ; linux-kernel @ vger . kernel . org
> 
> Subject: Re: [PATCH v3 3/4] x86/efi: print size in binary units in
> efi_print_memmap
> 
> On Mon, Jan 25, 2016 at 9:45 PM, James Bottomley
>  wrote:
> > On Mon, 2016-01-25 at 21:28 +0200, Andy Shevchenko wrote:
> >> On Mon, Jan 25, 2016 at 8:56 PM, James Bottomley
> >>  wrote:
> >> > On Mon, 2016-01-25 at 18:02 +, Elliott, Robert (Persistent
> >> > Memory)
> >> > wrote:
> >>
> >> >  Using ffs leads to precision runaway
> >>
> >> How exactly?!
> >
> > Off by one.  A size of 0x prints 18446744073709551615 B
> > rather than 20 GiB.
> 
> Because it's not a 20 GiB. It's exactly 20 GiB - 1 B.
> 
> AFAIU, the intention was to show _exact_ size.

For the UEFI memory map, that was indeed my intention.  I
don't want it silently round to "20 GiB".  Even rounding
to "19.999 GiB" is imprecise.

Another option could be to use a "~" prefix for imperfect
values, like "~20 GiB".  That would serve as a warning
that something's not quite right.




RE: [PATCH v3 3/4] x86/efi: print size in binary units in efi_print_memmap

2016-01-25 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: James Bottomley [mailto:james.bottom...@hansenpartnership.com]
> Sent: Saturday, January 23, 2016 10:44 AM
> To: Andy Shevchenko ; Matt Fleming
> ; Thomas Gleixner ; Ingo
> Molnar ; H . Peter Anvin ; linux-
> e...@vger.kernel.org; Rasmus Villemoes ; Andrew
> Morton ; linux-kernel @ vger . kernel . org
> 
> Cc: Elliott, Robert (Persistent Memory) 
> Subject: Re: [PATCH v3 3/4] x86/efi: print size in binary units in
> efi_print_memmap
> 
> On Sat, 2016-01-23 at 16:55 +0200, Andy Shevchenko wrote:
> > From: Robert Elliott 
> >
> > Print the size in the best-fit B, KiB, MiB, etc. units rather than
> > always MiB. This avoids rounding, which can be misleading.
> >

...
> 
> What if size is zero, which might happen on a UEFI screw up?  

> Also it gives really odd results for non power of two memory sizes. 
> 16384MB prints as 16GiB but 16385 prints as 16385MiB.
> If the goal is to have a clean interface reporting only the first four
> significant figures and a size exponent, then a helper would be much
> better than trying to open code this ad hoc.

An impetus for the patch was to stop rounding the sub-MiB values,
which is misleading and can hide bugs.  For my systems, the
minimum size of a range happens to be 4 KiB, so I wanted at least
that resolution. However, I don't want to print everything as KiB,
because that makes big sizes less clear.

Example - old output:
efi: mem00: [Conventional Memory...] 
range=[0x-0x1000) (0MB)
efi: mem01: [Loader Data...] 
range=[0x1000-0x2000) (0MB)
efi: mem02: [Conventional Memory...] 
range=[0x2000-0x00093000) (0MB)
efi: mem03: [Reserved   ...] 
range=[0x00093000-0x00094000) (0MB)

Proposed output:
efi: mem00: [Conventional Memory...] 
range=[0x-0x00092fff] (588 KiB @ 0 B)
efi: mem01: [Reserved   ...] 
range=[0x00093000-0x00093fff] (4 KiB @ 588 KiB)
efi: mem02: [Conventional Memory...] 
range=[0x00094000-0x0009] (48 KiB @ 592 KiB)
efi: mem03: [Loader Data...] 
range=[0x0010-0x013e8fff] (19364 KiB @ 1 MiB)
(notes:
 - from a different system
 - including both base and size
 - Matt didn't like printing the base so that's been removed)

With persistent memory (NVDIMMs) bringing storage device capacities
into the memory subsystem, MiB is too small.  Seeing a 1 TiB NVDIMM
as 1 TiB is a lot clearer than having to recognize 1048576 MiB as
the same value (especially since these power-of-two quantities
don't just chop off zeros on the right).

Examples:
efi: mem50: [Runtime Data   ...] 
range=[0x784ff000-0x788fefff] (4 MiB @ 1971196 KiB)
efi: mem56: [Conventional Memory...] 
range=[0x0001-0x00087fff] (30 GiB @ 4 GiB)
efi: mem58: [Memory Mapped I/O  ...] 
range=[0x8000-0x8fff] (256 MiB @ 2 GiB)
efi: mem60: [Persistent Memory  ...] 
range=[0x00148000-0x001a7fff] (24 GiB @ 82 GiB)


---
Robert Elliott, HPE Persistent Memory


RE: [PATCH v3 3/4] x86/efi: print size in binary units in efi_print_memmap

2016-01-25 Thread Elliott, Robert (Persistent Memory)


---
Robert Elliott, HPE Persistent Memory


> -Original Message-
> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> Sent: Monday, January 25, 2016 2:01 PM
> To: James Bottomley <james.bottom...@hansenpartnership.com>
> Cc: Elliott, Robert (Persistent Memory) <elli...@hpe.com>; Andy Shevchenko
> <andriy.shevche...@linux.intel.com>; Matt Fleming
> <m...@codeblueprint.co.uk>; Thomas Gleixner <t...@linutronix.de>; Ingo
> Molnar <mi...@redhat.com>; H . Peter Anvin <h...@zytor.com>; linux-
> e...@vger.kernel.org; Rasmus Villemoes <li...@rasmusvillemoes.dk>; Andrew
> Morton <a...@linux-foundation.org>; linux-kernel @ vger . kernel . org
> <linux-kernel@vger.kernel.org>
> Subject: Re: [PATCH v3 3/4] x86/efi: print size in binary units in
> efi_print_memmap
> 
> On Mon, Jan 25, 2016 at 9:45 PM, James Bottomley
> <james.bottom...@hansenpartnership.com> wrote:
> > On Mon, 2016-01-25 at 21:28 +0200, Andy Shevchenko wrote:
> >> On Mon, Jan 25, 2016 at 8:56 PM, James Bottomley
> >> <james.bottom...@hansenpartnership.com> wrote:
> >> > On Mon, 2016-01-25 at 18:02 +, Elliott, Robert (Persistent
> >> > Memory)
> >> > wrote:
> >>
> >> >  Using ffs leads to precision runaway
> >>
> >> How exactly?!
> >
> > Off by one.  A size of 0x prints 18446744073709551615 B
> > rather than 20 GiB.
> 
> Because it's not a 20 GiB. It's exactly 20 GiB - 1 B.
> 
> AFAIU, the intention was to show _exact_ size.

For the UEFI memory map, that was indeed my intention.  I
don't want it silently round to "20 GiB".  Even rounding
to "19.999 GiB" is imprecise.

Another option could be to use a "~" prefix for imperfect
values, like "~20 GiB".  That would serve as a warning
that something's not quite right.




RE: [PATCH v3 3/4] x86/efi: print size in binary units in efi_print_memmap

2016-01-25 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: James Bottomley [mailto:james.bottom...@hansenpartnership.com]
> Sent: Saturday, January 23, 2016 10:44 AM
> To: Andy Shevchenko <andriy.shevche...@linux.intel.com>; Matt Fleming
> <m...@codeblueprint.co.uk>; Thomas Gleixner <t...@linutronix.de>; Ingo
> Molnar <mi...@redhat.com>; H . Peter Anvin <h...@zytor.com>; linux-
> e...@vger.kernel.org; Rasmus Villemoes <li...@rasmusvillemoes.dk>; Andrew
> Morton <a...@linux-foundation.org>; linux-kernel @ vger . kernel . org
> <linux-kernel@vger.kernel.org>
> Cc: Elliott, Robert (Persistent Memory) <elli...@hpe.com>
> Subject: Re: [PATCH v3 3/4] x86/efi: print size in binary units in
> efi_print_memmap
> 
> On Sat, 2016-01-23 at 16:55 +0200, Andy Shevchenko wrote:
> > From: Robert Elliott <elli...@hpe.com>
> >
> > Print the size in the best-fit B, KiB, MiB, etc. units rather than
> > always MiB. This avoids rounding, which can be misleading.
> >

...
> 
> What if size is zero, which might happen on a UEFI screw up?  

> Also it gives really odd results for non power of two memory sizes. 
> 16384MB prints as 16GiB but 16385 prints as 16385MiB.
> If the goal is to have a clean interface reporting only the first four
> significant figures and a size exponent, then a helper would be much
> better than trying to open code this ad hoc.

An impetus for the patch was to stop rounding the sub-MiB values,
which is misleading and can hide bugs.  For my systems, the
minimum size of a range happens to be 4 KiB, so I wanted at least
that resolution. However, I don't want to print everything as KiB,
because that makes big sizes less clear.

Example - old output:
efi: mem00: [Conventional Memory...] 
range=[0x-0x1000) (0MB)
efi: mem01: [Loader Data...] 
range=[0x1000-0x2000) (0MB)
efi: mem02: [Conventional Memory...] 
range=[0x2000-0x00093000) (0MB)
efi: mem03: [Reserved   ...] 
range=[0x00093000-0x00094000) (0MB)

Proposed output:
efi: mem00: [Conventional Memory...] 
range=[0x-0x00092fff] (588 KiB @ 0 B)
efi: mem01: [Reserved   ...] 
range=[0x00093000-0x00093fff] (4 KiB @ 588 KiB)
efi: mem02: [Conventional Memory...] 
range=[0x00094000-0x0009] (48 KiB @ 592 KiB)
efi: mem03: [Loader Data...] 
range=[0x0010-0x013e8fff] (19364 KiB @ 1 MiB)
(notes:
 - from a different system
 - including both base and size
 - Matt didn't like printing the base so that's been removed)

With persistent memory (NVDIMMs) bringing storage device capacities
into the memory subsystem, MiB is too small.  Seeing a 1 TiB NVDIMM
as 1 TiB is a lot clearer than having to recognize 1048576 MiB as
the same value (especially since these power-of-two quantities
don't just chop off zeros on the right).

Examples:
efi: mem50: [Runtime Data   ...] 
range=[0x784ff000-0x788fefff] (4 MiB @ 1971196 KiB)
efi: mem56: [Conventional Memory...] 
range=[0x0001-0x00087fff] (30 GiB @ 4 GiB)
efi: mem58: [Memory Mapped I/O  ...] 
range=[0x8000-0x8fff] (256 MiB @ 2 GiB)
efi: mem60: [Persistent Memory  ...] 
range=[0x00148000-0x001a7fff] (24 GiB @ 82 GiB)


---
Robert Elliott, HPE Persistent Memory


RE: [PATCH v2 3/5] dax: improve documentation for fsync/msync

2016-01-22 Thread Elliott, Robert (Persistent Memory)


---
Robert Elliott, HPE Persistent Memory


> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Ross Zwisler
> Sent: Friday, January 22, 2016 9:58 AM
> To: Jan Kara 
> Cc: Andrew Morton ; linux-nvd...@lists.01.org;
> Dave Chinner ; linux-kernel@vger.kernel.org;
> Alexander Viro ; Jan Kara ; linux-
> fsde...@vger.kernel.org
> Subject: Re: [PATCH v2 3/5] dax: improve documentation for fsync/msync
> 
> On Fri, Jan 22, 2016 at 04:01:29PM +0100, Jan Kara wrote:
> > On Thu 21-01-16 10:46:02, Ross Zwisler wrote:
...
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index d589113..55ae394 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -350,6 +350,13 @@ static int dax_radix_entry(struct address_space
> *mapping, pgoff_t index,
> > >
> > >   if (!pmd_entry || type == RADIX_DAX_PMD)
> > >   goto dirty;
> > > +
> > > + /*
> > > +  * We only insert dirty PMD entries into the radix tree.  This
> > > +  * means we don't need to worry about removing a dirty PTE
> > > +  * entry and inserting a clean PMD entry, thus reducing the
> > > +  * range we would flush with a follow-up fsync/msync call.
> > > +  */
> >
> > May be acompany this with:
> >
> > WARN_ON(pmd_entry && !dirty);
> >
> > somewhere in dax_radix_entry()?
> 
> Sure, I'll add one.

If this is something that could trigger due to I/O traffic, please
use WARN_ONCE rather than WARN_ON to avoid the risk of swamping
the serial output.



RE: [PATCH v2 3/5] dax: improve documentation for fsync/msync

2016-01-22 Thread Elliott, Robert (Persistent Memory)


---
Robert Elliott, HPE Persistent Memory


> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Ross Zwisler
> Sent: Friday, January 22, 2016 9:58 AM
> To: Jan Kara 
> Cc: Andrew Morton ; linux-nvd...@lists.01.org;
> Dave Chinner ; linux-kernel@vger.kernel.org;
> Alexander Viro ; Jan Kara ; linux-
> fsde...@vger.kernel.org
> Subject: Re: [PATCH v2 3/5] dax: improve documentation for fsync/msync
> 
> On Fri, Jan 22, 2016 at 04:01:29PM +0100, Jan Kara wrote:
> > On Thu 21-01-16 10:46:02, Ross Zwisler wrote:
...
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index d589113..55ae394 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -350,6 +350,13 @@ static int dax_radix_entry(struct address_space
> *mapping, pgoff_t index,
> > >
> > >   if (!pmd_entry || type == RADIX_DAX_PMD)
> > >   goto dirty;
> > > +
> > > + /*
> > > +  * We only insert dirty PMD entries into the radix tree.  This
> > > +  * means we don't need to worry about removing a dirty PTE
> > > +  * entry and inserting a clean PMD entry, thus reducing the
> > > +  * range we would flush with a follow-up fsync/msync call.
> > > +  */
> >
> > May be acompany this with:
> >
> > WARN_ON(pmd_entry && !dirty);
> >
> > somewhere in dax_radix_entry()?
> 
> Sure, I'll add one.

If this is something that could trigger due to I/O traffic, please
use WARN_ONCE rather than WARN_ON to avoid the risk of swamping
the serial output.



RE: linux-next: manual merge of the akpm-current tree with the jc_docs tree

2015-12-31 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Stephen Rothwell [mailto:s...@canb.auug.org.au]
> Sent: Thursday, December 31, 2015 4:43 AM
> To: Andrew Morton ; Jonathan Corbet
> 
> Cc: linux-n...@vger.kernel.org; linux-kernel@vger.kernel.org;
> Elliott, Robert (Persistent Memory) ; Taku Izumi
> 
> Subject: linux-next: manual merge of the akpm-current tree with the
> jc_docs tree
> 
> Hi Andrew,
> 
> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   Documentation/kernel-parameters.txt
> 
> between commit:
> 
>   9daacf51b428 ("Documentation/kernel-parameters: update KMG units")
> 
> from the jc_docs tree and commit:
> 
>   f0a906868be1 ("mm/page_alloc.c: introduce kernelcore=mirror
> option")
> 
> from the akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary (no
> action
> is required).
> 
> --
> Cheers,
> Stephen Rothwells...@canb.auug.org.au
> 
> diff --cc Documentation/kernel-parameters.txt
> index adf540032a9d,2cfb638d138b..
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@@ -1710,7 -1696,8 +1714,8 @@@ Such letter suffixes can also be
> entire
> 
>   keepinitrd  [HW,ARM]
> 
> - kernelcore=nn[KMGTPE]   [KNL,X86,IA-64,PPC] This parameter
>  -kernelcore= Format: nn[KMG] | "mirror"
> ++kernelcore= Format: nn[KMGTPE] | "mirror"
> + [KNL,X86,IA-64,PPC] This parameter
>   specifies the amount of memory usable by the
> kernel
>   for non-movable allocations.  The requested
> amount is
>   spread evenly throughout all nodes in the system.
> The

There's a pretty strong convention that the [KNL,X86,IA-64,PPC] line
stay on the first line for grep, which seems more important than
keeping the format on that line. So, it might be better to resolve
using three lines like this:

kernelcore=  [KNL,X86,IA-64,PPC] 
Format: nn[KMGPTE] | "mirror"  
This parameter specifies...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: linux-next: manual merge of the akpm-current tree with the jc_docs tree

2015-12-31 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Stephen Rothwell [mailto:s...@canb.auug.org.au]
> Sent: Thursday, December 31, 2015 4:43 AM
> To: Andrew Morton <a...@linux-foundation.org>; Jonathan Corbet
> <cor...@lwn.net>
> Cc: linux-n...@vger.kernel.org; linux-kernel@vger.kernel.org;
> Elliott, Robert (Persistent Memory) <elli...@hpe.com>; Taku Izumi
> <izumi.t...@jp.fujitsu.com>
> Subject: linux-next: manual merge of the akpm-current tree with the
> jc_docs tree
> 
> Hi Andrew,
> 
> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   Documentation/kernel-parameters.txt
> 
> between commit:
> 
>   9daacf51b428 ("Documentation/kernel-parameters: update KMG units")
> 
> from the jc_docs tree and commit:
> 
>   f0a906868be1 ("mm/page_alloc.c: introduce kernelcore=mirror
> option")
> 
> from the akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary (no
> action
> is required).
> 
> --
> Cheers,
> Stephen Rothwells...@canb.auug.org.au
> 
> diff --cc Documentation/kernel-parameters.txt
> index adf540032a9d,2cfb638d138b..
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@@ -1710,7 -1696,8 +1714,8 @@@ Such letter suffixes can also be
> entire
> 
>   keepinitrd  [HW,ARM]
> 
> - kernelcore=nn[KMGTPE]   [KNL,X86,IA-64,PPC] This parameter
>  -kernelcore= Format: nn[KMG] | "mirror"
> ++kernelcore= Format: nn[KMGTPE] | "mirror"
> + [KNL,X86,IA-64,PPC] This parameter
>   specifies the amount of memory usable by the
> kernel
>   for non-movable allocations.  The requested
> amount is
>   spread evenly throughout all nodes in the system.
> The

There's a pretty strong convention that the [KNL,X86,IA-64,PPC] line
stay on the first line for grep, which seems more important than
keeping the format on that line. So, it might be better to resolve
using three lines like this:

kernelcore=  [KNL,X86,IA-64,PPC] 
Format: nn[KMGPTE] | "mirror"  
This parameter specifies...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: A blocksize problem about dax and ext4

2015-12-24 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Thursday, December 24, 2015 4:11 AM
> Subject: Re: A blocksize problem about dax and ext4
> 
> On Thu, Dec 24, 2015 at 02:47:07AM +0000, Elliott, Robert (Persistent
> Memory) wrote:
> > > Did you mean that I should make the blocksize bigger until the mount
> > > command tell me that dax is enabled?
> >
> > To really use DAX, the filesystem block size must match the
> > system CPU's page size, which is probably 4096 bytes.
> 
> No, it doesn't.  File you use for DAX must be aligne at page size
> granularity.  For XFS you could do this with the per-inode extent
> size hint for example even if the overall block size is smaller.

I think that's a future goal.

Currently, the checks are like this:

if (sb->s_blocksize != PAGE_SIZE) {
xfs_alert(mp,
"Filesystem block size invalid for DAX Turning DAX off.");


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: A blocksize problem about dax and ext4

2015-12-24 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Thursday, December 24, 2015 4:11 AM
> Subject: Re: A blocksize problem about dax and ext4
> 
> On Thu, Dec 24, 2015 at 02:47:07AM +0000, Elliott, Robert (Persistent
> Memory) wrote:
> > > Did you mean that I should make the blocksize bigger until the mount
> > > command tell me that dax is enabled?
> >
> > To really use DAX, the filesystem block size must match the
> > system CPU's page size, which is probably 4096 bytes.
> 
> No, it doesn't.  File you use for DAX must be aligne at page size
> granularity.  For XFS you could do this with the per-inode extent
> size hint for example even if the overall block size is smaller.

I think that's a future goal.

Currently, the checks are like this:

if (sb->s_blocksize != PAGE_SIZE) {
xfs_alert(mp,
"Filesystem block size invalid for DAX Turning DAX off.");


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: A blocksize problem about dax and ext4

2015-12-23 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: Cholerae Hu [mailto:cholerae...@gmail.com]
> Sent: Wednesday, December 23, 2015 8:36 PM
> Subject: Re: A blocksize problem about dax and ext4
...
> xfs will silently disable dax when the fs block size is too small,
> i.e. your mmap() operations are backed by page cache in this case.
> Currently the only indication of whether a mapping is DAX backed or
> not is the presence of the VM_MIXEDMAP flag ("mm" in the VmFlags field
> of /proc//smaps)
> 
> Did you mean that I should make the blocksize bigger until the mount
> command tell me that dax is enabled?

To really use DAX, the filesystem block size must match the
system CPU's page size, which is probably 4096 bytes.

---
Robert Elliott, HPE Persistent Memory

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH v6 3/7] mm: add find_get_entries_tag()

2015-12-23 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Ross Zwisler
> Sent: Wednesday, December 23, 2015 1:39 PM
> Subject: [PATCH v6 3/7] mm: add find_get_entries_tag()
> 
...
> diff --git a/mm/filemap.c b/mm/filemap.c
...
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> + int tag, unsigned int nr_entries,
> + struct page **entries, pgoff_t *indices)
> +{
> + void **slot;
> + unsigned int ret = 0;
...
> + radix_tree_for_each_tagged(slot, >page_tree,
> +, start, tag) {
...
> + indices[ret] = iter.index;
> + entries[ret] = page;
> + if (++ret == nr_entries)
> + break;
> + }

Using >= would provide more safety from buffer overflow
problems in case ret ever jumped ahead by more than one.
---
Robert Elliott, HPE Persistent Memory

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: A blocksize problem about dax and ext4

2015-12-23 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Wednesday, December 23, 2015 11:16 AM
> To: Cholerae Hu 
> Cc: linux-nvd...@lists.01.org
> Subject: Re: A blocksize problem about dax and ext4
> 
> On Wed, Dec 23, 2015 at 4:03 AM, Cholerae Hu 
> wrote:
...
> > [root@localhost cholerae]# mount -o dax /dev/pmem0 /mnt/mem
> > mount: wrong fs type, bad option, bad superblock on /dev/pmem0,
> >missing codepage or helper program, or other error
> >
> >In some cases useful info is found in syslog - try
> >dmesg | tail or so.
> > [root@localhost cholerae]# dmesg | tail
...
> > [   81.779582] EXT4-fs (pmem0): error: unsupported blocksize for dax
...

> What's the fs block size?  For example:
> # dumpe2fs -h /dev/pmem0  | grep "Block size"
> dumpe2fs 1.42.9 (28-Dec-2013)
> Block size:   4096
> Depending on the size of /dev/pmem0 it may have automatically set it
> to a block size less than 4 KiB which is incompatible with "-o dax".

I noticed a few things while trying that out on both ext4 and xfs.

$ sudo mkfs.ext4 -F -b 1024 /dev/pmem0
$ sudo mount -o dax /dev/pmem0 /mnt/ext4-pmem0
$ sudo mkfs.xfs -f -b size=1024 /dev/pmem0
$ sudo mount -o dax /dev/pmem0 /mnt/xfs-pmem0

[  199.679195] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your 
own risk
[  199.724931] EXT4-fs (pmem0): error: unsupported block size 1024 for dax 
[  859.077766] XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own 
risk
[  859.118106] XFS (pmem0): Filesystem block size invalid for DAX Turning DAX 
off.
[  859.156950] XFS (pmem0): Mounting V4 Filesystem
[  859.183626] XFS (pmem0): Ending clean mount


1. ext4 fails to mount the filesystem, while xfs just disables DAX.
It seems like they should they be the same.

2. if CONFIG_FS_DAX is not supported, ext4 fails to mount, but prints
the message at the KERN_INFO level. All the rest of its mount errors
use KERN_ERR.

Completely unknown mount options are reported like this at the
KERN_ERR level:
[ 2188.194775] EXT4-fs (pmem0): Unrecognized mount option "xyzzy" or missing 
value

In contrast, if CONFIG_FS_DAX is not supported, then xfs lumps it in
with the rest of the unknown mount options, which are reported with 
xfs_warn():
[ 2347.654182] XFS (pmem0): unknown mount option [xyzzy].


3. It might be worth printing the problematic filesystem block size
 (here and in a few other similar messages).  

I like how xfs' wording of "Filesystem block size" helps distinguish
the value from the block device's logical block size.


Code excerpts
=
fs/xfs/xfs_super.c:
#ifdef CONFIG_FS_DAX
} else if (!strcmp(this_char, MNTOPT_DAX)) {
mp->m_flags |= XFS_MOUNT_DAX;
#endif
...
if (mp->m_flags & XFS_MOUNT_DAX) {
xfs_warn(mp,
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
if (sb->s_blocksize != PAGE_SIZE) {
xfs_alert(mp,
"Filesystem block size invalid for DAX Turning DAX off.");
mp->m_flags &= ~XFS_MOUNT_DAX;
} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
xfs_alert(mp,
"Block device does not support DAX Turning DAX off.");
mp->m_flags &= ~XFS_MOUNT_DAX;
}
}

fs/ext4/super.c:
} else if (token == Opt_dax) {
#ifdef CONFIG_FS_DAX
ext4_msg(sb, KERN_WARNING,
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
sbi->s_mount_opt |= m->mount_opt;
#else
ext4_msg(sb, KERN_INFO, "dax option not supported");
return -1;
#endif

---
Robert Elliott, HPE Persistent Memory


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v6 3/7] mm: add find_get_entries_tag()

2015-12-23 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Ross Zwisler
> Sent: Wednesday, December 23, 2015 1:39 PM
> Subject: [PATCH v6 3/7] mm: add find_get_entries_tag()
> 
...
> diff --git a/mm/filemap.c b/mm/filemap.c
...
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> + int tag, unsigned int nr_entries,
> + struct page **entries, pgoff_t *indices)
> +{
> + void **slot;
> + unsigned int ret = 0;
...
> + radix_tree_for_each_tagged(slot, >page_tree,
> +, start, tag) {
...
> + indices[ret] = iter.index;
> + entries[ret] = page;
> + if (++ret == nr_entries)
> + break;
> + }

Using >= would provide more safety from buffer overflow
problems in case ret ever jumped ahead by more than one.
---
Robert Elliott, HPE Persistent Memory

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: A blocksize problem about dax and ext4

2015-12-23 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: Cholerae Hu [mailto:cholerae...@gmail.com]
> Sent: Wednesday, December 23, 2015 8:36 PM
> Subject: Re: A blocksize problem about dax and ext4
...
> xfs will silently disable dax when the fs block size is too small,
> i.e. your mmap() operations are backed by page cache in this case.
> Currently the only indication of whether a mapping is DAX backed or
> not is the presence of the VM_MIXEDMAP flag ("mm" in the VmFlags field
> of /proc//smaps)
> 
> Did you mean that I should make the blocksize bigger until the mount
> command tell me that dax is enabled?

To really use DAX, the filesystem block size must match the
system CPU's page size, which is probably 4096 bytes.

---
Robert Elliott, HPE Persistent Memory

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: A blocksize problem about dax and ext4

2015-12-23 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Wednesday, December 23, 2015 11:16 AM
> To: Cholerae Hu 
> Cc: linux-nvd...@lists.01.org
> Subject: Re: A blocksize problem about dax and ext4
> 
> On Wed, Dec 23, 2015 at 4:03 AM, Cholerae Hu 
> wrote:
...
> > [root@localhost cholerae]# mount -o dax /dev/pmem0 /mnt/mem
> > mount: wrong fs type, bad option, bad superblock on /dev/pmem0,
> >missing codepage or helper program, or other error
> >
> >In some cases useful info is found in syslog - try
> >dmesg | tail or so.
> > [root@localhost cholerae]# dmesg | tail
...
> > [   81.779582] EXT4-fs (pmem0): error: unsupported blocksize for dax
...

> What's the fs block size?  For example:
> # dumpe2fs -h /dev/pmem0  | grep "Block size"
> dumpe2fs 1.42.9 (28-Dec-2013)
> Block size:   4096
> Depending on the size of /dev/pmem0 it may have automatically set it
> to a block size less than 4 KiB which is incompatible with "-o dax".

I noticed a few things while trying that out on both ext4 and xfs.

$ sudo mkfs.ext4 -F -b 1024 /dev/pmem0
$ sudo mount -o dax /dev/pmem0 /mnt/ext4-pmem0
$ sudo mkfs.xfs -f -b size=1024 /dev/pmem0
$ sudo mount -o dax /dev/pmem0 /mnt/xfs-pmem0

[  199.679195] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your 
own risk
[  199.724931] EXT4-fs (pmem0): error: unsupported block size 1024 for dax 
[  859.077766] XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own 
risk
[  859.118106] XFS (pmem0): Filesystem block size invalid for DAX Turning DAX 
off.
[  859.156950] XFS (pmem0): Mounting V4 Filesystem
[  859.183626] XFS (pmem0): Ending clean mount


1. ext4 fails to mount the filesystem, while xfs just disables DAX.
It seems like they should they be the same.

2. if CONFIG_FS_DAX is not supported, ext4 fails to mount, but prints
the message at the KERN_INFO level. All the rest of its mount errors
use KERN_ERR.

Completely unknown mount options are reported like this at the
KERN_ERR level:
[ 2188.194775] EXT4-fs (pmem0): Unrecognized mount option "xyzzy" or missing 
value

In contrast, if CONFIG_FS_DAX is not supported, then xfs lumps it in
with the rest of the unknown mount options, which are reported with 
xfs_warn():
[ 2347.654182] XFS (pmem0): unknown mount option [xyzzy].


3. It might be worth printing the problematic filesystem block size
 (here and in a few other similar messages).  

I like how xfs' wording of "Filesystem block size" helps distinguish
the value from the block device's logical block size.


Code excerpts
=
fs/xfs/xfs_super.c:
#ifdef CONFIG_FS_DAX
} else if (!strcmp(this_char, MNTOPT_DAX)) {
mp->m_flags |= XFS_MOUNT_DAX;
#endif
...
if (mp->m_flags & XFS_MOUNT_DAX) {
xfs_warn(mp,
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
if (sb->s_blocksize != PAGE_SIZE) {
xfs_alert(mp,
"Filesystem block size invalid for DAX Turning DAX off.");
mp->m_flags &= ~XFS_MOUNT_DAX;
} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
xfs_alert(mp,
"Block device does not support DAX Turning DAX off.");
mp->m_flags &= ~XFS_MOUNT_DAX;
}
}

fs/ext4/super.c:
} else if (token == Opt_dax) {
#ifdef CONFIG_FS_DAX
ext4_msg(sb, KERN_WARNING,
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
sbi->s_mount_opt |= m->mount_opt;
#else
ext4_msg(sb, KERN_INFO, "dax option not supported");
return -1;
#endif

---
Robert Elliott, HPE Persistent Memory


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 4/4] x86/efi: print size and base in binary units in efi_print_memmap

2015-12-22 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: Matt Fleming [mailto:m...@codeblueprint.co.uk]
> Sent: Monday, December 21, 2015 10:16 AM
> To: Elliott, Robert (Persistent Memory) 
> Cc: t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
> linux-...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 4/4] x86/efi: print size and base in binary units in
> efi_print_memmap
> 
> On Thu, 17 Dec, at 07:28:34PM, Robert Elliott wrote:
> > Print the base address for each range in decimal alongside the size.
> > Use a "(size @ base)" format similar to the fake_memmap kernel
> parameter.
> >
> > Print the range and base in the best-fit B, KiB, MiB, etc. units rather
> > than always MiB.  This avoids rounding, which can be misleading.
> >
> > Use proper IEC binary units (KiB, MiB, etc.) rather than misuse SI
> > decimal units (KB, MB, etc.).
> >
> > old:
> > efi: mem61: [Persistent Memory  |   |  |  |  |  |  |   |WB|WT|WC|UC]
> range=[0x00088000-0x000c7fff) (16384MB)
> >
> > new:
> > efi: mem61: [Persistent Memory  |   |  |  |  |  |  |   |WB|WT|WC|UC]
> range=[0x00088000-0x000c7fff] (16 GiB @ 34 GiB)
> >
> > Signed-off-by: Robert Elliott 
> > ---
> >  arch/x86/platform/efi/efi.c | 27 ---
> >  1 file changed, 24 insertions(+), 3 deletions(-)
> 
> I'm not at all sure of the value of printing the physical address as a
> size. I would have thought that you'd have to convert it back to an
> address whenever you wanted to use it anyway.

I was trying to make it resemble the memmap=size@address 
kernel parameter format for creating e820 entries, which
does accept abbreviations in addition to hex values:
memmap=nn[KMG]@ss[KMG] for usable DRAM
memmap=nn[KMG]#ss[KMG] for ACPI data
memmap=nn[KMG]$ss[KMG] for reserved
memmap=nn[KMG]!ss[KMG] for persistent memory

Mapping the UEFI type to the corresponding @, #, $, or ! was
more than I wanted to tackle, so it's not a drop-in
replacement string.

memparse() also accepts T, P, and E units; I guess those
need to be added to Documentation/kernel-parameters.txt.

> > diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> > index 635a955..030ba91 100644
> > --- a/arch/x86/platform/efi/efi.c
> > +++ b/arch/x86/platform/efi/efi.c
> > @@ -222,6 +222,25 @@ int __init efi_memblock_x86_reserve_range(void)
> > return 0;
> >  }
> >
> > +char * __init efi_size_format(char *buf, size_t size, u64 bytes)
> > +{
> > +   if (!bytes || (bytes & 0x3ff))
> > +   snprintf(buf, size, "%llu B", bytes);
> > +   else if (bytes & 0xf)
> > +   snprintf(buf, size, "%llu KiB", bytes >> 10);
> > +   else if (bytes & 0x3fff)
> > +   snprintf(buf, size, "%llu MiB", bytes >> 20);
> > +   else if (bytes & 0xff)
> > +   snprintf(buf, size, "%llu GiB", bytes >> 30);
> > +   else if (bytes & 0x3)
> > +   snprintf(buf, size, "%llu TiB", bytes >> 40);
> > +   else if (bytes & 0xfff)
> > +   snprintf(buf, size, "%llu PiB", bytes >> 50);
> > +   else
> > +   snprintf(buf, size, "%llu EiB", bytes >> 60);
> > +   return buf;
> > +}
> > +
> 
> Can we use string_get_size() instead of rolling our own function?

Thanks for the pointer; I wondered if there was a similar
function somewhere.  However, that function throws away
precision in favor of printing just 3 significant digits;
I think that's dangerous.  Its non-integer output is not
supported by memmap=, and the function appears to use
assembly code to get CPU divide instructions, losing the
ability to use shifts for these power of two divisions.

Example results...

efi: mem01:... range=[0x00093000-0x00093fff] (4 KiB @ 588 KiB)
efi: mem01:... range=[0x00093000-0x00093fff] (4.00 KiB @ 588 
KiB) SGS

efi: mem03:... range=[0x0010-0x013e8fff] (19364 KiB @ 1 MiB)
efi: mem03:... range=[0x0010-0x013e8fff] (18.9 MiB @ 1.00 
MiB) SGS
(example of lost precision: 19364 KiB is really 18.91015625 MiB)

efi: mem04:... range=[0x013e9000-0x01ff] (12380 KiB @ 20388 
KiB)
efi: mem04:... range=[0x013e9000-0x01ff] (12.0 MiB @ 19.9 
MiB) SGS

efi: mem28:... range=[0x717c2000-0x72acafff] (19492 KiB @ 
1859336 KiB)
efi: mem28:... range=[0x717c2000-0x72acafff] (19.0 MiB @ 1.77 
GiB) SGS

efi: mem57:... range=[0x00088000-0x000e7fff] (24 GiB @ 34 GiB)
efi: mem57:... range=[0x00088000-0x000e7fff] (24.0 GiB @ 34.0 
GiB) SGS

---
Robert Elliott, HPE Persistent Memory

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/4] x86/efi: show actual ending addresses in efi_print_memmap

2015-12-22 Thread Elliott, Robert (Persistent Memory)


---
Robert Elliott, HPE Persistent Memory


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Matt Fleming
> Sent: Monday, December 21, 2015 10:06 AM
> To: Elliott, Robert (Persistent Memory) 
> Cc: t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
> linux-...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 1/4] x86/efi: show actual ending addresses in
> efi_print_memmap
> 
> On Mon, 21 Dec, at 03:50:38PM, Matt Fleming wrote:
> > On Thu, 17 Dec, at 07:28:31PM, Robert Elliott wrote:
> > > Adjust efi_print_memmap to print the real end address of each
> > > range, not 1 byte beyond. This matches other prints like those for
> > > SRAT and nosave memory.
> > >
> > > Change the closing ) to ] to match the opening [.
> > >
> > > old:
> > > efi: mem61: [Persistent Memory  |   |  |  |  |  |  |
> |WB|WT|WC|UC] range=[0x00088000-0x000c8000) (16384MB)
> > >
> > > new:
> > > efi: mem61: [Persistent Memory  |   |  |  |  |  |  |
> |WB|WT|WC|UC] range=[0x00088000-0x000c7fff] (16384MB)
> > >
> > > Example other address range prints:
> > > SRAT: Node 1 PXM 1 [mem 0x48000-0x87fff]
> > > PM: Registered nosave memory: [mem 0x88000-0xc7fff]
> > >
> > > Signed-off-by: Robert Elliott 
> > > ---
> > >  arch/x86/platform/efi/efi.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > Is this change purely for aesthetic reasons? We're usually not in the
> > habit of changing the output of print messages without a good reason
> > because people downstream do rely on them being consistent.
> 
> I just noticed the bracket change.
> 
> My comment above only refers to printing the range addresses. Swapping
> ')' for ']' is a perfectly valid change.

While investigating the grub persistent memory corruption
issues, it was helpful to make this table match the ending
address convention used by:
* the kernel's e820 table prints
* the kernel's nosave memory prints
* grub's lsmmap and lsefimmap commands
* the UEFI shell's memmap command

Using the excerpts from the patch, if you grep all the various
logs for c7fff, you won't find the line with c8000.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/4] x86/efi: show actual ending addresses in efi_print_memmap

2015-12-22 Thread Elliott, Robert (Persistent Memory)


---
Robert Elliott, HPE Persistent Memory


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Matt Fleming
> Sent: Monday, December 21, 2015 10:06 AM
> To: Elliott, Robert (Persistent Memory) <elli...@hpe.com>
> Cc: t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
> linux-...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 1/4] x86/efi: show actual ending addresses in
> efi_print_memmap
> 
> On Mon, 21 Dec, at 03:50:38PM, Matt Fleming wrote:
> > On Thu, 17 Dec, at 07:28:31PM, Robert Elliott wrote:
> > > Adjust efi_print_memmap to print the real end address of each
> > > range, not 1 byte beyond. This matches other prints like those for
> > > SRAT and nosave memory.
> > >
> > > Change the closing ) to ] to match the opening [.
> > >
> > > old:
> > > efi: mem61: [Persistent Memory  |   |  |  |  |  |  |
> |WB|WT|WC|UC] range=[0x00088000-0x000c8000) (16384MB)
> > >
> > > new:
> > > efi: mem61: [Persistent Memory  |   |  |  |  |  |  |
> |WB|WT|WC|UC] range=[0x00088000-0x000c7fff] (16384MB)
> > >
> > > Example other address range prints:
> > > SRAT: Node 1 PXM 1 [mem 0x48000-0x87fff]
> > > PM: Registered nosave memory: [mem 0x88000-0xc7fff]
> > >
> > > Signed-off-by: Robert Elliott <elli...@hpe.com>
> > > ---
> > >  arch/x86/platform/efi/efi.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > Is this change purely for aesthetic reasons? We're usually not in the
> > habit of changing the output of print messages without a good reason
> > because people downstream do rely on them being consistent.
> 
> I just noticed the bracket change.
> 
> My comment above only refers to printing the range addresses. Swapping
> ')' for ']' is a perfectly valid change.

While investigating the grub persistent memory corruption
issues, it was helpful to make this table match the ending
address convention used by:
* the kernel's e820 table prints
* the kernel's nosave memory prints
* grub's lsmmap and lsefimmap commands
* the UEFI shell's memmap command

Using the excerpts from the patch, if you grep all the various
logs for c7fff, you won't find the line with c8000.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 4/4] x86/efi: print size and base in binary units in efi_print_memmap

2015-12-22 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: Matt Fleming [mailto:m...@codeblueprint.co.uk]
> Sent: Monday, December 21, 2015 10:16 AM
> To: Elliott, Robert (Persistent Memory) <elli...@hpe.com>
> Cc: t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
> linux-...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 4/4] x86/efi: print size and base in binary units in
> efi_print_memmap
> 
> On Thu, 17 Dec, at 07:28:34PM, Robert Elliott wrote:
> > Print the base address for each range in decimal alongside the size.
> > Use a "(size @ base)" format similar to the fake_memmap kernel
> parameter.
> >
> > Print the range and base in the best-fit B, KiB, MiB, etc. units rather
> > than always MiB.  This avoids rounding, which can be misleading.
> >
> > Use proper IEC binary units (KiB, MiB, etc.) rather than misuse SI
> > decimal units (KB, MB, etc.).
> >
> > old:
> > efi: mem61: [Persistent Memory  |   |  |  |  |  |  |   |WB|WT|WC|UC]
> range=[0x00088000-0x000c7fff) (16384MB)
> >
> > new:
> > efi: mem61: [Persistent Memory  |   |  |  |  |  |  |   |WB|WT|WC|UC]
> range=[0x00088000-0x000c7fff] (16 GiB @ 34 GiB)
> >
> > Signed-off-by: Robert Elliott <elli...@hpe.com>
> > ---
> >  arch/x86/platform/efi/efi.c | 27 ---
> >  1 file changed, 24 insertions(+), 3 deletions(-)
> 
> I'm not at all sure of the value of printing the physical address as a
> size. I would have thought that you'd have to convert it back to an
> address whenever you wanted to use it anyway.

I was trying to make it resemble the memmap=size@address 
kernel parameter format for creating e820 entries, which
does accept abbreviations in addition to hex values:
memmap=nn[KMG]@ss[KMG] for usable DRAM
memmap=nn[KMG]#ss[KMG] for ACPI data
memmap=nn[KMG]$ss[KMG] for reserved
memmap=nn[KMG]!ss[KMG] for persistent memory

Mapping the UEFI type to the corresponding @, #, $, or ! was
more than I wanted to tackle, so it's not a drop-in
replacement string.

memparse() also accepts T, P, and E units; I guess those
need to be added to Documentation/kernel-parameters.txt.

> > diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> > index 635a955..030ba91 100644
> > --- a/arch/x86/platform/efi/efi.c
> > +++ b/arch/x86/platform/efi/efi.c
> > @@ -222,6 +222,25 @@ int __init efi_memblock_x86_reserve_range(void)
> > return 0;
> >  }
> >
> > +char * __init efi_size_format(char *buf, size_t size, u64 bytes)
> > +{
> > +   if (!bytes || (bytes & 0x3ff))
> > +   snprintf(buf, size, "%llu B", bytes);
> > +   else if (bytes & 0xf)
> > +   snprintf(buf, size, "%llu KiB", bytes >> 10);
> > +   else if (bytes & 0x3fff)
> > +   snprintf(buf, size, "%llu MiB", bytes >> 20);
> > +   else if (bytes & 0xff)
> > +   snprintf(buf, size, "%llu GiB", bytes >> 30);
> > +   else if (bytes & 0x3)
> > +   snprintf(buf, size, "%llu TiB", bytes >> 40);
> > +   else if (bytes & 0xfff)
> > +   snprintf(buf, size, "%llu PiB", bytes >> 50);
> > +   else
> > +   snprintf(buf, size, "%llu EiB", bytes >> 60);
> > +   return buf;
> > +}
> > +
> 
> Can we use string_get_size() instead of rolling our own function?

Thanks for the pointer; I wondered if there was a similar
function somewhere.  However, that function throws away
precision in favor of printing just 3 significant digits;
I think that's dangerous.  Its non-integer output is not
supported by memmap=, and the function appears to use
assembly code to get CPU divide instructions, losing the
ability to use shifts for these power of two divisions.

Example results...

efi: mem01:... range=[0x00093000-0x00093fff] (4 KiB @ 588 KiB)
efi: mem01:... range=[0x00093000-0x00093fff] (4.00 KiB @ 588 
KiB) SGS

efi: mem03:... range=[0x0010-0x013e8fff] (19364 KiB @ 1 MiB)
efi: mem03:... range=[0x0010-0x013e8fff] (18.9 MiB @ 1.00 
MiB) SGS
(example of lost precision: 19364 KiB is really 18.91015625 MiB)

efi: mem04:... range=[0x013e9000-0x01ff] (12380 KiB @ 20388 
KiB)
efi: mem04:... range=[0x013e9000-0x01ff] (12.0 MiB @ 19.9 
MiB) SGS

efi: mem28:... range=[0x717c2000-0x72acafff] (19492 KiB @ 
1859336 KiB)
efi: mem28:... range=[0x717c2000-0x72acafff] (19.0 MiB @ 1.77 
GiB) SGS

efi: mem57:... range=[0x00088000-0x000e7fff] (24 GiB @ 34 GiB)
efi: mem57:... range=[0x00088000-0x000e7fff] (24.0 GiB @ 34.0 
GiB) SGS

---
Robert Elliott, HPE Persistent Memory

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/4] x86/efi: show actual ending addresses in efi_print_memmap

2015-12-21 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: Matt Fleming [mailto:m...@codeblueprint.co.uk]
> Sent: Monday, December 21, 2015 9:51 AM
> To: Elliott, Robert (Persistent Memory) 
> Cc: t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
> linux-...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 1/4] x86/efi: show actual ending addresses in
> efi_print_memmap
> 
> On Thu, 17 Dec, at 07:28:31PM, Robert Elliott wrote:
> > Adjust efi_print_memmap to print the real end address of each
> > range, not 1 byte beyond. This matches other prints like those for
> > SRAT and nosave memory.
> >
> > Change the closing ) to ] to match the opening [.
> >
> > old:
> > efi: mem61: [Persistent Memory  |   |  |  |  |  |  |   |WB|WT|WC|UC]
> range=[0x00088000-0x000c8000) (16384MB)
> >
> > new:
> > efi: mem61: [Persistent Memory  |   |  |  |  |  |  |   |WB|WT|WC|UC]
> range=[0x00088000-0x000c7fff] (16384MB)
> >
> > Example other address range prints:
> > SRAT: Node 1 PXM 1 [mem 0x48000-0x87fff]
> > PM: Registered nosave memory: [mem 0x88000-0xc7fff]
> >
> > Signed-off-by: Robert Elliott 
> > ---
> >  arch/x86/platform/efi/efi.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Is this change purely for aesthetic reasons? We're usually not in the
> habit of changing the output of print messages without a good reason
> because people downstream do rely on them being consistent.

The format of that line is architecture-specific and only appears
under the efi=debug kernel parameter, so I don't know how much
anyone relies on the specific format.  Good question for the lists.

arch/ia64/kernel/efi.c shares the range=[...) format, but prints
the range after the bitmask rather than before:
printk("mem%02d: %s "
"range=[0x%016lx-0x%016lx) (%4lu%s)\n",
i, efi_md_typeattr_format(buf, sizeof(buf), md),
md->phys_addr,
md->phys_addr + efi_md_size(md), size, unit);

arch/arm64/kernel/efi.c has no mem prefix, or range=[...) text
surrounding the range:
pr_info("  0x%012llx-0x%012llx %s",
paddr, paddr + (npages << EFI_PAGE_SHIFT) - 1,
efi_md_typeattr_format(buf, sizeof(buf), md));
pr_cont("*");
pr_cont("\n");

The x86 code is inside ifdef EFI_DEBUG, which is always
defined in that file.  I wonder if it was supposed to change
to efi_enabled(EFI_DBG) to be based off the efi=debug kernel
parameter?  efi_init() qualified the call to this function
based on that:
if (efi_enabled(EFI_DBG))
efi_print_memmap();

In arch/ia64/kernel/efi.c efi_init(), EFI_DEBUG is set to 0
so the print doesn't happen at all without editing the
source code.  It doesn't use efi_enabled(EFI_DBG).

arch/arm64/kernel/efi.c uses efi_enabled(EFI_DBG) exclusively.

---
Robert Elliott, HPE Persistent Memory

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/4] x86/efi: show actual ending addresses in efi_print_memmap

2015-12-21 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: Matt Fleming [mailto:m...@codeblueprint.co.uk]
> Sent: Monday, December 21, 2015 9:51 AM
> To: Elliott, Robert (Persistent Memory) <elli...@hpe.com>
> Cc: t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
> linux-...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 1/4] x86/efi: show actual ending addresses in
> efi_print_memmap
> 
> On Thu, 17 Dec, at 07:28:31PM, Robert Elliott wrote:
> > Adjust efi_print_memmap to print the real end address of each
> > range, not 1 byte beyond. This matches other prints like those for
> > SRAT and nosave memory.
> >
> > Change the closing ) to ] to match the opening [.
> >
> > old:
> > efi: mem61: [Persistent Memory  |   |  |  |  |  |  |   |WB|WT|WC|UC]
> range=[0x00088000-0x000c8000) (16384MB)
> >
> > new:
> > efi: mem61: [Persistent Memory  |   |  |  |  |  |  |   |WB|WT|WC|UC]
> range=[0x00088000-0x000c7fff] (16384MB)
> >
> > Example other address range prints:
> > SRAT: Node 1 PXM 1 [mem 0x48000-0x87fff]
> > PM: Registered nosave memory: [mem 0x88000-0xc7fff]
> >
> > Signed-off-by: Robert Elliott <elli...@hpe.com>
> > ---
> >  arch/x86/platform/efi/efi.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Is this change purely for aesthetic reasons? We're usually not in the
> habit of changing the output of print messages without a good reason
> because people downstream do rely on them being consistent.

The format of that line is architecture-specific and only appears
under the efi=debug kernel parameter, so I don't know how much
anyone relies on the specific format.  Good question for the lists.

arch/ia64/kernel/efi.c shares the range=[...) format, but prints
the range after the bitmask rather than before:
printk("mem%02d: %s "
"range=[0x%016lx-0x%016lx) (%4lu%s)\n",
i, efi_md_typeattr_format(buf, sizeof(buf), md),
md->phys_addr,
md->phys_addr + efi_md_size(md), size, unit);

arch/arm64/kernel/efi.c has no mem prefix, or range=[...) text
surrounding the range:
pr_info("  0x%012llx-0x%012llx %s",
paddr, paddr + (npages << EFI_PAGE_SHIFT) - 1,
efi_md_typeattr_format(buf, sizeof(buf), md));
pr_cont("*");
pr_cont("\n");

The x86 code is inside ifdef EFI_DEBUG, which is always
defined in that file.  I wonder if it was supposed to change
to efi_enabled(EFI_DBG) to be based off the efi=debug kernel
parameter?  efi_init() qualified the call to this function
based on that:
if (efi_enabled(EFI_DBG))
efi_print_memmap();

In arch/ia64/kernel/efi.c efi_init(), EFI_DEBUG is set to 0
so the print doesn't happen at all without editing the
source code.  It doesn't use efi_enabled(EFI_DBG).

arch/arm64/kernel/efi.c uses efi_enabled(EFI_DBG) exclusively.

---
Robert Elliott, HPE Persistent Memory

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


arm64/efi handling of persistent memory

2015-12-17 Thread Elliott, Robert (Persistent Memory)
Similar to the questions about the arm64 efi boot stub
handing persistent memory, some of the arm64 kernel code 
looks fishy.

In arch/arm64/kernel/efi.c:

static int __init is_normal_ram(efi_memory_desc_t *md)
{
if (md->attribute & EFI_MEMORY_WB)
return 1;
return 0;
}

static __init int is_reserve_region(efi_memory_desc_t *md)
{
switch (md->type) {
case EFI_LOADER_CODE:
case EFI_LOADER_DATA:
case EFI_BOOT_SERVICES_CODE:
case EFI_BOOT_SERVICES_DATA:
case EFI_CONVENTIONAL_MEMORY:
case EFI_PERSISTENT_MEMORY:
return 0;
default:
break;
}
return is_normal_ram(md);
}

static __init void reserve_regions(void)
{
...
if (is_normal_ram(md))
early_init_dt_add_memory_arch(paddr, size);

if (is_reserve_region(md)) {
memblock_reserve(paddr, size);
...

static bool __init efi_virtmap_init(void)
{
...
if (!is_normal_ram(md))
prot = __pgprot(PROT_DEVICE_nGnRE);
else if (md->type == EFI_RUNTIME_SERVICES_CODE ||
 !PAGE_ALIGNED(md->phys_addr))
prot = PAGE_KERNEL_EXEC;
else
prot = PAGE_KERNEL;

Concerns include:

1. is_normal_ram() will see the WB bit and return 1 regardless
of the type.  That seems similar to the arm EFI boot stub issue.
The three callers are shown above.

2. is_reserve_region() treating EFI_PERSISTENT_MEMORY the same
as EFI_CONVENTIONAL_MEMORY looks wrong.

3. We're contemplating working around the grub problem by
reporting EFI_RESERVED_MEMORY plus the NV attribute rather
than EFI_PERSISTENT_MEMORY.

If this is done, then is_reserve_region() will fall through
to is_normal_ram(), which will see the WB bit and return 1.
That seems backwards... but seems correct for persistent
memory, reporting it as a reserved region.  That might avoid the
the EFI_PERSISTENT_MEMORY handling problem (if the preceding
call to is_normal_ram() didn't already cause problems).

---
Robert Elliott, HPE Persistent Memory


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


arm64/efi handling of persistent memory

2015-12-17 Thread Elliott, Robert (Persistent Memory)
Similar to the questions about the arm64 efi boot stub
handing persistent memory, some of the arm64 kernel code 
looks fishy.

In arch/arm64/kernel/efi.c:

static int __init is_normal_ram(efi_memory_desc_t *md)
{
if (md->attribute & EFI_MEMORY_WB)
return 1;
return 0;
}

static __init int is_reserve_region(efi_memory_desc_t *md)
{
switch (md->type) {
case EFI_LOADER_CODE:
case EFI_LOADER_DATA:
case EFI_BOOT_SERVICES_CODE:
case EFI_BOOT_SERVICES_DATA:
case EFI_CONVENTIONAL_MEMORY:
case EFI_PERSISTENT_MEMORY:
return 0;
default:
break;
}
return is_normal_ram(md);
}

static __init void reserve_regions(void)
{
...
if (is_normal_ram(md))
early_init_dt_add_memory_arch(paddr, size);

if (is_reserve_region(md)) {
memblock_reserve(paddr, size);
...

static bool __init efi_virtmap_init(void)
{
...
if (!is_normal_ram(md))
prot = __pgprot(PROT_DEVICE_nGnRE);
else if (md->type == EFI_RUNTIME_SERVICES_CODE ||
 !PAGE_ALIGNED(md->phys_addr))
prot = PAGE_KERNEL_EXEC;
else
prot = PAGE_KERNEL;

Concerns include:

1. is_normal_ram() will see the WB bit and return 1 regardless
of the type.  That seems similar to the arm EFI boot stub issue.
The three callers are shown above.

2. is_reserve_region() treating EFI_PERSISTENT_MEMORY the same
as EFI_CONVENTIONAL_MEMORY looks wrong.

3. We're contemplating working around the grub problem by
reporting EFI_RESERVED_MEMORY plus the NV attribute rather
than EFI_PERSISTENT_MEMORY.

If this is done, then is_reserve_region() will fall through
to is_normal_ram(), which will see the WB bit and return 1.
That seems backwards... but seems correct for persistent
memory, reporting it as a reserved region.  That might avoid the
the EFI_PERSISTENT_MEMORY handling problem (if the preceding
call to is_normal_ram() didn't already cause problems).

---
Robert Elliott, HPE Persistent Memory


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

2015-12-15 Thread Elliott, Robert (Persistent Memory)


---
Robert Elliott, HPE Persistent Memory


> -Original Message-
> From: Borislav Petkov [mailto:b...@alien8.de]
> Sent: Tuesday, December 15, 2015 1:29 PM
> To: Elliott, Robert (Persistent Memory) 
> Cc: Dan Williams ; Luck, Tony
> ; linux-nvdimm ; X86 ML
> ; linux-kernel@vger.kernel.org; Linux MM  m...@kvack.org>; Andy Lutomirski ; Andrew Morton
> ; Ingo Molnar 
> Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to
> recover from machine checks
> 
> On Tue, Dec 15, 2015 at 07:19:58PM +0000, Elliott, Robert (Persistent
> Memory) wrote:
> 
> ...
> 
> > Due to the historic long latency of storage devices,
> > applications don't re-read from storage again; they
> > save the results.
> > So, the streaming-load instructions are beneficial:
> 
> That's the theory...
> 
> Do you also have some actual performance numbers where non-temporal
> operations are better than the REP; MOVSB and *actually* show
> improvements? And no microbenchmarks please.
> 
> Thanks.
> 

This isn't exactly what you're looking for, but here is 
an example of fio doing reads from pmem devices (reading
from NVDIMMs, writing to DIMMs) with various transfer
sizes.

At 256 KiB, all the main memory buffers fit in the CPU
caches, so no write traffic appears on DDR (just the reads
from the NVDIMMs).  At 1 MiB, the data spills out of the
caches, and writes to the DIMMs end up on DDR.

Although DDR is busier, fio gets a lot less work done:
* 256 KiB: 90 GiB/s by fio
*   1 MiB: 49 GiB/s by fio

We could try modifying pmem to use its own non-temporal
memcpy functions (I've posted experimental patches
before that did this) to see if that transition point
shifts.  We can also watch the CPU cache statistics
while running.

Here are statistics from Intel's pcm-memory.x 
(pardon the wide formatting):

256 KiB
===
pmem0: (groupid=0, jobs=40): err= 0: pid=20867: Tue Nov 24 18:20:08 2015
  read : io=5219.1GB, bw=89079MB/s, iops=356314, runt= 60006msec
  cpu  : usr=1.74%, sys=96.16%, ctx=49576, majf=0, minf=21997

Run status group 0 (all jobs):
   READ: io=5219.1GB, aggrb=89079MB/s, minb=89079MB/s, maxb=89079MB/s, 
mint=60006msec, maxt=60006msec

|---||---|
|-- Socket  0 --||-- Socket  1 
--|
|---||---|
|-- Memory Channel Monitoring --||-- Memory Channel Monitoring 
--|
|---||---|
|-- Mem Ch  0: Reads (MB/s): 11778.11 --||-- Mem Ch  0: Reads (MB/s): 11743.99 
--|
|--Writes(MB/s):51.83 --||--Writes(MB/s):43.25 
--|
|-- Mem Ch  1: Reads (MB/s): 11779.90 --||-- Mem Ch  1: Reads (MB/s): 11736.06 
--|
|--Writes(MB/s):48.73 --||--Writes(MB/s):37.86 
--|
|-- Mem Ch  4: Reads (MB/s): 11784.79 --||-- Mem Ch  4: Reads (MB/s): 11746.94 
--|
|--Writes(MB/s):52.90 --||--Writes(MB/s):43.73 
--|
|-- Mem Ch  5: Reads (MB/s): 11778.48 --||-- Mem Ch  5: Reads (MB/s): 11741.55 
--|
|--Writes(MB/s):47.62 --||--Writes(MB/s):37.80 
--|
|-- NODE 0 Mem Read (MB/s) : 47121.27 --||-- NODE 1 Mem Read (MB/s) : 46968.53 
--|
|-- NODE 0 Mem Write(MB/s) :   201.08 --||-- NODE 1 Mem Write(MB/s) :   162.65 
--|
|-- NODE 0 P. Write (T/s): 190927 --||-- NODE 1 P. Write (T/s): 182961 
--|
|-- NODE 0 Memory (MB/s):47322.36 --||-- NODE 1 Memory (MB/s):47131.17 
--|
|---||---|
|---||---|
|--   System Read Throughput(MB/s):  94089.80  
--|
|--  System Write Throughput(MB/s):363.73  
--|
|-- System Memory Throughput(MB/s):  94453.52  
--|
|---||---|

1 MiB
=
|---||---|
|-- Socket  0 --||-- Socket  1 
--|
|---||---|
|-- Memory Channel Monitoring --||-- Memory Channel Monitoring 
--|
|---||---|
|-- Mem Ch  0: Reads (MB/s):  7227.83 --||-- Mem Ch  0: Reads (MB/s):  7047.45 
--|
|--Writes(MB/s):  5894.47 --||--Writes(MB/s):  6010.66 
--|
|-- Mem Ch  1: Reads (MB/s):  7229.32 --||-- Mem Ch  1: Reads (MB/s):  7041.79 
--|
|--Writes(MB/s):  5891.38 --||--Writes(MB/s):  6003.19 
--|
|-- Mem Ch  4: Read

RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

2015-12-15 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf
> Of Borislav Petkov
> Sent: Tuesday, December 15, 2015 12:39 PM
> To: Dan Williams 
> Cc: Luck, Tony ; linux-nvdimm  nvd...@ml01.01.org>; X86 ML ; linux-
> ker...@vger.kernel.org; Linux MM ; Andy Lutomirski
> ; Andrew Morton ; Ingo Molnar
> 
> Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to
> recover from machine checks
> 
> On Tue, Dec 15, 2015 at 10:35:49AM -0800, Dan Williams wrote:
> > Correction we have MOVNTDQA, but that requires saving the fpu state
> > and marking the memory as WC, i.e. probably not worth it.
> 
> Not really. Last time I tried an SSE3 memcpy in the kernel like glibc
> does, it wasn't worth it. The enhanced REP; MOVSB is hands down faster.

Reading from NVDIMM, rep movsb is efficient, but it 
fills the CPU caches with the NVDIMM addresses. For
large data moves (not uncommon for storage) this
will crowd out more important cacheable data.

For normal block device reads made through the pmem
block device driver, this CPU cache consumption is
wasteful, since it is unlikely the application will
ask pmem to read the same addresses anytime soon.
Due to the historic long latency of storage devices,
applications don't re-read from storage again; they
save the results.  So, the streaming-load
instructions are beneficial:
* movntdqa (16-byte xmm registers) 
* vmovntdqa (32-byte ymm registers)
* vmovntdqa (64-byte zmm registers)

Dan Williams wrote:
> Correction we have MOVNTDQA, but that requires
> saving the fpu state and marking the memory as WC
> i.e. probably not worth it.

Although the WC memory type is described in the SDM
in the most detail:
"An implementation may also make use of the
non-temporal hint associated with this instruction
if the memory source is WB (write back) memory
type. ... may optimize cache reads generated by 
(V)MOVNTDQA on WB memory type to reduce cache 
evictions."

For applications doing loads from mmap() DAX memory, 
the CPU cache usage could be worthwhile, because
applications expect mmap() regions to consist of
traditional writeback-cached memory and might do
lots of loads/stores.

Writing to the NVDIMM requires either:
* non-temporal stores; or
* normal stores + cache flushes + fences

movnti is OK for small transfers, but these are
better for bulk moves:
* movntdq (16-byte xmm registers)
* vmovntdq (32-byte ymm registers)
* vmovntdq (64-byte zmm registers)

---
Robert Elliott, HPE Persistent Memory

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

2015-12-15 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf
> Of Borislav Petkov
> Sent: Tuesday, December 15, 2015 12:39 PM
> To: Dan Williams 
> Cc: Luck, Tony ; linux-nvdimm  nvd...@ml01.01.org>; X86 ML ; linux-
> ker...@vger.kernel.org; Linux MM ; Andy Lutomirski
> ; Andrew Morton ; Ingo Molnar
> 
> Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to
> recover from machine checks
> 
> On Tue, Dec 15, 2015 at 10:35:49AM -0800, Dan Williams wrote:
> > Correction we have MOVNTDQA, but that requires saving the fpu state
> > and marking the memory as WC, i.e. probably not worth it.
> 
> Not really. Last time I tried an SSE3 memcpy in the kernel like glibc
> does, it wasn't worth it. The enhanced REP; MOVSB is hands down faster.

Reading from NVDIMM, rep movsb is efficient, but it 
fills the CPU caches with the NVDIMM addresses. For
large data moves (not uncommon for storage) this
will crowd out more important cacheable data.

For normal block device reads made through the pmem
block device driver, this CPU cache consumption is
wasteful, since it is unlikely the application will
ask pmem to read the same addresses anytime soon.
Due to the historic long latency of storage devices,
applications don't re-read from storage again; they
save the results.  So, the streaming-load
instructions are beneficial:
* movntdqa (16-byte xmm registers) 
* vmovntdqa (32-byte ymm registers)
* vmovntdqa (64-byte zmm registers)

Dan Williams wrote:
> Correction we have MOVNTDQA, but that requires
> saving the fpu state and marking the memory as WC
> i.e. probably not worth it.

Although the WC memory type is described in the SDM
in the most detail:
"An implementation may also make use of the
non-temporal hint associated with this instruction
if the memory source is WB (write back) memory
type. ... may optimize cache reads generated by 
(V)MOVNTDQA on WB memory type to reduce cache 
evictions."

For applications doing loads from mmap() DAX memory, 
the CPU cache usage could be worthwhile, because
applications expect mmap() regions to consist of
traditional writeback-cached memory and might do
lots of loads/stores.

Writing to the NVDIMM requires either:
* non-temporal stores; or
* normal stores + cache flushes + fences

movnti is OK for small transfers, but these are
better for bulk moves:
* movntdq (16-byte xmm registers)
* vmovntdq (32-byte ymm registers)
* vmovntdq (64-byte zmm registers)

---
Robert Elliott, HPE Persistent Memory

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

2015-12-15 Thread Elliott, Robert (Persistent Memory)


---
Robert Elliott, HPE Persistent Memory


> -Original Message-
> From: Borislav Petkov [mailto:b...@alien8.de]
> Sent: Tuesday, December 15, 2015 1:29 PM
> To: Elliott, Robert (Persistent Memory) <elli...@hpe.com>
> Cc: Dan Williams <dan.j.willi...@intel.com>; Luck, Tony
> <tony.l...@intel.com>; linux-nvdimm <linux-nvd...@ml01.01.org>; X86 ML
> <x...@kernel.org>; linux-kernel@vger.kernel.org; Linux MM  m...@kvack.org>; Andy Lutomirski <l...@kernel.org>; Andrew Morton
> <a...@linux-foundation.org>; Ingo Molnar <mi...@kernel.org>
> Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to
> recover from machine checks
> 
> On Tue, Dec 15, 2015 at 07:19:58PM +, Elliott, Robert (Persistent
> Memory) wrote:
> 
> ...
> 
> > Due to the historic long latency of storage devices,
> > applications don't re-read from storage again; they
> > save the results.
> > So, the streaming-load instructions are beneficial:
> 
> That's the theory...
> 
> Do you also have some actual performance numbers where non-temporal
> operations are better than the REP; MOVSB and *actually* show
> improvements? And no microbenchmarks please.
> 
> Thanks.
> 

This isn't exactly what you're looking for, but here is 
an example of fio doing reads from pmem devices (reading
from NVDIMMs, writing to DIMMs) with various transfer
sizes.

At 256 KiB, all the main memory buffers fit in the CPU
caches, so no write traffic appears on DDR (just the reads
from the NVDIMMs).  At 1 MiB, the data spills out of the
caches, and writes to the DIMMs end up on DDR.

Although DDR is busier, fio gets a lot less work done:
* 256 KiB: 90 GiB/s by fio
*   1 MiB: 49 GiB/s by fio

We could try modifying pmem to use its own non-temporal
memcpy functions (I've posted experimental patches
before that did this) to see if that transition point
shifts.  We can also watch the CPU cache statistics
while running.

Here are statistics from Intel's pcm-memory.x 
(pardon the wide formatting):

256 KiB
===
pmem0: (groupid=0, jobs=40): err= 0: pid=20867: Tue Nov 24 18:20:08 2015
  read : io=5219.1GB, bw=89079MB/s, iops=356314, runt= 60006msec
  cpu  : usr=1.74%, sys=96.16%, ctx=49576, majf=0, minf=21997

Run status group 0 (all jobs):
   READ: io=5219.1GB, aggrb=89079MB/s, minb=89079MB/s, maxb=89079MB/s, 
mint=60006msec, maxt=60006msec

|---||---|
|-- Socket  0 --||-- Socket  1 
--|
|---||---|
|-- Memory Channel Monitoring --||-- Memory Channel Monitoring 
--|
|---||---|
|-- Mem Ch  0: Reads (MB/s): 11778.11 --||-- Mem Ch  0: Reads (MB/s): 11743.99 
--|
|--Writes(MB/s):51.83 --||--Writes(MB/s):43.25 
--|
|-- Mem Ch  1: Reads (MB/s): 11779.90 --||-- Mem Ch  1: Reads (MB/s): 11736.06 
--|
|--Writes(MB/s):48.73 --||--Writes(MB/s):37.86 
--|
|-- Mem Ch  4: Reads (MB/s): 11784.79 --||-- Mem Ch  4: Reads (MB/s): 11746.94 
--|
|--Writes(MB/s):52.90 --||--Writes(MB/s):43.73 
--|
|-- Mem Ch  5: Reads (MB/s): 11778.48 --||-- Mem Ch  5: Reads (MB/s): 11741.55 
--|
|--Writes(MB/s):47.62 --||--Writes(MB/s):37.80 
--|
|-- NODE 0 Mem Read (MB/s) : 47121.27 --||-- NODE 1 Mem Read (MB/s) : 46968.53 
--|
|-- NODE 0 Mem Write(MB/s) :   201.08 --||-- NODE 1 Mem Write(MB/s) :   162.65 
--|
|-- NODE 0 P. Write (T/s): 190927 --||-- NODE 1 P. Write (T/s): 182961 
--|
|-- NODE 0 Memory (MB/s):47322.36 --||-- NODE 1 Memory (MB/s):47131.17 
--|
|---||---|
|---||---|
|--   System Read Throughput(MB/s):  94089.80  
--|
|--  System Write Throughput(MB/s):363.73  
--|
|-- System Memory Throughput(MB/s):  94453.52  
--|
|---||---|

1 MiB
=
|---||---|
|-- Socket  0 --||-- Socket  1 
--|
|---||---|
|-- Memory Channel Monitoring --||-- Memory Channel Monitoring 
--|
|---||---|
|-- Mem Ch  0: Reads (MB/s):  7227.83 --||-- Mem Ch  0: Reads (MB/s):  7047.45 
--|
|--Writes(MB/s):  5894.47 --||--   

RE: [PATCHv3] printf: Add format for 8-byte EUI-64 type

2015-12-09 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Keith Busch
> Sent: Wednesday, December 9, 2015 5:07 PM
...
> Subject: [PATCHv3] printf: Add format for 8-byte EUI-64 type
> 
...
> @@ -1496,6 +1496,8 @@ char *pointer(const char *fmt, char *buf, char
> *end, void *ptr,
>   case 'm':   /* Contiguous: 000102030405 */
>   /* [mM]F (FDDI) */
>   /* [mM]R (Reverse order; Bluetooth) */
> + /* [mM]l (EUI-64) */
> + /* [mM][FM]l (FDDI/Reverse order 
> EUI-64) */

Should that be [FR] rather than [FM]?


---
Robert Elliott, HPE Persistent Memory

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCHv3] printf: Add format for 8-byte EUI-64 type

2015-12-09 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Keith Busch
> Sent: Wednesday, December 9, 2015 5:07 PM
...
> Subject: [PATCHv3] printf: Add format for 8-byte EUI-64 type
> 
...
> @@ -1496,6 +1496,8 @@ char *pointer(const char *fmt, char *buf, char
> *end, void *ptr,
>   case 'm':   /* Contiguous: 000102030405 */
>   /* [mM]F (FDDI) */
>   /* [mM]R (Reverse order; Bluetooth) */
> + /* [mM]l (EUI-64) */
> + /* [mM][FM]l (FDDI/Reverse order 
> EUI-64) */

Should that be [FR] rather than [FM]?


---
Robert Elliott, HPE Persistent Memory

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: ARM EFI stub and the EfiPersistentMemory type

2015-12-07 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: Mark Rutland [mailto:mark.rutl...@arm.com]
> Sent: Thursday, December 3, 2015 10:05 PM
> To: Ard Biesheuvel 
> Cc: Elliott, Robert (Persistent Memory) ;
> leif.lindh...@linaro.org; dan.j.willi...@intel.com; Kani, Toshimitsu
> ; linux-nvd...@lists.01.org; matt.flem...@intel.com;
> b...@suse.de; ying...@kernel.org; msal...@redhat.com; roy.fr...@linaro.org;
> linux-kernel@vger.kernel.org
> Subject: Re: ARM EFI stub and the EfiPersistentMemory type
> 
> On Sat, Dec 05, 2015 at 11:11:46AM +0100, Ard Biesheuvel wrote:
> > On 4 December 2015 at 04:02, Mark Rutland  wrote:

>> For EfiPersistentMemory, I understand that such allocations would 
>> be valid, (given that such memory may be mapped as EFI_MEMORY_WB),
>> but would be suboptimal (i.e. that memory will be slower, and
>> would be better suited for other data).
>>
>> Is that understanding correct?
>>
>> Or are there correctness issues with placing the kernel in
>> persistent memory, even if using attributes consistent with
>> EFI_MEMORY_WB?

The contents of persistent memory depend on the drivers being
used.  The pmem block device driver treats the memory as 512 byte 
logical blocks holding anything that can be stored on block
device - partitions, filesystems, etc.  Other drivers could
have different interpretations.  The user must manage any driver
conflicts.

Any UEFI driver that manages this space (e.g., a UEFI pmem
driver to facilitate booting) must match the OS driver.

UEFI's pool and page allocators need to avoid that area to
avoid letting loadable applications treating it as memory
from destroying the persistent data.

> [...]
> 
> > > Is AllocatePages expected to allocate EfiPersistentMemory? The spec
> > > seems vague on this point.
> > >
> >
> > AllocatePages should not return EfiPersistentMemory. That is the
> > consensus, and if the spec does not convey this clearly, we should fix
> > that.
> > The primary reason is that the UEFI spec does not allow the nature of
> > the contents of such regions to be described, and obviously, arbitrary
> > allocations should not be served from regions which are not known to
> > be vacant.
> 
> That was my feeling too.
> 
> AllocatePages will refuse an allocation if the MemoryType parameter was
> EfiPersistentMemory, but I couldn't find any statement that a region of
> EfiPersistentMemory will not be converted to EfiLoaderData or similar to
> fulfill an allocation, as can happen for EfiConventionalMemory.
> 
> I think that could be clarified.

The description of AllocatePool() includes:
"This function allocates pages from EfiConventionalMemory as needed
 to grow the requested pool type."

but that doesn't include EfiPersistentMemory, so it shouldn't be
interpreted as being the source for a pool.

AllocatePages() doesn't say anything like that for EfiConventionalMemory,
so shouldn't imply anything about EfiPersistentMemory.

> > On top of that, the kernel routines that back efi_low_alloc() and
> > efi_high_alloc() traverse the memory map and look for
> > EfiConventionalMemory regions, and trying to allocate from the
> > explicitly. (i.e., they ultimately invoke AllocatePages() with a fixed
> > address which is a priori known to be backed by EfiConventionalMemory)
> 
> Ok. That renders us safe for the case Robert had concerns.
> 
> > > Regarding EfiReservedMemory, the UEFI spec states that such
> > > regions should not be allocated by UEFI applications, drivers,
> > > or OS loaders, so if we can allocate from such regions, that
> > > is a bug that we should correct. I would hope that AllocatePages
> > > would refuse to allocate from such regions, but I don't see
> > > any guarantee to that effect.

AllocatePages() supports EfiReservedMemoryType for internal 
allocations, so doesn't reject such requests like it does
for EfiPersistentMemory.  I guess "just don't do that" applies
for loadable applications - they'd just be removing memory
from their own memory map.

...
> I understand that distinction. I think there's a problem if a region can
> have the WB flag and yet not be safe to map with WB attributes (i.e. the
> type takes precedence).

EfiPersistentMemory will usually be marked as WB|WT|WC|UC, but
persistent memory software must be careful using WB. Writes are 
not persistent until the CPU caches are flushed and appropriate
fence instructions are completed by the CPU.

The ACPI NFIT reports a lot more information about the memory
regions - GUIDs, Flush Hint Addresses, etc.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: ARM EFI stub and the EfiPersistentMemory type

2015-12-07 Thread Elliott, Robert (Persistent Memory)

> -Original Message-
> From: Mark Rutland [mailto:mark.rutl...@arm.com]
> Sent: Thursday, December 3, 2015 10:05 PM
> To: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Cc: Elliott, Robert (Persistent Memory) <elli...@hpe.com>;
> leif.lindh...@linaro.org; dan.j.willi...@intel.com; Kani, Toshimitsu
> <toshi.k...@hpe.com>; linux-nvd...@lists.01.org; matt.flem...@intel.com;
> b...@suse.de; ying...@kernel.org; msal...@redhat.com; roy.fr...@linaro.org;
> linux-kernel@vger.kernel.org
> Subject: Re: ARM EFI stub and the EfiPersistentMemory type
> 
> On Sat, Dec 05, 2015 at 11:11:46AM +0100, Ard Biesheuvel wrote:
> > On 4 December 2015 at 04:02, Mark Rutland <mark.rutl...@arm.com> wrote:

>> For EfiPersistentMemory, I understand that such allocations would 
>> be valid, (given that such memory may be mapped as EFI_MEMORY_WB),
>> but would be suboptimal (i.e. that memory will be slower, and
>> would be better suited for other data).
>>
>> Is that understanding correct?
>>
>> Or are there correctness issues with placing the kernel in
>> persistent memory, even if using attributes consistent with
>> EFI_MEMORY_WB?

The contents of persistent memory depend on the drivers being
used.  The pmem block device driver treats the memory as 512 byte 
logical blocks holding anything that can be stored on block
device - partitions, filesystems, etc.  Other drivers could
have different interpretations.  The user must manage any driver
conflicts.

Any UEFI driver that manages this space (e.g., a UEFI pmem
driver to facilitate booting) must match the OS driver.

UEFI's pool and page allocators need to avoid that area to
avoid letting loadable applications treating it as memory
from destroying the persistent data.

> [...]
> 
> > > Is AllocatePages expected to allocate EfiPersistentMemory? The spec
> > > seems vague on this point.
> > >
> >
> > AllocatePages should not return EfiPersistentMemory. That is the
> > consensus, and if the spec does not convey this clearly, we should fix
> > that.
> > The primary reason is that the UEFI spec does not allow the nature of
> > the contents of such regions to be described, and obviously, arbitrary
> > allocations should not be served from regions which are not known to
> > be vacant.
> 
> That was my feeling too.
> 
> AllocatePages will refuse an allocation if the MemoryType parameter was
> EfiPersistentMemory, but I couldn't find any statement that a region of
> EfiPersistentMemory will not be converted to EfiLoaderData or similar to
> fulfill an allocation, as can happen for EfiConventionalMemory.
> 
> I think that could be clarified.

The description of AllocatePool() includes:
"This function allocates pages from EfiConventionalMemory as needed
 to grow the requested pool type."

but that doesn't include EfiPersistentMemory, so it shouldn't be
interpreted as being the source for a pool.

AllocatePages() doesn't say anything like that for EfiConventionalMemory,
so shouldn't imply anything about EfiPersistentMemory.

> > On top of that, the kernel routines that back efi_low_alloc() and
> > efi_high_alloc() traverse the memory map and look for
> > EfiConventionalMemory regions, and trying to allocate from the
> > explicitly. (i.e., they ultimately invoke AllocatePages() with a fixed
> > address which is a priori known to be backed by EfiConventionalMemory)
> 
> Ok. That renders us safe for the case Robert had concerns.
> 
> > > Regarding EfiReservedMemory, the UEFI spec states that such
> > > regions should not be allocated by UEFI applications, drivers,
> > > or OS loaders, so if we can allocate from such regions, that
> > > is a bug that we should correct. I would hope that AllocatePages
> > > would refuse to allocate from such regions, but I don't see
> > > any guarantee to that effect.

AllocatePages() supports EfiReservedMemoryType for internal 
allocations, so doesn't reject such requests like it does
for EfiPersistentMemory.  I guess "just don't do that" applies
for loadable applications - they'd just be removing memory
from their own memory map.

...
> I understand that distinction. I think there's a problem if a region can
> have the WB flag and yet not be safe to map with WB attributes (i.e. the
> type takes precedence).

EfiPersistentMemory will usually be marked as WB|WT|WC|UC, but
persistent memory software must be careful using WB. Writes are 
not persistent until the CPU caches are flushed and appropriate
fence instructions are completed by the CPU.

The ACPI NFIT reports a lot more information about the memory
regions - GUIDs, Flush Hint Addresses, etc.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ARM EFI stub and the EfiPersistentMemory type

2015-12-04 Thread Elliott, Robert (Persistent Memory)
drivers/firmware/efi/libstub/efi-stub-helper.c get_dram_base() 
parses the UEFI memory map, but just looks at the EFI_MEMORY_WB 
attribute while searching for the base memory address, 
not the type:

unsigned long get_dram_base(efi_system_table_t *sys_table_arg) {
...
for_each_efi_memory_desc(, md)
if (md->attribute & EFI_MEMORY_WB)
if (membase > md->phys_addr)
membase = md->phys_addr;


This is used by drivers/firmware/efi/libstub/arm-stub.c to
decide where to place the kernel image:

unsigned long efi_entry(void *handle, efi_system_table_t *sys_table, {
...
dram_base = get_dram_base(sys_table);
if (dram_base == EFI_ERROR) {
pr_efi_err(sys_table, "Failed to find DRAM base\n");
goto fail;
}
status = handle_kernel_image(sys_table, image_addr, _size,
 _addr,
 _size,
 dram_base, image);

Most types, including EfiPersistentMemory (14) and 
EfiReservedMemoryType (0), end up with WB|WT|WC|UC attributes.

So, if persistent memory happened to be below conventional memory,
it appears that this code would end up loading the kernel into
persistent memory, which would not be good.  The same for 
reserved memory ranges. I think it needs to check and only
operate on EfiConventionalMemory (maybe with a few others).

Example UEFI memory map (on an x86 system):

efi: mem00: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x-0x00092fff] (0 B for 588 KiB)
efi: mem01: [Reserved   |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00093000-0x00093fff] (588 KiB for 4 KiB)
efi: mem02: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00094000-0x0009] (592 KiB for 48 KiB)
efi: mem03: [Loader Data|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x0010-0x013e8fff] (1 MiB for 19364 KiB)
efi: mem04: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x013e9000-0x01ff] (20388 KiB for 12380 KiB)
efi: mem05: [Loader Data|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x0200-0x032e8fff] (32 MiB for 19364 KiB)
efi: mem06: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x032e9000-0x0fff] (52132 KiB for 210012 KiB)
efi: mem07: [Boot Code  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x1000-0x10048fff] (256 MiB for 292 KiB)
efi: mem08: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x10049000-0x234f4fff] (262436 KiB for 316080 KiB)
efi: mem09: [Loader Data|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x234f5000-0x3ecf] (578516 KiB for 450604 KiB)
efi: mem10: [Boot Data  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x3ed0-0x3ed7] (1005 MiB for 512 KiB)
efi: mem11: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x3ed8-0x6affbfff] (1029632 KiB for 723440 KiB)
efi: mem12: [Reserved   |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x6affc000-0x6b5fbfff] (1753072 KiB for 6 MiB)
efi: mem13: [Boot Data  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x6b5fc000-0x6b5fcfff] (1759216 KiB for 4 KiB)
efi: mem14: [Runtime Data   |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x6b5fd000-0x6b67dfff] (1759220 KiB for 516 KiB)
efi: mem15: [Boot Data  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x6b67e000-0x6ecfafff] (1759736 KiB for 55796 KiB)
efi: mem16: [Boot Code  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x6ecfb000-0x6ecfbfff] (1815532 KiB for 4 KiB)
efi: mem17: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x6ecfc000-0x712a5fff] (1815536 KiB for 38568 KiB)
efi: mem18: [Boot Data  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x712a6000-0x71337fff] (1854104 KiB for 584 KiB)
efi: mem19: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x71338000-0x7135dfff] (1854688 KiB for 152 KiB)
efi: mem20: [Boot Data  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x7135e000-0x71387fff] (1854840 KiB for 168 KiB)
efi: mem21: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x71388000-0x7138bfff] (1855008 KiB for 16 KiB)
efi: mem22: [Boot Data  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x7138c000-0x71835fff] (1855024 KiB for 4776 KiB)
efi: mem23: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x71836000-0x71836fff] (1859800 KiB for 4 KiB)
efi: mem24: [Boot Data

ARM EFI stub and the EfiPersistentMemory type

2015-12-04 Thread Elliott, Robert (Persistent Memory)
drivers/firmware/efi/libstub/efi-stub-helper.c get_dram_base() 
parses the UEFI memory map, but just looks at the EFI_MEMORY_WB 
attribute while searching for the base memory address, 
not the type:

unsigned long get_dram_base(efi_system_table_t *sys_table_arg) {
...
for_each_efi_memory_desc(, md)
if (md->attribute & EFI_MEMORY_WB)
if (membase > md->phys_addr)
membase = md->phys_addr;


This is used by drivers/firmware/efi/libstub/arm-stub.c to
decide where to place the kernel image:

unsigned long efi_entry(void *handle, efi_system_table_t *sys_table, {
...
dram_base = get_dram_base(sys_table);
if (dram_base == EFI_ERROR) {
pr_efi_err(sys_table, "Failed to find DRAM base\n");
goto fail;
}
status = handle_kernel_image(sys_table, image_addr, _size,
 _addr,
 _size,
 dram_base, image);

Most types, including EfiPersistentMemory (14) and 
EfiReservedMemoryType (0), end up with WB|WT|WC|UC attributes.

So, if persistent memory happened to be below conventional memory,
it appears that this code would end up loading the kernel into
persistent memory, which would not be good.  The same for 
reserved memory ranges. I think it needs to check and only
operate on EfiConventionalMemory (maybe with a few others).

Example UEFI memory map (on an x86 system):

efi: mem00: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x-0x00092fff] (0 B for 588 KiB)
efi: mem01: [Reserved   |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00093000-0x00093fff] (588 KiB for 4 KiB)
efi: mem02: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00094000-0x0009] (592 KiB for 48 KiB)
efi: mem03: [Loader Data|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x0010-0x013e8fff] (1 MiB for 19364 KiB)
efi: mem04: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x013e9000-0x01ff] (20388 KiB for 12380 KiB)
efi: mem05: [Loader Data|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x0200-0x032e8fff] (32 MiB for 19364 KiB)
efi: mem06: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x032e9000-0x0fff] (52132 KiB for 210012 KiB)
efi: mem07: [Boot Code  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x1000-0x10048fff] (256 MiB for 292 KiB)
efi: mem08: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x10049000-0x234f4fff] (262436 KiB for 316080 KiB)
efi: mem09: [Loader Data|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x234f5000-0x3ecf] (578516 KiB for 450604 KiB)
efi: mem10: [Boot Data  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x3ed0-0x3ed7] (1005 MiB for 512 KiB)
efi: mem11: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x3ed8-0x6affbfff] (1029632 KiB for 723440 KiB)
efi: mem12: [Reserved   |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x6affc000-0x6b5fbfff] (1753072 KiB for 6 MiB)
efi: mem13: [Boot Data  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x6b5fc000-0x6b5fcfff] (1759216 KiB for 4 KiB)
efi: mem14: [Runtime Data   |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x6b5fd000-0x6b67dfff] (1759220 KiB for 516 KiB)
efi: mem15: [Boot Data  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x6b67e000-0x6ecfafff] (1759736 KiB for 55796 KiB)
efi: mem16: [Boot Code  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x6ecfb000-0x6ecfbfff] (1815532 KiB for 4 KiB)
efi: mem17: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x6ecfc000-0x712a5fff] (1815536 KiB for 38568 KiB)
efi: mem18: [Boot Data  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x712a6000-0x71337fff] (1854104 KiB for 584 KiB)
efi: mem19: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x71338000-0x7135dfff] (1854688 KiB for 152 KiB)
efi: mem20: [Boot Data  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x7135e000-0x71387fff] (1854840 KiB for 168 KiB)
efi: mem21: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x71388000-0x7138bfff] (1855008 KiB for 16 KiB)
efi: mem22: [Boot Data  |   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x7138c000-0x71835fff] (1855024 KiB for 4776 KiB)
efi: mem23: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x71836000-0x71836fff] (1859800 KiB for 4 KiB)
efi: mem24: [Boot Data

  1   2   >