Re: [PATCH v3 2/2] tpm: add support for nonblocking operation

2018-06-13 Thread J Freyensee




On 6/12/18 10:58 AM, Tadeusz Struk wrote:

Currently the TPM driver only supports blocking calls, which doesn't allow
asynchronous IO operations to the TPM hardware.
This patch changes it and adds support for nonblocking write and a new poll
function to enable applications, which want to take advantage of this.

Signed-off-by: Tadeusz Struk 
---

snip
.
.
.


@@ -84,10 +124,9 @@ ssize_t tpm_common_write(struct file *file, const char 
__user *buf,
 size_t size, loff_t *off)
  {
struct file_priv *priv = file->private_data;
-   size_t in_size = size;
-   ssize_t out_size;
+   int ret = 0;
  
-	if (in_size > TPM_BUFSIZE)

+   if (size > TPM_BUFSIZE)
return -E2BIG;
  
  	mutex_lock(&priv->buffer_mutex);

@@ -97,20 +136,19 @@ ssize_t tpm_common_write(struct file *file, const char 
__user *buf,
 * buffered writes from blocking here.
 */
if (priv->data_pending != 0) {
-   mutex_unlock(&priv->buffer_mutex);
-   return -EBUSY;
+   ret = -EBUSY;
+   goto out;
}
  
-	if (copy_from_user

-   (priv->data_buffer, (void __user *) buf, in_size)) {
-   mutex_unlock(&priv->buffer_mutex);
-   return -EFAULT;
+   if (copy_from_user(priv->data_buffer, buf, size)) {
+   ret = -EFAULT;
+   goto out;
}
  
-	if (in_size < 6 ||

-   in_size < be32_to_cpu(*((__be32 *) (priv->data_buffer + 2 {
-   mutex_unlock(&priv->buffer_mutex);
-   return -EINVAL;
+   if (size < 6 ||
+   size < be32_to_cpu(*((__be32 *)(priv->data_buffer + 2 {
+   ret = -EINVAL;
+   goto out;
}
  
  	/* atomic tpm command send and result receive. We only hold the ops

@@ -118,25 +156,48 @@ ssize_t tpm_common_write(struct file *file, const char 
__user *buf,
 * the char dev is held open.
 */
if (tpm_try_get_ops(priv->chip)) {
-   mutex_unlock(&priv->buffer_mutex);
-   return -EPIPE;
+   ret = -EPIPE;
+   goto out;
}
-   out_size = tpm_transmit(priv->chip, priv->space, priv->data_buffer,
-   sizeof(priv->data_buffer), 0);
  
-	tpm_put_ops(priv->chip);

-   if (out_size < 0) {
-   mutex_unlock(&priv->buffer_mutex);
-   return out_size;
+   /*
+* If in nonblocking mode schedule an async job to send
+* the command return the size.
+* In case of error the err code will be returned in
+* the subsequent read call.
+*/
+   if (file->f_flags & O_NONBLOCK) {
+   queue_work(tpm_dev_wq, &priv->async_work);
+   return size;


Apologies for the question, but should there be a mutex_unlock() here?  
It's about the only return statement I am seeing where I cannot tell if 
a mutex_unlock() will be called before return or is needed before 
return.  The rest of the code is pretty obvious the return statements 
are being re-factored to an out: block where the mutex_unlock() will 
always be called before returning.


Thanks,
Jay





Re: [PATCH V3 3/5 selinux-next] selinux: sidtab_clone switch to use rwlock.

2018-05-30 Thread J Freyensee



  
+int sidtab_clone(struct sidtab *s, struct sidtab *d)

+{
+   int i, rc = 0;
If s or d are NULL (see if() below), why would we want rc, the return 
value, to be 0?  How about defaulting rc to an error value (-EINVAL)?

+   struct sidtab_node *cur;
+
+   if (!s || !d)
+   goto errout;
+
+   read_lock(&s->lock);
+   for (i = 0; i < SIDTAB_SIZE; i++) {
+   cur = s->htable[i];
+   while (cur) {
+   if (cur->sid > SECINITSID_NUM)
+   rc =  sidtab_insert(d, cur->sid, &cur->context);
+   if (rc)
+   goto out;
+   cur = cur->next;
+   }
+   }
+out:
+   read_unlock(&s->lock);
+errout:
+   return rc;
+}


Thanks,
Jay


Re: [PATCH v3 2/2] tpm: reduce polling time to usecs for even finer granularity

2018-05-08 Thread J Freyensee



On 5/7/18 9:07 AM, Nayna Jain wrote:

The TPM burstcount and status commands are supposed to return very
quickly [2][3]. This patch further reduces the TPM poll sleep time to usecs
in get_burstcount() and wait_for_tpm_stat() by calling usleep_range()
directly.

After this change, performance on a system[1] with a TPM 1.2 with an 8 byte
burstcount for 1000 extends improved from ~10.7 sec to ~7 sec.

[1] All tests are performed on an x86 based, locked down, single purpose
closed system. It has Infineon TPM 1.2 using LPC Bus.

[2] From the TCG Specification "TCG PC Client Specific TPM Interface
Specification (TIS), Family 1.2":

"NOTE : It takes roughly 330 ns per byte transfer on LPC. 256 bytes would
take 84 us, which is a long time to stall the CPU. Chipsets may not be
designed to post this much data to LPC; therefore, the CPU itself is
stalled for much of this time. Sending 1 kB would take 350 μs. Therefore,
even if the TPM_STS_x.burstCount field is a high value, software SHOULD
be interruptible during this period."

[3] From the TCG Specification 2.0, "TCG PC Client Platform TPM Profile
(PTP) Specification":

"It takes roughly 330 ns per byte transfer on LPC. 256 bytes would take
84 us. Chipsets may not be designed to post this much data to LPC;
therefore, the CPU itself is stalled for much of this time. Sending 1 kB
would take 350 us. Therefore, even if the TPM_STS_x.burstCount field is a
high value, software should be interruptible during this period. For SPI,
assuming 20MHz clock and 64-byte transfers, it would take about 120 usec
to move 256B of data. Sending 1kB would take about 500 usec. If the
transactions are done using 4 bytes at a time, then it would take about
1 msec. to transfer 1kB of data."

Signed-off-by: Nayna Jain 
Reviewed-by: Mimi Zohar 
Reviewed-by: Jarkko Sakkinen 
---


Acked-by: Jay Freyensee 



Re: [PATCH v3 1/2] tpm: reduce poll sleep time in tpm_transmit()

2018-05-08 Thread J Freyensee



do {
-   tpm_msleep(TPM_POLL_SLEEP);
+   tpm_msleep(TPM_TIMEOUT_POLL);

I'm just curious why it was decided to still use tpm_msleep() here 
instead of usleep_range() which was used in the 2nd patch.


Otherwise,

Acked-by: Jay Freyensee 



Re: [PATCH 4/8] struct page: add field for vm_struct

2018-03-15 Thread J Freyensee



On 3/15/18 2:38 AM, Igor Stoppa wrote:

On 14/03/18 19:43, J Freyensee wrote:

On 3/13/18 3:00 PM, Matthew Wilcox wrote:

[...]


Signed-off-by: Igor Stoppa 

Reviewed-by: Matthew Wilcox 

Igor, do you mind sticking these tags on the files that have spent some
time reviewing a revision of your patchset (like the Reviewed-by: tags I
provided last revision?)

Apologies, that was not intentional, I forgot it.
I will do it, although most of the files will now change so much that I
am not sure what will survive, beside this patch, in the form that you
reviewed.

I suppose the Review-by tag drops, if the patch changes.


That's true, if so much of the patch changes it basically looks like 
something different, the Reviewed-by: would drop.


Jay



--
igor




Re: [PATCH 4/8] struct page: add field for vm_struct

2018-03-14 Thread J Freyensee



On 3/13/18 3:00 PM, Matthew Wilcox wrote:

On Tue, Mar 13, 2018 at 11:45:50PM +0200, Igor Stoppa wrote:

When a page is used for virtual memory, it is often necessary to obtain
a handler to the corresponding vm_struct, which refers to the virtually
continuous area generated when invoking vmalloc.

The struct page has a "mapping" field, which can be re-used, to store a
pointer to the parent area.

This will avoid more expensive searches, later on.

Signed-off-by: Igor Stoppa 

Reviewed-by: Matthew Wilcox 


Igor, do you mind sticking these tags on the files that have spent some 
time reviewing a revision of your patchset (like the Reviewed-by: tags I 
provided last revision?)


Thanks,
Jay




Re: [PATCH 5/8] Protectable Memory

2018-03-14 Thread J Freyensee




+struct pmalloc_data {
+   struct gen_pool *pool;  /* Link back to the associated pool. */
+   bool protected; /* Status of the pool: RO or RW. */
+   struct kobj_attribute attr_protected; /* Sysfs attribute. */
+   struct kobj_attribute attr_avail; /* Sysfs attribute. */
+   struct kobj_attribute attr_size;  /* Sysfs attribute. */
+   struct kobj_attribute attr_chunks;/* Sysfs attribute. */
+   struct kobject *pool_kobject;
+   struct list_head node; /* list of pools */
+};

sysfs attributes aren't free, you know.  I appreciate you want something
to help debug / analyse, but having one file for the whole subsystem or
at least one per pool would be a better idea.

Which means that it should not be normal sysfs, but rather debugfs, if I
understand correctly, since in sysfs 1 value -> 1 file.


Yes, that is a good idea, to use debugfs so you still have a means to 
debug/analyze but can be also turned off for normal system execution.  
Sorry I didn't think about that earlier to save a revision, that's one 
of my favorite things I like to use for diagnosis.


Jay



Re: [PATCH v3 3/5] tpm: migrate tpm2_probe() to use struct tpm_buf

2018-03-08 Thread J Freyensee



On 3/5/18 8:56 AM, Jarkko Sakkinen wrote:

In order to make struct tpm_buf the first class object for constructing TPM
commands, migrate tpm2_probe() to use it.

Signed-off-by: Jarkko Sakkinen 
---
  drivers/char/tpm/tpm2-cmd.c | 27 +++
  1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
index abe6ef4a7a0b..890d83c5c78b 100644
--- a/drivers/char/tpm/tpm2-cmd.c
+++ b/drivers/char/tpm/tpm2-cmd.c
@@ -851,22 +851,25 @@ static int tpm2_do_selftest(struct tpm_chip *chip)
   */
  int tpm2_probe(struct tpm_chip *chip)
  {
-   struct tpm2_cmd cmd;
+   struct tpm_output_header *out;
+   struct tpm_buf buf;
int rc;
  
-	cmd.header.in = tpm2_get_tpm_pt_header;

-   cmd.params.get_tpm_pt_in.cap_id = cpu_to_be32(TPM2_CAP_TPM_PROPERTIES);
-   cmd.params.get_tpm_pt_in.property_id = cpu_to_be32(0x100);
-   cmd.params.get_tpm_pt_in.property_cnt = cpu_to_be32(1);
-
-   rc = tpm_transmit_cmd(chip, NULL, &cmd, sizeof(cmd), 0, 0, NULL);
-   if (rc <  0)
+   rc = tpm_buf_init(&buf, TPM2_ST_NO_SESSIONS, TPM2_CC_GET_CAPABILITY);
+   if (rc)
return rc;
-
-   if (be16_to_cpu(cmd.header.out.tag) == TPM2_ST_NO_SESSIONS)
+   tpm_buf_append_u32(&buf, TPM2_CAP_TPM_PROPERTIES);
+   tpm_buf_append_u32(&buf, TPM_PT_TOTAL_COMMANDS);
+   tpm_buf_append_u32(&buf, 1);
+   rc = tpm_transmit_cmd(chip, NULL, buf.data, PAGE_SIZE, 0, 0, NULL);
+   if (rc <  0)
+   goto out;
+   out = (struct tpm_output_header *)buf.data;
+   if (be16_to_cpu(out->tag) == TPM2_ST_NO_SESSIONS)
chip->flags |= TPM_CHIP_FLAG_TPM2;
-
-   return 0;
+out:
+   tpm_buf_destroy(&buf);
+   return rc;



Looks better :-).


Acked-by: Jay Freyensee 



  }
  EXPORT_SYMBOL_GPL(tpm2_probe);
  




Re: [PATCH 6/7] lkdtm: crash on overwriting protected pmalloc var

2018-03-07 Thread J Freyensee



On 3/7/18 5:18 AM, Igor Stoppa wrote:


On 06/03/18 19:20, J Freyensee wrote:


On 2/28/18 12:06 PM, Igor Stoppa wrote:

[...]


   void __init lkdtm_perms_init(void);
   void lkdtm_WRITE_RO(void);
   void lkdtm_WRITE_RO_AFTER_INIT(void);
+void lkdtm_WRITE_RO_PMALLOC(void);

Does this need some sort of #ifdef too?

Not strictly. It's just a function declaration.
As long as it is not used, the linker will not complain.
The #ifdef placed around the use and definition is sufficient, from a
correctness perspective.

But it's a different question if there is any standard in linux about
hiding also the declaration.



I'd prefer hiding it if it's contents are being ifdef'ed out, but I 
really think it's more of a maintainer preference question.





I am not very fond of #ifdefs, so when I can I try to avoid them.







Re: [PATCH 7/7] Documentation for Pmalloc

2018-03-06 Thread J Freyensee


Minus the comment-fixes Mike Rapoport mentioned, looks good:

Reviewed-by: Jay Freyensee 


On 2/28/18 12:06 PM, Igor Stoppa wrote:

Detailed documentation about the protectable memory allocator.

Signed-off-by: Igor Stoppa 
---
  Documentation/core-api/index.rst   |   1 +
  Documentation/core-api/pmalloc.rst | 111 +
  2 files changed, 112 insertions(+)
  create mode 100644 Documentation/core-api/pmalloc.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index c670a8031786..8f5de42d6571 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -25,6 +25,7 @@ Core utilities
 genalloc
 errseq
 printk-formats
+   pmalloc
  
  Interfaces for kernel debugging

  ===
diff --git a/Documentation/core-api/pmalloc.rst 
b/Documentation/core-api/pmalloc.rst
new file mode 100644
index ..8fb9c9d3171b
--- /dev/null
+++ b/Documentation/core-api/pmalloc.rst
@@ -0,0 +1,111 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Protectable memory allocator
+
+
+Purpose
+---
+
+The pmalloc library is meant to provide R/O status to data that, for some
+reason, could neither be declared as constant, nor could it take advantage
+of the qualifier __ro_after_init, but is write-once and read-only in spirit.
+It protects data from both accidental and malicious overwrites.
+
+Example: A policy that is loaded from userspace.
+
+
+Concept
+---
+
+pmalloc builds on top of genalloc, using the same concept of memory pools.
+
+The value added by pmalloc is that now the memory contained in a pool can
+become R/O, for the rest of the life of the pool.
+
+Different kernel drivers and threads can use different pools, for finer
+control of what becomes R/O and when. And for improved lockless concurrency.
+
+
+Caveats
+---
+
+- Memory freed while a pool is not yet protected will be reused.
+
+- Once a pool is protected, it's not possible to allocate any more memory
+  from it.
+
+- Memory "freed" from a protected pool indicates that such memory is not
+  in use anymore by the requester; however, it will not become available
+  for further use, until the pool is destroyed.
+
+- pmalloc does not provide locking support with respect to allocating vs
+  protecting an individual pool, for performance reasons.
+  It is recommended not to share the same pool between unrelated functions.
+  Should sharing be a necessity, the user of the shared pool is expected
+  to implement locking for that pool.
+
+- pmalloc uses genalloc to optimize the use of the space it allocates
+  through vmalloc. Some more TLB entries will be used, however less than
+  in the case of using vmalloc directly. The exact number depends on the
+  size of each allocation request and possible slack.
+
+- Considering that not much data is supposed to be dynamically allocated
+  and then marked as read-only, it shouldn't be an issue that the address
+  range for pmalloc is limited, on 32-bit systems.
+
+- Regarding SMP systems, the allocations are expected to happen mostly
+  during an initial transient, after which there should be no more need to
+  perform cross-processor synchronizations of page tables.
+
+- To facilitate the conversion of existing code to pmalloc pools, several
+  helper functions are provided, mirroring their kmalloc counterparts.
+
+
+Use
+---
+
+The typical sequence, when using pmalloc, is:
+
+1. create a pool
+
+.. kernel-doc:: include/linux/pmalloc.h
+   :functions: pmalloc_create_pool
+
+2. [optional] pre-allocate some memory in the pool
+
+.. kernel-doc:: include/linux/pmalloc.h
+   :functions: pmalloc_prealloc
+
+3. issue one or more allocation requests to the pool with locking as needed
+
+.. kernel-doc:: include/linux/pmalloc.h
+   :functions: pmalloc
+
+.. kernel-doc:: include/linux/pmalloc.h
+   :functions: pzalloc
+
+4. initialize the memory obtained with desired values
+
+5. [optional] iterate over points 3 & 4 as needed
+
+6. write-protect the pool
+
+.. kernel-doc:: include/linux/pmalloc.h
+   :functions: pmalloc_protect_pool
+
+7. use in read-only mode the handles obtained through the allocations
+
+8. [optional] release all the memory allocated
+
+.. kernel-doc:: include/linux/pmalloc.h
+   :functions: pfree
+
+9. [optional, but depends on point 8] destroy the pool
+
+.. kernel-doc:: include/linux/pmalloc.h
+   :functions: pmalloc_destroy_pool
+
+API
+---
+
+.. kernel-doc:: include/linux/pmalloc.h




Re: [PATCH 6/7] lkdtm: crash on overwriting protected pmalloc var

2018-03-06 Thread J Freyensee



On 2/28/18 12:06 PM, Igor Stoppa wrote:

Verify that pmalloc read-only protection is in place: trying to
overwrite a protected variable will crash the kernel.

Signed-off-by: Igor Stoppa 
---
  drivers/misc/lkdtm.h   |  1 +
  drivers/misc/lkdtm_core.c  |  3 +++
  drivers/misc/lkdtm_perms.c | 28 
  3 files changed, 32 insertions(+)

diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h
index 9e513dcfd809..dcda3ae76ceb 100644
--- a/drivers/misc/lkdtm.h
+++ b/drivers/misc/lkdtm.h
@@ -38,6 +38,7 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void);
  void __init lkdtm_perms_init(void);
  void lkdtm_WRITE_RO(void);
  void lkdtm_WRITE_RO_AFTER_INIT(void);
+void lkdtm_WRITE_RO_PMALLOC(void);


Does this need some sort of #ifdef too?


  void lkdtm_WRITE_KERN(void);
  void lkdtm_EXEC_DATA(void);
  void lkdtm_EXEC_STACK(void);
diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c
index 2154d1bfd18b..c9fd42bda6ee 100644
--- a/drivers/misc/lkdtm_core.c
+++ b/drivers/misc/lkdtm_core.c
@@ -155,6 +155,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(ACCESS_USERSPACE),
CRASHTYPE(WRITE_RO),
CRASHTYPE(WRITE_RO_AFTER_INIT),
+#ifdef CONFIG_PROTECTABLE_MEMORY
+   CRASHTYPE(WRITE_RO_PMALLOC),
+#endif
CRASHTYPE(WRITE_KERN),
CRASHTYPE(REFCOUNT_INC_OVERFLOW),
CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
diff --git a/drivers/misc/lkdtm_perms.c b/drivers/misc/lkdtm_perms.c
index 53b85c9d16b8..0ac9023fd2b0 100644
--- a/drivers/misc/lkdtm_perms.c
+++ b/drivers/misc/lkdtm_perms.c
@@ -9,6 +9,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  /* Whether or not to fill the target memory area with do_nothing(). */

@@ -104,6 +105,33 @@ void lkdtm_WRITE_RO_AFTER_INIT(void)
*ptr ^= 0xabcd1234;
  }
  
+#ifdef CONFIG_PROTECTABLE_MEMORY

+void lkdtm_WRITE_RO_PMALLOC(void)
+{
+   struct gen_pool *pool;
+   int *i;
+
+   pool = pmalloc_create_pool("pool", 0);
+   if (unlikely(!pool)) {
+   pr_info("Failed preparing pool for pmalloc test.");
+   return;
+   }
+
+   i = (int *)pmalloc(pool, sizeof(int), GFP_KERNEL);
+   if (unlikely(!i)) {
+   pr_info("Failed allocating memory for pmalloc test.");
+   pmalloc_destroy_pool(pool);
+   return;
+   }
+
+   *i = INT_MAX;
+   pmalloc_protect_pool(pool);
+
+   pr_info("attempting bad pmalloc write at %p\n", i);
+   *i = 0;


OK, now I'm on the right version of this patch series, same comment 
applies.  I don't get the local *i assignment at the end of the 
function, but seems harmless.


Except the two minor comments, otherwise,
Reviewed-by: Jay Freyensee 


+}
+#endif
+
  void lkdtm_WRITE_KERN(void)
  {
size_t size;




Re: [PATCH 5/7] Pmalloc selftest

2018-03-06 Thread J Freyensee

Looks good, and a bit more thorough test than last iteration.

Reviewed-by: Jay Freyensee 


On 2/28/18 12:06 PM, Igor Stoppa wrote:

Add basic self-test functionality for pmalloc.

The testing is introduced as early as possible, right after the main
dependency, genalloc, has passed successfully, so that it can help
diagnosing failures in pmalloc users.

Signed-off-by: Igor Stoppa 
---
  include/linux/test_pmalloc.h |  24 +++
  init/main.c  |   2 +
  mm/Kconfig   |  10 +
  mm/Makefile  |   1 +
  mm/test_pmalloc.c| 100 +++
  5 files changed, 137 insertions(+)
  create mode 100644 include/linux/test_pmalloc.h
  create mode 100644 mm/test_pmalloc.c

diff --git a/include/linux/test_pmalloc.h b/include/linux/test_pmalloc.h
new file mode 100644
index ..c7e2e451c17c
--- /dev/null
+++ b/include/linux/test_pmalloc.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * test_pmalloc.h
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+
+#ifndef __LINUX_TEST_PMALLOC_H
+#define __LINUX_TEST_PMALLOC_H
+
+
+#ifdef CONFIG_TEST_PROTECTABLE_MEMORY
+
+void test_pmalloc(void);
+
+#else
+
+static inline void test_pmalloc(void){};
+
+#endif
+
+#endif
diff --git a/init/main.c b/init/main.c
index 2bf1312fd2fe..ea44c940070a 100644
--- a/init/main.c
+++ b/init/main.c
@@ -91,6 +91,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -663,6 +664,7 @@ asmlinkage __visible void __init start_kernel(void)
mem_encrypt_init();
  
  	test_genalloc();

+   test_pmalloc();
  #ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
diff --git a/mm/Kconfig b/mm/Kconfig
index 016d29b9400b..47b0843b02d2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -767,3 +767,13 @@ config PROTECTABLE_MEMORY
  depends on ARCH_HAS_SET_MEMORY
  select GENERIC_ALLOCATOR
  default y
+
+config TEST_PROTECTABLE_MEMORY
+   bool "Run self test for pmalloc memory allocator"
+depends on MMU
+   depends on ARCH_HAS_SET_MEMORY
+   select PROTECTABLE_MEMORY
+   default n
+   help
+ Tries to verify that pmalloc works correctly and that the memory
+ is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index 959fdbdac118..1de4be5fd0bc 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
  obj-$(CONFIG_SLOB) += slob.o
  obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
  obj-$(CONFIG_PROTECTABLE_MEMORY) += pmalloc.o
+obj-$(CONFIG_TEST_PROTECTABLE_MEMORY) += test_pmalloc.o
  obj-$(CONFIG_KSM) += ksm.o
  obj-$(CONFIG_PAGE_POISONING) += page_poison.o
  obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/test_pmalloc.c b/mm/test_pmalloc.c
new file mode 100644
index ..df7ecc91c6a4
--- /dev/null
+++ b/mm/test_pmalloc.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_pmalloc.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#define SIZE_1 (PAGE_SIZE * 3)
+#define SIZE_2 1000
+
+static inline bool validate_alloc(bool expected, void *addr,
+ unsigned long size)
+{
+   bool test;
+
+   test = is_pmalloc_object(addr, size) > 0;
+   pr_notice("must be %s: %s",
+ expected ? "ok" : "no", test ? "ok" : "no");
+   return test == expected;
+}
+
+#define is_alloc_ok(variable, size)\
+   validate_alloc(true, variable, size)
+
+#define is_alloc_no(variable, size)\
+   validate_alloc(false, variable, size)
+
+void test_pmalloc(void)
+{
+   struct gen_pool *pool_unprot;
+   struct gen_pool *pool_prot;
+   void *var_prot, *var_unprot, *var_vmall;
+
+   pr_notice("pmalloc-selftest");
+   pool_unprot = pmalloc_create_pool("unprotected", 0);
+   if (unlikely(!pool_unprot))
+   goto error;
+   pool_prot = pmalloc_create_pool("protected", 0);
+   if (unlikely(!(pool_prot)))
+   goto error_release;
+
+   pr_notice("Testing allocation capability");
+   var_unprot = pmalloc(pool_unprot,  SIZE_1 - 1, GFP_KERNEL);
+   var_prot = pmalloc(pool_prot,  SIZE_1, GFP_KERNEL);
+   *(int *)var_prot = 0;
+   var_vmall = vmalloc(SIZE_2);
+
+
+   pr_notice("Test correctness of is_pmalloc_object()");
+   WARN_ON(unlikely(!is_alloc_ok(var_unprot, 10)));
+   WARN_ON(unlikely(!is_alloc_ok(var_unprot, SIZE_1)));
+   WARN_ON(unlikely(!is_alloc_ok(var_unprot, PAGE_SIZE)));
+   WARN_ON(unlikely(!is_alloc_no(var_unprot, SIZE_1 + 1)));
+   WARN_ON(unlikely(!is_alloc_no(var_vmall, 10)));
+
+
+   pfree(pool_unprot, var_unprot);
+   vfree(var_vmall);
+
+   pmalloc_protect_pool(pool_prot);
+
+   /*
+   

Re: [PATCH 6/7] lkdtm: crash on overwriting protected pmalloc var

2018-03-06 Thread J Freyensee



On 3/6/18 9:05 AM, J Freyensee wrote:



  +#ifdef CONFIG_PROTECTABLE_MEMORY
+void lkdtm_WRITE_RO_PMALLOC(void)
+{
+    struct gen_pool *pool;
+    int *i;
+
+    pool = pmalloc_create_pool("pool", 0);
+    if (unlikely(!pool)) {
+    pr_info("Failed preparing pool for pmalloc test.");
+    return;
+    }
+
+    i = (int *)pmalloc(pool, sizeof(int), GFP_KERNEL);
+    if (unlikely(!i)) {
+    pr_info("Failed allocating memory for pmalloc test.");
+    pmalloc_destroy_pool(pool);
+    return;
+    }
+
+    *i = INT_MAX;
+    pmalloc_protect_pool(pool);
+
+    pr_info("attempting bad pmalloc write at %p\n", i);
+    *i = 0;




Opps, disregard this, this is the last series of this patch series, not 
the most recent one :-(.




Seems harmless, but I don't get why *i local variable needs to be set 
to 0 at the end of this function.



Otherwise,

Reviewed-by: Jay Freyensee 





Re: [PATCH 6/7] lkdtm: crash on overwriting protected pmalloc var

2018-03-06 Thread J Freyensee


  
+#ifdef CONFIG_PROTECTABLE_MEMORY

+void lkdtm_WRITE_RO_PMALLOC(void)
+{
+   struct gen_pool *pool;
+   int *i;
+
+   pool = pmalloc_create_pool("pool", 0);
+   if (unlikely(!pool)) {
+   pr_info("Failed preparing pool for pmalloc test.");
+   return;
+   }
+
+   i = (int *)pmalloc(pool, sizeof(int), GFP_KERNEL);
+   if (unlikely(!i)) {
+   pr_info("Failed allocating memory for pmalloc test.");
+   pmalloc_destroy_pool(pool);
+   return;
+   }
+
+   *i = INT_MAX;
+   pmalloc_protect_pool(pool);
+
+   pr_info("attempting bad pmalloc write at %p\n", i);
+   *i = 0;


Seems harmless, but I don't get why *i local variable needs to be set to 
0 at the end of this function.



Otherwise,

Reviewed-by: Jay Freyensee 



Re: [PATCH 5/7] Pmalloc selftest

2018-03-06 Thread J Freyensee

Looks good,

Reviewed-by: Jay Freyensee 


On 2/23/18 6:48 AM, Igor Stoppa wrote:

Add basic self-test functionality for pmalloc.

The testing is introduced as early as possible, right after the main
dependency, genalloc, has passed successfully, so that it can help
diagnosing failures in pmalloc users.

Signed-off-by: Igor Stoppa 
---
  include/linux/test_pmalloc.h | 24 ++
  init/main.c  |  2 ++
  mm/Kconfig   |  9 +
  mm/Makefile  |  1 +
  mm/test_pmalloc.c| 79 
  5 files changed, 115 insertions(+)
  create mode 100644 include/linux/test_pmalloc.h
  create mode 100644 mm/test_pmalloc.c

diff --git a/include/linux/test_pmalloc.h b/include/linux/test_pmalloc.h
new file mode 100644
index ..c7e2e451c17c
--- /dev/null
+++ b/include/linux/test_pmalloc.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * test_pmalloc.h
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+
+#ifndef __LINUX_TEST_PMALLOC_H
+#define __LINUX_TEST_PMALLOC_H
+
+
+#ifdef CONFIG_TEST_PROTECTABLE_MEMORY
+
+void test_pmalloc(void);
+
+#else
+
+static inline void test_pmalloc(void){};
+
+#endif
+
+#endif
diff --git a/init/main.c b/init/main.c
index bfccf1fd463c..a61e3a5fd321 100644
--- a/init/main.c
+++ b/init/main.c
@@ -90,6 +90,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -662,6 +663,7 @@ asmlinkage __visible void __init start_kernel(void)
mem_encrypt_init();
  
  	test_genalloc();

+   test_pmalloc();
  #ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
diff --git a/mm/Kconfig b/mm/Kconfig
index be578fbdce6d..81dcc5345631 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -766,3 +766,12 @@ config PROTECTABLE_MEMORY
  depends on ARCH_HAS_SET_MEMORY
  select GENERIC_ALLOCATOR
  default y
+
+config TEST_PROTECTABLE_MEMORY
+   bool "Run self test for pmalloc memory allocator"
+   depends on ARCH_HAS_SET_MEMORY
+   select PROTECTABLE_MEMORY
+   default n
+   help
+ Tries to verify that pmalloc works correctly and that the memory
+ is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index 959fdbdac118..1de4be5fd0bc 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
  obj-$(CONFIG_SLOB) += slob.o
  obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
  obj-$(CONFIG_PROTECTABLE_MEMORY) += pmalloc.o
+obj-$(CONFIG_TEST_PROTECTABLE_MEMORY) += test_pmalloc.o
  obj-$(CONFIG_KSM) += ksm.o
  obj-$(CONFIG_PAGE_POISONING) += page_poison.o
  obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/test_pmalloc.c b/mm/test_pmalloc.c
new file mode 100644
index ..5de21c462d3d
--- /dev/null
+++ b/mm/test_pmalloc.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_pmalloc.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#include 
+#include 
+#include 
+
+#define SIZE_1 (PAGE_SIZE * 3)
+#define SIZE_2 1000
+
+#define validate_alloc(expected, variable, size)   \
+   pr_notice("must be " expected ": %s",   \
+ is_pmalloc_object(variable, size) > 0 ? "ok" : "no")
+
+#define is_alloc_ok(variable, size)\
+   validate_alloc("ok", variable, size)
+
+#define is_alloc_no(variable, size)\
+   validate_alloc("no", variable, size)
+
+void test_pmalloc(void)
+{
+   struct gen_pool *pool_unprot;
+   struct gen_pool *pool_prot;
+   void *var_prot, *var_unprot, *var_vmall;
+
+   pr_notice("pmalloc-selftest");
+   pool_unprot = pmalloc_create_pool("unprotected", 0);
+   if (unlikely(!pool_unprot))
+   goto error;
+   pool_prot = pmalloc_create_pool("protected", 0);
+   if (unlikely(!(pool_prot)))
+   goto error_release;
+
+   var_unprot = pmalloc(pool_unprot,  SIZE_1 - 1, GFP_KERNEL);
+   var_prot = pmalloc(pool_prot,  SIZE_1, GFP_KERNEL);
+   *(int *)var_prot = 0;
+   var_vmall = vmalloc(SIZE_2);
+   is_alloc_ok(var_unprot, 10);
+   is_alloc_ok(var_unprot, SIZE_1);
+   is_alloc_ok(var_unprot, PAGE_SIZE);
+   is_alloc_no(var_unprot, SIZE_1 + 1);
+   is_alloc_no(var_vmall, 10);
+
+
+   pfree(pool_unprot, var_unprot);
+   vfree(var_vmall);
+
+   pmalloc_protect_pool(pool_prot);
+
+   /*
+* This will intentionally trigger a WARN because the pool being
+* destroyed is not protected, which is unusual and should happen
+* on error paths only, where probably other warnings are already
+* displayed.
+*/
+   pr_notice("pmalloc-selftest:"
+ " Expect WARN in pmalloc_pool_set_protection below.");
+   pmalloc_destroy_pool(pool_unprot);
+   pr_notice("pmalloc-selftest

Re: [PATCH 4/7] Protectable Memory

2018-03-05 Thread J Freyensee

snip
.
.
.


+
+config PROTECTABLE_MEMORY
+bool
+depends on MMU



Curious, would you also want to depend on "SECURITY" as well, as this is 
being advertised as a compliment to __read_only_after_init, per the file 
header comments, as I'm assuming ro_after_init would be disabled if the 
SECURITY Kconfig selection is *NOT* selected?




+depends on ARCH_HAS_SET_MEMORY
+select GENERIC_ALLOCATOR
+default y
diff --git a/mm/Makefile b/mm/Makefile
index e669f02c5a54..959fdbdac118 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -65,6 +65,7 @@ obj-$(CONFIG_SPARSEMEM)   += sparse.o
  obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
  obj-$(CONFIG_SLOB) += slob.o
  obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_PROTECTABLE_MEMORY) += pmalloc.o
  obj-$(CONFIG_KSM) += ksm.o
  obj-$(CONFIG_PAGE_POISONING) += page_poison.o
  obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/pmalloc.c b/mm/pmalloc.c
new file mode 100644
index ..acdec0fbdde6
--- /dev/null
+++ b/mm/pmalloc.c
@@ -0,0 +1,468 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pmalloc.c: Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+/*
+ * pmalloc_data contains the data specific to a pmalloc pool,
+ * in a format compatible with the design of gen_alloc.
+ * Some of the fields are used for exposing the corresponding parameter
+ * to userspace, through sysfs.
+ */
+struct pmalloc_data {
+   struct gen_pool *pool;  /* Link back to the associated pool. */
+   bool protected; /* Status of the pool: RO or RW. */


nitpick, you could probably get a tad bit better byte packing alignment 
of this struct if "bool protected" was stuck as the last element in this 
data structure.



+   struct kobj_attribute attr_protected; /* Sysfs attribute. */
+   struct kobj_attribute attr_avail; /* Sysfs attribute. */
+   struct kobj_attribute attr_size;  /* Sysfs attribute. */
+   struct kobj_attribute attr_chunks;/* Sysfs attribute. */
+   struct kobject *pool_kobject;
+   struct list_head node; /* list of pools */
+};
+
+static LIST_HEAD(pmalloc_final_list);
+static LIST_HEAD(pmalloc_tmp_list);
+static struct list_head *pmalloc_list = &pmalloc_tmp_list;
+static DEFINE_MUTEX(pmalloc_mutex);
+static struct kobject *pmalloc_kobject;
+
+static ssize_t pmalloc_pool_show_protected(struct kobject *dev,
+  struct kobj_attribute *attr,
+  char *buf)
+{
+   struct pmalloc_data *data;
+
+   data = container_of(attr, struct pmalloc_data, attr_protected);
+   if (data->protected)
+   return sprintf(buf, "protected\n");
+   else
+   return sprintf(buf, "unprotected\n");
+}
+
+static ssize_t pmalloc_pool_show_avail(struct kobject *dev,
+  struct kobj_attribute *attr,
+  char *buf)
+{
+   struct pmalloc_data *data;
+
+   data = container_of(attr, struct pmalloc_data, attr_avail);
+   return sprintf(buf, "%lu\n",
+  (unsigned long)gen_pool_avail(data->pool));
+}
+
+static ssize_t pmalloc_pool_show_size(struct kobject *dev,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+   struct pmalloc_data *data;
+
+   data = container_of(attr, struct pmalloc_data, attr_size);
+   return sprintf(buf, "%lu\n",
+  (unsigned long)gen_pool_size(data->pool));
+}



Curious, will this show the size in bytes?



+
+static void pool_chunk_number(struct gen_pool *pool,
+ struct gen_pool_chunk *chunk, void *data)
+{
+   unsigned long *counter = data;
+
+   (*counter)++;
+}
+
+static ssize_t pmalloc_pool_show_chunks(struct kobject *dev,
+   struct kobj_attribute *attr,
+   char *buf)
+{
+   struct pmalloc_data *data;
+   unsigned long chunks_num = 0;
+
+   data = container_of(attr, struct pmalloc_data, attr_chunks);
+   gen_pool_for_each_chunk(data->pool, pool_chunk_number, &chunks_num);
+   return sprintf(buf, "%lu\n", chunks_num);
+}
+
+/* Exposes the pool and its attributes through sysfs. */
+static struct kobject *pmalloc_connect(struct pmalloc_data *data)
+{
+   const struct attribute *attrs[] = {
+   &data->attr_protected.attr,
+   &data->attr_avail.attr,
+   &data->attr_size.attr,
+   &data->attr_chunks.attr,
+   NULL
+   };
+   struct kobject *kobj;
+
+   kobj = kobject_create_and_add(data->pool->name, pmalloc_kobject);
+   if (unlikely(!kobj))
+  

Re: [PATCH 3/7] struct page: add field for vm_struct

2018-03-05 Thread J Freyensee

Reviewed-by: Jay Freyensee 

On 2/28/18 12:06 PM, Igor Stoppa wrote:

When a page is used for virtual memory, it is often necessary to obtain
a handler to the corresponding vm_struct, which refers to the virtually
continuous area generated when invoking vmalloc.

The struct page has a "mapping" field, which can be re-used, to store a
pointer to the parent area.

This will avoid more expensive searches, later on.

Signed-off-by: Igor Stoppa 
---
  include/linux/mm_types.h | 1 +
  mm/vmalloc.c | 2 ++
  2 files changed, 3 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd1af6b9591d..c3a4825e10c0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -84,6 +84,7 @@ struct page {
void *s_mem;/* slab first object */
atomic_t compound_mapcount; /* first tail page */
/* page_deferred_list().next -- second tail page */
+   struct vm_struct *area;
};
  
  	/* Second double word */

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ebff729cc956..61a1ca22b0f6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1536,6 +1536,7 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
struct page *page = area->pages[i];
  
  			BUG_ON(!page);

+   page->area = NULL;
__free_pages(page, 0);
}
  
@@ -1705,6 +1706,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,

area->nr_pages = i;
goto fail;
}
+   page->area = area;
area->pages[i] = page;
if (gfpflags_allow_blocking(gfp_mask|highmem_mask))
cond_resched();




Re: [PATCH 2/7] genalloc: selftest

2018-03-05 Thread J Freyensee



+
+/*
+ * In case of failure of any of these tests, memory corruption is almost
+ * guarranteed; allowing the boot to continue means risking to corrupt
+ * also any filesystem/block device accessed write mode.
+ * Therefore, BUG_ON() is used, when testing.
+ */
+
+


I like the explanation; good background info on why something is 
implemented the way it is :-).


Reviewed-by: Jay Freyensee 



Re: [PATCH 1/7] genalloc: track beginning of allocations

2018-03-05 Thread J Freyensee

.
.


On 2/28/18 12:06 PM, Igor Stoppa wrote:

+
+/**
+ * gen_pool_dma_alloc() - allocate special memory from the pool for DMA usage
+ * @pool: pool to allocate from
+ * @size: number of bytes to allocate from the pool
+ * @dma: dma-view physical address return value.  Use NULL if unneeded.
+ *
+ * Allocate the requested number of bytes from the specified pool.
+ * Uses the pool allocation function (with first-fit algorithm by default).
+ * Can not be used in NMI handler on architectures without
+ * NMI-safe cmpxchg implementation.
+ *
+ * Return:
+ * * address of the memory allocated   - success
+ * * NULL  - error
+ */
+void *gen_pool_dma_alloc(struct gen_pool *pool, size_t size, dma_addr_t *dma);
+


OK, so gen_pool_dma_alloc() is defined here, which believe is the API 
line being drawn for this series.


so,
.
.
.


  
  /**

- * gen_pool_dma_alloc - allocate special memory from the pool for DMA usage
+ * gen_pool_dma_alloc() - allocate special memory from the pool for DMA usage
   * @pool: pool to allocate from
   * @size: number of bytes to allocate from the pool
   * @dma: dma-view physical address return value.  Use NULL if unneeded.
@@ -342,14 +566,15 @@ EXPORT_SYMBOL(gen_pool_alloc_algo);
   * Uses the pool allocation function (with first-fit algorithm by default).
   * Can not be used in NMI handler on architectures without
   * NMI-safe cmpxchg implementation.
+ *
+ * Return:
+ * * address of the memory allocated   - success
+ * * NULL  - error
   */
  void *gen_pool_dma_alloc(struct gen_pool *pool, size_t size, dma_addr_t *dma)
  {
unsigned long vaddr;
  
-	if (!pool)

-   return NULL;
-
why is this being removed?  I don't believe this code was getting 
removed from your v17 series patches.

vaddr = gen_pool_alloc(pool, size);
if (!vaddr)
return NULL;
@@ -362,10 +587,10 @@ void *gen_pool_dma_alloc(struct gen_pool *pool, size_t 
size, dma_addr_t *dma)
  EXPORT_SYMBOL(gen_pool_dma_alloc);
  


Otherwise, looks good,

Reviewed-by: Jay Freyensee 


Re: [PATCH 3/5] tpm: migrate tpm2_probe() to use struct tpm_buf

2018-03-01 Thread J Freyensee

.
.
.
I'm new to this area of the kernel, but I'm not getting these lines:


+   rc = tpm_transmit_cmd(chip, NULL, buf.data, PAGE_SIZE, 0, 0, NULL);
+   tpm_buf_destroy(&buf);
if (rc <  0)
Why is this if() check not directly after the tpm_transmit_cmd() call 
that sets rc?  Is it correct you want to destroy buf regardless of the 
tpm_transmit_cmd() outcome?

return rc;
-
-   if (be16_to_cpu(cmd.header.out.tag) == TPM2_ST_NO_SESSIONS)
+   out = (struct tpm_output_header *)buf.data;


So buf has been destroyed, buf.data sill has something valid to assign 
to out?

+   if (be16_to_cpu(out->tag) == TPM2_ST_NO_SESSIONS)
chip->flags |= TPM_CHIP_FLAG_TPM2;
  
  	return 0;

Thanks,
Jay


Re: [PATCH 7/7] Documentation for Pmalloc

2018-02-26 Thread J Freyensee

[...]

On 2/26/18 7:39 AM, Igor Stoppa wrote:


On 24/02/18 02:26, J Freyensee wrote:


On 2/23/18 6:48 AM, Igor Stoppa wrote:

[...]


+- Before destroying a pool, all the memory allocated from it must be
+  released.

Is that true?  pmalloc_destroy_pool() has:

.
.
+    pmalloc_pool_set_protection(pool, false);
+    gen_pool_for_each_chunk(pool, pmalloc_chunk_free, NULL);
+    gen_pool_destroy(pool);
+    kfree(data);

which to me looks like is the opposite, the data (ie, "memory") is being
released first, then the pool is destroyed.

well, this is embarrassing ... yes I had this prototype code, because I
was wondering if it wouldn't make more sense to tear down the pool as
fast as possible. It slipped in, apparently.

I'm actually tempted to leave it in and fix the comment.


Sure, one or the other.



[...]


+
+- pmalloc does not provide locking support with respect to allocating vs
+  protecting an individual pool, for performance reasons.

What is the recommendation to using locks then, as the computing
real-world mainly operates in multi-threaded/process world?

How common are multi-threaded allocations of write-once memory?
Here we are talking exclusively about the part of the memory life-cycle
where it is allocated (from pmalloc).


Yah, that's true, good point.




Maybe show
an example of an issue that occur if locks aren't used and give a coding
example.

An example of how to use a mutex to access a shared resource? :-O

This part below, under your question, was supposed to be the answer :-(


+  It is recommended not to share the same pool between unrelated functions.
+  Should sharing be a necessity, the user of the shared pool is expected
+  to implement locking for that pool.


My bad, I was suggesting a code sample, if there was a simple code 
sample to provide (like 5-10 lines?).  If it's a lot of code to write, 
no bother.



[...]


+- pmalloc uses genalloc to optimize the use of the space it allocates
+  through vmalloc. Some more TLB entries will be used, however less than
+  in the case of using vmalloc directly. The exact number depends on the
+  size of each allocation request and possible slack.
+
+- Considering that not much data is supposed to be dynamically allocated
+  and then marked as read-only, it shouldn't be an issue that the address
+  range for pmalloc is limited, on 32-bit systems.

Why is 32-bit systems mentioned and not 64-bit?

Because, as written, on 32 bit system the vmalloc range is relatively
small, so one might wonder if there are enough addresses.


   Is there a problem with 64-bit here?

Quite the opposite.
I thought it was clear, but obviously it isn't, I'll reword this.


Sounds good, thank you,
Jay



-igor






Re: [PATCH 4/7] Protectable Memory

2018-02-26 Thread J Freyensee

inlined responses.


On 2/26/18 6:28 AM, Igor Stoppa wrote:


On 24/02/18 02:10, J Freyensee wrote:

On 2/23/18 6:48 AM, Igor Stoppa wrote:

[...]


+struct gen_pool *pmalloc_create_pool(const char *name,
+int min_alloc_order);

Same comments as earlier.  If this is new API with new code being
introduced into the kernel, can the variables be declared to avoid weird
problems?  Like min_alloc_order being a negative value makes little
sense (based on the description here), so can it be declared as size_t
or unsigned int?

in this case, yes, but I see it as different case


OK.



[...]


+ * * NULL  - either no memory available or
+ *   pool already read-only
+ */

I don't know if an errno value is being set, but setting a variable
somewhere using EROFS or ENOMEM would more accurate diagnose those two
NULL conditions.

I expect that the latter is highly unlikely to happen, because the user
of the API controls if the pool is locked or not.

I think it shouldn't come as a surprise to the one who locked the pool,
if the pool is locked.

If the pool is used with concurrent users, attention should be paid to
not lock it before ever user is happy (this is where the user of the API
has to provide own locking.)

Since the information if the pool is already protected is actually
present in the pmalloc_data structure associated with the pool, I was
tempted to make it available through the API, but that seemed wrong.

The user of the API should be very aware of the state of the pool, since
the user is the one who sets it. Why would it have to be read back?


OK, sounds good :-).




+void *pmalloc(struct gen_pool *pool, size_t size, gfp_t gfp);
+
+
+/**
+ * pzalloc() - zero-initialized version of pmalloc
+ * @pool: handle to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ * @gfp: flags for page allocation
+ *
+ * Executes pmalloc, initializing the memory requested to 0,
+ * before returning the pointer to it.
+ *
+ * Return:
+ * * pointer to the memory requested   - success
+ * * NULL  - either no memory available or
+ *   pool already read-only
+ */

Same comment here, though that inline function below looks highly
optimized...

The same applies here.
I'm not very fond of the idea of returning the status elsewhere and in a
way that is not intrinsically connected with the action that has
determined the change of state.

AFAIK *alloc functions return either the memory requested or NULL.
I wonder how realistic this case is.


OK.



[...]


+   if (unlikely(!(pool && n && size)))

Has this code been run through sparse?

I use "make C=1 W=1"


OK.




I know one thing sparse looks at
is if NULL is being treated like a 0, and sparse does check cases when 0
is being used in place for NULL for pointer checks, and I'm wondering if
that line of code would pass.

It's a logical AND: wouldn't NULL translate to false, rather 0?
I can add an explicit check against NULL, it's probably more readable
too, but I don't think that the current construct treats NULL as 0.


If it passes sparse, I think it's fine.



[...]


+   if (unlikely(pool == NULL || s == NULL))

Here, the right check is being done, so at the very least, I would make
the last line I commented on the same as this one for code continuity.

ok

[...]


+   if (unlikely(!(pool && chunk)))

Please make this check the same as the last line I commented on,
especially since it's the same struct being checked.

yes


OK :-)




[...]


+   if (!name) {
+   WARN_ON(1);

??  Maybe the name check should be in WARN_ON()?

true :-(


OK.




[...]


+   if (unlikely(!req_size || !pool))

same unlikely() check problem mentioned before.

+   return -1;

Can we use an errno value instead for better diagnosibility?

+
+   data = pool->data;
+
+   if (data == NULL)
+   return -1;

Same here (ENOMEM or ENXIO comes to mind).

+
+   if (unlikely(data->protected)) {
+   WARN_ON(1);

Maybe re-write this with the check inside WARN_ON()?

+   return -1;

Same here, how about a different errno value for this case?

yes, to all of the above


OK.



[...]


+static void pmalloc_chunk_set_protection(struct gen_pool *pool,
+
+struct gen_pool_chunk *chunk,
+void *data)
+{
+   const bool *flag = data;
+   size_t chunk_size = chunk->end_addr + 1 - chunk->start_addr;
+   unsigned long pages = chunk_size / PAGE_SIZE;
+
+   BUG_ON(chunk_size & (PAGE_SIZE - 1));

Re-think WARN_ON() for BUG_ON()?  And also check chunk as well, as it's
being used below?

ok


+
+   if (*flag)
+   set

Re: [PATCH 2/7] genalloc: selftest

2018-02-26 Thread J Freyensee



On 2/26/18 4:11 AM, Igor Stoppa wrote:


On 24/02/18 00:42, J Freyensee wrote:

+   locations[action->location] = gen_pool_alloc(pool, action->size);
+   BUG_ON(!locations[action->location]);

Again, I'd think it through if you really want to use BUG_ON() or not:

https://lwn.net/Articles/13183/
https://lkml.org/lkml/2016/10/4/1

Is it acceptable to display only a WARNing, in case of risking damaging
a mounted filesystem?


That's a good question.  Based upon those articles, 'yes'.  But it seems 
like a 'darned-if-you-do, darned-if-you-don't' question as couldn't you 
also corrupt a mounted filesystem by crashing the kernel, yes/no?


If you really want a system crash, maybe just do a panic() like 
filesystems also use?


--
igor




Re: [PATCH 1/7] genalloc: track beginning of allocations

2018-02-26 Thread J Freyensee

My replies also inlined.

On 2/26/18 4:09 AM, Igor Stoppa wrote:

Hello,
and thanks for the reviews, my replies inlined below.

On 24/02/18 00:28, J Freyensee wrote:

some code snipping
.
.
.

+/**
+ * get_bitmap_entry() - extracts the specified entry from the bitmap
+ * @map: pointer to a bitmap
+ * @entry_index: the index of the desired entry in the bitmap
+ *
+ * Return: The requested bitmap.
+ */
+static inline unsigned long get_bitmap_entry(unsigned long *map,
+   int entry_index)
+{

Apologies if this has been mentioned before, but since this function is
expecting map to not be NULL, shouldn't something like:

WARN_ON(map == NULL);

be used to check the parameters for bad values?

TBH I do not know.
Actually I'd rather ask back: when is it preferred to do (and not do)
parameter sanitation?

I was thinking that doing it at API level is the right balance.
This function, for example, is not part of the API, it's used only
internally in this file.



I agree.  But some of the code looks API'like to me, partly because of 
all the function header documentation, which thank you for that, but I 
wasn't sure where you drew your "API line" where the checks would be.




Is it assuming too much that the function will be used correctly, inside
the module it belongs to?

And even at API level, I'd tend to say that if there are chances that
the data received is corrupted, then it should be sanitized, but otherwise,
why adding overhead?


It's good secure coding practice to check your parameters, you are 
adding code to a security module after all ;-).


If it's brand-new code entering the kernel, it's better to err on the 
side of having the extra checks and have a maintainer tell you to remove 
it than the other way around- especially since this code is part of the 
LSM solution.  What's worse- a tad bit of overhead catching a 
corner-case scenario that can be more easily fixed or something not 
caught that makes the kernel unstable?




Unless you expect some form of memory corruption. Is that what you have
in mind?

[...]


   static inline size_t chunk_size(const struct gen_pool_chunk *chunk)
   {

Same problem here, always expecting chunk to not but NULL.

What would be the case that makes it not NULL?
There are already tests in place when the memory is allocated.

If I really have to pick a place where to do the test, it's at API
level,


I agree, and if that is the case, I'm fine.


where the user of the API might fail to notice that the creation
of a pool failed and try to get memory from a non-existing pool.
That is the only scenario I can think of, where bogus data would be
received.


return chunk->end_addr - chunk->start_addr + 1;
   }
   
-static int set_bits_ll(unsigned long *addr, unsigned long mask_to_set)

+
+/**
+ * set_bits_ll() - based on value and mask, sets bits at address
+ * @addr: where to write
+ * @mask: filter to apply for the bits to alter
+ * @value: actual configuration of bits to store
+ *
+ * Return:
+ * * 0 - success
+ * * -EBUSY- otherwise
+ */
+static int set_bits_ll(unsigned long *addr,
+  unsigned long mask, unsigned long value)
   {
-   unsigned long val, nval;
+   unsigned long nval;
+   unsigned long present;
+   unsigned long target;
   
   	nval = *addr;

Same issue here with addr.

Again, I am more leaning toward believing that the user of the API might
forget to check for errors,


Same in agreement, so if that is the case, I'm ok.  It was a little hard 
to tell what is exactly your API is.  I'm used to reviewing kernel code 
where important API-like functions were heavily documented, and inner 
routines were not...so seeing the function documentation (which is a 
good thing :-)) made me think this was some sort of new API code I was 
looking at.



and pass a NULL pointer as pool, than to
believe something like this would happen.

This is an address obtained from data managed automatically by the library.

Can you please explain why you think it would be NULL?


Why would it be NULL?  I don't know, I'm not intimately familiar with 
the code; but I default to implementing code defensively.  But I'll turn 
the question around on you- why would it NOT be NULL?  Are you sure this 
will never be NULL?  Are you going to trust the library that it always 
provides a good address?  You should add to your function header 
documentation why addr will NOT be NULL.




I'll skip further similar comment.

[...]


+   /*
+* Prepare for writing the initial part of the allocation, from
+* starting entry, to the end of the UL bitmap element which
+* contains it. It might be larger than the actual allocation.
+*/
+   start_bit = ENTRIES_TO_BITS(start_entry);
+   end_bit = ENTRIES_TO_BITS(start_entry + nentries);
+   nbits = ENTRIES_TO_BIT

Re: [PATCH 7/7] Documentation for Pmalloc

2018-02-23 Thread J Freyensee



On 2/23/18 6:48 AM, Igor Stoppa wrote:

Detailed documentation about the protectable memory allocator.

Signed-off-by: Igor Stoppa 
---
  Documentation/core-api/index.rst   |   1 +
  Documentation/core-api/pmalloc.rst | 114 +
  2 files changed, 115 insertions(+)
  create mode 100644 Documentation/core-api/pmalloc.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index c670a8031786..8f5de42d6571 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -25,6 +25,7 @@ Core utilities
 genalloc
 errseq
 printk-formats
+   pmalloc
  
  Interfaces for kernel debugging

  ===
diff --git a/Documentation/core-api/pmalloc.rst 
b/Documentation/core-api/pmalloc.rst
new file mode 100644
index ..d9725870444e
--- /dev/null
+++ b/Documentation/core-api/pmalloc.rst
@@ -0,0 +1,114 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Protectable memory allocator
+
+
+Purpose
+---
+
+The pmalloc library is meant to provide R/O status to data that, for some
+reason, could neither be declared as constant, nor could it take advantage
+of the qualifier __ro_after_init, but is write-once and read-only in spirit.
+It protects data from both accidental and malicious overwrites.
+
+Example: A policy that is loaded from userspace.
+
+
+Concept
+---
+
+pmalloc builds on top of genalloc, using the same concept of memory pools.
+
+The value added by pmalloc is that now the memory contained in a pool can
+become R/O, for the rest of the life of the pool.
+
+Different kernel drivers and threads can use different pools, for finer
+control of what becomes R/O and when. And for improved lockless concurrency.
+
+
+Caveats
+---
+
+- Memory freed while a pool is not yet protected will be reused.
+
+- Once a pool is protected, it's not possible to allocate any more memory
+  from it.
+
+- Memory "freed" from a protected pool indicates that such memory is not
+  in use anymore by the requester; however, it will not become available
+  for further use, until the pool is destroyed.
+
+- Before destroying a pool, all the memory allocated from it must be
+  released.


Is that true?  pmalloc_destroy_pool() has:

.
.
+    pmalloc_pool_set_protection(pool, false);
+    gen_pool_for_each_chunk(pool, pmalloc_chunk_free, NULL);
+    gen_pool_destroy(pool);
+    kfree(data);

which to me looks like is the opposite, the data (ie, "memory") is being 
released first, then the pool is destroyed.





+
+- pmalloc does not provide locking support with respect to allocating vs
+  protecting an individual pool, for performance reasons.


What is the recommendation to using locks then, as the computing 
real-world mainly operates in multi-threaded/process world?  Maybe show 
an example of an issue that occur if locks aren't used and give a coding 
example.



+  It is recommended not to share the same pool between unrelated functions.
+  Should sharing be a necessity, the user of the shared pool is expected
+  to implement locking for that pool.
+
+- pmalloc uses genalloc to optimize the use of the space it allocates
+  through vmalloc. Some more TLB entries will be used, however less than
+  in the case of using vmalloc directly. The exact number depends on the
+  size of each allocation request and possible slack.
+
+- Considering that not much data is supposed to be dynamically allocated
+  and then marked as read-only, it shouldn't be an issue that the address
+  range for pmalloc is limited, on 32-bit systems.


Why is 32-bit systems mentioned and not 64-bit?  Is there a problem with 
64-bit here?


Thanks,
Jay


Re: [PATCH 4/7] Protectable Memory

2018-02-23 Thread J Freyensee

On 2/23/18 6:48 AM, Igor Stoppa wrote:


The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain exclusively data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section, but this does not sit very well with dynamically allocated
ones.

Dynamic allocation does not provide, currently, any means for grouping
variables in memory pages that would contain exclusively data suitable
for conversion to read only access mode.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can request a pool and then refer any allocation request to the
pool handler it has received.

Once all the chunks of memory associated to a specific pool are
initialized, the pool can be protected.

After this point, the pool can only be destroyed (it is up to the module
to avoid any further references to the memory from the pool, after
the destruction is invoked).

The latter case is mainly meant for releasing memory, when a module is
unloaded.

A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.

Since pmalloc memory is obtained from vmalloc, an attacker that has
gained access to the physical mapping, still has to identify where the
target of the attack is actually located.

At the same time, being also based on genalloc, pmalloc does not
generate as much trashing of the TLB as it would be caused by using
directly only vmalloc.

Signed-off-by: Igor Stoppa 
---
  include/linux/genalloc.h |   3 +
  include/linux/pmalloc.h  | 242 +++
  include/linux/vmalloc.h  |   1 +
  lib/genalloc.c   |  27 +++
  mm/Kconfig   |   6 +
  mm/Makefile  |   1 +
  mm/pmalloc.c | 499 +++
  mm/usercopy.c|  33 
  8 files changed, 812 insertions(+)
  create mode 100644 include/linux/pmalloc.h
  create mode 100644 mm/pmalloc.c

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index dcaa33e74b1c..b6c4cea9fbd8 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -121,6 +121,9 @@ extern unsigned long gen_pool_alloc_algo(struct gen_pool *, 
size_t,
  extern void *gen_pool_dma_alloc(struct gen_pool *pool, size_t size,
dma_addr_t *dma);
  extern void gen_pool_free(struct gen_pool *, unsigned long, size_t);
+
+extern void gen_pool_flush_chunk(struct gen_pool *pool,
+struct gen_pool_chunk *chunk);
  extern void gen_pool_for_each_chunk(struct gen_pool *,
void (*)(struct gen_pool *, struct gen_pool_chunk *, void *), void *);
  extern size_t gen_pool_avail(struct gen_pool *);
diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index ..afc2068d5545
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,242 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#ifndef _LINUX_PMALLOC_H
+#define _LINUX_PMALLOC_H
+
+
+#include 
+#include 
+
+#define PMALLOC_DEFAULT_ALLOC_ORDER (-1)
+
+/*
+ * Library for dynamic allocation of pools of memory that can be,
+ * after initialization, marked as read-only.
+ *
+ * This is intended to complement __read_only_after_init, for those cases
+ * where either it is not possible to know the initialization value before
+ * init is completed, or the amount of data is variable and can be
+ * determined only at run-time.
+ *
+ * ***WARNING***
+ * The user of the API is expected to synchronize:
+ * 1) allocation,
+ * 2) writes to the allocated memory,
+ * 3) write protection of the pool,
+ * 4) freeing of the allocated memory, and
+ * 5) destruction of the pool.
+ *
+ * For a non-threaded scenario, this type of locking is not even required.
+ *
+ * Even if the library were to provide support for locking, point 2)
+ * would still depend on the user taking the lock.
+ */
+
+
+/**
+ * pmalloc_create_pool() - create a new protectable memory pool
+ * @name: the name of the pool, enforced to be unique
+ * @min_alloc_order: log2 of the minimum allocation size obtainable
+ *   from the pool
+ *
+ * Creates a new (empty) memory pool for allocation of protectable
+ * memory. Memory will be allocated upon request (through pmalloc).
+ *
+ * Return:
+ * * pointer to the new pool   - success
+ * * NULL  - error
+ */
+struct gen_pool *pmalloc_create_pool(const char *name,
+int min_alloc_order);


Same comments as earlier.  If this is new API with new code being 
introduced into the kernel, can the variables be declared to avoid weird 
proble

Re: [PATCH 2/7] genalloc: selftest

2018-02-23 Thread J Freyensee



+   locations[action->location] = gen_pool_alloc(pool, action->size);
+   BUG_ON(!locations[action->location]);


Again, I'd think it through if you really want to use BUG_ON() or not:

https://lwn.net/Articles/13183/
https://lkml.org/lkml/2016/10/4/1

Thanks,
Jay



Re: [PATCH 1/7] genalloc: track beginning of allocations

2018-02-23 Thread J Freyensee

some code snipping
.
.
.

+/**
+ * get_bitmap_entry() - extracts the specified entry from the bitmap
+ * @map: pointer to a bitmap
+ * @entry_index: the index of the desired entry in the bitmap
+ *
+ * Return: The requested bitmap.
+ */
+static inline unsigned long get_bitmap_entry(unsigned long *map,
+   int entry_index)
+{


Apologies if this has been mentioned before, but since this function is 
expecting map to not be NULL, shouldn't something like:


WARN_ON(map == NULL);

be used to check the parameters for bad values?

BTW, excellent commenting :-).

+   return (map[ENTRIES_DIV_LONGS(entry_index)] >>
+   ENTRIES_TO_BITS(entry_index % ENTRIES_PER_LONG)) &
+   ENTRY_MASK;
+}
+
+
+/**
+ * mem_to_units() - convert references to memory into orders of allocation
+ * @size: amount in bytes
+ * @order: power of 2 represented by each entry in the bitmap
+ *
+ * Return: the number of units representing the size.
+ */
+static inline unsigned long mem_to_units(unsigned long size,
+unsigned long order)
+{
+   return (size + (1UL << order) - 1) >> order;
+}
+
+/**
+ * chunk_size() - dimension of a chunk of memory, in bytes
+ * @chunk: pointer to the struct describing the chunk
+ *
+ * Return: The size of the chunk, in bytes.
+ */
  static inline size_t chunk_size(const struct gen_pool_chunk *chunk)
  {

Same problem here, always expecting chunk to not but NULL.


return chunk->end_addr - chunk->start_addr + 1;
  }
  
-static int set_bits_ll(unsigned long *addr, unsigned long mask_to_set)

+
+/**
+ * set_bits_ll() - based on value and mask, sets bits at address
+ * @addr: where to write
+ * @mask: filter to apply for the bits to alter
+ * @value: actual configuration of bits to store
+ *
+ * Return:
+ * * 0 - success
+ * * -EBUSY- otherwise
+ */
+static int set_bits_ll(unsigned long *addr,
+  unsigned long mask, unsigned long value)
  {
-   unsigned long val, nval;
+   unsigned long nval;
+   unsigned long present;
+   unsigned long target;
  
  	nval = *addr;


Same issue here with addr.

do {
-   val = nval;
-   if (val & mask_to_set)
+   present = nval;
+   if (present & mask)
return -EBUSY;
+   target =  present | value;
cpu_relax();
-   } while ((nval = cmpxchg(addr, val, val | mask_to_set)) != val);
-
+   } while ((nval = cmpxchg(addr, present, target)) != target);
return 0;
  }
  
-static int clear_bits_ll(unsigned long *addr, unsigned long mask_to_clear)

+
+/**
+ * clear_bits_ll() - based on value and mask, clears bits at address
+ * @addr: where to write
+ * @mask: filter to apply for the bits to alter
+ * @value: actual configuration of bits to clear
+ *
+ * Return:
+ * * 0 - success
+ * * -EBUSY- otherwise
+ */
+static int clear_bits_ll(unsigned long *addr,
+unsigned long mask, unsigned long value)


Dido here for addr too.

  {
-   unsigned long val, nval;
+   unsigned long nval;
+   unsigned long present;
+   unsigned long target;
  
  	nval = *addr;

+   present = nval;
+   if (unlikely((present & mask) ^ value))
+   return -EBUSY;
do {
-   val = nval;
-   if ((val & mask_to_clear) != mask_to_clear)
+   present = nval;
+   if (unlikely((present & mask) ^ value))
return -EBUSY;
+   target =  present & ~mask;
cpu_relax();
-   } while ((nval = cmpxchg(addr, val, val & ~mask_to_clear)) != val);
-
+   } while ((nval = cmpxchg(addr, present, target)) != target);
return 0;
  }
  
-/*

- * bitmap_set_ll - set the specified number of bits at the specified position
+
+/**
+ * get_boundary() - verifies address, then measure length.
   * @map: pointer to a bitmap
- * @start: a bit position in @map
- * @nr: number of bits to set
+ * @start_entry: the index of the first entry in the bitmap
+ * @nentries: number of entries to alter
   *
- * Set @nr bits start from @start in @map lock-lessly. Several users
- * can set/clear the same bitmap simultaneously without lock. If two
- * users set the same bit, one user will return remain bits, otherwise
- * return 0.
+ * Return:
+ * * length of an allocation   - success
+ * * -EINVAL   - invalid parameters
   */
-static int bitmap_set_ll(unsigned long *map, int start, int nr)
+static int get_boundary(unsigned long *map, int start_entry, int nentries)

Same problem for map.


  {
-   unsigned long *p = map + BIT_WORD(start);
-   const int size = start + nr;
-   int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
-   unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
-
-   while (nr - bits_to_set >= 0) {
-   if (set_bits_ll(p, mask_t

Re: [PATCH v3 01/13] mpt3sas: Update MPI Header

2017-08-08 Thread J Freyensee


Looks like your header has a white space error:

Applying: mpt3sas: Update MPI Header
.git/rebase-apply/patch:1452: new blank line at EOF.
+

Also, FYI, this project has a lot of sparse errors that look like existed
before your patchset.  As your patchset touches a few of the files that
have sparse warnings (such as mpt3sas_base.c, mpt3sas_scsih.c, etc), maybe
you'll want to investigate fixing these things?


[mainline-linux]$ make C=1
  CHK include/config/kernel.release
  CHK include/generated/uapi/linux/version.h
  CHK include/generated/utsrelease.h
  CHK include/generated/bounds.h
  CHK include/generated/timeconst.h
  CHK include/generated/asm-offsets.h
  CALLscripts/checksyscalls.sh
  CHK scripts/mod/devicetable-offsets.h
  CHK include/generated/compile.h
  AR  drivers/scsi/mpt3sas/built-in.o
  CHECK   drivers/scsi/mpt3sas/mpt3sas_base.c
drivers/scsi/mpt3sas/mpt3sas_base.c:861:42: warning: incorrect type in
assignment (different base types)
drivers/scsi/mpt3sas/mpt3sas_base.c:861:42:expected unsigned short
[unsigned] [usertype] Event
drivers/scsi/mpt3sas/mpt3sas_base.c:861:42:got restricted __le16
[usertype] Event
drivers/scsi/mpt3sas/mpt3sas_base.c:862:49: warning: incorrect type in
assignment (different base types)
drivers/scsi/mpt3sas/mpt3sas_base.c:862:49:expected unsigned int
[unsigned] [usertype] EventContext
drivers/scsi/mpt3sas/mpt3sas_base.c:862:49:got restricted __le32
[usertype] EventContext
drivers/scsi/mpt3sas/mpt3sas_base.c:1080:64: warning: incorrect type in
argument 2 (different address spaces)
drivers/scsi/mpt3sas/mpt3sas_base.c:1080:64:expected void volatile
[noderef] *addr
drivers/scsi/mpt3sas/mpt3sas_base.c:1080:64:got unsigned long long
[usertype] *
drivers/scsi/mpt3sas/mpt3sas_base.c:1129:52: warning: incorrect type in
argument 2 (different address spaces)
drivers/scsi/mpt3sas/mpt3sas_base.c:1129:52:expected void volatile
[noderef] *addr
drivers/scsi/mpt3sas/mpt3sas_base.c:1129:52:got unsigned long long
[usertype] *
drivers/scsi/mpt3sas/mpt3sas_base.c:1519:36: warning: incorrect type in
assignment (different base types)
drivers/scsi/mpt3sas/mpt3sas_base.c:1519:36:expected unsigned long long
[unsigned] [long] [long long] [usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1519:36:got restricted __le64
[usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1532:37: warning: incorrect type in
assignment (different base types)
drivers/scsi/mpt3sas/mpt3sas_base.c:1532:37:expected unsigned long long
[unsigned] [long] [long long] [usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1532:37:got restricted __le64
[usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1552:45: warning: incorrect type in
assignment (different base types)
drivers/scsi/mpt3sas/mpt3sas_base.c:1552:45:expected unsigned long long
[unsigned] [long] [long long] [usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1552:45:got restricted __le64
[usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1565:45: warning: incorrect type in
assignment (different base types)
drivers/scsi/mpt3sas/mpt3sas_base.c:1565:45:expected unsigned long long
[unsigned] [long] [long long] [usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1565:45:got restricted __le64
[usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1575:36: warning: incorrect type in
assignment (different base types)
drivers/scsi/mpt3sas/mpt3sas_base.c:1575:36:expected unsigned long long
[unsigned] [long] [long long] [usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1575:36:got restricted __le64
[usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1594:5: warning: symbol 'base_mod64'
was not declared. Should it be static?
drivers/scsi/mpt3sas/mpt3sas_base.c:1717:36: warning: incorrect type in
assignment (different base types)
drivers/scsi/mpt3sas/mpt3sas_base.c:1717:36:expected unsigned long long
[unsigned] [long] [long long] [usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1717:36:got restricted __le64
[usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1722:28: warning: incorrect type in
assignment (different base types)
drivers/scsi/mpt3sas/mpt3sas_base.c:1722:28:expected unsigned long long
[unsigned] [long] [long long] [usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1722:28:got restricted __le64
[usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:1620:1: warning: symbol
'base_make_prp_nvme' was not declared. Should it be static?
drivers/scsi/mpt3sas/mpt3sas_base.c:1761:21: warning: incorrect type in
assignment (different base types)
drivers/scsi/mpt3sas/mpt3sas_base.c:1761:21:expected unsigned int
[unsigned] [usertype] data_length
drivers/scsi/mpt3sas/mpt3sas_base.c:1761:21:got restricted __le32
[usertype] 
drivers/scsi/mpt3sas/mpt3sas_base.c:2771:32: warning: cast removes address
space of expression
drivers/scsi/mpt3sas/mpt3sas_base.c:3064:16: warning: incorrect type in
argument 1 (different base types)
drivers/scsi/mpt3sas/mpt3sas_base.c:3064:16:expected unsigned long
[unsigned] 

Re: [PATCH v4] add u64 number parser

2016-09-26 Thread J Freyensee
On Sun, 2016-09-25 at 09:14 -0700, James Smart wrote:
> add u64 number parser
> 
> Reverted back to version 2 of the patch.  This adds the interface
> using existing logic. Comments from the prior reviewers to move to
> kasprintf were rejected by Linus.
> 
> Signed-off-by: James Smart 

Acked-by: Jay Freyensee 



Re: [PATCH] lightnvm: add missing \n to end of dev_err message

2016-09-26 Thread J Freyensee
On Sun, 2016-09-25 at 14:56 -0700, Colin King wrote:
> From: Colin Ian King 
> 
> Trival fix, dev_err message is missing a \n, so add it.
> 
> Signed-off-by: Colin Ian King 
> 
Reviewed-by: Jay Freyensee 



Re: [PATCH] nvme: add missing \n to end of dev_warn message

2016-09-26 Thread J Freyensee
On Sun, 2016-09-25 at 15:04 -0700, Colin King wrote:
> From: Colin Ian King 
> 
> Trival fix, dev_warn message is missing a \n, so add it.
> 
> Signed-off-by: Colin Ian King 

Reviewed-by: Jay Freyensee 



Re: [PATCH] nvme-rdma: add missing \n to end of dev_err message

2016-09-26 Thread J Freyensee
On Sun, 2016-09-25 at 15:01 -0700, Colin King wrote:
> From: Colin Ian King 
> 
> Trival fix, dev_err message is missing a \n, so add it. Also
> break line as it was over 80 chars wide.
> 
> Signed-off-by: Colin Ian King 

Reviewed-by: Jay Freyensee 

> ---
>  drivers/nvme/host/rdma.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 69e1fb7..3f32995 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1227,7 +1227,8 @@ static int nvme_rdma_conn_rejected(struct
> nvme_rdma_queue *queue,
>   (struct nvme_rdma_cm_rej *)ev-
> >param.conn.private_data;
>  
>   dev_err(queue->ctrl->ctrl.device,
> - "Connect rejected, status %d.",
> le16_to_cpu(rej->sts));
> + "Connect rejected, status %d.\n",
> + le16_to_cpu(rej->sts));
>   /* XXX: Think of something clever to do here... */
>   } else {
>   dev_err(queue->ctrl->ctrl.device,


Re: [PATCH v4 0/3] nvme power saving

2016-09-22 Thread J Freyensee
On Thu, 2016-09-22 at 14:43 -0600, Jens Axboe wrote:
> On 09/22/2016 02:11 PM, Andy Lutomirski wrote:
> > 
> > On Thu, Sep 22, 2016 at 7:23 AM, Jens Axboe  wrote:
> > > 
> > > 
> > > On 09/16/2016 12:16 PM, Andy Lutomirski wrote:
> > > > 
> > > > 
> > > > Hi all-
> > > > 
> > > > Here's v4 of the APST patch set.  The biggest bikesheddable
> > > > thing (I
> > > > think) is the scaling factor.  I currently have it hardcoded so
> > > > that
> > > > we wait 50x the total latency before entering a power saving
> > > > state.
> > > > On my Samsung 950, this means we enter state 3 (70mW, 0.5ms
> > > > entry
> > > > latency, 5ms exit latency) after 275ms and state 4 (5mW, 2ms
> > > > entry
> > > > latency, 22ms exit latency) after 1200ms.  I have the default
> > > > max
> > > > latency set to 25ms.
> > > > 
> > > > FWIW, in practice, the latency this introduces seems to be well
> > > > under 22ms, but my benchmark is a bit silly and I might have
> > > > measured it wrong.  I certainly haven't observed a slowdown
> > > > just
> > > > using my laptop.
> > > > 
> > > > This time around, I changed the names of parameters after Jay
> > > > Frayensee got confused by the first try.  Now they are:
> > > > 
> > > >  - ps_max_latency_us in sysfs: actually controls it.
> > > >  - nvme_core.default_ps_max_latency_us: sets the default.
> > > > 
> > > > Yeah, they're mouthfuls, but they should be clearer now.
> > > 
> > > 
> > > The only thing I don't like about this is the fact that's it's a
> > > driver private thing. Similar to ALPM on SATA, it's yet another
> > > knob that needs to be set. It we put it somewhere generic, then
> > > at least we could potentially use it in a generic fashion.
> > 
> > Agreed.  I'm hoping to hear back from Rafael soon about the
> > dev_pm_qos
> > thing.
> > 
> > > 
> > > 
> > > Additionally, it should not be on by default.
> > 
> > I think I disagree with this.  Since we don't have anything like
> > laptop-mode AFAIK, I think we do want it on by default.  For the
> > server workloads that want to consume more idle power for faster
> > response when idle, I think the servers should be willing to make
> > this
> > change, just like they need to disable overly deep C states, etc.
> > (Admittedly, unifying the configuration would be nice.)
> 
> I can see two reasons why we don't want it the default:
> 
> 1) Changes like this has a tendency to cause issues on various types
> of
> hardware. How many NVMe devices have you tested this on? ALPM on SATA
> had a lot of initial problems, where slowed down some SSDs unberably.

...and some SSDs don't even support this feature yet, so the number of
different NVMe devices available to test initially will most likely be
small (like the Fultondales I have, all I could check is to see if the
code broke anything if the device did not have this power-save
feature).

I agree with Jens, makes a lot of sense to start with this feature
'off'.

To 'advertise' the feature, maybe make the feature a new selection in
Kconfig?  Example, initially make it "EXPERIMENTAL", and later when
more devices implement this feature it can be integrated more tightly
into the NVMe solution and default to on.



Re: [PATCH v4 0/3] nvme power saving

2016-09-16 Thread J Freyensee
On Fri, 2016-09-16 at 11:16 -0700, Andy Lutomirski wrote:
> Hi all-
> 
> Here's v4 of the APST patch set.  The biggest bikesheddable thing (I
> think) is the scaling factor.  I currently have it hardcoded so that
> we wait 50x the total latency before entering a power saving state.
> On my Samsung 950, this means we enter state 3 (70mW, 0.5ms entry
> latency, 5ms exit latency) after 275ms and state 4 (5mW, 2ms entry
> latency, 22ms exit latency) after 1200ms.  I have the default max
> latency set to 25ms.
> 
> FWIW, in practice, the latency this introduces seems to be well
> under 22ms, but my benchmark is a bit silly and I might have
> measured it wrong.  I certainly haven't observed a slowdown just
> using my laptop.
> 
> This time around, I changed the names of parameters after Jay
> Frayensee got confused by the first try.  Now they are:
> 
>  - ps_max_latency_us in sysfs: actually controls it.
>  - nvme_core.default_ps_max_latency_us: sets the default.
> 
> Yeah, they're mouthfuls, but they should be clearer now.
> 

I took the patches and applied them to one of my NVMe fabric hosts on
my NVMe-over-Fabrics setup.  Basically, it doesn't test much other than
Andy's explanation that "ps_max_latency_us" does not appear in any of
/sys/block/nvmeXnY sysfs nodes (I have 7) so seems good to me on this
front.

Tested-by: Jay Freyensee 
[jpf: defaults benign to NVMe-over-Fabrics]


Re: [PATCH v4 1/3] nvme/scsi: Remove power management support

2016-09-16 Thread J Freyensee
On Fri, 2016-09-16 at 11:16 -0700, Andy Lutomirski wrote:
> As far as I can tell, there is basically nothing correct about this
> code.  It misinterprets npss (off-by-one).  It hardcodes a bunch of
> power states, which is nonsense, because they're all just indices
> into a table that software needs to parse.  It completely ignores
> the distinction between operational and non-operational states.
> And, until 4.8, if all of the above magically succeeded, it would
> dereference a NULL pointer and OOPS.
> 
> Since this code appears to be useless, just delete it.

Acked-by: Jay Freyensee 

> 
> Signed-off-by: Andy Lutomirski 
> ---
> 


Re: [PATCH v3 1/3] nvme/scsi: Remove power management support

2016-09-16 Thread J Freyensee
On Fri, 2016-09-16 at 10:33 +0200, Christoph Hellwig wrote:
> On Thu, Sep 15, 2016 at 10:24:19PM -0700, Andy Lutomirski wrote:
> > 
> > As far as I can tell, there is basically nothing correct about this
> > code.  It misinterprets npss (off-by-one).  It hardcodes a bunch of
> > power states, which is nonsense, because they're all just indices
> > into a table that software needs to parse.  It completely ignores
> > the distinction between operational and non-operational states.
> > And, until 4.8, if all of the above magically succeeded, it would
> > dereference a NULL pointer and OOPS.
> > 
> > Since this code appears to be useless, just delete it.
> 
> Yes, please!
> 
> Acked-by: Christoph Hellwig 

Acked-by: Jay Freyensee 




Re: [PATCH v2 3/3] nvme: Enable autonomous power state transitions

2016-09-02 Thread J Freyensee
On Fri, 2016-09-02 at 14:43 -0700, Andy Lutomirski wrote:
> On Fri, Sep 2, 2016 at 2:15 PM, J Freyensee
>  wrote:
> > 
> > On Tue, 2016-08-30 at 14:59 -0700, Andy Lutomirski wrote:
> > > 
> > > NVME devices can advertise multiple power states.  These states
> > > can
> > > be either "operational" (the device is fully functional but
> > > possibly
> > > slow) or "non-operational" (the device is asleep until woken up).
> > > Some devices can automatically enter a non-operational state when
> > > idle for a specified amount of time and then automatically wake
> > > back
> > > up when needed.
> > > 
> > > The hardware configuration is a table.  For each state, an entry
> > > in
> > > the table indicates the next deeper non-operational state, if
> > > any,
> > > to autonomously transition to and the idle time required before
> > > transitioning.
> > > 
> > > This patch teaches the driver to program APST so that each
> > > successive non-operational state will be entered after an idle
> > > time
> > > equal to 100% of the total latency (entry plus exit) associated
> > > with
> > > that state.  A sysfs attribute 'apst_max_latency_us' gives the
> > > maximum acceptable latency in ns; non-operational states with
> > > total
> > > latency greater than this value will not be used.  As a special
> > > case, apst_max_latency_us=0 will disable APST entirely.
> > 
> > May I ask a dumb question?
> > 
> > How does this work with multiple NVMe devices plugged into a
> > system?  I
> > would have thought we'd want one apst_max_latency_us entry per NVMe
> > controller for individual control of each device?  I have two
> > Fultondale-class devices plugged into a system I tried these
> > patches on
> > (the 4.8-rc4 kernel) and I'm not sure how the single
> > /sys/module/nvme_core/parameters/apst_max_latency_us would work per
> > my
> > 2 devices (and the value is using the default 25000).
> > 
> 
> Ah, I faked you out :(
> 
> The module parameter (nvme_core/parameters/apst_max_latency_us) just
> sets the default for newly probed devices.  The actual setting is in
> /sys/devices/whatever (symlinked from /sys/block/nvme0n1/devices, for
> example).  Perhaps I should name the former
> default_apst_max_latency_us.

It would certainly be more describable to understand what the entry is
for, but then the name is also getting longer.

Just "default_apst_latency_us"? Or maybe it's probably fine as-is.



Re: [PATCH v2 3/3] nvme: Enable autonomous power state transitions

2016-09-02 Thread J Freyensee
On Tue, 2016-08-30 at 14:59 -0700, Andy Lutomirski wrote:
> NVME devices can advertise multiple power states.  These states can
> be either "operational" (the device is fully functional but possibly
> slow) or "non-operational" (the device is asleep until woken up).
> Some devices can automatically enter a non-operational state when
> idle for a specified amount of time and then automatically wake back
> up when needed.
> 
> The hardware configuration is a table.  For each state, an entry in
> the table indicates the next deeper non-operational state, if any,
> to autonomously transition to and the idle time required before
> transitioning.
> 
> This patch teaches the driver to program APST so that each
> successive non-operational state will be entered after an idle time
> equal to 100% of the total latency (entry plus exit) associated with
> that state.  A sysfs attribute 'apst_max_latency_us' gives the
> maximum acceptable latency in ns; non-operational states with total
> latency greater than this value will not be used.  As a special
> case, apst_max_latency_us=0 will disable APST entirely.

May I ask a dumb question?

How does this work with multiple NVMe devices plugged into a system?  I
would have thought we'd want one apst_max_latency_us entry per NVMe
controller for individual control of each device?  I have two
Fultondale-class devices plugged into a system I tried these patches on
(the 4.8-rc4 kernel) and I'm not sure how the single
/sys/module/nvme_core/parameters/apst_max_latency_us would work per my
2 devices (and the value is using the default 25000).

Now from 
nvme id-ctrl /dev/nvme0 (or nvme1)

NVME Identify Controller:
vid : 0x8086
ssvid   : 0x8086
sn  : CVFT41720018800HGN  
mn  : INTEL SSDPE2MD800G4 
fr  : 8DV10151
rab : 0
ieee: 5cd2e4
cmic: 0
mdts: 5
cntlid  : 0
ver : 0
rtd3r   : 0
rtd3e   : 0
oaes: 0
oacs: 0x6
acl : 3
aerl: 3
frmw: 0x2
lpa : 0x2
elpe: 63
npss: 0
avscc   : 0
apsta   : 0 <-

the Fultondales don't support apst.

But I'd still like to ask the dumb question :-).

> 
> On hardware without APST support, apst_max_latency_us will not be
> exposed in sysfs.

Not sure that is true, as from what I see so far, Fultondales don't
support apst yet I still see:

[root@nvme-fabric-host01 nvme-cli]# cat
/sys/module/nvme_core/parameters/apst_max_latency_us
25000

> 
> In theory, the device can expose "default" APST table, but this
> doesn't seem to function correctly on my device (Samsung 950), nor
> does it seem particularly useful.  There is also an optional
> mechanism by which a configuration can be "saved" so it will be
> automatically loaded on reset.  This can be configured from
> userspace, but it doesn't seem useful to support in the driver.
> 
> On my laptop, enabling APST seems to save nearly 1W.
> 
> The hardware tables can be decoded in userspace with nvme-cli.
> 'nvme id-ctrl /dev/nvmeN' will show the power state table and
> 'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
> configuration.

nvme get-feature -f 0x0c -H /dev/nvme0

isn't working for me, I get a:

[root@nvme-fabric-host01 nvme-cli]# ./nvme get-feature -f 0x0c -H
/dev/nvme0
NVMe Status:INVALID_FIELD(2)

I don't have the time right now to investigate further, but I'll assume
it's because I have Fultondales (though I would have thought this patch
would have provided enough code for the latest nvme-cli code to do this
new get-feature as-is).

Jay



Re: [PATCH 3/3] nvme: Enable autonomous power state transitions

2016-09-02 Thread J Freyensee

> > > 
> > > 
> > > > 
> > > > + /*
> > > > +  * By default, allow up to 25ms of APST-induced
> > > > latency.  This will
> > > > +  * have no effect on non-APST supporting controllers
> > > > (i.e.
> > > > any
> > > > +  * controller with APSTA == 0).
> > > > +  */
> > > > + ctrl->apst_max_latency_ns = 2500;
> > > 
> > > Is it possible to make that a #define please?
> > 
> > I'll make it a module parameter as Keith suggested.
> 
> One question, though: should we call this and the sysfs parameter
> apst_max_latency or should it be more generically
> power_save_max_latency?  The idea is that we might want to support
> non-automonous transitions some day or even runtime D3.  Or maybe
> those should be separately configured if used.

I read the spec and reviewed your latest patchset.  Personally for me I
like having the field names from the NVMe spec in the names of the
Linux implementation because it makes it easier to find and relate the
two.  So apst_max_latency makes more sense to me, as this is a
'apst'(e/a) NVMe feature.

> 
> --Andy
> 
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme


Re: [PATCH 3/3] nvme: Enable autonomous power state transitions

2016-08-29 Thread J Freyensee
On Mon, 2016-08-29 at 02:25 -0700, Andy Lutomirski wrote:
> NVME devices can advertise multiple power states.  These states can
> be either "operational" (the device is fully functional but possibly
> slow) or "non-operational" (the device is asleep until woken up).
> Some devices can automatically enter a non-operational state when
> idle for a specified amount of time and then automatically wake back
> up when needed.
> 
> The hardware configuration is a table.  For each state, an entry in
> the table indicates the next deeper non-operational state, if any,
> to autonomously transition to and the idle time required before
> transitioning.
> 
> This patch teaches the driver to program APST so that each
> successive non-operational state will be entered after an idle time
> equal to 100% of the total latency (entry plus exit) associated with
> that state.  A sysfs attribute 'apst_max_latency_ns' gives the
> maximum acceptable latency in ns; non-operational states with total
> latency greater than this value will not be used.  As a special
> case, apst_max_latency_ns=0 will disable APST entirely.
> 
> On hardware without APST support, apst_max_latency_ns will not be
> exposed in sysfs.
> 
> In theory, the device can expose "default" APST table, but this
> doesn't seem to function correctly on my device (Samsung 950), nor
> does it seem particularly useful.  There is also an optional
> mechanism by which a configuration can be "saved" so it will be
> automatically loaded on reset.  This can be configured from
> userspace, but it doesn't seem useful to support in the driver.
> 
> On my laptop, enabling APST seems to save nearly 1W.
> 
> The hardware tables can be decoded in userspace with nvme-cli.
> 'nvme id-ctrl /dev/nvmeN' will show the power state table and
> 'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
> configuration.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  drivers/nvme/host/core.c | 167
> +++
>  drivers/nvme/host/nvme.h |   6 ++
>  include/linux/nvme.h |   6 ++
>  3 files changed, 179 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 3f7561ab54dc..042137ad2437 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1223,6 +1223,98 @@ static void nvme_set_queue_limits(struct
> nvme_ctrl *ctrl,
>   blk_queue_write_cache(q, vwc, vwc);
>  }
>  
> +static void nvme_configure_apst(struct nvme_ctrl *ctrl)
> +{
> + /*
> +  * APST (Autonomous Power State Transition) lets us program
> a
> +  * table of power state transitions that the controller will
> +  * perform automatically.  We configure it with a simple
> +  * heuristic: we are willing to spend at most 2% of the time
> +  * transitioning between power states.  Therefore, when
> running
> +  * in any given state, we will enter the next lower-power
> +  * non-operational state after waiting 100 * (enlat + exlat)
> +  * microseconds, as long as that state's total latency is
> under
> +  * the requested maximum latency.
> +  *
> +  * We will not autonomously enter any non-operational state
> for
> +  * which the total latency exceeds
> apst_max_latency_ns.  Users
> +  * can set apst_max_latency_ns to zero to turn off APST.
> +  */
> +
> + unsigned apste;
> + struct nvme_feat_auto_pst *table;
> + int ret;
> +
> + if (!ctrl->apsta)
> + return; /* APST isn't supported. */
> +
> + if (ctrl->npss > 31) {
> + dev_warn(ctrl->device, "NPSS is invalid; disabling
> APST\n");

Quick question.  A little bit below in a later if() block, apste is set
to 0 to turn off APST, which is to be used later in a
nvme_set_features() call to actually turn it off.  You wouldn't want to
also set apste to zero too and call a nvme_set_features() to "disable
APST"?

I guess I'm a little confused on the error statement, "disabling APST",
when it doesn't seem like anything is being done to actually disable
APST, it's just more of an invalid state retrieved from the HW.
 

> + return;
> + }
> +
> + table = kzalloc(sizeof(*table), GFP_KERNEL);
> + if (!table)
> + return;
> +
> + if (ctrl->apst_max_latency_ns == 0) {
> + /* Turn off APST. */
> + apste = 0;
> + } else {
> + __le64 target = cpu_to_le64(0);
> + int state;
> +
> + /*
> +  * Walk through all states from lowest- to highest-
> power.
> +  * According to the spec, lower-numbered states use
> more
> +  * power.  NPSS, despite the name, is the index of
> the
> +  * lowest-power state, not the number of states.
> +  */
> + for (state = (int)ctrl->npss; state >= 0; state--) {
> + u64 total_latency_us, transition_ms;
> +
> + if (target)
> + table->entries

Re: [PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices

2016-08-16 Thread J Freyensee
On Mon, 2016-08-15 at 22:41 -0300, Helen Koike wrote:


>  
> +struct nvme_doorbell_memory {
> + __u8opcode;
> + __u8flags;
> + __u16   command_id;
> + __u32   rsvd1[5];
> + __le64  prp1;
> + __le64  prp2;
> + __u32   rsvd12[6];
> +};
> +
>  struct nvme_command {
>   union {
>   struct nvme_common_command common;
> @@ -845,6 +858,7 @@ struct nvme_command {
>   struct nvmf_connect_command connect;
>   struct nvmf_property_set_command prop_set;
>   struct nvmf_property_get_command prop_get;
> + struct nvme_doorbell_memory doorbell_memory;
>   };
>  };

This looks like a new NVMe command being introduced, not found in the
latest NVMe specs (NVMe 1.2.1 spec or NVMe-over-Fabrics 1.0 spec)?

This is a big NACK, the command needs to be part of the NVMe standard
before adding it to the NVMe code base (this is exactly how NVMe-over-
Fabrics standard got implemented).  I would bring your proposal to
nvmexpress.org.

Jay


>  
> @@ -934,6 +948,9 @@ enum {
>   /*
>    * Media and Data Integrity Errors:
>    */
> +#ifdef CONFIG_NVME_VDB
> + NVME_SC_DOORBELL_MEMORY_INVALID = 0x1C0,
> +#endif
>   NVME_SC_WRITE_FAULT = 0x280,
>   NVME_SC_READ_ERROR  = 0x281,
>   NVME_SC_GUARD_CHECK = 0x282,


Re: [PATCH -next] nvme-rdma: fix the return value of nvme_rdma_reinit_request()

2016-07-12 Thread J Freyensee
On Tue, 2016-07-12 at 11:06 +, weiyj...@163.com wrote:
> From: Wei Yongjun 
> 
> PTR_ERR should be applied before its argument is reassigned,
> otherwise the
> return value will be set to 0, not error code.

Another good catch.

Reviewed-by: Jay Freyensee 

> 
> Signed-off-by: Wei Yongjun 

>   if (IS_ERR(req->mr)) {
> - req->mr = NULL;
>   ret = PTR_ERR(req->mr);
> + req->mr = NULL;
>   }



Re: [PATCH] nvme-loop: add configfs dependency

2016-07-11 Thread J Freyensee
On Sun, 2016-07-10 at 14:14 +0200, Christoph Hellwig wrote:
> On Thu, Jul 07, 2016 at 08:35:17AM -0600, Jens Axboe wrote:
> > Thanks Arnd, applied.
> 
> Actually I think we should replace the select with the depends.  In
> fact I though I had done that a while ago, but I must have messed it
> up.
> 
> Btw - do you plan to grab patches directly from the list now or
> do you want me to queue them up?  There are a few more pending on
> the list that should go in:
> 
> http://lists.infradead.org/pipermail/linux-nvme/2016-June/005150.html
> (whole series)
> 
> http://lists.infradead.org/pipermail/linux-nvme/2016-July/005290.html
> (needs the attribution fixed to be from Ming)

I can re-send the patch changing "Fix by: " to "From: " if that helps.

Jay



Re: [PATCH -next] nvmet: fix return value check in nvmet_subsys_alloc()

2016-07-08 Thread J Freyensee
On Wed, 2016-07-06 at 12:02 +, weiyj...@163.com wrote:
> From: Wei Yongjun 
> 
> In case of error, the function kstrndup() returns NULL pointer
> not ERR_PTR(). The IS_ERR() test in the return value check
> should be replaced with NULL test.
> 
> Signed-off-by: Wei Yongjun 

Looks good and inline with how other drivers use kstrndup(), thanks for
the fix.

Reviewed-by: Jay Freyensee 

> ---
>  drivers/nvme/target/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> index e0b3f01..8a891ca 100644
> --- a/drivers/nvme/target/core.c
> +++ b/drivers/nvme/target/core.c
> @@ -895,7 +895,7 @@ struct nvmet_subsys *nvmet_subsys_alloc(const
> char *subsysnqn,
>   subsys->type = type;
>   subsys->subsysnqn = kstrndup(subsysnqn, NVMF_NQN_SIZE,
>   GFP_KERNEL);
> - if (IS_ERR(subsys->subsysnqn)) {
> + if (!subsys->subsysnqn) {
>   kfree(subsys);
>   return NULL;
>   }
> 
> 
> 
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme


Re: [PATCH 5/6] lightnvm: expose device geometry through sysfs

2016-06-30 Thread J Freyensee
On Wed, 2016-06-29 at 16:51 +0200, Matias Bjørling wrote:
> From: "Simon A. F. Lund" 
> 
> For a host to access an Open-Channel SSD, it has to know its
> geometry,
> so that it writes and reads at the appropriate device bounds.
> 
> Currently, the geometry information is kept within the kernel, and
> not
> exported to user-space for consumption. This patch exposes the
> configuration through sysfs and enables user-space libraries, such as
> liblightnvm, to use the sysfs implementation to get the geometry of
> an
> Open-Channel SSD.
> 
> The sysfs entries are stored within the device hierarchy, and can be
> found using the "lightnvm" device type.
> 
> An example configuration looks like this:
> 
> /sys/class/nvme/
> └── nvme0n1
>├── capabilities: 3
>├── device_mode: 1
>├── channel_parallelism: 0
>├── erase_max: 100
>├── erase_typ: 100
>├── flash_media_type: 0
>├── media_capabilities: 0x0001
>├── media_type: 0
>├── multiplane: 0x00010101
>├── num_blocks: 1022
>├── num_channels: 1
>├── num_luns: 4
>├── num_pages: 64
>├── num_planes: 1
>├── page_size: 4096
>├── prog_max: 10
>├── prog_typ: 10
>├── read_max: 1
>├── read_typ: 1
>├── sector_oob_size: 0
>├── sector_size: 4096
>├── media_manager: gennvm
>├── ppa_format: 0x380830082808001010102008
>├── vendor_opcode: 0
>└── version: 1
> 

That is an awful lot of new things to add under nvme0n1-type sysfs
entries when there is already a decent amount of stuff under it.

Any chance these new things could be stuck under a separate sysfs
directory under each nvmeXnY device?  If these things are mainly
beneficial to LightNVM, it will be easier for a LightNVM newbie to
find, recognize, and consider all the important things in an Open
Channel SSD solution if it's under a separate directory.  And for
current SSD solutions that don't seem to need these things exposed in
sysfs for operation, it will make what is directly under nvmeXnY
directories less cluttered.

Thanks,
Jay



Re: [PATCHv1] NVMe: nvme_queue made cache friendly.

2015-05-21 Thread J Freyensee
On Wed, 2015-05-20 at 16:43 -0400, Parav Pandit wrote:
> nvme_queue structure made 64B cache friendly so that majority of the
> data elements of the structure during IO and completion path can be
> found in typical single 64B cache line size which was previously spanning
> beyond single 64B cache line size.
> 
> By aligning most of the fields are found at start of the structure.
> Elements which are not used in frequent IO path are moved at the
> end of structure.

I'll repeat the same question Matthew said last time:

"Have you done any performance measurements on this?"

If the answer is no, then I'm not sure why the patch is even being sent
to apply to the code base if the main reason is performance-related.
>From the comments from the last patch attempt, it did not even sound
like there was a good understanding where the q_lock should go for best
performance.

I think it would be better to have some results to go along with the
patch request.  At least it would be known for sure where the q_lock
should go.  And that would be good knowledge to know for future
programming projects.

> 
> Signed-off-by: Parav Pandit 
> ---
>  drivers/block/nvme-core.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index b9ba36f..58041c7 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -98,23 +98,23 @@ struct async_cmd_info {
>  struct nvme_queue {
>   struct device *q_dmadev;
>   struct nvme_dev *dev;
> - char irqname[24];   /* nvme4294967295-65535\0 */
> - spinlock_t q_lock;
>   struct nvme_command *sq_cmds;
> + struct blk_mq_hw_ctx *hctx;
>   volatile struct nvme_completion *cqes;
> - dma_addr_t sq_dma_addr;
> - dma_addr_t cq_dma_addr;
>   u32 __iomem *q_db;
> + spinlock_t q_lock;
>   u16 q_depth;
> - s16 cq_vector;
>   u16 sq_head;
>   u16 sq_tail;
>   u16 cq_head;
>   u16 qid;
> + s16 cq_vector;
>   u8 cq_phase;
>   u8 cqe_seen;
>   struct async_cmd_info cmdinfo;
> - struct blk_mq_hw_ctx *hctx;
> + char irqname[24];   /* nvme4294967295-65535\0 */
> + dma_addr_t sq_dma_addr;
> + dma_addr_t cq_dma_addr;
>  };
>  
>  /*


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] NVMe: Fix error handling of class_create("nvme")

2015-03-10 Thread J Freyensee
On Sat, 2015-03-07 at 01:43 +0300, Alexey Khoroshilov wrote:
> class_create() returns ERR_PTR on failure,
> so IS_ERR() should be used instead of check for NULL.
> 
> Found by Linux Driver Verification project (linuxtesting.org).
> 
> Signed-off-by: Alexey Khoroshilov 
> ---
>  drivers/block/nvme-core.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index ceb32dd52a6c..30ac50b3009d 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -3165,8 +3165,10 @@ static int __init nvme_init(void)
>   nvme_char_major = result;
>  
>   nvme_class = class_create(THIS_MODULE, "nvme");
> - if (!nvme_class)
> + if (IS_ERR(nvme_class)) {

Looking at IS_ERR(), it uses IS_ERR_VALUE() which uses unlikely(), which
from what I understand is a compiler hint that means "this error is
unlikely to happen".

Well, is this error unlikely to happen?  Looks like the error could be
caused by kzalloc() failing, which could more likely happen in a
low-cost Chromebook OS laptop (nvme boot driver just got accepted into
the OS). Or a number of other things going wrong within the
__class_register() call.

Without understanding the code in fine-print detail, it seems like how
'nvme_class' condition is checked can be a bit arbitrary.

> + result = PTR_ERR(nvme_class);
>   goto unregister_chrdev;
> + }
>  
>   result = pci_register_driver(&nvme_driver);
>   if (result)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/