date:20181115

Re: [PATCH] mm: cleancache: fix corruption on missed inode invalidation

2018-11-15 Thread Vasily Averin

On 11/16/18 1:31 AM, Andrew Morton wrote:
> On Mon, 12 Nov 2018 12:57:34 +0300 Pavel Tikhomirov 
>  wrote:
> 
>> If all pages are deleted from the mapping by memory reclaim and also
>> moved to the cleancache:
>>
>> __delete_from_page_cache
>>   (no shadow case)
>>   unaccount_page_cache_page
>> cleancache_put_page
>>   page_cache_delete
>> mapping->nrpages -= nr
>> (nrpages becomes 0)
>>
>> We don't clean the cleancache for an inode after final file truncation
>> (removal).
>>
>> truncate_inode_pages_final
>>   check (nrpages || nrexceptional) is false
>> no truncate_inode_pages
>>   no cleancache_invalidate_inode(mapping)
>>
>> These way when reading the new file created with same inode we may get
>> these trash leftover pages from cleancache and see wrong data instead of
>> the contents of the new file.
>>
>> Fix it by always doing truncate_inode_pages which is already ready for
>> nrpages == 0 && nrexceptional == 0 case and just invalidates inode.
>>
> 
> Data corruption sounds serious.  Shouldn't we backport this into
> -stable kernels?

Yes, it was broken in 4.14 kernel and it should affect all who uses cleancache
Fixes: commit 91b0abe36a7b ("mm + fs: store shadow entries in page cache")

Re: [PATCH v2 1/2] pci: prevent sk hynix nvme from entering D3

2018-11-15 Thread Christoph Hellwig

On Thu, Nov 15, 2018 at 11:30:15AM -0600, Bjorn Helgaas wrote:
> 
> But I guess you have to do this anyway just to add the vendor/device
> ID to the driver, so maybe this isn't a big deal to you.  If you can
> do a quirk like this in the driver, it would be invisible to me and I
> wouldn't care.  I just don't want to deal with ongoing tweaks like
> this in the PCI core :)

No, NVMe is a spec with a class code, and a specification that is
vendor independent.  NVMe devices declare invididual features based
on common fields.

APST is an optional feature with all kinds of parameters, but there
is absolutely no language that a host should not put the device into
D3 mode if APST is supported anywhere in the NVMe spec, and such
behavior is also rather counter intuitive.  If SK Hynix thinks this
is sensible behavior they should bring it up in the NVMe technical
working group.  I've pinged a contact there to see what this whole
story is about.

Re: [PATCH 1/1] Input: synaptics - enable SMBus for HP 15-ay000 (SYN3221).

2018-11-15 Thread Benjamin Tissoires

On Fri, Nov 16, 2018 at 6:00 AM Teika Kazura  wrote:
>
> SMBus works fine for the touchpad with id SYN3221, used in the HP 15-ay000 
> series,

Nice.

Thanks for the patch.

Reviewed-by: Benjamin Tissoires 

Cheers,
Benjamin

>
> This device has been reported in these messages in the "linux-input" mailing 
> list:
> * https://marc.info/?l=linux-input&m=152016683003369&w=2
> * https://www.spinics.net/lists/linux-input/msg52525.html
>
> Reported-by: Nitesh Debnath 
> Reported-by: Teika Kazura 
> Signed-off-by: Teika Kazura 
> ---
>  drivers/input/mouse/synaptics.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/input/mouse/synaptics.c b/drivers/input/mouse/synaptics.c
> index 55d33500d5..591b776f22 100644
> --- a/drivers/input/mouse/synaptics.c
> +++ b/drivers/input/mouse/synaptics.c
> @@ -179,6 +179,7 @@ static const char * const smbus_pnp_ids[] = {
> "LEN0096", /* X280 */
> "LEN0097", /* X280 -> ALPS trackpoint */
> "LEN200f", /* T450s */
> +   "SYN3221", /* HP 15-ay000 */
> NULL
>  };
>
> --
> 2.18.1

Re: [PATCH v2 1/2] Makefile: Fix distcc compilation with x86 macros

2018-11-15 Thread Masahiro Yamada

On Thu, Nov 15, 2018 at 1:01 PM Nadav Amit  wrote:
>
> Introducing the use of asm macros in c-code broke distcc, since it only
> sends the preprocessed source file. The solution is to break the
> compilation into two separate phases of compilation and assembly, and
> between the two concatenate the assembly macros and the compiled (yet
> not assembled) source file. Since this is less efficient, this
> compilation mode is only used when distcc or icecc are used.
>
> Note that the assembly stage should also be distributed, if distcc is
> configured using "CFLAGS=-DENABLE_REMOTE_ASSEMBLE".
>
> Reported-by: Logan Gunthorpe 
> Signed-off-by: Nadav Amit 


Wow, this is so ugly.
I realized how much I hated this by now.

My question is, how long do we need to carry this?


As far as I understood from the long discussion
https://lkml.org/lkml/2018/10/7/92
people are trying to deal with it on the compiler side.
Is it right?


https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01932.html

Once it is supported, what would happen on the kernel side?





> ---
>  Makefile   |  4 +++-
>  arch/x86/Makefile  |  7 +--
>  scripts/Makefile.build | 30 --
>  3 files changed, 36 insertions(+), 5 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index 9fce8b91c15f..c07349fc38c7 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -743,7 +743,9 @@ KBUILD_CFLAGS   += $(call cc-option, -gsplit-dwarf, -g)
>  else
>  KBUILD_CFLAGS  += -g
>  endif
> -KBUILD_AFLAGS  += -Wa,-gdwarf-2
> +AFLAGS_DEBUG_INFO = -Wa,-gdwarf-2
> +export AFLAGS_DEBUG_INFO
> +KBUILD_AFLAGS  += $(AFLAGS_DEBUG_INFO)
>  endif
>  ifdef CONFIG_DEBUG_INFO_DWARF4
>  KBUILD_CFLAGS  += $(call cc-option, -gdwarf-4,)
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index f5d7f4134524..b5953cbcc9c8 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -235,10 +235,13 @@ archscripts: scripts_basic
>  archheaders:
> $(Q)$(MAKE) $(build)=arch/x86/entry/syscalls all
>
> +ASM_MACRO_FILE = arch/x86/kernel/macros.s
> +export ASM_MACRO_FILE
> +
>  archmacros:
> -   $(Q)$(MAKE) $(build)=arch/x86/kernel arch/x86/kernel/macros.s
> +   $(Q)$(MAKE) $(build)=arch/x86/kernel $(ASM_MACRO_FILE)
>
> -ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s
> +ASM_MACRO_FLAGS = -Wa,$(ASM_MACRO_FILE)
>  export ASM_MACRO_FLAGS
>  KBUILD_CFLAGS += $(ASM_MACRO_FLAGS)
>
> diff --git a/scripts/Makefile.build b/scripts/Makefile.build
> index 6a6be9f440cf..b8d26bdf48b0 100644
> --- a/scripts/Makefile.build
> +++ b/scripts/Makefile.build
> @@ -155,8 +155,34 @@ $(obj)/%.ll: $(src)/%.c FORCE
>
>  quiet_cmd_cc_o_c = CC $(quiet_modtag)  $@
>
> +# If distcc (or icecc) are used, and when assembly macro files are needed, 
> the
> +# compilation stage and the assembly stage needs to be separated. Providing 
> the
> +# "IGNORE_DISTCC=y" option disables separate compilation and assembly.
> +
> +cmd_cc_o_c_direct = $(CC) $(c_flags) -c -o $(1) $<
> +
> +ifneq ($(if $(IGNORE_DISTCC),,$(shell $(CC) --version 2>&1 | head -n 1 | 
> grep -E 'distcc|ICECC')),)
> +a_flags_no_debug = $(filter-out $(AFLAGS_DEBUG_INFO), $(a_flags))
> +c_flags_no_macros = $(filter-out $(ASM_MACRO_FLAGS), $(c_flags))
> +
> +cmd_cc_o_c_two_steps = \
> +   $(CC) $(c_flags_no_macros) $(DISABLE_LTO) -fverbose-asm -S  \
> +   -o $(@D)/.$(@F:.o=.s) $<;   \
> +   cat $(ASM_MACRO_FILE) $(@D)/.$(@F:.o=.s) >  \
> +   $(@D)/.tmp_$(@F:.o=.s); \
> +   $(CC) $(a_flags_no_debug) -c -o $(1) $(@D)/.tmp_$(@F:.o=.s);\
> +   rm -f $(@D)/.$(@F:.o=.s) $(@D)/.tmp_$(@F:.o=.s) \
> +
> +cmd_cc_o_c_helper =\
> +   $(if $(findstring $(ASM_MACRO_FLAGS),$(c_flags)),   \
> +   $(call cmd_cc_o_c_two_steps, $(1)), \
> +   $(call cmd_cc_o_c_direct, $(1)))
> +else
> +cmd_cc_o_c_helper = $(call cmd_cc_o_c_direct, $(1))
> +endif
> +
>  ifndef CONFIG_MODVERSIONS
> -cmd_cc_o_c = $(CC) $(c_flags) -c -o $@ $<
> +cmd_cc_o_c = $(call cmd_cc_o_c_helper,$@)
>
>  else
>  # When module versioning is enabled the following steps are executed:
> @@ -171,7 +197,7 @@ else
>  #   replace the unresolved symbols __crc_exported_symbol with
>  #   the actual value of the checksum generated by genksyms
>
> -cmd_cc_o_c = $(CC) $(c_flags) -c -o $(@D)/.tmp_$(@F) $<
> +cmd_cc_o_c = $(call cmd_cc_o_c_helper,$(@D)/.tmp_$(@F))
>
>  cmd_modversions_c =  
>   \
> if $(OBJDUMP) -h $(@D)/.tmp_$(@F) | grep -q __ksymtab; then   
>   \
> --
> 2.17.1
>


-- 
Best Regards
Masahiro Yamada

[PATCH 2/2] ASoC: sdm845: Add support for Secondary MI2S interface

2018-11-15 Thread Rohit kumar

Add support to configure bit clock for secondary MI2S
TX interface.

Signed-off-by: Rohit kumar 
---
 sound/soc/qcom/sdm845.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/sound/soc/qcom/sdm845.c b/sound/soc/qcom/sdm845.c
index 84e6ee7..58593db 100644
--- a/sound/soc/qcom/sdm845.c
+++ b/sound/soc/qcom/sdm845.c
@@ -19,6 +19,7 @@
 struct sdm845_snd_data {
struct snd_soc_card *card;
uint32_t pri_mi2s_clk_count;
+   uint32_t sec_mi2s_clk_count;
uint32_t quat_tdm_clk_count;
 };
 
@@ -121,6 +122,15 @@ static int sdm845_snd_startup(struct snd_pcm_substream 
*substream)
snd_soc_dai_set_fmt(cpu_dai, fmt);
break;
 
+   case SECONDARY_MI2S_TX:
+   if (++(data->sec_mi2s_clk_count) == 1) {
+   snd_soc_dai_set_sysclk(cpu_dai,
+   Q6AFE_LPASS_CLK_ID_SEC_MI2S_IBIT,
+   MI2S_BCLK_RATE, SNDRV_PCM_STREAM_CAPTURE);
+   }
+   snd_soc_dai_set_fmt(cpu_dai, fmt);
+   break;
+
case QUATERNARY_TDM_RX_0:
case QUATERNARY_TDM_TX_0:
if (++(data->quat_tdm_clk_count) == 1) {
@@ -157,6 +167,14 @@ static void  sdm845_snd_shutdown(struct snd_pcm_substream 
*substream)
};
break;
 
+   case SECONDARY_MI2S_TX:
+   if (--(data->sec_mi2s_clk_count) == 0) {
+   snd_soc_dai_set_sysclk(cpu_dai,
+   Q6AFE_LPASS_CLK_ID_SEC_MI2S_IBIT,
+   0, SNDRV_PCM_STREAM_CAPTURE);
+   }
+   break;
+
case QUATERNARY_TDM_RX_0:
case QUATERNARY_TDM_TX_0:
if (--(data->quat_tdm_clk_count) == 0) {
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.,
is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

[PATCH 0/2] ASoC: SDM845: Update MI2S and TDM configuration

2018-11-15 Thread Rohit kumar

Update bit clock rate, slot width for TDM and MI2S
interfaces. Also add support for secondary MI2S TX
interface in SDM845 machine driver.

Rohit kumar (2):
  ASoC: sdm845: Update slot_width for Quaternary TDM port
  ASoC: sdm845: Add support for Secondary MI2S interface

 sound/soc/qcom/sdm845.c | 27 +++
 1 file changed, 23 insertions(+), 4 deletions(-)

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.,
is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

[PATCH 1/2] ASoC: sdm845: Update slot_width for Quaternary TDM port

2018-11-15 Thread Rohit kumar

Change slot_width for quaternary TDM port to 16 and
update bclk rate for TDM and MI2S interfaces
accordingly.

Signed-off-by: Rohit kumar 
---
 sound/soc/qcom/sdm845.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/sound/soc/qcom/sdm845.c b/sound/soc/qcom/sdm845.c
index 8d0cdff..84e6ee7 100644
--- a/sound/soc/qcom/sdm845.c
+++ b/sound/soc/qcom/sdm845.c
@@ -13,7 +13,8 @@
 
 #define DEFAULT_SAMPLE_RATE_48K48000
 #define DEFAULT_MCLK_RATE  24576000
-#define DEFAULT_BCLK_RATE  12288000
+#define TDM_BCLK_RATE  6144000
+#define MI2S_BCLK_RATE 1536000
 
 struct sdm845_snd_data {
struct snd_soc_card *card;
@@ -33,7 +34,7 @@ static int sdm845_tdm_snd_hw_params(struct snd_pcm_substream 
*substream,
 
switch (params_format(params)) {
case SNDRV_PCM_FORMAT_S16_LE:
-   slot_width = 32;
+   slot_width = 16;
break;
default:
dev_err(rtd->dev, "%s: invalid param format 0x%x\n",
@@ -115,7 +116,7 @@ static int sdm845_snd_startup(struct snd_pcm_substream 
*substream)
DEFAULT_MCLK_RATE, SNDRV_PCM_STREAM_PLAYBACK);
snd_soc_dai_set_sysclk(cpu_dai,
Q6AFE_LPASS_CLK_ID_PRI_MI2S_IBIT,
-   DEFAULT_BCLK_RATE, SNDRV_PCM_STREAM_PLAYBACK);
+   MI2S_BCLK_RATE, SNDRV_PCM_STREAM_PLAYBACK);
}
snd_soc_dai_set_fmt(cpu_dai, fmt);
break;
@@ -125,7 +126,7 @@ static int sdm845_snd_startup(struct snd_pcm_substream 
*substream)
if (++(data->quat_tdm_clk_count) == 1) {
snd_soc_dai_set_sysclk(cpu_dai,
Q6AFE_LPASS_CLK_ID_QUAD_TDM_IBIT,
-   DEFAULT_BCLK_RATE, SNDRV_PCM_STREAM_PLAYBACK);
+   TDM_BCLK_RATE, SNDRV_PCM_STREAM_PLAYBACK);
}
break;
 
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.,
is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

Re: [PATCH v2 2/2] x86: set a dependency on macros.S

2018-11-15 Thread Masahiro Yamada

On Thu, Nov 15, 2018 at 1:01 PM Nadav Amit  wrote:
>
> Changes in macros.S should trigger the recompilation of all C files, as
> the macros might need to affect their compilation.
>
> Acked-by: Ingo Molnar 
> Signed-off-by: Nadav Amit 
> ---

When we talked about this last time,
we agreed to not do this
because a single line change in asm headers
would cause global rebuilding.

Did you change your mind?




>  scripts/Makefile.build | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/scripts/Makefile.build b/scripts/Makefile.build
> index b8d26bdf48b0..efec77991c2b 100644
> --- a/scripts/Makefile.build
> +++ b/scripts/Makefile.build
> @@ -313,13 +313,13 @@ cmd_undef_syms = echo
>  endif
>
>  # Built-in and composite module parts
> -$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
> +$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) 
> $(ASM_MACRO_FILE:.s=.S) FORCE
> $(call cmd,force_checksrc)
> $(call if_changed_rule,cc_o_c)
>
>  # Single-part modules are special since we need to mark them in $(MODVERDIR)
>
> -$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) 
> $(objtool_dep) FORCE
> +$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) 
> $(objtool_dep) $(ASM_MACRO_FILE:.s=.S) FORCE
> $(call cmd,force_checksrc)
> $(call if_changed_rule,cc_o_c)
> @{ echo $(@:.o=.ko); echo $@; \
> --
> 2.17.1
>


-- 
Best Regards
Masahiro Yamada

Re: [RFC PATCH 5/5] mm, memory_hotplug: be more verbose for memory offline failures

2018-11-15 Thread Michal Hocko

On Thu 15-11-18 16:07:16, Andrew Morton wrote:
> On Wed,  7 Nov 2018 11:18:30 +0100 Michal Hocko  wrote:
> 
> > From: Michal Hocko 
> > 
> > There is only very limited information printed when the memory offlining
> > fails:
> > [ 1984.506184] rac1 kernel: memory offlining [mem 
> > 0x826-0x8267fff] failed due to signal backoff
> > 
> > This tells us that the failure is triggered by the userspace
> > intervention but it doesn't tell us much more about the underlying
> > reason. It might be that the page migration failes repeatedly and the
> > userspace timeout expires and send a signal or it might be some of the
> > earlier steps (isolation, memory notifier) takes too long.
> > 
> > If the migration failes then it would be really helpful to see which
> > page that and its state. The same applies to the isolation phase. If we
> > fail to isolate a page from the allocator then knowing the state of the
> > page would be helpful as well.
> > 
> > Dump the page state that fails to get isolated or migrated. This will
> > tell us more about the failure and what to focus on during debugging.
> > 
> > ...
> >
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1388,10 +1388,8 @@ do_migrate_range(unsigned long start_pfn, unsigned 
> > long end_pfn)
> > page_is_file_cache(page));
> >  
> > } else {
> > -#ifdef CONFIG_DEBUG_VM
> > -   pr_alert("failed to isolate pfn %lx\n", pfn);
> > +   pr_warn("failed to isolate pfn %lx\n", pfn);
> > dump_page(page, "isolation failed");
> > -#endif
> > put_page(page);
> > /* Because we don't have big zone->lock. we should
> >check this again here. */
> > @@ -1411,8 +1409,14 @@ do_migrate_range(unsigned long start_pfn, unsigned 
> > long end_pfn)
> > /* Allocate a new page from the nearest neighbor node */
> > ret = migrate_pages(&source, new_node_page, NULL, 0,
> > MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
> > -   if (ret)
> > +   if (ret) {
> > +   list_for_each_entry(page, &source, lru) {
> > +   pr_warn("migrating pfn %lx failed ",
> > +  page_to_pfn(page), ret);
> > +   dump_page(page, NULL);
> > +   }
> 
> ./include/linux/kern_levels.h:5:18: warning: too many arguments for format 
> [-Wformat-extra-args]
>  #define KERN_SOH "\001"  /* ASCII Start Of Header */
>   ^
> ./include/linux/kern_levels.h:12:22: note: in expansion of macro ‘KERN_SOH’
>  #define KERN_WARNING KERN_SOH "4" /* warning conditions */
>   ^~~~
> ./include/linux/printk.h:310:9: note: in expansion of macro ‘KERN_WARNING’
>   printk(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
>  ^~~~
> ./include/linux/printk.h:311:17: note: in expansion of macro ‘pr_warning’
>  #define pr_warn pr_warning
>  ^~
> mm/memory_hotplug.c:1414:5: note: in expansion of macro ‘pr_warn’
>  pr_warn("migrating pfn %lx failed ",
>  ^~~

yeah, 0day already complained and I've posted a follow up fix
http://lkml.kernel.org/r/20181108081231.gn27...@dhcp22.suse.cz

Let me post a version 2 with all the fixups.
 
Thanks!

> --- 
> a/mm/memory_hotplug.c~mm-memory_hotplug-be-more-verbose-for-memory-offline-failures-fix
> +++ a/mm/memory_hotplug.c
> @@ -1411,7 +1411,7 @@ do_migrate_range(unsigned long start_pfn
>   MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
>   if (ret) {
>   list_for_each_entry(page, &source, lru) {
> - pr_warn("migrating pfn %lx failed ",
> + pr_warn("migrating pfn %lx failed: %d",
>  page_to_pfn(page), ret);
>   dump_page(page, NULL);
>   }
> 

-- 
Michal Hocko
SUSE Labs

[PATCH 5/6] zram: add bd_stat statistics

2018-11-15 Thread Minchan Kim

bd_stat reprenents things happened in backing device. Currently,
it supports bd_counts, bd_reads and bd_writes which are helpful
to understand wearout of flash and memory saving.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 ++
 Documentation/blockdev/zram.txt| 11 
 drivers/block/zram/zram_drv.c  | 31 ++
 drivers/block/zram/zram_drv.h  |  5 
 4 files changed, 55 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index d1f80b077885..a4daca7e5043 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -113,3 +113,11 @@ Contact:   Minchan Kim 
 Description:
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
+
+What:  /sys/block/zram/bd_stat
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The bd_stat file is read-only and represents backing device's
+   statistics (bd_count, bd_reads, bd_writes.) in a format
+   similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 60b585dab6e0..1f4907307a0d 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -221,6 +221,17 @@ The stat file represents device's mm statistics. It 
consists of a single
  pages_compacted  the number of pages freed during compaction
  huge_pages  the number of incompressible pages
 
+File /sys/block/zram/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+ bd_count  size of data written in backing device.
+   Unit: pages
+ bd_reads  the number of reads from backing device
+   Unit: pages
+ bd_writes the number of writes to backing device
+   Unit: pages
+
 9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b7b5c9e5f0cd..17d566d9a321 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -505,6 +505,8 @@ static unsigned long alloc_block_bdev(struct zram *zram)
ret = blk_idx;
 out:
spin_unlock_irq(&zram->bitmap_lock);
+   if (ret != 0)
+   atomic64_inc(&zram->stats.bd_count);
 
return ret;
 }
@@ -518,6 +520,7 @@ static void free_block_bdev(struct zram *zram, unsigned 
long blk_idx)
was_set = test_and_clear_bit(blk_idx, zram->bitmap);
spin_unlock_irqrestore(&zram->bitmap_lock, flags);
WARN_ON_ONCE(!was_set);
+   atomic64_dec(&zram->stats.bd_count);
 }
 
 static void zram_page_end_io(struct bio *bio)
@@ -661,6 +664,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
+   atomic64_inc(&zram->stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -748,6 +752,7 @@ static int read_from_bdev_sync(struct zram *zram, struct 
bio_vec *bvec,
 static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
 {
+   atomic64_inc(&zram->stats.bd_reads);
if (sync)
return read_from_bdev_sync(zram, bvec, entry, parent);
else
@@ -790,6 +795,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec 
*bvec,
 
submit_bio(bio);
*pentry = entry;
+   atomic64_inc(&zram->stats.bd_writes);
 
return 0;
 }
@@ -1053,6 +1059,25 @@ static ssize_t mm_stat_show(struct device *dev,
return ret;
 }
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+static ssize_t bd_stat_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct zram *zram = dev_to_zram(dev);
+   ssize_t ret;
+
+   down_read(&zram->init_lock);
+   ret = scnprintf(buf, PAGE_SIZE,
+   "%8llu %8llu %8llu\n",
+   (u64)atomic64_read(&zram->stats.bd_count),
+   (u64)atomic64_read(&zram->stats.bd_reads),
+   (u64)atomic64_read(&zram->stats.bd_writes));
+   up_read(&zram->init_lock);
+
+   return ret;
+}
+#endif
+
 static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -1073,6 +1098,9 @@ static ssize_t debug_stat_show(struct device *dev,
 
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
+#ifdef CONFIG_ZRAM_WRITEBACK
+static DEVICE_ATTR_RO(bd_stat);
+#endif
 static DEVICE_ATTR_RO(debug_stat);
 
 static void zram_meta_free(struct zram *zr

[PATCH 6/6] zram: writeback throttle

2018-11-15 Thread Minchan Kim

On small memory system, there are lots of write IO so if we use
flash device as swap, there would be serious flash wearout.
To overcome the problem, system developers need to design write
limitation strategy to guarantee flash health for entire product life.

This patch creates a new konb "writeback_limit" on zram. With that,
if current writeback IO count(/sys/block/zramX/io_stat) excceds
the limitation, zram stops further writeback until admin can reset
the limit.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  9 
 Documentation/blockdev/zram.txt|  2 +
 drivers/block/zram/zram_drv.c  | 55 --
 drivers/block/zram/zram_drv.h  |  2 +
 4 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index a4daca7e5043..210f2cdac752 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -121,3 +121,12 @@ Contact:   Minchan Kim 
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes.) in a format
similar to block layer statistics file format.
+
+What:  /sys/block/zram/writeback_limit
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback_limit file is read-write and specifies the maximum
+   amount of writeback ZRAM can do. The limit could be changed
+   in run time and "0" means disable the limit.
+   No limit is the initial state.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 1f4907307a0d..39ee416bf552 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -164,6 +164,8 @@ reset WOtrigger device reset
 mem_used_max  WOreset the `mem_used_max' counter (see later)
 mem_limit WOspecifies the maximum amount of memory ZRAM can use
 to store the compressed data
+writeback_limit  WOspecifies the maximum amount of write IO zram 
can
+   write out to backing device
 max_comp_streams  RWthe number of possible concurrent compress operations
 comp_algorithmRWshow and change the compression algorithm
 compact   WOtrigger memory compaction
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 17d566d9a321..b263febaed10 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -317,6 +317,40 @@ static ssize_t idle_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
+
+static ssize_t writeback_limit_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   u64 val;
+   ssize_t ret = -EINVAL;
+
+   if (kstrtoull(buf, 10, &val))
+   return ret;
+
+   down_read(&zram->init_lock);
+   atomic64_set(&zram->stats.bd_wb_limit, val);
+   if (val == 0 || val > atomic64_read(&zram->stats.bd_writes))
+   zram->stop_writeback = false;
+   up_read(&zram->init_lock);
+   ret = len;
+
+   return ret;
+}
+
+static ssize_t writeback_limit_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   u64 val;
+   struct zram *zram = dev_to_zram(dev);
+
+   down_read(&zram->init_lock);
+   val = atomic64_read(&zram->stats.bd_wb_limit);
+   up_read(&zram->init_lock);
+
+   return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
+}
+
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
@@ -575,6 +609,7 @@ static ssize_t writeback_store(struct device *dev,
ssize_t ret;
unsigned long mode;
unsigned long blk_idx = 0;
+   u64 wb_count, wb_limit;
 
 #define HUGE_WRITEBACK 0x1
 #define IDLE_WRITEBACK 0x2
@@ -610,6 +645,11 @@ static ssize_t writeback_store(struct device *dev,
bvec.bv_len = PAGE_SIZE;
bvec.bv_offset = 0;
 
+   if (zram->stop_writeback) {
+   ret = -EIO;
+   break;
+   }
+
if (!blk_idx) {
blk_idx = alloc_block_bdev(zram);
if (!blk_idx) {
@@ -664,7 +704,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
-   atomic64_inc(&zram->stats.bd_writes);
+   wb_count = atomic64_inc_return(&zram->stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -687,6 +727,9 @@ static ssize_t writeback_store(struct device *dev,
zram_set_element(zram, index, blk

[PATCH 4/6] zram: support idle page writeback

2018-11-15 Thread Minchan Kim

This patch supports new feature "zram idle page writeback".
On zram-swap usecase, zram has usually idle swap pages come
from many processes. It's pointless to keep in memory(ie, zram).

To solve the problem, this feature gives idle page writeback to
backing device so the goal is to save more memory space
on embedded system.

Normal sequence to use the feature is as follows,

while (1) {
# mark allocated zram slot to idle
echo 1 > /sys/block/zram0/idle
sleep several hours
# idle zram slots are still IDLE marked.
echo 3 > /sys/block/zram0/writeback
# write the IDLE marked slot into backing device and free
# the memory.
}

echo 'val' > /sys/block/zramX/writeback

val is combination of bits.

0th bit: hugepage writeback
1th bit: idlepage writeback

Thus,
1 -> hugepage writeback
2 -> idlepage writeabck
3 -> writeback both pages

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |   7 +
 Documentation/blockdev/zram.txt|  19 +++
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 166 +++--
 drivers/block/zram/zram_drv.h  |   1 +
 5 files changed, 187 insertions(+), 11 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 04c9a5980bc7..d1f80b077885 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -106,3 +106,10 @@ Contact:   Minchan Kim 
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram/block_state
+
+What:  /sys/block/zram/writeback
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback file is write-only and trigger idle and/or
+   huge page writeback to backing device.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index f3bcd716d8a9..60b585dab6e0 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -244,6 +244,25 @@ to backing storage rather than keeping it in memory.
 User should set up backing device via /sys/block/zramX/backing_dev
 before disksize setting.
 
+User can writeback idle pages to backing device. To use the feature,
+first, user need to mark zram slots allocated currently as idle.
+Afterward, slots not accessed since then will have still idle mark.
+Then, if user does,
+   "echo val > /sys/block/zramX/writeback"
+
+  val is combination of bits.
+
+  0th bit: hugepage writeback
+  1th bit: idlepage writeback
+
+  Thus,
+  1 -> hugepage writeback
+  2 -> idlepage writeabck
+  3 -> writeback both pages
+
+zram will writeback the idle/huge pages to backing device and free the
+memory space pages occupied so save memory.
+
 = memory tracking
 
 With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index fcd055457364..1ffc64770643 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -15,7 +15,7 @@ config ZRAM
  See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
-   bool "Write back incompressible page to backing device"
+   bool "Write back incompressible or idle page to backing device"
depends on ZRAM
help
 With incompressible page, there is no memory saving to keep it
@@ -23,6 +23,9 @@ config ZRAM_WRITEBACK
 For this feature, admin should set up backing device via
 /sys/block/zramX/backing_dev.
 
+With /sys/block/zramX/{idle,writeback}, application could ask
+idle page's writeback to the backing device to save in memory.
+
 See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_MEMORY_TRACKING
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f956179076ce..b7b5c9e5f0cd 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -52,6 +52,9 @@ static unsigned int num_devices = 1;
 static size_t huge_class_size;
 
 static void zram_free_page(struct zram *zram, size_t index);
+static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
+   u32 index, int offset, struct bio *bio);
+
 
 static int zram_slot_trylock(struct zram *zram, u32 index)
 {
@@ -73,13 +76,6 @@ static inline bool init_done(struct zram *zram)
return zram->disksize;
 }
 
-static inline bool zram_allocated(struct zram *zram, u32 index)
-{
-
-   return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
-   zram->table[index].handle;
-}
-
 static inline struct zram *dev_to_zram(struct device *dev)
 {
return (struct zram *)dev_to_disk(dev)->private_data;
@@ -138,6 +134,13 @@ static void zram_se

[PATCH 1/6] zram: fix lockdep warning of free block handling

2018-11-15 Thread Minchan Kim

[  254.519728] 
[  254.520311] WARNING: inconsistent lock state
[  254.520898] 4.19.0+ #390 Not tainted
[  254.521387] 
[  254.521732] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  254.521732] zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  254.521732] b1828693 (&(&zram->bitmap_lock)->rlock){+.?.}, at: 
put_entry_bdev+0x1e/0x50
[  254.521732] {SOFTIRQ-ON-W} state was registered at:
[  254.521732]   _raw_spin_lock+0x2c/0x40
[  254.521732]   zram_make_request+0x755/0xdc9
[  254.521732]   generic_make_request+0x373/0x6a0
[  254.521732]   submit_bio+0x6c/0x140
[  254.521732]   __swap_writepage+0x3a8/0x480
[  254.521732]   shrink_page_list+0x1102/0x1a60
[  254.521732]   shrink_inactive_list+0x21b/0x3f0
[  254.521732]   shrink_node_memcg.constprop.99+0x4f8/0x7e0
[  254.521732]   shrink_node+0x7d/0x2f0
[  254.521732]   do_try_to_free_pages+0xe0/0x300
[  254.521732]   try_to_free_pages+0x116/0x2b0
[  254.521732]   __alloc_pages_slowpath+0x3f4/0xf80
[  254.521732]   __alloc_pages_nodemask+0x2a2/0x2f0
[  254.521732]   __handle_mm_fault+0x42e/0xb50
[  254.521732]   handle_mm_fault+0x55/0xb0
[  254.521732]   __do_page_fault+0x235/0x4b0
[  254.521732]   page_fault+0x1e/0x30
[  254.521732] irq event stamp: 228412
[  254.521732] hardirqs last  enabled at (228412): [] 
__slab_free+0x3e6/0x600
[  254.521732] hardirqs last disabled at (228411): [] 
__slab_free+0x1c5/0x600
[  254.521732] softirqs last  enabled at (228396): [] 
__do_softirq+0x31e/0x427
[  254.521732] softirqs last disabled at (228403): [] 
irq_exit+0xd1/0xe0
[  254.521732]
[  254.521732] other info that might help us debug this:
[  254.521732]  Possible unsafe locking scenario:
[  254.521732]
[  254.521732]CPU0
[  254.521732]
[  254.521732]   lock(&(&zram->bitmap_lock)->rlock);
[  254.521732]   
[  254.521732] lock(&(&zram->bitmap_lock)->rlock);
[  254.521732]
[  254.521732]  *** DEADLOCK ***
[  254.521732]
[  254.521732] no locks held by zram_verify/2095.
[  254.521732]
[  254.521732] stack backtrace:
[  254.521732] CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
[  254.521732] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  254.521732] Call Trace:
[  254.521732]  
[  254.521732]  dump_stack+0x67/0x9b
[  254.521732]  print_usage_bug+0x1bd/0x1d3
[  254.521732]  mark_lock+0x4aa/0x540
[  254.521732]  ? check_usage_backwards+0x160/0x160
[  254.521732]  __lock_acquire+0x51d/0x1300
[  254.521732]  ? free_debug_processing+0x24e/0x400
[  254.521732]  ? bio_endio+0x6d/0x1a0
[  254.521732]  ? lockdep_hardirqs_on+0x9b/0x180
[  254.521732]  ? lock_acquire+0x90/0x180
[  254.521732]  lock_acquire+0x90/0x180
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  _raw_spin_lock+0x2c/0x40
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  put_entry_bdev+0x1e/0x50
[  254.521732]  zram_free_page+0xf6/0x110
[  254.521732]  zram_slot_free_notify+0x42/0xa0
[  254.521732]  end_swap_bio_read+0x5b/0x170
[  254.521732]  blk_update_request+0x8f/0x340
[  254.521732]  scsi_end_request+0x2c/0x1e0
[  254.521732]  scsi_io_completion+0x98/0x650
[  254.521732]  blk_done_softirq+0x9e/0xd0
[  254.521732]  __do_softirq+0xcc/0x427
[  254.521732]  irq_exit+0xd1/0xe0
[  254.521732]  do_IRQ+0x93/0x120
[  254.521732]  common_interrupt+0xf/0xf
[  254.521732]  

With writeback feature, zram_slot_free_notify could be called
in softirq context by end_swap_bio_read. However, bitmap_lock
is not aware of that so lockdep yell out. Thanks.

The problem is not only bitmap_lock but it is also zram_slot_lock
so straightforward solution would disable irq on zram_slot_lock
which covers every bitmap_lock, too.
Although duration disabling the irq is short in many places
zram_slot_lock is used, a place(ie, decompress) is not fast
enough to hold irqlock on relying on compression algorithm
so it's not a option.

The approach in this patch is just "best effort", not guarantee
"freeing orphan zpage". If the zram_slot_lock contention may happen,
kernel couldn't free the zpage until it recycles the block. However,
such contention between zram_slot_free_notify and other places to
hold zram_slot_lock should be very rare in real practice.
To see how often it happens, this patch adds new debug stat
"miss_free".

It also adds irq lock in get/put_block_bdev to prevent deadlock
lockdep reported. The reason I used irq disable rather than bottom
half is swap_slot_free_notify could be called with irq disabled
so it breaks local_bh_enable's rule. The irqlock works on only
writebacked zram slot entry so it should be not frequent lock.

Cc: sta...@vger.kernel.org # 4.14+
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 56 +--
 drivers/block/zram/zram_drv.h |  1 +
 2 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 4879595200e1..472027eaed60

[PATCH 3/6] zram: introduce ZRAM_IDLE flag

2018-11-15 Thread Minchan Kim

To support idle page writeback with upcoming patches, this patch
introduces a new ZRAM_IDLE flag.

Userspace can mark zram slots as "idle" via
"echo 1 > /sys/block/zramX/idle"
which marks every allocated zram slot as ZRAM_IDLE.
User could see it by /sys/kernel/debug/zram/zram0/block_state.

  30075.033841 ...i
  30163.806904 s..i
  30263.806919 ..hi

Once there is IO for the slot, the mark will be disappeared.

  30075.033841 ...
  30163.806904 s..i
  30263.806919 ..hi

Therefore, 300th block is idle zpage. With this feature,
user can how many zram has idle pages which are waste of memory.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 
 Documentation/blockdev/zram.txt| 10 +++--
 drivers/block/zram/zram_drv.c  | 44 --
 drivers/block/zram/zram_drv.h  |  1 +
 4 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index c1513c756af1..04c9a5980bc7 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -98,3 +98,11 @@ Contact: Minchan Kim 
The backing_dev file is read-write and set up backing
device for zram to write incompressible pages.
For using, user should enable CONFIG_ZRAM_WRITEBACK.
+
+What:  /sys/block/zram/idle
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   idle file is write-only and mark zram slot as idle.
+   If system has mounted debugfs, user can see which slots
+   are idle via /sys/kernel/debug/zram/zram/block_state
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 3c1b5ab54bc0..f3bcd716d8a9 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -169,6 +169,7 @@ comp_algorithmRWshow and change the compression 
algorithm
 compact   WOtrigger memory compaction
 debug_statROthis file is used for zram debugging purposes
 backing_dev  RWset up backend storage for zram to write out
+idle WOmark allocated slot as idle
 
 
 User space is advised to use the following files to read the device statistics.
@@ -251,16 +252,17 @@ pages of the process with*pagemap.
 If you enable the feature, you could see block state via
 /sys/kernel/debug/zram/zram0/block_state". The output is as follows,
 
- 30075.033841 .wh
- 30163.806904 s..
- 30263.806919 ..h
+ 30075.033841 .wh.
+ 30163.806904 s...
+ 30263.806919 ..hi
 
 First column is zram's block index.
 Second column is access time since the system was booted
 Third column is state of the block.
 (s: same page
 w: written page to backing store
-h: huge page)
+h: huge page
+i: idle page)
 
 First line of above example says 300th block is accessed at 75.033841sec
 and the block's state is huge so it is written back to the backing
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index bc59db2b1036..f956179076ce 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -281,6 +281,34 @@ static ssize_t mem_used_max_store(struct device *dev,
return len;
 }
 
+static ssize_t idle_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+   int index;
+
+   down_read(&zram->init_lock);
+   if (!init_done(zram)) {
+   up_read(&zram->init_lock);
+   return -EINVAL;
+   }
+
+   for (index = 0; index < nr_pages; index++) {
+   zram_slot_lock(zram, index);
+   if (!zram_allocated(zram, index))
+   goto next;
+
+   zram_set_flag(zram, index, ZRAM_IDLE);
+next:
+   zram_slot_unlock(zram, index);
+   }
+
+   up_read(&zram->init_lock);
+
+   return len;
+}
+
 #ifdef CONFIG_ZRAM_WRITEBACK
 static void reset_bdev(struct zram *zram)
 {
@@ -658,6 +686,7 @@ static void zram_debugfs_destroy(void)
 
 static void zram_accessed(struct zram *zram, u32 index)
 {
+   zram_clear_flag(zram, index, ZRAM_IDLE);
zram->table[index].ac_time = ktime_get_boottime();
 }
 
@@ -690,12 +719,13 @@ static ssize_t read_block_state(struct file *file, char 
__user *buf,
 
ts = ktime_to_timespec64(zram->table[index].ac_time);
copied = snprintf(kbuf + written, count,
-   "%12zd %12lld.%06lu %c%c%c\n",
+   "%12zd %12lld.%06lu %c%c%c%c\n",
index, (s64)ts.tv_sec,
ts.tv_nsec / NSEC_PER_USEC,
zram_te

[PATCH 2/6] zram: refactoring flags and writeback stuff

2018-11-15 Thread Minchan Kim

This patch does renaming some variables and restructuring
some codes for better redability in writeback and zs_free_page.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 105 +-
 drivers/block/zram/zram_drv.h |   8 +--
 2 files changed, 44 insertions(+), 69 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 472027eaed60..bc59db2b1036 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -55,17 +55,17 @@ static void zram_free_page(struct zram *zram, size_t index);
 
 static int zram_slot_trylock(struct zram *zram, u32 index)
 {
-   return bit_spin_trylock(ZRAM_LOCK, &zram->table[index].value);
+   return bit_spin_trylock(ZRAM_LOCK, &zram->table[index].flags);
 }
 
 static void zram_slot_lock(struct zram *zram, u32 index)
 {
-   bit_spin_lock(ZRAM_LOCK, &zram->table[index].value);
+   bit_spin_lock(ZRAM_LOCK, &zram->table[index].flags);
 }
 
 static void zram_slot_unlock(struct zram *zram, u32 index)
 {
-   bit_spin_unlock(ZRAM_LOCK, &zram->table[index].value);
+   bit_spin_unlock(ZRAM_LOCK, &zram->table[index].flags);
 }
 
 static inline bool init_done(struct zram *zram)
@@ -76,7 +76,7 @@ static inline bool init_done(struct zram *zram)
 static inline bool zram_allocated(struct zram *zram, u32 index)
 {
 
-   return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+   return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
zram->table[index].handle;
 }
 
@@ -99,19 +99,19 @@ static void zram_set_handle(struct zram *zram, u32 index, 
unsigned long handle)
 static bool zram_test_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   return zram->table[index].value & BIT(flag);
+   return zram->table[index].flags & BIT(flag);
 }
 
 static void zram_set_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value |= BIT(flag);
+   zram->table[index].flags |= BIT(flag);
 }
 
 static void zram_clear_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value &= ~BIT(flag);
+   zram->table[index].flags &= ~BIT(flag);
 }
 
 static inline void zram_set_element(struct zram *zram, u32 index,
@@ -127,15 +127,15 @@ static unsigned long zram_get_element(struct zram *zram, 
u32 index)
 
 static size_t zram_get_obj_size(struct zram *zram, u32 index)
 {
-   return zram->table[index].value & (BIT(ZRAM_FLAG_SHIFT) - 1);
+   return zram->table[index].flags & (BIT(ZRAM_FLAG_SHIFT) - 1);
 }
 
 static void zram_set_obj_size(struct zram *zram,
u32 index, size_t size)
 {
-   unsigned long flags = zram->table[index].value >> ZRAM_FLAG_SHIFT;
+   unsigned long flags = zram->table[index].flags >> ZRAM_FLAG_SHIFT;
 
-   zram->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
+   zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
 }
 
 #if PAGE_SIZE != 4096
@@ -282,16 +282,11 @@ static ssize_t mem_used_max_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
-static bool zram_wb_enabled(struct zram *zram)
-{
-   return zram->backing_dev;
-}
-
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
 
-   if (!zram_wb_enabled(zram))
+   if (!zram->backing_dev)
return;
 
bdev = zram->bdev;
@@ -318,7 +313,7 @@ static ssize_t backing_dev_show(struct device *dev,
ssize_t ret;
 
down_read(&zram->init_lock);
-   if (!zram_wb_enabled(zram)) {
+   if (!zram->backing_dev) {
memcpy(buf, "none\n", 5);
up_read(&zram->init_lock);
return 5;
@@ -446,7 +441,7 @@ static ssize_t backing_dev_store(struct device *dev,
return err;
 }
 
-static unsigned long get_entry_bdev(struct zram *zram)
+static unsigned long alloc_block_bdev(struct zram *zram)
 {
unsigned long blk_idx;
unsigned long ret = 0;
@@ -479,13 +474,13 @@ static unsigned long get_entry_bdev(struct zram *zram)
return ret;
 }
 
-static void put_entry_bdev(struct zram *zram, unsigned long entry)
+static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
 {
int was_set;
unsigned long flags;
 
spin_lock_irqsave(&zram->bitmap_lock, flags);
-   was_set = test_and_clear_bit(entry, zram->bitmap);
+   was_set = test_and_clear_bit(blk_idx, zram->bitmap);
spin_unlock_irqrestore(&zram->bitmap_lock, flags);
WARN_ON_ONCE(!was_set);
 }
@@ -599,7 +594,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec 
*bvec,
if (!bio)
return -ENOMEM;
 
-   entry = get_entry_bdev(zram);
+   entry = alloc_block_bdev(zram);
if (!entry) {
bio_put(bio);
return -ENOSPC;
@@ -

[PATCH 0/6] zram idle page writeback

2018-11-15 Thread Minchan Kim

Inherently, swap device has many idle pages which are rare touched since
it was allocated. It is never problem if we use storage device as swap.
However, it's just waste for zram-swap.

This patchset supports zram idle page writeback feature.

* Admin can define what is idle page "no access since X time ago"
* Admin can define when zram should writeback them
* Admin can define when zram should stop writeback to prevent wearout

Detail is on each patch's description.

Minchan Kim (6):
  zram: fix lockdep warning of free block handling
  zram: refactoring flags and writeback stuff
  zram: introduce ZRAM_IDLE flag
  zram: support idle page writeback
  zram: add bd_stat statistics
  zram: writeback throttle

 Documentation/ABI/testing/sysfs-block-zram |  32 ++
 Documentation/blockdev/zram.txt|  42 +-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 435 +
 drivers/block/zram/zram_drv.h  |  18 +-
 5 files changed, 438 insertions(+), 94 deletions(-)

-- 
2.19.1.1215.g8438c0b245-goog

Re: [alsa-devel] [PATCH v3 1/5] ALSA: soc-compress: add support to snd_compr_set_runtime_buffer()

2018-11-15 Thread Daniel Baluta

Hi Srinivas,

One minor comment:



> struct snd_compr_ops *ops;
> +   struct snd_dma_buffer *dma_buffer_p;

I don't think it is necessary to encode the type inside the name variable
So, dma_buffer would sounds better to me then dma_buffer_p;

> void *buffer;

It is also consistent with this ^


> +static inline void snd_compr_set_runtime_buffer(
> +   struct snd_compr_stream *substream,
> +   struct snd_dma_buffer *bufp)

Also buf instead of bufp here.

thanks,
Daniel.

Re: [PATCH 4/8] kbuild: simplify dependency generation for CONFIG_TRIM_UNUSED_KSYMS

2018-11-15 Thread Masahiro Yamada

On Fri, Nov 16, 2018 at 2:13 PM Nicolas Pitre  wrote:
>
> On Thu, 15 Nov 2018, Masahiro Yamada wrote:
>
> > My main motivation of this commit is to clean up scripts/Kbuild.include
> > and scripts/Makefile.build.
> >
> > Currently, CONFIG_TRIM_UNUSED_KSYMS works with a tricky gimmick;
> > possibly exported symbols are detected by letting $(CPP) replace
> > EXPORT_SYMBOL* with a special string '=== __KSYM_*===', which is
> > post-processed by sed, and passed to fixdep. The extra preprocessing
> > is costly, and hacking cmd_and_fixdep is ugly.
> >
> > I came up with a new way to find exported symbols; insert a dummy
> > symbol __ksym_marker_* to each potentially exported symbol. Those
> > dummy symbols are picked up by $(NM), post-processed by sed, then
> > appended to .*.cmd files. I collected the post-process part to a
> > new shell script scripts/gen_ksymdeps.sh for readability. The dummy
> > symbols are put into the .discard.* section so that the linker
> > script rips them off the final vmlinux or modules.
>
> Brilliant!  I really like it.
>
> Minor comments below.
>
> > diff --git a/include/asm-generic/export.h b/include/asm-generic/export.h
> > index 4d73e6e..294d6ae 100644
> > --- a/include/asm-generic/export.h
> > +++ b/include/asm-generic/export.h
> > @@ -59,16 +59,19 @@ __kcrctab_\name:
> >  .endm
> >  #undef __put
> >
> > -#if defined(__KSYM_DEPS__)
> > -
> > -#define __EXPORT_SYMBOL(sym, val, sec)   === __KSYM_##sym ===
> > -
> > -#elif defined(CONFIG_TRIM_UNUSED_KSYMS)
> > +#if defined(CONFIG_TRIM_UNUSED_KSYMS)
> >
> >  #include 
> >  #include 
> >
> > +.macro __ksym_marker sym
> > + .section ".discard.ksym","a"
> > +__ksym_marker_\sym:
> > +  .previous
>
> Does this work as intended? I have vague memories about having problems
> with sections being discarded when they don't allocate any space.

What I can tell is, this patch produces the same size kernel
(after dropping debug info by 'strip' command).


> > +.endm
> > +
> >  #define __EXPORT_SYMBOL(sym, val, sec)   \
> > + __ksym_marker sym;  \
> >   __cond_export_sym(sym, val, sec, __is_defined(__KSYM_##sym))
> >  #define __cond_export_sym(sym, val, sec, conf)   \
> >   ___cond_export_sym(sym, val, sec, conf)
> > diff --git a/include/linux/export.h b/include/linux/export.h
> > index ce764a5..0413a3d 100644
> > --- a/include/linux/export.h
> > +++ b/include/linux/export.h
> > @@ -92,22 +92,22 @@ struct kernel_symbol {
> >   */
> >  #define __EXPORT_SYMBOL(sym, sec)
> >
> > -#elif defined(__KSYM_DEPS__)
> > +#elif defined(CONFIG_TRIM_UNUSED_KSYMS)
> > +
> > +#include 
> >
> >  /*
> >   * For fine grained build dependencies, we want to tell the build system
> >   * about each possible exported symbol even if they're not actually 
> > exported.
> > - * We use a string pattern that is unlikely to be valid code that the build
> > - * system filters out from the preprocessor output (see ksym_dep_filter
> > - * in scripts/Kbuild.include).
> > + * We use a symbol pattern __ksym_marker_ that the build system 
> > filters
> > + * from the $(NM) output (see scripts/gen_ksymdep.sh). These symbols are
> > + * discarded in the final link stage.
> >   */
> > -#define __EXPORT_SYMBOL(sym, sec)=== __KSYM_##sym ===
> > -
> > -#elif defined(CONFIG_TRIM_UNUSED_KSYMS)
> > -
> > -#include 
> > +#define __ksym_marker(sym)   \
> > + static int __ksym_marker_##sym[0] __section(".discard.ksym") __used
>
> Even if this is discarded during the final link, maybe this could save
> a tiny amount of disk space by using a char instead?


I am afraid you missed '[0]' after the symbol name.
This is actually zero-length array.

No memory allocated for this dummy section.

As far as I tested, this is working.





> > diff --git a/scripts/Makefile.build b/scripts/Makefile.build
> > index 7f3ca6e..e5ba9b1 100644
> > --- a/scripts/Makefile.build
> > +++ b/scripts/Makefile.build
> > @@ -254,9 +254,18 @@ objtool_dep = $(objtool_obj)   
> >   \
> > $(wildcard include/config/orc/unwinder.h  \
> >include/config/stack/validation.h)
> >
> > +ifdef CONFIG_TRIM_UNUSED_KSYMS
> > +cmd_gen_ksymdeps = \
> > + $(CONFIG_SHELL) $(srctree)/scripts/gen_ksymdeps.sh $@ > 
> > $(dot-target).tmp; \
> > + cat $(dot-target).tmp >> $(dot-target).cmd; \
> > + rm -f $(dot-target).tmp;
>
> Why don't you append to $(dot-target).cmd directly?


If scripts/gen_ksymdeps.sh fails for some reasons,
it will error out immediately thanks to 'set -e' flag.

Appending incomplete portion might end up with a corrupted .*.cmd file.

Probably, that would not happen, but I just wanted to ensure it.




>
> Nicolas



--
Best Regards
Masahiro Yamada

Re: [PATCH V3 1/3] mmc: sdhci: Allow platform controlled voltage switching

2018-11-15 Thread Adrian Hunter

On 16/11/18 1:17 AM, Evan Green wrote:
> On Wed, Nov 14, 2018 at 6:36 AM Veerabhadrarao Badiganti
>  wrote:
>>
> diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
> index b001cf4..3c28152 100644
> --- a/drivers/mmc/host/sdhci.h
> +++ b/drivers/mmc/host/sdhci.h
> @@ -524,6 +524,7 @@ struct sdhci_host {
>  bool pending_reset; /* Cmd/data reset is pending */
>  bool irq_wake_enabled;  /* IRQ wakeup is enabled */
>  bool v4_mode;   /* Host Version 4 Enable */
> +   bool vqmmc_enabled; /* Vqmmc is enabled */
 I still don't love this, since it doesn't mean what it says. Everyone
 else that has a vqmmc_enabled member uses it to actually mean "vqmmc
 is enabled", but this doesn't mean that. For example, you don't clear
 this when you disable the regulator in patch 3, so this would be set
 even if the regulator is disabled, and you don't set it when sdhci
 enables the regulator, so the regulator is on when this flag is not
 set.

>> Hi Evan
>>
>> This flag is meant to say "disable vqmmc *only* if it is enabled by host
>> driver (sdhci_host)".
>> If host driver doesn't enable vqmmc (enabled by platfrm driver) or if it
>> fails to enable it, then don't call disable vqmmc.
>>
>> Agree with you, the present name is not conveying its purpose.
>> It must be something like "vqmmc_enabled_by_host".
>>
>> Please let me know if you have any suggestions on this name.
> 
> Yeah. Maybe vqmmc_pltfrm_controlled? Or vqmmc_enabled_by_platfrm as
> you suggested?

"pltfrm" doesn't mean anything here.  Just change the comment "vqmmc enabled
in sdhci.c"

Re: [PATCH tip/core/rcu 6/7] mm: Replace spin_is_locked() with lockdep

2018-11-15 Thread Paul E. McKenney

On Thu, Nov 15, 2018 at 10:49:17AM -0800, Davidlohr Bueso wrote:
> On Sun, 11 Nov 2018, Paul E. McKenney wrote:
> 
> >From: Lance Roy 
> >
> >lockdep_assert_held() is better suited to checking locking requirements,
> >since it only checks if the current thread holds the lock regardless of
> >whether someone else does. This is also a step towards possibly removing
> >spin_is_locked().
> 
> So fyi I'm not crazy about these kind of patches simply because lockdep
> is a lot less used out of anything that's not a lab, and we can be missing
> potential offenders. There's obviously nothing wrong about what you describe
> above perse, just my two cents.

Fair point!

One countervailing advantage of lockdep is that it is not subject to the
false negatives that can happen if someone else happens to be currently
holding the lock.  But what would you suggest instead?

Thanx, Paul

RE: [PATCH v3 1/2] mtd: spi-nor: add macros related to MICRON flash

2018-11-15 Thread Yogesh Narayan Gaur

Hi Boris,

Please apply this patch series [1] in the coming release.

--
Regards
Yogesh Gaur
[1] https://patchwork.ozlabs.org/project/linux-mtd/list/?series=70384


> -Original Message-
> From: Yogesh Narayan Gaur
> Sent: Tuesday, October 23, 2018 3:31 PM
> To: 'Boris Brezillon' 
> Cc: Mark Brown ; Tudor Ambarus
> ; linux-...@lists.infradead.org; linux-
> s...@vger.kernel.org; marek.va...@gmail.com; cyrille.pitc...@wedev4u.fr;
> computersforpe...@gmail.com; frieder.schre...@exceet.de; linux-
> ker...@vger.kernel.org
> Subject: RE: [PATCH v3 1/2] mtd: spi-nor: add macros related to MICRON flash
> 
> Hi Boris,
> 
> > -Original Message-
> > From: Boris Brezillon [mailto:boris.brezil...@bootlin.com]
> > Sent: Tuesday, October 23, 2018 3:27 PM
> > To: Yogesh Narayan Gaur 
> > Cc: Mark Brown ; Tudor Ambarus
> > ; linux-...@lists.infradead.org; linux-
> > s...@vger.kernel.org; marek.va...@gmail.com;
> > cyrille.pitc...@wedev4u.fr; computersforpe...@gmail.com;
> > frieder.schre...@exceet.de; linux- ker...@vger.kernel.org
> > Subject: Re: [PATCH v3 1/2] mtd: spi-nor: add macros related to MICRON
> > flash
> >
> > Hi Yogesh,
> >
> > On Tue, 23 Oct 2018 09:39:25 +
> > Yogesh Narayan Gaur  wrote:
> >
> > > Hi,
> > >
> > > Did we have have any comments or remarks about this patch-series,
> > > if not
> > please apply.
> >
> > Sorry, but it was already too late for this release, and the merge
> > window just started, so it will have to wait at least 2 more weeks.
> Ok.
> 
> >
> > We've been lagging with SPI NOR patches for the last couple releases
> > because I clearly don't have time to review those contributions, and
> > it seems Marek does not have time either.
> >
> > >
> > > Both patches in the series been reviewed by Tudor.
> >
> > Things are improving a bit thanks to Tudor's involvement in the review
> > process, but I'd like to remember you that you, as a regular
> > contributor to the spi-nor subsystem, can help us with that too. That
> > is, help review patches coming from others instead of only focusing on your
> own contributions.
> >
> Sure, I would start doing the review of other contributor patches.
> 
> --
> Regards
> Yogesh Gaur.
> 
> > Regards,
> >
> > Boris

Re: [resend PATCH 1/3] pwm: mediatek: drop flag 'has_clks'

2018-11-15 Thread Uwe Kleine-König

On Wed, Nov 14, 2018 at 01:47:52PM +0100, Thierry Reding wrote:
> On Tue, Nov 13, 2018 at 10:08:22AM +0800, Ryder Lee wrote:
> > The flag 'has_clks' and related checks are superfluous as the CCF
> > subsystem does this for you.
> 
> Both of these mechanisms aren't equivalent. While CCF can deal with
> optional clocks, what the has_clks flag actually means is that the
> device doesn't need a clock (or doesn't have a clock input) on the
> devices where it is cleared.
> 
> So I'd actually be in favor of keeping the has_clks property because it
> serves as an additional sanity check. For example if you run this driver
> on an SoC that "has clocks" but if you don't list them in DT, then after
> this patch the driver will happily continue without clocks, even though
> it may break completely without those clocks. I've seen SoCs respond to
> disabled clocks for a hardware block in different ways, in many cases an
> access to any of the registers will completely hang the CPU. In other
> cases it may just crash in some other way or give you some sort of
> machine exception. None of those are good, and make the tiny bit of
> additional code required to support the has_clks flag very attractive.
> 
> But that's just my opinion. If you prefer to throw away that safety
> barrier, be my guest. But if you do, please move this functionality into
> the clock framework first and then make the driver use it.

The usual policy is: If the things specified in the dt are
wrong or incomplete, it's ok to fail however you like. So from a
correctness POV I think the change is fine.

I don't know about the mips details that John pointed out in a followup
to this mail though.

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | http://www.pengutronix.de/  |

Re: [PATCH v2 4/4] ARM: dts: meson: consistently disable pin bias

2018-11-15 Thread Martin Blumenstingl

On Fri, Nov 9, 2018 at 3:05 PM Jerome Brunet  wrote:
>
> On Amlogic chipsets, the bias set through pinconf applies to the pad
> itself, not only the GPIO function. This means that even when we change
> the function of the pad from GPIO to anything else, the bias previously
> set still applies.
>
> As we have seen with the eMMC, depending on the bias type and the function,
> it may trigger problems.
>
> The underlying issue is that we inherit whatever was left by previous user
> of the pad (pinconf, u-boot or the ROM code). As a consequence, the actual
> setup we will get is undefined.
>
> There is nothing mentioned in the documentation about pad bias and pinmux
> function, however leaving it undefined is not an option.
>
> This change consistently disable the pad bias for every pinmux functions.
> It seems to work well, we can only assume that the necessary bias (if any)
> is already provided by the pin function itself.
>
> Signed-off-by: Jerome Brunet 
Acked-by: Martin Blumenstingl

my Odroid-C1 still boots fine from SD card and Ethernet (ping) also still works

Kevin, can you please move this patch from your v4.21/dt64 branch to
the v4.21/dt (32-bit) branch?
all other patches from this series are for the 64-bit SoCs, so only
this single patch has to be moved


Thank you!
Regards
Martin

Re: [PATCH 0/3] tools/memory-model: Add SRCU support

2018-11-15 Thread Paul E. McKenney

On Thu, Nov 15, 2018 at 11:19:24AM -0500, Alan Stern wrote:
> Paul and other LKMM maintainers:
> 
> The following series of patches adds support for SRCU to the Linux
> Kernel Memory Model.  That is, it adds the srcu_read_lock(),
> srcu_read_unlock(), and synchronize_srcu() primitives to the model.
> 
>   Patch 1/3 does some renaming of the RCU parts of the
>   memory model's existing CAT code, to help distinguish them
>   from the upcoming SRCU parts.
> 
>   Patch 2/3 refactors the definitions of some RCU relations
>   in the CAT code, in a way that the SRCU portions will need.
> 
>   Patch 3/3 actually adds the SRCU support.
> 
> This new code requires herd7 version 7.51+4(dev) or later (now 
> available in the herdtools7 github repository) to run.  Thanks to Luc 
> for making the necessary changes to support SRCU.

These patches pass the tests that I have constructed, and also regression
tests, very nice!  Applied and pushed, thank you.

> The code does not check that the index argument passed to 
> srcu_read_unlock() is the same as the value returned by the 
> corresponding srcu_read_lock() call.  This is deemed to be a semantic 
> issue, not directly relevant to the memory model.

Agreed.

If I understand correctly, there are in theory some use cases that these
patches do not support, for example:

r1 = srcu_read_lock(a);
do_1();
r2 = srcu_read_lock(a);
do_2();
srcu_read_unlock(a, r1);
do_3();
srcu_read_unlock(a, r2);

In practice, I would be more worried about this had I ever managed to
find a non-bogus use case for this pattern.  ;-)

Thanx, Paul

[PATCH v2 1/2] build_bug.h: remove negative-array fallback for BUILD_BUG_ON()

2018-11-15 Thread Masahiro Yamada

The kernel can only be compiled with an optimization option (-O2, -Os,
or the currently proposed -Og). Hence, __OPTIMIZE__ is always defined
in the kernel source.

The fallback for -O0 case is just hypothetical and pointless. Moreover,
commit 0bb95f80a38f ("Makefile: Globally enable VLA warning") enabled
-Wvla warning. The use of variable length arrays is banned.

Signed-off-by: Masahiro Yamada 
---

Changes in v2: None

 include/linux/build_bug.h | 14 --
 1 file changed, 14 deletions(-)

diff --git a/include/linux/build_bug.h b/include/linux/build_bug.h
index 43d1fd5..d415c64 100644
--- a/include/linux/build_bug.h
+++ b/include/linux/build_bug.h
@@ -51,23 +51,9 @@
  * If you have some code which relies on certain constants being equal, or
  * some other compile-time-evaluated condition, you should use BUILD_BUG_ON to
  * detect if someone changes it.
- *
- * The implementation uses gcc's reluctance to create a negative array, but gcc
- * (as of 4.4) only emits that error for obvious cases (e.g. not arguments to
- * inline functions).  Luckily, in 4.3 they added the "error" function
- * attribute just for this type of case.  Thus, we use a negative sized array
- * (should always create an error on gcc versions older than 4.4) and then call
- * an undefined function with the error attribute (should always create an
- * error on gcc 4.3 and later).  If for some reason, neither creates a
- * compile-time error, we'll still have a link-time error, which is harder to
- * track down.
  */
-#ifndef __OPTIMIZE__
-#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
-#else
 #define BUILD_BUG_ON(condition) \
BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
-#endif
 
 /**
  * BUILD_BUG - break compile if used.
-- 
2.7.4

Re: [PATCH 0/7] ACPI HMAT memory sysfs representation

2018-11-15 Thread Anshuman Khandual

On 11/15/2018 04:19 AM, Keith Busch wrote:
> This series provides a new sysfs representation for heterogeneous
> system memory.
> 
> The previous series that was specific to HMAT that this series was based
> on was last posted here: https://lkml.org/lkml/2017/12/13/968
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series first creates new generic APIs under the kernel's node
> representation. These new APIs can be used to create links among local
> memory and compute nodes and export characteristics about the memory
> nodes. Documentation desribing the new representation are provided.
> 
> Finally the series adds a kernel user for these new APIs from parsing
> the ACPI HMAT.

Not able to see the patches from this series either on the list or on the
archive (https://lkml.org/lkml/2018/11/15/331). IIRC last time we discussed
about this and the concern which I raised was in absence of a broader NUMA
rework for multi attribute memory it might not a good idea to settle down
and freeze sysfs interface for the user space.

[PATCH v2 2/2] build_bug.h: remove all dummy BUILD_BUG_ON stubs for sparse

2018-11-15 Thread Masahiro Yamada

The introduction of these dummy BUILD_BUG_ON stubs dates back to
commit 903c0c7cdc21 ("sparse: define dummy BUILD_BUG_ON definition
for sparse"). At that time, BUILD_BUG_ON() was implemented with the
negative array trick, which Sparse complains about even if the
condition can be optimized and evaluated to 0 at compile-time.

With the previous commit, the leftover negative array trick is gone.
Sparse is happy with the current BUILD_BUG_ON(), which is implemented
by using the 'error' attribute.

There might be a little room for argument about BUILD_BUG_ON_ZERO().
Sparse reports 'invalid bitfield width, -1' for non-zero value,
and 'bad integer constant expression' for non-constant value.
This is the same criteria as GCC uses. So, if those Sparse errors
occurred, they would cause errors for GCC as well. (Hence, such
errors would have been detected by the normal compile test process.)

Signed-off-by: Masahiro Yamada 
---

Changes in v2:
 - Fix a coding style error (two consecutive blank lines)

 include/linux/build_bug.h | 12 
 1 file changed, 12 deletions(-)

diff --git a/include/linux/build_bug.h b/include/linux/build_bug.h
index d415c64..6625c88 100644
--- a/include/linux/build_bug.h
+++ b/include/linux/build_bug.h
@@ -4,16 +4,6 @@
 
 #include 
 
-#ifdef __CHECKER__
-#define __BUILD_BUG_ON_NOT_POWER_OF_2(n) (0)
-#define BUILD_BUG_ON_NOT_POWER_OF_2(n) (0)
-#define BUILD_BUG_ON_ZERO(e) (0)
-#define BUILD_BUG_ON_INVALID(e) (0)
-#define BUILD_BUG_ON_MSG(cond, msg) (0)
-#define BUILD_BUG_ON(condition) (0)
-#define BUILD_BUG() (0)
-#else /* __CHECKER__ */
-
 /* Force a compilation error if a constant expression is not a power of 2 */
 #define __BUILD_BUG_ON_NOT_POWER_OF_2(n)   \
BUILD_BUG_ON(((n) & ((n) - 1)) != 0)
@@ -64,6 +54,4 @@
  */
 #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
 
-#endif /* __CHECKER__ */
-
 #endif /* _LINUX_BUILD_BUG_H */
-- 
2.7.4

[PATCH 1/2] build_bug.h: remove negative-array fallback for BUILD_BUG_ON()

2018-11-15 Thread Masahiro Yamada

The kernel can only be compiled with an optimization option (-O2, -Os,
or the currently proposed -Og). Hence, __OPTIMIZE__ is always defined
in the kernel source.

A fallback for -O0 case is just hypothetical and pointless. Moreover,
commit 0bb95f80a38f ("Makefile: Globally enable VLA warning") enabled
-Wvla warning. The use of variable length arrays is banned.

Signed-off-by: Masahiro Yamada 
---

 include/linux/build_bug.h | 14 --
 1 file changed, 14 deletions(-)

diff --git a/include/linux/build_bug.h b/include/linux/build_bug.h
index 43d1fd5..d415c64 100644
--- a/include/linux/build_bug.h
+++ b/include/linux/build_bug.h
@@ -51,23 +51,9 @@
  * If you have some code which relies on certain constants being equal, or
  * some other compile-time-evaluated condition, you should use BUILD_BUG_ON to
  * detect if someone changes it.
- *
- * The implementation uses gcc's reluctance to create a negative array, but gcc
- * (as of 4.4) only emits that error for obvious cases (e.g. not arguments to
- * inline functions).  Luckily, in 4.3 they added the "error" function
- * attribute just for this type of case.  Thus, we use a negative sized array
- * (should always create an error on gcc versions older than 4.4) and then call
- * an undefined function with the error attribute (should always create an
- * error on gcc 4.3 and later).  If for some reason, neither creates a
- * compile-time error, we'll still have a link-time error, which is harder to
- * track down.
  */
-#ifndef __OPTIMIZE__
-#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
-#else
 #define BUILD_BUG_ON(condition) \
BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
-#endif
 
 /**
  * BUILD_BUG - break compile if used.
-- 
2.7.4

[PATCH 2/2] build_bug.h: remove all dummy BUILD_BUG_ON stubs for sparse

2018-11-15 Thread Masahiro Yamada

The introduction of these dummy BUILD_BUG_ON stubs dates back to
commit 903c0c7cdc21 ("sparse: define dummy BUILD_BUG_ON definition
for sparse"). At that time, BUILD_BUG_ON() was implemented with the
negative array trick, which Sparse complains about even if the
condition can be optimized and evaluated to 0 at compile-time.

With the previous commit, the leftover negative array trick is gone.
Sparse is happy with the current BUILD_BUG_ON(), which is implemented
by using the 'error' attribute.

There might be a little room for argument about BUILD_BUG_ON_ZERO().
Sparse reports 'invalid bitfield width, -1' for non-zero value,
and 'bad integer constant expression' for non-constant value.
This is the same criteria as GCC uses. So, if those Sparse errors
occurred, they would cause errors for GCC as well. (Hence, such
errors would have been detected by the normal compile test process.)

Signed-off-by: Masahiro Yamada 
---

 include/linux/build_bug.h | 11 ---
 1 file changed, 11 deletions(-)

diff --git a/include/linux/build_bug.h b/include/linux/build_bug.h
index d415c64..b0828f7 100644
--- a/include/linux/build_bug.h
+++ b/include/linux/build_bug.h
@@ -4,16 +4,6 @@
 
 #include 
 
-#ifdef __CHECKER__
-#define __BUILD_BUG_ON_NOT_POWER_OF_2(n) (0)
-#define BUILD_BUG_ON_NOT_POWER_OF_2(n) (0)
-#define BUILD_BUG_ON_ZERO(e) (0)
-#define BUILD_BUG_ON_INVALID(e) (0)
-#define BUILD_BUG_ON_MSG(cond, msg) (0)
-#define BUILD_BUG_ON(condition) (0)
-#define BUILD_BUG() (0)
-#else /* __CHECKER__ */
-
 /* Force a compilation error if a constant expression is not a power of 2 */
 #define __BUILD_BUG_ON_NOT_POWER_OF_2(n)   \
BUILD_BUG_ON(((n) & ((n) - 1)) != 0)
@@ -64,6 +54,5 @@
  */
 #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
 
-#endif /* __CHECKER__ */
 
 #endif /* _LINUX_BUILD_BUG_H */
-- 
2.7.4

[PATCH] initramfs: clean old path before creating a hardlink

2018-11-15 Thread Li Zhijian

Previously, sys_link() will fail due to the new path is already existed.
this case ofen appears when we use a concated initrd, below is an
sample:

1) prepare a basic rootfs, it contains a regular files rc.local
lizhijian@:~/yocto-tiny-i386-2016-04-22$ cat etc/rc.local
 #!/bin/sh
 echo "Running /etc/rc.local..."
yocto-tiny-i386-2016-04-22$ find . | sed 's,^\./,,' | cpio -o -H newc | gzip -n 
-9 >../rootfs.cgz

2) create a extra initrd which also includes a etc/rc.local
lizhijian@:~/lkp-x86_64/etc$ echo "append initrd" >rc.local
lizhijian@:~/lkp/lkp-x86_64/etc$ cat rc.local
append initrd
lizhijian@:~/lkp/lkp-x86_64/etc$ ln rc.local rc.local.hardlink
append initrd
lizhijian@:~/lkp/lkp-x86_64/etc$ stat rc.local rc.local.hardlink
  File: 'rc.local'
  Size: 14  Blocks: 8  IO Block: 4096   regular file
Device: 801h/2049d  Inode: 11296086Links: 2
Access: (0664/-rw-rw-r--)  Uid: ( 1002/lizhijian)   Gid: ( 1002/lizhijian)
Access: 2018-11-15 16:08:28.654464815 +0800
Modify: 2018-11-15 16:07:57.514903210 +0800
Change: 2018-11-15 16:08:24.180228872 +0800
 Birth: -
  File: 'rc.local.hardlink'
  Size: 14  Blocks: 8  IO Block: 4096   regular file
Device: 801h/2049d  Inode: 11296086Links: 2
Access: (0664/-rw-rw-r--)  Uid: ( 1002/lizhijian)   Gid: ( 1002/lizhijian)
Access: 2018-11-15 16:08:28.654464815 +0800
Modify: 2018-11-15 16:07:57.514903210 +0800
Change: 2018-11-15 16:08:24.180228872 +0800
 Birth: -

lizhijian@:~/lkp/lkp-x86_64$ find . | sed 's,^\./,,' | cpio -o -H newc | gzip 
-n -9 >../rc-local.cgz
lizhijian@:~/lkp/lkp-x86_64$ gzip -dc ../rc-local.cgz | cpio -t
.
etc
etc/rc.local.hardlink <<< it will be extracted first at this initrd
etc/rc.local

3) concate 2 initrds and boot
lizhijian@:~/lkp$ cat rootfs.cgz rc-local.cgz >concate-initrd.cgz
lizhijian@:~/lkp$ qemu-system-x86_64 -nographic -enable-kvm -cpu host -smp 1 -m 
1024 -kernel ~/lkp/linux/arch/x86/boot/bzImage -append "console=ttyS0 
earlyprint=ttyS0 ignore_loglevel" -initrd ./concate-initr.cgz -serial stdio 
-nodefaults

In this case, sys_link(2) will fail and return -EEXIST, so we can only
get the rc.local at rootfs.cgz instead of rc-local.cgz

CC: Philip Li 
Signed-off-by: Li Zhijian 
---
 init/initramfs.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index 6405577..381fd4c 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -291,12 +291,16 @@ static int __init do_reset(void)
return 1;
 }
 
+static void __init clean_path(char *path, umode_t fmode);
+
 static int __init maybe_link(void)
 {
if (nlink >= 2) {
char *old = find_link(major, minor, ino, mode, collected);
-   if (old)
+   if (old) {
+   clean_path(collected, 0);
return (ksys_link(old, collected) < 0) ? -1 : 1;
+   }
}
return 0;
 }
-- 
2.7.4

linux-next: Tree for Nov 16

2018-11-15 Thread Stephen Rothwell

Hi all,

Changes since 20181115:

The xtensa tree gained a conflict against Linus' tree.

The block tree gained a conflict against Linus' tree.

The tip tree still had its build failure for which I applied a fix patch.

Non-merge commits (relative to Linus' tree): 3059
 3132 files changed, 125660 insertions(+), 103426 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a
multi_v7_defconfig for arm and a native build of tools/perf. After
the final fixups (if any), I do an x86_64 modules_install followed by
builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit),
ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc
and sparc64 defconfig. And finally, a simple boot test of the powerpc
pseries_le_defconfig kernel in qemu (with and without kvm enabled).

Below is a summary of the state of the merge.

I am currently merging 298 trees (counting Linus' and 68 trees of bug
fix patches pending for the current merge release).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (da5322e65940 Merge tag 'selinux-pr-20181115' of 
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux)
Merging fixes/master (7c6c54b505b8 Merge branch 'i2c/for-next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux)
Merging kbuild-current/fixes (ccda4af0f4b9 Linux 4.20-rc2)
Merging arc-current/for-curr (121e38e5acdc ARC: mm: fix uninitialised signal 
code in do_page_fault)
Merging arm-current/fixes (e46daee53bb5 ARM: 8806/1: kprobes: Fix false 
positive with FORTIFY_SOURCE)
Merging arm64-fixes/for-next/fixes (24cc61d8cb5a arm64: memblock: don't permit 
memblock resizing until linear mapping is up)
Merging m68k-current/for-linus (58c116fb7dc6 m68k/sun3: Remove is_medusa and 
m68k_pgtable_cachemode)
Merging powerpc-fixes/fixes (b2fed34a628d selftests/powerpc: Adjust wild_bctr 
to build with old binutils)
Merging sparc/master (1f2b5b8e2df4 sparc64: Wire up compat getpeername and 
getsockname.)
Merging fscrypt-current/for-stable (ae64f9bd1d36 Linux 4.15-rc2)
Merging net/master (08e14fe429a0 net_sched: sch_fq: ensure maxrate fq parameter 
applies to EDT flows)
Merging bpf/master (da85d8bfd151 kselftests/bpf: use ping6 as the default ipv6 
ping binary when it exists)
Merging ipsec/master (ca92e173ab34 xfrm: Fix bucket count reported to userspace)
Merging netfilter/master (29e3880109e3 netfilter: nf_tables: fix use-after-free 
when deleting compat expressions)
Merging ipvs/master (feb9f55c33e5 netfilter: nft_dynset: allow dynamic updates 
of non-anonymous set)
Merging wireless-drivers/master (b374e8686fc3 mt76: fix building without 
CONFIG_LEDS_CLASS)
Merging mac80211/master (113f3aaa81bd cfg80211: Prevent regulatory restore 
during STA disconnect in concurrent interfaces)
Merging rdma-fixes/for-rc (99b77fef3c6c net/mlx5: Fix XRC SRQ umem valid bits)
Merging sound-current/for-linus (d99501b8575d ALSA: hda/ca0132 - Call 
pci_iounmap() instead of iounmap())
Merging sound-asoc-fixes/for-linus (0e6f4dd73a24 Merge branch 'asoc-4.20' into 
asoc-linus)
Merging regmap-fixes/for-linus (ccda4af0f4b9 Linux 4.20-rc2)
Merging regulator-fixes/for-linus (97f11d95dd75 Merge branch 'regulator-4.20' 
into regulator-linus)
Merging spi-fixes/for-linus (7c9a8d31fb2a Merge branch 'spi-4.20' into 
spi-linus)
Merging pci-current/for-linus (1a87119b7bcf Revert "ACPI/PCI: Pay attention to 
device-specific _PXM node values")
Merging driver-core.current/driver-core-linus (a66d972465d1 devres: Align 
data[] to ARCH_KMALLOC_MINALIGN)
Merging tty.current/tty-linus (ccda4af0f4b9 Linux 4.20-rc2)
Merging usb.current/usb-linus (2f31a67f01a8 usb: xhci: Prevent bus suspend if a 
port connect change or polling state is detected)
Merging usb-gadget-fixes/fixes (2fc6d4be35fb usb: d

RE: [PATCH v5 3/9] spi: Add a driver for the Freescale/NXP QuadSPI controller

2018-11-15 Thread Yogesh Narayan Gaur

Hi Frieder,

> -Original Message-
> From: Schrempf Frieder [mailto:frieder.schre...@kontron.de]
> Sent: Thursday, November 15, 2018 7:32 PM
> To: Yogesh Narayan Gaur 
> Cc: Boris Brezillon ; 
> linux-...@lists.infradead.org;
> linux-...@vger.kernel.org; Marek Vasut ; Mark
> Brown ; Han Xu ;
> dw...@infradead.org; computersforpe...@gmail.com; rich...@nod.at;
> miquel.ray...@bootlin.com; David Wolfe ; Fabio
> Estevam ; Prabhakar Kushwaha
> ; shawn...@kernel.org; linux-
> ker...@vger.kernel.org
> Subject: Re: [PATCH v5 3/9] spi: Add a driver for the Freescale/NXP QuadSPI
> controller
> 
> Hi Yogesh,
> 
> On 15.11.18 14:12, Boris Brezillon wrote:
> > On Thu, 15 Nov 2018 11:43:05 +
> > Schrempf Frieder  wrote:
> >
> >> On 15.11.18 07:22, Yogesh Narayan Gaur wrote:
> >>> Hi Frieder,
> >>>
> >>> With below patch on top of your v5, Read/Write/Erase on CS1 is working
> fine for me.
> >>
> >> Ok, are you sure, that AHB read is working too with this patch?
> >> You are removing the memmap_phy offset from SFAR and the SFXXAD
> >> register values.
> >>
> >> I can understand that selection of the CS and IP commands will work
> >> like this, but I can't understand how AHB read should work without
> >> the base address of the mapped memory.
> >>
> >> I'm afraid I still don't fully understand the background of these
> >> things,
> >
> > Same here. Yogesh, can you give us more detail on why you decided to
> > drop the memmap_phy offset?
> 
> Your changes do not work on my setup (i.MX6UL). It looks like your hardware is
> different.
> 
> I found this patch for LS2080A: [1]. This would explain why you need to remove
> the offset to make it work.
> 
> To verify this, could you please test your setup with the current spi-nor 
> driver
> (fsl_quadspi.c). If our assumptions are right, it should only work on CS0 and 
> CS1
> with [1] applied.
> 

Yes, I need to remove the offset to make it work and this is required for the 
NXP Layerscape-2.x SoCs like LS208x/Ls108x etc.

I have modified the patch and have introduced entry in quirks for ls2080a. With 
this Read/Write/Erase are working for me for both CS.

diff --git a/drivers/spi/spi-fsl-qspi.c b/drivers/spi/spi-fsl-qspi.c
index ce45e8e..5d26f73 100644
--- a/drivers/spi/spi-fsl-qspi.c
+++ b/drivers/spi/spi-fsl-qspi.c
@@ -175,6 +175,9 @@
 /* TKT245618, the controller cannot wake up from wait mode */
 #define QUADSPI_QUIRK_TKT245618BIT(3)

+/* QSPI_AMBA_BASE is internally added by SOC design for LS-2.x architecture */
+#define QUADSPI_AMBA_BASE_INTERNAL BIT(4)
+
 struct fsl_qspi_devtype_data {
unsigned int rxfifo;
unsigned int txfifo;
@@ -227,7 +230,7 @@ static const struct fsl_qspi_devtype_data ls2080a_data = {
.rxfifo = SZ_128,
.txfifo = SZ_64,
.ahb_buf_size = SZ_1K,
-   .quirks = QUADSPI_QUIRK_TKT253890,
+   .quirks = QUADSPI_QUIRK_TKT253890 | QUADSPI_AMBA_BASE_INTERNAL,
.little_endian = true,
 };

@@ -235,6 +238,7 @@ struct fsl_qspi {
void __iomem *iobase;
void __iomem *ahb_addr;
u32 memmap_phy;
+   u32 amba_base_addr;
struct clk *clk, *clk_en;
struct device *dev;
struct completion c;
@@ -264,6 +268,11 @@ static inline int needs_wakeup_wait_mode(struct fsl_qspi 
*q)
return q->devtype_data->quirks & QUADSPI_QUIRK_TKT245618;
 }

+static inline int has_added_amba_base_internal(struct fsl_qspi *q)
+{
+   return q->devtype_data->quirks & QUADSPI_AMBA_BASE_INTERNAL;
+}
+
 /*
  * An IC bug makes it necessary to rearrange the 32-bit data.
  * Later chips, such as IMX6SLX, have fixed this bug.
@@ -489,29 +498,11 @@ static void fsl_qspi_invalidate(struct fsl_qspi *q)
 static void fsl_qspi_select_mem(struct fsl_qspi *q, struct spi_device *spi)
 {
unsigned long rate = spi->max_speed_hz;
-   int ret, i;
-   u32 map_addr;
+   int ret;

if (q->selected == spi->chip_select)
return;

-   /*
-* In HW there can be a maximum of four chips on two buses with
-* two chip selects on each bus. We use four chip selects in SW
-* to differentiate between the four chips.
-* We use the SFA1AD, SFA2AD, SFB1AD, SFB2AD registers to select
-* the chip we want to access.
-*/
-   for (i = 0; i < 4; i++) {
-   if (i < spi->chip_select)
-   map_addr = q->memmap_phy;
-   else
-   map_addr = q->memmap_phy +
-  2 * q->devtype_data->ahb_buf_size;
-
-   qspi_writel(q, map_addr, q->iobase + QUADSPI_SFA1AD + (i * 4));
-   }
-
if (needs_4x_clock(q))
rate *= 4;

@@ -534,7 +525,9 @@ static void fsl_qspi_select_mem(struct fsl_qspi *q, struct 
spi_device *spi)

 static void fsl_qspi_read_ahb(struct fsl_qspi *q, const struct spi_mem_op *op)
 {
-   memcpy_fromio(op->data.buf.in, q->ahb_addr, op->data.nbytes);
+   memcpy_fromio(op->data.buf.i

[PATCH] x86/cpu/AMD: Fix CPB bit for more processors

2018-11-15 Thread Jiaxun Yang

CPUID Fn8000_0007_EDX[CPB] is wrongly 0 on Model 17,
Stepping 0, but revision guide has not been released for
newer Family 17h models.

Tesed on AMD "Ryzen 7 2700U with Radeon Vega Mobile Gfx"
and "AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx",
their CPUID Fn_0001_EAX is 0x00810f10 and should have
CPB feature according AMD product specifications, however
their Fn8000_0007_EDX is 0x6599, indicating they don't
support CPB feature.

Signed-off-by: Jiaxun Yang 
---
 arch/x86/kernel/cpu/amd.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index eeea634bee0a..7db43ef8e97e 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -821,8 +821,12 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
/*
 * Fix erratum 1076: CPB feature bit not being set in CPUID. It affects
 * all up to and including B1.
+*
+* Revision guide for Family 17h, Model 17 has not been released, but
+* Model 17, Stepping 0 have the same issue.
 */
-   if (c->x86_model <= 1 && c->x86_stepping <= 1)
+   if ((c->x86_model <= 1 && c->x86_stepping <= 1) ||  \
+   (c->x86_model == 17 && c->x86_stepping == 0))
set_cpu_cap(c, X86_FEATURE_CPB);
 }
 
-- 
2.19.1

[PATCH] slab: fix 'dubious: x & !y' warning from Sparse

2018-11-15 Thread Masahiro Yamada

Sparse reports:
./include/linux/slab.h:332:43: warning: dubious: x & !y

Signed-off-by: Masahiro Yamada 
---

 include/linux/slab.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 918f374..d395c73 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -329,7 +329,7 @@ static __always_inline enum kmalloc_cache_type 
kmalloc_type(gfp_t flags)
 * If an allocation is both __GFP_DMA and __GFP_RECLAIMABLE, return
 * KMALLOC_DMA and effectively ignore __GFP_RECLAIMABLE
 */
-   return type_dma + (is_reclaimable & !is_dma) * KMALLOC_RECLAIM;
+   return type_dma + (is_reclaimable && !is_dma) * KMALLOC_RECLAIM;
 }
 
 /*
-- 
2.7.4

Re: [PATCH] ARM: dts: imx6sll: remove unused property in gpc node

2018-11-15 Thread Shawn Guo

On Tue, Nov 06, 2018 at 09:19:36AM +, Anson Huang wrote:
> The "fsl,mf-mix-wakeup-irq" is ONLY used as a temporary
> solution in NXP's internal tree for Mega/Fast Mix off
> feature after suspend, upstream kernel does NOT need it,
> remove it.
> 
> Signed-off-by: Anson Huang 

Applied both, thanks.

Re: [PATCH] arm64: dts: rockchip: rk3399: Add xin32k clk

2018-11-15 Thread dbasehore .

On Thu, Nov 15, 2018 at 9:03 PM Doug Anderson  wrote:
>
> Hi,
>
> On Thu, Nov 15, 2018 at 4:42 PM Derek Basehore  wrote:
> >
> > This adds the xin32k clock to the RK3399 CPU. Even though it's not
> > directly used, muxes will end up traversing the entire clk tree on
> > calls to determine_rate if it doesn't exist.
> >
> > Signed-off-by: Derek Basehore 
> > ---
> >  arch/arm64/boot/dts/rockchip/rk3399.dtsi | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/arch/arm64/boot/dts/rockchip/rk3399.dtsi 
> > b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> > index 99e7f65c1779..6a32293982d0 100644
> > --- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> > +++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> > @@ -191,6 +191,13 @@
> > #clock-cells = <0>;
> > };
> >
> > +   xin32k: xin32k {
>
> nit: xin32k is the name of the clock that rk3399 consumes.  It seems a
> little weird to name this node with that name.  Can you call this:
>
> ap_rtc_clk: ap-rtc-clk
>
> ...after the gru schematic?  You wouldn't change the
> clock-output-names, just the node name / label.
>
>
> > +   compatible = "fixed-clock";
> > +   clock-frequency = <32000>;
>
> I checked the datasheet for the 32K clock and it shows that this is a
> 32768 Hz clock, not a 32000 Hz one.  I also checked the rk808 clock
> driver (which is supposed to be compatible with rk3399) and it
> produces a 32768 clock.

Ok, sending out an updated patch that addresses these concerns.

[PATCH] arm64: dts: rockchip: rk3399: Add xin32k clk

2018-11-15 Thread Derek Basehore

This adds the xin32k clock to the RK3399 CPU. Even though it's not
directly used, muxes will end up traversing the entire clk tree on
calls to determine_rate if it doesn't exist.

Signed-off-by: Derek Basehore 
---
 arch/arm64/boot/dts/rockchip/rk3399.dtsi | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm64/boot/dts/rockchip/rk3399.dtsi 
b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
index 99e7f65c1779..3d09472978f8 100644
--- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
@@ -191,6 +191,13 @@
#clock-cells = <0>;
};
 
+   ap_rtc_clk: ap-rtc-clk {
+   compatible = "fixed-clock";
+   clock-frequency = <32768>;
+   clock-output-names = "xin32k";
+   #clock-cells = <0>;
+   };
+
amba {
compatible = "simple-bus";
#address-cells = <2>;
-- 
2.19.1.1215.g8438c0b245-goog

Re: [PATCH 4/8] kbuild: simplify dependency generation for CONFIG_TRIM_UNUSED_KSYMS

2018-11-15 Thread Nicolas Pitre

On Thu, 15 Nov 2018, Masahiro Yamada wrote:

> My main motivation of this commit is to clean up scripts/Kbuild.include
> and scripts/Makefile.build.
> 
> Currently, CONFIG_TRIM_UNUSED_KSYMS works with a tricky gimmick;
> possibly exported symbols are detected by letting $(CPP) replace
> EXPORT_SYMBOL* with a special string '=== __KSYM_*===', which is
> post-processed by sed, and passed to fixdep. The extra preprocessing
> is costly, and hacking cmd_and_fixdep is ugly.
> 
> I came up with a new way to find exported symbols; insert a dummy
> symbol __ksym_marker_* to each potentially exported symbol. Those
> dummy symbols are picked up by $(NM), post-processed by sed, then
> appended to .*.cmd files. I collected the post-process part to a
> new shell script scripts/gen_ksymdeps.sh for readability. The dummy
> symbols are put into the .discard.* section so that the linker
> script rips them off the final vmlinux or modules.

Brilliant!  I really like it.

Minor comments below.

> diff --git a/include/asm-generic/export.h b/include/asm-generic/export.h
> index 4d73e6e..294d6ae 100644
> --- a/include/asm-generic/export.h
> +++ b/include/asm-generic/export.h
> @@ -59,16 +59,19 @@ __kcrctab_\name:
>  .endm
>  #undef __put
>  
> -#if defined(__KSYM_DEPS__)
> -
> -#define __EXPORT_SYMBOL(sym, val, sec)   === __KSYM_##sym ===
> -
> -#elif defined(CONFIG_TRIM_UNUSED_KSYMS)
> +#if defined(CONFIG_TRIM_UNUSED_KSYMS)
>  
>  #include 
>  #include 
>  
> +.macro __ksym_marker sym
> + .section ".discard.ksym","a"
> +__ksym_marker_\sym:
> +  .previous

Does this work as intended? I have vague memories about having problems 
with sections being discarded when they don't allocate any space.

> +.endm
> +
>  #define __EXPORT_SYMBOL(sym, val, sec)   \
> + __ksym_marker sym;  \
>   __cond_export_sym(sym, val, sec, __is_defined(__KSYM_##sym))
>  #define __cond_export_sym(sym, val, sec, conf)   \
>   ___cond_export_sym(sym, val, sec, conf)
> diff --git a/include/linux/export.h b/include/linux/export.h
> index ce764a5..0413a3d 100644
> --- a/include/linux/export.h
> +++ b/include/linux/export.h
> @@ -92,22 +92,22 @@ struct kernel_symbol {
>   */
>  #define __EXPORT_SYMBOL(sym, sec)
>  
> -#elif defined(__KSYM_DEPS__)
> +#elif defined(CONFIG_TRIM_UNUSED_KSYMS)
> +
> +#include 
>  
>  /*
>   * For fine grained build dependencies, we want to tell the build system
>   * about each possible exported symbol even if they're not actually exported.
> - * We use a string pattern that is unlikely to be valid code that the build
> - * system filters out from the preprocessor output (see ksym_dep_filter
> - * in scripts/Kbuild.include).
> + * We use a symbol pattern __ksym_marker_ that the build system 
> filters
> + * from the $(NM) output (see scripts/gen_ksymdep.sh). These symbols are
> + * discarded in the final link stage.
>   */
> -#define __EXPORT_SYMBOL(sym, sec)=== __KSYM_##sym ===
> -
> -#elif defined(CONFIG_TRIM_UNUSED_KSYMS)
> -
> -#include 
> +#define __ksym_marker(sym)   \
> + static int __ksym_marker_##sym[0] __section(".discard.ksym") __used

Even if this is discarded during the final link, maybe this could save 
a tiny amount of disk space by using a char instead?

> diff --git a/scripts/Makefile.build b/scripts/Makefile.build
> index 7f3ca6e..e5ba9b1 100644
> --- a/scripts/Makefile.build
> +++ b/scripts/Makefile.build
> @@ -254,9 +254,18 @@ objtool_dep = $(objtool_obj) 
> \
> $(wildcard include/config/orc/unwinder.h  \
>include/config/stack/validation.h)
>  
> +ifdef CONFIG_TRIM_UNUSED_KSYMS
> +cmd_gen_ksymdeps = \
> + $(CONFIG_SHELL) $(srctree)/scripts/gen_ksymdeps.sh $@ > 
> $(dot-target).tmp; \
> + cat $(dot-target).tmp >> $(dot-target).cmd; \
> + rm -f $(dot-target).tmp;

Why don't you append to $(dot-target).cmd directly?


Nicolas

Re: [PATCH] arm64: dts: rockchip: rk3399: Add xin32k clk

2018-11-15 Thread Doug Anderson

Hi,

On Thu, Nov 15, 2018 at 4:42 PM Derek Basehore  wrote:
>
> This adds the xin32k clock to the RK3399 CPU. Even though it's not
> directly used, muxes will end up traversing the entire clk tree on
> calls to determine_rate if it doesn't exist.
>
> Signed-off-by: Derek Basehore 
> ---
>  arch/arm64/boot/dts/rockchip/rk3399.dtsi | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/arch/arm64/boot/dts/rockchip/rk3399.dtsi 
> b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> index 99e7f65c1779..6a32293982d0 100644
> --- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> +++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> @@ -191,6 +191,13 @@
> #clock-cells = <0>;
> };
>
> +   xin32k: xin32k {

nit: xin32k is the name of the clock that rk3399 consumes.  It seems a
little weird to name this node with that name.  Can you call this:

ap_rtc_clk: ap-rtc-clk

...after the gru schematic?  You wouldn't change the
clock-output-names, just the node name / label.

> +   compatible = "fixed-clock";
> +   clock-frequency = <32000>;

I checked the datasheet for the 32K clock and it shows that this is a
32768 Hz clock, not a 32000 Hz one.  I also checked the rk808 clock
driver (which is supposed to be compatible with rk3399) and it
produces a 32768 clock.

[PATCH 1/1] Input: synaptics - enable SMBus for HP 15-ay000 (SYN3221).

2018-11-15 Thread Teika Kazura

SMBus works fine for the touchpad with id SYN3221, used in the HP 15-ay000 
series,

This device has been reported in these messages in the "linux-input" mailing 
list:
* https://marc.info/?l=linux-input&m=152016683003369&w=2
* https://www.spinics.net/lists/linux-input/msg52525.html

Reported-by: Nitesh Debnath 
Reported-by: Teika Kazura 
Signed-off-by: Teika Kazura 
---
 drivers/input/mouse/synaptics.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/input/mouse/synaptics.c b/drivers/input/mouse/synaptics.c
index 55d33500d5..591b776f22 100644
--- a/drivers/input/mouse/synaptics.c
+++ b/drivers/input/mouse/synaptics.c
@@ -179,6 +179,7 @@ static const char * const smbus_pnp_ids[] = {
"LEN0096", /* X280 */
"LEN0097", /* X280 -> ALPS trackpoint */
"LEN200f", /* T450s */
+   "SYN3221", /* HP 15-ay000 */
NULL
 };

--
2.18.1

Re: possible deadlock in acct_pin_kill

2018-11-15 Thread syzbot


syzbot has found a reproducer for the following crash on:

HEAD commit:da5322e65940 Merge tag 'selinux-pr-20181115' of git://git...
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1570390540
kernel config:  https://syzkaller.appspot.com/x/.config?x=4a0a89f12ca9b0f5
dashboard link: https://syzkaller.appspot.com/bug?extid=2a73a6ea9507b7112141
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11950d4d40

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+2a73a6ea9507b7112...@syzkaller.appspotmail.com

overlayfs: failed to resolve './file1': -2
overlayfs: filesystem on './file0' not supported as upperdir

==
overlayfs: filesystem on './file0' not supported as upperdir
WARNING: possible circular locking dependency detected
4.20.0-rc2+ #336 Not tainted
--
syz-executor0/7612 is trying to acquire lock:
a1ecfa3f (&acct->lock#2){+.+.}, at: acct_pin_kill+0x26/0x100  
kernel/acct.c:173

Process accounting resumed

but task is already holding lock:
1109cf86 (sb_writers#3){.+.+}, at: sb_start_write  
include/linux/fs.h:1597 [inline]
1109cf86 (sb_writers#3){.+.+}, at: mnt_want_write+0x3f/0xc0  
fs/namespace.c:360


which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (sb_writers#3){.+.+}:
   percpu_down_read_preempt_disable include/linux/percpu-rwsem.h:36  
[inline]

   percpu_down_read include/linux/percpu-rwsem.h:59 [inline]
   __sb_start_write+0x214/0x370 fs/super.c:1387
   file_start_write include/linux/fs.h:2810 [inline]
   ovl_write_iter+0x9a7/0xd10 fs/overlayfs/file.c:243
   call_write_iter include/linux/fs.h:1857 [inline]
   new_sync_write fs/read_write.c:474 [inline]
   __vfs_write+0x6b8/0x9f0 fs/read_write.c:487
   __kernel_write+0x10c/0x370 fs/read_write.c:506
   do_acct_process+0x1144/0x1660 kernel/acct.c:520
   slow_acct_process kernel/acct.c:579 [inline]
   acct_process+0x6b1/0x875 kernel/acct.c:605
   do_exit+0x1b89/0x26d0 kernel/exit.c:857
   do_group_exit+0x177/0x440 kernel/exit.c:970
   get_signal+0x8b0/0x1980 kernel/signal.c:2517
   do_signal+0x9c/0x21c0 arch/x86/kernel/signal.c:816
   exit_to_usermode_loop+0x2e5/0x380 arch/x86/entry/common.c:162
   prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
   syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
   do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #1 (&ovl_i_mutex_key[depth]){+.+.}:
   down_write+0x8a/0x130 kernel/locking/rwsem.c:70
   inode_lock include/linux/fs.h:757 [inline]
   ovl_write_iter+0x151/0xd10 fs/overlayfs/file.c:231
   call_write_iter include/linux/fs.h:1857 [inline]
   new_sync_write fs/read_write.c:474 [inline]
   __vfs_write+0x6b8/0x9f0 fs/read_write.c:487
   __kernel_write+0x10c/0x370 fs/read_write.c:506
   do_acct_process+0x1144/0x1660 kernel/acct.c:520
   slow_acct_process kernel/acct.c:579 [inline]
   acct_process+0x6b1/0x875 kernel/acct.c:605
   do_exit+0x1b89/0x26d0 kernel/exit.c:857
   do_group_exit+0x177/0x440 kernel/exit.c:970
   get_signal+0x8b0/0x1980 kernel/signal.c:2517
   do_signal+0x9c/0x21c0 arch/x86/kernel/signal.c:816
   exit_to_usermode_loop+0x2e5/0x380 arch/x86/entry/common.c:162
   prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
   syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
   do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (&acct->lock#2){+.+.}:
   lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844
   __mutex_lock_common kernel/locking/mutex.c:925 [inline]
   __mutex_lock+0x166/0x16f0 kernel/locking/mutex.c:1072
   mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1087
   acct_pin_kill+0x26/0x100 kernel/acct.c:173
   pin_kill+0x29d/0xab0 fs/fs_pin.c:50
   acct_on+0x665/0x940 kernel/acct.c:254
   __do_sys_acct kernel/acct.c:286 [inline]
   __se_sys_acct kernel/acct.c:273 [inline]
   __x64_sys_acct+0xc2/0x1f0 kernel/acct.c:273
   do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Chain exists of:
  &acct->lock#2 --> &ovl_i_mutex_key[depth] --> sb_writers#3

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(sb_writers#3);
   lock(&ovl_i_mutex_key[depth]);
   lock(sb_writers#3);
  lock(&acct->lock#2

Re: [PATCH] mm: use managed_zone() for more exact check in zone iteration

2018-11-15 Thread Wei Yang

On Thu, Nov 15, 2018 at 01:37:35PM -0800, Andrew Morton wrote:
>On Thu, 15 Nov 2018 07:50:40 +0800 Wei Yang  wrote:
>
>> For one zone, there are three digits to describe its space range:
>> 
>> spanned_pages
>> present_pages
>> managed_pages
>> 
>> The detailed meaning is written in include/linux/mmzone.h. This patch
>> concerns about the last two.
>> 
>> present_pages is physical pages existing within the zone
>> managed_pages is present pages managed by the buddy system
>> 
>> >From the definition, managed_pages is a more strict condition than
>> present_pages.
>> 
>> There are two functions using zone's present_pages as a boundary:
>> 
>> populated_zone()
>> for_each_populated_zone()
>> 
>> By going through the kernel tree, most of their users are willing to
>> access pages managed by the buddy system, which means it is more exact
>> to check zone's managed_pages for a validation.
>> 
>> This patch replaces those checks on present_pages to managed_pages by:
>> 
>> * change for_each_populated_zone() to for_each_managed_zone()
>> * convert for_each_populated_zone() to for_each_zone() and check
>>   populated_zone() where is necessary
>> * change populated_zone() to managed_zone() at proper places
>> 
>> Signed-off-by: Wei Yang 
>> 
>> ---
>> 
>> Michal, after last mail, I did one more thing to replace
>> populated_zone() with managed_zone() at proper places.
>> 
>> One thing I am not sure is those places in mm/compaction.c. I have
>> chaged them. If not, please let me know.
>> 
>> BTW, I did a boot up test with the patched kernel and looks smooth.
>
>Seems sensible, but a bit scary.  A basic boot test is unlikely to
>expose subtle gremlins.
>

Agree.

>Worse, the situations in which managed_zone() != populated_zone() are
>rare(?), so it will take a long time for problems to be discovered, I
>expect.

Hmm... I created a virtual machine with 4 nodes, which has total 6
populated zones. All of them are different.

This is a little bit out of my expactation.

>
>I'll toss it in there for now, let's see who breaks :(

Thanks.

-- 
Wei Yang
Help you, Help me

[PATCH] perf stat: Fix shadow stats for clock events

2018-11-15 Thread Ravi Bangoria

Commit 0aa802a79469 ("perf stat: Get rid of extra clock display
function") introduced scale and unit for clock events. Thus,
perf_stat__update_shadow_stats() now saves scaled values of
clock events in msecs, instead of original nsecs. But while
calculating values of shadow stats we still consider clock
event values in nsecs. This results in a wrong shadow stat
values. Ex,

  # ./perf stat -e task-clock,cycles ls

  2.60 msec task-clock:u#0.877 CPUs utilized
 2,430,564  cycles:u# 1215282.000 GHz

Fix this by saving original nsec values for clock events in
perf_stat__update_shadow_stats(). After patch:

  # ./perf stat -e task-clock,cycles ls

  3.14 msec task-clock:u#0.839 CPUs utilized
 3,094,528  cycles:u#0.985 GHz

Reported-by: Anton Blanchard 
Suggested-by: Jiri Olsa 
Fixes: 0aa802a79469 ("perf stat: Get rid of extra clock display function")
Signed-off-by: Ravi Bangoria 
---
 tools/perf/util/stat-shadow.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index f0a8cec55c47..3c22c58b3e90 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -209,11 +209,12 @@ void perf_stat__update_shadow_stats(struct perf_evsel 
*counter, u64 count,
int cpu, struct runtime_stat *st)
 {
int ctx = evsel_context(counter);
+   u64 count_ns = count;
 
count *= counter->scale;
 
if (perf_evsel__is_clock(counter))
-   update_runtime_stat(st, STAT_NSECS, 0, cpu, count);
+   update_runtime_stat(st, STAT_NSECS, 0, cpu, count_ns);
else if (perf_evsel__match(counter, HARDWARE, HW_CPU_CYCLES))
update_runtime_stat(st, STAT_CYCLES, ctx, cpu, count);
else if (perf_stat_evsel__is(counter, CYCLES_IN_TX))
-- 
2.17.1

Re: Crash in msm serial on dragonboard with ftrace bootargs

2018-11-15 Thread Viresh Kumar

On Thu, Nov 15, 2018 at 4:23 PM Srinivas Kandagatla
 wrote:

> Yes, this is not the solution, but it proves that the hand-off between
> booloaders and kernel is the issue.
>
> In general there is wider issue with resources hand-off between
> bootloader and kernel.
>
> There has been some proposal in the past by Viresh for a new framework
> called boot-constriants (https://lkml.org/lkml/2017/12/14/440) which am
> not sure if its still actively looked at. But something similar should
> be the way to address such issues.

It isn't dead code yet and I am waiting to gain few more use-cases
before I attempt
to convince Greg again :)

Here is the code..

git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/linux.git boot-constraint

--
viresh

Re: [PATCH 1/2] clocksource: Demote dbx500 PRCMU clocksource

2018-11-15 Thread Baolin Wang

On 15 November 2018 at 21:32, Linus Walleij  wrote:
> Demote the DBx500 PRCMU clocksource to quality 100 and
> mark it as NONSTOP so it will still be used for
> timekeeping across suspend/resume.
>
> The Nomadik MTU timer which has higher precision will
> be used when the system is up and running, thanks to
> the recent changes properly utilizing the suspend
> clocksources.
>
> This was discussed back in 2011 when the driver was
> written, but the infrastructure was not available
> upstream to use this timer properly. Now the
> infrastructure is there, so let's finalize the work.
>
> Cc: Baolin Wang 
> Signed-off-by: Linus Walleij 
> ---

Glad to see new driver uses the suspend clocksource.
Reviewed-by: Baolin Wang 

>  drivers/clocksource/clksrc-dbx500-prcmu.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/clocksource/clksrc-dbx500-prcmu.c 
> b/drivers/clocksource/clksrc-dbx500-prcmu.c
> index c1b96dc5f444..4054539fe066 100644
> --- a/drivers/clocksource/clksrc-dbx500-prcmu.c
> +++ b/drivers/clocksource/clksrc-dbx500-prcmu.c
> @@ -46,10 +46,10 @@ static u64 notrace clksrc_dbx500_prcmu_read(struct 
> clocksource *cs)
>
>  static struct clocksource clocksource_dbx500_prcmu = {
> .name   = "dbx500-prcmu-timer",
> -   .rating = 300,
> +   .rating = 100,
> .read   = clksrc_dbx500_prcmu_read,
> .mask   = CLOCKSOURCE_MASK(32),
> -   .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
> +   .flags  = CLOCK_SOURCE_IS_CONTINUOUS | 
> CLOCK_SOURCE_SUSPEND_NONSTOP,
>  };
>
>  #ifdef CONFIG_CLKSRC_DBX500_PRCMU_SCHED_CLOCK
> --
> 2.17.2
>



-- 
Baolin Wang
Best Regards

Re: [PATCH v6 0/2] arm64: dts: add prng-ee nodes

2018-11-15 Thread Vinod Koul

On 15-11-18, 11:20, Andy Gross wrote:
> On Thu, Nov 15, 2018 at 09:15:18AM +0530, Vinod Koul wrote:
> > On 01-10-18, 11:51, Vinod Koul wrote:
> > > This adds prng-ee nodes for msm8996 and sdm845
> > 
> > Ping Andy, would appreciate if you can pick these up.
> 
> Done.  I did have to massage the location in dts for both of these patches to
> keep the address order correct.

Thank you Andy, appreciate it.

I have two more series in queue, QCS404 DTS series 
https://patchwork.kernel.org/project/linux-arm-msm/list/?series=40841
and QCS404 defconfigs 
https://patchwork.kernel.org/project/linux-arm-msm/list/?series=40853

Would be great if you can review these as well

Regards
-- 
~Vinod

Re: [PATCH] ASoC: imx-audmux: complete dt-bindings for imx6

2018-11-15 Thread Shawn Guo

On Mon, Nov 05, 2018 at 02:58:02PM +0100, Clément Péron wrote:
> From: Colin Didier 
> 
> The MX6 Audmux differs from MX51.
> 
> This patch adds the audmux for i.MX6 family.
> 
> Signed-off-by: Colin Didier 
> Signed-off-by: Clément Péron 

I think you should send it to ASoC maintainer and list for applying.

Shawn

> ---
>  include/dt-bindings/sound/fsl-imx-audmux.h | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/include/dt-bindings/sound/fsl-imx-audmux.h 
> b/include/dt-bindings/sound/fsl-imx-audmux.h
> index 15f138bebe16..a1d0741d9ed1 100644
> --- a/include/dt-bindings/sound/fsl-imx-audmux.h
> +++ b/include/dt-bindings/sound/fsl-imx-audmux.h
> @@ -25,6 +25,14 @@
>  #define MX51_AUDMUX_PORT65
>  #define MX51_AUDMUX_PORT76
>  
> +#define MX6_AUDMUX_PORT1_SSI10
> +#define MX6_AUDMUX_PORT2_SSI21
> +#define MX6_AUDMUX_PORT3 2
> +#define MX6_AUDMUX_PORT4 3
> +#define MX6_AUDMUX_PORT5 4
> +#define MX6_AUDMUX_PORT6 5
> +#define MX6_AUDMUX_PORT7_SSI36
> +
>  /*
>   * TFCSEL/RFCSEL (i.MX27) or TFSEL/TCSEL/RFSEL/RCSEL (i.MX31/51/53/6Q)
>   * can be sourced from Rx/Tx.
> -- 
> 2.19.1
>

Re: [PATCH] ARM: dts: imx6ul: ccimx6ulsom: Fix indentation on iomuxc nodes

2018-11-15 Thread Shawn Guo

On Mon, Nov 05, 2018 at 11:48:04AM +0100, Alex Gonzalez wrote:
> This patch corrects indentation problems in the gpmigrp and i2c1grp nodes.
> 
> Signed-off-by: Alex Gonzalez 

Applied, thanks.

Re: [PATCH v2] ARM: dts: imx6ul: ccimx6ulsom: Add support for wireless SOM variant

2018-11-15 Thread Shawn Guo

On Mon, Nov 05, 2018 at 11:43:42AM +0100, Alex Gonzalez wrote:
> The wireless variants of the ConnecCore 6UL SOM include a Qualcomm
> QCA6564 wireless chip with dual WiFi and Bluetooth.
> 
> Both the ConnectCore 6UL SBC Express and Pro boards fit a wireless SOM.
> 
> The Wifi is connected through the SDIO interface on usdhc1 and the
> Bluetooth is connected via uart1.
> 
> Reviewed-by: Fabio Estevam 
> Signed-off-by: Alex Gonzalez 

Applied, thanks.

Re: [PATCHv3 1/6] atomics: add common header generation files

2018-11-15 Thread Mark Rutland

Hi Andrew,

On Thu, Nov 15, 2018 at 03:10:48PM -0800, Andrew Morton wrote:
> On Tue,  4 Sep 2018 11:48:25 +0100 Mark Rutland  wrote:
> 
> > To minimize repetition, to allow for future rework, and to ensure
> > regularity of the various atomic APIs, we'd like to automatically
> > generate (the bulk of) a number of headers related to atomics.
> > 
> > This patch adds the infrastructure to do so, leaving actual conversion
> > of headers to subsequent patches.
> 
> This thing is appallingly slow.  `sh scripts/atomic/check-atomics.sh'
> takes 8 seconds on a machine which builds an allnoconfig kernel in 30
> seconds.

Hmm... on my laptop it's less than half that, and allnoconfig takes ~35s, so
clearly there's a major difference between our setups.

For reference, which distro are you using, and what is /bin/sh on your box?

> Um, no.  Just no.  Please find a way to make this overhead go away.

Will do.

Trivially, switching to diff -q halves the check runtime for me, and I'm sure
there are other parts of the scripting which can be optimized.

Thanks,
Mark.

[PATCH] misc/pvpanic: resolve compile errors for arch=um

2018-11-15 Thread Peng Hao

Resolve compile error for arch=um
pvpanic.c:(.text+0xb6): undefined reference to `devm_ioremap_resource'

Signed-off-by: Peng Hao 
---
 drivers/misc/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 642626a..f417b06 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -515,7 +515,7 @@ config MISC_RTSX
 
 config PVPANIC
tristate "pvpanic device support"
-   depends on ACPI || OF
+   depends on HAS_IOMEM && (ACPI || OF)
help
  This driver provides support for the pvpanic device.  pvpanic is
  a paravirtualized device provided by QEMU; it lets a virtual machine
-- 
1.8.3.1

Oops: 0003 [#1] PREEMPT SMP NOPTI

2018-11-15 Thread Kyle Sanderson

2008(!) dual-core Atom box.
model name  : Intel(R) Atom(TM) CPU  330   @ 1.60GHz

[1027540.690329] perf: interrupt took too long (12579 > 12560),
lowering kernel.perf_event_max_sample_rate to 15000
[1027540.774026] perf: interrupt took too long (15754 > 15723),
lowering kernel.perf_event_max_sample_rate to 12000
[1027541.963573] BUG: unable to handle kernel paging request at b0428a44
[1027541.963647] IP: format_decode+0x20/0x3d0
[1027541.963676] PGD 2100c067 P4D 2100c067 PUD 2100d063 PMD 800020e001e1
[1027541.963723] Oops: 0003 [#1] PREEMPT SMP NOPTI
[1027541.963752] Modules linked in: af_packet_diag sctp_diag sctp
tcp_diag udp_diag dccp_diag dccp inet_diag unix_diag nfsd auth_rpcgss
nfs_acl tda18271 s5h1411 cfg80211 rfkill 8021q garp stp llc
xt_hashlimit iptable_filter ip_tables xt_length nf_conntrack_ipv6
nf_defrag_ipv6 xt_conntrack ip6table_filter ip6_tables ipv6 crc_ccitt
cachefiles snd_hda_codec_hdmi snd_hda_codec_realtek nouveau
snd_hda_codec_generic mxm_wmi video ttm wmi_bmof saa7164
nf_conntrack_ftp drm_kms_helper nf_conntrack tveeprom dvb_core bcache
drm videodev snd_hda_intel coretemp media syscopyarea sysfillrect
snd_hda_codec pcspkr serio_raw sysimgblt snd_hda_core fb_sys_fops
snd_hwdep snd_pcm snd_timer snd forcedeth soundcore i2c_nforce2 wmi
xts crypto_simd glue_helper cryptd aes_x86_64 crc32_generic cbc
sha256_generic ixgb ixgbe tulip
[1027541.964143]  cxgb3 cxgb mdio cxgb4 vxge bonding vxlan
ip6_udp_tunnel udp_tunnel macvlan vmxnet3 tg3 sky2 r8169 pcnet32 mii
igb ptp pps_core dca i2c_algo_bit i2c_core e1000 bnx2 atl1c msdos fat
configfs cramfs squashfs fuse xfs nfs lockd grace sunrpc fscache jfs
reiserfs btrfs zstd_decompress zstd_compress xxhash ext4 jbd2 ext2
mbcache linear raid10 raid1 raid0 dm_zero dm_verity reed_solomon
dm_thin_pool dm_switch dm_snapshot dm_raid raid456 async_raid6_recov
async_memcpy async_pq raid6_pq dm_mirror dm_region_hash dm_log_writes
dm_log_userspace dm_log dm_integrity async_xor async_tx xor dm_flakey
dm_era dm_delay dm_crypt dm_cache_smq dm_cache dm_persistent_data
libcrc32c dm_bufio dm_bio_prison dm_mod dax firewire_core crc_itu_t
sl811_hcd xhci_pci xhci_hcd usb_storage mpt3sas raid_class aic94xx
libsas
[1027541.964537]  lpfc qla2xxx megaraid_sas megaraid_mbox megaraid_mm
aacraid sx8 hpsa 3w_9xxx 3w_ 3w_sas mptsas scsi_transport_sas
mptfc scsi_transport_fc mptspi mptscsih mptbase imm parport sym53c8xx
initio arcmsr aic7xxx aic79xx scsi_transport_spi sr_mod cdrom sg
sd_mod pdc_adma sata_inic162x sata_mv ata_piix ahci libahci sata_qstor
sata_vsc sata_uli sata_sis sata_sx4 sata_nv sata_via sata_svw
sata_sil24 sata_sil sata_promise pata_via pata_jmicron pata_marvell
pata_sis pata_netcell pata_pdc202xx_old pata_atiixp pata_amd pata_ali
pata_it8213 pata_pcmcia pata_serverworks pata_oldpiix pata_artop
pata_it821x pata_hpt3x2n pata_hpt3x3 pata_hpt37x pata_hpt366
pata_cmd64x pata_sil680 pata_pdc2027x nvme nvme_core virtio_net
virtio_crypto crypto_engine virtio_mmio virtio_pci virtio_balloon
virtio_rng virtio_console
[1027541.964933]  virtio_blk virtio_scsi virtio_ring virtio
[1027541.964981] CPU: 2 PID: 11405 Comm: atop Not tainted 4.14.65-gentoo #1
[1027541.965013] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./To be filled by O.E.M., BIOS 080015  11/05/2009
[1027541.965072] task: 928755972400 task.stack: 9e8c02264000
[1027541.965114] RIP: 0010:format_decode+0x20/0x3d0
[1027541.965144] RSP: 0018:9e8c02267ba0 EFLAGS: 00010216
[1027541.965177] RAX: 0020 RBX: 928687ae307b RCX:
0014
[1027541.965212] RDX: 0014 RSI: b0428a44 RDI:
928687ae307b
[1027541.965245] RBP: b0428a44 R08: 6e72654b0a426b20 R09:
6953656761506c65
[1027541.965278] R10: 61506c656e72654b R11: 203a657a69536567 R12:
0f9d
[1027541.965327] R13: 9e8c02267c28 R14: b0428a44 R15:
b0428a58
[1027541.965363] FS:  7f8f05183680() GS:92875bb0()
knlGS:
[1027541.965398] CS:  0010 DS:  ES:  CR0: 80050033
[1027541.965430] CR2: b0428a44 CR3: 81e1e000 CR4:
06e0
[1027541.965463] Call Trace:
[1027541.965501]  vsnprintf+0x56/0x4d0
[1027541.965533]  ? vsnprintf+0xda/0x4d0
[1027541.965587]  ? seq_vprintf+0x30/0x50
[1027541.965619]  ? seq_printf+0x45/0x50
[1027541.965657]  ? show_smap.isra.34+0x19f/0x3e0
[1027541.965693]  ? smaps_hugetlb_range+0x120/0x120
[1027541.965728]  ? pagemap_pmd_range+0x640/0x640
[1027541.965768]  ? seq_read+0xed/0x3b0
[1027541.965800]  ? __vfs_read+0x25/0x130
[1027541.965832]  ? vfs_read+0x94/0x140
[1027541.965863]  ? SyS_read+0x46/0xa0
[1027541.965893]  ? do_syscall_64+0x6a/0x120
[1027541.965927]  ? entry_SYSCALL_64_after_hwframe+0x42/0xb7
[1027541.965965] Code: e8 a1 86 a2 ff 0f 0b eb c6 66 90 55 48 8d 2e 53
48 8d 1f 48 8d 64 24 f8 0f b6 06 48 89 3c 24 3c 01 74 4c 3c 02 0f 84
a2 01 00 00  06 00 0f b6 07 84 c0 0f 84 db 02 00 00 3c 25 0f 84 3b
03 00
[1027541.966170] RIP: format_decode+0x20/0x3

Re: linux-next: manual merge of the block tree with Linus' tree

2018-11-15 Thread Jens Axboe

On 11/15/18 7:19 PM, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the block tree got a conflict in:
> 
>   block/blk.h
> 
> between commit:
> 
>   1adfc5e4136f ("block: make sure discard bio is aligned with logical block 
> size")
> 
> from Linus' tree (precedes v4.20-rc2) and commit:
> 
>   079076b3416e ("block: remove deadline __deadline manipulation helpers")
> 
> from the block tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Thanks Stephen, there's a few coming up. Come Sunday I'll pull in
-rc3 and resolve these. Not just for that, but also to ensure that
my -next branch has some important fixes from this series.

-- 
Jens Axboe

linux-next: manual merge of the block tree with Linus' tree

2018-11-15 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the block tree got a conflict in:

  block/blk.h

between commit:

  1adfc5e4136f ("block: make sure discard bio is aligned with logical block 
size")

from Linus' tree (precedes v4.20-rc2) and commit:

  079076b3416e ("block: remove deadline __deadline manipulation helpers")

from the block tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc block/blk.h
index 0089fefdf771,027a0ccc175e..
--- a/block/blk.h
+++ b/block/blk.h
@@@ -380,31 -233,6 +233,16 @@@ static inline void req_set_nomerge(stru
q->last_merge = NULL;
  }
  
- /*
-  * Steal a bit from this field for legacy IO path atomic IO marking. Note that
-  * setting the deadline clears the bottom bit, potentially clearing the
-  * completed bit. The user has to be OK with this (current ones are fine).
-  */
- static inline void blk_rq_set_deadline(struct request *rq, unsigned long time)
- {
-   rq->__deadline = time & ~0x1UL;
- }
- 
- static inline unsigned long blk_rq_deadline(struct request *rq)
- {
-   return rq->__deadline & ~0x1UL;
- }
- 
 +/*
 + * The max size one bio can handle is UINT_MAX becasue bvec_iter.bi_size
 + * is defined as 'unsigned int', meantime it has to aligned to with logical
 + * block size which is the minimum accepted unit by hardware.
 + */
 +static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
 +{
 +  return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
 +}
 +
  /*
   * Internal io_context interface
   */


pgpfyhYR9szte.pgp
Description: OpenPGP digital signature

Re: [PATCH 1/2] perf vendor events: Add stepping in CPUID string for x86

2018-11-15 Thread Arnaldo Carvalho de Melo

Em Thu, Nov 15, 2018 at 04:01:46PM -0500, Liang, Kan escreveu:
> 
> 
> On 11/15/2018 3:44 PM, Jiri Olsa wrote:
> > On Wed, Nov 14, 2018 at 01:24:15PM -0800, kan.li...@linux.intel.com wrote:
> > > From: Kan Liang 
> > > 
> > > Perf tools cannot find the proper event list for Cascadelake server.
> > > Because Cascadelake server and Skylake server have the same CPU model
> > > number, which are used by perf tools to find the event list.
> > > 
> > > The stepping for Skylake server is up to 4.
> > > The stepping for Cascadelake server starts from 5.
> > > The stepping can be used to distinguish between them.
> > > 
> > > The stepping is added in get_cpuid_str().
> > > The stepping information for Skylake server is updated in mapfile.csv.
> > > A x86 specific strcmp_cpuid_cmp() function is added to handle two CPUID
> > > formats in mapfile.csv, "vendor-family-model-stepping" and
> > > "vendor-family-model".
> > > - If a cpuid-regular-expression from the mapfile.csv using the new
> > >stepping format, a cpuid-string generated on the machine must include
> > >stepping. Otherwise, it is a mismatch.
> > > - If the cpuid-regular-expression using the old non-stepping format,
> > >the stepping in the cpuid-string will be ignored.
> > > 
> > > The script, which using environment string "PERF_CPUID" without stepping
> > > on Skylake server, will be broken. If so, users must fix their scripts.
> > > 
> > > Signed-off-by: Kan Liang 
> > 
> > Reviewed-by: Jiri Olsa 
> > 
> 
> Thanks Jirka,
> 
> Hi Arnaldo,
> 
> Are you OK with the patch?
> If yes, I will go ahead to cleanup the *_cpuid_str() by moving them to
> header.c as promised. https://lkml.org/lkml/2018/11/15/929
> The new patch will be on top of this patch.

I'm travelling, will look at it soon, can't now, battery almost deead
:-\

- Arnaldo

[PATCH v2] ASoC: rt5663: Add documentation for power supply support

2018-11-15 Thread Cheng-Yi Chiang

rt5663 codec driver will support setting CPVDD and AVDD power supply
from device tree.

Signed-off-by: Cheng-Yi Chiang 
---
 Moved power supply properties to required properties.

 Documentation/devicetree/bindings/sound/rt5663.txt | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/devicetree/bindings/sound/rt5663.txt 
b/Documentation/devicetree/bindings/sound/rt5663.txt
index 23386446c63d6..2a55e91334082 100644
--- a/Documentation/devicetree/bindings/sound/rt5663.txt
+++ b/Documentation/devicetree/bindings/sound/rt5663.txt
@@ -10,6 +10,10 @@ Required properties:
 
 - interrupts : The CODEC's interrupt output.
 
+- avdd-supply: Power supply for AVDD, providing 1.8V.
+
+- cpvdd-supply: Power supply for CPVDD, providing 3.5V.
+
 Optional properties:
 
 - "realtek,dc_offset_l_manual"
@@ -51,4 +55,6 @@ rt5663: codec@12 {
compatible = "realtek,rt5663";
reg = <0x12>;
interrupts = <7 IRQ_TYPE_EDGE_FALLING>;
+   avdd-supply = <&pp1800_a_alc5662>;
+   cpvdd-supply = <&pp3500_a_alc5662>;
 };
-- 
2.19.1.1215.g8438c0b245-goog

[PATCH 1/1] Improve kernfs_notify() poll notification latency

2018-11-15 Thread Radu Rendec

kernfs_notify() does two notifications: poll and fsnotify. Originally,
both notifications were done from scheduled work context and all that
kernfs_notify() did was schedule the work.

This patch simply moves the poll notification from the scheduled work
handler to kernfs_notify(). The fsnotify notification still needs to be
done from scheduled work context because it can sleep (it needs to lock
a mutex).

If the poll notification is time critical (the notified thread needs to
wake as quickly as possible), it's better to do it from kernfs_notify()
directly. One example is calling sysfs_notify_dirent() from a hardware
interrupt handler to wake up a thread and handle the interrupt in user
space.

Signed-off-by: Radu Rendec 
---
 fs/kernfs/file.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index dbf5bc250bfd..f8d5021a652e 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -857,7 +857,6 @@ static __poll_t kernfs_fop_poll(struct file *filp, 
poll_table *wait)
 static void kernfs_notify_workfn(struct work_struct *work)
 {
struct kernfs_node *kn;
-   struct kernfs_open_node *on;
struct kernfs_super_info *info;
 repeat:
/* pop one off the notify_list */
@@ -871,17 +870,6 @@ static void kernfs_notify_workfn(struct work_struct *work)
kn->attr.notify_next = NULL;
spin_unlock_irq(&kernfs_notify_lock);
 
-   /* kick poll */
-   spin_lock_irq(&kernfs_open_node_lock);
-
-   on = kn->attr.open;
-   if (on) {
-   atomic_inc(&on->event);
-   wake_up_interruptible(&on->poll);
-   }
-
-   spin_unlock_irq(&kernfs_open_node_lock);
-
/* kick fsnotify */
mutex_lock(&kernfs_mutex);
 
@@ -934,10 +922,21 @@ void kernfs_notify(struct kernfs_node *kn)
 {
static DECLARE_WORK(kernfs_notify_work, kernfs_notify_workfn);
unsigned long flags;
+   struct kernfs_open_node *on;
 
if (WARN_ON(kernfs_type(kn) != KERNFS_FILE))
return;
 
+   /* kick poll immediately */
+   spin_lock_irqsave(&kernfs_open_node_lock, flags);
+   on = kn->attr.open;
+   if (on) {
+   atomic_inc(&on->event);
+   wake_up_interruptible(&on->poll);
+   }
+   spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
+
+   /* schedule work to kick fsnotify */
spin_lock_irqsave(&kernfs_notify_lock, flags);
if (!kn->attr.notify_next) {
kernfs_get(kn);
-- 
2.17.2

[PATCH 0/1] kernfs_notify() poll latency

2018-11-15 Thread Radu Rendec

Hi everyone,

I believe kernfs_notify() poll latency can be improved if the poll
notification is done from kernfs_notify() directly rather than scheduled
work context.

I am sure there are good reasons why the fsnotify notification must be
done from scheduled work context (an obvious one is that it needs to be
able to sleep). But I don't see any reason why the poll notification
could not be done from kernfs_notify(). If there is any, please point it
out - I would highly appreciate it.

I came across this while working on a project that uses the sysfs GPIO
interface to wake a (real time) user space project on GPIO interrupts. I
know that interface is deprecated, but I still believe it's a valid
scenario and may occur with other drivers as well (current or future).

The sysfs GPIO interface (drivers/gpio/gpiolib-sysfs.c) interrupt
handler relies on kernfs_notify() (actually sysfs_notify_dirent(), but
that's just an alias) to wake any thread that may be poll()ing on the
interrupt. It is important to wake the thread as quickly as possible and
going through the kernel worker to handle the scheduled work is much
slower. Since the kernel worker runs with normal priority, this can even
become a case of priority inversion. If a higher priority thread hogs
the CPU, it may delay the kernel worker and in turn the thread that
needs to be notified (which could be a real time thread).

Best regards,
Radu Rendec


Radu Rendec (1):
  Improve kernfs_notify() poll notification latency

 fs/kernfs/file.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

-- 
2.17.2

Re: [PATCH v3 06/13] dt-bindings: irqchip: Introduce TISCI Interrupt router bindings

2018-11-15 Thread Lokesh Vutla

Hi Rob,

On 11/13/2018 11:43 AM, Lokesh Vutla wrote:
> Hi Rob,
> 
> On 12/11/18 11:30 PM, Rob Herring wrote:
>> On Tue, Nov 06, 2018 at 02:10:58PM +0530, Lokesh Vutla wrote:
>>> Add the DT binding documentation for Interrupt router driver.
>>>
>>> Signed-off-by: Lokesh Vutla 
>>> ---
>>> Changes since v2:
>>> - Dropped interrupt-parent from reqired properties description
>>> - Updated the interrupt cells to 4.
>>>
>>>   .../interrupt-controller/ti,sci-intr.txt  | 84 +++
>>>   MAINTAINERS   |  1 +
>>>   2 files changed, 85 insertions(+)
>>>   create mode 100644
>>> Documentation/devicetree/bindings/interrupt-controller/ti,sci-intr.txt
>>>
>>> diff --git
>>> a/Documentation/devicetree/bindings/interrupt-controller/ti,sci-intr.txt
>>> b/Documentation/devicetree/bindings/interrupt-controller/ti,sci-intr.txt
>>> new file mode 100644
>>> index ..06e69f8c812c
>>> --- /dev/null
>>> +++
>>> b/Documentation/devicetree/bindings/interrupt-controller/ti,sci-intr.txt
>>> @@ -0,0 +1,84 @@
>>> +Texas Instruments K3 Interrupt Router
>>> +=
>>> +
>>> +The Interrupt Router (INTR) module provides a mechanism to mux M
>>> +interrupt inputs to N interrupt outputs, where all M inputs are
>>> selectable
>>> +to be driven per N output. There is one register per output
>>> (MUXCNTL_N) that
>>> +controls the selection.
>>> +
>>> +
>>> + Interrupt Router
>>> + +--+
>>> + |  Inputs Outputs  |
>>> +    +---+    | +--+ |
>>> +    | GPIO  |--->| | irq0 | |   Host IRQ
>>> +    +---+    | +--+ |  controller
>>> + |    .    +-+  |  +---+
>>> +    +---+    |    .    |  0  |  |->|  IRQ  |
>>> +    | INTA  |--->|    .    +-+  |  +---+
>>> +    +---+    |    .  .  |
>>> + | +--+  .  |
>>> + | | irqM |    +-+  |
>>> + | +--+    |  N  |  |
>>> + | +-+  |
>>> + +--+
>>> +
>>> +Configuration of these MUXCNTL_N registers is done by a system
>>> controller
>>> +(like the Device Memory and Security Controller on K3 AM654 SoC).
>>> System
>>> +controller will keep track of the used and unused registers within
>>> the Router.
>>> +Driver should request the system controller to get the range of GIC
>>> IRQs
>>> +assigned to the requesting hosts. It is the drivers responsibility
>>> to keep
>>> +track of Host IRQs.
>>> +
>>> +Communication between the host processor running an OS and the system
>>> +controller happens through a protocol called TI System Control
>>> Interface
>>> +(TISCI protocol). For more details refer:
>>> +Documentation/devicetree/bindings/arm/keystone/ti,sci.txt
>>> +
>>> +TISCI Interrupt Router Node:
>>> +
>>> +- compatible:    Must be "ti,sci-intr".
>>> +- interrupt-controller:    Identifies the node as an interrupt
>>> controller
>>> +- #interrupt-cells:    Specifies the number of cells needed to
>>> encode an
>>> +    interrupt source. The value should be 4.
>>> +    First cell should contain the TISCI device ID of source
>>> +    Second cell should contain the interrupt source offset
>>> +    within the device
>>> +    Third cell specifies the trigger type as defined
>>> +    in interrupts.txt in this directory.
>>> +    Fourth cell should be 1 if the irq is coming from
>>> +    interrupt aggregator else 0.
>>> +- ti,sci:    Phandle to TI-SCI compatible System controller node.
>>> +- ti,sci-dst-id:    TISCI device ID of the destination IRQ controller.
>>> +- ti,sci-rm-range-girq:    TISCI subtype id representing the host
>>> irqs assigned
>>> +    to this interrupt router.
>>
>> u32 or array?
> 
> it is u32.

Sorry, I am wrong here. There is one instance where there are more than
one set of gic irq ranges associated with this IP. Will fix it as an
array in next version.

Thanks and regards,
Lokesh

> 
>>
>>> +
>>> +For more details on TISCI IRQ resource management refer:
>>> +http://downloads.ti.com/tisci/esd/latest/2_tisci_msgs/rm/rm_irq.html
>>> +
>>> +Example:
>>> +
>>> +The following example demonstrates both interrupt router node and
>>> the consumer
>>> +node(main gpio) on the AM654 SoC:
>>> +
>>> +main_intr: interrupt-controller@1 {
>>
>> Unit-address is not valid here without a reg property.
> 
> Sure will fix it in next version.
> 
> Thanks and regards,
> Lokesh
> 
>>
>>> +    compatible = "ti,sci-intr";
>>> +    interrupt-controller;
>>> +    interrupt-parent = <&gic>;
>>> +    #interrupt-c

Re: [PATCH 5/8] kbuild: change if_changed_rule to accept multi-line recipe

2018-11-15 Thread Masahiro Yamada

On Thu, Nov 15, 2018 at 6:12 PM Rasmus Villemoes
 wrote:
>
> On 15/11/2018 09.27, Masahiro Yamada wrote:
> > GNU Make supports 'define' ... 'endef' directive, which can describe
> > a recipe that consists of multiple lines.
> >
> >   endef
> >
> > This does not actually exploit the benefits of 'define' ... 'endef'
> > form. All shell commands must be concatenated with '; \' so that it
> > looks like a single command from the Makefile point of view. '@' can
> > only appear before the first action.
> >
> > The root cause of this misfortune is the '@set -e;' in if_changed_rule.
> > It is easily solvable by moving '@set -e' to the 'cmd' macro.
> >
> > The combo of $(call echo-cmd,*) $(cmd_*) in rule_cc_o_c and rule_as_o_S
> > were replaced with $(call cmd,*). The tailing back-slashes went away.
> >
> > Signed-off-by: Masahiro Yamada 
> > ---
> >
> >  define rule_cc_o_c
> > - $(call echo-cmd,checksrc) $(cmd_checksrc) \
> > - $(call cmd_and_fixdep,cc_o_c) \
> > - $(cmd_gen_ksymdeps)   \
> > - $(cmd_checkdoc)   \
> > - $(call echo-cmd,objtool) $(cmd_objtool)   \
> > - $(cmd_modversions_c)  \
> > - $(call echo-cmd,record_mcount) $(cmd_record_mcount)
> > + $(call cmd,checksrc)
> > + @$(call cmd_and_fixdep,cc_o_c)
> > + $(call cmd,gen_ksymdeps)
> > + $(call cmd,checkdoc)
> > + $(call cmd,objtool)
> > + $(call cmd,modversions_c)
> > + $(call cmd,record_mcount)
> >  endef
>
> Does this mean that Make now spawns a new shell for each of these
> commands,


Yes and No.

Basically, each line is run in an independent sub-shell,
but things are a little bit complex.

GNU Make, if possible, runs the command directly
instead of forking a shell process.

That is what I understood according to the following document:


http://wanderinghorse.net/computing/make/book/ManagingProjectsWithGNUMake-3.1.3.pdf

See "chapter 7: commands"
  -->8--
  Commands are essentially one-line shell scripts.
  In effect, make grabs each line and passes it to a subshell for execution.
  In fact, make can optimize this (relatively) expensive fork/exec algorithm
  if it can guarantee that omitting the shell will not change the behavior of
  the program. It checks this by scanning each command line for shell special
  characters, such as wildcard characters and i/o redirection. If none are
  found, make directly executes the command without passing it to a subshell.
  -->8--





> and if so, what's the performance impact? Or am I just
> misreading things? If this does change the semantics (one shell instance
> versus many), I think that's worth mentioning explicitly in the
> changelog, regardless of whether there's no measuarable performance impact.


Last night, I checked 'perf stat' of x86 defconfig build,
which enables cmd_objtool.



Without this commit:


 Performance counter stats for 'make -j8' (10 runs):

 125.499068713 seconds time elapsed
  ( +-  0.10% )


With this commit:

 Performance counter stats for 'make -j8' (10 runs):

 125.618321667 seconds time elapsed
  ( +-  0.24% )



I did not get noticeable performance regression.



-- 
Best Regards
Masahiro Yamada

Re: [PATCH v17 18/23] platform/x86: Intel SGX driver

2018-11-15 Thread Randy Dunlap

On 11/15/18 5:01 PM, Jarkko Sakkinen wrote:
> Intel Software Guard eXtensions (SGX) is a set of CPU instructions that
> can be used by applications to set aside private regions of code and
> data. The code outside the enclave is disallowed to access the memory
> inside the enclave by the CPU access control.
> 
> SGX driver provides a ioctl API for loading and initializing enclaves.
> Address range for enclaves is reserved with mmap() and they are
> destroyed with munmap(). Enclave construction, measurement and
> initialization is done with the provided the ioctl API.
> 
> Signed-off-by: Jarkko Sakkinen 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Co-developed-by: Serge Ayoun 
> Signed-off-by: Serge Ayoun 
> Co-developed-by: Shay Katz-zamir 
> Signed-off-by: Shay Katz-zamir 
> Co-developed-by: Suresh Siddha 
> Signed-off-by: Suresh Siddha 
> ---

> diff --git a/arch/x86/include/uapi/asm/sgx.h b/arch/x86/include/uapi/asm/sgx.h
> new file mode 100644
> index ..aadf9c76e360
> --- /dev/null
> +++ b/arch/x86/include/uapi/asm/sgx.h
> @@ -0,0 +1,59 @@
> +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> +/**
> + * Copyright(c) 2016-18 Intel Corporation.
> + */
> +#ifndef _UAPI_ASM_X86_SGX_H
> +#define _UAPI_ASM_X86_SGX_H
> +
> +#include 
> +#include 
> +
> +#define SGX_MAGIC 0xA4
> +
> +#define SGX_IOC_ENCLAVE_CREATE \
> + _IOW(SGX_MAGIC, 0x00, struct sgx_enclave_create)
> +#define SGX_IOC_ENCLAVE_ADD_PAGE \
> + _IOW(SGX_MAGIC, 0x01, struct sgx_enclave_add_page)
> +#define SGX_IOC_ENCLAVE_INIT \
> + _IOW(SGX_MAGIC, 0x02, struct sgx_enclave_init)
> +
> +/* IOCTL return values */
> +#define SGX_POWER_LOST_ENCLAVE   0x4000


Hi,
The ioctl magic number should be documented in
Documentation/ioctl/ioctl-number.txt.

ta.
-- 
~Randy

Re: [RFC PATCH] dt-bindings: add a jsonschema binding example

2018-11-15 Thread jonsm...@gmail.com

On Thu, Nov 15, 2018 at 6:42 PM Rob Herring  wrote:
>
> On Wed, Nov 14, 2018 at 1:39 PM jonsm...@gmail.com  wrote:
> >
> > On Fri, Apr 20, 2018 at 9:36 PM Rob Herring  wrote:
> > > I share the concern as I doubt most kernel developers don't know
> > > jsonschema. But then the alternative is us defining and writing our
> > > own thing which is C like 'cause that's what kernel developers
> > > understand. My hope is to simplify and restrict things enough that it
> > > writing a binding doc is straightforward without being jsonschema
> > > experts. That was the intent of this patch without going into all the
> > > details behind it.
> >
> > When schemas were first discussed long, long ago the idea was to have
> > a n arbitrator who controls the schema (like Grant/Rob) so there is no
> > need for general schema design knowledge in random kernel developers.
> >
> > First a developer should try and build their device tree using the
> > existing schema. Then only if they find that impossible to do so
> > should they propose schema changes. The schema arbitrator would then
> > look at those changes and work them into the existing schemas as
> > needed. Doing this via an arbitrator will ensure consistency in the
> > overall schema design while eliminating redundancy with slight
> > variations (like we have now).
> >
> > Another side effect of schemas is that as they evolve and enforce
> > commonality among driver implementation it will become possible to
> > turn those in-common pieces into driver libraries.
>
> If we replace 'schemas' everywhere above with 'bindings', then this
> pretty much describes the status quo today. Most device specific
> bindings are a collection of standard bindings. Occasionally, we have
> new common bindings. All the bindings get reviewed by me. The only
> real change here is submitters have to have some level of
> understanding of json-schema instead of just English (for writing free
> form text). I think it will continue to largely be following existing
> examples of other bindings.

What used to happen is that drivers would be written out of tree
without review of their bindings until mainline submission (if they
submit them at all).  With schema a driver writer who is working out
of tree can use the schema to validate their new device tree entries
before submitting them. That way they will know ahead of time if they
are making up something non-standard. It will also give them the heads
up that they can't just make up anything they want in the device tree
and that they are going to have to defend their design when asking for
the schema to be changed to support it. An example of where schema
would have been initially valuable is in the i2c bindings which
contain significant variation but the function is the same.

Maybe we are thinking about schema differently. I had envisioned
starting from a base generic schema that is capable of validating all
possible legal Linux device trees. This schema is more strict that
YAML syntax, but it obviously can't validate in detail.  Someone
working out of tree would always be able to validate against this
schema.

As this generic schema validates the device tree it will discover that
it can utilize more strict schema fragments. So by providing these
fragments you can validate to any desired level of conformance. The
end of that process is the json-schema bindings file. But if those
fragments are missing you can still validate, just not at a detailed
level.

A large set of schemas that work like this are used in ONVIF (security
cameras). A flavor of SOAP.
https://www.onvif.org/profiles/specifications/
These schemas are using XML stylesheets to make them pretty, use view
source to see the actual schemas.

The ONVIF schemas define points where vendors are allowed to insert
arbitrary items (ANY elements) and then they will use a vendor
supplied schema to validate the fragment if one is available. If not
the generic schema is used to validate the basic structure of the
vendor fragments.

>
> Rob

-- 
Jon Smirl
jonsm...@gmail.com

Re: [PATCH v3] tpm: add support for partial reads

2018-11-15 Thread Jarkko Sakkinen

On Thu, Nov 15, 2018 at 04:26:33PM -0800, Tadeusz Struk wrote:
> On 11/15/18 3:31 PM, Jarkko Sakkinen wrote:
> > You could drop these memset() calls and also one from
> > tpm_timeout_work(). The call could be done once in the beginning of
> > tpm_common_write() instead of having three different call sites.
> > 
> 
> Don't we want to clean the buffer as the response is read?
> When we will only memset it in write and if one would send
> just one command there will only be one response.
> The data will sit in the buffer until the next command.
> There will be a quite bit time window when the data can leak.

Point taken.

> > Naming becomes a mess and the comment for data_pending variable is
> > incorrect as it is also used for synchronous operation.
> > 
> > Maybe add a prepending commit to rename it as
> > 
> > /* Holds the resul of the tpm_transmit() last call. */
> > ssize_t transmit_result;
> 
> Agree, will do that.
> 
> > 
> > That is at least clear and obvious on what it contains.
> > 
> > The comment for partial_data is incorrect as the variable does not
> > contain any data.
> 
> Yes, I will change it.
> 
> > 
> > We could use declare:
> > 
> > /* Holds the count how much of the response is still unread. */
> > size_t response_pending;
> > 
> > Observe another remark from your commit: there is no reaso to ssize_t as
> > the type as the value should never be a negative number.
> 
> Yes, in fact now the priv->data_pending can also be only positive.
> I'll change it and send a new version soon.

Right you're correct. Then it should be simply called as response_size
(and be size_t) as it is then the most precise name for what it
contains.

> Thanks,
> -- 
> Tadeusz

/Jarkko

Re: Memory hotplug softlock issue

2018-11-15 Thread Baoquan He

On 11/15/18 at 03:32pm, Michal Hocko wrote:
> On Thu 15-11-18 21:38:40, Baoquan He wrote:
> > On 11/15/18 at 02:19pm, Michal Hocko wrote:
> > > On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > > > On 11/15/18 at 09:30am, Michal Hocko wrote:
> > > [...]
> > > > > It would be also good to find out whether this is fs specific. E.g. 
> > > > > does
> > > > > it make any difference if you use a different one for your stress
> > > > > testing?
> > > > 
> > > > Created a ramdisk and put stress bin there, then run stress -m 200, now
> > > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now 
> > > > xfs
> > > > is a big suspect. At bottom I paste numactl printing, you can see that 
> > > > it's
> > > > the last 4G.
> > > > 
> > > > Seems it's trying to migrate libc-2.28.so, but stress program keeps 
> > > > trying to
> > > > access and activate it.
> > > 
> > > Is this still with faultaround disabled? I have seen exactly same
> > > pattern in the bug I am working on. It was ext4 though.
> > 
> > After a long time struggling, the last 2nd block where libc-2.28.so is
> > located is reclaimed, now it comes to the last memory block, still
> > stress program itself. swap migration entry has been made and trying to
> > unmap, now it's looping there.
> > 
> > [  +0.004445] migrating pfn 190ff2bb0 failed 
> > [  +0.13] page:ea643fcaec00 count:203 mapcount:201 
> > mapping:888dfb268f48 index:0x0
> > [  +0.012809] shmem_aops 
> > [  +0.11] name:"stress" 
> > [  +0.002550] flags: 
> > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
> > [  +0.010715] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 
> > 888dfb268f48
> > [  +0.007828] raw:   00cb00c8 
> > 888e72e92000
> > [  +0.007810] page->mem_cgroup:888e72e92000
> [...]
> > [  +0.004455] migrating pfn 190ff2bb0 failed 
> > [  +0.18] page:ea643fcaec00 count:203 mapcount:201 
> > mapping:888dfb268f48 index:0x0
> > [  +0.014392] shmem_aops 
> > [  +0.10] name:"stress" 
> > [  +0.002565] flags: 
> > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
> > [  +0.010675] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 
> > 888dfb268f48
> > [  +0.007819] raw:   00cb00c8 
> > 888e72e92000
> > [  +0.007808] page->mem_cgroup:888e72e92000
> 
> OK, so this is tmpfs backed code of your stree test. This just tells us
> that this is not fs specific. Reference count is 2 more than the map
> count which is the expected state. So the reference count must have been
> elevated at the time when the migration was attempted. Shmem supports
> fault around so this might be still possible (assuming it is enabled).
> If not we really need to dig deeper. I will think of a debugging patch.

Disabled faultaround and reboot, test again, it's looping forever in the
last block again, on node2, stress progam itself again. The weird is
refcount seems to have been crazy, a random number now. There must be
something going wrong.

[  +0.058624] migrating pfn 80fd6fbe failed 
[  +0.03] page:ea203f5bef80 count:336 mapcount:201 
mapping:888e1c9357d8 index:0x2
[  +0.014122] shmem_aops 
[  +0.00] name:"stress" 
[  +0.002467] flags: 0x9fc008000e(referenced|uptodate|dirty|swapbacked)
[  +0.009511] raw: 009fc008000e c90e3d80 c90e3d80 
888e1c9357d8
[  +0.007743] raw: 0002  00cb00c8 
888e2233d000
[  +0.007740] page->mem_cgroup:888e2233d000
[  +0.038916] migrating pfn 80fd6fbe failed 
[  +0.03] page:ea203f5bef80 count:349 mapcount:201 
mapping:888e1c9357d8 index:0x2
[  +0.012453] shmem_aops 
[  +0.01] name:"stress" 
[  +0.002641] flags: 0x9fc008000e(referenced|uptodate|dirty|swapbacked)
[  +0.009501] raw: 009fc008000e c90e3d80 c90e3d80 
888e1c9357d8
[  +0.007746] raw: 0002  00cb00c8 
888e2233d000
[  +0.007740] page->mem_cgroup:888e2233d000
[  +0.061226] migrating pfn 80fd6fbe failed 
[  +0.04] page:ea203f5bef80 count:276 mapcount:201 
mapping:888e1c9357d8 index:0x2
[  +0.014129] shmem_aops 
[  +0.02] name:"stress" 
[  +0.003246] flags: 
0x9fc008008e(waiters|referenced|uptodate|dirty|swapbacked)
[  +0.010183] raw: 009fc008008e c90e3d80 c90e3d80 
888e1c9357d8
[  +0.007742] raw: 0002  00cb00c8 
888e2233d000
[  +0.007733] page->mem_cgroup:888e2233d000
[  +0.037305] migrating pfn 80fd6fbe failed 
[  +0.03] page:ea203f5bef80 count:304 mapcount:201 
mapping:888e1c9357d8 index:0x2
[  +0.012449] shmem_aops 
[  +0.02] name:"stress" 
[  +0.002469] flags: 0x9fc008000e(referenced|uptodate|dirty|swapbacked)
[  +0.009495] raw: 009fc008000e c90e3d80 c90e3d80 
888e1c9357d8
[  +0.007743] raw: 0002  00cb

Re: [PATCH] serial: 8250: Default SERIAL_OF_PLATFORM to SERIAL_8250

2018-11-15 Thread Guenter Roeck

On Thu, Nov 15, 2018 at 11:48:20AM -0800, Florian Fainelli wrote:
> 
> OK, would you mind testing this below? It seems to me that 8250_of.c is
> incompatible with arch/powerpc/kernel/legacy_serial.c and that is what
> is causing the issue here.
> 
> diff --git a/drivers/tty/serial/8250/Kconfig
> b/drivers/tty/serial/8250/Kconfig
> index d7737dca0e48..21cb14cbd34a 100644
> --- a/drivers/tty/serial/8250/Kconfig
> +++ b/drivers/tty/serial/8250/Kconfig
> @@ -483,7 +483,7 @@ config SERIAL_8250_PXA
> 
>  config SERIAL_OF_PLATFORM
> tristate "Devicetree based probing for 8250 ports"
> -   depends on SERIAL_8250 && OF
> +   depends on SERIAL_8250 && OF && !(PPC && PPC_UDBG_16550)
> default SERIAL_8250
> help
>   This option is used for all 8250 compatible serial ports that

44x/virtex5_defconfig has both PPC_UDBG_16550 and SERIAL_OF_PLATFORM enabled
and fails to boot (or display anything on the console) with this patch applied.

Guenter

Re: [PATCH] Revert "HID: uhid: use strlcpy() instead of strncpy()"

2018-11-15 Thread Kees Cook

On Thu, Nov 15, 2018 at 5:55 AM, David Herrmann  wrote:
> Hi
>
> On Thu, Nov 15, 2018 at 12:09 AM Kees Cook  wrote:
>> On Wed, Nov 14, 2018 at 9:40 AM, Laura Abbott  wrote:
> [...]
>> > Can we switch to strscpy instead? This will quiet gcc and avoid the
>> > issues with strlcpy.
>>
>> Yes please: it looks like these strings are expected to be NUL
>> terminated, so strscpy() without the "- 1" and min() logic would be
>> the correct solution here.
>
> "the correct solution"? To my knowledge the original code was correct
> as well. Am I missing something?

So, yes, no one should use strlcpy():
https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy

And while I think nothing was technically wrong with the strncpy()
usage in the original version, I think strncpy() should only be used
for __nonstring cases:
https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings

>
>>If @hid is already zero, then this would
>> just be:
>>
>>strscpy(hid->name, ev->u.create2.name, sizeof(hid->name));
>>strscpy(hid->phys, ev->u.create2.phys, sizeof(hid->phys));
>>strscpy(hid->uniq, ev->u.create2.uniq, sizeof(hid->uniq));
>>
>> If they are NOT NUL terminated, then keep using strncpy() but mark the
>> fields in the struct with the __nonstring attribute.
>
> They are supposed to be NUL terminated, but for compatibility reasons
> we allow them to be not. So I don't think your proposal is safe.

I was originally thinking only about the the hid->* strings, so I was
confused by this answer (they appear to always be NUL-terminated).
Then I thought you meant that ev->u.create2.* strings may not be
terminated, but I stayed confused. :)

The original code was:

len = min(sizeof(hid->name), sizeof(ev->u.create2.name)) - 1;
strncpy(hid->name, ev->u.create2.name, len);

If sizeof(hid->name) is smaller than sizeof(ev->u.create2.name), it
made sure than hid->name kept a trailing NUL.

If sizeof(ev->u.create2.name) is smaller than sizeof(hid->name), it
made sure than the last byte of ev->u.create2.name was ignored, and by
definition, hid->name would be NUL-terminated.

So you're converting from a potentially unterminated string into a
terminated string... (ev->u.create2.name maybe needs to be marked
__nonstring?)

The most you can write is sizeof(dest) - 1 but you must not read more
than sizeof(source). So I see that if the destination is smaller than
the source, you cannot represent these conditions correctly to
strscpy(). (And, I would argue, you can't with strncpy() either.)

However, they're all exactly the same size, so none of this matters,
and I think strscpy() would be the most sensible. And maybe you could
enforce the size checking:

BUILD_BUG_ON(sizeof(hid->name) != sizeof(ev->u.create2.name));
strscpy(hid->name, ev->u.create2.name, sizeof(hid->name));

etc...

-- 
Kees Cook

[PATCH v17 23/23] selftests/x86: Add a selftest for SGX

2018-11-15 Thread Jarkko Sakkinen

Add a selftest for SGX. It is a trivial test where a simple enclave
copies one 64-bit word of memory between two memory locations given to
the enclave as arguments.

Signed-off-by: Jarkko Sakkinen 
---
 tools/testing/selftests/x86/Makefile  |  10 +
 tools/testing/selftests/x86/sgx/Makefile  |  47 ++
 tools/testing/selftests/x86/sgx/encl.c|  20 +
 tools/testing/selftests/x86/sgx/encl.lds  |  33 ++
 .../selftests/x86/sgx/encl_bootstrap.S|  94 
 tools/testing/selftests/x86/sgx/encl_piggy.S  |  16 +
 tools/testing/selftests/x86/sgx/encl_piggy.h  |  13 +
 .../testing/selftests/x86/sgx/sgx-selftest.c  | 149 ++
 tools/testing/selftests/x86/sgx/sgx_arch.h| 109 
 tools/testing/selftests/x86/sgx/sgx_call.S|  20 +
 tools/testing/selftests/x86/sgx/sgx_uapi.h| 100 
 tools/testing/selftests/x86/sgx/sgxsign.c | 503 ++
 .../testing/selftests/x86/sgx/signing_key.pem |  39 ++
 13 files changed, 1153 insertions(+)
 create mode 100644 tools/testing/selftests/x86/sgx/Makefile
 create mode 100644 tools/testing/selftests/x86/sgx/encl.c
 create mode 100644 tools/testing/selftests/x86/sgx/encl.lds
 create mode 100644 tools/testing/selftests/x86/sgx/encl_bootstrap.S
 create mode 100644 tools/testing/selftests/x86/sgx/encl_piggy.S
 create mode 100644 tools/testing/selftests/x86/sgx/encl_piggy.h
 create mode 100644 tools/testing/selftests/x86/sgx/sgx-selftest.c
 create mode 100644 tools/testing/selftests/x86/sgx/sgx_arch.h
 create mode 100644 tools/testing/selftests/x86/sgx/sgx_call.S
 create mode 100644 tools/testing/selftests/x86/sgx/sgx_uapi.h
 create mode 100644 tools/testing/selftests/x86/sgx/sgxsign.c
 create mode 100644 tools/testing/selftests/x86/sgx/signing_key.pem

diff --git a/tools/testing/selftests/x86/Makefile 
b/tools/testing/selftests/x86/Makefile
index 186520198de7..4fc9a42f56ea 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -1,4 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
+
+SUBDIRS_64 := sgx
+
 all:
 
 include ../lib.mk
@@ -67,6 +70,13 @@ all_32: $(BINARIES_32)
 
 all_64: $(BINARIES_64)
 
+all_64: $(SUBDIRS_64)
+   @for DIR in $(SUBDIRS_64); do   \
+   BUILD_TARGET=$(OUTPUT)/$$DIR;   \
+   mkdir $$BUILD_TARGET  -p;   \
+   make OUTPUT=$$BUILD_TARGET -C $$DIR $@; \
+   done
+
 EXTRA_CLEAN := $(BINARIES_32) $(BINARIES_64)
 
 $(BINARIES_32): $(OUTPUT)/%_32: %.c
diff --git a/tools/testing/selftests/x86/sgx/Makefile 
b/tools/testing/selftests/x86/sgx/Makefile
new file mode 100644
index ..04004d244de4
--- /dev/null
+++ b/tools/testing/selftests/x86/sgx/Makefile
@@ -0,0 +1,47 @@
+top_srcdir = ../../../../..
+
+include ../../lib.mk
+
+HOST_CFLAGS := -Wall -Werror -g
+ENCL_CFLAGS := -Wall -Werror -static -nostdlib -nostartfiles -fPIC \
+  -fno-stack-protector -mrdrnd
+
+TEST_CUSTOM_PROGS := $(OUTPUT)/sgx-selftest
+all_64: $(TEST_CUSTOM_PROGS)
+
+$(TEST_CUSTOM_PROGS): $(OUTPUT)/sgx-selftest.o $(OUTPUT)/sgx_call.o \
+ $(OUTPUT)/encl_piggy.o
+   $(CC) $(HOST_CFLAGS) -o $@ $^
+
+$(OUTPUT)/sgx-selftest.o: sgx-selftest.c
+   $(CC) $(HOST_CFLAGS) -c $< -o $@
+
+$(OUTPUT)/sgx_call.o: sgx_call.S
+   $(CC) $(HOST_CFLAGS) -c $< -o $@
+
+$(OUTPUT)/encl_piggy.o: $(OUTPUT)/encl.bin $(OUTPUT)/encl.ss
+
+$(OUTPUT)/encl.bin: $(OUTPUT)/encl.elf $(OUTPUT)/sgxsign
+   objcopy --remove-section=.got.plt -O binary $< $@
+
+$(OUTPUT)/encl.elf: $(OUTPUT)/encl.o $(OUTPUT)/encl_bootstrap.o
+   $(CC) $(ENCL_CFLAGS) -T encl.lds -o $@ $^
+
+$(OUTPUT)/encl.o: encl.c
+   $(CC) $(ENCL_CFLAGS) -c $< -o $@
+
+$(OUTPUT)/encl_bootstrap.o: encl_bootstrap.S
+   $(CC) $(ENCL_CFLAGS) -c $< -o $@
+
+$(OUTPUT)/encl.ss: $(OUTPUT)/encl.bin  $(OUTPUT)/sgxsign
+   $(OUTPUT)/sgxsign signing_key.pem $(OUTPUT)/encl.bin $(OUTPUT)/encl.ss
+
+$(OUTPUT)/sgxsign: sgxsign.c
+   $(CC) -o $@ $< -lcrypto
+
+EXTRA_CLEAN := $(OUTPUT)/sgx-selftest $(OUTPUT)/sgx-selftest.o \
+  $(OUTPUT)/sgx_call.o $(OUTPUT)/encl.bin $(OUTPUT)/encl.ss \
+  $(OUTPUT)/encl.elf $(OUTPUT)/encl.o $(OUTPUT)/encl_bootstrap.o \
+  $(OUTPUT)/sgxsign
+
+.PHONY: clean
diff --git a/tools/testing/selftests/x86/sgx/encl.c 
b/tools/testing/selftests/x86/sgx/encl.c
new file mode 100644
index ..eb6aa318d3f1
--- /dev/null
+++ b/tools/testing/selftests/x86/sgx/encl.c
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+// Copyright(c) 2016-18 Intel Corporation.
+
+#include 
+#include "sgx_arch.h"
+
+static void *memcpy(void *dest, const void *src, size_t n)
+{
+   size_t i;
+
+   for (i = 0; i < n; i++)
+   ((char *)dest)[i] = ((char *)src)[i];
+
+   return dest;
+}
+
+void encl_body(void *rdi, void *rsi)
+{
+   memcpy(rsi, rdi, 8);
+}
diff --git a/tools/testing/selftests/x86/sgx/encl.lds 
b/tools/testing/selftests/x86/sgx/encl.lds
new file mo

[PATCH v17 21/23] platform/x86: ptrace() support for the SGX driver

2018-11-15 Thread Jarkko Sakkinen

Add VMA callbacks for ptrace() that can be used with debug enclaves.
With debug enclaves data can be read and write the memory word at a time
by using ENCLS(EDBGRD) and ENCLS(EDBGWR) leaf instructions.

Signed-off-by: Jarkko Sakkinen 
---
 drivers/platform/x86/intel_sgx/sgx_vma.c | 109 +++
 1 file changed, 109 insertions(+)

diff --git a/drivers/platform/x86/intel_sgx/sgx_vma.c 
b/drivers/platform/x86/intel_sgx/sgx_vma.c
index cc0993b4fd40..df604e4d0d0a 100644
--- a/drivers/platform/x86/intel_sgx/sgx_vma.c
+++ b/drivers/platform/x86/intel_sgx/sgx_vma.c
@@ -51,8 +51,117 @@ static int sgx_vma_fault(struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
 }
 
+static int sgx_edbgrd(struct sgx_encl *encl, struct sgx_encl_page *page,
+ unsigned long addr, void *data)
+{
+   unsigned long offset;
+   int ret;
+
+   offset = addr & ~PAGE_MASK;
+
+   if ((page->desc & SGX_ENCL_PAGE_TCS) &&
+   offset > offsetof(struct sgx_tcs, gs_limit))
+   return -ECANCELED;
+
+   ret = __edbgrd(sgx_epc_addr(page->epc_page) + offset, data);
+   if (ret) {
+   sgx_dbg(encl, "EDBGRD returned %d\n", ret);
+   return encls_to_err(ret);
+   }
+
+   return 0;
+}
+
+static int sgx_edbgwr(struct sgx_encl *encl, struct sgx_encl_page *page,
+ unsigned long addr, void *data)
+{
+   unsigned long offset;
+   int ret;
+
+   offset = addr & ~PAGE_MASK;
+
+   /* Writing anything else than flags will cause #GP */
+   if ((page->desc & SGX_ENCL_PAGE_TCS) &&
+   offset != offsetof(struct sgx_tcs, flags))
+   return -ECANCELED;
+
+   ret = __edbgwr(sgx_epc_addr(page->epc_page) + offset, data);
+   if (ret) {
+   sgx_dbg(encl, "EDBGWR returned %d\n", ret);
+   return encls_to_err(ret);
+   }
+
+   return 0;
+}
+
+static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
+ void *buf, int len, int write)
+{
+   struct sgx_encl *encl = vma->vm_private_data;
+   struct sgx_encl_page *entry = NULL;
+   unsigned long align;
+   char data[sizeof(unsigned long)];
+   int offset;
+   int cnt;
+   int ret = 0;
+   int i;
+
+   /* If process was forked, VMA is still there but vm_private_data is set
+* to NULL.
+*/
+   if (!encl)
+   return -EFAULT;
+
+   if (!(encl->flags & SGX_ENCL_DEBUG) ||
+   !(encl->flags & SGX_ENCL_INITIALIZED) ||
+   (encl->flags & SGX_ENCL_DEAD))
+   return -EFAULT;
+
+   for (i = 0; i < len; i += cnt) {
+   if (!entry || !((addr + i) & (PAGE_SIZE - 1))) {
+   if (entry)
+   entry->desc &= ~SGX_ENCL_PAGE_RESERVED;
+
+   entry = sgx_fault_page(vma, (addr + i) & PAGE_MASK,
+  true);
+   if (IS_ERR(entry)) {
+   ret = PTR_ERR(entry);
+   entry = NULL;
+   break;
+   }
+   }
+
+   /* Locking is not needed because only immutable fields of the
+* page are accessed and page itself is reserved so that it
+* cannot be swapped out in the middle.
+*/
+
+   align = ALIGN_DOWN(addr + i, sizeof(unsigned long));
+   offset = (addr + i) & (sizeof(unsigned long) - 1);
+   cnt = sizeof(unsigned long) - offset;
+   cnt = min(cnt, len - i);
+
+   ret = sgx_edbgrd(encl, entry, align, data);
+   if (ret)
+   break;
+   if (write) {
+   memcpy(data + offset, buf + i, cnt);
+   ret = sgx_edbgwr(encl, entry, align, data);
+   if (ret)
+   break;
+   } else
+   memcpy(buf + i, data + offset, cnt);
+   }
+
+   if (entry)
+   entry->desc &= ~SGX_ENCL_PAGE_RESERVED;
+
+   return ret < 0 ? ret : i;
+}
+
 const struct vm_operations_struct sgx_vm_ops = {
.close = sgx_vma_close,
.open = sgx_vma_open,
.fault = sgx_vma_fault,
+   .access = sgx_vma_access,
 };
-- 
2.19.1

[PATCH v17 20/23] x86/sgx: Add a simple swapper for the EPC memory manager

2018-11-15 Thread Jarkko Sakkinen

Wire up the EPC manager's reclaim flow to the SGX driver's swapping
functionality.  In the long term there will be multiple users of the
EPC manager, e.g. SGX driver and KVM, thus the interface between the
EPC manager and the driver is fairly genericized and decoupled.  But
to avoid adding unusued infrastructure, do not add any indirection
between the EPC manager and the SGX driver.  This has the unfortunate
and odd side effect of preventing the SGX driver from being compiled
as a loadable module.  However, this should be a temporary situation
that is remedied when a second user of EPC is added, i.e. KVM.

The swapper thread ksgxswapd reclaims pages on the event when the number
of free EPC pages goes below %SGX_NR_LOW_PAGES up until it reaches
%SGX_NR_HIGH_PAGES.

Pages are reclaimed in LRU fashion from a global list. The consumers
take care of calling EBLOCK (block page from new accesses), ETRACK
(restart counting the entering hardware threads) and EWB (write page to
the regular memory) because executing these operations usually (if not
always) requires to do some subsystem-internal locking operations.

Signed-off-by: Jarkko Sakkinen 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Co-developed-by: Serge Ayoun 
Signed-off-by: Serge Ayoun 
Co-developed-by: Shay Katz-zamir 
Signed-off-by: Shay Katz-zamir 
---
 arch/x86/Kconfig  |   1 +
 arch/x86/include/asm/sgx.h|  10 +-
 arch/x86/kernel/cpu/intel_sgx.c   | 241 +-
 drivers/platform/x86/intel_sgx/sgx_encl.c |   5 +-
 .../platform/x86/intel_sgx/sgx_encl_page.c|   3 +-
 drivers/platform/x86/intel_sgx/sgx_fault.c|   3 +-
 drivers/platform/x86/intel_sgx/sgx_util.c |   2 +-
 7 files changed, 245 insertions(+), 20 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4c3a325351ce..5d38e30d9563 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1922,6 +1922,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 config INTEL_SGX_CORE
bool "Intel SGX core functionality"
depends on X86_64 && CPU_SUP_INTEL
+   select INTEL_SGX
help
Intel Software Guard eXtensions (SGX) CPU feature that allows ring 3
applications to create enclaves: private regions of memory that are
diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
index 4cdfa1d22c6e..afe1291a17d5 100644
--- a/arch/x86/include/asm/sgx.h
+++ b/arch/x86/include/asm/sgx.h
@@ -12,6 +12,7 @@
 
 struct sgx_epc_page {
unsigned long desc;
+   void *owner;
struct list_head list;
 };
 
@@ -307,10 +308,17 @@ static inline int __emodt(struct sgx_secinfo *secinfo, 
void *addr)
return __encls_ret_2(SGX_EMODT, secinfo, addr);
 }
 
-struct sgx_epc_page *sgx_alloc_page(void);
+struct sgx_epc_page *sgx_alloc_page(void *owner, bool reclaim);
 int __sgx_free_page(struct sgx_epc_page *page);
 void sgx_free_page(struct sgx_epc_page *page);
 int sgx_einit(struct sgx_sigstruct *sigstruct, struct sgx_einittoken *token,
  struct sgx_epc_page *secs, u64 *lepubkeyhash);
+void sgx_page_reclaimable(struct sgx_epc_page *page);
+
+bool sgx_encl_page_get(struct sgx_epc_page *epc_page);
+void sgx_encl_page_put(struct sgx_epc_page *epc_page);
+bool sgx_encl_page_reclaim(struct sgx_epc_page *epc_page);
+void sgx_encl_page_block(struct sgx_epc_page *epc_page);
+void sgx_encl_page_write(struct sgx_epc_page *epc_page);
 
 #endif /* _ASM_X86_SGX_H */
diff --git a/arch/x86/kernel/cpu/intel_sgx.c b/arch/x86/kernel/cpu/intel_sgx.c
index 0e5fc8fc6b0d..929192d9557e 100644
--- a/arch/x86/kernel/cpu/intel_sgx.c
+++ b/arch/x86/kernel/cpu/intel_sgx.c
@@ -10,24 +10,141 @@
 #include 
 #include 
 
+/**
+ * enum sgx_swap_constants - the constants used by the swapping code
+ * %SGX_NR_TO_SCAN:the number of pages to scan in a single round
+ * %SGX_NR_LOW_PAGES:  the low watermark for ksgxswapd when it starts to swap
+ * pages.
+ * %SGX_NR_HIGH_PAGES: the high watermark for ksgxswapd what it stops swapping
+ * pages.
+ */
+enum sgx_swap_constants {
+   SGX_NR_TO_SCAN  = 16,
+   SGX_NR_LOW_PAGES= 32,
+   SGX_NR_HIGH_PAGES   = 64,
+};
+
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 EXPORT_SYMBOL_GPL(sgx_epc_sections);
 
 static int sgx_nr_epc_sections;
+static LIST_HEAD(sgx_active_page_list);
+static DEFINE_SPINLOCK(sgx_active_page_list_lock);
+static struct task_struct *ksgxswapd_tsk;
+static DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
 
 /* A per-cpu cache for the last known values of IA32_SGXLEPUBKEYHASHx MSRs. */
 static DEFINE_PER_CPU(u64 [4], sgx_lepubkeyhash_cache);
 
 /**
- * sgx_alloc_page - Allocate an EPC page
- *
- * Try to grab a page from the free EPC page list.
+ * sgx_reclaim_pages - reclaim EPC pages from the consumers
  *
- * Return:
- *   a pointer to a &struct sgx_epc_page instance,
- *   -errno on error
+ * Takes a fixed chunk of pages

[PATCH v17 19/23] platform/x86: sgx: Add swapping functionality to the Intel SGX driver

2018-11-15 Thread Jarkko Sakkinen

Because the kernel is untrusted, swapping pages in/out of the Enclave
Page Cache (EPC) has specialized requirements:

* The kernel cannot directly access EPC memory, i.e. cannot copy data
  to/from the EPC.
* To evict a page from the EPC, the kernel must "prove" to hardware that
  are no valid TLB entries for said page since a stale TLB entry would
  allow an attacker to bypass SGX access controls.
* When loading a page back into the EPC, hardware must be able to verify
  the integrity and freshness of the data.
* When loading an enclave page, e.g. regular pages and Thread Control
  Structures (TCS), hardware must be able to associate the page with a
  Secure Enclave Control Structure (SECS).

To satisfy the above requirements, the CPU provides dedicated ENCLS
functions to support paging data in/out of the EPC:

* EBLOCK:   Mark a page as blocked in the EPC Map (EPCM).  Attempting
to access a blocked page that misses the TLB will fault.
* ETRACK:   Activate blocking tracking.  Hardware verifies that all
translations for pages marked as "blocked" have been flushed
from the TLB.
* EPA:  Add version array page to the EPC.  As the name suggests, a
VA page is an 512-entry array of version numbers that are
used to uniquely identify pages evicted from the EPC.
* EWB:  Write back a page from EPC to memory, e.g. RAM.  Software
must supply a VA slot, memory to hold the a Paging Crypto
Metadata (PCMD) of the page and obviously backing for the
evicted page.
* ELD{B,U}: Load a page in {un}blocked state from memory to EPC.  The
driver only uses the ELDU variant as there is no use case
for loading a page as "blocked" in a bare metal environment.

To top things off, all of the above ENCLS functions are subject to
strict concurrency rules, e.g. many operations will #GP fault if two
or more operations attempt to access common pages/structures.

To put it succinctly, paging in/out of the EPC requires coordinating
with the SGX driver where all of an enclave's tracking resides.  But,
simply shoving all reclaim logic into the driver is not desirable as
doing so has unwanted long term implications:

* Oversubscribing EPC to KVM guests, i.e. virtualizing SGX in KVM and
  swapping a guest's EPC pages (without the guest's cooperation) needs
  the same high level flows for reclaim but has painfully different
  semantics in the details.
* Accounting EPC, i.e. adding an EPC cgroup controller, is desirable
  as EPC is effectively a specialized memory type and even more scarce
  than system memory.  Providing a single touchpoint for EPC accounting
  regardless of end consumer greatly simplifies the EPC controller.
* Allowing the userspace-facing driver to be built as a loaded module
  is desirable, e.g. for debug, testing and development.  The cgroup
  infrastructure does not support dependencies on loadable modules.
* Separating EPC swapping from the driver once it has been tightly
  coupled to the driver is non-trivial (speaking from experience).

So, although the SGX driver is currently the sole consumer of EPC,
encapsulate EPC swapping in the driver to minimize the dependencies
between the core SGX code and driver, and do so in a way that can be
extended to an abstracted interface with minimal effort.

To that end, add functions to swap EPC pages to the driver.  The user
of these functions will be the core SGX subsystem, which will be enabled
in a future patch.

* sgx_encl_page_{get,put}() - Attempt to pin/unpin (the owner of) an EPC
  page so that it can be operated on by a reclaimer.
* sgx_encl_page_reclaim()   - Mark a page as being reclaimed. The
  page is considered reclaimable if it hasn't been accessed recently and
  it isn't reserved by the driver for other use.
* sgx_encl_page_block() - EBLOCK an EPC page
* sgx_encl_page_write() - Evict an EPC page to the regular memory via
  EWB.  Activates ETRACK (via sgx_encl_track()) if necessary.

Since we also need to be able to fault pages back into the EPC, add a
page fault handler to allocate an EPC page and ELDU a previously evicted
page.

Signed-off-by: Jarkko Sakkinen 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Co-developed-by: Serge Ayoun 
Signed-off-by: Serge Ayoun 
Co-developed-by: Shay Katz-zamir 
Signed-off-by: Shay Katz-zamir 
---
 drivers/platform/x86/intel_sgx/Makefile   |   2 +
 drivers/platform/x86/intel_sgx/sgx.h  |  32 +++
 drivers/platform/x86/intel_sgx/sgx_encl.c | 194 +-
 .../platform/x86/intel_sgx/sgx_encl_page.c| 179 
 drivers/platform/x86/intel_sgx/sgx_fault.c| 108 ++
 drivers/platform/x86/intel_sgx/sgx_util.c |  71 +++
 drivers/platform/x86/intel_sgx/sgx_vma.c  |  15 ++
 7 files changed, 600 insertions(+), 1 deletion(-)
 create mode 100644 drivers/platform/x86/intel_sgx/sgx_encl_page.c
 create mode 100644 drivers/platform

[PATCH v17 18/23] platform/x86: Intel SGX driver

2018-11-15 Thread Jarkko Sakkinen

Intel Software Guard eXtensions (SGX) is a set of CPU instructions that
can be used by applications to set aside private regions of code and
data. The code outside the enclave is disallowed to access the memory
inside the enclave by the CPU access control.

SGX driver provides a ioctl API for loading and initializing enclaves.
Address range for enclaves is reserved with mmap() and they are
destroyed with munmap(). Enclave construction, measurement and
initialization is done with the provided the ioctl API.

Signed-off-by: Jarkko Sakkinen 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Co-developed-by: Serge Ayoun 
Signed-off-by: Serge Ayoun 
Co-developed-by: Shay Katz-zamir 
Signed-off-by: Shay Katz-zamir 
Co-developed-by: Suresh Siddha 
Signed-off-by: Suresh Siddha 
---
 arch/x86/include/uapi/asm/sgx.h|  59 ++
 drivers/platform/x86/Kconfig   |   2 +
 drivers/platform/x86/Makefile  |   1 +
 drivers/platform/x86/intel_sgx/Kconfig |  20 +
 drivers/platform/x86/intel_sgx/Makefile|  12 +
 drivers/platform/x86/intel_sgx/sgx.h   | 180 +
 drivers/platform/x86/intel_sgx/sgx_encl.c  | 784 +
 drivers/platform/x86/intel_sgx/sgx_ioctl.c | 234 ++
 drivers/platform/x86/intel_sgx/sgx_main.c  | 267 +++
 drivers/platform/x86/intel_sgx/sgx_util.c  |  85 +++
 drivers/platform/x86/intel_sgx/sgx_vma.c   |  43 ++
 11 files changed, 1687 insertions(+)
 create mode 100644 arch/x86/include/uapi/asm/sgx.h
 create mode 100644 drivers/platform/x86/intel_sgx/Kconfig
 create mode 100644 drivers/platform/x86/intel_sgx/Makefile
 create mode 100644 drivers/platform/x86/intel_sgx/sgx.h
 create mode 100644 drivers/platform/x86/intel_sgx/sgx_encl.c
 create mode 100644 drivers/platform/x86/intel_sgx/sgx_ioctl.c
 create mode 100644 drivers/platform/x86/intel_sgx/sgx_main.c
 create mode 100644 drivers/platform/x86/intel_sgx/sgx_util.c
 create mode 100644 drivers/platform/x86/intel_sgx/sgx_vma.c

diff --git a/arch/x86/include/uapi/asm/sgx.h b/arch/x86/include/uapi/asm/sgx.h
new file mode 100644
index ..aadf9c76e360
--- /dev/null
+++ b/arch/x86/include/uapi/asm/sgx.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+/**
+ * Copyright(c) 2016-18 Intel Corporation.
+ */
+#ifndef _UAPI_ASM_X86_SGX_H
+#define _UAPI_ASM_X86_SGX_H
+
+#include 
+#include 
+
+#define SGX_MAGIC 0xA4
+
+#define SGX_IOC_ENCLAVE_CREATE \
+   _IOW(SGX_MAGIC, 0x00, struct sgx_enclave_create)
+#define SGX_IOC_ENCLAVE_ADD_PAGE \
+   _IOW(SGX_MAGIC, 0x01, struct sgx_enclave_add_page)
+#define SGX_IOC_ENCLAVE_INIT \
+   _IOW(SGX_MAGIC, 0x02, struct sgx_enclave_init)
+
+/* IOCTL return values */
+#define SGX_POWER_LOST_ENCLAVE 0x4000
+
+/**
+ * struct sgx_enclave_create - parameter structure for the
+ * %SGX_IOC_ENCLAVE_CREATE ioctl
+ * @src:   address for the SECS page data
+ */
+struct sgx_enclave_create  {
+   __u64   src;
+};
+
+/**
+ * struct sgx_enclave_add_page - parameter structure for the
+ *   %SGX_IOC_ENCLAVE_ADD_PAGE ioctl
+ * @addr:  address within the ELRANGE
+ * @src:   address for the page data
+ * @secinfo:   address for the SECINFO data
+ * @mrmask:bitmask for the measured 256 byte chunks
+ */
+struct sgx_enclave_add_page {
+   __u64   addr;
+   __u64   src;
+   __u64   secinfo;
+   __u16   mrmask;
+} __attribute__((__packed__));
+
+
+/**
+ * struct sgx_enclave_init - parameter structure for the
+ *   %SGX_IOC_ENCLAVE_INIT ioctl
+ * @addr:  address within the ELRANGE
+ * @sigstruct: address for the SIGSTRUCT data
+ */
+struct sgx_enclave_init {
+   __u64   addr;
+   __u64   sigstruct;
+};
+
+#endif /* _UAPI_ASM_X86_SGX_H */
diff --git a/drivers/platform/x86/Kconfig b/drivers/platform/x86/Kconfig
index 54f6a40c75c6..e7c8d7898434 100644
--- a/drivers/platform/x86/Kconfig
+++ b/drivers/platform/x86/Kconfig
@@ -1288,6 +1288,8 @@ config INTEL_ATOMISP2_PM
  To compile this driver as a module, choose M here: the module
  will be called intel_atomisp2_pm.
 
+source "drivers/platform/x86/intel_sgx/Kconfig"
+
 endif # X86_PLATFORM_DEVICES
 
 config PMC_ATOM
diff --git a/drivers/platform/x86/Makefile b/drivers/platform/x86/Makefile
index 39ae94135406..a826ab3d7987 100644
--- a/drivers/platform/x86/Makefile
+++ b/drivers/platform/x86/Makefile
@@ -96,3 +96,4 @@ obj-$(CONFIG_INTEL_TURBO_MAX_3) += intel_turbo_max_3.o
 obj-$(CONFIG_INTEL_CHTDC_TI_PWRBTN)+= intel_chtdc_ti_pwrbtn.o
 obj-$(CONFIG_I2C_MULTI_INSTANTIATE)+= i2c-multi-instantiate.o
 obj-$(CONFIG_INTEL_ATOMISP2_PM)+= intel_atomisp2_pm.o
+obj-$(CONFIG_INTEL_SGX) += intel_sgx/
diff --git a/drivers/platform/x86/intel_sgx/Kconfig 
b/drivers/platform/x86/intel_sgx/Kconfig
new file mode 100644
index ..7d22d44acce9
--- /dev/null
+++ b/drivers/platform/x86/intel_sgx/Kconfig
@@ -0,0 +1,20 @@
+#

[PATCH v17 15/23] x86/sgx: Enumerate and track EPC sections

2018-11-15 Thread Jarkko Sakkinen

From: Sean Christopherson 

Enumerate Enclave Page Cache (EPC) sections via CPUID and add the data
structures necessary to track EPC pages so that they can be allocated,
freed and managed.  As a system may have multiple EPC sections, invoke
CPUID on SGX sub-leafs until an invalid leaf is encountered.

On NUMA systems, a node can have at most one bank. A bank can be at
most part of two nodes.  SGX supports both nodes with a single memory
controller and also sub-cluster nodes with severals memory controllers
on a single die.

For simplicity, support a maximum of eight EPC sections.  Current
client hardware supports only a single section, while upcoming server
hardware will support at most eight sections.  Bounding the number of
sections also allows the section ID to be embedded along with a page's
offset in a single unsigned long, enabling easy retrieval of both the
VA and PA for a given page.

Signed-off-by: Sean Christopherson 
Co-developed-by: Jarkko Sakkinen 
Signed-off-by: Jarkko Sakkinen 
Co-developed-by: Suresh Siddha 
Signed-off-by: Suresh Siddha 
Co-developed-by: Serge Ayoun 
Signed-off-by: Serge Ayoun 
---
 arch/x86/Kconfig|  17 
 arch/x86/include/asm/sgx.h  |  58 +
 arch/x86/kernel/cpu/Makefile|   1 +
 arch/x86/kernel/cpu/intel_sgx.c | 146 
 4 files changed, 222 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/intel_sgx.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9d734f3c8234..4c3a325351ce 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1919,6 +1919,23 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 
  If unsure, say y.
 
+config INTEL_SGX_CORE
+   bool "Intel SGX core functionality"
+   depends on X86_64 && CPU_SUP_INTEL
+   help
+   Intel Software Guard eXtensions (SGX) CPU feature that allows ring 3
+   applications to create enclaves: private regions of memory that are
+   architecturally protected from unauthorized access and/or modification.
+
+   This option enables kernel recognition of SGX, high-level management
+   of the Enclave Page Cache (EPC), tracking and writing of SGX Launch
+   Enclave Hash MSRs, and allows for virtualization of SGX via KVM. By
+   itself, this option does not provide SGX support to userspace.
+
+   For details, see Documentation/x86/intel_sgx.rst
+
+   If unsure, say N.
+
 config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
index 3d5ba1d23dfb..efe3e213e582 100644
--- a/arch/x86/include/asm/sgx.h
+++ b/arch/x86/include/asm/sgx.h
@@ -2,9 +2,67 @@
 #ifndef _ASM_X86_SGX_H
 #define _ASM_X86_SGX_H
 
+#include 
+#include 
+#include 
+#include 
+#include 
 #include 
 #include 
 
+struct sgx_epc_page {
+   unsigned long desc;
+   struct list_head list;
+};
+
+/**
+ * struct sgx_epc_section
+ *
+ * The firmware can define multiple chunks of EPC to the different areas of the
+ * physical memory e.g. for memory areas of the each node. This structure is
+ * used to store EPC pages for one EPC section and virtual memory area where
+ * the pages have been mapped.
+ */
+struct sgx_epc_section {
+   unsigned long pa;
+   void *va;
+   struct sgx_epc_page **pages;
+   unsigned long free_cnt;
+   spinlock_t lock;
+};
+
+#define SGX_MAX_EPC_SECTIONS   8
+
+extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
+
+/**
+ * enum sgx_epc_page_desc - bits and masks for an EPC page's descriptor
+ * %SGX_EPC_SECTION_MASK:  SGX allows to have multiple EPC sections in the
+ * physical memory. The existing and near-future
+ * hardware defines at most eight sections, hence
+ * three bits to hold a section.
+ * %SGX_EPC_PAGE_RECLAIMABLE:  The page page is reclaimable. Used when freeing
+ * a page to know that we also need to remove the
+ * page from the list of reclaimable pages.
+ */
+enum sgx_epc_page_desc {
+   SGX_EPC_SECTION_MASK= GENMASK_ULL(3, 0),
+   SGX_EPC_PAGE_RECLAIMABLE= BIT(4),
+   /* bits 12-63 are reserved for the physical page address of the page */
+};
+
+static inline struct sgx_epc_section *sgx_epc_section(struct sgx_epc_page 
*page)
+{
+   return &sgx_epc_sections[page->desc & SGX_EPC_SECTION_MASK];
+}
+
+static inline void *sgx_epc_addr(struct sgx_epc_page *page)
+{
+   struct sgx_epc_section *section = sgx_epc_section(page);
+
+   return section->va + (page->desc & PAGE_MASK) - section->pa;
+}
+
 /**
  * ENCLS_FAULT_FLAG - flag signifying an ENCLS return code is a trapnr
  *
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 1f5d2291c31e..b496c9360b88 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -39,6 +39,7 @@ obj-$(CONFIG_C

[PATCH v17 17/23] x86/sgx: Add sgx_einit() for initializing enclaves

2018-11-15 Thread Jarkko Sakkinen

From: Sean Christopherson 

Add a helper function to perform ENCLS(EINIT) with the correct LE
hash MSR values.  ENCLS[EINIT] initializes an enclave, verifying the
enclave's measurement and preparing it for execution, i.e. the enclave
cannot be run until it has been initialized.  The measurement aspect
of EINIT references the MSR_IA32_SGXLEPUBKEYHASH* MSRs, with the CPU
comparing CPU compares the key (technically its hash) used to sign the
enclave[1] with the key hash stored in the MSRs, and will reject EINIT
if the keys do not match.

A per-cpu cache is used to avoid writing the MSRs as writing the MSRs
is extraordinarily expensive, e.g. 300-400 cycles per MSR.  Because
the cache may become stale, force update the MSRs and retry EINIT if
the first EINIT fails due to an "invalid token".  An invalid token
error does not necessarily mean the MSRs need to be updated, but the
cost of an unnecessary write is minimal relative to the cost of EINIT
itself.

[1] For EINIT's purposes, the effective signer of the enclave may be
the enclave's owner, or a separate Launch Enclave that has created
an EINIT token for the target enclave.  When using an EINIT token,
the key used to sign the token must match the MSRs in order for
EINIT to succeed.

Signed-off-by: Sean Christopherson 
Co-developed-by: Jarkko Sakkinen 
Signed-off-by: Jarkko Sakkinen 
---
 arch/x86/include/asm/sgx.h  |  2 ++
 arch/x86/kernel/cpu/intel_sgx.c | 50 +
 2 files changed, 52 insertions(+)

diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
index 372fc378018b..4cdfa1d22c6e 100644
--- a/arch/x86/include/asm/sgx.h
+++ b/arch/x86/include/asm/sgx.h
@@ -310,5 +310,7 @@ static inline int __emodt(struct sgx_secinfo *secinfo, void 
*addr)
 struct sgx_epc_page *sgx_alloc_page(void);
 int __sgx_free_page(struct sgx_epc_page *page);
 void sgx_free_page(struct sgx_epc_page *page);
+int sgx_einit(struct sgx_sigstruct *sigstruct, struct sgx_einittoken *token,
+ struct sgx_epc_page *secs, u64 *lepubkeyhash);
 
 #endif /* _ASM_X86_SGX_H */
diff --git a/arch/x86/kernel/cpu/intel_sgx.c b/arch/x86/kernel/cpu/intel_sgx.c
index 59750a5df629..0e5fc8fc6b0d 100644
--- a/arch/x86/kernel/cpu/intel_sgx.c
+++ b/arch/x86/kernel/cpu/intel_sgx.c
@@ -15,6 +15,9 @@ EXPORT_SYMBOL_GPL(sgx_epc_sections);
 
 static int sgx_nr_epc_sections;
 
+/* A per-cpu cache for the last known values of IA32_SGXLEPUBKEYHASHx MSRs. */
+static DEFINE_PER_CPU(u64 [4], sgx_lepubkeyhash_cache);
+
 /**
  * sgx_alloc_page - Allocate an EPC page
  *
@@ -91,6 +94,53 @@ void sgx_free_page(struct sgx_epc_page *page)
 }
 EXPORT_SYMBOL_GPL(sgx_free_page);
 
+static void sgx_update_lepubkeyhash_msrs(u64 *lepubkeyhash, bool enforce)
+{
+   u64 __percpu *cache;
+   int i;
+
+   cache = per_cpu(sgx_lepubkeyhash_cache, smp_processor_id());
+   for (i = 0; i < 4; i++) {
+   if (enforce || (lepubkeyhash[i] != cache[i])) {
+   wrmsrl(MSR_IA32_SGXLEPUBKEYHASH0 + i, lepubkeyhash[i]);
+   cache[i] = lepubkeyhash[i];
+   }
+   }
+}
+
+/**
+ * sgx_einit - initialize an enclave
+ * @sigstruct: a pointer a SIGSTRUCT
+ * @token: a pointer an EINITTOKEN (optional)
+ * @secs:  a pointer a SECS
+ * @lepubkeyhash:  the desired value for IA32_SGXLEPUBKEYHASHx MSRs
+ *
+ * Execute ENCLS[EINIT], writing the IA32_SGXLEPUBKEYHASHx MSRs according
+ * to @lepubkeyhash (if possible and necessary).
+ *
+ * Return:
+ *   0 on success,
+ *   -errno or SGX error on failure
+ */
+int sgx_einit(struct sgx_sigstruct *sigstruct, struct sgx_einittoken *token,
+ struct sgx_epc_page *secs, u64 *lepubkeyhash)
+{
+   int ret;
+
+   if (!boot_cpu_has(X86_FEATURE_SGX_LC))
+   return __einit(sigstruct, token, sgx_epc_addr(secs));
+
+   preempt_disable();
+   sgx_update_lepubkeyhash_msrs(lepubkeyhash, false);
+   ret = __einit(sigstruct, token, sgx_epc_addr(secs));
+   if (ret == SGX_INVALID_EINITTOKEN) {
+   sgx_update_lepubkeyhash_msrs(lepubkeyhash, true);
+   ret = __einit(sigstruct, token, sgx_epc_addr(secs));
+   }
+   preempt_enable();
+   return ret;
+}
+EXPORT_SYMBOL(sgx_einit);
 
 static __init void sgx_free_epc_section(struct sgx_epc_section *section)
 {
-- 
2.19.1

[PATCH v17 16/23] x86/sgx: Add functions to allocate and free EPC pages

2018-11-15 Thread Jarkko Sakkinen

At this time there is no support for reclaiming pages prior to the
owner explicitly freeing the page.  As for freeing pages, because
freeing a page is expected to succeed in the vast majority of cases
and because most call sites will not be equipped to handle failure,
provide a variant for freeing a page that warns on failure, e.g. due
to ENCLS[EREMOVE] failing.

Signed-off-by: Jarkko Sakkinen 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/sgx.h  |  4 ++
 arch/x86/kernel/cpu/intel_sgx.c | 77 +
 2 files changed, 81 insertions(+)

diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
index efe3e213e582..372fc378018b 100644
--- a/arch/x86/include/asm/sgx.h
+++ b/arch/x86/include/asm/sgx.h
@@ -307,4 +307,8 @@ static inline int __emodt(struct sgx_secinfo *secinfo, void 
*addr)
return __encls_ret_2(SGX_EMODT, secinfo, addr);
 }
 
+struct sgx_epc_page *sgx_alloc_page(void);
+int __sgx_free_page(struct sgx_epc_page *page);
+void sgx_free_page(struct sgx_epc_page *page);
+
 #endif /* _ASM_X86_SGX_H */
diff --git a/arch/x86/kernel/cpu/intel_sgx.c b/arch/x86/kernel/cpu/intel_sgx.c
index bfdf907c5d94..59750a5df629 100644
--- a/arch/x86/kernel/cpu/intel_sgx.c
+++ b/arch/x86/kernel/cpu/intel_sgx.c
@@ -15,6 +15,83 @@ EXPORT_SYMBOL_GPL(sgx_epc_sections);
 
 static int sgx_nr_epc_sections;
 
+/**
+ * sgx_alloc_page - Allocate an EPC page
+ *
+ * Try to grab a page from the free EPC page list.
+ *
+ * Return:
+ *   a pointer to a &struct sgx_epc_page instance,
+ *   -errno on error
+ */
+struct sgx_epc_page *sgx_alloc_page(void)
+{
+   struct sgx_epc_section *section;
+   struct sgx_epc_page *page;
+   int i;
+
+   for (i = 0; i < sgx_nr_epc_sections; i++) {
+   section = &sgx_epc_sections[i];
+   spin_lock(§ion->lock);
+   if (section->free_cnt) {
+   page = section->pages[section->free_cnt - 1];
+   section->free_cnt--;
+   }
+   spin_unlock(§ion->lock);
+
+   if (page)
+   return page;
+   }
+
+   return ERR_PTR(-ENOMEM);
+}
+EXPORT_SYMBOL_GPL(sgx_alloc_page);
+
+/**
+ * __sgx_free_page - Free an EPC page
+ * @page:  pointer a previously allocated EPC page
+ *
+ * EREMOVE an EPC page and insert it back to the list of free pages.
+ *
+ * Return:
+ *   0 on success
+ *   SGX error code if EREMOVE fails
+ */
+int __sgx_free_page(struct sgx_epc_page *page)
+{
+   struct sgx_epc_section *section = sgx_epc_section(page);
+   int ret;
+
+   ret = __eremove(sgx_epc_addr(page));
+   if (ret)
+   return ret;
+
+   spin_lock(§ion->lock);
+   section->pages[section->free_cnt++] = page;
+   spin_unlock(§ion->lock);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(__sgx_free_page);
+
+/**
+ * sgx_free_page - Free an EPC page and WARN on failure
+ * @page:  pointer to a previously allocated EPC page
+ *
+ * EREMOVE an EPC page and insert it back to the list of free pages, and WARN
+ * if EREMOVE fails.  For use when the call site cannot (or chooses not to)
+ * handle failure, i.e. the page is leaked on failure.
+ */
+void sgx_free_page(struct sgx_epc_page *page)
+{
+   int ret;
+
+   ret = __sgx_free_page(page);
+   WARN(ret > 0, "sgx: EREMOVE returned %d (0x%x)", ret, ret);
+}
+EXPORT_SYMBOL_GPL(sgx_free_page);
+
+
 static __init void sgx_free_epc_section(struct sgx_epc_section *section)
 {
int i;
-- 
2.19.1

[PATCH v17 14/23] x86/sgx: Add wrappers for ENCLS leaf functions

2018-11-15 Thread Jarkko Sakkinen

ENCLS is an umbrella instruction for a variety of cpl0 SGX functions.
The ENCLS function that is executed is specified in EAX, with each
function potentially having more leaf-specific operands beyond EAX.
ENCLS introduces its own (positive value) error codes that (some)
leafs use to return failure information in EAX.  Leafs that return
an error code also modify RFLAGS.  And finally, ENCLS generates
ENCLS-specific non-fatal #GPs and #PFs, i.e. a bug-free kernel may
encounter faults on ENCLS that must be handled gracefully.

Because of the complexity involved in encoding ENCLS and handling its
assortment of failure paths, executing any given leaf is not a simple
matter of emitting ENCLS.

To enable adding support for ENCLS leafs with minimal fuss, add a
two-layer macro system along with an encoding scheme to allow wrappers
to return trap numbers along ENCLS-specific error codes.  The bottom
layer of the macro system splits between the leafs that return an
error code and those that do not.  The second layer generates the
correct input/output annotations based on the number of operands for
each leaf function.

Signed-off-by: Jarkko Sakkinen 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/sgx.h | 252 +
 1 file changed, 252 insertions(+)
 create mode 100644 arch/x86/include/asm/sgx.h

diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
new file mode 100644
index ..3d5ba1d23dfb
--- /dev/null
+++ b/arch/x86/include/asm/sgx.h
@@ -0,0 +1,252 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+#ifndef _ASM_X86_SGX_H
+#define _ASM_X86_SGX_H
+
+#include 
+#include 
+
+/**
+ * ENCLS_FAULT_FLAG - flag signifying an ENCLS return code is a trapnr
+ *
+ * ENCLS has its own (positive value) error codes and also generates
+ * ENCLS specific #GP and #PF faults.  And the ENCLS values get munged
+ * with system error codes as everything percolates back up the stack.
+ * Unfortunately (for us), we need to precisely identify each unique
+ * error code, e.g. the action taken if EWB fails varies based on the
+ * type of fault and on the exact SGX error code, i.e. we can't simply
+ * convert all faults to -EFAULT.
+ *
+ * To make all three error types coexist, we set bit 30 to identify an
+ * ENCLS fault.  Bit 31 (technically bits N:31) is used to differentiate
+ * between positive (faults and SGX error codes) and negative (system
+ * error codes) values.
+ */
+#define ENCLS_FAULT_FLAG 0x4000
+
+/**
+ * Check for a fault by looking for a postive value with the fault
+ * flag set.  The postive value check is needed to filter out system
+ * error codes since negative values will have all higher order bits
+ * set, including ENCLS_FAULT_FLAG.
+ */
+#define IS_ENCLS_FAULT(r) ((int)(r) > 0 && ((r) & ENCLS_FAULT_FLAG))
+
+/**
+ * Retrieve the encoded trapnr from the specified return code.
+ */
+#define ENCLS_TRAPNR(r) ((r) & ~ENCLS_FAULT_FLAG)
+
+/**
+ * encls_to_err - translate an ENCLS fault or SGX code into a system error code
+ * @ret:   positive value return code
+ *
+ * Translate a postive return code, e.g. from ENCLS, into a system error
+ * code.  Primarily used by functions that cannot return a non-negative
+ * error code, e.g. kernel callbacks.
+ *
+ * Return:
+ * 0 on success,
+ * -errno on failure
+ */
+static inline int encls_to_err(int ret)
+{
+   if (IS_ENCLS_FAULT(ret))
+   return -EFAULT;
+
+   switch (ret) {
+   case SGX_UNMASKED_EVENT:
+   return -EINTR;
+   case SGX_INVALID_SIG_STRUCT:
+   case SGX_INVALID_ATTRIBUTE:
+   case SGX_INVALID_MEASUREMENT:
+   case SGX_INVALID_EINITTOKEN:
+   case SGX_INVALID_CPUSVN:
+   case SGX_INVALID_ISVSVN:
+   case SGX_INVALID_KEYNAME:
+   return -EINVAL;
+   case SGX_ENCLAVE_ACT:
+   case SGX_CHILD_PRESENT:
+   case SGX_ENTRYEPOCH_LOCKED:
+   case SGX_PREV_TRK_INCMPL:
+   case SGX_PAGE_NOT_MODIFIABLE:
+   case SGX_PAGE_NOT_DEBUGGABLE:
+   return -EBUSY;
+   default:
+   return -EIO;
+   };
+}
+
+/**
+ * __encls_ret_N - encode an ENCLS leaf that returns an error code in EAX
+ * @rax:   leaf number
+ * @inputs:asm inputs for the leaf
+ *
+ * Emit assembly for an ENCLS leaf that returns an error code, e.g. EREMOVE.
+ * And because SGX isn't complex enough as it is, leafs that return an error
+ * code also modify flags.
+ *
+ * Return:
+ * 0 on success,
+ * SGX error code on failure
+ */
+#define __encls_ret_N(rax, inputs...)  \
+   ({  \
+   int ret;\
+   asm volatile(   \
+   "1: .byte 0x0f, 0x01, 0xcf;\n\t"\
+   "2:\n"  \
+   ".section .fixup,\"

[PATCH v17 13/23] x86/msr: Add SGX Launch Control MSR definitions

2018-11-15 Thread Jarkko Sakkinen

From: Sean Christopherson 

Add a new IA32_FEATURE_CONTROL bit, SGX_LE_WR.  When set, SGX_LE_WR
allows software to write the SGXLEPUBKEYHASH MSRs (see below).  The
The existence of the bit is enumerated by CPUID as X86_FEATURE_SGX_LC.
Like all other flags in IA32_FEATURE_CONTROL, the MSR must be locked
for SGX_LE_WR to take effect.

Add four MSRs, SGXLEPUBKEYHASH{0,1,2,3}, or in human readable form,
the SGX Launch Enclave Public Key Hash MSRs.  These MSRs correspond to
the key that is used by the CPU to determine whether or not to allow
software to enter an enclave.  When ENCLS[EINIT] is executed, which is
a prerequisite to entering the enclave, the CPU compares the key
(technically its hash) used to sign the enclave with the key hash
stored in the MSRs, and will reject EINIT if the keys do not match.

Enclaves can also be blessed by proxy, in which case a Launch Enclave
generates and signs an EINIT TOKEN.  If a valid token is provided,
ENCLS[EINIT] compares the signer of the token against the MSRs instead
of the signer of the enclave.  The SGXLEPUBKEYHASH MSRs only exist on
CPUs that support SGX Launch Control, enumerated by X86_FEATURE_SGX_LC.
CPUs without Launch Control use a hardcoded key for the ENCLS[EINIT]
checks.  An internal hardcoded key is also used as the reset value for
the hash MSRs when they exist.

As a final note, the SGX_LEPUBKEYHASH MSRs can also be written by
pre-boot firmware prior to activating SGX (SGX activation is done by
setting bit 0 in MSR 0x7A).  Thus, firmware can lock the MSRs to a
non-Intel value by writing the MSRs and locking IA32_FEATURE_CONTROL
without setting SGX_LE_WR.

Signed-off-by: Sean Christopherson 
Co-developed-by: Haim Cohen 
Signed-off-by: Haim Cohen 
Signed-off-by: Jarkko Sakkinen 
---
 arch/x86/include/asm/msr-index.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 082890bff490..9274179a445c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -487,6 +487,7 @@
 #define FEATURE_CONTROL_LOCKED (1<<0)
 #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX   (1<<1)
 #define FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX  (1<<2)
+#define FEATURE_CONTROL_SGX_LE_WR  (1<<17)
 #define FEATURE_CONTROL_SGX_ENABLE (1<<18)
 #define FEATURE_CONTROL_LMCE   (1<<20)
 
@@ -500,6 +501,12 @@
 #define MSR_IA32_UCODE_WRITE   0x0079
 #define MSR_IA32_UCODE_REV 0x008b
 
+/* Intel SGX Launch Enclave Public Key Hash MSRs */
+#define MSR_IA32_SGXLEPUBKEYHASH0  0x008C
+#define MSR_IA32_SGXLEPUBKEYHASH1  0x008D
+#define MSR_IA32_SGXLEPUBKEYHASH2  0x008E
+#define MSR_IA32_SGXLEPUBKEYHASH3  0x008F
+
 #define MSR_IA32_SMM_MONITOR_CTL   0x009b
 #define MSR_IA32_SMBASE0x009e
 
-- 
2.19.1

[PATCH v17 12/23] x86/sgx: Add definitions for SGX's CPUID leaf and variable sub-leafs

2018-11-15 Thread Jarkko Sakkinen

SGX defines its own CPUID leaf, 0x12, along with a variable number of
sub-leafs.  Sub-leafs 0 and 1 are always available if SGX is supported
and enumerate various SGX features, e.g. instruction sets and enclave
capabilities.  Sub-leafs 2+ are variable, both in their existence and
in what they enumerate.  Bits 3:0 of EAX report the sub-leaf type,
with the remaining bits in EAX, EBX, ECX and EDX being type-specific.
Currently, the only known sub-leaf type enumerates an EPC section.  An
EPC section is simply a range of EPC memory available to software.
The "list" of varaible SGX sub-leafs is NULL-terminated, i.e. software
is expected to query CPUID until an invalid sub-leaf is encountered.

Signed-off-by: Jarkko Sakkinen 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/sgx_arch.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/include/asm/sgx_arch.h b/arch/x86/include/asm/sgx_arch.h
index d4c57154e6e6..188243e3eee1 100644
--- a/arch/x86/include/asm/sgx_arch.h
+++ b/arch/x86/include/asm/sgx_arch.h
@@ -11,6 +11,21 @@
 #include 
 #include 
 
+#define SGX_CPUID  0x12
+#define SGX_CPUID_FIRST_VARIABLE_SUB_LEAF  2
+
+/**
+ * enum sgx_sub_leaf_types - SGX CPUID variable sub-leaf types
+ * %SGX_CPUID_SUB_LEAF_INVALID:Indicates this sub-leaf is 
invalid.
+ * %SGX_CPUID_SUB_LEAF_EPC_SECTION:Sub-leaf enumerates an EPC section.
+ * %SGX_CPUID_SUB_LEAF_TYPE_MASK:  Mask for bits containing the type.
+ */
+enum sgx_sub_leaf_types {
+   SGX_CPUID_SUB_LEAF_INVALID  = 0x0,
+   SGX_CPUID_SUB_LEAF_EPC_SECTION  = 0x1,
+   SGX_CPUID_SUB_LEAF_TYPE_MASK= GENMASK(3, 0),
+};
+
 /**
  * enum sgx_encls_leaves - ENCLS leaf functions
  * %SGX_ECREATE:   Create an enclave.
-- 
2.19.1

[PATCH v17 10/23] x86/sgx: Add ENCLS architectural error codes

2018-11-15 Thread Jarkko Sakkinen

The SGX architecture defines an extensive set of error codes that are
used by ENCL{S,U,V} instructions to provide software with (somewhat)
precise error information.  Though they are architectural, define the
known error codes in a separate file from sgx_arch.h so that they can
be exposed to userspace.  For some ENCLS leafs, e.g. EINIT, returning
the exact error code on failure can enable userspace to make informed
decisions when an operation fails.

Signed-off-by: Jarkko Sakkinen 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/sgx_arch.h   |  2 +
 arch/x86/include/uapi/asm/sgx_errno.h | 91 +++
 2 files changed, 93 insertions(+)
 create mode 100644 arch/x86/include/uapi/asm/sgx_errno.h

diff --git a/arch/x86/include/asm/sgx_arch.h b/arch/x86/include/asm/sgx_arch.h
index e068db46835e..6cd572fa95fa 100644
--- a/arch/x86/include/asm/sgx_arch.h
+++ b/arch/x86/include/asm/sgx_arch.h
@@ -8,6 +8,8 @@
 #ifndef _ASM_X86_SGX_ARCH_H
 #define _ASM_X86_SGX_ARCH_H
 
+#include 
+
 /**
  * enum sgx_encls_leaves - ENCLS leaf functions
  * %SGX_ECREATE:   Create an enclave.
diff --git a/arch/x86/include/uapi/asm/sgx_errno.h 
b/arch/x86/include/uapi/asm/sgx_errno.h
new file mode 100644
index ..48b87aed58d7
--- /dev/null
+++ b/arch/x86/include/uapi/asm/sgx_errno.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
+/*
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * Contains the architecturally defined error codes that are returned by SGX
+ * instructions, e.g. ENCLS, and may be propagated to userspace via errno.
+ */
+
+#ifndef _UAPI_ASM_X86_SGX_ERRNO_H
+#define _UAPI_ASM_X86_SGX_ERRNO_H
+
+/**
+ * enum sgx_encls_leaves - return codes for ENCLS, ENCLU and ENCLV
+ * %SGX_SUCCESS:   No error.
+ * %SGX_INVALID_SIG_STRUCT:SIGSTRUCT contains an invalid value.
+ * %SGX_INVALID_ATTRIBUTE: Enclave is not attempting to access a resource
+ * for which it is not authorized.
+ * %SGX_BLKSTATE:  EPC page is already blocked.
+ * %SGX_INVALID_MEASUREMENT:   SIGSTRUCT or EINITTOKEN contains an incorrect
+ * measurement.
+ * %SGX_NOTBLOCKABLE:  EPC page type is not one which can be blocked.
+ * %SGX_PG_INVLD:  EPC page is invalid (and cannot be blocked).
+ * %SGX_EPC_PAGE_CONFLICT: EPC page in use by another SGX instruction.
+ * %SGX_INVALID_SIGNATURE: Enclave's signature does not validate with
+ * public key enclosed in SIGSTRUCT.
+ * %SGX_MAC_COMPARE_FAIL:  MAC check failed when reloading EPC page.
+ * %SGX_PAGE_NOT_BLOCKED:  EPC page is not marked as blocked.
+ * %SGX_NOT_TRACKED:   ETRACK has not been completed on the EPC page.
+ * %SGX_VA_SLOT_OCCUPIED:  Version array slot contains a valid entry.
+ * %SGX_CHILD_PRESENT: Enclave has child pages present in the EPC.
+ * %SGX_ENCLAVE_ACT:   Logical processors are currently executing
+ * inside the enclave.
+ * %SGX_ENTRYEPOCH_LOCKED: SECS locked for EPOCH update, i.e. an ETRACK is
+ * currently executing on the SECS.
+ * %SGX_INVALID_EINITTOKEN:EINITTOKEN is invalid and enclave signer's
+ * public key does not match IA32_SGXLEPUBKEYHASH.
+ * %SGX_PREV_TRK_INCMPL:   All processors did not complete the previous
+ * tracking sequence.
+ * %SGX_PG_IS_SECS:Target EPC page is an SECS and cannot be
+ * blocked.
+ * %SGX_PAGE_ATTRIBUTES_MISMATCH:  Attributes of the EPC page do not match
+ * the expected values.
+ * %SGX_PAGE_NOT_MODIFIABLE:   EPC page cannot be modified because it is in
+ * the PENDING or MODIFIED state.
+ * %SGX_PAGE_NOT_DEBUGGABLE:   EPC page cannot be modified because it is in
+ * the PENDING or MODIFIED state.
+ * %SGX_INVALID_COUNTER:   {In,De}crementing a counter would cause it to
+ * {over,under}flow.
+ * %SGX_PG_NONEPC: Target page is not an EPC page.
+ * %SGX_TRACK_NOT_REQUIRED:Target page type does not require tracking.
+ * %SGX_INVALID_CPUSVN:Security version number reported by CPU 
is less
+ * than what is required by the enclave.
+ * %SGX_INVALID_ISVSVN:Security version number of enclave is 
less than
+ * what is required by the KEYREQUEST struct.
+ * %SGX_UNMASKED_EVENT:An unmasked event, e.g. INTR, was 
received
+ * while the instruction was executing.
+ * %SGX_INVALID_KEYNAME:   Requested key is not supported by hardware.
+ */
+enum sgx_return_codes {
+   SGX_SUCCESS = 0,
+   SGX_INVALID_SIG_STRUCT  =

[PATCH v17 11/23] x86/sgx: Add SGX1 and SGX2 architectural data structures

2018-11-15 Thread Jarkko Sakkinen

Define the data structures used by various ENCLS functions needed for
Linux to support all SGX1 and SGX2 ENCLS leaf functions.  This is not
an exhaustive representation of all SGX data structures as several are
only consumed by ENCLU (userspace), e.g. REPORT and KEYREQUEST, while
others are only consumed by future features, e.g. RDINFO.

Signed-off-by: Jarkko Sakkinen 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/sgx_arch.h | 329 
 1 file changed, 329 insertions(+)

diff --git a/arch/x86/include/asm/sgx_arch.h b/arch/x86/include/asm/sgx_arch.h
index 6cd572fa95fa..d4c57154e6e6 100644
--- a/arch/x86/include/asm/sgx_arch.h
+++ b/arch/x86/include/asm/sgx_arch.h
@@ -8,6 +8,7 @@
 #ifndef _ASM_X86_SGX_ARCH_H
 #define _ASM_X86_SGX_ARCH_H
 
+#include 
 #include 
 
 /**
@@ -53,4 +54,332 @@ enum sgx_encls_leaves {
SGX_EMODT   = 0x0F,
 };
 
+#define SGX_MODULUS_SIZE 384
+
+/**
+ * enum sgx_miscselect - additional information to an SSA frame
+ * %SGX_MISC_EXINFO:   Report #PF or #GP to the SSA frame.
+ *
+ * Save State Area (SSA) is a stack inside the enclave used to store processor
+ * state when an exception or interrupt occurs. This enum defines additional
+ * information stored to an SSA frame.
+ */
+enum sgx_miscselect {
+   SGX_MISC_EXINFO = BIT(0),
+   SGX_MISC_RESERVED_MASK  = GENMASK_ULL(63, 1)
+};
+
+#define SGX_SSA_GPRS_SIZE  182
+#define SGX_SSA_MISC_EXINFO_SIZE   16
+
+/**
+ * enum sgx_attributes - the attributes field in &struct sgx_secs
+ * %SGX_ATTR_INIT: Enclave can be entered (is initialized).
+ * %SGX_ATTR_DEBUG:Allow ENCLS(EDBGRD) and ENCLS(EDBGWR).
+ * %SGX_ATTR_MODE64BIT:Tell that this a 64-bit enclave.
+ * %SGX_ATTR_PROVISIONKEY:  Allow to use provisioning keys for remote
+ * attestation.
+ * %SGX_ATTR_EINITTOKENKEY:Allow to use token signing key that is used to
+ * sign cryptographic tokens that can be passed to
+ * EINIT as an authorization to run an enclave.
+ */
+enum sgx_attribute {
+   SGX_ATTR_INIT   = BIT(0),
+   SGX_ATTR_DEBUG  = BIT(1),
+   SGX_ATTR_MODE64BIT  = BIT(2),
+   SGX_ATTR_PROVISIONKEY   = BIT(4),
+   SGX_ATTR_EINITTOKENKEY  = BIT(5),
+   SGX_ATTR_RESERVED_MASK  = BIT_ULL(3) | GENMASK_ULL(63, 6)
+};
+
+#define SGX_SECS_RESERVED1_SIZE 24
+#define SGX_SECS_RESERVED2_SIZE 32
+#define SGX_SECS_RESERVED3_SIZE 96
+#define SGX_SECS_RESERVED4_SIZE 3836
+
+/**
+ * struct sgx_secs - SGX Enclave Control Structure (SECS)
+ * @size:  size of the address space
+ * @base:  base address of the  address space
+ * @ssa_frame_size:size of an SSA frame
+ * @miscselect:additional information stored to an SSA frame
+ * @attributes:attributes for enclave
+ * @xfrm:  XSave-Feature Request Mask (subset of XCR0)
+ * @mrenclave: SHA256-hash of the enclave contents
+ * @mrsigner:  SHA256-hash of the public key used to sign the SIGSTRUCT
+ * @isvprodid: a user-defined value that is used in key derivation
+ * @isvsvn:a user-defined value that is used in key derivation
+ *
+ * SGX Enclave Control Structure (SECS) is a special enclave page that is not
+ * visible in the address space. In fact, this structure defines the address
+ * range and other global attributes for the enclave and it is the first EPC
+ * page created for any enclave. It is moved from a temporary buffer to an EPC
+ * by the means of ENCLS(ECREATE) leaf.
+ */
+struct sgx_secs {
+   u64 size;
+   u64 base;
+   u32 ssa_frame_size;
+   u32 miscselect;
+   u8  reserved1[SGX_SECS_RESERVED1_SIZE];
+   u64 attributes;
+   u64 xfrm;
+   u32 mrenclave[8];
+   u8  reserved2[SGX_SECS_RESERVED2_SIZE];
+   u32 mrsigner[8];
+   u8  reserved3[SGX_SECS_RESERVED3_SIZE];
+   u16 isvprodid;
+   u16 isvsvn;
+   u8  reserved4[SGX_SECS_RESERVED4_SIZE];
+} __packed;
+
+/**
+ * enum sgx_tcs_flags - execution flags for TCS
+ * %SGX_TCS_DBGOPTIN:  If enabled allows single-stepping and breakpoints
+ * inside an enclave. It is cleared by EADD but can
+ * be set later with EDBGWR.
+ */
+enum sgx_tcs_flags {
+   SGX_TCS_DBGOPTIN= 0x01,
+   SGX_TCS_RESERVED_MASK   = GENMASK_ULL(63, 1)
+};
+
+#define SGX_TCS_RESERVED_SIZE 4024
+
+/**
+ * struct sgx_tcs - Thread Control Structure (TCS)
+ * @state: used to mark an entered TCS
+ * @flags: execution flags (cleared by EADD)
+ * @ssa_offset:SSA stack offset relative to the enclave base
+ * @ssa_index: the current SSA frame index (cleard by EADD)
+ * @nr_ssa_frames: the number of frame in the SSA stack
+ * @entry_offset:  entry point offset relative to the

[PATCH v17 09/23] x86/sgx: Define SGX1 and SGX2 ENCLS leafs

2018-11-15 Thread Jarkko Sakkinen

ENCLS, a.k.a. Enclave System instruction, is an umbrella instruction
for a variety of privileged SGX functions.  The ENCLS function to be
executed is specified in EAX, a la GETSEC of SMX/TXT fame.  Leafs may
use additional registers for function-specific operands.  ENCLS also
introduces its own set of error codes that (some) leafs use to return
pass/fail information to software.  Leafs that return an error code
also modify RFLAGS.  And finally, ENCLS generates ENCLS-specific #GPs
and #PFs.

ENCLS leafs functions are organized under SGX sub-features, e.g. SGX1
defines the base ENCLS function set and SGX2 adds ENCLS functions to
enable dynamic EPC management.  At this time, only the SGX1 and SGX2
function sets are supported by Linux; the other published sets relate
to VMM EPC oversubscription, which is far out on the horizon.

Define the ENCLS leafs in a dedicated file as more architecturally
defined SGX constants and data structures will be introduced in short
order.

Signed-off-by: Jarkko Sakkinen 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/sgx_arch.h | 54 +
 1 file changed, 54 insertions(+)
 create mode 100644 arch/x86/include/asm/sgx_arch.h

diff --git a/arch/x86/include/asm/sgx_arch.h b/arch/x86/include/asm/sgx_arch.h
new file mode 100644
index ..e068db46835e
--- /dev/null
+++ b/arch/x86/include/asm/sgx_arch.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+/**
+ * Copyright(c) 2016-18 Intel Corporation.
+ *
+ * Contains data structures defined by the SGX architecture.  Data structures
+ * defined by the Linux software stack should not be placed here.
+ */
+#ifndef _ASM_X86_SGX_ARCH_H
+#define _ASM_X86_SGX_ARCH_H
+
+/**
+ * enum sgx_encls_leaves - ENCLS leaf functions
+ * %SGX_ECREATE:   Create an enclave.
+ * %SGX_EADD:  Add a page to an uninitialized enclave.
+ * %SGX_EINIT: Initialize an enclave, i.e. launch an enclave.
+ * %SGX_EREMOVE:   Remove a page from an enclave.
+ * %SGX_EDBGRD:Read a word from an enclve (peek).
+ * %SGX_EDBGWR:Write a word to an enclave (poke).
+ * %SGX_EEXTEND:   Measure 256 bytes of an added enclave page.
+ * %SGX_ELDB:  Load a swapped page in blocked state.
+ * %SGX_ELDU:  Load a swapped page in unblocked state.
+ * %SGX_EBLOCK:Change page state to blocked i.e. entering 
hardware
+ * threads cannot access it and create new TLB entries.
+ * %SGX_EPA:   Create a Version Array (VA) page used to store isvsvn
+ * number for a swapped EPC page.
+ * %SGX_EWB:   Swap an enclave page to the regular memory. Checks that
+ * all threads have exited that were in the previous
+ * shoot-down sequence.
+ * %SGX_ETRACK:Start a new shoot down sequence. Used to 
together with
+ * EBLOCK to make sure that a page is safe to swap.
+ * %SGX_EAUG:  Add a page to an initialized enclave.
+ * %SGX_EMODPR:Restrict an EPC page's permissions.
+ * %SGX_EMODT: Modify the page type of an EPC page.
+ */
+enum sgx_encls_leaves {
+   SGX_ECREATE = 0x00,
+   SGX_EADD= 0x01,
+   SGX_EINIT   = 0x02,
+   SGX_EREMOVE = 0x03,
+   SGX_EDGBRD  = 0x04,
+   SGX_EDGBWR  = 0x05,
+   SGX_EEXTEND = 0x06,
+   SGX_ELDB= 0x07,
+   SGX_ELDU= 0x08,
+   SGX_EBLOCK  = 0x09,
+   SGX_EPA = 0x0A,
+   SGX_EWB = 0x0B,
+   SGX_ETRACK  = 0x0C,
+   SGX_EAUG= 0x0D,
+   SGX_EMODPR  = 0x0E,
+   SGX_EMODT   = 0x0F,
+};
+
+#endif /* _ASM_X86_SGX_ARCH_H */
-- 
2.19.1

[PATCH v17 08/23] x86/mm: x86/sgx: Signal SIGSEGV for userspace #PFs w/ PF_SGX

2018-11-15 Thread Jarkko Sakkinen

From: Sean Christopherson 

The PF_SGX bit is set if and only if the #PF is detected by the SGX
Enclave Page Cache Map (EPCM).  The EPCM is a hardware-managed table
that enforces accesses to an enclave's EPC pages in addition to the
software-managed kernel page tables, i.e. the effective permissions
for an EPC page are a logical AND of the kernel's page tables and
the corresponding EPCM entry.

The EPCM is consulted only after an access walks the kernel's page
tables, i.e.:

  a. the access was allowed by the kernel
  b. the kernel's tables have become less restrictive than the EPCM
  c. the kernel cannot fixup the cause of the fault

Noteably, (b) implies that either the kernel has botched the EPC
mappings or the EPCM has been invalidated (see below).  Regardless of
why the fault occurred, userspace needs to be alerted so that it can
take appropriate action, e.g. restart the enclave.  This is reinforced
by (c) as the kernel doesn't really have any other reasonable option,
i.e. signalling SIGSEGV is actually the least severe action possible.

Although the primary purpose of the EPCM is to prevent a malicious or
compromised kernel from attacking an enclave, e.g. by modifying the
enclave's page tables, do not WARN on a #PF w/ PF_SGX set.  The SGX
architecture effectively allows the CPU to invalidate all EPCM entries
at will and requires that software be prepared to handle an EPCM fault
at any time.  The architecture defines this behavior because the EPCM
is encrypted with an ephemeral key that isn't exposed to software.  As
such, the EPCM entries cannot be preserved across transitions that
result in a new key being used, e.g. CPU power down as part of an S3
transition or when a VM is live migrated to a new physical system.

Cc: Andy Lutomirski 
Cc: Dave Hansen 
Signed-off-by: Sean Christopherson 
Signed-off-by: Jarkko Sakkinen 
---
 arch/x86/mm/fault.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 71d4b9d4d43f..eb8db2425b5b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1108,6 +1108,19 @@ access_error(unsigned long error_code, struct 
vm_area_struct *vma)
if (error_code & X86_PF_PK)
return 1;
 
+   /*
+* Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the
+* access is allowed by the PTE but not the EPCM.  This usually happens
+* when the EPCM is yanked out from under us, e.g. by hardware after a
+* suspend/resume cycle.  In any case, software, i.e. the kernel, can't
+* fix the source of the fault as the EPCM can't be directly modified
+* by software.  Handle the fault as an access error in order to signal
+* userspace, e.g. so that userspace can rebuild their enclave(s), even
+* though userspace may not have actually violated access permissions.
+*/
+   if (unlikely(error_code & X86_PF_SGX))
+   return 1;
+
/*
 * Make sure to check the VMA so that we do not perform
 * faults just to hit a X86_PF_PK as soon as we fill in a
-- 
2.19.1

[PATCH v17 04/23] x86/msr: Add IA32_FEATURE_CONTROL.SGX_ENABLE definition

2018-11-15 Thread Jarkko Sakkinen

From: Sean Christopherson 

Add a new IA32_FEATURE_CONTROL bit, SGX_ENABLE, which must be set in
order to execute SGX instructions, i.e. ENCL{S,U,V}.  The existence of
the bit is enumerated by CPUID as X86_FEATURE_SGX.  Like all other
flags in IA32_FEATURE_CONTROL, the MSR must be locked for SGX_ENABLE
to take effect.

Signed-off-by: Sean Christopherson 
Signed-off-by: Jarkko Sakkinen 
---
 arch/x86/include/asm/msr-index.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 80f4a4f38c79..082890bff490 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -487,6 +487,7 @@
 #define FEATURE_CONTROL_LOCKED (1<<0)
 #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX   (1<<1)
 #define FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX  (1<<2)
+#define FEATURE_CONTROL_SGX_ENABLE (1<<18)
 #define FEATURE_CONTROL_LMCE   (1<<20)
 
 #define MSR_IA32_APICBASE  0x001b
-- 
2.19.1

[PATCH v17 06/23] x86/cpu/intel: Detect SGX support and update caps appropriately

2018-11-15 Thread Jarkko Sakkinen

From: Sean Christopherson 

Similar to other large Intel features such as VMX and TXT, SGX must be
explicitly enabled in IA32_FEATURE_CONTROL MSR to be truly usable.
Clear all SGX related capabilities if SGX is not fully enabled in
IA32_FEATURE_CONTROL or if the SGX1 instruction set isn't supported
(impossible on bare metal, theoretically possible in a VM if the VMM is
doing something weird).

Like SGX itself, SGX Launch Control must be explicitly enabled via a
flag in IA32_FEATURE_CONTROL. Clear the SGX_LC capability if Launch
Control is not fully enabled (or obviously if SGX itself is disabled).

Note that clearing X86_FEATURE_SGX_LC creates a bit of a conundrum
regarding the SGXLEPUBKEYHASH MSRs, as it may be desirable to read the
MSRs even if they are not writable, e.g. to query the configured key,
but clearing the capability leaves no breadcrum for discerning whether
or not the MSRs exist.  But, such usage will be rare (KVM is the only
known case at this time) and not performance critical, so it's not
unreasonable to require the use of rdmsr_safe().  Clearing the cap bit
eliminates the need for an additional flag to track whether or not
Launch Control is truly enabled, which is what we care about the vast
majority of the time.

Signed-off-by: Sean Christopherson 
Co-developed-by: Jarkko Sakkinen 
Signed-off-by: Jarkko Sakkinen 
---
 arch/x86/kernel/cpu/intel.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index fc3c07fe7df5..8a20a193d399 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -596,6 +596,40 @@ static void detect_tme(struct cpuinfo_x86 *c)
c->x86_phys_bits -= keyid_bits;
 }
 
+static void detect_sgx(struct cpuinfo_x86 *c)
+{
+   unsigned long long fc;
+
+   rdmsrl(MSR_IA32_FEATURE_CONTROL, fc);
+   if (!(fc & FEATURE_CONTROL_LOCKED)) {
+   pr_err_once("sgx: IA32_FEATURE_CONTROL MSR is not locked\n");
+   goto out_unsupported;
+   }
+
+   if (!(fc & FEATURE_CONTROL_SGX_ENABLE)) {
+   pr_err_once("sgx: not enabled in IA32_FEATURE_CONTROL MSR\n");
+   goto out_unsupported;
+   }
+
+   if (!cpu_has(c, X86_FEATURE_SGX1)) {
+   pr_err_once("sgx: SGX1 instruction set not supported\n");
+   goto out_unsupported;
+   }
+
+   if (!(fc & FEATURE_CONTROL_SGX_LE_WR)) {
+   pr_info_once("sgx: launch control MSRs are not writable\n");
+   goto out_msrs_rdonly;
+   }
+
+   return;
+out_unsupported:
+   setup_clear_cpu_cap(X86_FEATURE_SGX);
+   setup_clear_cpu_cap(X86_FEATURE_SGX1);
+   setup_clear_cpu_cap(X86_FEATURE_SGX2);
+out_msrs_rdonly:
+   setup_clear_cpu_cap(X86_FEATURE_SGX_LC);
+}
+
 static void init_intel_energy_perf(struct cpuinfo_x86 *c)
 {
u64 epb;
@@ -763,6 +797,9 @@ static void init_intel(struct cpuinfo_x86 *c)
if (cpu_has(c, X86_FEATURE_TME))
detect_tme(c);
 
+   if (cpu_has(c, X86_FEATURE_SGX))
+   detect_sgx(c);
+
init_intel_energy_perf(c);
 
init_intel_misc_features(c);
-- 
2.19.1

[PATCH v17 07/23] x86/mm: x86/sgx: Add new 'PF_SGX' page fault error code bit

2018-11-15 Thread Jarkko Sakkinen

From: Sean Christopherson 

The SGX bit is set in the #PF error code if and only if the fault is
detected by the Enclave Page Cache Map (EPCM), a hardware-managed
table that enforces the paging permissions defined by the enclave,
e.g. to prevent the kernel from changing the permissions of an
enclave's page(s).

Cc: Dave Hansen 
Signed-off-by: Sean Christopherson 
Signed-off-by: Jarkko Sakkinen 
---
 arch/x86/include/asm/traps.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 3de69330e6c5..165c93dd700e 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -162,5 +162,6 @@ enum x86_pf_error_code {
X86_PF_RSVD =   1 << 3,
X86_PF_INSTR=   1 << 4,
X86_PF_PK   =   1 << 5,
+   X86_PF_SGX  =   1 << 15,
 };
 #endif /* _ASM_X86_TRAPS_H */
-- 
2.19.1

[PATCH v17 05/23] x86/cpufeatures: Add Intel-defined SGX_LC feature bit

2018-11-15 Thread Jarkko Sakkinen

From: Kai Huang 

X86_FEATURE_SGX_LC reflects whether or not the CPU supports SGX Launch
Control, i.e. enumerates the existence of IA32_FEATURE_CONTROL's
SGX_LE_WR bit and the IA32_SGXLEPUBKEYHASH MSRs.

Signed-off-by: Kai Huang 
Signed-off-by: Jarkko Sakkinen 
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index afdf5f2e13b5..97604c7fdfd2 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -343,6 +343,7 @@
 #define X86_FEATURE_CLDEMOTE   (16*32+25) /* CLDEMOTE instruction */
 #define X86_FEATURE_MOVDIRI(16*32+27) /* MOVDIRI instruction */
 #define X86_FEATURE_MOVDIR64B  (16*32+28) /* MOVDIR64B instruction */
+#define X86_FEATURE_SGX_LC (16*32+30) /* Software Guard Extensions 
Launch Control */
 
 /* AMD-defined CPU features, CPUID level 0x8007 (EBX), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV (17*32+ 0) /* MCA overflow recovery 
support */
-- 
2.19.1

[PATCH v17 02/23] x86/cpufeatures: Add Intel-defined SGX feature bit

2018-11-15 Thread Jarkko Sakkinen

From: Kai Huang 

X86_FEATURE_SGX reflects whether or not the CPU supports Intel's
Software Guard eXtensions (SGX).

Signed-off-by: Kai Huang 
Signed-off-by: Jarkko Sakkinen 
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 28c4a502b419..da7fed4939a3 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -236,6 +236,7 @@
 /* Intel-defined CPU features, CPUID level 0x0007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE   ( 9*32+ 0) /* RDFSBASE, WRFSBASE, 
RDGSBASE, WRGSBASE instructions*/
 #define X86_FEATURE_TSC_ADJUST ( 9*32+ 1) /* TSC adjustment MSR 0x3B */
+#define X86_FEATURE_SGX( 9*32+ 2) /* Software Guard 
Extensions */
 #define X86_FEATURE_BMI1   ( 9*32+ 3) /* 1st group bit 
manipulation extensions */
 #define X86_FEATURE_HLE( 9*32+ 4) /* Hardware Lock 
Elision */
 #define X86_FEATURE_AVX2   ( 9*32+ 5) /* AVX2 instructions */
-- 
2.19.1

[PATCH v17 03/23] x86/cpufeatures: Add SGX sub-features (as Linux-defined bits)

2018-11-15 Thread Jarkko Sakkinen

From: Sean Christopherson 

CPUID_12_EAX is an Intel-defined feature bits leaf dedicated for SGX
that enumerates the SGX instruction sets that are supported by the
CPU, e.g. SGX1, SGX2, etc...  Because Linux currently only cares about
two bits (SGX1 and SGX2) and there are currently only four documented
bits in total, relocate the bits to Linux-defined word 8 to conserve
space.

But, keep the bit positions identical between the Intel-defined value
and the Linux-defined value, e.g. keep SGX1 at bit 0.  This allows KVM
to use its existing code for probing guest CPUID bits using Linux's
X86_FEATURE_* definitions.  To do so, shift around some existing bits
to effectively reserve bits 0-7 of word 8 for SGX sub-features.

Signed-off-by: Sean Christopherson 
Signed-off-by: Jarkko Sakkinen 
---
 arch/x86/include/asm/cpufeatures.h   | 21 +++--
 arch/x86/kernel/cpu/scattered.c  |  2 ++
 tools/arch/x86/include/asm/cpufeatures.h | 21 +++--
 3 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index da7fed4939a3..afdf5f2e13b5 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -222,12 +222,21 @@
 #define X86_FEATURE_L1TF_PTEINV( 7*32+29) /* "" L1TF 
workaround PTE inversion */
 #define X86_FEATURE_IBRS_ENHANCED  ( 7*32+30) /* Enhanced IBRS */
 
-/* Virtualization flags: Linux defined, word 8 */
-#define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
-#define X86_FEATURE_VNMI   ( 8*32+ 1) /* Intel Virtual NMI */
-#define X86_FEATURE_FLEXPRIORITY   ( 8*32+ 2) /* Intel FlexPriority */
-#define X86_FEATURE_EPT( 8*32+ 3) /* Intel Extended 
Page Table */
-#define X86_FEATURE_VPID   ( 8*32+ 4) /* Intel Virtual Processor 
ID */
+/*
+ * Scattered Intel features: Linux defined, word 8.
+ *
+ * Note that the bit location of the SGX features is meaningful as KVM expects
+ * the Linux defined bit to match the Intel defined bit, e.g. X86_FEATURE_SGX1
+ * must remain at bit 0, SGX2 at bit 1, etc...
+ */
+#define X86_FEATURE_SGX1   ( 8*32+ 0) /* SGX1 leaf functions */
+#define X86_FEATURE_SGX2   ( 8*32+ 1) /* SGX2 leaf functions */
+
+#define X86_FEATURE_TPR_SHADOW ( 8*32+ 8) /* Intel TPR Shadow */
+#define X86_FEATURE_VNMI   ( 8*32+ 9) /* Intel Virtual NMI */
+#define X86_FEATURE_FLEXPRIORITY   ( 8*32+10) /* Intel FlexPriority */
+#define X86_FEATURE_EPT( 8*32+11) /* Intel Extended 
Page Table */
+#define X86_FEATURE_VPID   ( 8*32+12) /* Intel Virtual Processor 
ID */
 
 #define X86_FEATURE_VMMCALL( 8*32+15) /* Prefer VMMCALL to VMCALL 
*/
 #define X86_FEATURE_XENPV  ( 8*32+16) /* "" Xen paravirtual guest 
*/
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 772c219b6889..f7f0970b8f89 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -26,6 +26,8 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_CDP_L3,   CPUID_ECX,  2, 0x0010, 1 },
{ X86_FEATURE_CDP_L2,   CPUID_ECX,  2, 0x0010, 2 },
{ X86_FEATURE_MBA,  CPUID_EBX,  3, 0x0010, 0 },
+   { X86_FEATURE_SGX1, CPUID_EAX,  0, 0x0012, 0 },
+   { X86_FEATURE_SGX2, CPUID_EAX,  1, 0x0012, 0 },
{ X86_FEATURE_HW_PSTATE,CPUID_EDX,  7, 0x8007, 0 },
{ X86_FEATURE_CPB,  CPUID_EDX,  9, 0x8007, 0 },
{ X86_FEATURE_PROC_FEEDBACK,CPUID_EDX, 11, 0x8007, 0 },
diff --git a/tools/arch/x86/include/asm/cpufeatures.h 
b/tools/arch/x86/include/asm/cpufeatures.h
index 89a048c2faec..9cc7628b9845 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -222,12 +222,21 @@
 #define X86_FEATURE_L1TF_PTEINV( 7*32+29) /* "" L1TF 
workaround PTE inversion */
 #define X86_FEATURE_IBRS_ENHANCED  ( 7*32+30) /* Enhanced IBRS */
 
-/* Virtualization flags: Linux defined, word 8 */
-#define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
-#define X86_FEATURE_VNMI   ( 8*32+ 1) /* Intel Virtual NMI */
-#define X86_FEATURE_FLEXPRIORITY   ( 8*32+ 2) /* Intel FlexPriority */
-#define X86_FEATURE_EPT( 8*32+ 3) /* Intel Extended 
Page Table */
-#define X86_FEATURE_VPID   ( 8*32+ 4) /* Intel Virtual Processor 
ID */
+/*
+ * Scattered Intel features: Linux defined, word 8.
+ *
+ * Note that the bit numbers of the SGX features are meaningful as KVM expects
+ * the Linux defined bit to match the Intel defined bit, e.g. X86_FEATURE_SGX1
+ * must remain at bit 0, SGX2 at bit 1, etc...
+ */
+#define X86_FEATURE_SGX1   ( 8*32+ 0) /* SGX1 leaf functions */
+#define X86_FEATURE_SGX2

[PATCH v17 01/23] x86/sgx: Update MAINTAINERS

2018-11-15 Thread Jarkko Sakkinen

Add the maintainer information for the SGX subsystem.

Signed-off-by: Jarkko Sakkinen 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 0abecc528dac..aaf56b544858 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7676,6 +7676,13 @@ L:   linux-g...@vger.kernel.org
 S: Maintained
 F: drivers/gpio/gpio-intel-mid.c
 
+INTEL SGX
+M: Jarkko Sakkinen 
+L: linux-...@vger.kernel.org
+Q: https://patchwork.kernel.org/project/intel-sgx/list/
+F: drivers/platform/x86/intel_sgx/
+K: \bSGX_
+
 INVENSENSE MPU-3050 GYROSCOPE DRIVER
 M: Linus Walleij 
 L: linux-...@vger.kernel.org
-- 
2.19.1

Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks

2018-11-15 Thread Dave Hansen

On 11/15/18 4:21 PM, Li, Aubrey wrote:
> On 2018/11/15 23:40, Dave Hansen wrote:
>> On 11/14/18 3:00 PM, Aubrey Li wrote:
>>> AVX-512 component has 3 states, only Hi16_ZMM state causes notable
>>> frequency drop. Add per task Hi16_ZMM state tracking to context switch.
>>
>> Just curious, but is there any public documentation of this?  It seems
>> really odd to me that something using the same AVX-512 instructions on
>> some low-numbered registers would behave differently than the same
>> instructions on some high-numbered registers.  I'm not saying this is
>> wrong, but it's certainly counter-intuitive and I think that begs for
>> some more explanation.
> 
> Yes, Intel 64 and IA-32 Architectures software developer's Manual mentioned
> this in performance event CORE_POWER.LVL2_TURBO_LICENSE.
> 
> "Core cycles where the core was running with power delivery for license
> level 2 (introduced in Skylake Server microarchitecture). This includes
> high current AVX 512-bit instructions."
> 
> I translated license level 2 to frequency drop.

OK, but that talks about AVX 512 and not specifically about Hi16_ZMM's
impact which is what this patch measures.  Are the Hi16_ZMM intricacies
documented anywhere?

Re: [RFC PATCH 2/5] mm: lower the printk loglevel for __dump_page messages

2018-11-15 Thread Baoquan He

On 11/07/18 at 11:18am, Michal Hocko wrote:
> From: Michal Hocko 
> 
> __dump_page messages use KERN_EMERG resp. KERN_ALERT loglevel (this is
> the case since 2004). Most callers of this function are really detecting
> a critical page state and BUG right after. On the other hand the
> function is called also from contexts which just want to inform about
> the page state and those would rather not disrupt logs that much (e.g.
> some systems route these messages to the normal console).
> 
> Reduce the loglevel to KERN_WARNING to make dump_page easier to reuse
> for other contexts while those messages will still make it to the kernel
> log in most setups. Even if the loglevel setup filters warnings away
> those paths that are really critical already print the more targeted
> error or panic and that should make it to the kernel log.
> 
> Signed-off-by: Michal Hocko 
> ---
>  mm/debug.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/debug.c b/mm/debug.c
> index a33177bfc856..d18c5cea3320 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -54,7 +54,7 @@ void __dump_page(struct page *page, const char *reason)
>* dump_page() when detected.
>*/
>   if (page_poisoned) {
> - pr_emerg("page:%px is uninitialized and poisoned", page);
> + pr_warn("page:%px is uninitialized and poisoned", page);
>   goto hex_only;
>   }
>  
> @@ -65,27 +65,27 @@ void __dump_page(struct page *page, const char *reason)
>*/
>   mapcount = PageSlab(page) ? 0 : page_mapcount(page);
>  
> - pr_emerg("page:%px count:%d mapcount:%d mapping:%px index:%#lx",
> + pr_warn("page:%px count:%d mapcount:%d mapping:%px index:%#lx",
pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx",

Better print it as refcount since we have renamed it. 

> page, page_ref_count(page), mapcount,
> page->mapping, page_to_pgoff(page));
>   if (PageCompound(page))
>   pr_cont(" compound_mapcount: %d", compound_mapcount(page));
>   pr_cont("\n");
>   if (PageAnon(page))
> - pr_emerg("anon ");
> + pr_warn("anon ");
>   else if (PageKsm(page))
> - pr_emerg("ksm ");
> + pr_warn("ksm ");
>   else if (mapping) {
> - pr_emerg("%ps ", mapping->a_ops);
> + pr_warn("%ps ", mapping->a_ops);
>   if (mapping->host->i_dentry.first) {
>   struct dentry *dentry;
>   dentry = container_of(mapping->host->i_dentry.first, 
> struct dentry, d_u.d_alias);
> - pr_emerg("name:\"%*s\" ", dentry->d_name.len, 
> dentry->d_name.name);
> + pr_warn("name:\"%*s\" ", dentry->d_name.len, 
> dentry->d_name.name);
>   }
>   }
>   BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS + 1);
>  
> - pr_emerg("flags: %#lx(%pGp)\n", page->flags, &page->flags);
> + pr_warn("flags: %#lx(%pGp)\n", page->flags, &page->flags);
>  
>  hex_only:
>   print_hex_dump(KERN_ALERT, "raw: ", DUMP_PREFIX_NONE, 32,
> @@ -93,11 +93,11 @@ void __dump_page(struct page *page, const char *reason)
>   sizeof(struct page), false);
>  
>   if (reason)
> - pr_alert("page dumped because: %s\n", reason);
> + pr_warn("page dumped because: %s\n", reason);
>  
>  #ifdef CONFIG_MEMCG
>   if (!page_poisoned && page->mem_cgroup)
> - pr_alert("page->mem_cgroup:%px\n", page->mem_cgroup);
> + pr_warn("page->mem_cgroup:%px\n", page->mem_cgroup);
>  #endif
>  }
>  
> -- 
> 2.19.1
>

[PATCH] arm64: dts: rockchip: rk3399: Add xin32k clk

2018-11-15 Thread Derek Basehore

This adds the xin32k clock to the RK3399 CPU. Even though it's not
directly used, muxes will end up traversing the entire clk tree on
calls to determine_rate if it doesn't exist.

Signed-off-by: Derek Basehore 
---
 arch/arm64/boot/dts/rockchip/rk3399.dtsi | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm64/boot/dts/rockchip/rk3399.dtsi 
b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
index 99e7f65c1779..6a32293982d0 100644
--- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
@@ -191,6 +191,13 @@
#clock-cells = <0>;
};
 
+   xin32k: xin32k {
+   compatible = "fixed-clock";
+   clock-frequency = <32000>;
+   clock-output-names = "xin32k";
+   #clock-cells = <0>;
+   };
+
amba {
compatible = "simple-bus";
#address-cells = <2>;
-- 
2.19.1.1215.g8438c0b245-goog

Applied "regulator/of_get_regulator: add child path to find the regulator supplier" to the regulator tree

2018-11-15 Thread Mark Brown

The patch

   regulator/of_get_regulator: add child path to find the regulator supplier

has been applied to the regulator tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator.git 

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From fe06051dbf8abf5962d9258c4a863056bdfa6eae Mon Sep 17 00:00:00 2001
From: zoro 
Date: Wed, 14 Nov 2018 17:38:22 +0800
Subject: [PATCH] regulator/of_get_regulator: add child path to find the
 regulator supplier

when the VIR_LDO1 regulator supplier is it's brother,
we can't find the supplier.

example code :
&vir_regulator {
ldo0_vir: ldo0-virtual {
regulator-compatible = "VIR_LDO0";
regulator-name= "VIR_LDO0";
regulator-min-microvolt = <100>;
regulator-max-microvolt = <200>;
};
ldo1_vir: ldo1-virtual {
regulator-compatible = "VIR_LDO1";
regulator-name= "VIR_LDO1";
regulator-min-microvolt = <100>;
regulator-max-microvolt = <300>;
ldo1-supply = <&ldo0_vir>;
};
...
}

so we add the child ptah to find the suppier.

Signed-off-by: zoro 
Signed-off-by: Mark Brown 
---
 drivers/regulator/core.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c
index 2c66b528aede..55207264c212 100644
--- a/drivers/regulator/core.c
+++ b/drivers/regulator/core.c
@@ -227,6 +227,37 @@ static void regulator_unlock_supply(struct regulator_dev 
*rdev)
}
 }
 
+/**
+ * of_get_child_regulator - get a child regulator device node
+ * based on supply name
+ * @parent: Parent device node
+ * @prop_name: Combination regulator supply name and "-supply"
+ *
+ * Traverse all child nodes.
+ * Extract the child regulator device node corresponding to the supply name.
+ * returns the device node corresponding to the regulator if found, else
+ * returns NULL.
+ */
+static struct device_node *of_get_child_regulator(struct device_node *parent,
+ const char *prop_name)
+{
+   struct device_node *regnode = NULL;
+   struct device_node *child = NULL;
+
+   for_each_child_of_node(parent, child) {
+   regnode = of_parse_phandle(child, prop_name, 0);
+
+   if (!regnode) {
+   regnode = of_get_child_regulator(child, prop_name);
+   if (regnode)
+   return regnode;
+   } else {
+   return regnode;
+   }
+   }
+   return NULL;
+}
+
 /**
  * of_get_regulator - get a regulator device node based on supply name
  * @dev: Device pointer for the consumer (of regulator) device
@@ -247,6 +278,10 @@ static struct device_node *of_get_regulator(struct device 
*dev, const char *supp
regnode = of_parse_phandle(dev->of_node, prop_name, 0);
 
if (!regnode) {
+   regnode = of_get_child_regulator(dev->of_node, prop_name);
+   if (regnode)
+   return regnode;
+
dev_dbg(dev, "Looking up %s property in node %pOF failed\n",
prop_name, dev->of_node);
return NULL;
-- 
2.19.1

Re: [PATCH] watchdog: core: suppress "watchdog did not stop" message

2018-11-15 Thread Tao Ren

On 11/15/18 4:19 PM, Guenter Roeck wrote:
> NACK. This message is displayed if/when the watchdog application
> exits without stopping the watchdog and/or without closing properly.
> This _is_ critical since it will reboot the system after the next
> timeout period.
> 
> If userspace triggers this message on purpose (eg by the mentioned
> script, which does not exit properly), userspace is at fault,
> not the kernel.
> 
> Guenter

Thank you for the quick response, Guenter. I see the log each time when I 
reboot my system, and when I searched the message in google, I also found posts 
asking why the message is printed at reboot, and that's why I feel it's 
confusing.

Anyways, please ignore the patch since it's necessary.

Thanks,
Tao Ren

Re: [PATCH v3 2/2] proc: add /proc//arch_state

2018-11-15 Thread Li, Aubrey

On 2018/11/15 23:18, Dave Hansen wrote:
> On 11/14/18 3:00 PM, Aubrey Li wrote:
>> +void arch_thread_state(struct seq_file *m, struct task_struct *task)
>> +{
>> +/*
>> + * Report AVX-512 Hi16_ZMM registers usage
>> + */
>> +if (task->thread.fpu.hi16zmm_usage)
>> +seq_putc(m, '1');
>> +else
>> +seq_putc(m, '0');
>> +seq_putc(m, '\n');
>> +}
> 
> Am I reading this right that this file just dumps out a plain 0 or 1
> based on the internal kernel state?  BTW, there's no example of the
> output in the changelog, so it's rather hard to tell if my guess is
> right.  (Hint, hint).
> 
> If so, I'd really prefer you not do this.  We have /proc/$pid/stat to
> stand as a disaster in this regard.  It is essentially
> non-human-readable gibberish because it's impossible to tell what the
> values mean without a decoder ring.

Yes, I'm following /proc/$pid/stat format, as I think this interface is
not for the end user, but for developer and user space job scheduler. So
I guess this style might be okay.

> 
> If we go down this road, we need a file along the lines of
> /proc/$pid/status.

I checked /proc/$pid/status, all common information to architectures.
That's why I want to open a new interface to CPU specific state.

> 
> But, either way, this is a new ABI that we need to consider carefully.
> It needs documentation.  For instance, will this really mean "Hi16_ZMM
> user" from now until the end of time?  Or, does it just mean "group me
> with other tasks that have this bit set"?
> 
I'm open to this interface. Let's wait to see if there are more comments
and suggestions.

Thanks,
-Aubrey

[PATCH] perf build: fix -lbfd feature check

2018-11-15 Thread Stanislav Fomichev

Current libbfd feature test unconditionally links against -liberty and -lz.
While it's required on some systems (e.g. opensuse), it's completely
unnecessary on the others, where only -lbdf is sufficient (debian).
This patch streamlines (and renames) the following feature checks:

feature-libbfd   - only link against -lbfd (debian),
   see commit 2cf9040714f3 ("perf tools: Fix bfd
   dependency libraries detection")
feature-libbfd-liberty   - link against -lbfd and -liberty
feature-libbfd-liberty-z - link against -lbfd, -liberty and -lz (opensuse),
   see commit 280e7c48c3b8 ("perf tools: fix BFD
   detection on opensuse")

(feature-liberty{,-z} were renamed to feature-libbfd-liberty{,z}
for clarity)

The main motivation is to fix this feature test for bpftool which is
currently broken on debian (libbfd feature shows OFF, but we still
unconditionally link against -lbfd and it works).

Tested on debian with only -lbfd installed (without -liberty); I'd
appreciate if somebody on the other systems can test this new detection
method.

Signed-off-by: Stanislav Fomichev 
---
 tools/build/Makefile.feature |  4 ++--
 tools/build/feature/Makefile | 10 
 tools/perf/Makefile.config   | 44 +++-
 3 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/tools/build/Makefile.feature b/tools/build/Makefile.feature
index f216b2f5c3d7..42a787856cd8 100644
--- a/tools/build/Makefile.feature
+++ b/tools/build/Makefile.feature
@@ -79,8 +79,8 @@ FEATURE_TESTS_EXTRA :=  \
  cplus-demangle \
  hello  \
  libbabeltrace  \
- liberty\
- liberty-z  \
+ libbfd-liberty \
+ libbfd-liberty-z   \
  libunwind-debug-frame  \
  libunwind-debug-frame-arm  \
  libunwind-debug-frame-aarch64  \
diff --git a/tools/build/feature/Makefile b/tools/build/feature/Makefile
index 0516259be70f..bf8a8ebcca1e 100644
--- a/tools/build/feature/Makefile
+++ b/tools/build/feature/Makefile
@@ -15,8 +15,8 @@ FILES=  \
  test-libbfd.bin\
  test-disassembler-four-args.bin\
  test-reallocarray.bin \
- test-liberty.bin   \
- test-liberty-z.bin \
+ test-libbfd-liberty.bin\
+ test-libbfd-liberty-z.bin  \
  test-cplus-demangle.bin\
  test-libelf.bin\
  test-libelf-getphdrnum.bin \
@@ -200,7 +200,7 @@ FLAGS_PERL_EMBED=$(PERL_EMBED_CCOPTS) $(PERL_EMBED_LDOPTS)
$(BUILD)
 
 $(OUTPUT)test-libbfd.bin:
-   $(BUILD) -DPACKAGE='"perf"' -lbfd -lz -liberty -ldl
+   $(BUILD) -DPACKAGE='"perf"' -lbfd -ldl
 
 $(OUTPUT)test-disassembler-four-args.bin:
$(BUILD) -DPACKAGE='"perf"' -lbfd -lopcodes
@@ -208,10 +208,10 @@ FLAGS_PERL_EMBED=$(PERL_EMBED_CCOPTS) $(PERL_EMBED_LDOPTS)
 $(OUTPUT)test-reallocarray.bin:
$(BUILD)
 
-$(OUTPUT)test-liberty.bin:
+$(OUTPUT)test-libbfd-liberty.bin:
$(CC) $(CFLAGS) -Wall -Werror -o $@ test-libbfd.c -DPACKAGE='"perf"' 
$(LDFLAGS) -lbfd -ldl -liberty
 
-$(OUTPUT)test-liberty-z.bin:
+$(OUTPUT)test-libbfd-liberty-z.bin:
$(CC) $(CFLAGS) -Wall -Werror -o $@ test-libbfd.c -DPACKAGE='"perf"' 
$(LDFLAGS) -lbfd -ldl -liberty -lz
 
 $(OUTPUT)test-cplus-demangle.bin:
diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config
index e30d20fb482d..6287fa0ebd1d 100644
--- a/tools/perf/Makefile.config
+++ b/tools/perf/Makefile.config
@@ -686,18 +686,20 @@ endif
 
 ifeq ($(feature-libbfd), 1)
   EXTLIBS += -lbfd
+else
+  # we are on a system that requires -liberty and (maybe) -lz
+  # to link against -lbfd; test each case individually here
 
   # call all detections now so we get correct
   # status in VF output
-  $(call feature_check,liberty)
-  $(call feature_check,liberty-z)
-  $(call feature_check,cplus-demangle)
+  $(call feature_check,libbfd-liberty)
+  $(call feature_check,libbfd-liberty-z)
 
-  ifeq ($(feature-liberty), 1)
-EXTLIBS += -liberty
+  ifeq ($(feature-libbfd-liberty), 1)
+EXTLIBS += -lbfd -liberty
   else
-ifeq ($(feature-liberty-z), 1)
-  EXTLIBS += -liberty -lz
+ifeq ($(feature-libbfd-liberty-z), 1)
+  EXTLIBS += -lbfd -liberty -lz
 endif
   endif
 endif
@@ -707,24 +709,24 @@ ifdef NO_DEMANGLE
 else
   ifdef HAVE_CPLUS_DEMANGLE_SUPPORT
 EXTLIBS += -liberty
-CFLAGS += -DHAVE_CPLUS_DEMANGLE_SUPPORT
   else
-ifneq ($(feature-libbfd), 1)
-  ifneq ($(feature-liberty), 1)
-ifneq ($(feature-liberty-z), 1)
-  # we dont have neither HAVE_CPLUS_DEMANGLE_SUPPORT
-  # or any of

[PATCH 1/6] locking/mutex: Remove caller signal_pending branch predictions

2018-11-15 Thread Davidlohr Bueso

This is already done for us internally by the signal machinery.

Cc: pet...@infradead.org
Cc: mi...@kernel.org
Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/mutex.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 3f8a35104285..db578783dd36 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -987,7 +987,7 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
 * wait_lock. This ensures the lock cancellation is ordered
 * against mutex_unlock() and wake-ups do not go missing.
 */
-   if (unlikely(signal_pending_state(state, current))) {
+   if (signal_pending_state(state, current)) {
ret = -EINTR;
goto err;
}
-- 
2.16.4

[PATCH 4/6] mm: Remove caller signal_pending branch predictions

2018-11-15 Thread Davidlohr Bueso

This is already done for us internally by the signal machinery.

Signed-off-by: Davidlohr Bueso 
---
 mm/filemap.c | 2 +-
 mm/gup.c | 2 +-
 mm/hugetlb.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 81adec8ee02c..abd6c4591855 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1096,7 +1096,7 @@ static inline int 
wait_on_page_bit_common(wait_queue_head_t *q,
break;
}
 
-   if (unlikely(signal_pending_state(state, current))) {
+   if (signal_pending_state(state, current)) {
ret = -EINTR;
break;
}
diff --git a/mm/gup.c b/mm/gup.c
index f76e77a2d34b..391c71dde267 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -722,7 +722,7 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
 * If we have a pending SIGKILL, don't keep faulting pages and
 * potentially allocating memory.
 */
-   if (unlikely(fatal_signal_pending(current))) {
+   if (fatal_signal_pending(current)) {
ret = -ERESTARTSYS;
goto out;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c007fb5fb8d5..da798bc2e948 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4184,7 +4184,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * If we have a pending SIGKILL, don't keep faulting pages and
 * potentially allocating memory.
 */
-   if (unlikely(fatal_signal_pending(current))) {
+   if (fatal_signal_pending(current)) {
remainder = 0;
break;
}
-- 
2.16.4

[PATCH 3/6] arch/arc: Remove caller signal_pending_branch predictions

2018-11-15 Thread Davidlohr Bueso

This is already done for us internally by the signal machinery.

Cc: vgu...@synopsys.com
Signed-off-by: Davidlohr Bueso 
---
 arch/arc/mm/fault.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index c9da6102eb4f..6b6df8f9f882 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -142,7 +142,7 @@ void do_page_fault(unsigned long address, struct pt_regs 
*regs)
fault = handle_mm_fault(vma, address, flags);
 
/* If Pagefault was interrupted by SIGKILL, exit page fault "early" */
-   if (unlikely(fatal_signal_pending(current))) {
+   if (fatal_signal_pending(current)) {
if ((fault & VM_FAULT_ERROR) && !(fault & VM_FAULT_RETRY))
up_read(&mm->mmap_sem);
if (user_mode(regs))
-- 
2.16.4

[PATCH 2/6] kernel/sched: Remove caller signal_pending branch predictions

2018-11-15 Thread Davidlohr Bueso

This is already done for us internally by the signal machinery.

Cc: pet...@infradead.org
Cc: mi...@kernel.org
Signed-off-by: Davidlohr Bueso 
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/swait.c | 2 +-
 kernel/sched/wait.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f12225f26b70..1972b4c63a1f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3416,7 +3416,7 @@ static void __sched notrace __schedule(bool preempt)
 
switch_count = &prev->nivcsw;
if (!preempt && prev->state) {
-   if (unlikely(signal_pending_state(prev->state, prev))) {
+   if (signal_pending_state(prev->state, prev)) {
prev->state = TASK_RUNNING;
} else {
deactivate_task(rq, prev, DEQUEUE_SLEEP | 
DEQUEUE_NOCLOCK);
diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index 66b59ac77c22..e83a3f8449f6 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -93,7 +93,7 @@ long prepare_to_swait_event(struct swait_queue_head *q, 
struct swait_queue *wait
long ret = 0;
 
raw_spin_lock_irqsave(&q->lock, flags);
-   if (unlikely(signal_pending_state(state, current))) {
+   if (signal_pending_state(state, current)) {
/*
 * See prepare_to_wait_event(). TL;DR, subsequent swake_up_one()
 * must not see us.
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 5dd47f1103d1..6eb1f8efd221 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -264,7 +264,7 @@ long prepare_to_wait_event(struct wait_queue_head *wq_head, 
struct wait_queue_en
long ret = 0;
 
spin_lock_irqsave(&wq_head->lock, flags);
-   if (unlikely(signal_pending_state(state, current))) {
+   if (signal_pending_state(state, current)) {
/*
 * Exclusive waiter must not fail if it was selected by wakeup,
 * it should "consume" the condition we were waiting for.
-- 
2.16.4

[PATCH 5/6] drivers/i2c: Remove caller signal_pending branch predictions

2018-11-15 Thread Davidlohr Bueso

This is already done for us internally by the signal machinery.

Cc: linux-...@vger.kernel.org
Cc: p...@axentia.se
Signed-off-by: Davidlohr Bueso 
---
 drivers/i2c/busses/i2c-ibm_iic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/i2c/busses/i2c-ibm_iic.c b/drivers/i2c/busses/i2c-ibm_iic.c
index 6f6e1dfe7cce..d78023d42a35 100644
--- a/drivers/i2c/busses/i2c-ibm_iic.c
+++ b/drivers/i2c/busses/i2c-ibm_iic.c
@@ -437,7 +437,7 @@ static int iic_wait_for_tc(struct ibm_iic_private* dev){
break;
}
 
-   if (unlikely(signal_pending(current))){
+   if (signal_pending(current)){
DBG("%d: poll interrupted\n", dev->idx);
ret = -ERESTARTSYS;
break;
-- 
2.16.4

1 2 3 4 5 6 >

1 - 100 of 546 matches

Mail list logo