date:20180426

[RFC PATCH 3/9] arm64: mm: migrate: add pmd swap entry to support thp migration.

2018-04-26 Thread Zi Yan

From: Zi Yan 

Signed-off-by: Zi Yan 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Steve Capper 
Cc: Marc Zyngier 
Cc: Kristina Martsenko 
Cc: Dan Williams 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux...@kvack.org
---
 arch/arm64/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7e2c27e63cd8..1cdc9d3db2c7 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -742,7 +742,9 @@ extern pgd_t tramp_pg_dir[PTRS_PER_PGD];
 #define __swp_entry(type,offset) ((swp_entry_t) { ((type) << __SWP_TYPE_SHIFT) 
| ((offset) << __SWP_OFFSET_SHIFT) })
 
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val(pte) })
+#define __pmd_to_swp_entry(pmd)((swp_entry_t) { pmd_val(pmd) })
 #define __swp_entry_to_pte(swp)((pte_t) { (swp).val })
+#define __swp_entry_to_pmd(swp)((pmd_t) { (swp).val })
 
 /*
  * Ensure that there are not more swap files than can be encoded in the kernel
-- 
2.17.0

Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread Mikulas Patocka



On Thu, 26 Apr 2018, Michal Hocko wrote:

> On Wed 25-04-18 18:42:57, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 25 Apr 2018, James Bottomley wrote:
> [...]
> > > Kconfig proliferation, conversely, is a bit of a nightmare from both
> > > the user and the tester's point of view, so we're trying to avoid it
> > > unless absolutely necessary.
> > > 
> > > James
> > 
> > I already offered that we don't need to introduce a new kernel option and 
> > we can bind this feature to any other kernel option, that is enabled in 
> > the debug kernel, for example CONFIG_DEBUG_SG. Michal said no and he said 
> > that he wants a new kernel option instead.
> 
> Just for the record. I didn't say I _want_ a config option. Do not
> misinterpret my words. I've said that a config option would be
> acceptable if there is no way to deliver the functionality via kernel
> package automatically. You haven't provided any argument that would
> explain why the kernel package cannot add a boot option. Maybe there are
> some but I do not see them right now.

AFAIK Grub doesn't load per-kernel options from a per-kernel file. Even if 
we hacked grub scripts to add this option, other distributions won't.

Mikulas

Re: [patch V2 4/7] LICENSES: Add CDDL-1.0 license text

2018-04-26 Thread Kate Stewart

On Thu, Apr 26, 2018 at 2:01 AM, Greg Kroah-Hartman
 wrote:
> On Wed, Apr 25, 2018 at 10:30:24PM +0200, Thomas Gleixner wrote:
>> Add the full text of the CDDL-1.0 to the kernel tree.  It was copied directly
>> from:
>>
>>https://spdx.org/licenses/CDDL-1.0.html#licenseText
>>
>> Signed-off-by: Thomas Gleixner 
>
> Reviewed-by: Greg Kroah-Hartman 

Reviewed-by: Kate Stewart

[RFC PATCH 2/9] arm: mm: migrate: add pmd swap entry to support thp migration.

2018-04-26 Thread Zi Yan

From: Zi Yan 

Signed-off-by: Zi Yan 
Cc: Russell King 
Cc: Christoffer Dall 
Cc: Marc Zyngier 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux...@kvack.org
---
 arch/arm/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index a757401129f9..d4b35514e96a 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -347,7 +347,9 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 #define __swp_entry(type,offset) ((swp_entry_t) { ((type) << __SWP_TYPE_SHIFT) 
| ((offset) << __SWP_OFFSET_SHIFT) })
 
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val(pte) })
+#define __pmd_to_swp_entry(pmd)((swp_entry_t) { pmd_val(pmd) })
 #define __swp_entry_to_pte(swp)((pte_t) { (swp).val })
+#define __swp_entry_to_pmd(swp)((pmd_t) { (swp).val })
 
 /*
  * It is an error for the kernel to have more swap files than we can
-- 
2.17.0

[RFC PATCH 1/9] arc: mm: migrate: add pmd swap entry to support thp migration.

2018-04-26 Thread Zi Yan

From: Zi Yan 

Signed-off-by: Zi Yan 
Cc: Vineet Gupta 
Cc: linux-snps-...@lists.infradead.org
Cc: linux...@kvack.org
---
 arch/arc/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h
index 08fe33830d4b..246934105e61 100644
--- a/arch/arc/include/asm/pgtable.h
+++ b/arch/arc/include/asm/pgtable.h
@@ -383,7 +383,9 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned 
long address,
 
 /* NOPs, to keep generic kernel happy */
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val(pte) })
+#define __pmd_to_swp_entry(pmd)((swp_entry_t) { pmd_val(pmd) })
 #define __swp_entry_to_pte(x)  ((pte_t) { (x).val })
+#define __swp_entry_to_pmd(x)  ((pmd_t) { (x).val })
 
 #define kern_addr_valid(addr)  (1)
 
-- 
2.17.0

[RFC PATCH 6/9] powerpc: mm: migrate: add pmd swap entry to support thp migration.

2018-04-26 Thread Zi Yan

From: Zi Yan 

pmd swap soft dirty support is added, too.

Signed-off-by: Zi Yan 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Aneesh Kumar K.V" 
Cc: Ram Pai 
Cc: Balbir Singh 
Cc: Naoya Horiguchi 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux...@kvack.org
---
 arch/powerpc/include/asm/book3s/32/pgtable.h |  2 ++
 arch/powerpc/include/asm/book3s/64/pgtable.h | 17 +
 arch/powerpc/include/asm/nohash/32/pgtable.h |  2 ++
 arch/powerpc/include/asm/nohash/64/pgtable.h |  2 ++
 4 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index c615abdce119..866b67a8abf0 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -294,7 +294,9 @@ static inline void __ptep_set_access_flags(struct mm_struct 
*mm,
 #define __swp_offset(entry)((entry).val >> 5)
 #define __swp_entry(type, offset)  ((swp_entry_t) { (type) | ((offset) << 
5) })
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val(pte) 
>> 3 })
+#define __pmd_to_swp_entry(pmd)((swp_entry_t) { pmd_val(pmd) 
>> 3 })
 #define __swp_entry_to_pte(x)  ((pte_t) { (x).val << 3 })
+#define __swp_entry_to_pmd(x)  ((pmd_t) { (x).val << 3 })
 
 int map_kernel_page(unsigned long va, phys_addr_t pa, int flags);
 
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index a6b9f1d74600..6b3c6492071d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -726,7 +726,9 @@ static inline bool pte_user(pte_t pte)
  * Clear bits not found in swap entries here.
  */
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val((pte)) & 
~_PAGE_PTE })
+#define __pmd_to_swp_entry(pmd)((swp_entry_t) { pmd_val((pmd)) & 
~_PAGE_PTE })
 #define __swp_entry_to_pte(x)  __pte((x).val | _PAGE_PTE)
+#define __swp_entry_to_pmd(x)  __pmd((x).val | _PAGE_PTE)
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
 #define _PAGE_SWP_SOFT_DIRTY   (1UL << (SWP_TYPE_BITS + _PAGE_BIT_SWAP_TYPE))
@@ -749,6 +751,21 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 {
return __pte(pte_val(pte) & ~_PAGE_SWP_SOFT_DIRTY);
 }
+
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+   return __pmd(pmd_val(pmd) | _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline bool pmd_swp_soft_dirty(pmd_t pmd)
+{
+   return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_SWP_SOFT_DIRTY));
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+   return __pmd(pmd_val(pmd) & ~_PAGE_SWP_SOFT_DIRTY);
+}
 #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
 
 static inline bool check_pte_access(unsigned long access, unsigned long ptev)
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 03bbd1149530..f6b0534a02d4 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -337,7 +337,9 @@ static inline void __ptep_set_access_flags(struct mm_struct 
*mm,
 #define __swp_offset(entry)((entry).val >> 5)
 #define __swp_entry(type, offset)  ((swp_entry_t) { (type) | ((offset) << 
5) })
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val(pte) 
>> 3 })
+#define __pmd_to_swp_entry(pmd)((swp_entry_t) { pmd_val(pte) 
>> 3 })
 #define __swp_entry_to_pte(x)  ((pte_t) { (x).val << 3 })
+#define __swp_entry_to_pmd(x)  ((pmd_t) { (x).val << 3 })
 
 int map_kernel_page(unsigned long va, phys_addr_t pa, int flags);
 
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h 
b/arch/powerpc/include/asm/nohash/64/pgtable.h
index 5c5f75d005ad..5790763c07df 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -342,7 +342,9 @@ static inline void __ptep_set_access_flags(struct mm_struct 
*mm,
| ((offset) << PTE_RPN_SHIFT) })
 
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val((pte)) 
})
+#define __pmd_to_swp_entry(pmd)((swp_entry_t) { pmd_val((pmd)) 
})
 #define __swp_entry_to_pte(x)  __pte((x).val)
+#define __swp_entry_to_pmd(x)  __pmd((x).val)
 
 extern int map_kernel_page(unsigned long ea, unsigned long pa,
   unsigned long flags);
-- 
2.17.0

[RFC PATCH 0/9] Enable THP migration for all possible architectures

2018-04-26 Thread Zi Yan

From: Zi Yan 

Hi all,

THP migration is only enabled on x86_64 with a special
ARCH_ENABLE_THP_MIGRATION macro. This patchset enables THP migration for
all architectures that uses transparent hugepage, so that special macro can
be dropped. Instead, THP migration is enabled/disabled via
/sys/kernel/mm/transparent_hugepage/enable_thp_migration.

I grepped for TRANSPARENT_HUGEPAGE in arch folder and got 9 architectures that
are supporting transparent hugepage. I mechanically add __pmd_to_swp_entry() and
__swp_entry_to_pmd() based on existing __pte_to_swp_entry() and
__swp_entry_to_pte() for all these architectures, except tile which is going to
be dropped.

I have successfully compiled all these architectures, but have NOT tested them
due to lack of real hardware. I appreciate your help, if the maintainers of
these architectures can do a quick test with the code from
https://github.com/x-y-z/thp-migration-bench . Please apply patch 9 as well
to enable THP migration.

By enabling THP migration, migrating a 2MB THP on x86_64 machines takes only 1/3
time of migrating equivalent 512 4KB pages.

Hi Naoya, I also add soft dirty support for powerpc and s390. It would be great
if you can take a look at patch 6 & 7.

Feel free to give comments. Thanks.

Cc: linux...@kvack.org
Cc: Vineet Gupta 
Cc: linux-snps-...@lists.infradead.org
Cc: Russell King 
Cc: Christoffer Dall 
Cc: Marc Zyngier 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Steve Capper 
Cc: Kristina Martsenko 
Cc: Dan Williams 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "Kirill A. Shutemov" 
Cc: x...@kernel.org
Cc: Ralf Baechle 
Cc: James Hogan 
Cc: Michal Hocko 
Cc: linux-m...@linux-mips.org
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Aneesh Kumar K.V" 
Cc: Ram Pai 
Cc: Balbir Singh 
Cc: Naoya Horiguchi 
Cc: linuxppc-...@lists.ozlabs.org
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Janosch Frank 
Cc: linux-s...@vger.kernel.org
Cc: "David S. Miller" 
Cc: sparcli...@vger.kernel.org
Cc: "Huang, Ying" 


Zi Yan (9):
  arc: mm: migrate: add pmd swap entry to support thp migration.
  arm: mm: migrate: add pmd swap entry to support thp migration.
  arm64: mm: migrate: add pmd swap entry to support thp migration.
  i386: mm: migrate: add pmd swap entry to support thp migration.
  mips: mm: migrate: add pmd swap entry to support thp migration.
  powerpc: mm: migrate: add pmd swap entry to support thp migration.
  s390: mm: migrate: add pmd swap entry to support thp migration.
  sparc: mm: migrate: add pmd swap entry to support thp migration.
  mm: migrate: enable thp migration for all possible architectures.

 arch/arc/include/asm/pgtable.h   |  2 ++
 arch/arm/include/asm/pgtable.h   |  2 ++
 arch/arm64/include/asm/pgtable.h |  2 ++
 arch/mips/include/asm/pgtable-64.h   |  2 ++
 arch/powerpc/include/asm/book3s/32/pgtable.h |  2 ++
 arch/powerpc/include/asm/book3s/64/pgtable.h | 17 
 arch/powerpc/include/asm/nohash/32/pgtable.h |  2 ++
 arch/powerpc/include/asm/nohash/64/pgtable.h |  2 ++
 arch/s390/include/asm/pgtable.h  |  5 
 arch/sparc/include/asm/pgtable_32.h  |  2 ++
 arch/sparc/include/asm/pgtable_64.h  |  2 ++
 arch/x86/Kconfig |  4 ---
 arch/x86/include/asm/pgtable-2level.h|  2 ++
 arch/x86/include/asm/pgtable-3level.h|  2 ++
 arch/x86/include/asm/pgtable.h   |  2 --
 fs/proc/task_mmu.c   |  2 --
 include/asm-generic/pgtable.h| 21 ++-
 include/linux/huge_mm.h  |  9 +++
 include/linux/swapops.h  |  4 +--
 mm/Kconfig   |  3 ---
 mm/huge_memory.c | 27 +---
 mm/migrate.c |  6 ++---
 mm/rmap.c|  5 ++--
 23 files changed, 73 insertions(+), 54 deletions(-)

-- 
2.17.0

[RFC PATCH 7/9] s390: mm: migrate: add pmd swap entry to support thp migration.

2018-04-26 Thread Zi Yan

From: Zi Yan 

pmd swap soft dirty support is added, too.

Signed-off-by: Zi Yan 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Janosch Frank 
Cc: Naoya Horiguchi 
Cc: linux-s...@vger.kernel.org
Cc: linux...@kvack.org
---
 arch/s390/include/asm/pgtable.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2d24d33bf188..215fbb34203e 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -798,18 +798,21 @@ static inline int pmd_soft_dirty(pmd_t pmd)
 {
return pmd_val(pmd) & _SEGMENT_ENTRY_SOFT_DIRTY;
 }
+#define pmd_swp_soft_dirty pmd_soft_dirty
 
 static inline pmd_t pmd_mksoft_dirty(pmd_t pmd)
 {
pmd_val(pmd) |= _SEGMENT_ENTRY_SOFT_DIRTY;
return pmd;
 }
+#define pmd_swp_mksoft_dirty pmd_mksoft_dirty
 
 static inline pmd_t pmd_clear_soft_dirty(pmd_t pmd)
 {
pmd_val(pmd) &= ~_SEGMENT_ENTRY_SOFT_DIRTY;
return pmd;
 }
+#define pmd_swp_clear_soft_dirty pmd_clear_soft_dirty
 
 /*
  * query functions pte_write/pte_dirty/pte_young only work if
@@ -1594,7 +1597,9 @@ static inline swp_entry_t __swp_entry(unsigned long type, 
unsigned long offset)
 }
 
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val(pte) })
+#define __pmd_to_swp_entry(pte)((swp_entry_t) { pmd_val(pmd) })
 #define __swp_entry_to_pte(x)  ((pte_t) { (x).val })
+#define __swp_entry_to_pmd(x)  ((pmd_t) { (x).val })
 
 #define kern_addr_valid(addr)   (1)
 
-- 
2.17.0

[RFC PATCH 4/9] i386: mm: migrate: add pmd swap entry to support thp migration.

2018-04-26 Thread Zi Yan

From: Zi Yan 

Signed-off-by: Zi Yan 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "Kirill A. Shutemov" 
Cc: x...@kernel.org
Cc: linux...@kvack.org
---
 arch/x86/include/asm/pgtable-2level.h | 2 ++
 arch/x86/include/asm/pgtable-3level.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-2level.h 
b/arch/x86/include/asm/pgtable-2level.h
index 685ffe8a0eaf..fba4722ec2c2 100644
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -93,6 +93,8 @@ static inline unsigned long pte_bitop(unsigned long value, 
unsigned int rightshi
 ((type) << (_PAGE_BIT_PRESENT + 1)) \
 | ((offset) << SWP_OFFSET_SHIFT) })
 #define __pte_to_swp_entry(pte)((swp_entry_t) { (pte).pte_low 
})
+#define __pmd_to_swp_entry(pmd)((swp_entry_t) { 
native_pmd_val(pmd) })
 #define __swp_entry_to_pte(x)  ((pte_t) { .pte = (x).val })
+#define __swp_entry_to_pmd(x)  (native_make_pmd(x.val))
 
 #endif /* _ASM_X86_PGTABLE_2LEVEL_H */
diff --git a/arch/x86/include/asm/pgtable-3level.h 
b/arch/x86/include/asm/pgtable-3level.h
index f24df59c40b2..9b7e3c74fbc0 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -246,7 +246,9 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
 #define __swp_offset(x)((x).val >> 5)
 #define __swp_entry(type, offset)  ((swp_entry_t){(type) | (offset) << 5})
 #define __pte_to_swp_entry(pte)((swp_entry_t){ (pte).pte_high 
})
+#define __pmd_to_swp_entry(pmd)((swp_entry_t){ (pmd).pmd_high 
})
 #define __swp_entry_to_pte(x)  ((pte_t){ { .pte_high = (x).val } })
+#define __swp_entry_to_pmd(x)  ((pmd_t){ { .pmd_high = (x).val } })
 
 #define gup_get_pte gup_get_pte
 /*
-- 
2.17.0

[RFC PATCH 5/9] mips: mm: migrate: add pmd swap entry to support thp migration.

2018-04-26 Thread Zi Yan

From: Zi Yan 

Signed-off-by: Zi Yan 
Cc: Ralf Baechle 
Cc: James Hogan 
Cc: Michal Hocko 
Cc: Ingo Molnar 
Cc: Andrew Morton 
Cc: linux-m...@linux-mips.org
Cc: linux...@kvack.org
---
 arch/mips/include/asm/pgtable-64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/mips/include/asm/pgtable-64.h 
b/arch/mips/include/asm/pgtable-64.h
index 0036ea0c7173..ec72e5b12965 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -366,6 +366,8 @@ static inline pte_t mk_swap_pte(unsigned long type, 
unsigned long offset)
 #define __swp_offset(x)((x).val >> 24)
 #define __swp_entry(type, offset) ((swp_entry_t) { pte_val(mk_swap_pte((type), 
(offset))) })
 #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
+#define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val(pmd) })
 #define __swp_entry_to_pte(x)  ((pte_t) { (x).val })
+#define __swp_entry_to_pmd(x)  ((pmd_t) { (x).val })
 
 #endif /* _ASM_PGTABLE_64_H */
-- 
2.17.0

Re: [patch V2 3/7] LICENSES: Add Apache 2.0 license text

2018-04-26 Thread Kate Stewart

On Thu, Apr 26, 2018 at 2:00 AM, Greg Kroah-Hartman
 wrote:
> On Wed, Apr 25, 2018 at 10:30:23PM +0200, Thomas Gleixner wrote:
>> Add the full text of the Apache License version 2 to the kernel tree.  It
>> was copied directly from:
>>
>>https://spdx.org/licenses/Apache-2.0.html#licenseText
>>
>> Signed-off-by: Thomas Gleixner 
>
> Reviewed-by: Greg Kroah-Hartman 

Reviewed-by: Kate Stewart

[RFC PATCH 8/9] sparc: mm: migrate: add pmd swap entry to support thp migration.

2018-04-26 Thread Zi Yan

From: Zi Yan 

Signed-off-by: Zi Yan 
Cc: "David S. Miller" 
Cc: sparcli...@vger.kernel.org
Cc: linux...@kvack.org
---
 arch/sparc/include/asm/pgtable_32.h | 2 ++
 arch/sparc/include/asm/pgtable_64.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_32.h 
b/arch/sparc/include/asm/pgtable_32.h
index 4eebed6c6781..293bf9f8f949 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -367,7 +367,9 @@ static inline swp_entry_t __swp_entry(unsigned long type, 
unsigned long offset)
 }
 
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val(pte) })
+#define __pmd_to_swp_entry(pmd)((swp_entry_t) { pmd_val(pmd) })
 #define __swp_entry_to_pte(x)  ((pte_t) { (x).val })
+#define __swp_entry_to_pmd(x)  ((pmd_t) { (x).val })
 
 static inline unsigned long
 __get_phys (unsigned long addr)
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 339920fdf9ed..2811aef4a636 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1031,7 +1031,9 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct 
*mm, pmd_t *pmdp);
  ((long)(offset) << (PAGE_SHIFT + 8UL))) \
  } )
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val(pte) })
+#define __pmd_to_swp_entry(pmd)((swp_entry_t) { pmd_val(pmd) })
 #define __swp_entry_to_pte(x)  ((pte_t) { (x).val })
+#define __swp_entry_to_pmd(x)  ((pmd_t) { (x).val })
 
 int page_in_phys_avail(unsigned long paddr);
 
-- 
2.17.0

Re: [PATCH 0/4] ALSA: Fix year 2038 issue for sound subsystem, alternative

2018-04-26 Thread Arnd Bergmann

On Thu, Apr 26, 2018 at 3:30 PM, Jaroslav Kysela  wrote:
> Dne 26.4.2018 v 14:44 Arnd Bergmann napsal(a):
>> I've tried the suggestion from Jaroslaw, doing a minimal change to the
>> UAPI headers to keep the existing binary interface. As he predicted,
>> this is a much simpler set of kernel changes, but we will pay for that
>> with added complexity in alsa-lib.
>>
>> The first two patches in this series are taken from Baolin's patch
>> set, with a small bugfix folded in to avoid a compile-time regression.
>>
>> The other two patches are to redefine the UAPI and to deprecate
>> the support for CLOCK_REALTIME time stamps, which we can no longer
>> allow with user space that we expect to survive beyond 2038.
>>
>> Overall, I'd still be happier with Baolin's approach since it allows
>> us to keep compatiblity with CLOCK_REALTIME users and requires
>> fewer changes in user space, but this would work as well.
>
> Hi Arnd,
>
>   Thanks for your work. I proposed a bit different implementation. Example:
>
> struct snd_example {
>   struct snd_native_timespec tstamp;
>   
>   u64 tstamp_sec64; /* use the reserved[] array for this */
> };
>
>   So tstamp contains the current 32-bit tv_sec/tv_nsec and the full
> 64-bit value is in tstamp_sec64. In this way, we can transfer any type
> of the timespec64 values and it's backward compatible to retain the
> binary compatibility. The protocol versions should be increased to let
> the userspace know about the new 64-bit fields.

Right, I went in a slightly different way since the intention was to keep
the interface simple. I think we can either force the use of monotonic
times or extend it to 64-bit CLOCK_REALTIME stamps, but the
monotonic stamps seem much better for multiple reasons (i.e. skipping)
if you want to avoid introducing new ioctls.

The added complexity of having two timestamps in a single structure
means we don't end up with much simpler code that what Baolin
proposed, which mostly just moves the existing compat_ioctl()
to the native 32-bit handler but not add anything new that requires
library changes.

His tread patch and my mmap patch both do add some complexity
but then we also need some of that with your suggestions for
tread.

>   The timer read protocol must be updated, because the stream will
> change, so I am fine to add new ioctl (like originally proposed).

With forced monotonic times, we can skip that update and keep
using the existing stream format.

>   The alsa-lib defines timespec only if posix defines are not set so
> glibc's time.h does not define the timespec structure - it may be improved.

Yes, we definitely need to improve that, since any application that
relies on the timespec definition to come from alsa would otherwise
get a structure with a 64-bit tv_sec but incorrect padding on tv_nsec
(no padding on i386, padding on the wrong side for big-endian
architectures).

One way out would be to define snd_timestamp_t and
snd_htimestamp_t in terms of snd_monotonic_timestamp
from the kernel header and let it still have the traditional layout
even for applications built with 64-bit time_t.

The downside is again that applications may break when they
cast between snd_htimestamp_t and timespec pointers.

 Arnd

[RFC PATCH 9/9] mm: migrate: enable thp migration for all possible architectures.

2018-04-26 Thread Zi Yan

From: Zi Yan 

Remove CONFIG_ARCH_ENABLE_THP_MIGRATION. thp migration is enabled along
with transparent hugepage and can be toggled via
/sys/kernel/mm/transparent_hugepage/enable_thp_migration.

Signed-off-by: Zi Yan 
Cc: linux...@kvack.org
Cc: Vineet Gupta 
Cc: linux-snps-...@lists.infradead.org
Cc: Russell King 
Cc: Christoffer Dall 
Cc: Marc Zyngier 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Steve Capper 
Cc: Kristina Martsenko 
Cc: Dan Williams 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "Kirill A. Shutemov" 
Cc: x...@kernel.org
Cc: Ralf Baechle 
Cc: James Hogan 
Cc: Michal Hocko 
Cc: linux-m...@linux-mips.org
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Aneesh Kumar K.V" 
Cc: Ram Pai 
Cc: Balbir Singh 
Cc: Naoya Horiguchi 
Cc: linuxppc-...@lists.ozlabs.org
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Janosch Frank 
Cc: linux-s...@vger.kernel.org
Cc: "David S. Miller" 
Cc: sparcli...@vger.kernel.org
Cc: "Huang, Ying" 
---
 arch/x86/Kconfig   |  4 
 arch/x86/include/asm/pgtable.h |  2 --
 fs/proc/task_mmu.c |  2 --
 include/asm-generic/pgtable.h  | 21 ++---
 include/linux/huge_mm.h|  9 -
 include/linux/swapops.h|  4 +---
 mm/Kconfig |  3 ---
 mm/huge_memory.c   | 27 ++-
 mm/migrate.c   |  6 ++
 mm/rmap.c  |  5 ++---
 10 files changed, 29 insertions(+), 54 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0fa71a78ec99..e73954e3eef7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2423,10 +2423,6 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
def_bool y
depends on X86_64 && HUGETLB_PAGE && MIGRATION
 
-config ARCH_ENABLE_THP_MIGRATION
-   def_bool y
-   depends on X86_64 && TRANSPARENT_HUGEPAGE
-
 menu "Power management and ACPI options"
 
 config ARCH_HIBERNATION_HEADER
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b444d83cfc95..f9f54d9b39e3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1193,7 +1193,6 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
 
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
 {
return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
@@ -1209,7 +1208,6 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
return pmd_clear_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
 }
 #endif
-#endif
 
 #define PKRU_AD_BIT 0x1
 #define PKRU_WD_BIT 0x2
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dd1b2aeb01e8..07a2f028d29a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1326,7 +1326,6 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
frame = pmd_pfn(pmd) +
((addr & ~PMD_MASK) >> PAGE_SHIFT);
}
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
else if (is_swap_pmd(pmd)) {
swp_entry_t entry = pmd_to_swp_entry(pmd);
unsigned long offset = swp_offset(entry);
@@ -1340,7 +1339,6 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
VM_BUG_ON(!is_pmd_migration_entry(pmd));
page = migration_entry_to_page(entry);
}
-#endif
 
if (page && page_mapcount(page) == 1)
flags |= PM_MMAP_EXCLUSIVE;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f59639afaa39..9dacdd203131 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -674,24 +674,7 @@ static inline void ptep_modify_prot_commit(struct 
mm_struct *mm,
 #define arch_start_context_switch(prev)do {} while (0)
 #endif
 
-#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
-#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
-static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
-{
-   return pmd;
-}
-
-static inline int pmd_swp_soft_dirty(pmd_t pmd)
-{
-   return 0;
-}
-
-static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
-{
-   return pmd;
-}
-#endif
-#else /* !CONFIG_HAVE_ARCH_SOFT_DIRTY */
+#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline int pte_soft_dirty(pte_t pte)
 {
return 0;
@@ -946,7 +929,7 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t 
*pmd)
 * redundant with !pmd_present().
 */
if (pmd_none(pmdval) || pmd_trans_huge(pmdval) ||
-   (IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && 
!pmd_present(pmdval)))
+   (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && 
!pmd_present(pmdval)))
return 1;
if (unlikely(pmd_bad(pmdval))) {
pmd_clear_bad(pmd);
diff --git a/include/linux/huge_mm.h b/include/

Re: [PATCH V2 1/2] sched/fair: Rearrange select_task_rq_fair() to optimize it

2018-04-26 Thread Valentin Schneider

Hi,

LGTM. Tiny inline comment but TBH might not be worth it.

FWIW: Reviewed-by: Valentin Schneider 

On 26/04/18 11:30, Viresh Kumar wrote:
> Rearrange select_task_rq_fair() a bit to avoid executing some
> conditional statements in few specific code-paths. That gets rid of the
> goto as well.
> 
> This shouldn't result in any functional changes.
> 
> Signed-off-by: Viresh Kumar 
> Tested-by: Rohit Jain 
> 
> ---
> V1->V2:
> - Optimize a bit more and get rid of affine_sd variable (Valentin)
> - Add unlikely while checking for non-NULL sd and add fast/slow path
>   comments (Joel)
> - Add tested-by from Rohit.
> 
>  kernel/sched/fair.c | 37 -
>  1 file changed, 16 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 54dc31e7ab9b..84fc74ddbd4b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6613,7 +6613,7 @@ static int wake_cap(struct task_struct *p, int cpu, int 
> prev_cpu)
>  static int
>  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int 
> wake_flags)
>  {
> - struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> + struct sched_domain *tmp, *sd = NULL;
>   int cpu = smp_processor_id();
>   int new_cpu = prev_cpu;
>   int want_affine = 0;
> @@ -6636,7 +6636,10 @@ select_task_rq_fair(struct task_struct *p, int 
> prev_cpu, int sd_flag, int wake_f
>*/
>   if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
>   cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> - affine_sd = tmp;
> + if (cpu != prev_cpu)
> + new_cpu = wake_affine(tmp, p, cpu, prev_cpu, 
> sync);

This cpu != prev_cpu check could be folded into wake_affine() to make this
look a little neater, but we might want to keep things as is to avoid a
function call (it's only ever called once so it might get inlined, but AFAIK
that's not guaranteed).

> +
> + sd = NULL; /* Prefer wake_affine over balance flags */
>   break;
>   }
>  
> @@ -6646,33 +6649,25 @@ select_task_rq_fair(struct task_struct *p, int 
> prev_cpu, int sd_flag, int wake_f
>   break;
>   }
>  
> - if (affine_sd) {
> - sd = NULL; /* Prefer wake_affine over balance flags */
> - if (cpu == prev_cpu)
> - goto pick_cpu;
> -
> - new_cpu = wake_affine(affine_sd, p, cpu, prev_cpu, sync);
> - }
> + if (unlikely(sd)) {
> + /* Slow path */
>  
> - if (sd && !(sd_flag & SD_BALANCE_FORK)) {
>   /*
>* We're going to need the task's util for capacity_spare_wake
>* in find_idlest_group. Sync it up to prev_cpu's
>* last_update_time.
>*/
> - sync_entity_load_avg(&p->se);
> - }
> -
> - if (!sd) {
> -pick_cpu:
> - if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
> - new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> + if (!(sd_flag & SD_BALANCE_FORK))
> + sync_entity_load_avg(&p->se);
>  
> - if (want_affine)
> - current->recent_used_cpu = cpu;
> - }
> - } else {
>   new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
> + } else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
> + /* Fast path */
> +
> + new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> +
> + if (want_affine)
> + current->recent_used_cpu = cpu;
>   }
>   rcu_read_unlock();
>  
>

Re: [PATCH v3] PCI / PM: Always check PME wakeup capability for runtime wakeup support

2018-04-26 Thread Rafael J. Wysocki

On Thursday, April 26, 2018 3:55:45 PM CEST Bjorn Helgaas wrote:
> On Fri, Apr 13, 2018 at 09:29:56AM +0200, Rafael J. Wysocki wrote:
> > On Friday, April 13, 2018 8:58:11 AM CEST Kai Heng Feng wrote:
> > > Hi Bjorn and Rafael,
> > > 
> > > > On Apr 1, 2018, at 12:40 AM, Kai-Heng Feng 
> > > >   
> > > > wrote:
> > > >
> > > > USB controller ASM1042 stops working after commit de3ef1eb1cd0 ("PM /
> > > > core: Drop run_wake flag from struct dev_pm_info").
> > > >
> > > > The device in question is not power managed by platform firmware,
> > > > furthermore, it only supports PME# from D3cold:
> > > > Capabilities: [78] Power Management version 3
> > > >Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA 
> > > > PME(D0-,D1-,D2-,D3hot-,D3cold+)
> > > >Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > >
> > > > Before commit de3ef1eb1cd0, the device never gets runtime suspended.
> > > > After that commit, the device gets runtime suspended, so it does not
> > > > respond to any PME#.
> 
> Apologies for my lack of PM expertise.  I don't think the device would
> *respond* to PME#, would it?  I would think the device would
> potentially *generate* a PME#.

Right.

> And I guess since this device can generate PME# only from D3cold, the
> implication is that runtime suspending the device may put it into D1,
> D2, or D3hot, but not D3cold?  Is that an axiom of the runtime suspend
> design?

No, it isn't.

Runtime PM is expected to only put devices into D-states from where they
can generate PME.

Before the problematic change it would just hold the device in question in D0,
but after that change the device will be suspended (in which case it will end
up in D3hot which is incorrect).

> > > > usb_hcd_pci_probe() mandatorily calls device_wakeup_enable(), hence
> > > > device_can_wakeup() in pci_dev_run_wake() always returns true.
> 
> I think "mandatorily" means "always" or "unconditionally", right?
> 
> > > > So pci_dev_run_wake() needs to check PME wakeup capability as its first
> > > > condition.
> > > >
> > > > In addition, change wakeup flag passed to pci_target_state() from false
> > > > to true, because we want to find the deepest state that the device can
> > > > still generate PME#.
> 
> Is this a separate bug fix?  I don't understand how it fits in here
> because the wakeup flag means "Whether or not wakeup functionality
> will be enabled for the device", and you're not changing anything
> about whether wakeup functionality will be enabled.

For runtime PM the "wakeup" argument of pci_target_state() should always be
"true", so technically this may be regarded as a separate issue, but this
change is needed as a functional fix for the device in question along with
the reordering.

Since technically there is a state from which the device can signal PME,
device_can_wakeup() returns "true" for it, but this isn't sufficient for
pci_dev_run_wake() to return "true" (because that state is D3cold and
the platform cannot power-manage the device, so the device cannot be put
into D3cold directly).  That's the first thing that needs to be changed.

On top of that, we need to look for a state from which the device can
generate PME.

> > > > Fixes: de3ef1eb1cd0 ("PM / core: Drop run_wake flag from struct  
> > > > dev_pm_info")
> > > > Cc: sta...@vger.kernel.org # 4.13+
> > > > Signed-off-by: Kai-Heng Feng 
> > > > ---
> > > > v3: State the reason why the wakeup flag gets changed.
> > > >
> > > > v2: Explicitly check dev->pme_support.
> > > 
> > > If this patch is good enough, I am hoping it can get merged in v4.17.
> > 
> > OK
> > 
> > Bjorn, if you want to take this:
> > 
> >  Reviewed-by: Rafael J. Wysocki 
> > 
> > Otherwise please let me know and I'll queue it up.
> 
> de3ef1eb1cd0 went through your tree, so I think this fix should go
> through your tree, too.
> 
> Acked-by: Bjorn Helgaas 

OK

> Not directly related to this patch, but I think these comments in
> pci_target_state() are slightly misleading:
> 
>* Call the platform to choose the target state of the device
>* and enable wake-up from this state if supported.
> 
>* Find the deepest state from which the device can generate
>* wake-up events, make it the target state and enable device
>* to generate PME#.
> 
> AFAICT, pci_target_state() does not actually "enable wake-up from this
> state" or "enable device to generate PME#".

Right, the comments appear to be stale, I'll send a patch to update them.

Thanks,
Rafael

Re: [PATCHv2 1/3] dt-bindings: misc: achc: Make ezport distinguishable

2018-04-26 Thread Sebastian Reichel

Hi,

On Fri, Apr 13, 2018 at 10:33:05AM -0500, Rob Herring wrote:
> On Mon, Apr 9, 2018 at 4:13 PM, Sebastian Reichel
>  wrote:
> > Hi,
> >
> > On Mon, Apr 09, 2018 at 01:57:27PM -0500, Rob Herring wrote:
> >> On Tue, Mar 27, 2018 at 03:52:57PM +0200, Sebastian Reichel wrote:
> >> > This updates the GE ACHC binding, so that different compatible
> >> > strings are used for the programming interface, which is the
> >> > ezport interface from NXP MK20FN1M0VMD12 and the microcontroller's
> >> > normal SPI interface.
> >> >
> >> > Signed-off-by: Sebastian Reichel 
> >> > ---
> >> >  Documentation/devicetree/bindings/misc/ge-achc.txt | 19 
> >> > ---
> >> >  1 file changed, 16 insertions(+), 3 deletions(-)
> >> >
> >> > diff --git a/Documentation/devicetree/bindings/misc/ge-achc.txt 
> >> > b/Documentation/devicetree/bindings/misc/ge-achc.txt
> >> > index 77df94d7a32f..6c6bd6568504 100644
> >> > --- a/Documentation/devicetree/bindings/misc/ge-achc.txt
> >> > +++ b/Documentation/devicetree/bindings/misc/ge-achc.txt
> >> > @@ -7,7 +7,13 @@ Note: This device does not expose the peripherals as 
> >> > USB devices.
> >> >
> >> >  Required properties:
> >> >
> >> > -- compatible : Should be "ge,achc"
> >> > +- compatible : Should be
> >> > +  "ge,achc" (normal interface)
> >> > +  "ge,achc-ezport" (flashing interface)
> >> > +
> >> > +Required properties (flashing interface only):
> >> > +
> >> > +- reset-gpios: GPIO Specifier for the reset GPIO
> >>
> >> Does the reset only affect the flashing interface and are the data pins
> >> shared? If not for both, then I think the correct thing to do here is
> >> just extend reg to support multiple values to represent multiple chip
> >> selects.
> >
> > reset affects the whole chip and the same spi data/clock pins are
> > being used, so extending reg should work. The flashing cannot happen
> > with the same speed, though. I'm currently encoding this by using
> > different "spi-max-frequency" properties. I suppose I could limit
> > it in the driver instead. I tried to come up with an example for
> > your suggestion. Is this what you had in mind?
> 
> If the max frequency is the device max, then that should be in the
> driver. spi-max-frequency should really only be needed if the
> frequency is less than the max of either the host or device.

Right.

> > &spi_controller {
> > achc@multiple {
> 
> @0
> 
> unit addresses are the 1st address.

Ok.

> > /* 0 = flashing interface, 1 = normal interface */
> > reg = <0>, <1>;
> 
> You may want to put the normal interface first as that is the primary
> interface and would still work assuming the OS ignored extra entries.
> 
> > compatible = "ge,achc";
> > reset-gpios = <&gpio42 23 ACTIVE_LOW>;
> > spi-max-frequency = <42>; /* max speed for normal operation */
> 
> 42 Hz?

This was just a random number as example.

I had a look at implementing this and the Linux SPI core does not
expect any devices with multiple chip selects. I can try to
implement it, but I think it makes sense to gather some feedback
from Mark first (added to Cc).

@Mark
Patch Series: https://lwn.net/Articles/750177/
Binding Discussion: https://patchwork.kernel.org/patch/10310109/

-- Sebastian


signature.asc
Description: PGP signature

Re: [RFC PATCH 1/2] ACPI / PNP: Don't add "enumeration_by_parent" devices

2018-04-26 Thread Mika Westerberg

On Thu, Apr 26, 2018 at 03:23:17PM +0100, John Garry wrote:
> Not that I know about. Can you describe this method? I guess I also don't
> need to set the mfd_cell pnpid either for this special case device.

There is some documentation in "MFD devices" chapter of
Documentation/acpi/enumeration.txt at least.

Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-26 Thread Niklas Söderlund

Hi Vincent,

Thanks for all your help.

On 2018-04-26 12:31:33 +0200, Vincent Guittot wrote:
> Hi Niklas,
> 
> Le Thursday 26 Apr 2018 à 00:56:03 (+0200), Niklas Söderlund a écrit :
> > Hi Vincent,
> > 
> > Here are the result, sorry for the delay.
> > 
> > On 2018-04-23 11:54:20 +0200, Vincent Guittot wrote:
> > 
> > [snip]
> > 
> > > 
> > > Thanks for the report. Can you re run with the following trace-cmd 
> > > sequence ? My previous sequence disables ftrace events
> > > 
> > > trace-cmd reset > /dev/null
> > > trace-cmd start -b 4 -p function -l dump_backtrace:traceoff -e sched 
> > > -e cpu_idle -e cpu_frequency -e timer -e ipi -e irq -e printk
> > > trace-cmd start -b 4 -p function -l dump_backtrace -e sched -e 
> > > cpu_idle -e cpu_frequency -e timer -e ipi -e irq -e printk
> > > 
> > > I have updated the patch and added traces to check that scheduler returns 
> > > from idle_balance function and doesn't stay stuck
> > 
> > Once more I applied the change bellow on-top of c18bb396d3d261eb ("Merge 
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
> > 
> > This time the result of 'trace-cmd report' is so large I do not include 
> > it here, but I attach the trace.dat file. Not sure why but the timing of 
> > sending the NMI to the backtrace print is different (but content the 
> > same AFIK) so in the odd change it can help figure this out:
> > 
> 
> Thanks for the trace, I have been able to catch a problem with it.
> Could you test the patch below to confirm that the problem is solved ?
> The patch apply on-top of
> c18bb396d3d261eb ("Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")

I can confirm that with the patch bellow I can no longer produce the 
problem. Thanks!

> 
> From: Vincent Guittot 
> Date: Thu, 26 Apr 2018 12:19:32 +0200
> Subject: [PATCH] sched/fair: fix the update of blocked load when newly idle
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> With commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle"),
> we release the rq->lock when updating blocked load of idle CPUs. This open
> a time window during which another CPU can add a task to this CPU's cfs_rq.
> The check for newly added task of idle_balance() is not in the common path.
> Move the out label to include this check.
> 
> Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
> Reported-by: Heiner Kallweit 
> Reported-by: Niklas Söderlund 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0951d1c..15a9f5e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9847,6 +9847,7 @@ static int idle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   if (curr_cost > this_rq->max_idle_balance_cost)
>   this_rq->max_idle_balance_cost = curr_cost;
>  
> +out:
>   /*
>* While browsing the domains, we released the rq lock, a task could
>* have been enqueued in the meantime. Since we're not going idle,
> @@ -9855,7 +9856,6 @@ static int idle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   if (this_rq->cfs.h_nr_running && !pulled_task)
>   pulled_task = 1;
>  
> -out:
>   /* Move the next balance forward */
>   if (time_after(this_rq->next_balance, next_balance))
>   this_rq->next_balance = next_balance;
> -- 
> 2.7.4
> 
> 
> 
> [snip]
> 

-- 
Regards,
Niklas Söderlund

[GIT PULL] Power management fixes for v4.17-rc3

2018-04-26 Thread Rafael J. Wysocki

Hi Linus,

Please pull from the tag

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 pm-4.17-rc3

with top-most commit e140c4af1b63125dff629e8339793390201e2470

 Merge branches 'acpi-pm' and 'pm-cpufreq'

on top of commit 6d08b06e67cd117f6992c46611dfb4ce267cd71e

 Linux 4.17-rc2

to receive power management fixes for 4.17-rc3.

These are a Low Power S0 Idle quirk, a hibernation handling fix for
the PCI bus type and a brcmstb-avs-cpufreq driver fixup removing
development debug code from it.

Specifics:

 - Blacklist the Low Power S0 Idle _DSM on ThinkPad X1 Tablet(2016)
   where it causes issues and make it use ACPI S3 which works instead
   of the non-working suspend-to-idle by default (Chen Yu).

 - Fix the handling of hibernation in the PCI core for devices with
   the DPM_FLAG_SMART_SUSPEND flag set to fix a regression affecting
   intel-lpss I2C devices (Mika Westerberg).

 - Drop development debug code from the brcmstb-avs-cpufreq driver
   (Markus Mayer).

Thanks!


---

Chen Yu (1):
  ACPI / PM: Blacklist Low Power S0 Idle _DSM for ThinkPad X1 Tablet(2016)

Markus Mayer (1):
  cpufreq: brcmstb-avs-cpufreq: remove development debug support

Mika Westerberg (1):
  PCI / PM: Do not clear state_saved in pci_pm_freeze() when smart
suspend is set

---

 drivers/acpi/sleep.c  |  13 ++
 drivers/cpufreq/Kconfig.arm   |  10 --
 drivers/cpufreq/brcmstb-avs-cpufreq.c | 323 +-
 drivers/pci/pci-driver.c  |   5 +-
 4 files changed, 17 insertions(+), 334 deletions(-)

Re: [RFC PATCH 09/35] ovl: stack file ops

2018-04-26 Thread Miklos Szeredi

On Thu, Apr 26, 2018 at 4:13 PM, Vivek Goyal  wrote:
> On Thu, Apr 12, 2018 at 05:08:00PM +0200, Miklos Szeredi wrote:
>
> [..]
>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>> new file mode 100644
>> index ..a0b606885c41
>> --- /dev/null
>> +++ b/fs/overlayfs/file.c
>> @@ -0,0 +1,76 @@
>> +/*
>> + * Copyright (C) 2017 Red Hat, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published 
>> by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include "overlayfs.h"
>> +
>> +static struct file *ovl_open_realfile(const struct file *file)
>> +{
>> + struct inode *inode = file_inode(file);
>> + struct inode *upperinode = ovl_inode_upper(inode);
>> + struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
>> + struct file *realfile;
>> + const struct cred *old_cred;
>> +
>> + old_cred = ovl_override_creds(inode->i_sb);
>> + realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
>> +  realinode, current_cred(), false);
>> + revert_creds(old_cred);
>> +
>> + pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
>> +  file, file, upperinode ? 'u' : 'l', file->f_flags,
>> +  realfile, IS_ERR(realfile) ? 0 : realfile->f_flags);
>> +
>> + return realfile;
>> +}
>> +
>> +static int ovl_open(struct inode *inode, struct file *file)
>> +{
>> + struct dentry *dentry = file_dentry(file);
>
> Hi Miklos,
>
> There is one thing I can't wrap my head around, so I better ask.
>
> file_dentry() will call ovl_d_real() and try to find dentry based on
> inode installed in f->f_inode. If ovl_d_real() can't find inode dentry
> matching the passed in inode, it warns.
>
> Assume, I have a stacked overlay configuration. Let me call top level
> overlay layer ovl1 and lower level overlay layer ovl2. Say I open a
> file foo.txt. Now ovl_open() in ovl1 decides that realinode is a lower
> inode and installs that inode f->f_inode of realfile. (This should be
> ovl2 layer inode, let me call it ovl2_inode). Now ovl_open() of ovl2 layer
> will be called and it will call file_dentry() and will look for dentry
> corresponding to ovl2_inode. I am wondering what if a copy up of foo.txt
> was triggered in ovl1 and by the time we called ovl_d_real(dentry,
> ovl2_inode), it will start comparing with inode of ovl1_upper and never
> find ovl2_inode.

Okay, so we've modified ovl_d_real() to allow returning the overlay
dentry itself.  This is important: when we fail to match ovl1_upper
with ovl2_inode, well go on to get ovl2_dentry and call d_real()
recursively.  That recursive call should match the inode, return it to
outer ovl_d_real(), which again will match the inode and return
without warning.

> IOW, I am not able to figure out how do we protect agains copy up races
> when ovl_open() calls file_dentry().

Racing with a copy up cannot matter, since we'll continue looking for
the inode in the layers and stacks below, regardless of whether we
checked the upper dentry or not.

Does that make it clearer?

Thanks,
Miklos

[GIT PULL] ACPI fixes for v4.17-rc3

2018-04-26 Thread Rafael J. Wysocki

Hi Linus,

Please pull from the tag

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 acpi-4.17-rc3

with top-most commit bd6dff55de7acb2e5065e69706400c41b1bd0521

 Merge branches 'acpi-watchdog', 'acpi-button' and 'acpi-video'

on top of commit 6d08b06e67cd117f6992c46611dfb4ce267cd71e

 Linux 4.17-rc2

to receive ACPI fixes for 4.17-rc3.

These are two watchdog-related fixes, fix for a backlight regression
from the 4.16 cycle that unfortunately was propagated to -stable and
a button module modification to prevent graphics driver modules from
failing to load due to unmet dependencies if ACPI is disabled from
the kernel command line.

Specifics:

 - Change the ACPI subsystem initialization ordering to initialize
   the WDAT watchodg before reserving PNP motherboard resources so
   as to allow the watchdog to allocate its resources before the PNP
   code gets to them and prevents it from working correctly (Mika
   Westerberg).

 - Add a quirk for Lenovo Z50-70 to use the iTCO watchdog instead of
   the WDAT one which conflicts with the RTC on that platform (Mika
   Westerberg).

 - Avoid breaking backlight handling on Dell XPS 13 2013 model by
   allowing laptops to use the ACPI backlight by default even if they
   are Windows 8-ready in principle (Hans de Goede).

 - Prevent the ACPI button module from failing to load if ACPI is
   disabled via the kernel command line so as to allow graphics
   driver modules depending on the ACPI button module to load in
   that case (Ard Biesheuvel).

Thanks!


---

Ard Biesheuvel (1):
  ACPI / button: make module loadable when booted in non-ACPI mode

Hans de Goede (1):
  ACPI / video: Only default only_lcd to true on Win8-ready _desktops_

Mika Westerberg (2):
  ACPI / scan: Initialize watchdog before PNP
  ACPI / watchdog: Prefer iTCO_wdt on Lenovo Z50-70

---

 drivers/acpi/acpi_video.c| 27 ++--
 drivers/acpi/acpi_watchdog.c | 59 
 drivers/acpi/button.c| 24 +-
 drivers/acpi/scan.c  |  2 +-
 4 files changed, 98 insertions(+), 14 deletions(-)

Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread James Bottomley

On Thu, 2018-04-26 at 10:28 -0400, Mikulas Patocka wrote:
> 
> On Thu, 26 Apr 2018, Michal Hocko wrote:
> 
> > On Wed 25-04-18 18:42:57, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Wed, 25 Apr 2018, James Bottomley wrote:
> > [...]
> > > > Kconfig proliferation, conversely, is a bit of a nightmare from
> both
> > > > the user and the tester's point of view, so we're trying to
> avoid it
> > > > unless absolutely necessary.
> > > > 
> > > > James
> > > 
> > > I already offered that we don't need to introduce a new kernel
> option and 
> > > we can bind this feature to any other kernel option, that is
> enabled in 
> > > the debug kernel, for example CONFIG_DEBUG_SG. Michal said no and
> he said 
> > > that he wants a new kernel option instead.
> > 
> > Just for the record. I didn't say I _want_ a config option. Do not
> > misinterpret my words. I've said that a config option would be
> > acceptable if there is no way to deliver the functionality via
> kernel
> > package automatically. You haven't provided any argument that would
> > explain why the kernel package cannot add a boot option. Maybe
> there are
> > some but I do not see them right now.
> 
> AFAIK Grub doesn't load per-kernel options from a per-kernel file.
> Even if we hacked grub scripts to add this option, other
> distributions won't.

Perhaps find out beforehand instead of insisting on an approach without
knowing.  On openSUSE the grub config is built from the files in
/etc/grub.d/ so any package can add a kernel option (and various
conditions around activating it) simply by adding a new file.  The
config files are quite sophisticated, so you can add what looks to be a
new kernel, but is really an existing kernel with different options
this way.

James

Re: [PATCH] MAINTAINERS: Remove myself as maintainer

2018-04-26 Thread Arnd Bergmann

On Tue, Apr 17, 2018 at 2:53 PM, Niklas Cassel  wrote:
> From: Niklas Cassel 
>
> I am leaving Axis, so this address will bounce in the not too
> distant future.
>
> Fortunately, I will still be working with the community.
>
> Signed-off-by: Niklas Cassel 
> ---
> Hello arm-soc, could you please pick this up?
> The email address has started to bounce.

Applied to fixes now, thanks

  Arnd

[PATCH v3 net-next 0/2] tcp: mmap: rework zerocopy receive

2018-04-26 Thread Eric Dumazet

syzbot reported a lockdep issue caused by tcp mmap() support.

I implemented Andy Lutomirski nice suggestions to resolve the
issue and increase scalability as well.

First patch is adding a new getsockopt() operation and changes mmap()
behavior.

Second patch changes tcp_mmap reference program.

v3: change TCP_ZEROCOPY_RECEIVE to be a getsockopt() option
instead of setsockopt(), feedback from Ka-Cheon Poon

v2: Added a missing page align of zc->length in tcp_zerocopy_receive()
Properly clear zc->recv_skip_hint in case user request was completed.

Eric Dumazet (2):
  tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
  selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE

 include/uapi/linux/tcp.h   |   8 ++
 net/ipv4/tcp.c | 192 +
 tools/testing/selftests/net/tcp_mmap.c |  64 +
 3 files changed, 146 insertions(+), 118 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog

[PATCH v3 net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE

2018-04-26 Thread Eric Dumazet

After prior kernel change, mmap() on TCP socket only reserves VMA.

We have to use getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...)
to perform the transfert of pages from skbs in TCP receive queue into such VMA.

struct tcp_zerocopy_receive {
__u64 address;  /* in: address of mapping */
__u32 length;   /* in/out: number of bytes to map/mapped */
__u32 recv_skip_hint;   /* out: amount of bytes to skip */
};

After a successful getsockopt(...TCP_ZEROCOPY_RECEIVE...), @length contains
number of bytes that were mapped, and @recv_skip_hint contains number of bytes
that should be read using conventional read()/recv()/recvmsg() system calls,
to skip a sequence of bytes that can not be mapped, because not properly page
aligned.

Signed-off-by: Eric Dumazet 
Cc: Andy Lutomirski 
Cc: Soheil Hassas Yeganeh 
---
 tools/testing/selftests/net/tcp_mmap.c | 64 +++---
 1 file changed, 37 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/net/tcp_mmap.c 
b/tools/testing/selftests/net/tcp_mmap.c
index 
dea342fe6f4e88b5709d2ac37b2fc9a2a320bf44..77f762780199ff1f69f9f6b3f18e72deddb69f5e
 100644
--- a/tools/testing/selftests/net/tcp_mmap.c
+++ b/tools/testing/selftests/net/tcp_mmap.c
@@ -76,9 +76,10 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
+#include 
+#include 
 
 #ifndef MSG_ZEROCOPY
 #define MSG_ZEROCOPY0x400
@@ -134,11 +135,12 @@ void hash_zone(void *zone, unsigned int length)
 void *child_thread(void *arg)
 {
unsigned long total_mmap = 0, total = 0;
+   struct tcp_zerocopy_receive zc;
unsigned long delta_usec;
int flags = MAP_SHARED;
struct timeval t0, t1;
char *buffer = NULL;
-   void *oaddr = NULL;
+   void *addr = NULL;
double throughput;
struct rusage ru;
int lu, fd;
@@ -153,41 +155,46 @@ void *child_thread(void *arg)
perror("malloc");
goto error;
}
+   if (zflg) {
+   addr = mmap(NULL, chunk_size, PROT_READ, flags, fd, 0);
+   if (addr == (void *)-1)
+   zflg = 0;
+   }
while (1) {
struct pollfd pfd = { .fd = fd, .events = POLLIN, };
int sub;
 
poll(&pfd, 1, 1);
if (zflg) {
-   void *naddr;
+   socklen_t zc_len = sizeof(zc);
+   int res;
 
-   naddr = mmap(oaddr, chunk_size, PROT_READ, flags, fd, 
0);
-   if (naddr == (void *)-1) {
-   if (errno == EAGAIN) {
-   /* That is if SO_RCVLOWAT is buggy */
-   usleep(1000);
-   continue;
-   }
-   if (errno == EINVAL) {
-   flags = MAP_SHARED;
-   oaddr = NULL;
-   goto fallback;
-   }
-   if (errno != EIO)
-   perror("mmap()");
+   zc.address = (__u64)addr;
+   zc.length = chunk_size;
+   zc.recv_skip_hint = 0;
+   res = getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE,
+&zc, &zc_len);
+   if (res == -1)
break;
+
+   if (zc.length) {
+   assert(zc.length <= chunk_size);
+   total_mmap += zc.length;
+   if (xflg)
+   hash_zone(addr, zc.length);
+   total += zc.length;
}
-   total_mmap += chunk_size;
-   if (xflg)
-   hash_zone(naddr, chunk_size);
-   total += chunk_size;
-   if (!keepflag) {
-   flags |= MAP_FIXED;
-   oaddr = naddr;
+   if (zc.recv_skip_hint) {
+   assert(zc.recv_skip_hint <= chunk_size);
+   lu = read(fd, buffer, zc.recv_skip_hint);
+   if (lu > 0) {
+   if (xflg)
+   hash_zone(buffer, lu);
+   total += lu;
+   }
}
continue;
}
-fallback:
sub = 0;
while (sub < chunk_size) {
lu = read(fd, buffer + sub, chunk_size - sub);
@@ -228,6 +235,8 @@ v

[PATCH v3 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive

2018-04-26 Thread Eric Dumazet

When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.

Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.

1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
  This operation does not involve any TCP locking.

2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
 the transfert of pages from skbs to one VMA.
  This operation only uses down_read(¤t->mm->mmap_sem) after
  holding TCP lock, thus solving the lockdep issue.

This new implementation was suggested by Andy Lutomirski with great details.

Benefits are :

- Better scalability, in case multiple threads reuse VMAS
   (without mmap()/munmap() calls) since mmap_sem wont be write locked.

- Better error recovery.
   The previous mmap() model had to provide the expected size of the
   mapping. If for some reason one part could not be mapped (partial MSS),
   the whole operation had to be aborted.
   With the tcp_zerocopy_receive struct, kernel can report how
   many bytes were successfuly mapped, and how many bytes should
   be read to skip the problematic sequence.

- No more memory allocation to hold an array of page pointers.
  16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/

- skbs are freed while mmap_sem has been released

Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)

Note that memcg might require additional changes.

Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Suggested-by: Andy Lutomirski 
Cc: linux...@kvack.org
Cc: Soheil Hassas Yeganeh 
---
 include/uapi/linux/tcp.h |   8 ++
 net/ipv4/tcp.c   | 192 ---
 2 files changed, 109 insertions(+), 91 deletions(-)

diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 
379b08700a542d49bbce9b4b49b17879d00b69bb..e9e8373b34b9ddc735329341b91f455bf5c0b17c
 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -122,6 +122,7 @@ enum {
 #define TCP_MD5SIG_EXT 32  /* TCP MD5 Signature with extensions */
 #define TCP_FASTOPEN_KEY   33  /* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE 34  /* Enable TFO without a TFO cookie */
+#define TCP_ZEROCOPY_RECEIVE   35
 
 struct tcp_repair_opt {
__u32   opt_code;
@@ -276,4 +277,11 @@ struct tcp_diag_md5sig {
__u8tcpm_key[TCP_MD5SIG_MAXKEYLEN];
 };
 
+/* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */
+
+struct tcp_zerocopy_receive {
+   __u64 address;  /* in: address of mapping */
+   __u32 length;   /* in/out: number of bytes to map/mapped */
+   __u32 recv_skip_hint;   /* out: amount of bytes to skip */
+};
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
dfd090ea54ad47112fc23c61180b5bf8edd2c736..c10c4a41ad39d6f8ae472882b243c2b70c915546
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1726,118 +1726,111 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(tcp_set_rcvlowat);
 
-/* When user wants to mmap X pages, we first need to perform the mapping
- * before freeing any skbs in receive queue, otherwise user would be unable
- * to fallback to standard recvmsg(). This happens if some data in the
- * requested block is not exactly fitting in a page.
- *
- * We only support order-0 pages for the moment.
- * mmap() on TCP is very strict, there is no point
- * trying to accommodate with pathological layouts.
- */
+static const struct vm_operations_struct tcp_vm_ops = {
+};
+
 int tcp_mmap(struct file *file, struct socket *sock,
 struct vm_area_struct *vma)
 {
-   unsigned long size = vma->vm_end - vma->vm_start;
-   unsigned int nr_pages = size >> PAGE_SHIFT;
-   struct page **pages_array = NULL;
-   u32 seq, len, offset, nr = 0;
-   struct sock *sk = sock->sk;
-   const skb_frag_t *frags;
+   if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+   return -EPERM;
+   vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+
+   /* Instruct vm_insert_page() to not down_read(mmap_sem) */
+   vma->vm_flags |= VM_MIXEDMAP;
+
+   vma->vm_ops = &tcp_vm_ops;
+   return 0;
+}
+EXPORT_SYMBOL(tcp_mmap);
+
+static int tcp_zerocopy_receive(struct sock *sk,
+   struct tcp_zerocopy_receive *zc)
+{
+   unsigned long address = (unsigned long)zc->address;
+   const skb_frag_t *frags = NULL;
+   u32 length = 0, seq, offset;
+   struct vm_area_struct *vma;
+   struct sk_buff *skb = NULL;
struct tcp_sock *tp;
-   struct sk_buff *skb;
int ret;
 
-   if (vma->vm_pgoff || !nr_pages)
+   if (address & (PAGE_SIZE - 1) || address != zc->address)
return -EINVAL;
 
-

Re: [PATCH v3 07/12] KVM: arm/arm64: Adapt vgic_v3_check_base to multiple rdist regions

2018-04-26 Thread Auger Eric

Hi Christoffer,

On 04/26/2018 12:06 PM, Christoffer Dall wrote:
> On Thu, Apr 26, 2018 at 10:29:35AM +0200, Auger Eric wrote:
>> Hi Christoffer,
>> On 04/24/2018 11:07 PM, Christoffer Dall wrote:
>>> On Fri, Apr 13, 2018 at 10:20:53AM +0200, Eric Auger wrote:
 We introduce a new helper to check there is no overlap between
 dist region (if set) and registered rdist regions. This both
 handles the case of legacy single rdist region (implicitly sized
 with the number of online vcpus) and the new case of multiple
 explicitly sized rdist regions.
>>>
>>> I don't understand this change, really.  Is this just a cleanup, or
>>> changing some functionality (why?).
>>>
>>> I think this could have come with the introduction of
>>> vgic_v3_rdist_overlap() before patch 6, and then patch 6 could have been
>>> simplified (hopefully) to just call this "check that nothing in the
>>> world ever collides withi itself" function.
>> I have merged this patch and vgic_v3_rd_region_size +
>> vgic_v3_rdist_overlap and put it before this patch.
>>
>> Also I reworked the commit message which was unclear I acknowledge.
>>
>> With respect to using the adapted  vgic_v3_check_base() in
>> vgic_v3_insert_redist_region(), it is less obvious to me.
>>
>> In vgic_v3_insert_redist_region we do the checks *before* inserting the
>> new rdist region in the list of redist regions. While
>> vgic_v3_check_base() does the checks on already registered rdist and
>> dist regions. So I would be tempted to leave
>> vgic_v3_insert_redist_region() implementation as it is.
>>
> 
> ok, but do see my suggestion there to factor out the check, which should
> make that function slightly easier to read.
> 
> Then perhaps you can use that function from vgic_v3_check_base to check
> that each rdist doesn't overlap with the distributor?

I introduced the suggested additional helper, vgic_dist_overlap, to
check a region does not overlap with the distributor region and used it
in vgic_v3_insert_redist_region.

However in  vgic_v3_check_base I do not need it as I already use
vgic_v3_rdist_overlap() which does the job, ie. check the dist against
all registered redists.

Thanks

Eric
> 
> Thanks,
> -Christoffer
>

Re: [PATCH v2] HISI LPC: Add Kconfig MFD_CORE dependency

2018-04-26 Thread Arnd Bergmann

On Thu, Apr 19, 2018 at 4:14 PM, John Garry  wrote:
> For ACPI support of the HiSilicon LPC driver we depend
> on MFD_CORE config.
>
> Currently the HiSi LPC Kconfig entry does not define this
> dependency, so add it.
>
> The reason for depending on MFD_CORE in the driver is
> that we model the LPC host as an MFD, in that a platform
> device will be created for each device on the bus.
>
> We do this as we need to modify the resources of these
> derived platform devices, something which we should not
> do to the original devices created in the ACPI scan.
> Details in e0aa1563f894 ("HISI LPC: Add ACPI support").
>
> Fixes: e0aa1563f894 ("HISI LPC: Add ACPI support")
> Reported-and-tested-by: Tan Xiaojun 
> Signed-off-by: John Garry 

Applied to fixes.

  Arnd

Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread Mikulas Patocka



On Wed, 25 Apr 2018, James Bottomley wrote:

> > BTW. even developers who compile their own kernel should have this
> > enabled by a CONFIG option - because if the developer sees the option
> > when browsing through menuconfig, he may enable it. If he doesn't see
> > the option, he won't even know that such an option exists.
> 
> I may be an atypical developer but I'd rather have a root canal than
> browse through menuconfig options.  The way to get people to learn
> about new debugging options is to blog about it (or write an lwn.net
> article) which google will find the next time I ask it how I debug XXX.
>  Google (probably as a service to humanity) rarely turns up Kconfig
> options in response to a query.

>From my point of view, this feature should be as little disruptive to the 
developer as possible. It should work automatically behind the scenes 
without the developer or the tester even knowing that it is working. From 
this point of view, binding it to CONFIG_DEBUG_SG (or any other commonly 
used debugging option) would be ideal, because driver developers already 
enable CONFIG_DEBUG_SG, so they'll get this kvmalloc test for free.

>From your point of view, you should introduce a sysfs file and a kernel 
parameter that no one knows about - and then start blogging about it - to 
let people know. Why would you bother people with this knowledge? They'll 
forget about it anyway and won't turn it on.

Mikulas

Re: [Intel-gfx] 4.17-rc2: Could not determine valid watermarks for inherited state

2018-04-26 Thread Ville Syrjälä

On Thu, Apr 26, 2018 at 10:27:19AM -0400, Dave Jones wrote:
> [1.176131] [drm:i9xx_get_initial_plane_config] pipe A/primary A with fb: 
> size=800x600@32, offset=0, pitch 3200, size 0x1d4c00
> [1.176161] [drm:i915_gem_object_create_stolen_for_preallocated] creating 
> preallocated stolen object: stolen_offset=0x, 
> gtt_offset=0x, size=0x001d5000
> [1.176312] [drm:intel_alloc_initial_plane_obj.isra.127] initial plane fb 
> obj (ptrval)
> [1.176351] [drm:intel_modeset_init] pipe A active planes 0x1
> [1.176456] [drm:drm_atomic_helper_check_plane_state] Plane must cover 
> entire CRTC
> [1.176481] [drm:drm_rect_debug_print] dst: 800x600+0+0
> [1.176494] [drm:drm_rect_debug_print] clip: 1366x768+0+0

OK, so that's the problem right there. The fb we took over from the
BIOS was 800x600, but now we're trying to set up a 1366x768 mode.

We seem to be missing checks to make sure the initial fb is actually
big enough for the mode we're currently using :(

-- 
Ville Syrjälä
Intel

Re: [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive

2018-04-26 Thread Eric Dumazet



On 04/25/2018 06:20 PM, Soheil Hassas Yeganeh wrote:
> 
> Acked-by: Soheil Hassas Yeganeh 
> 
>

Thanks Soheil for reviewing.

I have changed setsockopt() to getsockopt() so chose to not carry your Acked-by

Please add it back if you agree, thanks !

Re: [RFC PATCH 09/35] ovl: stack file ops

2018-04-26 Thread Vivek Goyal

On Thu, Apr 26, 2018 at 04:43:53PM +0200, Miklos Szeredi wrote:
> On Thu, Apr 26, 2018 at 4:13 PM, Vivek Goyal  wrote:
> > On Thu, Apr 12, 2018 at 05:08:00PM +0200, Miklos Szeredi wrote:
> >
> > [..]
> >> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> >> new file mode 100644
> >> index ..a0b606885c41
> >> --- /dev/null
> >> +++ b/fs/overlayfs/file.c
> >> @@ -0,0 +1,76 @@
> >> +/*
> >> + * Copyright (C) 2017 Red Hat, Inc.
> >> + *
> >> + * This program is free software; you can redistribute it and/or modify it
> >> + * under the terms of the GNU General Public License version 2 as 
> >> published by
> >> + * the Free Software Foundation.
> >> + */
> >> +
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include "overlayfs.h"
> >> +
> >> +static struct file *ovl_open_realfile(const struct file *file)
> >> +{
> >> + struct inode *inode = file_inode(file);
> >> + struct inode *upperinode = ovl_inode_upper(inode);
> >> + struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
> >> + struct file *realfile;
> >> + const struct cred *old_cred;
> >> +
> >> + old_cred = ovl_override_creds(inode->i_sb);
> >> + realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
> >> +  realinode, current_cred(), false);
> >> + revert_creds(old_cred);
> >> +
> >> + pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
> >> +  file, file, upperinode ? 'u' : 'l', file->f_flags,
> >> +  realfile, IS_ERR(realfile) ? 0 : realfile->f_flags);
> >> +
> >> + return realfile;
> >> +}
> >> +
> >> +static int ovl_open(struct inode *inode, struct file *file)
> >> +{
> >> + struct dentry *dentry = file_dentry(file);
> >
> > Hi Miklos,
> >
> > There is one thing I can't wrap my head around, so I better ask.
> >
> > file_dentry() will call ovl_d_real() and try to find dentry based on
> > inode installed in f->f_inode. If ovl_d_real() can't find inode dentry
> > matching the passed in inode, it warns.
> >
> > Assume, I have a stacked overlay configuration. Let me call top level
> > overlay layer ovl1 and lower level overlay layer ovl2. Say I open a
> > file foo.txt. Now ovl_open() in ovl1 decides that realinode is a lower
> > inode and installs that inode f->f_inode of realfile. (This should be
> > ovl2 layer inode, let me call it ovl2_inode). Now ovl_open() of ovl2 layer
> > will be called and it will call file_dentry() and will look for dentry
> > corresponding to ovl2_inode. I am wondering what if a copy up of foo.txt
> > was triggered in ovl1 and by the time we called ovl_d_real(dentry,
> > ovl2_inode), it will start comparing with inode of ovl1_upper and never
> > find ovl2_inode.
> 
> Okay, so we've modified ovl_d_real() to allow returning the overlay
> dentry itself.  This is important: when we fail to match ovl1_upper
> with ovl2_inode, well go on to get ovl2_dentry and call d_real()
> recursively.  That recursive call should match the inode, return it to
> outer ovl_d_real(), which again will match the inode and return
> without warning.

So current code does following.

ovl_d_real() {
...
...

real = ovl_dentry_real(dentry);
if (inode == d_inode(real))
return real;

/* Handle recursion */
if (unlikely(real->d_flags & DCACHE_OP_REAL))
return real->d_op->d_real(real, inode);
}

If file got copied up in ovl1, then "real" will be ovl1_upper dentry. And
upper is regular fs (only ovl1 lower is overlay), then it should not have
DCACHE_OP_REAL set and that means we will not recurse further and not
find ovl2 dentry matching ovl2_inode and print warning and return
ovl1 dentry.

What am I missing.

Vivek

> 
> > IOW, I am not able to figure out how do we protect agains copy up races
> > when ovl_open() calls file_dentry().
> 
> Racing with a copy up cannot matter, since we'll continue looking for
> the inode in the layers and stacks below, regardless of whether we
> checked the upper dentry or not.
> 
> Does that make it clearer?
> 
> Thanks,
> Miklos

Re: [PATCH v3 net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE

2018-04-26 Thread Soheil Hassas Yeganeh

On Thu, Apr 26, 2018 at 10:50 AM, Eric Dumazet  wrote:
> After prior kernel change, mmap() on TCP socket only reserves VMA.
>
> We have to use getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...)
> to perform the transfert of pages from skbs in TCP receive queue into such 
> VMA.
>
> struct tcp_zerocopy_receive {
> __u64 address;  /* in: address of mapping */
> __u32 length;   /* in/out: number of bytes to map/mapped */
> __u32 recv_skip_hint;   /* out: amount of bytes to skip */
> };
>
> After a successful getsockopt(...TCP_ZEROCOPY_RECEIVE...), @length contains
> number of bytes that were mapped, and @recv_skip_hint contains number of bytes
> that should be read using conventional read()/recv()/recvmsg() system calls,
> to skip a sequence of bytes that can not be mapped, because not properly page
> aligned.
>
> Signed-off-by: Eric Dumazet 
> Cc: Andy Lutomirski 
> Cc: Soheil Hassas Yeganeh 

Acked-by: Soheil Hassas Yeganeh 

Thank you, again!

> ---
>  tools/testing/selftests/net/tcp_mmap.c | 64 +++---
>  1 file changed, 37 insertions(+), 27 deletions(-)
>
> diff --git a/tools/testing/selftests/net/tcp_mmap.c 
> b/tools/testing/selftests/net/tcp_mmap.c
> index 
> dea342fe6f4e88b5709d2ac37b2fc9a2a320bf44..77f762780199ff1f69f9f6b3f18e72deddb69f5e
>  100644
> --- a/tools/testing/selftests/net/tcp_mmap.c
> +++ b/tools/testing/selftests/net/tcp_mmap.c
> @@ -76,9 +76,10 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  #ifndef MSG_ZEROCOPY
>  #define MSG_ZEROCOPY0x400
> @@ -134,11 +135,12 @@ void hash_zone(void *zone, unsigned int length)
>  void *child_thread(void *arg)
>  {
> unsigned long total_mmap = 0, total = 0;
> +   struct tcp_zerocopy_receive zc;
> unsigned long delta_usec;
> int flags = MAP_SHARED;
> struct timeval t0, t1;
> char *buffer = NULL;
> -   void *oaddr = NULL;
> +   void *addr = NULL;
> double throughput;
> struct rusage ru;
> int lu, fd;
> @@ -153,41 +155,46 @@ void *child_thread(void *arg)
> perror("malloc");
> goto error;
> }
> +   if (zflg) {
> +   addr = mmap(NULL, chunk_size, PROT_READ, flags, fd, 0);
> +   if (addr == (void *)-1)
> +   zflg = 0;
> +   }
> while (1) {
> struct pollfd pfd = { .fd = fd, .events = POLLIN, };
> int sub;
>
> poll(&pfd, 1, 1);
> if (zflg) {
> -   void *naddr;
> +   socklen_t zc_len = sizeof(zc);
> +   int res;
>
> -   naddr = mmap(oaddr, chunk_size, PROT_READ, flags, fd, 
> 0);
> -   if (naddr == (void *)-1) {
> -   if (errno == EAGAIN) {
> -   /* That is if SO_RCVLOWAT is buggy */
> -   usleep(1000);
> -   continue;
> -   }
> -   if (errno == EINVAL) {
> -   flags = MAP_SHARED;
> -   oaddr = NULL;
> -   goto fallback;
> -   }
> -   if (errno != EIO)
> -   perror("mmap()");
> +   zc.address = (__u64)addr;
> +   zc.length = chunk_size;
> +   zc.recv_skip_hint = 0;
> +   res = getsockopt(fd, IPPROTO_TCP, 
> TCP_ZEROCOPY_RECEIVE,
> +&zc, &zc_len);
> +   if (res == -1)
> break;
> +
> +   if (zc.length) {
> +   assert(zc.length <= chunk_size);
> +   total_mmap += zc.length;
> +   if (xflg)
> +   hash_zone(addr, zc.length);
> +   total += zc.length;
> }
> -   total_mmap += chunk_size;
> -   if (xflg)
> -   hash_zone(naddr, chunk_size);
> -   total += chunk_size;
> -   if (!keepflag) {
> -   flags |= MAP_FIXED;
> -   oaddr = naddr;
> +   if (zc.recv_skip_hint) {
> +   assert(zc.recv_skip_hint <= chunk_size);
> +   lu = read(fd, buffer, zc.recv_skip_hint);
> +   if (lu > 0) {
> +   if (xflg)
> +   hash_zone(buffer,

Re: [PATCH v3 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive

2018-04-26 Thread Soheil Hassas Yeganeh

On Thu, Apr 26, 2018 at 10:50 AM, Eric Dumazet  wrote:
> When adding tcp mmap() implementation, I forgot that socket lock
> had to be taken before current->mm->mmap_sem. syzbot eventually caught
> the bug.
>
> Since we can not lock the socket in tcp mmap() handler we have to
> split the operation in two phases.
>
> 1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
>   This operation does not involve any TCP locking.
>
> 2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
>  the transfert of pages from skbs to one VMA.
>   This operation only uses down_read(¤t->mm->mmap_sem) after
>   holding TCP lock, thus solving the lockdep issue.
>
> This new implementation was suggested by Andy Lutomirski with great details.
>
> Benefits are :
>
> - Better scalability, in case multiple threads reuse VMAS
>(without mmap()/munmap() calls) since mmap_sem wont be write locked.
>
> - Better error recovery.
>The previous mmap() model had to provide the expected size of the
>mapping. If for some reason one part could not be mapped (partial MSS),
>the whole operation had to be aborted.
>With the tcp_zerocopy_receive struct, kernel can report how
>many bytes were successfuly mapped, and how many bytes should
>be read to skip the problematic sequence.
>
> - No more memory allocation to hold an array of page pointers.
>   16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
>
> - skbs are freed while mmap_sem has been released
>
> Following patch makes the change in tcp_mmap tool to demonstrate
> one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
>
> Note that memcg might require additional changes.
>
> Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 
> Suggested-by: Andy Lutomirski 
> Cc: linux...@kvack.org
> Cc: Soheil Hassas Yeganeh 

Acked-by: Soheil Hassas Yeganeh 

getsockopt() is indeed a better choice. Thanks!

> ---
>  include/uapi/linux/tcp.h |   8 ++
>  net/ipv4/tcp.c   | 192 ---
>  2 files changed, 109 insertions(+), 91 deletions(-)
>
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index 
> 379b08700a542d49bbce9b4b49b17879d00b69bb..e9e8373b34b9ddc735329341b91f455bf5c0b17c
>  100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -122,6 +122,7 @@ enum {
>  #define TCP_MD5SIG_EXT 32  /* TCP MD5 Signature with extensions 
> */
>  #define TCP_FASTOPEN_KEY   33  /* Set the key for Fast Open (cookie) 
> */
>  #define TCP_FASTOPEN_NO_COOKIE 34  /* Enable TFO without a TFO cookie */
> +#define TCP_ZEROCOPY_RECEIVE   35
>
>  struct tcp_repair_opt {
> __u32   opt_code;
> @@ -276,4 +277,11 @@ struct tcp_diag_md5sig {
> __u8tcpm_key[TCP_MD5SIG_MAXKEYLEN];
>  };
>
> +/* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */
> +
> +struct tcp_zerocopy_receive {
> +   __u64 address;  /* in: address of mapping */
> +   __u32 length;   /* in/out: number of bytes to map/mapped */
> +   __u32 recv_skip_hint;   /* out: amount of bytes to skip */
> +};
>  #endif /* _UAPI_LINUX_TCP_H */
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 
> dfd090ea54ad47112fc23c61180b5bf8edd2c736..c10c4a41ad39d6f8ae472882b243c2b70c915546
>  100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1726,118 +1726,111 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
>  }
>  EXPORT_SYMBOL(tcp_set_rcvlowat);
>
> -/* When user wants to mmap X pages, we first need to perform the mapping
> - * before freeing any skbs in receive queue, otherwise user would be unable
> - * to fallback to standard recvmsg(). This happens if some data in the
> - * requested block is not exactly fitting in a page.
> - *
> - * We only support order-0 pages for the moment.
> - * mmap() on TCP is very strict, there is no point
> - * trying to accommodate with pathological layouts.
> - */
> +static const struct vm_operations_struct tcp_vm_ops = {
> +};
> +
>  int tcp_mmap(struct file *file, struct socket *sock,
>  struct vm_area_struct *vma)
>  {
> -   unsigned long size = vma->vm_end - vma->vm_start;
> -   unsigned int nr_pages = size >> PAGE_SHIFT;
> -   struct page **pages_array = NULL;
> -   u32 seq, len, offset, nr = 0;
> -   struct sock *sk = sock->sk;
> -   const skb_frag_t *frags;
> +   if (vma->vm_flags & (VM_WRITE | VM_EXEC))
> +   return -EPERM;
> +   vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
> +
> +   /* Instruct vm_insert_page() to not down_read(mmap_sem) */
> +   vma->vm_flags |= VM_MIXEDMAP;
> +
> +   vma->vm_ops = &tcp_vm_ops;
> +   return 0;
> +}
> +EXPORT_SYMBOL(tcp_mmap);
> +
> +static int tcp_zerocopy_receive(struct sock *sk,
> +   struct tcp_zerocopy_receive *zc)
> +{
> +   unsigned long address = (unsigned long)zc->a

NVMe Poll CQ on timeout

2018-04-26 Thread Bharat Kumar Gogada

Hi,

We are testing NVMe cards on ARM64 platform, the card uses legacy interrupts.
Intermittently we are hitting following case in drivers/nvme/host/pci.c
   /*
 * Did we miss an interrupt?
 */
if (__nvme_poll(nvmeq, req->tag)) {
dev_warn(dev->ctrl.device,
 "I/O %d QID %d timeout, completion polled\n",
 req->tag, nvmeq->qid);
return BLK_EH_HANDLED;
}

Can anyone tell when does nvme_timeout gets invoked ?
What does "Did we miss an interrupt mean" ? Does it mean host missing
to service a interrupt raised by EP card ?

Regards,
Bharat

[PATCH V2 1/3] usb: xhci: tegra: Prepare for adding runtime PM support

2018-04-26 Thread Jon Hunter

When adding runtime PM support to the Tegra XHCI driver, it is desirable
to move the function calls to enable the clocks, regulators and PHY from
the tegra_xusb_probe into the runtime PM handlers. Currently, the
clocks, regulators and PHY are all enabled before we call
usb_create_hcd() in tegra_xusb_probe(), however, we cannot call
pm_runtime_get_sync() at this point because the platform device data is
not yet initialised. Fortunately, the function usb_create_hcd() can be
called before we enable the clocks, regulators and PHY and so prepare
for adding runtime PM support, by moving the call to usb_create_hcd()
before we enable the hardware.

Signed-off-by: Jon Hunter 
---

Changes since V1:
- None

 drivers/usb/host/xhci-tegra.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/usb/host/xhci-tegra.c b/drivers/usb/host/xhci-tegra.c
index 2c076ea80522..02b0b24faa58 100644
--- a/drivers/usb/host/xhci-tegra.c
+++ b/drivers/usb/host/xhci-tegra.c
@@ -1054,10 +1054,23 @@ static int tegra_xusb_probe(struct platform_device 
*pdev)
}
}
 
+   tegra->hcd = usb_create_hcd(&tegra_xhci_hc_driver, &pdev->dev,
+   dev_name(&pdev->dev));
+   if (!tegra->hcd) {
+   err = -ENOMEM;
+   goto put_padctl;
+   }
+
+   /*
+* This must happen after usb_create_hcd(), because usb_create_hcd()
+* will overwrite the drvdata of the device with the hcd it creates.
+*/
+   platform_set_drvdata(pdev, tegra);
+
err = tegra_xusb_clk_enable(tegra);
if (err) {
dev_err(&pdev->dev, "failed to enable clocks: %d\n", err);
-   goto put_padctl;
+   goto put_usb2;
}
 
err = regulator_bulk_enable(tegra->soc->num_supplies, tegra->supplies);
@@ -1080,19 +1093,6 @@ static int tegra_xusb_probe(struct platform_device *pdev)
goto disable_phy;
}
 
-   tegra->hcd = usb_create_hcd(&tegra_xhci_hc_driver, &pdev->dev,
-   dev_name(&pdev->dev));
-   if (!tegra->hcd) {
-   err = -ENOMEM;
-   goto disable_phy;
-   }
-
-   /*
-* This must happen after usb_create_hcd(), because usb_create_hcd()
-* will overwrite the drvdata of the device with the hcd it creates.
-*/
-   platform_set_drvdata(pdev, tegra);
-
tegra->hcd->regs = tegra->regs;
tegra->hcd->rsrc_start = regs->start;
tegra->hcd->rsrc_len = resource_size(regs);
@@ -1100,7 +1100,7 @@ static int tegra_xusb_probe(struct platform_device *pdev)
err = usb_add_hcd(tegra->hcd, tegra->xhci_irq, IRQF_SHARED);
if (err < 0) {
dev_err(&pdev->dev, "failed to add USB HCD: %d\n", err);
-   goto put_usb2;
+   goto disable_phy;
}
 
device_wakeup_enable(tegra->hcd->self.controller);
@@ -1155,14 +1155,14 @@ static int tegra_xusb_probe(struct platform_device 
*pdev)
usb_put_hcd(xhci->shared_hcd);
 remove_usb2:
usb_remove_hcd(tegra->hcd);
-put_usb2:
-   usb_put_hcd(tegra->hcd);
 disable_phy:
tegra_xusb_phy_disable(tegra);
 disable_regulator:
regulator_bulk_disable(tegra->soc->num_supplies, tegra->supplies);
 disable_clk:
tegra_xusb_clk_disable(tegra);
+put_usb2:
+   usb_put_hcd(tegra->hcd);
 put_padctl:
tegra_xusb_padctl_put(tegra->padctl);
return err;
-- 
2.7.4

[PATCH V2 2/3] usb: xhci: tegra: Add runtime PM support

2018-04-26 Thread Jon Hunter

Add runtime PM support to the Tegra XHCI driver and move the function
calls to enable/disable the clocks, regulators and PHY into the runtime
PM callbacks.

Signed-off-by: Jon Hunter 
---

Changes since V1:
- Re-worked change to handle case where runtime PM is disabled.

 drivers/usb/host/xhci-tegra.c | 89 ++-
 1 file changed, 63 insertions(+), 26 deletions(-)

diff --git a/drivers/usb/host/xhci-tegra.c b/drivers/usb/host/xhci-tegra.c
index 02b0b24faa58..85f2381883ad 100644
--- a/drivers/usb/host/xhci-tegra.c
+++ b/drivers/usb/host/xhci-tegra.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -761,6 +762,50 @@ static void tegra_xusb_phy_disable(struct tegra_xusb 
*tegra)
}
 }
 
+static int tegra_xusb_runtime_suspend(struct device *dev)
+{
+   struct tegra_xusb *tegra = dev_get_drvdata(dev);
+
+   tegra_xusb_phy_disable(tegra);
+   regulator_bulk_disable(tegra->soc->num_supplies, tegra->supplies);
+   tegra_xusb_clk_disable(tegra);
+
+   return 0;
+}
+
+static int tegra_xusb_runtime_resume(struct device *dev)
+{
+   struct tegra_xusb *tegra = dev_get_drvdata(dev);
+   int err;
+
+   err = tegra_xusb_clk_enable(tegra);
+   if (err) {
+   dev_err(dev, "failed to enable clocks: %d\n", err);
+   return err;
+   }
+
+   err = regulator_bulk_enable(tegra->soc->num_supplies, tegra->supplies);
+   if (err) {
+   dev_err(dev, "failed to enable regulators: %d\n", err);
+   goto disable_clk;
+   }
+
+   err = tegra_xusb_phy_enable(tegra);
+   if (err < 0) {
+   dev_err(dev, "failed to enable PHYs: %d\n", err);
+   goto disable_regulator;
+   }
+
+   return 0;
+
+disable_regulator:
+   regulator_bulk_disable(tegra->soc->num_supplies, tegra->supplies);
+disable_clk:
+   tegra_xusb_clk_disable(tegra);
+   return err;
+}
+
+
 static int tegra_xusb_load_firmware(struct tegra_xusb *tegra)
 {
unsigned int code_tag_blocks, code_size_blocks, code_blocks;
@@ -1067,22 +1112,15 @@ static int tegra_xusb_probe(struct platform_device 
*pdev)
 */
platform_set_drvdata(pdev, tegra);
 
-   err = tegra_xusb_clk_enable(tegra);
-   if (err) {
-   dev_err(&pdev->dev, "failed to enable clocks: %d\n", err);
-   goto put_usb2;
-   }
+   pm_runtime_enable(&pdev->dev);
+   if (!pm_runtime_enabled(&pdev->dev))
+   err = pm_runtime_get_sync(&pdev->dev);
+   else
+   err = tegra_xusb_runtime_resume(&pdev->dev);
 
-   err = regulator_bulk_enable(tegra->soc->num_supplies, tegra->supplies);
-   if (err) {
-   dev_err(&pdev->dev, "failed to enable regulators: %d\n", err);
-   goto disable_clk;
-   }
-
-   err = tegra_xusb_phy_enable(tegra);
if (err < 0) {
-   dev_err(&pdev->dev, "failed to enable PHYs: %d\n", err);
-   goto disable_regulator;
+   dev_err(&pdev->dev, "failed to enable device: %d\n", err);
+   goto disable_rpm;
}
 
tegra_xusb_ipfs_config(tegra, regs);
@@ -1090,7 +1128,7 @@ static int tegra_xusb_probe(struct platform_device *pdev)
err = tegra_xusb_load_firmware(tegra);
if (err < 0) {
dev_err(&pdev->dev, "failed to load firmware: %d\n", err);
-   goto disable_phy;
+   goto put_rpm;
}
 
tegra->hcd->regs = tegra->regs;
@@ -1100,7 +1138,7 @@ static int tegra_xusb_probe(struct platform_device *pdev)
err = usb_add_hcd(tegra->hcd, tegra->xhci_irq, IRQF_SHARED);
if (err < 0) {
dev_err(&pdev->dev, "failed to add USB HCD: %d\n", err);
-   goto disable_phy;
+   goto put_rpm;
}
 
device_wakeup_enable(tegra->hcd->self.controller);
@@ -1155,13 +1193,11 @@ static int tegra_xusb_probe(struct platform_device 
*pdev)
usb_put_hcd(xhci->shared_hcd);
 remove_usb2:
usb_remove_hcd(tegra->hcd);
-disable_phy:
-   tegra_xusb_phy_disable(tegra);
-disable_regulator:
-   regulator_bulk_disable(tegra->soc->num_supplies, tegra->supplies);
-disable_clk:
-   tegra_xusb_clk_disable(tegra);
-put_usb2:
+put_rpm:
+   if (!pm_runtime_status_suspended(&pdev->dev))
+   tegra_xusb_runtime_suspend(&pdev->dev);
+disable_rpm:
+   pm_runtime_disable(&pdev->dev);
usb_put_hcd(tegra->hcd);
 put_padctl:
tegra_xusb_padctl_put(tegra->padctl);
@@ -1181,9 +1217,8 @@ static int tegra_xusb_remove(struct platform_device *pdev)
dma_free_coherent(&pdev->dev, tegra->fw.size, tegra->fw.virt,
  tegra->fw.phys);
 
-   tegra_xusb_phy_disable(tegra);
-   regulator_bulk_disable(tegra->soc->num_supplies, tegra->supplies);
-   tegra_xusb_clk_disable(tegra);
+   pm_runtime_put_sync(&pdev->dev);
+   pm_

[PATCH V2 3/3] usb: xhci: tegra: Add support for managing powergates

2018-04-26 Thread Jon Hunter

The Tegra XHCI controller requires that the XUSBA (for superspeed) and
XUSBC (for host) power-domains are enabled. Commit 8df127456f29
("soc/tegra: pmc: Enable XUSB partitions on boot") was added to force
on these power-domains if the XHCI driver is enabled while proper
power-domain support is added, to ensure the device did not hang on
boot. However, rather than forcing on these power-domains in the PMC
driver we can use the legacy Tegra powergate APIs to turn on these
power-domains during the probe of the Tegra XHCI driver.

In the near future we plan to move the Tegra XHCI driver to use the
generic PM domain framework for power-domains and so to prepare for
this only use the legacy Tegra powergate API if there is not PM
domain associated with device (ie. dev.pm_domain is NULL). Please
note that in the future the superspeed and host resets will be handled
by the generic PM domain provider and so these are only these are only
needed in the case where there is no generic PM domain.

Signed-off-by: Jon Hunter 
---

Changes since V1:
- None

 drivers/usb/host/xhci-tegra.c | 68 +++
 1 file changed, 49 insertions(+), 19 deletions(-)

diff --git a/drivers/usb/host/xhci-tegra.c b/drivers/usb/host/xhci-tegra.c
index 85f2381883ad..862f85f4c8bb 100644
--- a/drivers/usb/host/xhci-tegra.c
+++ b/drivers/usb/host/xhci-tegra.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "xhci.h"
 
@@ -975,20 +976,6 @@ static int tegra_xusb_probe(struct platform_device *pdev)
if (IS_ERR(tegra->padctl))
return PTR_ERR(tegra->padctl);
 
-   tegra->host_rst = devm_reset_control_get(&pdev->dev, "xusb_host");
-   if (IS_ERR(tegra->host_rst)) {
-   err = PTR_ERR(tegra->host_rst);
-   dev_err(&pdev->dev, "failed to get xusb_host reset: %d\n", err);
-   goto put_padctl;
-   }
-
-   tegra->ss_rst = devm_reset_control_get(&pdev->dev, "xusb_ss");
-   if (IS_ERR(tegra->ss_rst)) {
-   err = PTR_ERR(tegra->ss_rst);
-   dev_err(&pdev->dev, "failed to get xusb_ss reset: %d\n", err);
-   goto put_padctl;
-   }
-
tegra->host_clk = devm_clk_get(&pdev->dev, "xusb_host");
if (IS_ERR(tegra->host_clk)) {
err = PTR_ERR(tegra->host_clk);
@@ -1052,11 +1039,48 @@ static int tegra_xusb_probe(struct platform_device 
*pdev)
goto put_padctl;
}
 
+   if (!pdev->dev.pm_domain) {
+   tegra->host_rst = devm_reset_control_get(&pdev->dev,
+"xusb_host");
+   if (IS_ERR(tegra->host_rst)) {
+   err = PTR_ERR(tegra->host_rst);
+   dev_err(&pdev->dev,
+   "failed to get xusb_host reset: %d\n", err);
+   goto put_padctl;
+   }
+
+   tegra->ss_rst = devm_reset_control_get(&pdev->dev, "xusb_ss");
+   if (IS_ERR(tegra->ss_rst)) {
+   err = PTR_ERR(tegra->ss_rst);
+   dev_err(&pdev->dev, "failed to get xusb_ss reset: %d\n",
+   err);
+   goto put_padctl;
+   }
+
+   err = tegra_powergate_sequence_power_up(TEGRA_POWERGATE_XUSBA,
+   tegra->ss_clk,
+   tegra->ss_rst);
+   if (err) {
+   dev_err(&pdev->dev,
+   "failed to enable XUSBA domain: %d\n", err);
+   goto put_padctl;
+   }
+
+   err = tegra_powergate_sequence_power_up(TEGRA_POWERGATE_XUSBC,
+   tegra->host_clk,
+   tegra->host_rst);
+   if (err) {
+   dev_err(&pdev->dev,
+   "failed to enable XUSBC domain: %d\n", err);
+   goto disable_xusba;
+   }
+   }
+
tegra->supplies = devm_kcalloc(&pdev->dev, tegra->soc->num_supplies,
   sizeof(*tegra->supplies), GFP_KERNEL);
if (!tegra->supplies) {
err = -ENOMEM;
-   goto put_padctl;
+   goto disable_xusbc;
}
 
for (i = 0; i < tegra->soc->num_supplies; i++)
@@ -1066,7 +1090,7 @@ static int tegra_xusb_probe(struct platform_device *pdev)
  tegra->supplies);
if (err) {
dev_err(&pdev->dev, "failed to get regulators: %d\n", err);
-   goto put_padctl;
+   goto disable_xusbc;
}
 
for (i = 0; i < tegra->soc->num_types; i++)
@@ -1076,7 +1100,7 @@ static int tegra_xusb_probe(struct platform_device *pdev)

Re: [PATCH 1/4] exit: Move read_unlock() up in mm_update_next_owner()

2018-04-26 Thread Oleg Nesterov

On 04/26, Kirill Tkhai wrote:
>
> @@ -464,18 +464,15 @@ void mm_update_next_owner(struct mm_struct *mm)
>   return;
>  
>  assign_new_owner:
> - BUG_ON(c == p);
>   get_task_struct(c);
> + read_unlock(&tasklist_lock);
> + BUG_ON(c == p);
> +
>   /*
>* The task_lock protects c->mm from changing.
>* We always want mm->owner->mm == mm
>*/
>   task_lock(c);
> - /*
> -  * Delay read_unlock() till we have the task_lock()
> -  * to ensure that c does not slip away underneath us
> -  */
> - read_unlock(&tasklist_lock);

I think this is correct, but...

Firstly, I agree with Michal, it would be nice to kill mm_update_next_owner()
altogether.

If this is not possible I agree, it needs cleanups and we can change it to
avoid tasklist (although your 4/4 looks overcomplicated to me at first glance).

But in this case I think that whatever we do we should start with something like
the patch below. I wrote it 3 years ago but it still applies.

Oleg.

Subject: [PATCH 1/3] memcg: introduce assign_new_owner()

The code under "assign_new_owner" looks very ugly and suboptimal.

We do not really need get_task_struct/put_task_struct(), we can
simply recheck/change mm->owner under tasklist_lock. And we do not
want to restart from the very beginning if ->mm was changed by the
time we take task_lock(), we can simply continue (if we do not drop
tasklist_lock).

Just move this code into the new simple helper, assign_new_owner().

Signed-off-by: Oleg Nesterov 
---
 kernel/exit.c |   56 ++--
 1 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 22fcc05..4d446ab 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -293,6 +293,23 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct 
task_struct *parent)
 }
 
 #ifdef CONFIG_MEMCG
+static bool assign_new_owner(struct mm_struct *mm, struct task_struct *c)
+{
+   bool ret = false;
+
+   if (c->mm != mm)
+   return ret;
+
+   task_lock(c); /* protects c->mm from changing */
+   if (c->mm == mm) {
+   mm->owner = c;
+   ret = true;
+   }
+   task_unlock(c);
+
+   return ret;
+}
+
 /*
  * A task is exiting.   If it owned this mm, find a new owner for the mm.
  */
@@ -300,7 +317,6 @@ void mm_update_next_owner(struct mm_struct *mm)
 {
struct task_struct *c, *g, *p = current;
 
-retry:
/*
 * If the exiting or execing task is not the owner, it's
 * someone else's problem.
@@ -322,16 +338,16 @@ retry:
 * Search in the children
 */
list_for_each_entry(c, &p->children, sibling) {
-   if (c->mm == mm)
-   goto assign_new_owner;
+   if (assign_new_owner(mm, c))
+   goto done;
}
 
/*
 * Search in the siblings
 */
list_for_each_entry(c, &p->real_parent->children, sibling) {
-   if (c->mm == mm)
-   goto assign_new_owner;
+   if (assign_new_owner(mm, c))
+   goto done;
}
 
/*
@@ -341,42 +357,22 @@ retry:
if (g->flags & PF_KTHREAD)
continue;
for_each_thread(g, c) {
-   if (c->mm == mm)
-   goto assign_new_owner;
+   if (assign_new_owner(mm, c))
+   goto done;
if (c->mm)
break;
}
}
-   read_unlock(&tasklist_lock);
+
/*
 * We found no owner yet mm_users > 1: this implies that we are
 * most likely racing with swapoff (try_to_unuse()) or /proc or
 * ptrace or page migration (get_task_mm()).  Mark owner as NULL.
 */
mm->owner = NULL;
-   return;
-
-assign_new_owner:
-   BUG_ON(c == p);
-   get_task_struct(c);
-   /*
-* The task_lock protects c->mm from changing.
-* We always want mm->owner->mm == mm
-*/
-   task_lock(c);
-   /*
-* Delay read_unlock() till we have the task_lock()
-* to ensure that c does not slip away underneath us
-*/
+done:
read_unlock(&tasklist_lock);
-   if (c->mm != mm) {
-   task_unlock(c);
-   put_task_struct(c);
-   goto retry;
-   }
-   mm->owner = c;
-   task_unlock(c);
-   put_task_struct(c);
+   return;
 }
 #endif /* CONFIG_MEMCG */
 
-- 
1.5.5.1

Re: [RFC PATCH 09/35] ovl: stack file ops

2018-04-26 Thread Miklos Szeredi

On Thu, Apr 26, 2018 at 4:56 PM, Vivek Goyal  wrote:
> On Thu, Apr 26, 2018 at 04:43:53PM +0200, Miklos Szeredi wrote:
>> On Thu, Apr 26, 2018 at 4:13 PM, Vivek Goyal  wrote:
>> > On Thu, Apr 12, 2018 at 05:08:00PM +0200, Miklos Szeredi wrote:
>> >
>> > [..]
>> >> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>> >> new file mode 100644
>> >> index ..a0b606885c41
>> >> --- /dev/null
>> >> +++ b/fs/overlayfs/file.c
>> >> @@ -0,0 +1,76 @@
>> >> +/*
>> >> + * Copyright (C) 2017 Red Hat, Inc.
>> >> + *
>> >> + * This program is free software; you can redistribute it and/or modify 
>> >> it
>> >> + * under the terms of the GNU General Public License version 2 as 
>> >> published by
>> >> + * the Free Software Foundation.
>> >> + */
>> >> +
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include "overlayfs.h"
>> >> +
>> >> +static struct file *ovl_open_realfile(const struct file *file)
>> >> +{
>> >> + struct inode *inode = file_inode(file);
>> >> + struct inode *upperinode = ovl_inode_upper(inode);
>> >> + struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
>> >> + struct file *realfile;
>> >> + const struct cred *old_cred;
>> >> +
>> >> + old_cred = ovl_override_creds(inode->i_sb);
>> >> + realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
>> >> +  realinode, current_cred(), false);
>> >> + revert_creds(old_cred);
>> >> +
>> >> + pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
>> >> +  file, file, upperinode ? 'u' : 'l', file->f_flags,
>> >> +  realfile, IS_ERR(realfile) ? 0 : realfile->f_flags);
>> >> +
>> >> + return realfile;
>> >> +}
>> >> +
>> >> +static int ovl_open(struct inode *inode, struct file *file)
>> >> +{
>> >> + struct dentry *dentry = file_dentry(file);
>> >
>> > Hi Miklos,
>> >
>> > There is one thing I can't wrap my head around, so I better ask.
>> >
>> > file_dentry() will call ovl_d_real() and try to find dentry based on
>> > inode installed in f->f_inode. If ovl_d_real() can't find inode dentry
>> > matching the passed in inode, it warns.
>> >
>> > Assume, I have a stacked overlay configuration. Let me call top level
>> > overlay layer ovl1 and lower level overlay layer ovl2. Say I open a
>> > file foo.txt. Now ovl_open() in ovl1 decides that realinode is a lower
>> > inode and installs that inode f->f_inode of realfile. (This should be
>> > ovl2 layer inode, let me call it ovl2_inode). Now ovl_open() of ovl2 layer
>> > will be called and it will call file_dentry() and will look for dentry
>> > corresponding to ovl2_inode. I am wondering what if a copy up of foo.txt
>> > was triggered in ovl1 and by the time we called ovl_d_real(dentry,
>> > ovl2_inode), it will start comparing with inode of ovl1_upper and never
>> > find ovl2_inode.
>>
>> Okay, so we've modified ovl_d_real() to allow returning the overlay
>> dentry itself.  This is important: when we fail to match ovl1_upper
>> with ovl2_inode, well go on to get ovl2_dentry and call d_real()
>> recursively.  That recursive call should match the inode, return it to
>> outer ovl_d_real(), which again will match the inode and return
>> without warning.
>
> So current code does following.
>
> ovl_d_real() {
> ...
> ...
>
> real = ovl_dentry_real(dentry);
> if (inode == d_inode(real))
> return real;
>
> /* Handle recursion */
> if (unlikely(real->d_flags & DCACHE_OP_REAL))
> return real->d_op->d_real(real, inode);
> }
>
> If file got copied up in ovl1, then "real" will be ovl1_upper dentry. And
> upper is regular fs (only ovl1 lower is overlay), then it should not have
> DCACHE_OP_REAL set and that means we will not recurse further and not
> find ovl2 dentry matching ovl2_inode and print warning and return
> ovl1 dentry.
>
> What am I missing.

Ah,  that's indeed buggy.  The bug is in "[RFC PATCH 34/35] vfs:
simplify d_op->d_real()".

I've already reverted that (due to d_real_inode() acquiring a new
user) and the old code should be good (AFAICS).

Thanks,
Miklos

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

2018-04-26 Thread Mathieu Desnoyers

- On Apr 25, 2018, at 6:51 PM, rostedt rost...@goodmis.org wrote:

> On Wed, 25 Apr 2018 17:40:56 -0400 (EDT)
> Mathieu Desnoyers  wrote:
> 
>> One problem with your approach is that you can have multiple callers
>> for the same tracepoint name, where some could be non-preemptible and
>> others blocking. Also, there is then no clear way for the callback
>> registration API to enforce whether the callback expects the tracepoint
>> to be blocking or non-preemptible. This can introduce hard to diagnose
>> issues in a kernel without debug options enabled.
> 
> I agree that it should not be tied to an implementation name. But
> "blocking" is confusing. I would say "can_sleep" or some such name that
> states that the trace point caller is indeed something that can sleep.

"trace_*event*_{can,might,may}_sleep" are all acceptable candidates for
me.

> 
>> 
>> Regarding the name, I'm OK with having something along the lines of
>> trace_*event*_blocking or such. Please don't use "srcu" or other naming
>> that is explicitly tied to the underlying mechanism used internally
>> however: what we want to convey is that this specific tracepoint probe
>> can be preempted and block. The underlying implementation could move to
>> a different RCU flavor brand in the future, and it should not impact
>> users of the tracepoint APIs.
>> 
>> In order to ensure that probes that may block only register themselves
>> to tracepoints that allow blocking, we should introduce new tracepoint
>> declaration/definition *and* registration APIs also contain the
>> "BLOCKING/blocking" keywords (or such), so we can ensure that a
>> tracepoint probe being registered to a "blocking" tracepoint is indeed
>> allowed to block.
> 
> I'd really don't want to add more declaration/definitions, as we
> already have too many as is, and with different meanings and the number
> is of incarnations is n! in growth.
> 
> I'd say we just stick with a trace__can_sleep() call, and make
> sure that if that is used that no trace_() call is also used, and
> enforce this with linker or compiler tricks.

My main concern is not about having both trace__can_sleep() mixed
with trace_() calls. It's more about having a registration API allowing
modules registering probes that may need to sleep to explicitly declare it,
and enforce that tracepoint never connects a probe that may need to sleep
with an instrumentation site which cannot sleep.

I'm unsure what's the best way to achieve this goal though. We could possibly
extend the tracepoint_probe_register_* APIs to introduce e.g.
tracepoint_probe_register_prio_flags() and provide a TRACEPOINT_PROBE_CAN_SLEEP
as parameter upon registration. If this flag is provided, then we could figure 
out
an way to iterate on all callers, and ensure they are all "can_sleep" type of
callers.

Thoughts ?

Thanks,

Mathieu



> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread Mikulas Patocka



On Thu, 26 Apr 2018, James Bottomley wrote:

> On Thu, 2018-04-26 at 10:28 -0400, Mikulas Patocka wrote:
> > 
> > On Thu, 26 Apr 2018, Michal Hocko wrote:
> > 
> > > On Wed 25-04-18 18:42:57, Mikulas Patocka wrote:
> > > > 
> > > > 
> > > > On Wed, 25 Apr 2018, James Bottomley wrote:
> > > [...]
> > > > > Kconfig proliferation, conversely, is a bit of a nightmare from
> > both
> > > > > the user and the tester's point of view, so we're trying to
> > avoid it
> > > > > unless absolutely necessary.
> > > > > 
> > > > > James
> > > > 
> > > > I already offered that we don't need to introduce a new kernel
> > option and 
> > > > we can bind this feature to any other kernel option, that is
> > enabled in 
> > > > the debug kernel, for example CONFIG_DEBUG_SG. Michal said no and
> > he said 
> > > > that he wants a new kernel option instead.
> > > 
> > > Just for the record. I didn't say I _want_ a config option. Do not
> > > misinterpret my words. I've said that a config option would be
> > > acceptable if there is no way to deliver the functionality via
> > kernel
> > > package automatically. You haven't provided any argument that would
> > > explain why the kernel package cannot add a boot option. Maybe
> > there are
> > > some but I do not see them right now.
> > 
> > AFAIK Grub doesn't load per-kernel options from a per-kernel file.
> > Even if we hacked grub scripts to add this option, other
> > distributions won't.
> 
> Perhaps find out beforehand instead of insisting on an approach without
> knowing.  On openSUSE the grub config is built from the files in
> /etc/grub.d/ so any package can add a kernel option (and various
> conditions around activating it) simply by adding a new file.

And then, different versions of the debug kernel will clash when 
attempting to create the same file.

And what about other distributions? What about people who the RHEL kernel 
from source with "make"?

The problem with this approach that you are trying to bother more and more 
people with this little silly feature.

> The config files are quite sophisticated, so you can add what looks to 
> be a new kernel, but is really an existing kernel with different options 
> this way.
> 
> James

Mikulas

[PATCH 1/2] drm/ttm: Add TTM_PAGE_FLAG_TRANSHUGE

2018-04-26 Thread Michel Dänzer

From: Michel Dänzer 

When it's set, TTM tries to allocate huge pages if possible. Drivers
which can take advantage of huge pages should set it.

Drivers not setting this flag no longer incur any overhead related to
allocating or freeing huge pages.

Cc: sta...@vger.kernel.org
Signed-off-by: Michel Dänzer 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c  |  2 +-
 drivers/gpu/drm/ttm/ttm_page_alloc.c | 14 ++
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |  8 +---
 include/drm/ttm/ttm_tt.h |  1 +
 4 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index dfd22db13fb1..e03e9e361e2a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -988,7 +988,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct 
ttm_buffer_object *bo,
return NULL;
}
gtt->ttm.ttm.func = &amdgpu_backend_func;
-   if (ttm_sg_tt_init(>t->ttm, bo, page_flags)) {
+   if (ttm_sg_tt_init(>t->ttm, bo, page_flags | 
TTM_PAGE_FLAG_TRANSHUGE)) {
kfree(gtt);
return NULL;
}
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index f0481b7b60c5..2ce91272b111 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -760,7 +760,7 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
 {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+   struct ttm_page_pool *huge = NULL;
 #endif
unsigned long irq_flags;
unsigned i;
@@ -780,7 +780,8 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
}
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (!(flags & TTM_PAGE_FLAG_DMA32)) {
+   if ((flags & (TTM_PAGE_FLAG_DMA32 | 
TTM_PAGE_FLAG_TRANSHUGE)) ==
+   TTM_PAGE_FLAG_TRANSHUGE) {
for (j = 0; j < HPAGE_PMD_NR; ++j)
if (p++ != pages[i + j])
break;
@@ -805,6 +806,8 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
 
i = 0;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (flags & TTM_PAGE_FLAG_TRANSHUGE)
+   huge = ttm_get_pool(flags, true, cstate);
if (huge) {
unsigned max_size, n2free;
 
@@ -877,7 +880,7 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
 {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+   struct ttm_page_pool *huge = NULL;
 #endif
struct list_head plist;
struct page *p = NULL;
@@ -906,7 +909,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
 
i = 0;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (!(gfp_flags & GFP_DMA32)) {
+   if ((flags & (TTM_PAGE_FLAG_DMA32 | TTM_PAGE_FLAG_TRANSHUGE)) ==
+   TTM_PAGE_FLAG_TRANSHUGE) {
while (npages >= HPAGE_PMD_NR) {
gfp_t huge_flags = gfp_flags;
 
@@ -946,6 +950,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
count = 0;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (flags & TTM_PAGE_FLAG_TRANSHUGE)
+   huge = ttm_get_pool(flags, true, cstate);
if (huge && npages >= HPAGE_PMD_NR) {
INIT_LIST_HEAD(&plist);
ttm_page_pool_get_pages(huge, &plist, flags, cstate,
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index 8a25d1974385..291b04213ec5 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -949,7 +949,8 @@ int ttm_dma_populate(struct ttm_dma_tt *ttm_dma, struct 
device *dev,
type = ttm_to_type(ttm->page_flags, ttm->caching_state);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (ttm->page_flags & TTM_PAGE_FLAG_DMA32)
+   if ((ttm->page_flags & (TTM_PAGE_FLAG_DMA32 | TTM_PAGE_FLAG_TRANSHUGE))
+   != TTM_PAGE_FLAG_TRANSHUGE)
goto skip_huge;
 
pool = ttm_dma_find_pool(dev, type | IS_HUGE);
@@ -1035,7 +1036,7 @@ void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, 
struct device *dev)
 {
struct ttm_tt *ttm = &ttm_dma->ttm;
struct ttm_mem_global *mem_glob = ttm->bdev->glob->mem_glob;
-   struct dma_pool *pool;
+   struct dma_pool *pool = NULL;
struct dma_page *d_page, *next;
enum pool_type type;
bool is_cached = false;
@@ -1045,7 +10

[PATCH 2/2] drm/ttm: Use GFP_TRANSHUGE_LIGHT for allocating huge pages

2018-04-26 Thread Michel Dänzer

From: Michel Dänzer 

GFP_TRANSHUGE tries very hard to allocate huge pages, which can result
in long delays with high memory pressure. I have observed firefox
freezing for up to around a minute due to this while restic was taking
a full system backup.

Since we don't really need huge pages, use GFP_TRANSHUGE_LIGHT |
__GFP_NORETRY instead, in order to fail quickly when there are no huge
pages available.

Set __GFP_KSWAPD_RECLAIM as well, in order for huge pages to be freed
up in the background if necessary.

With these changes, I'm no longer seeing freezes during a restic backup.

Cc: sta...@vger.kernel.org
Signed-off-by: Michel Dänzer 
---
 drivers/gpu/drm/ttm/ttm_page_alloc.c | 11 ---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |  3 ++-
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index 2ce91272b111..6794f15461d9 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -914,7 +914,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
while (npages >= HPAGE_PMD_NR) {
gfp_t huge_flags = gfp_flags;
 
-   huge_flags |= GFP_TRANSHUGE;
+   huge_flags |= GFP_TRANSHUGE_LIGHT | 
__GFP_NORETRY |
+   __GFP_KSWAPD_RECLAIM;
huge_flags &= ~__GFP_MOVABLE;
huge_flags &= ~__GFP_COMP;
p = alloc_pages(huge_flags, HPAGE_PMD_ORDER);
@@ -1033,11 +1034,15 @@ int ttm_page_alloc_init(struct ttm_mem_global *glob, 
unsigned max_pages)
  GFP_USER | GFP_DMA32, "uc dma", 0);
 
ttm_page_pool_init_locked(&_manager->wc_pool_huge,
- GFP_TRANSHUGE & ~(__GFP_MOVABLE | __GFP_COMP),
+ (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+  __GFP_KSWAPD_RECLAIM) &
+ ~(__GFP_MOVABLE | __GFP_COMP),
  "wc huge", order);
 
ttm_page_pool_init_locked(&_manager->uc_pool_huge,
- GFP_TRANSHUGE & ~(__GFP_MOVABLE | __GFP_COMP)
+ (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+  __GFP_KSWAPD_RECLAIM) &
+ ~(__GFP_MOVABLE | __GFP_COMP)
  , "uc huge", order);
 
_manager->options.max_size = max_pages;
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index 291b04213ec5..5a551158c289 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -910,7 +910,8 @@ static gfp_t ttm_dma_pool_gfp_flags(struct ttm_dma_tt 
*ttm_dma, bool huge)
gfp_flags |= __GFP_ZERO;
 
if (huge) {
-   gfp_flags |= GFP_TRANSHUGE;
+   gfp_flags |= GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+   __GFP_KSWAPD_RECLAIM;
gfp_flags &= ~__GFP_MOVABLE;
gfp_flags &= ~__GFP_COMP;
}
-- 
2.17.0

Re: [PATCH v4 6/8] drm/panel: Add Ilitek ILI9881c panel driver

2018-04-26 Thread Thierry Reding

On Wed, Apr 04, 2018 at 11:57:14AM +0200, Maxime Ripard wrote:
> The LHR050H41 panel is the panel shipped with the BananaPi M2-Magic, and is
> based on the Ilitek ILI9881c Controller. Add a driver for it, modelled
> after the other Ilitek controller drivers.
> 
> Signed-off-by: Maxime Ripard 
> ---
>  drivers/gpu/drm/panel/Kconfig |   9 +-
>  drivers/gpu/drm/panel/Makefile|   1 +-
>  drivers/gpu/drm/panel/panel-ilitek-ili9881c.c | 489 +++-
>  3 files changed, 499 insertions(+)
>  create mode 100644 drivers/gpu/drm/panel/panel-ilitek-ili9881c.c
> 
> diff --git a/drivers/gpu/drm/panel/Kconfig b/drivers/gpu/drm/panel/Kconfig
> index 25682ff3449a..6020c30a33b3 100644
> --- a/drivers/gpu/drm/panel/Kconfig
> +++ b/drivers/gpu/drm/panel/Kconfig
> @@ -46,6 +46,15 @@ config DRM_PANEL_ILITEK_IL9322
> Say Y here if you want to enable support for Ilitek IL9322
> QVGA (320x240) RGB, YUV and ITU-T BT.656 panels.
>  
> +config DRM_PANEL_ILITEK_ILI9881C
> + tristate "Ilitek ILI9881C-based panels"
> + depends on OF
> + depends on DRM_MIPI_DSI
> + depends on BACKLIGHT_CLASS_DEVICE
> + help
> +   Say Y if you want to enable support for panels based on the
> +   Ilitek ILI9881c controller.
> +
>  config DRM_PANEL_INNOLUX_P079ZCA
>   tristate "Innolux P079ZCA panel"
>   depends on OF
> diff --git a/drivers/gpu/drm/panel/Makefile b/drivers/gpu/drm/panel/Makefile
> index f26efc11d746..5ccaaa9d13af 100644
> --- a/drivers/gpu/drm/panel/Makefile
> +++ b/drivers/gpu/drm/panel/Makefile
> @@ -3,6 +3,7 @@ obj-$(CONFIG_DRM_PANEL_ARM_VERSATILE) += panel-arm-versatile.o
>  obj-$(CONFIG_DRM_PANEL_LVDS) += panel-lvds.o
>  obj-$(CONFIG_DRM_PANEL_SIMPLE) += panel-simple.o
>  obj-$(CONFIG_DRM_PANEL_ILITEK_IL9322) += panel-ilitek-ili9322.o
> +obj-$(CONFIG_DRM_PANEL_ILITEK_ILI9881C) += panel-ilitek-ili9881c.o
>  obj-$(CONFIG_DRM_PANEL_INNOLUX_P079ZCA) += panel-innolux-p079zca.o
>  obj-$(CONFIG_DRM_PANEL_JDI_LT070ME05000) += panel-jdi-lt070me05000.o
>  obj-$(CONFIG_DRM_PANEL_LG_LG4573) += panel-lg-lg4573.o
> diff --git a/drivers/gpu/drm/panel/panel-ilitek-ili9881c.c 
> b/drivers/gpu/drm/panel/panel-ilitek-ili9881c.c
> new file mode 100644
> index ..8992a6431c30
> --- /dev/null
> +++ b/drivers/gpu/drm/panel/panel-ilitek-ili9881c.c
> @@ -0,0 +1,489 @@
> +// SPDX-License-Identifier: GPL-2.0+

This isn't a valid SPDX license specifier. The module license is GPL v2,
so the corresponding specifier would be: GPL-2.0-only.

> +/*
> + * Copyright (C) 2017, Free Electrons

-2018?

> + * Author: Maxime Ripard 

No need for this, it's already in MODULE_AUTHOR.

> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +struct ili9881c {
> + struct drm_panelpanel;
> + struct mipi_dsi_device  *dsi;
> +
> + struct backlight_device *backlight;
> + struct gpio_desc*power;
> + struct gpio_desc*reset;
> +};
> +
> +enum ili9881c_op {
> + ILI9881C_SWITCH_PAGE,
> + ILI9881C_COMMAND,
> +};
> +
> +struct ili9881c_instr {
> + enum ili9881c_opop;
> +
> + union arg {
> + struct cmd {
> + u8  cmd;
> + u8  data;
> + } cmd;
> + u8  page;
> + } arg;
> +};
> +
> +#define ILI9881C_SWITCH_PAGE_INSTR(_page)\
> + {   \
> + .op = ILI9881C_SWITCH_PAGE, \
> + .arg = {\
> + .page = (_page),\
> + },  \
> + }
> +
> +#define ILI9881C_COMMAND_INSTR(_cmd, _data)  \
> + {   \
> + .op = ILI9881C_COMMAND, \
> + .arg = {\
> + .cmd = {\
> + .cmd = (_cmd),  \
> + .data = (_data),\
> + },  \
> + },  \
> + }
> +
> +static struct ili9881c_instr ili9881c_init[] = {

These are never modified, so they could be const, right?

> + ILI9881C_SWITCH_PAGE_INSTR(3),
> + ILI9881C_COMMAND_INSTR(0x01, 0x00),
> + ILI9881C_COMMAND_INSTR(0x02, 0x00),
> + ILI9881C_COMMAND_INSTR(0x03, 0x73),
> + ILI9881C_COMMAND_INSTR(0x04, 0x03),
> + ILI9881C_COMMAND_INSTR(0x05, 0x00),
> + ILI9881C_COMMAND_INSTR(0x06, 0x06),
> + ILI9881C_COMMAND_INSTR(0x07, 0x06),
> + ILI9881C_COMMAND_INSTR(0x08, 0x00),
> + ILI9881C_COMMAND_INSTR(0x09, 0x18),
> + ILI9881C_COMMAND_INSTR(0x0a, 0x04),
> + ILI9881C_COMMAND_INSTR(0x0b, 0x00),
> + ILI9881C_COMMAND_INSTR(0x0c, 0x02),
> + ILI9881C_COMMAN

Re: issues with suspend on Dell XPS 13 2-in-1

2018-04-26 Thread Pandruvada, Srinivas

On Thu, 2018-04-26 at 07:42 -0500, Dennis Gilmore wrote:
> Hi Srinivas,
> 
> El jue, 26-04-2018 a las 05:34 +, Pandruvada, Srinivas escribió:
> > Hi Dennis,
> > 
> > On Wed, 2018-04-25 at 22:06 -0500, Dennis Gilmore wrote:
> > > Hi Srinivas,
> > > 
> > > Yes I have latest bios, I have version 1.3.1 that was released on
> > > 18th
> > > of  Feb.
> > 
> > Can you try these commands and repeat the test?
> > 
> > # cd /sys/kernel/debug/pmc_core/
> > # for i in {0..32}; do echo $i > ltr_ignore; done
> 
> # for i in {0..32}; do echo $i > ltr_ignore; done
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> -bash: ltr_ignore: Operación no permitida
> 
> should I go ahead and run the test even though writing to ltr_ignore
> failed?
Strange. Do you have any issue with the permissions?
# cd /sys/kernel/debug/pmc_core/
# ls -l

Do you have rw permissions?

Thanks,
Srinivas

> 
> Dennis
> 
> > Thanks,
> > Srinivas
> > 
> > > 
> > > Dennis
> > > 
> > > El jue, 26-04-2018 a las 02:13 +, Pandruvada, Srinivas
> > > escribió:
> > > > 
> > > > I see around 43% PC10 residency with power drop of 0.7W.
> > > > Do you have the latest BIOS of Dell 9365?
> > > > 
> > > > 
> > > > Thanks,
> > > > Srinivas
> > > > 
> > > > On Fri, 2018-04-20 at 08:36 -0500, Dennis Gilmore wrote:
> > > > > Here is the full output 
> > > > > 
> > > > > # turbostat
> > > > > turbostat version 17.06.23 - Len Brown 
> > > > > CPUID(0): GenuineIntel 22 CPUID levels; family:model:stepping
> > > > > 0x6:8e:9 (6:142:9)
> > > > > CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM TM
> > > > > CPUID(6): APERF, TURBO, DTS, PTM, HWP, HWPnotify, HWPwindow,
> > > > > HWPepp, No-HWPpkg, EPB
> > > > > cpu0: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST No-MWAIT
> > > > > PREFETCH
> > > > > TURBO)
> > > > > CPUID(7): SGX
> > > > > cpu0: MSR_IA32_FEATURE_CONTROL: 0xff07 (Locked )
> > > > > CPUID(0x15): eax_crystal: 2 ebx_tsc: 134 ecx_crystal_hz: 0
> > > > > TSC: 1608 MHz (2400 Hz * 134 / 2 / 100)
> > > > > CPUID(0x16): base_mhz: 1600 max_mhz: 3600 bus_mhz: 100
> > > > > cpu0: MSR_MISC_PWR_MGMT: 0x00401cc0 (ENable-EIST_Coordination
> > > > > DISable-EPB DISable-OOB)
> > > > > RAPL: 58254 sec. Joule Counter Range, at 4 Watts
> > > > > cpu0: MSR_PLATFORM_INFO: 0x804043df1011000
> > > > > 4 * 100.0 = 400.0 MHz max efficiency frequency
> > > > > 16 * 100.0 = 1600.0 MHz base frequency
> > > > > cpu0: MSR_IA32_POWER_CTL: 0x0024005d (C1E auto-promotion:
> > > > > DISabled)
> > > > > cpu0: MSR_TURBO_RATIO_LIMIT: 0x2224
> > > > > 34 * 100.0 = 3400.0 MHz max turbo 4 active cores
> > > > > 34 * 100.0 = 3400.0 MHz max turbo 3 active cores
> > > > > 34 * 100.0 = 3400.0 MHz max turbo 2 active cores
> > > > > 36 * 100.0 = 3600.0 MHz max turbo 1 active cores
> > > > > cpu0: MSR_CONFIG_TDP_NOMINAL: 0x000d (base_ratio=13)
> > > > > cpu0: MSR_CONFIG_TDP_LEVEL_1: 0x0006001c (PKG_MIN_PWR_LVL1=0
> > > > > PKG_MAX_PWR_LVL1=0 LVL1_RATIO=6 PKG_TDP_LVL1=28)
> > > > > cpu0: MSR_CONFIG_TDP_LEVEL_2: 0x00100038 (PKG_MIN_PWR_LVL2=0
> > > > > PKG_MAX_PWR_LVL2=0 LVL2_RATIO=16 PKG_TDP_LVL2=56)
> > > > > cpu0: MSR_CONFIG_TDP_CONTROL: 0x ( lock=0)
> > > > > cpu0: MSR_TURBO_ACTIVATION_RATIO: 0x000c
> > > > > (MAX_NON_TURBO_RATIO=12 lock=0)
> > > > > cpu0: MSR_PKG_CST_CONFIG_CONTROL: 0x1e008008 (UNdemote-C3,
> > > > > UNdemote-C1, demote-C3, demote-C1, locked: pkg-cstate-
> > > > > limit=8:
> > > > > unlimited)
> > > > > cpu0: POLL: CPUIDLE CORE POLL IDLE
> > > > > cpu0: C1: MWAIT 0x00
> > > > > cpu0: C1E: MWAIT 0x01
> > > > > cpu0: C3: MWAIT 0x10
> > > > > cpu0: C6: MWAIT 0x20
> > > > > cpu0: C7s: MWAIT 0x33
> > > > > cpu0: C8: MWAIT 0x

Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

2018-04-26 Thread Oleg Nesterov

On 04/26, Kirill Tkhai wrote:
>
> We can rework this simply by adding a list of tasks to mm.

Perhaps, but then I think this list should not depend on mm->owner.

I mean, mm->list_of_group_leaders_which_use_this_mm can be used by coredump
and oom-killer at least. But this is not that simple...

Oleg.

Re: [PATCH 2/5] ide: kill ide_toggle_bounce

2018-04-26 Thread Jens Axboe

On 4/26/18 1:20 AM, Christoph Hellwig wrote:
> On Tue, Apr 24, 2018 at 08:02:56PM -0600, Jens Axboe wrote:
>> On 4/24/18 12:16 PM, Christoph Hellwig wrote:
>>> ide_toggle_bounce did select various strange block bounce limits, including
>>> not bouncing at all as soon as an iommu is present in the system.  Given
>>> that the dma_map routines now handle any required bounce buffering except
>>> for ISA DMA, and the ide code already must handle either ISA DMA or highmem
>>> at least for iommu equipped systems we can get rid of the block layer
>>> bounce limit setting entirely.
>>
>> Pretty sure I was the one to add this code, when highmem page IO was
>> enabled about two decades ago...
>>
>> Outside of DMA, the issue was that the PIO code could not handle
>> highmem. That's not the case anymore, so this should be fine.
> 
> Yes, that is the rationale.  Any chance to you could look over the
> other patches as well?  Except for the networking one for which I'd
> really like to see a review from Dave all the users of the interface
> are block related.

You can add my reviewed-by to 1-3, and 5. Looks good to me.

-- 
Jens Axboe

Re: [RFC PATCH 09/35] ovl: stack file ops

2018-04-26 Thread Vivek Goyal

On Thu, Apr 26, 2018 at 05:01:37PM +0200, Miklos Szeredi wrote:
> On Thu, Apr 26, 2018 at 4:56 PM, Vivek Goyal  wrote:
> > On Thu, Apr 26, 2018 at 04:43:53PM +0200, Miklos Szeredi wrote:
> >> On Thu, Apr 26, 2018 at 4:13 PM, Vivek Goyal  wrote:
> >> > On Thu, Apr 12, 2018 at 05:08:00PM +0200, Miklos Szeredi wrote:
> >> >
> >> > [..]
> >> >> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> >> >> new file mode 100644
> >> >> index ..a0b606885c41
> >> >> --- /dev/null
> >> >> +++ b/fs/overlayfs/file.c
> >> >> @@ -0,0 +1,76 @@
> >> >> +/*
> >> >> + * Copyright (C) 2017 Red Hat, Inc.
> >> >> + *
> >> >> + * This program is free software; you can redistribute it and/or 
> >> >> modify it
> >> >> + * under the terms of the GNU General Public License version 2 as 
> >> >> published by
> >> >> + * the Free Software Foundation.
> >> >> + */
> >> >> +
> >> >> +#include 
> >> >> +#include 
> >> >> +#include 
> >> >> +#include "overlayfs.h"
> >> >> +
> >> >> +static struct file *ovl_open_realfile(const struct file *file)
> >> >> +{
> >> >> + struct inode *inode = file_inode(file);
> >> >> + struct inode *upperinode = ovl_inode_upper(inode);
> >> >> + struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
> >> >> + struct file *realfile;
> >> >> + const struct cred *old_cred;
> >> >> +
> >> >> + old_cred = ovl_override_creds(inode->i_sb);
> >> >> + realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
> >> >> +  realinode, current_cred(), false);
> >> >> + revert_creds(old_cred);
> >> >> +
> >> >> + pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
> >> >> +  file, file, upperinode ? 'u' : 'l', file->f_flags,
> >> >> +  realfile, IS_ERR(realfile) ? 0 : realfile->f_flags);
> >> >> +
> >> >> + return realfile;
> >> >> +}
> >> >> +
> >> >> +static int ovl_open(struct inode *inode, struct file *file)
> >> >> +{
> >> >> + struct dentry *dentry = file_dentry(file);
> >> >
> >> > Hi Miklos,
> >> >
> >> > There is one thing I can't wrap my head around, so I better ask.
> >> >
> >> > file_dentry() will call ovl_d_real() and try to find dentry based on
> >> > inode installed in f->f_inode. If ovl_d_real() can't find inode dentry
> >> > matching the passed in inode, it warns.
> >> >
> >> > Assume, I have a stacked overlay configuration. Let me call top level
> >> > overlay layer ovl1 and lower level overlay layer ovl2. Say I open a
> >> > file foo.txt. Now ovl_open() in ovl1 decides that realinode is a lower
> >> > inode and installs that inode f->f_inode of realfile. (This should be
> >> > ovl2 layer inode, let me call it ovl2_inode). Now ovl_open() of ovl2 
> >> > layer
> >> > will be called and it will call file_dentry() and will look for dentry
> >> > corresponding to ovl2_inode. I am wondering what if a copy up of foo.txt
> >> > was triggered in ovl1 and by the time we called ovl_d_real(dentry,
> >> > ovl2_inode), it will start comparing with inode of ovl1_upper and never
> >> > find ovl2_inode.
> >>
> >> Okay, so we've modified ovl_d_real() to allow returning the overlay
> >> dentry itself.  This is important: when we fail to match ovl1_upper
> >> with ovl2_inode, well go on to get ovl2_dentry and call d_real()
> >> recursively.  That recursive call should match the inode, return it to
> >> outer ovl_d_real(), which again will match the inode and return
> >> without warning.
> >
> > So current code does following.
> >
> > ovl_d_real() {
> > ...
> > ...
> >
> > real = ovl_dentry_real(dentry);
> > if (inode == d_inode(real))
> > return real;
> >
> > /* Handle recursion */
> > if (unlikely(real->d_flags & DCACHE_OP_REAL))
> > return real->d_op->d_real(real, inode);
> > }
> >
> > If file got copied up in ovl1, then "real" will be ovl1_upper dentry. And
> > upper is regular fs (only ovl1 lower is overlay), then it should not have
> > DCACHE_OP_REAL set and that means we will not recurse further and not
> > find ovl2 dentry matching ovl2_inode and print warning and return
> > ovl1 dentry.
> >
> > What am I missing.
> 
> Ah,  that's indeed buggy.  The bug is in "[RFC PATCH 34/35] vfs:
> simplify d_op->d_real()".
> 
> I've already reverted that (due to d_real_inode() acquiring a new
> user) and the old code should be good (AFAICS).

Aha, cool. thanks. While I am at it, let me just ask one more stupid
question.

I am wondering while opening the underlying realfile, why do we pass
in the path/dentry of ovl layer (and not underlying real layer).

realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
 realinode, current_cred(), false);

This forces us to do file_dentry() in ovl_open() later to map top level
dentry to underlying dentry.

We know the realinode and should be figure out real dentry. Can't we
construct path from underlying dentry and mount point and use that
to open underlyi

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

2018-04-26 Thread Mathieu Desnoyers

- On Apr 25, 2018, at 7:13 PM, Joel Fernandes joe...@google.com wrote:

> Hi Mathieu,
> 
> On Wed, Apr 25, 2018 at 2:40 PM, Mathieu Desnoyers
>  wrote:
>> - On Apr 25, 2018, at 5:27 PM, Joel Fernandes joe...@google.com wrote:
>>
>>> On Tue, Apr 24, 2018 at 9:20 PM, Paul E. McKenney
>>>  wrote:
>>> [..]
> >
> > Sounds good, thanks.
> >
> > Also I found the reason for my boot issue. It was because the
> > init_srcu_struct in the prototype was being done in an initcall.
> > Instead if I do it in start_kernel before the tracepoint is used, it
> > fixes it (although I don't know if this is dangerous to do like this
> > but I can get it to boot atleast.. Let me know if this isn't the
> > right way to do it, or if something else could go wrong)
> >
> > diff --git a/init/main.c b/init/main.c
> > index 34823072ef9e..ecc88319c6da 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -631,6 +631,7 @@ asmlinkage __visible void __init start_kernel(void)
> > WARN(!irqs_disabled(), "Interrupts were enabled early\n");
> > early_boot_irqs_disabled = false;
> >
> > +   init_srcu_struct(&tracepoint_srcu);
> > lockdep_init_early();
> >
> > local_irq_enable();
> > --
> >
> > I benchmarked it and the performance also looks quite good compared
> > to the rcu tracepoint version.
> >
> > If you, Paul and other think doing the init_srcu_struct like this
> > should be Ok, then I can try to work more on your srcu prototype and
> > roll into my series and post them in the next RFC series (or let me
> > know if you wanted to work your srcu stuff in a separate series..).
>
> That is definitely not what I was expecting, but let's see if it works
> anyway...  ;-)
>
> But first, I was instead expecting something like this:
>
> DEFINE_SRCU(tracepoint_srcu);
>
> With this approach, some of the initialization happens at compile time
> and the rest happens at the first call_srcu().
>
> This will work -only- if the first call_srcu() doesn't happen until after
> workqueue_init_early() has been invoked.  Which I believe must have been
> the case in your testing, because otherwise it looks like __call_srcu()
> would have complained bitterly.
>
> On the other hand, if you need to invoke call_srcu() before the call
> to workqueue_init_early(), then you need the patch that I am beating
> into shape.  Plus you would need to use DEFINE_SRCU() and to avoid
> invoking init_srcu_struct().

 And here is the patch.  I do not intend to send it upstream unless it
 actually proves necessary, and it appears that current SRCU does what
 you need.

 You would only need this patch if you wanted to invoke call_srcu()
 before workqueue_init_early() was called, which does not seem likely.
>>>
>>> Cool. So I was chatting with Paul and just to update everyone as well,
>>> I tried the DEFINE_SRCU instead of the late init_srcu_struct call and
>>> can make it past boot too (thanks Paul!). Also I don't see a reason we
>>> need the RCU callback to execute early and its fine if it runs later.
>>>
>>> Also, I was thinking of introducing a separate trace_*event*_srcu API
>>> as a replacement to the _rcuidle API. Then I can make use of it for my
>>> tracepoints, and then later can use it for the other tracepoints
>>> needing _rcuidle. After that we can finally get rid of the _rcuidle
>>> API if there are no other users of it. This is just a rough plan, but
>>> let me know if there's any issue with this plan that you can think
>>> off.
>>> IMO, I believe its simpler if the caller worries about whether it can
>>> tolerate if tracepoint probes can block or not, than making it a
>>> property of the tracepoint. That would also simplify the patch to
>>> introduce srcu and keep the tracepoint creation API simple and less
>>> confusing, but let me know if I'm missing something about this.
>>
>> One problem with your approach is that you can have multiple callers
>> for the same tracepoint name, where some could be non-preemptible and
>> others blocking. Also, there is then no clear way for the callback
> 
> Shouldn't it be responsibility of the caller to make sure it calls
> correct API? So if you're wanting to allow probes to block, then you'd
> call trace*blocking, if not then you don't. So the caller side can
> just always do the right thing. That's a caller side issue.

The issue there is that tracepoint.c has APIs both for instrumentation
and for registration of probe providers (callbacks). I want tracepoint.c
to provide guarantees that it won't connect incompatible probes and
callsites together.

> 
>>
>> Regarding the name, I'm OK with having something along the lines of
>> trace_*event*_blocking or such. Please don't use "srcu" or other naming
>> that is explicitly tied to the underlying mechanism used

[PATCH] Revert "x86/mm: implement free pmd/pte page interfaces"

2018-04-26 Thread Joerg Roedel

From: Joerg Roedel 

This reverts commit 28ee90fe6048fa7b7ceaeb8831c0e4e454a4cf89.

This commit is broken for x86, as it unmaps the PTE and PMD
pages and immediatly frees them without doing a TLB flush.

Further this lacks synchronization with other page-tables in
the system when the PMD pages are not shared between
mm_structs.

On x86-32 with PAE and PTI patches on-top this patch
triggers the BUG_ON in vmalloc_sync_one() because the kernel
and the process page-table were not synchronized.

Signed-off-by: Joerg Roedel 
---
 arch/x86/mm/pgtable.c | 28 ++--
 1 file changed, 2 insertions(+), 26 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ae98d4c5e32a..fd02a537a80f 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -787,22 +787,7 @@ int pmd_clear_huge(pmd_t *pmd)
  */
 int pud_free_pmd_page(pud_t *pud)
 {
-   pmd_t *pmd;
-   int i;
-
-   if (pud_none(*pud))
-   return 1;
-
-   pmd = (pmd_t *)pud_page_vaddr(*pud);
-
-   for (i = 0; i < PTRS_PER_PMD; i++)
-   if (!pmd_free_pte_page(&pmd[i]))
-   return 0;
-
-   pud_clear(pud);
-   free_page((unsigned long)pmd);
-
-   return 1;
+   return pud_none(*pud);
 }
 
 /**
@@ -814,15 +799,6 @@ int pud_free_pmd_page(pud_t *pud)
  */
 int pmd_free_pte_page(pmd_t *pmd)
 {
-   pte_t *pte;
-
-   if (pmd_none(*pmd))
-   return 1;
-
-   pte = (pte_t *)pmd_page_vaddr(*pmd);
-   pmd_clear(pmd);
-   free_page((unsigned long)pte);
-
-   return 1;
+   return pmd_none(*pmd);
 }
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
-- 
2.13.6

Re: [PATCH 8/8] ALSA: add new 32-bit layout for snd_pcm_mmap_status/control

2018-04-26 Thread Arnd Bergmann

On Tue, Apr 24, 2018 at 2:06 PM, Baolin Wang  wrote:

> -struct snd_pcm_mmap_status {
> +/*
> + * For mmap operations, we need the 64-bit layout, both for compat mode,
> + * and for y2038 compatibility. For 64-bit applications, the two definitions
> + * are identical, so we keep the traditional version.
> + */
> +#ifdef __SND_STRUCT_TIME64
> +#define __snd_pcm_mmap_status64snd_pcm_mmap_status
> +#define __snd_pcm_mmap_control64   snd_pcm_mmap_control
> +#define __snd_pcm_sync_ptr64   snd_pcm_sync_ptr
> +#else
> +#define __snd_pcm_mmap_status  snd_pcm_mmap_status
> +#define __snd_pcm_mmap_control snd_pcm_mmap_control
> +#define __snd_pcm_sync_ptr snd_pcm_sync_ptr
> +#endif
> +
> +struct __snd_pcm_mmap_status {
> snd_pcm_state_t state;  /* RO: state - SNDRV_PCM_STATE_ */
> int pad1;   /* Needed for 64 bit alignment */
> snd_pcm_uframes_t hw_ptr;   /* RO: hw ptr (0...boundary-1) */

One more thing here: this definition is slightly suboptimal because in
an alsa-lib that gets built with 64-bit time_t, we end up with an unusable
__snd_pcm_mmap_status structure (__snd_pcm_mmap_status64 works
fine, and that would be the normal thing to use). Just in case we want
to be able to build an alsa-lib that is capable of running both on
new kernels with 64-bit time_t interfaces exposed to applications and
also on old kernels that don't have the new ioctls, we probably want
another fixup merged in:

diff --git a/include/uapi/sound/asound.h b/include/uapi/sound/asound.h
index 18fbdcb2c7b6..638c717d3eb9 100644
--- a/include/uapi/sound/asound.h
+++ b/include/uapi/sound/asound.h
@@ -496,19 +496,30 @@ struct snd_pcm_status {
 #define __snd_pcm_mmap_status64snd_pcm_mmap_status
 #define __snd_pcm_mmap_control64   snd_pcm_mmap_control
 #define __snd_pcm_sync_ptr64   snd_pcm_sync_ptr
+#define __snd_timespec64   timespec
+struct __snd_timespec {
+   __s32 tv_sec;
+   __s32 tv_nsec;
+};
 #else
 #define __snd_pcm_mmap_status  snd_pcm_mmap_status
 #define __snd_pcm_mmap_control snd_pcm_mmap_control
 #define __snd_pcm_sync_ptr snd_pcm_sync_ptr
+#define __snd_timespec timespec
+struct __snd_timespec64 {
+   __s64 tv_sec;
+   __s64 tv_nsec;
+};
+
 #endif

 struct __snd_pcm_mmap_status {
snd_pcm_state_t state;  /* RO: state - SNDRV_PCM_STATE_ */
int pad1;   /* Needed for 64 bit alignment */
snd_pcm_uframes_t hw_ptr;   /* RO: hw ptr (0...boundary-1) */
-   struct timespec tstamp; /* Timestamp */
+   struct __snd_timespec tstamp;   /* Timestamp */
snd_pcm_state_t suspended_state; /* RO: suspended stream state */
-   struct timespec audio_tstamp;   /* from sample counter or wall clock */
+   struct __snd_timespec audio_tstamp; /* from sample counter or
wall clock */
 };

 struct __snd_pcm_mmap_control {
@@ -532,11 +543,6 @@ struct __snd_pcm_sync_ptr {
} c;
 };

-struct __snd_timespec64 {
-   __s64 tv_sec;
-   __s64 tv_nsec;
-};
-
 #if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN :
defined(__BIG_ENDIAN)
 typedef char __pad_before_uframe[sizeof(__u64) - sizeof(snd_pcm_uframes_t)];
 typedef char __pad_after_uframe[0];

With this change, alsa-lib can either access whichever structure
matches the glibc 'timespec' definition, or it can ask for
__struct __snd_pcm_mmap_status and struct __snd_pcm_mmap_status64
explicitly, regardless of the time_t definition.

 Arnd

Re: [Intel-gfx] 4.17-rc2: Could not determine valid watermarks for inherited state

2018-04-26 Thread Ville Syrjälä

On Thu, Apr 26, 2018 at 05:56:14PM +0300, Ville Syrjälä wrote:
> On Thu, Apr 26, 2018 at 10:27:19AM -0400, Dave Jones wrote:
> > [1.176131] [drm:i9xx_get_initial_plane_config] pipe A/primary A with 
> > fb: size=800x600@32, offset=0, pitch 3200, size 0x1d4c00
> > [1.176161] [drm:i915_gem_object_create_stolen_for_preallocated] 
> > creating preallocated stolen object: stolen_offset=0x, 
> > gtt_offset=0x, size=0x001d5000
> > [1.176312] [drm:intel_alloc_initial_plane_obj.isra.127] initial plane 
> > fb obj (ptrval)
> > [1.176351] [drm:intel_modeset_init] pipe A active planes 0x1
> > [1.176456] [drm:drm_atomic_helper_check_plane_state] Plane must cover 
> > entire CRTC
> > [1.176481] [drm:drm_rect_debug_print] dst: 800x600+0+0
> > [1.176494] [drm:drm_rect_debug_print] clip: 1366x768+0+0
> 
> OK, so that's the problem right there. The fb we took over from the
> BIOS was 800x600, but now we're trying to set up a 1366x768 mode.
> 
> We seem to be missing checks to make sure the initial fb is actually
> big enough for the mode we're currently using :(

Actually we do read out the pipe src size as 800x600 initially, which
make sense. And we even stuff that into the mode.h/vdisplay, so up to
that point everything is pretty much correct. It goes wrong is when
intel_modeset_readout_hw_state() calls intel_mode_from_pipe_config()
as that will override the h/vdisplay with the actual crtc timings
instead of the pipe src size.

So I suppose we should be able to just add the sanity checks for the
fb vs. h/vdisplay, and at least we should get past this error. A
slightly bigger mystery is what will happen later when our pipe src
size doesn't actually agree with the h/vdisplay. The first modeset
will correct it, but we might want some kind of extra sanitize step
for fastboot type of stuff.

-- 
Ville Syrjälä
Intel

Re: [PATCH] f2fs: clear page_error for all the writebacking pages

2018-04-26 Thread Jaegeuk Kim

On 04/24, Chao Yu wrote:
> Hi Jaegeuk,
> 
> On 2018/4/24 6:49, Jaegeuk Kim wrote:
> > This patch clear page_error bit, if the page is going to be writebacked.
> 
> This patch is similar to previous patch ("f2fs: clear PageError on 
> writepage"),
> only coverage is different, could you merge them?

Yeah, done.

> 
> Thanks,
> 
> > 
> > Signed-off-by: Jaegeuk Kim 
> > ---
> >  fs/f2fs/gc.c  | 1 +
> >  fs/f2fs/inline.c  | 1 +
> >  fs/f2fs/node.c| 1 +
> >  fs/f2fs/segment.c | 1 +
> >  4 files changed, 4 insertions(+)
> > 
> > diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> > index 70418b34c5f6..a7de8b3431a9 100644
> > --- a/fs/f2fs/gc.c
> > +++ b/fs/f2fs/gc.c
> > @@ -693,6 +693,7 @@ static void move_data_block(struct inode *inode, 
> > block_t bidx,
> > dec_page_count(fio.sbi, F2FS_DIRTY_META);
> >  
> > set_page_writeback(fio.encrypted_page);
> > +   ClearPageError(page);
> >  
> > /* allocate block address */
> > f2fs_wait_on_page_writeback(dn.node_page, NODE, true);
> > diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c
> > index b50e36351280..1eaa2049eafa 100644
> > --- a/fs/f2fs/inline.c
> > +++ b/fs/f2fs/inline.c
> > @@ -139,6 +139,7 @@ int f2fs_convert_inline_page(struct dnode_of_data *dn, 
> > struct page *page)
> >  
> > /* write data page to try to make data consistent */
> > set_page_writeback(page);
> > +   ClearPageError(page);
> > fio.old_blkaddr = dn->data_blkaddr;
> > set_inode_flag(dn->inode, FI_HOT_DATA);
> > write_data_page(dn, &fio);
> > diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> > index ae83ca9d2d31..3a3d38b3e9ec 100644
> > --- a/fs/f2fs/node.c
> > +++ b/fs/f2fs/node.c
> > @@ -1394,6 +1394,7 @@ static int __write_node_page(struct page *page, bool 
> > atomic, bool *submitted,
> > fio.op_flags |= REQ_PREFLUSH | REQ_FUA;
> >  
> > set_page_writeback(page);
> > +   ClearPageError(page);
> > fio.old_blkaddr = ni.blk_addr;
> > write_node_page(nid, &fio);
> > set_node_addr(sbi, &ni, fio.new_blkaddr, is_fsync_dnode(page));
> > diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
> > index 3325b9959119..a99c30b4eec9 100644
> > --- a/fs/f2fs/segment.c
> > +++ b/fs/f2fs/segment.c
> > @@ -2758,6 +2758,7 @@ void write_meta_page(struct f2fs_sb_info *sbi, struct 
> > page *page,
> > fio.op_flags &= ~REQ_META;
> >  
> > set_page_writeback(page);
> > +   ClearPageError(page);
> > f2fs_submit_page_write(&fio);
> >  
> > f2fs_update_iostat(sbi, io_type, F2FS_BLKSIZE);
> >

Re: Linux messages full of `random: get_random_u32 called from`

2018-04-26 Thread Sultan Alsawaf

> Hmm, can you let the boot hang for a while?  It should continue after
> a few minutes if you wait long enough, but wait a minute or two, then
> give it entropy so the boot can continue.  Then can you use
> "systemd-analyze blame" or "systemd-analyize critical-chain" and we
> can see what process was trying to get randomness during the boot
> startup and blocking waiting for the CRNG to be fully initialized.
>
>- Ted

systemd-analyze blame: https://hastebin.com/ikipavevew.css
systemd-analyze critical-chain: https://hastebin.com/odoyuqeges.pl
dmesg: https://hastebin.com/waracebeja.vbs

Re: [PATCH v3 0/6] scsi: handle special return codes for ABORTED COMMAND

2018-04-26 Thread Martin Wilck

On Fri, 2018-04-20 at 19:15 -0400, Martin K. Petersen wrote:
> 
> Much better, thanks for reworking this. Applied to 4.18/scsi-queue.

Thank you!

By the way, I've been wondering whether declaring blist_flags_t
__bitwise was a wise decision. blist_flags_t is kernel-internal, thus
endianness doesn't matter. OTOH, using __bitwise requires explicit
casts in many places, which may suppress warnings about integer size
mismatch and made me overlook some places where I had to change
"unsigned long" to "unsigned long long" in the first place
(in the submitted and applied version I think I caught them all).

Regards,
Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

2018-04-26 Thread Joel Fernandes

On Thu, Apr 26, 2018 at 8:13 AM, Mathieu Desnoyers
 wrote:
[...]
>>> Regarding the name, I'm OK with having something along the lines of
>>> trace_*event*_blocking or such. Please don't use "srcu" or other naming
>>> that is explicitly tied to the underlying mechanism used internally
>>> however: what we want to convey is that this specific tracepoint probe
>>
>> Problem is that _blocking isn't the right word either. In my IRQ trace
>> point case, it will look something like this then:
>>
>> local_irq_disable();
>> // IRQs are now off.
>> trace_irq_disable_blocking(..);
>>
>> This wouldn't make sense. What we really want is to use the SRCU
>> implementation so that its low overhead...
>>
>> So it would be something like:
>>
>> local_irq_disable();
>> // IRQs are now off.
>> trace_irq_disable_srcu(..);
>>
>> I also Ok if, as Paul was saying in his last email, that just for
>> _rcuidle, we use SRCU so that we don't have to do the rcu_enter_irq
>> stuff. Or we kill the _rcuidle API completely and use _srcu for those
>> users instead. We already have 1 implementation specific name anyway
>> (rcuidle), we're just replacing it with another one. If in the future,
>> if we want to change that name we can always do so (Also if you will,
>> correcting the existing already bad naming is a different problem and
>> we're not making it any worse tbh).
>
> Using SRCU rather than the sched-rcu tracepoint synchronization in your
> use-case it caused by a limitation of sched-rcu: it cannot be efficiently
> used within idle code. So you don't care about the "can_sleep" property
> of SRCU. You could event mix SRCU and sched-rcu callsites for the same
> probe name, and it would be perfectly valid.
>
> So even though both "can_sleep" and "rcuidle" caller variants would end
> up using SRCU under the hood, each can have its own caller API, e.g.:
>
> * trace_() -> only non-sleeping probes can register to those.
>  Uses sched-rcu under the hood.
>
> * trace__can_sleep() -> both sleeping and non-sleeping probes can
>register to those. Uses SRCU under the hood.
>
> * trace__rcuidle() -> only non-sleeping probes can register to those,
>  uses SRCU under the hood.
>

Cool, sounds good to me!
For starters I was thinking of changing the _rcuidle underlying
implementation as you pointed. This should be simple enough and needs
no further additional APIs. I'll work on a patch along these lines and
send it out soon. Also would love to work on your sleeping callback
case in the future as well incase you wanted to spend your cycles
working on other things.

thanks,

- Joel

Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

2018-04-26 Thread Peter Zijlstra

On Thu, Apr 26, 2018 at 04:52:39PM +0300, Kirill Tkhai wrote:
> In the patch I used the logic, that the below code:
> 
>   x = A;
>   spin_lock();
>   spin_unlock();
>   spin_lock();
>   spin_unlock();
>   y = B;
> 
> cannot reorder much than:
> 
>   spin_lock();
>   x = A;  <- this can't become visible later, that spin_unlock()
>   spin_unlock();
>   spin_lock();
>   y = B;  <- this can't become visible earlier, than spin_lock()
>   spin_unlock();
> 
> Is there a problem?

The two stores will be ordered, but only at the strength of an
smp_wmb(). The above construct does not imply smp_mb(). The difference
is observable on real hardware (Power).

Re: [RFC PATCH 09/35] ovl: stack file ops

2018-04-26 Thread Miklos Szeredi

On Thu, Apr 26, 2018 at 5:13 PM, Vivek Goyal  wrote:

> Aha, cool. thanks. While I am at it, let me just ask one more stupid
> question.
>
> I am wondering while opening the underlying realfile, why do we pass
> in the path/dentry of ovl layer (and not underlying real layer).
>
> realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
>  realinode, current_cred(), false);
>
> This forces us to do file_dentry() in ovl_open() later to map top level
> dentry to underlying dentry.
>
> We know the realinode and should be figure out real dentry. Can't we
> construct path from underlying dentry and mount point and use that
> to open underlying real file.  I am sure there is some reason for doing
> this way, just trying to understand it.

The logical thing would be to just use the real path (as returned by
ovl_path_real()).

The reason we don't do that is because mmap stores the real file in
vma->vm_file and vm_file->f_path is used in various places (e.g.
/proc/PID/maps).

We could have a separate realfile for mmap, but that would be
additional complexity and memory use, so I don't think it makes sense.

Thanks,
Miklos

Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

2018-04-26 Thread Peter Zijlstra

On Thu, Apr 26, 2018 at 04:52:39PM +0300, Kirill Tkhai wrote:
> >>
> >> 1)for_each_process(g)copy_process()
> >>p->mm = mm
> >> smp_rmb(); smp_wmb() implied by alloc_pid()
> >> if (g->flags & PF_KTHREAD) list_add_tail_rcu(&p->tasks, 
> >> &init_task.tasks)
> >>
> >> 2)for_each_thread(g, c)  copy_process()
> >>p->mm = mm
> >> smp_rmb(); smp_wmb() implied by alloc_pid()
> >> tmp = READ_ONCE(c->mm) list_add_tail_rcu(&p->thread_node, ...)

For these two; what's the purpose of the smp_rmb()? which loads are
ordered?

Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread Mikulas Patocka



On Thu, 26 Apr 2018, Mikulas Patocka wrote:

> 
> 
> On Wed, 25 Apr 2018, James Bottomley wrote:
> 
> > > BTW. even developers who compile their own kernel should have this
> > > enabled by a CONFIG option - because if the developer sees the option
> > > when browsing through menuconfig, he may enable it. If he doesn't see
> > > the option, he won't even know that such an option exists.
> > 
> > I may be an atypical developer but I'd rather have a root canal than
> > browse through menuconfig options.  The way to get people to learn
> > about new debugging options is to blog about it (or write an lwn.net
> > article) which google will find the next time I ask it how I debug XXX.
> >  Google (probably as a service to humanity) rarely turns up Kconfig
> > options in response to a query.
> 
> From my point of view, this feature should be as little disruptive to the 
> developer as possible. It should work automatically behind the scenes 
> without the developer or the tester even knowing that it is working. From 
> this point of view, binding it to CONFIG_DEBUG_SG (or any other commonly 
> used debugging option) would be ideal, because driver developers already 
> enable CONFIG_DEBUG_SG, so they'll get this kvmalloc test for free.
> 
> From your point of view, you should introduce a sysfs file and a kernel 
> parameter that no one knows about - and then start blogging about it - to 
> let people know. Why would you bother people with this knowledge? They'll 
> forget about it anyway and won't turn it on.

BTW. try to think about - how many total lines of code should this feature 
consume in the whole Linux ecosystem?

I made a 10-line patch. I got pushback.

I remade it to a 53-line patch. And you try to suggest that 53 lines is 
not enough and we must also change kernel packaging scripts in distro 
kernels, because the kernel just cannot enable this feature on its own.

If we hack kernel packaging scripts in most distros, how many lines of 
code would that be? What's your target?

Mikulas

Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread James Bottomley

On Thu, 2018-04-26 at 11:05 -0400, Mikulas Patocka wrote:
> 
> On Thu, 26 Apr 2018, James Bottomley wrote:
[...]
> > Perhaps find out beforehand instead of insisting on an approach
> without
> > knowing.  On openSUSE the grub config is built from the files in
> > /etc/grub.d/ so any package can add a kernel option (and various
> > conditions around activating it) simply by adding a new file.
> 
> And then, different versions of the debug kernel will clash when 
> attempting to create the same file.

Don't be silly ... there are many ways of coping with that in rpm/dpkg.
 However, I take it the fact you're now trying to get me to explain
them means you take the point that a kernel dynamic option can be
activated in a variety of easy ways in a distribution including through
the boot menu; so if you want this to appear in the boot menu you don't
need a Kconfig option to achieve it.

> And what about other distributions? What about people who the RHEL
> kernel from source with "make"?

Well, if you build your own kernel and we have a dynamic option, it
will "just work" without you having to muck about trying to re-Kconfig
it, so I'd see that as a win.

> The problem with this approach that you are trying to bother more and
> more people with this little silly feature.

So you're shifting your argument from "I have to do it as a Kconfig
option because the distros require it" to "distributions will build
separate kernel packages for this, but won't do enabling in a non
kernel package"?  To be honest, I think the argument is nuts but I
don't really care.  From my point of view it's usually me explaining to
people how to debug stuff and "you have to build your own kernel with
this Kconfig option" compared to "add this to the kernel command line
and reboot" is much more effort for the debugger.

James

Re: [Intel-gfx] 4.17-rc2: Could not determine valid watermarks for inherited state

2018-04-26 Thread Ville Syrjälä

On Thu, Apr 26, 2018 at 06:16:41PM +0300, Ville Syrjälä wrote:
> On Thu, Apr 26, 2018 at 05:56:14PM +0300, Ville Syrjälä wrote:
> > On Thu, Apr 26, 2018 at 10:27:19AM -0400, Dave Jones wrote:
> > > [1.176131] [drm:i9xx_get_initial_plane_config] pipe A/primary A with 
> > > fb: size=800x600@32, offset=0, pitch 3200, size 0x1d4c00
> > > [1.176161] [drm:i915_gem_object_create_stolen_for_preallocated] 
> > > creating preallocated stolen object: stolen_offset=0x, 
> > > gtt_offset=0x, size=0x001d5000
> > > [1.176312] [drm:intel_alloc_initial_plane_obj.isra.127] initial plane 
> > > fb obj (ptrval)
> > > [1.176351] [drm:intel_modeset_init] pipe A active planes 0x1
> > > [1.176456] [drm:drm_atomic_helper_check_plane_state] Plane must cover 
> > > entire CRTC
> > > [1.176481] [drm:drm_rect_debug_print] dst: 800x600+0+0
> > > [1.176494] [drm:drm_rect_debug_print] clip: 1366x768+0+0
> > 
> > OK, so that's the problem right there. The fb we took over from the
> > BIOS was 800x600, but now we're trying to set up a 1366x768 mode.
> > 
> > We seem to be missing checks to make sure the initial fb is actually
> > big enough for the mode we're currently using :(
> 
> Actually we do read out the pipe src size as 800x600 initially, which
> make sense. And we even stuff that into the mode.h/vdisplay, so up to
> that point everything is pretty much correct. It goes wrong is when
> intel_modeset_readout_hw_state() calls intel_mode_from_pipe_config()
> as that will override the h/vdisplay with the actual crtc timings
> instead of the pipe src size.
> 
> So I suppose we should be able to just add the sanity checks for the
> fb vs. h/vdisplay, and at least we should get past this error. A
> slightly bigger mystery is what will happen later when our pipe src
> size doesn't actually agree with the h/vdisplay. The first modeset
> will correct it, but we might want some kind of extra sanitize step
> for fastboot type of stuff.

Hmm. Or maybe we should just stick to the pipe src size.

I'm curious whether this fixes the problem?

diff --git a/drivers/gpu/drm/i915/intel_display.c 
b/drivers/gpu/drm/i915/intel_display.c
index 0f8c7389e87d..30824beedef7 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -15284,6 +15284,8 @@ static void intel_modeset_readout_hw_state(struct 
drm_device *dev)
memset(&crtc->base.mode, 0, sizeof(crtc->base.mode));
if (crtc_state->base.active) {
intel_mode_from_pipe_config(&crtc->base.mode, 
crtc_state);
+   crtc->base.mode.hdisplay = crtc_state->pipe_src_w;
+   crtc->base.mode.vdisplay = crtc_state->pipe_src_h;

intel_mode_from_pipe_config(&crtc_state->base.adjusted_mode, crtc_state);
WARN_ON(drm_atomic_set_mode_for_crtc(crtc->base.state, 
&crtc->base.mode));

-- 
Ville Syrjälä
Intel

Re: [f2fs-dev] [PATCH 1/5] f2fs: give message and set need_fsck given broken node id

2018-04-26 Thread Jaegeuk Kim

On 04/25, Chao Yu wrote:
> On 2018/4/25 13:46, Jaegeuk Kim wrote:
> > syzbot hit the following crash on upstream commit
> > 83beed7b2b26f232d782127792dd0cd4362fdc41 (Fri Apr 20 17:56:32 2018 +)
> > Merge branch 'fixes' of 
> > git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
> > syzbot dashboard link: 
> > https://syzkaller.appspot.com/bug?extid=d154ec99402c6f628887
> > 
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5414336294027264
> > syzkaller reproducer: 
> > https://syzkaller.appspot.com/x/repro.syz?id=5471683234234368
> > Raw console output: 
> > https://syzkaller.appspot.com/x/log.txt?id=5436660795834368
> > Kernel config: 
> > https://syzkaller.appspot.com/x/.config?id=1808800213120130118
> > compiler: gcc (GCC) 8.0.1 20180413 (experimental)
> > 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+d154ec99402c6f628...@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for 
> > details.
> > If you forward the report, please keep this part and the footer.
> > 
> > F2FS-fs (loop0): Magic Mismatch, valid(0xf2f52010) - read(0x0)
> > F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
> > F2FS-fs (loop0): invalid crc value
> > [ cut here ]
> > kernel BUG at fs/f2fs/node.c:1185!
> > invalid opcode:  [#1] SMP KASAN
> > Dumping ftrace buffer:
> >(ftrace buffer empty)
> > Modules linked in:
> > CPU: 1 PID: 4549 Comm: syzkaller704305 Not tainted 4.17.0-rc1+ #10
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/01/2011
> > RIP: 0010:__get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185
> > RSP: 0018:8801d960e820 EFLAGS: 00010293
> > RAX: 8801d88205c0 RBX: 0003 RCX: 82f6cc06
> > RDX:  RSI: 82f6d5e8 RDI: 0004
> > RBP: 8801d960ec30 R08: 8801d88205c0 R09: ed003b5e46c2
> > R10: 0003 R11: 0003 R12: 8801a86e00c0
> > R13: 0001 R14: 8801a86e0530 R15: 8801d9745240
> > FS:  0072c880() GS:8801daf0() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 7f3d403209b8 CR3: 0001d8f3f000 CR4: 001406e0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  get_node_page fs/f2fs/node.c:1237 [inline]
> >  truncate_xattr_node+0x152/0x2e0 fs/f2fs/node.c:1014
> >  remove_inode_page+0x200/0xaf0 fs/f2fs/node.c:1039
> >  f2fs_evict_inode+0xe86/0x1710 fs/f2fs/inode.c:547
> >  evict+0x4a6/0x960 fs/inode.c:557
> >  iput_final fs/inode.c:1519 [inline]
> >  iput+0x62d/0xa80 fs/inode.c:1545
> >  f2fs_fill_super+0x5f4e/0x7bf0 fs/f2fs/super.c:2849
> >  mount_bdev+0x30c/0x3e0 fs/super.c:1164
> >  f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020
> >  mount_fs+0xae/0x328 fs/super.c:1267
> >  vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
> >  vfs_kern_mount fs/namespace.c:1027 [inline]
> >  do_new_mount fs/namespace.c:2518 [inline]
> >  do_mount+0x564/0x3070 fs/namespace.c:2848
> >  ksys_mount+0x12d/0x140 fs/namespace.c:3064
> >  __do_sys_mount fs/namespace.c:3078 [inline]
> >  __se_sys_mount fs/namespace.c:3075 [inline]
> >  __x64_sys_mount+0xbe/0x150 fs/namespace.c:3075
> >  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > RIP: 0033:0x443dea
> > RSP: 002b:7ffcc7882368 EFLAGS: 0297 ORIG_RAX: 00a5
> > RAX: ffda RBX: 2c00 RCX: 00443dea
> > RDX: 2000 RSI: 2100 RDI: 7ffcc7882370
> > RBP: 0003 R08: 20016a00 R09: 000a
> > R10:  R11: 0297 R12: 0004
> > R13: 00402ce0 R14:  R15: 
> > RIP: __get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185 RSP: 8801d960e820
> > ---[ end trace 4edbeb71f002bb76 ]---
> > 
> > Reported-and-tested-by: 
> > syzbot+d154ec99402c6f628...@syzkaller.appspotmail.com
> > Signed-off-by: Jaegeuk Kim 
> > ---
> >  fs/f2fs/f2fs.h  | 13 +
> >  fs/f2fs/inode.c | 13 ++---
> >  fs/f2fs/node.c  | 23 +--
> >  3 files changed, 28 insertions(+), 21 deletions(-)
> > 
> > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> > index 8f3ad9662d13..d26aae5bf00d 100644
> > --- a/fs/f2fs/f2fs.h
> > +++ b/fs/f2fs/f2fs.h
> > @@ -1583,18 +1583,6 @@ static inline bool __exist_node_summaries(struct 
> > f2fs_sb_info *sbi)
> > is_set_ckpt_flags(sbi, CP_FASTBOOT_FLAG));
> >  }
> >  
> > -/*
> > - * Check whether the given nid is within node id range.
> > - */
> > -static inline int check_nid_range(struct f2fs_sb_info *sbi, nid_t nid)
> > -{
> > -   if (unlikely(nid < F2FS_ROOT_INO(sbi)))
> > -   return -EINVAL;
> > -   if (unlikely(nid >= NM

Re: [f2fs-dev] [PATCH 4/5] f2fs: sanity check for total valid blocks

2018-04-26 Thread Jaegeuk Kim

On 04/25, Chao Yu wrote:
> Hi Jaegeuk,
> 
> This patch makes generic/008 failed, because for fallocate case, total valid
> block count can not be calculated by gathering valid_blocks of all sit 
> entries.

Yeah, I got that too, and I've been testing to change it by valid_node_count,
which works for syzbot case as well.

> 
> Thanks,
> 
> On 2018/4/25 13:46, Jaegeuk Kim wrote:
> > This patch enhances sanity check for SIT entries.
> > 
> > syzbot hit the following crash on upstream commit
> > 83beed7b2b26f232d782127792dd0cd4362fdc41 (Fri Apr 20 17:56:32 2018 +)
> > Merge branch 'fixes' of 
> > git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
> > syzbot dashboard link: 
> > https://syzkaller.appspot.com/bug?extid=bf9253040425feb155ad
> > 
> > syzkaller reproducer: 
> > https://syzkaller.appspot.com/x/repro.syz?id=5692130282438656
> > Raw console output: 
> > https://syzkaller.appspot.com/x/log.txt?id=5095924598571008
> > Kernel config: 
> > https://syzkaller.appspot.com/x/.config?id=1808800213120130118
> > compiler: gcc (GCC) 8.0.1 20180413 (experimental)
> > 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+bf9253040425feb15...@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for 
> > details.
> > If you forward the report, please keep this part and the footer.
> > 
> > F2FS-fs (loop0): invalid crc value
> > F2FS-fs (loop0): Try to recover 1th superblock, ret: 0
> > F2FS-fs (loop0): Mounted with checkpoint version = d
> > F2FS-fs (loop0): Bitmap was wrongly cleared, blk:9740
> > [ cut here ]
> > kernel BUG at fs/f2fs/segment.c:1884!
> > invalid opcode:  [#1] SMP KASAN
> > Dumping ftrace buffer:
> >(ftrace buffer empty)
> > Modules linked in:
> > CPU: 1 PID: 4508 Comm: syz-executor0 Not tainted 4.17.0-rc1+ #10
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/01/2011
> > RIP: 0010:update_sit_entry+0x1215/0x1590 fs/f2fs/segment.c:1882
> > RSP: 0018:8801af526708 EFLAGS: 00010282
> > RAX: ed0035ea4cc0 RBX: 8801ad454f90 RCX: 
> > RDX:  RSI: 82eeb87e RDI: ed0035ea4cb6
> > RBP: 8801af526760 R08: 8801ad4a2480 R09: ed003b5e4f90
> > R10: ed003b5e4f90 R11: 8801daf27c87 R12: 8801adb8d380
> > R13: 0001 R14: 0008 R15: 
> > FS:  014af940() GS:8801daf0() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 7f06bc223000 CR3: 0001adb02000 CR4: 001406e0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  allocate_data_block+0x66f/0x2050 fs/f2fs/segment.c:2663
> >  do_write_page+0x105/0x1b0 fs/f2fs/segment.c:2727
> >  write_node_page+0x129/0x350 fs/f2fs/segment.c:2770
> >  __write_node_page+0x7da/0x1370 fs/f2fs/node.c:1398
> >  sync_node_pages+0x18cf/0x1eb0 fs/f2fs/node.c:1652
> >  block_operations+0x429/0xa60 fs/f2fs/checkpoint.c:1088
> >  write_checkpoint+0x3ba/0x5380 fs/f2fs/checkpoint.c:1405
> >  f2fs_sync_fs+0x2fb/0x6a0 fs/f2fs/super.c:1077
> >  __sync_filesystem fs/sync.c:39 [inline]
> >  sync_filesystem+0x265/0x310 fs/sync.c:67
> >  generic_shutdown_super+0xd7/0x520 fs/super.c:429
> >  kill_block_super+0xa4/0x100 fs/super.c:1191
> >  kill_f2fs_super+0x9f/0xd0 fs/f2fs/super.c:3030
> >  deactivate_locked_super+0x97/0x100 fs/super.c:316
> >  deactivate_super+0x188/0x1b0 fs/super.c:347
> >  cleanup_mnt+0xbf/0x160 fs/namespace.c:1174
> >  __cleanup_mnt+0x16/0x20 fs/namespace.c:1181
> >  task_work_run+0x1e4/0x290 kernel/task_work.c:113
> >  tracehook_notify_resume include/linux/tracehook.h:191 [inline]
> >  exit_to_usermode_loop+0x2bd/0x310 arch/x86/entry/common.c:166
> >  prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
> >  syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
> >  do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290
> >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > RIP: 0033:0x457d97
> > RSP: 002b:7ffd46f9c8e8 EFLAGS: 0246 ORIG_RAX: 00a6
> > RAX:  RBX:  RCX: 00457d97
> > RDX: 014b09a3 RSI: 0002 RDI: 7ffd46f9da50
> > RBP: 7ffd46f9da50 R08:  R09: 0009
> > R10: 0005 R11: 0246 R12: 014b0940
> > R13:  R14: 0002 R15: 658e
> > RIP: update_sit_entry+0x1215/0x1590 fs/f2fs/segment.c:1882 RSP: 
> > 8801af526708
> > ---[ end trace f498328bb02610a2 ]---
> > 
> > Reported-and-tested-by: 
> > syzbot+bf9253040425feb15...@syzkaller.appspotmail.com
> > Reported-and-tested-by: 
> > syzbot+7d6d31d3bc702f566...@syzkaller.appspotmail.com
> > Reported-and-tested-by: 
> > syzbot+0a725420475916460...@syzkaller.appspotmail.com
>

Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-26 Thread Vincent Guittot

Hi Niklas,

>> Thanks for the trace, I have been able to catch a problem with it.
>> Could you test the patch below to confirm that the problem is solved ?
>> The patch apply on-top of
>> c18bb396d3d261eb ("Merge 
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
>
> I can confirm that with the patch bellow I can no longer produce the
> problem. Thanks!

Thanks for testing
Do you mind if I add
Tested-by: Niklas Söderlund 

Peter, Ingo,
Do you want me to re-send the patch with all tags or you will take
this version ?

Regards,
Vincent

>
>>
>> From: Vincent Guittot 
>> Date: Thu, 26 Apr 2018 12:19:32 +0200
>> Subject: [PATCH] sched/fair: fix the update of blocked load when newly idle
>> MIME-Version: 1.0
>> Content-Type: text/plain; charset=UTF-8
>> Content-Transfer-Encoding: 8bit
>>
>> With commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle"),
>> we release the rq->lock when updating blocked load of idle CPUs. This open
>> a time window during which another CPU can add a task to this CPU's cfs_rq.
>> The check for newly added task of idle_balance() is not in the common path.
>> Move the out label to include this check.
>>
>> Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
>> Reported-by: Heiner Kallweit 
>> Reported-by: Niklas Söderlund 
>> Signed-off-by: Vincent Guittot 
>> ---
>>  kernel/sched/fair.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0951d1c..15a9f5e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9847,6 +9847,7 @@ static int idle_balance(struct rq *this_rq, struct 
>> rq_flags *rf)
>>   if (curr_cost > this_rq->max_idle_balance_cost)
>>   this_rq->max_idle_balance_cost = curr_cost;
>>
>> +out:
>>   /*
>>* While browsing the domains, we released the rq lock, a task could
>>* have been enqueued in the meantime. Since we're not going idle,
>> @@ -9855,7 +9856,6 @@ static int idle_balance(struct rq *this_rq, struct 
>> rq_flags *rf)
>>   if (this_rq->cfs.h_nr_running && !pulled_task)
>>   pulled_task = 1;
>>
>> -out:
>>   /* Move the next balance forward */
>>   if (time_after(this_rq->next_balance, next_balance))
>>   this_rq->next_balance = next_balance;
>> --
>> 2.7.4
>>
>>
>>
>> [snip]
>>
>
> --
> Regards,
> Niklas Söderlund

Re: [PATCH] Revert "x86/mm: implement free pmd/pte page interfaces"

2018-04-26 Thread Greg KH

On Thu, Apr 26, 2018 at 05:14:07PM +0200, Joerg Roedel wrote:
> From: Joerg Roedel 
> 
> This reverts commit 28ee90fe6048fa7b7ceaeb8831c0e4e454a4cf89.
> 
> This commit is broken for x86, as it unmaps the PTE and PMD
> pages and immediatly frees them without doing a TLB flush.
> 
> Further this lacks synchronization with other page-tables in
> the system when the PMD pages are not shared between
> mm_structs.
> 
> On x86-32 with PAE and PTI patches on-top this patch
> triggers the BUG_ON in vmalloc_sync_one() because the kernel
> and the process page-table were not synchronized.
> 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/mm/pgtable.c | 28 ++--
>  1 file changed, 2 insertions(+), 26 deletions(-)



This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

Re: [f2fs-dev] [PATCH 4/5 v2] f2fs: sanity check for total valid blocks

2018-04-26 Thread Jaegeuk Kim

This patch enhances sanity check for SIT entries.

syzbot hit the following crash on upstream commit
83beed7b2b26f232d782127792dd0cd4362fdc41 (Fri Apr 20 17:56:32 2018 +)
Merge branch 'fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
syzbot dashboard link: 
https://syzkaller.appspot.com/bug?extid=bf9253040425feb155ad

syzkaller reproducer: 
https://syzkaller.appspot.com/x/repro.syz?id=5692130282438656
Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5095924598571008
Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+bf9253040425feb15...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for details.
If you forward the report, please keep this part and the footer.

F2FS-fs (loop0): invalid crc value
F2FS-fs (loop0): Try to recover 1th superblock, ret: 0
F2FS-fs (loop0): Mounted with checkpoint version = d
F2FS-fs (loop0): Bitmap was wrongly cleared, blk:9740
[ cut here ]
kernel BUG at fs/f2fs/segment.c:1884!
invalid opcode:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4508 Comm: syz-executor0 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:update_sit_entry+0x1215/0x1590 fs/f2fs/segment.c:1882
RSP: 0018:8801af526708 EFLAGS: 00010282
RAX: ed0035ea4cc0 RBX: 8801ad454f90 RCX: 
RDX:  RSI: 82eeb87e RDI: ed0035ea4cb6
RBP: 8801af526760 R08: 8801ad4a2480 R09: ed003b5e4f90
R10: ed003b5e4f90 R11: 8801daf27c87 R12: 8801adb8d380
R13: 0001 R14: 0008 R15: 
FS:  014af940() GS:8801daf0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f06bc223000 CR3: 0001adb02000 CR4: 001406e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 allocate_data_block+0x66f/0x2050 fs/f2fs/segment.c:2663
 do_write_page+0x105/0x1b0 fs/f2fs/segment.c:2727
 write_node_page+0x129/0x350 fs/f2fs/segment.c:2770
 __write_node_page+0x7da/0x1370 fs/f2fs/node.c:1398
 sync_node_pages+0x18cf/0x1eb0 fs/f2fs/node.c:1652
 block_operations+0x429/0xa60 fs/f2fs/checkpoint.c:1088
 write_checkpoint+0x3ba/0x5380 fs/f2fs/checkpoint.c:1405
 f2fs_sync_fs+0x2fb/0x6a0 fs/f2fs/super.c:1077
 __sync_filesystem fs/sync.c:39 [inline]
 sync_filesystem+0x265/0x310 fs/sync.c:67
 generic_shutdown_super+0xd7/0x520 fs/super.c:429
 kill_block_super+0xa4/0x100 fs/super.c:1191
 kill_f2fs_super+0x9f/0xd0 fs/f2fs/super.c:3030
 deactivate_locked_super+0x97/0x100 fs/super.c:316
 deactivate_super+0x188/0x1b0 fs/super.c:347
 cleanup_mnt+0xbf/0x160 fs/namespace.c:1174
 __cleanup_mnt+0x16/0x20 fs/namespace.c:1181
 task_work_run+0x1e4/0x290 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:191 [inline]
 exit_to_usermode_loop+0x2bd/0x310 arch/x86/entry/common.c:166
 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
 do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457d97
RSP: 002b:7ffd46f9c8e8 EFLAGS: 0246 ORIG_RAX: 00a6
RAX:  RBX:  RCX: 00457d97
RDX: 014b09a3 RSI: 0002 RDI: 7ffd46f9da50
RBP: 7ffd46f9da50 R08:  R09: 0009
R10: 0005 R11: 0246 R12: 014b0940
R13:  R14: 0002 R15: 658e
RIP: update_sit_entry+0x1215/0x1590 fs/f2fs/segment.c:1882 RSP: 8801af526708
---[ end trace f498328bb02610a2 ]---

Reported-and-tested-by: syzbot+bf9253040425feb15...@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+7d6d31d3bc702f566...@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+0a725420475916460...@syzkaller.appspotmail.com
Signed-off-by: Jaegeuk Kim 
---

Change log from v1:
 - check valid node count

 fs/f2fs/segment.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 20250b88bf51..c60f87822e9c 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -3612,6 +3612,7 @@ static int build_sit_entries(struct f2fs_sb_info *sbi)
unsigned int i, start, end;
unsigned int readed, start_blk = 0;
int err = 0;
+   block_t total_node_blocks = 0;
 
do {
readed = ra_meta_pages(sbi, start_blk, BIO_MAX_PAGES,
@@ -3634,6 +3635,8 @@ static int build_sit_entries(struct f2fs_sb_info *sbi)
if (err)
return err

[PATCH RFC PoC 0/2] platform: different approach to early platform drivers

2018-04-26 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

This is a follow to my series[1] the aim of which was to introduce device tree
support for early platform devices.

It was received rather negatively. Aside from using device tree to pass
implementation specific details to the system, two important concerns were
raised: no probe deferral support and the fact that currently the early devices
never get converted to actual platform drivers. This series is a
proof-of-concept that's trying to address those issues.

The only user of the current version of early platform drivers is the SuperH
architecture. If this series eventually gets merged, we could simply replace
the other solution.

The current implementation of early platform drivers is pretty much a hack
built on top of the early_param mechanism. The devices only look like platform
devices and use the same structures but never actually get registered with the
driver model.

The idea behind this series is to allow users to probe platform devices very
early in the boot sequence and then switch the early devices to actual platform
devices and the early drivers to platform drivers in during device_initcall.

If any of the early registration functions is called after device_initcall,
we'll just end up calling the normal platform_driver/device_register routines.

The user can specify if he wants the device to be probed the second time after
this conversion and the check if it's an early or a normal probe from the
driver code.

We also support a simple version of probe deferral: initially each new
registered device is added to the head of the early devices list. If it matches
an early driver, it will be probed. If probe return -EPROBE_DEFER, it will be
moved to the tail of the list and reprobed the next time we match a device
but after all other devices.

This implementation has certain shortcomings that will be addressed if the
feedback is at least somewhat positive. For instance: the driver registration
happens in early_initcall(). This may be too late for certain clocksource or
clk drivers. The solution for that would be defining a new section in which the
init callbacks of the drivers would reside and let the architecture call the
actual registration function whenever it's needed.

I also need to figure out if any locking is needed.

We don't support DT in this series either. The proposed approach would be to
walk over the devices nodes early in the boot sequence and allocate and probe
the matching early devices and then register them with the driver model later.

[1] https://lkml.org/lkml/2018/4/24/937

Bartosz Golaszewski (2):
  earlydev: implement a new way to probe platform devices early
  misc: implement a dummy early platform driver

 drivers/base/Kconfig|   3 +
 drivers/base/Makefile   |   1 +
 drivers/base/earlydev.c | 175 
 drivers/base/platform.c |  11 ++
 drivers/misc/Kconfig|   8 ++
 drivers/misc/Makefile   |   1 +
 drivers/misc/dummy-early.c  |  46 +
 include/linux/earlydev.h|  63 
 include/linux/platform_device.h |   4 +
 9 files changed, 312 insertions(+)
 create mode 100644 drivers/base/earlydev.c
 create mode 100644 drivers/misc/dummy-early.c
 create mode 100644 include/linux/earlydev.h

-- 
2.17.0

[PATCH RFC PoC 1/2] earlydev: implement a new way to probe platform devices early

2018-04-26 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

The current implementation of early platform drivers is pretty much a
hack built on top of the early_param mechanism. The devices only look
like platform devices and use the same structures but never actually
get registered with the driver model.

The idea behind this series is to allow users to probe platform devices
very early in the boot sequence and then switch the early devices to
actual platform devices and the early drivers to platform drivers in
during device_initcall.

Signed-off-by: Bartosz Golaszewski 
---
 drivers/base/Kconfig|   3 +
 drivers/base/Makefile   |   1 +
 drivers/base/earlydev.c | 175 
 drivers/base/platform.c |  11 ++
 include/linux/earlydev.h|  63 
 include/linux/platform_device.h |   4 +
 6 files changed, 257 insertions(+)
 create mode 100644 drivers/base/earlydev.c
 create mode 100644 include/linux/earlydev.h

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 29b0eb452b3a..e05f96d626b3 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -170,6 +170,9 @@ config DEV_COREDUMP
default y if WANT_DEV_COREDUMP
depends on ALLOW_DEV_COREDUMP
 
+config EARLYDEV
+   def_bool n
+
 config DEBUG_DRIVER
bool "Driver Core verbose debug messages"
depends on DEBUG_KERNEL
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index b074f242a435..ec47f86eac44 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -7,6 +7,7 @@ obj-y   := component.o core.o bus.o dd.o 
syscore.o \
   attribute_container.o transport_class.o \
   topology.o container.o property.o cacheinfo.o \
   devcon.o
+obj-$(CONFIG_EARLYDEV) += earlydev.o
 obj-$(CONFIG_DEVTMPFS) += devtmpfs.o
 obj-$(CONFIG_DMA_CMA) += dma-contiguous.o
 obj-y  += power/
diff --git a/drivers/base/earlydev.c b/drivers/base/earlydev.c
new file mode 100644
index ..3da9e81031d2
--- /dev/null
+++ b/drivers/base/earlydev.c
@@ -0,0 +1,175 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018 Texas Instruments
+ * Author: Bartosz Golaszewski 
+ */
+
+#include 
+#include 
+
+#include "base.h"
+
+static bool early_done;
+static LIST_HEAD(early_drvs);
+static LIST_HEAD(early_devs);
+
+static void earlydev_pdev_set_name(struct platform_device *pdev)
+{
+   if (pdev->dev.init_name)
+   return;
+
+   if (!slab_is_available()) {
+   pr_warn("slab unavailable - not assigning name to early 
device\n");
+   return;
+   }
+
+   switch (pdev->id) {
+   case PLATFORM_DEVID_NONE:
+   pdev->dev.init_name = kasprintf(GFP_KERNEL, "%s", pdev->name);
+   break;
+   case PLATFORM_DEVID_AUTO:
+   pr_warn("auto device ID not supported in early devices\n");
+   break;
+   default:
+   pdev->dev.init_name = kasprintf(GFP_KERNEL, "%s.%d",
+   pdev->name, pdev->id);
+   break;
+   }
+
+   if (!pdev->dev.init_name)
+   pr_warn("error allocating the early device name\n");
+}
+
+static void earlydev_probe_devices(void)
+{
+   struct earlydev_driver *edrv, *ndrv;
+   struct earlydev_device *edev, *ndev;
+   int rv;
+
+   list_for_each_entry_safe(edev, ndev, &early_devs, list) {
+   if (edev->bound_to)
+   continue;
+
+   list_for_each_entry_safe(edrv, ndrv, &early_drvs, list) {
+   if (strcmp(edrv->plat_drv.driver.name,
+  edev->pdev.name) != 0)
+   continue;
+
+   earlydev_pdev_set_name(&edev->pdev);
+   rv = edrv->plat_drv.probe(&edev->pdev);
+   if (rv) {
+   if (rv == -EPROBE_DEFER) {
+   /*
+* Move the device to the end of the
+* list so that it'll be reprobed next
+* time after all new devices.
+*/
+   list_move_tail(&edev->list,
+  &early_devs);
+   continue;
+   }
+
+   pr_err("error probing early device: %d\n", rv);
+   continue;
+   }
+
+   edev->bound_to = edrv;
+   edev->pdev.early_probed = true;
+   }
+   }
+}
+
+bool earlydev_probing_early(void)
+{
+   return !early_done;
+}
+
+bool earlydev_probe_late(struct platform_device *pdev)
+{
+   struct earlydev_device

Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

2018-04-26 Thread Andrea Parri

On Thu, Apr 26, 2018 at 04:52:39PM +0300, Kirill Tkhai wrote:
> On 26.04.2018 15:35, Andrea Parri wrote:

[...]

> > 
> > Mmh, it's possible that I am misunderstanding this statement but it does
> > not seem quite correct to me; a counter-example would be provided by the
> > test at "tools/memory-model/litmus-tests/SB+mbonceonces.litmus" (replace
> > either of the smp_mb() with the sequence:
> > 
> >spin_lock(s); spin_unlock(s); spin_lock(s); spin_unlock(s); ).
> > 
> > BTW, your commit message suggests that your case would work with "imply
> > an smp_wmb()".  This implication should hold "w.r.t. current implementa-
> > tions".  We (LKMM people) discussed changes to the LKMM to make it hold
> > in LKMM but such changes are still in our TODO list as of today...
> 
> I'm not close to LKMM, so the test you referenced is not clear for me.

The test could be concisely described by:

   {initially: x=y=0; }

   Thread0  Thread1

   x = 1;   y = 1;
   MB   MB
   r0 = y;  r1 = x;

   Can r0,r1 be both 0 after joining?

The answer to the question is -No-; however, if you replaced any of the
MB with the locking sequence described above, then the answer is -Yes-:
full fences on both sides are required to forbid that state and this is
something that the locking sequences won't be able to provide (think at
the implementation of these primitives for powerpc, for example).

> Does LKMM show the real hardware behavior? Or there are added the most
> cases, and work is still in progress?

Very roughly speaking, LKMM is an "envelope" of the underlying hardware
memory models/architectures supported by the Linux kernel which in turn
may not coincide with the observable behavior on a given implementation
/processor of that architecture.  Also, LKMM doesn't aim to be a "tight"
envelope.  I'd refer to the documentation within "tools/memory-model/";
please let me know if I can provide further info.

> 
> In the patch I used the logic, that the below code:
> 
>   x = A;
>   spin_lock();
>   spin_unlock();
>   spin_lock();
>   spin_unlock();
>   y = B;
> 
> cannot reorder much than:
> 
>   spin_lock();
>   x = A;  <- this can't become visible later, that spin_unlock()
>   spin_unlock();
>   spin_lock();
>   y = B;  <- this can't become visible earlier, than spin_lock()
>   spin_unlock();
> 
> Is there a problem?

As mentioned in the previous email, if smp_wmb() is what you're looking
for then this should be fine (considering current implementations; LKMM
will likely be there soon...).

BTW, the behavior in question has been recently discussed on the list;
c.f., for example, the test "unlock-lock-write-ordering" described in:

http://lkml.kernel.org/r/1519301990-11766-1-git-send-email-parri.and...@gmail.com

as well as

  0123f4d76ca63b7b895f40089be0ce4809e392d8
  ("riscv/spinlock: Strengthen implementations with fences")

  Andrea

> 
> Kirill

[PATCH RFC PoC 2/2] misc: implement a dummy early platform driver

2018-04-26 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

Implement a very simple early platform driver. Its purpose is to show
how such drivers can be registered and to emit a message when probed.

It can be then added to the device tree or machine code to verify that
the early platform devices work as expected.

Signed-off-by: Bartosz Golaszewski 
---
 drivers/misc/Kconfig   |  8 +++
 drivers/misc/Makefile  |  1 +
 drivers/misc/dummy-early.c | 46 ++
 3 files changed, 55 insertions(+)
 create mode 100644 drivers/misc/dummy-early.c

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 5d713008749b..99cde8aefdb0 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -91,6 +91,14 @@ config DUMMY_IRQ
  The sole purpose of this module is to help with debugging of systems 
on
  which spurious IRQs would happen on disabled IRQ vector.
 
+config DUMMY_EARLY
+   bool "Dummy early platform driver"
+   select EARLYDEV
+   default n
+   help
+ This module's only function is to register itself with the early
+ platform device framework and be probed early in the boot process.
+
 config IBM_ASM
tristate "Device driver for IBM RSA service processor"
depends on X86 && PCI && INPUT
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 20be70c3f118..84ad0225eb14 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_INTEL_MID_PTI)   += pti.o
 obj-$(CONFIG_ATMEL_SSC)+= atmel-ssc.o
 obj-$(CONFIG_ATMEL_TCLIB)  += atmel_tclib.o
 obj-$(CONFIG_DUMMY_IRQ)+= dummy-irq.o
+obj-$(CONFIG_DUMMY_EARLY)  += dummy-early.o
 obj-$(CONFIG_ICS932S401)   += ics932s401.o
 obj-$(CONFIG_LKDTM)+= lkdtm/
 obj-$(CONFIG_TIFM_CORE)+= tifm_core.o
diff --git a/drivers/misc/dummy-early.c b/drivers/misc/dummy-early.c
new file mode 100644
index ..f00fb1fbd5fa
--- /dev/null
+++ b/drivers/misc/dummy-early.c
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018 Texas Instruments
+ *
+ * Author:
+ *   Bartosz Golaszewski 
+ *
+ * Dummy testing driver whose only purpose is to be registered and probed
+ * using the early platform device mechanism.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int dummy_early_probe(struct platform_device *pdev)
+{
+   if (earlydev_probing_early())
+   dev_notice(&pdev->dev, "dummy-early driver probed early!\n");
+   else
+   dev_notice(&pdev->dev, "dummy-early driver probed late!\n");
+
+   return 0;
+}
+
+static const struct of_device_id dummy_early_of_match[] = {
+   { .compatible = "none,dummy-early", },
+   { },
+};
+
+static struct earlydev_driver dummy_early_driver = {
+   .plat_drv = {
+   .probe = dummy_early_probe,
+   .driver = {
+   .name = "dummy-early",
+   .of_match_table = dummy_early_of_match,
+   },
+   }
+};
+earlydev_platform_driver(dummy_early_driver);
+
+MODULE_AUTHOR("Bartosz Golaszewski ");
+MODULE_DESCRIPTION("Dummy early platform device driver");
+MODULE_LICENSE("GPL v2");
-- 
2.17.0

Re: [PATCH] Revert "x86/mm: implement free pmd/pte page interfaces"

2018-04-26 Thread Joerg Roedel

On Thu, Apr 26, 2018 at 05:27:12PM +0200, Greg KH wrote:
> 
> 
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree.  Please read:
> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for how to do this properly.
> 
> 

That is fine, as this is an upstream-first submission. When this commit
gets accepted it can also be applied to stable, given the original
commit was applied there too (didn't check that).

I just cc'ed stable because it was cc'ed on the original patch.

Thanks,

Joerg

Re: [RFC 01/10] PCI: dwc: Add MSI-X callbacks handler

2018-04-26 Thread Gustavo Pimentel


Hi Kishon,

On 24/04/2018 12:24, Kishon Vijay Abraham I wrote:
> Hi,
> 
> On Tuesday 24 April 2018 03:06 PM, Gustavo Pimentel wrote:
>> Hi Kishon,
>>
>> On 24/04/2018 08:07, Kishon Vijay Abraham I wrote:
>>> Hi,
>>>
>>> On Monday 23 April 2018 03:06 PM, Gustavo Pimentel wrote:
 Hi Kishon,

 On 16/04/2018 10:29, Kishon Vijay Abraham I wrote:
> Hi Gustavo,
>
> On Tuesday 10 April 2018 10:44 PM, Gustavo Pimentel wrote:
>> Changes the pcie_raise_irq function signature, namely the interrupt_num
>> variable type from u8 to u16 to accommodate the MSI-X maximum interrupts
>> of 2048.
>>
>> Implements a PCIe config space capability iterator function to search and
>> save the MSI and MSI-X pointers. With this method the code becomes more
>> generic and flexible.
>>
>> Implements MSI-X set/get functions for sysfs interface in order to change
>> the EP entries number.
>>
>> Implements EP MSI-X interface for triggering interruptions.
>>
>> Signed-off-by: Gustavo Pimentel 
>> ---
>>  drivers/pci/dwc/pci-dra7xx.c   |   2 +-
>>  drivers/pci/dwc/pcie-artpec6.c |   2 +-
>>  drivers/pci/dwc/pcie-designware-ep.c   | 145 
>> -
>>  drivers/pci/dwc/pcie-designware-plat.c |   6 +-
>>  drivers/pci/dwc/pcie-designware.h  |  23 +-
>>  5 files changed, 173 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c
>> index ed8558d..5265725 100644
>> --- a/drivers/pci/dwc/pci-dra7xx.c
>> +++ b/drivers/pci/dwc/pci-dra7xx.c
>> @@ -369,7 +369,7 @@ static void dra7xx_pcie_raise_msi_irq(struct 
>> dra7xx_pcie *dra7xx,
>>  }
>>  
>>  static int dra7xx_pcie_raise_irq(struct dw_pcie_ep *ep, u8 func_no,
>> - enum pci_epc_irq_type type, u8 
>> interrupt_num)
>> + enum pci_epc_irq_type type, u16 
>> interrupt_num)
>>  {
>>  struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>>  struct dra7xx_pcie *dra7xx = to_dra7xx_pcie(pci);
>> diff --git a/drivers/pci/dwc/pcie-artpec6.c 
>> b/drivers/pci/dwc/pcie-artpec6.c
>> index e66cede..96dc259 100644
>> --- a/drivers/pci/dwc/pcie-artpec6.c
>> +++ b/drivers/pci/dwc/pcie-artpec6.c
>> @@ -428,7 +428,7 @@ static void artpec6_pcie_ep_init(struct dw_pcie_ep 
>> *ep)
>>  }
>>  
>>  static int artpec6_pcie_raise_irq(struct dw_pcie_ep *ep, u8 func_no,
>> -  enum pci_epc_irq_type type, u8 
>> interrupt_num)
>> +  enum pci_epc_irq_type type, u16 
>> interrupt_num)
>>  {
>>  struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>>  
>> diff --git a/drivers/pci/dwc/pcie-designware-ep.c 
>> b/drivers/pci/dwc/pcie-designware-ep.c
>> index 15b22a6..874d4c2 100644
>> --- a/drivers/pci/dwc/pcie-designware-ep.c
>> +++ b/drivers/pci/dwc/pcie-designware-ep.c
>> @@ -40,6 +40,44 @@ void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum 
>> pci_barno bar)
>>  __dw_pcie_ep_reset_bar(pci, bar, 0);
>>  }
>>  
>> +void dw_pcie_ep_find_cap_addr(struct dw_pcie_ep *ep)
>> +{
>
> This should be implemented in a generic way similar to 
> pci_find_capability().
> It'll be useful when we try to implement other capabilities as well.

 Hum, what you suggest? Something implemented on the pci-epf-core?
>>>
>>> yeah, Initially thought it could be implemented as a helper function in
>>> pci-epc-core so that both designware and cadence can use it.
>>
>> That would be nice, however I couldn't find out how to access the config 
>> space,
>> through the pci_epf or pci_epc structs.
> 
> It's just a helper function so it can directly take the base address of the
> configuration space as argument (in our case, it should be dbi_base).

I don't think it will bring much benefit to this particular scope at this time
being. In any case this could be improved later.

Regards,
Gustavo

> 
> Thanks
> Kishon
> 
>>
>> So, I reworked the functions like this:
>>
>> (on pcie-designware-ep.c)
>>
>> u8 __dw_pcie_ep_find_next_cap(struct dw_pcie *pci, u8 cap_ptr,
>>   u8 cap)
>> {
>> u8 cap_id, next_cap_ptr;
>> u16 reg;
>>
>> reg = dw_pcie_readw_dbi(pci, cap_ptr);
>> next_cap_ptr = (reg & 0xff00) >> 8;
>> cap_id = (reg & 0x00ff);
>>
>> if (!next_cap_ptr || cap_id > PCI_CAP_ID_MAX)
>> return 0;
>>
>> if (cap_id == cap)
>> return cap_ptr;
>>
>> return __dw_pcie_ep_find_next_cap(pci, next_cap_ptr, cap);
>> }
>>
>> u8 dw_pcie_ep_find_capability(struct dw_pcie *pci, u8 cap)
>> {
>> u8 next_cap_ptr;
>> u16 reg;
>>
>> reg = dw_pcie_readw_dbi(pci, PCI_CAPABILITY_LIST);
>

Re: [PATCH] perf test: Adapt test case record+probe_libc_inet_pton.sh for s390

2018-04-26 Thread Martin Vuille




On 04/26/18 04:09, Thomas-Mich Richter wrote:

was different. With you patch it changed from /usr/lib64/libc.so (old) to
/usr/lib/debug/lib64/libc-2.26.so.debug (new)


Thomas,

Can you tell me what 'file' reports for the old and new files?

Regards,
MV

Re: [f2fs-dev] [PATCH 5/5] f2fs: enforce fsync_mode=strict for renamed directory

2018-04-26 Thread Jaegeuk Kim

On 04/25, Chao Yu wrote:
> On 2018/4/25 13:46, Jaegeuk Kim wrote:
> > This is to give a option for user to be able to recover B/foo in the below
> > case.
> > 
> > mkdir A
> > sync()
> > rename(A, B)
> > creat (B/foo)
> > fsync (B/foo)
> > ---crash---
> 
> That makes sense, IMO, it will be better to cover cross rename case as well?

file_lost_pino(old_inode) seems covering that.

> 
> Thanks,
> 
> > 
> > Sugessted-by: Velayudhan Pillai 
> > Signed-off-by: Jaegeuk Kim 
> > ---
> >  fs/f2fs/namei.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
> > index b5f404674cad..fef6e3ab2135 100644
> > --- a/fs/f2fs/namei.c
> > +++ b/fs/f2fs/namei.c
> > @@ -973,8 +973,11 @@ static int f2fs_rename(struct inode *old_dir, struct 
> > dentry *old_dentry,
> > f2fs_put_page(old_dir_page, 0);
> > f2fs_i_links_write(old_dir, false);
> > }
> > -   if (F2FS_OPTION(sbi).fsync_mode == FSYNC_MODE_STRICT)
> > +   if (F2FS_OPTION(sbi).fsync_mode == FSYNC_MODE_STRICT) {
> > add_ino_entry(sbi, new_dir->i_ino, TRANS_DIR_INO);
> > +   if (S_ISDIR(old_inode->i_mode))
> > +   add_ino_entry(sbi, old_inode->i_ino, TRANS_DIR_INO);
> > +   }
> >  
> > f2fs_unlock_op(sbi);
> >  
> >

Re: [RFC 06/10] misc: pci_endpoint_test: Add MSI-X support

2018-04-26 Thread Gustavo Pimentel

Hi Kishon,

On 24/04/2018 12:43, Kishon Vijay Abraham I wrote:
> Hi,
> 
> On Tuesday 24 April 2018 04:27 PM, Gustavo Pimentel wrote:
>> Hi Kishon,
>>
>> On 24/04/2018 08:19, Kishon Vijay Abraham I wrote:
>>> Hi,
>>>
>>> On Tuesday 17 April 2018 11:08 PM, Gustavo Pimentel wrote:
 Hi Kishon,

 On 17/04/2018 11:33, Kishon Vijay Abraham I wrote:
> Hi,
>
> On Tuesday 10 April 2018 10:44 PM, Gustavo Pimentel wrote:
>> Adds the MSI-X support and updates driver documentation accordingly.
>>
>> Changes the driver parameter in order to allow the interruption type
>> selection.
>>
>> Signed-off-by: Gustavo Pimentel 
>> ---
>>  Documentation/misc-devices/pci-endpoint-test.txt |   3 +
>>  drivers/misc/pci_endpoint_test.c | 102 
>> +--
>>  2 files changed, 79 insertions(+), 26 deletions(-)
>>
>> diff --git a/Documentation/misc-devices/pci-endpoint-test.txt 
>> b/Documentation/misc-devices/pci-endpoint-test.txt
>> index 4ebc359..fdfa0f6 100644
>> --- a/Documentation/misc-devices/pci-endpoint-test.txt
>> +++ b/Documentation/misc-devices/pci-endpoint-test.txt
>> @@ -10,6 +10,7 @@ The PCI driver for the test device performs the 
>> following tests
>>  *) verifying addresses programmed in BAR
>>  *) raise legacy IRQ
>>  *) raise MSI IRQ
>> +*) raise MSI-X IRQ
>>  *) read data
>>  *) write data
>>  *) copy data
>> @@ -25,6 +26,8 @@ ioctl
>>   PCITEST_LEGACY_IRQ: Tests legacy IRQ
>>   PCITEST_MSI: Tests message signalled interrupts. The MSI number
>>to be tested should be passed as argument.
>> + PCITEST_MSIX: Tests message signalled interrupts. The MSI-X number
>> +  to be tested should be passed as argument.
>>   PCITEST_WRITE: Perform write tests. The size of the buffer should be 
>> passed
>>  as argument.
>>   PCITEST_READ: Perform read tests. The size of the buffer should be 
>> passed
>> diff --git a/drivers/misc/pci_endpoint_test.c 
>> b/drivers/misc/pci_endpoint_test.c
>> index 37db0fc..a7d9354 100644
>> --- a/drivers/misc/pci_endpoint_test.c
>> +++ b/drivers/misc/pci_endpoint_test.c
>> @@ -42,11 +42,16 @@
>>  #define PCI_ENDPOINT_TEST_COMMAND   0x4
>>  #define COMMAND_RAISE_LEGACY_IRQBIT(0)
>>  #define COMMAND_RAISE_MSI_IRQ   BIT(1)
>> -#define MSI_NUMBER_SHIFT2
>> -/* 6 bits for MSI number */
>> -#define COMMAND_READBIT(8)
>> -#define COMMAND_WRITE   BIT(9)
>> -#define COMMAND_COPYBIT(10)
>> +#define COMMAND_RAISE_MSIX_IRQ  BIT(2)
>> +#define IRQ_TYPE_SHIFT  3
>> +#define IRQ_TYPE_LEGACY 0
>> +#define IRQ_TYPE_MSI1
>> +#define IRQ_TYPE_MSIX   2
>> +#define MSI_NUMBER_SHIFT5
>
> Now that you are anyways fixing this, add a new register entry for MSI 
> numbers.
> Let's not keep COMMAND and MSI's together.

 What you suggest?
>>>
>>> #define PCI_ENDPOINT_TEST_COMMAND   0x4
>>> #define COMMAND_RAISE_LEGACY_IRQBIT(0)
>>> #define COMMAND_RAISE_MSI_IRQ   BIT(1)
>>> #define COMMAND_RAISE_MSIX_IRQ  BIT(2)
>>> #define COMMAND_READBIT(3)
>>> #define COMMAND_WRITE   BIT(4)
>>> #define COMMAND_COPYBIT(5)
>>>
>>> #define PCI_ENDPOINT_TEST_STATUS0x8
>>> #define STATUS_READ_SUCCESS BIT(0)
>>> #define STATUS_READ_FAILBIT(1)
>>> #define STATUS_WRITE_SUCCESSBIT(2)
>>> #define STATUS_WRITE_FAIL   BIT(3)
>>> #define STATUS_COPY_SUCCESS BIT(4)
>>> #define STATUS_COPY_FAILBIT(5)
>>> #define STATUS_IRQ_RAISED   BIT(6)
>>> #define STATUS_SRC_ADDR_INVALID BIT(7)
>>> #define STATUS_DST_ADDR_INVALID BIT(8)
>>>
>>> #define PCI_ENDPOINT_TEST_LOWER_SRC_ADDR0xc
>>> #define PCI_ENDPOINT_TEST_UPPER_SRC_ADDR0x10
>>>
>>> #define PCI_ENDPOINT_TEST_LOWER_DST_ADDR0x14
>>> #define PCI_ENDPOINT_TEST_UPPER_DST_ADDR0x18
>>>
>>> #define PCI_ENDPOINT_TEST_SIZE  0x1c
>>> #define PCI_ENDPOINT_TEST_CHECKSUM  0x20
>>>
>>> #define PCI_ENDPOINT_TEST_MSI_NUMBER0x24
>>
>> Ok. I will do it.
>>
>>>
>>> We should try not to modify either the existing register offsets or the bit
>>> fields within these registers in the future as EP and RC will be running on
>>> different systems and it is possible one of them might not have the updated
>>> kernel.
>>
>> I totally agree.
>>

>> +/* 12 bits for MSI number */
>> +#define COMMAND_READBIT(17)
>> +#define COMMAND_WRITE   BIT(18)
>> +#define COMMAND_COPY

Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-26 Thread Niklas Söderlund

Hi Vincent,

On 2018-04-26 17:27:24 +0200, Vincent Guittot wrote:
> Hi Niklas,
> 
> >> Thanks for the trace, I have been able to catch a problem with it.
> >> Could you test the patch below to confirm that the problem is solved ?
> >> The patch apply on-top of
> >> c18bb396d3d261eb ("Merge 
> >> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
> >
> > I can confirm that with the patch bellow I can no longer produce the
> > problem. Thanks!
> 
> Thanks for testing
> Do you mind if I add
> Tested-by: Niklas Söderlund 

Please do.

> 
> Peter, Ingo,
> Do you want me to re-send the patch with all tags or you will take
> this version ?
> 
> Regards,
> Vincent
> 
> >
> >>
> >> From: Vincent Guittot 
> >> Date: Thu, 26 Apr 2018 12:19:32 +0200
> >> Subject: [PATCH] sched/fair: fix the update of blocked load when newly idle
> >> MIME-Version: 1.0
> >> Content-Type: text/plain; charset=UTF-8
> >> Content-Transfer-Encoding: 8bit
> >>
> >> With commit 31e77c93e432 ("sched/fair: Update blocked load when newly 
> >> idle"),
> >> we release the rq->lock when updating blocked load of idle CPUs. This open
> >> a time window during which another CPU can add a task to this CPU's cfs_rq.
> >> The check for newly added task of idle_balance() is not in the common path.
> >> Move the out label to include this check.
> >>
> >> Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
> >> Reported-by: Heiner Kallweit 
> >> Reported-by: Niklas Söderlund 
> >> Signed-off-by: Vincent Guittot 
> >> ---
> >>  kernel/sched/fair.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 0951d1c..15a9f5e 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -9847,6 +9847,7 @@ static int idle_balance(struct rq *this_rq, struct 
> >> rq_flags *rf)
> >>   if (curr_cost > this_rq->max_idle_balance_cost)
> >>   this_rq->max_idle_balance_cost = curr_cost;
> >>
> >> +out:
> >>   /*
> >>* While browsing the domains, we released the rq lock, a task could
> >>* have been enqueued in the meantime. Since we're not going idle,
> >> @@ -9855,7 +9856,6 @@ static int idle_balance(struct rq *this_rq, struct 
> >> rq_flags *rf)
> >>   if (this_rq->cfs.h_nr_running && !pulled_task)
> >>   pulled_task = 1;
> >>
> >> -out:
> >>   /* Move the next balance forward */
> >>   if (time_after(this_rq->next_balance, next_balance))
> >>   this_rq->next_balance = next_balance;
> >> --
> >> 2.7.4
> >>
> >>
> >>
> >> [snip]
> >>
> >
> > --
> > Regards,
> > Niklas Söderlund

-- 
Regards,
Niklas Söderlund

Re: issues with suspend on Dell XPS 13 2-in-1

2018-04-26 Thread Dennis Gilmore

El jue, 26-04-2018 a las 15:09 +, Pandruvada, Srinivas escribió:
> On Thu, 2018-04-26 at 07:42 -0500, Dennis Gilmore wrote:
> > Hi Srinivas,
> > 
> > El jue, 26-04-2018 a las 05:34 +, Pandruvada, Srinivas
> > escribió:
> > > Hi Dennis,
> > > 
> > > On Wed, 2018-04-25 at 22:06 -0500, Dennis Gilmore wrote:
> > > > Hi Srinivas,
> > > > 
> > > > Yes I have latest bios, I have version 1.3.1 that was released
> > > > on
> > > > 18th
> > > > of  Feb.
> > > 
> > > Can you try these commands and repeat the test?
> > > 
> > > # cd /sys/kernel/debug/pmc_core/
> > > # for i in {0..32}; do echo $i > ltr_ignore; done
> > 
> > # for i in {0..32}; do echo $i > ltr_ignore; done
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > -bash: ltr_ignore: Operación no permitida
> > 
> > should I go ahead and run the test even though writing to
> > ltr_ignore
> > failed?
> 
> Strange. Do you have any issue with the permissions?
> # cd /sys/kernel/debug/pmc_core/
> # ls -l

# ls -l 
total 0
-rw-r--r--. 1 root root 0 abr 24 17:27 ltr_ignore
-r--r--r--. 1 root root 0 abr 24 17:27
mphy_core_lanes_power_gating_status
-r--r--r--. 1 root root 0 abr 24 17:27 pch_ip_power_gating_status
-r--r--r--. 1 root root 0 abr 24 17:27 pll_status
-r--r--r--. 1 root root 0 abr 24 17:27 slp_s0_residency_usec

I have rw permissions selinux is in permissive mode currently

Thanks

Dennis

> Do you have rw permissions?
> 
> Thanks,
> Srinivas
> 
> > 
> > Dennis
> > 
> > > Thanks,
> > > Srinivas
> > > 
> > > > 
> > > > Dennis
> > > > 
> > > > El jue, 26-04-2018 a las 02:13 +, Pandruvada, Srinivas
> > > > escribió:
> > > > > 
> > > > > I see around 43% PC10 residency with power drop of 0.7W.
> > > > > Do you have the latest BIOS of Dell 9365?
> > > > > 
> > > > > 
> > > > > Thanks,
> > > > > Srinivas
> > > > > 
> > > > > On Fri, 2018-04-20 at 08:36 -0500, Dennis Gilmore wrote:
> > > > > > Here is the full output 
> > > > > > 
> > > > > > # turbostat
> > > > > > turbostat version 17.06.23 - Len Brown 
> > > > > > CPUID(0): GenuineIntel 22 CPUID levels;
> > > > > > family:model:stepping
> > > > > > 0x6:8e:9 (6:142:9)
> > > > > > CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM TM
> > > > > > CPUID(6): APERF, TURBO, DTS, PTM, HWP, HWPnotify,
> > > > > > HWPwindow,
> > > > > > HWPepp, No-HWPpkg, EPB
> > > > > > cpu0: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST No-MWAIT
> > > > > > PREFETCH
> > > > > > TURBO)
> > > > > > CPUID(7): SGX
> > > > > > cpu0: MSR_IA32_FEATURE_CONTROL: 0xff07 (Locked )
> > > > > > CPUID(0x15): eax_crystal: 2 ebx_tsc: 134 ecx_crystal_hz: 0
> > > > > > TSC: 1608 MHz (2400 Hz * 134 / 2 / 100)
> > > > > > CPUID(0x16): base_mhz: 1600 max_mhz: 3600 bus_mhz: 100
> > > > > > cpu0: MSR_MISC_PWR_MGMT: 0x00401cc0 (ENable-
> > > > > > EIST_Coordination
> > > > > > DISable-EPB DISable-OOB)
> > > > > > RAPL: 58254 sec. Joule Counter Range, at 4 Watts
> > > > > > cpu0: MSR_PLATFORM_INFO: 0x804043df1011000
> > > > > > 4 * 100.0 = 400.0 MHz max efficiency frequency
> > > > > > 16 * 100.0 = 1600.0 MHz base frequency
> > > > > > cpu0: MSR_IA32_POWER_CTL: 0x0024005d (C1E auto-promotion:
> > > > > > DISabled)
> > > > > > cpu0: MSR_TURBO_RATIO_LIMIT: 0x2224
> > > > > > 34 * 100.0 = 3400.0 MHz max turbo 4 active cores
> > > > > > 34 * 100.0 = 3400.0 MHz max turbo 3 active cores
> > > > > > 34 * 100.0 = 3400.0 MHz max turbo 2 active cores
> > > > > > 36 * 100.0 = 3600.0 MHz max turbo 1 active cores
> > > > > > cpu0: MSR_CONFIG_TDP_NOMINAL: 0x000d (base_ratio=13)
> > > > > > cpu0: MSR_CONFIG_TDP_LEVEL_1: 0x0006001c
> > > > > > (PKG_M

Re: [PATCH 4/7] aio: remove the extra get_file/fput pair in io_submit_one

2018-04-26 Thread Darrick J. Wong

On Sun, Apr 15, 2018 at 05:01:05PM +0200, Christoph Hellwig wrote:
> If we release the lockdep write protection token before calling into
> ->write_iter and thus never access the file pointer after an -EIOCBQUEUED
> return from ->write_iter or ->read_iter we don't need this extra
> reference.

Hmm, subtleties lurk to this unfamiliar reader...

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/aio.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 18507743757a..d7be32cdd1db 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1515,16 +1515,17 @@ static ssize_t aio_write(struct kiocb *req, struct 
> iocb *iocb, bool vectored,
>   return ret;
>   ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter));
>   if (!ret) {
> - req->ki_flags |= IOCB_WRITE;
> - file_start_write(file);
> - ret = aio_ret(req, call_write_iter(file, req, &iter));
>   /*
>* We release freeze protection in aio_complete().  Fool lockdep
>* by telling it the lock got released so that it doesn't
>* complain about held lock when we return to userspace.
>*/
> - if (S_ISREG(file_inode(file)->i_mode))
> + if (S_ISREG(file_inode(file)->i_mode)) {
> + __sb_start_write(file_inode(file)->i_sb, 
> SB_FREEZE_WRITE, true);

It took me a while to figure out that this ^^^ is the same as the
file_start_write call that you remove above, can you please update the
comment to note that we take freeze protection for the file before
screwing with lockdep? e.g.,

/*
 * Open-code file_start_write here to grab freeze protection, which will
 * be released by another thread in aio_complete().  Fool lockdep by
 * telling it the lock got released so that it doesn't complain about
 * held lock when we return to userspace.
 */

>   __sb_writers_release(file_inode(file)->i_sb, 
> SB_FREEZE_WRITE);
> + }
> + req->ki_flags |= IOCB_WRITE;
> + ret = aio_ret(req, call_write_iter(file, req, &iter));
>   }
>   kfree(iovec);
>   return ret;
> @@ -1599,7 +1600,6 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   req->ki_user_iocb = user_iocb;
>   req->ki_user_data = iocb->aio_data;
>  
> - get_file(file);

Here we have a reference to *file, but...

>   switch (iocb->aio_lio_opcode) {
>   case IOCB_CMD_PREAD:
>   ret = aio_read(&req->common, iocb, false, compat);
> @@ -1618,7 +1618,6 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   ret = -EINVAL;
>   break;
>   }
> - fput(file);

...by the time we get to here the reference may have gone away, but
you'd have to dig through aio_{read,write} -> call_{r,w}_iter ->
{r,w}_iter in order to figure out that the reference isn't valid
anymore on a EIOCBQUEUED return.

That's a little subtle, can you add a comment about that?

/*
 * If ret is EIOCBQUEUED here, the ->read_iter/->write_iter dropped the
 * reference on *file.  We don't ourselves ensure a reference to the
 * file, so we must be careful about that here and in the subfunctions.
 */

--D

>  
>   if (ret && ret != -EIOCBQUEUED)
>   goto out_put_req;
> -- 
> 2.17.0
>

[PATCH] mm: memory_hotplug: use put_device() if device_register fail

2018-04-26 Thread Arvind Yadav

if device_register() returned an error. Always use put_device()
to give up the initialized reference and release allocated memory.

Signed-off-by: Arvind Yadav 
---
 drivers/base/memory.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index bffe861..f5e5601 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -649,13 +649,19 @@ static const struct attribute_group 
*memory_memblk_attr_groups[] = {
 static
 int register_memory(struct memory_block *memory)
 {
+   int ret;
+
memory->dev.bus = &memory_subsys;
memory->dev.id = memory->start_section_nr / sections_per_block;
memory->dev.release = memory_block_release;
memory->dev.groups = memory_memblk_attr_groups;
memory->dev.offline = memory->state == MEM_OFFLINE;
 
-   return device_register(&memory->dev);
+   ret = device_register(&memory->dev);
+   if (ret)
+   put_device(&memory->dev);
+
+   return ret;
 }
 
 static int init_memory_block(struct memory_block **memory,
-- 
2.7.4

Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread Mikulas Patocka



On Thu, 26 Apr 2018, James Bottomley wrote:

> On Thu, 2018-04-26 at 11:05 -0400, Mikulas Patocka wrote:
> > 
> > On Thu, 26 Apr 2018, James Bottomley wrote:
> [...]
> > > Perhaps find out beforehand instead of insisting on an approach
> > without
> > > knowing.  On openSUSE the grub config is built from the files in
> > > /etc/grub.d/ so any package can add a kernel option (and various
> > > conditions around activating it) simply by adding a new file.
> > 
> > And then, different versions of the debug kernel will clash when 
> > attempting to create the same file.
> 
> Don't be silly ... there are many ways of coping with that in rpm/dpkg.

I know you can deal with it - but how many lines of code will that 
consume? Multiplied by the total number of rpm-based distros.

Mikulas

Re: [Qemu-devel] [RFC v2 1/2] virtio: add pmem driver

2018-04-26 Thread Pankaj Gupta


> > This patch adds virtio-pmem driver for KVM
> > guest.
> > 
> > Guest reads the persistent memory range
> > information from Qemu over VIRTIO and registers
> > it on nvdimm_bus. It also creates a nd_region
> > object with the persistent memory range
> > information so that existing 'nvdimm/pmem'
> > driver can reserve this into system memory map.
> > This way 'virtio-pmem' driver uses existing
> > functionality of pmem driver to register persistent
> > memory compatible for DAX capable filesystems.
> > 
> > This also provides function to perform guest flush
> > over VIRTIO from 'pmem' driver when userspace
> > performs flush on DAX memory range.
> > 
> > Signed-off-by: Pankaj Gupta 
> > ---
> >  drivers/virtio/Kconfig   |  12 
> >  drivers/virtio/Makefile  |   1 +
> >  drivers/virtio/virtio_pmem.c | 118
> >  +++
> >  include/linux/libnvdimm.h|   4 ++
> >  include/uapi/linux/virtio_ids.h  |   1 +
> >  include/uapi/linux/virtio_pmem.h |  58 +++
> >  6 files changed, 194 insertions(+)
> >  create mode 100644 drivers/virtio/virtio_pmem.c
> >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > 
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 3589764..879335d 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -42,6 +42,18 @@ config VIRTIO_PCI_LEGACY
> >  
> >   If unsure, say Y.
> >  
> > +config VIRTIO_PMEM
> > +   tristate "Virtio pmem driver"
> > +   depends on VIRTIO
> > +   help
> > +This driver adds persistent memory range to nd_region and registers
> > +with nvdimm bus. NVDIMM 'pmem' driver later allocates a persistent
> > +memory range on the memory information added by this driver. In 
> > addition
> > +to this, 'virtio-pmem' driver also provides a paravirt flushing
> > interface
> > +from guest to host.
> > +
> > +If unsure, say M.
> > +
> >  config VIRTIO_BALLOON
> > tristate "Virtio balloon driver"
> > depends on VIRTIO
> > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > index 3a2b5c5..cbe91c6 100644
> > --- a/drivers/virtio/Makefile
> > +++ b/drivers/virtio/Makefile
> > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> >  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> >  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> >  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> > diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> > new file mode 100644
> > index 000..0906d2d
> > --- /dev/null
> > +++ b/drivers/virtio/virtio_pmem.c
> > @@ -0,0 +1,118 @@
> 
> SPDX license line?  See Documentation/process/license-rules.rst.

o.k. 

> 
> > +/* Virtio pmem Driver
> > + *
> > + * Discovers persitent memory range information
> 
> s/persitent/persistent/
> 
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> 
> Are all these headers really needed?  delay.h?  oom.h?

Will remove not required ones. There are from previous
RFC where used *memremap* and other mm & block includes.

> 
> > +
> > +static int init_vq(struct virtio_pmem *vpmem)
> > +{
> > +   struct virtqueue *vq;
> > +
> > +   /* single vq */
> > +   vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > +   NULL, "flush_queue");
> > +
> > +   if (IS_ERR(vq))
> > +   return PTR_ERR(vq);
> > +
> > +   return 0;
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > +   int err = 0;
> > +   struct resource res;
> > +   struct virtio_pmem *vpmem;
> > +   struct nvdimm_bus *nvdimm_bus;
> > +   struct nd_region_desc ndr_desc;
> > +   int nid = dev_to_node(&vdev->dev);
> > +   static struct nvdimm_bus_descriptor nd_desc;
> > +
> > +   if (!vdev->config->get) {
> > +   dev_err(&vdev->dev, "%s failure: config disabled\n",
> > +   __func__);
> > +   return -EINVAL;
> > +   }
> > +
> > +   vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > +   GFP_KERNEL);
> > +   if (!vpmem) {
> > +   err = -ENOMEM;
> > +   goto out;
> > +   }
> > +
> > +   vpmem->vdev = vdev;
> > +   err = init_vq(vpmem);
> > +   if (err)
> > +   goto out;
> > +
> > +   virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +   start, &vpmem->start);
> > +   virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +   size, &vpmem->size);
> > +
> > +   res.start = vpmem->start;
> > +   res.end   = vpmem->start + vpmem->size-1;
> > +
> > +   memset(&nd_desc, 0, sizeof(nd_desc));
> > +   nd_desc.provider_name = "virtio-pmem";
> > +   nd_desc.module = THIS_MODULE;
> > +   nvdimm_bus = nvdimm_bus_reg

Re: [PATCH 3/6] arm64: untag user addresses in copy_from_user and others

2018-04-26 Thread Catalin Marinas

On Wed, Apr 18, 2018 at 08:53:12PM +0200, Andrey Konovalov wrote:
> @@ -238,12 +239,15 @@ static inline void uaccess_enable_not_uao(void)
>  /*
>   * Sanitise a uaccess pointer such that it becomes NULL if above the
>   * current addr_limit.
> + * Also untag user pointers that have the top byte tag set.
>   */
>  #define uaccess_mask_ptr(ptr) (__typeof__(ptr))__uaccess_mask_ptr(ptr)
>  static inline void __user *__uaccess_mask_ptr(const void __user *ptr)
>  {
>   void __user *safe_ptr;
>  
> + ptr = untagged_addr(ptr);
> +
>   asm volatile(
>   "   bicsxzr, %1, %2\n"
>   "   csel%0, %1, xzr, eq\n"

First of all, passing a tagged user pointer throughout the kernel is
safe with uaccess routines but not suitable for find_vma() etc.

With this change, we may have an inconsistent behaviour on the tag
masking, depending on whether the entry code uses __uaccess_mask_ptr()
or not. We could preserve the tag with something like:

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index e66b0fca99c2..ed15bfcbd797 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -244,10 +244,11 @@ static inline void __user *__uaccess_mask_ptr(const void 
__user *ptr)
void __user *safe_ptr;
 
asm volatile(
-   "   bicsxzr, %1, %2\n"
+   "   bicsxzr, %3, %2\n"
"   csel%0, %1, xzr, eq\n"
: "=&r" (safe_ptr)
-   : "r" (ptr), "r" (current_thread_info()->addr_limit)
+   : "r" (ptr), "r" (current_thread_info()->addr_limit),
+ "r" (untagged_addr(ptr))
: "cc");
 
csdb();

-- 
Catalin

Re: [PATCH] f2fs: fix to wait IO writeback in __revoke_inmem_pages()

2018-04-26 Thread Jaegeuk Kim

On 04/26, Chao Yu wrote:
> Thread A  Thread B
> - f2fs_ioc_commit_atomic_write
>  - commit_inmem_pages
>   - f2fs_submit_merged_write_cond
>   : write data
>   - write_checkpoint
>- do_checkpoint
>: commit all node within CP
>-> SPO
>   - f2fs_do_sync_file
>- file_write_and_wait_range
>: wait data writeback
> 
> In above race condition, data/node can be flushed in reversed order when
> coming a checkpoint before f2fs_do_sync_file, after SPOR, it results in
> atomic written data being corrupted.

Wait, what is the problem here? Thread B could succeed checkpoint, there is
no problem. If it fails, there is no fsync mark where we can recover it, so
we can just ignore the last written data as nothing.

> 
> This patch adds f2fs_wait_on_page_writeback in __revoke_inmem_pages() to
> keep data and node of atomic file being flushed orderly.
> 
> Signed-off-by: Chao Yu 
> ---
>  fs/f2fs/file.c| 4 
>  fs/f2fs/segment.c | 3 +++
>  2 files changed, 7 insertions(+)
> 
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index be7578774a47..a352804af244 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -217,6 +217,9 @@ static int f2fs_do_sync_file(struct file *file, loff_t 
> start, loff_t end,
>  
>   trace_f2fs_sync_file_enter(inode);
>  
> + if (atomic)
> + goto write_done;
> +
>   /* if fdatasync is triggered, let's do in-place-update */
>   if (datasync || get_dirty_pages(inode) <= SM_I(sbi)->min_fsync_blocks)
>   set_inode_flag(inode, FI_NEED_IPU);
> @@ -228,6 +231,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t 
> start, loff_t end,
>   return ret;
>   }
>  
> +write_done:
>   /* if the inode is dirty, let's recover all the time */
>   if (!f2fs_skip_inode_update(inode, datasync)) {
>   f2fs_write_inode(inode, NULL);
> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
> index 584483426584..9ca3d0a43d93 100644
> --- a/fs/f2fs/segment.c
> +++ b/fs/f2fs/segment.c
> @@ -230,6 +230,8 @@ static int __revoke_inmem_pages(struct inode *inode,
>  
>   lock_page(page);
>  
> + f2fs_wait_on_page_writeback(page, DATA, true);
> +
>   if (recover) {
>   struct dnode_of_data dn;
>   struct node_info ni;
> @@ -415,6 +417,7 @@ static int __commit_inmem_pages(struct inode *inode)
>   /* drop all uncommitted pages */
>   __revoke_inmem_pages(inode, &fi->inmem_pages, true, false);
>   } else {
> + /* wait all committed IOs writeback and release them from list 
> */
>   __revoke_inmem_pages(inode, &revoke_list, false, false);
>   }
>  
> -- 
> 2.15.0.55.gc2ece9dc4de6

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

2018-04-26 Thread Paul E. McKenney

On Thu, Apr 26, 2018 at 11:13:16AM -0400, Mathieu Desnoyers wrote:
> - On Apr 25, 2018, at 7:13 PM, Joel Fernandes joe...@google.com wrote:
> 
> > Hi Mathieu,
> > 
> > On Wed, Apr 25, 2018 at 2:40 PM, Mathieu Desnoyers
> >  wrote:
> >> - On Apr 25, 2018, at 5:27 PM, Joel Fernandes joe...@google.com wrote:
> >>
> >>> On Tue, Apr 24, 2018 at 9:20 PM, Paul E. McKenney
> >>>  wrote:
> >>> [..]
> > >
> > > Sounds good, thanks.
> > >
> > > Also I found the reason for my boot issue. It was because the
> > > init_srcu_struct in the prototype was being done in an initcall.
> > > Instead if I do it in start_kernel before the tracepoint is used, it
> > > fixes it (although I don't know if this is dangerous to do like this
> > > but I can get it to boot atleast.. Let me know if this isn't the
> > > right way to do it, or if something else could go wrong)
> > >
> > > diff --git a/init/main.c b/init/main.c
> > > index 34823072ef9e..ecc88319c6da 100644
> > > --- a/init/main.c
> > > +++ b/init/main.c
> > > @@ -631,6 +631,7 @@ asmlinkage __visible void __init 
> > > start_kernel(void)
> > > WARN(!irqs_disabled(), "Interrupts were enabled early\n");
> > > early_boot_irqs_disabled = false;
> > >
> > > +   init_srcu_struct(&tracepoint_srcu);
> > > lockdep_init_early();
> > >
> > > local_irq_enable();
> > > --
> > >
> > > I benchmarked it and the performance also looks quite good compared
> > > to the rcu tracepoint version.
> > >
> > > If you, Paul and other think doing the init_srcu_struct like this
> > > should be Ok, then I can try to work more on your srcu prototype and
> > > roll into my series and post them in the next RFC series (or let me
> > > know if you wanted to work your srcu stuff in a separate series..).
> >
> > That is definitely not what I was expecting, but let's see if it works
> > anyway...  ;-)
> >
> > But first, I was instead expecting something like this:
> >
> > DEFINE_SRCU(tracepoint_srcu);
> >
> > With this approach, some of the initialization happens at compile time
> > and the rest happens at the first call_srcu().
> >
> > This will work -only- if the first call_srcu() doesn't happen until 
> > after
> > workqueue_init_early() has been invoked.  Which I believe must have been
> > the case in your testing, because otherwise it looks like __call_srcu()
> > would have complained bitterly.
> >
> > On the other hand, if you need to invoke call_srcu() before the call
> > to workqueue_init_early(), then you need the patch that I am beating
> > into shape.  Plus you would need to use DEFINE_SRCU() and to avoid
> > invoking init_srcu_struct().
> 
>  And here is the patch.  I do not intend to send it upstream unless it
>  actually proves necessary, and it appears that current SRCU does what
>  you need.
> 
>  You would only need this patch if you wanted to invoke call_srcu()
>  before workqueue_init_early() was called, which does not seem likely.
> >>>
> >>> Cool. So I was chatting with Paul and just to update everyone as well,
> >>> I tried the DEFINE_SRCU instead of the late init_srcu_struct call and
> >>> can make it past boot too (thanks Paul!). Also I don't see a reason we
> >>> need the RCU callback to execute early and its fine if it runs later.
> >>>
> >>> Also, I was thinking of introducing a separate trace_*event*_srcu API
> >>> as a replacement to the _rcuidle API. Then I can make use of it for my
> >>> tracepoints, and then later can use it for the other tracepoints
> >>> needing _rcuidle. After that we can finally get rid of the _rcuidle
> >>> API if there are no other users of it. This is just a rough plan, but
> >>> let me know if there's any issue with this plan that you can think
> >>> off.
> >>> IMO, I believe its simpler if the caller worries about whether it can
> >>> tolerate if tracepoint probes can block or not, than making it a
> >>> property of the tracepoint. That would also simplify the patch to
> >>> introduce srcu and keep the tracepoint creation API simple and less
> >>> confusing, but let me know if I'm missing something about this.
> >>
> >> One problem with your approach is that you can have multiple callers
> >> for the same tracepoint name, where some could be non-preemptible and
> >> others blocking. Also, there is then no clear way for the callback
> > 
> > Shouldn't it be responsibility of the caller to make sure it calls
> > correct API? So if you're wanting to allow probes to block, then you'd
> > call trace*blocking, if not then you don't. So the caller side can
> > just always do the right thing. That's a caller side issue.
> 
> The issue there is that tracepoint.c has APIs both for instrumentation
> and for registration of probe providers (callbacks). I want tracepoint.c
> to provide gua

Re: [PATCH v1] kthread/smpboot: Serialize kthread parking against wakeup

2018-04-26 Thread Kohli, Gaurav


On 4/26/2018 2:27 PM, Peter Zijlstra wrote:


On Thu, Apr 26, 2018 at 10:41:31AM +0200, Peter Zijlstra wrote:

diff --git a/kernel/kthread.c b/kernel/kthread.c
index cd50e99202b0..4b6503c6a029 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -177,12 +177,13 @@ void *kthread_probe_data(struct task_struct *task)
  
  static void __kthread_parkme(struct kthread *self)

  {
-   __set_current_state(TASK_PARKED);
-   while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
+   for (;;) {
+   __set_task_state(TASK_PARKED);

set_current_state(TASK_PARKED);

of course..


Hi Peter,

Maybe i am missing something , but still that race can come as we don't put 
task_parked on special state.

Controller                                                                      
 Hotplug

                                                                            
 Loop

                                                                                
 Task_Interruptible

Set SHOULD_PARK

wakeup -> Proceeds

                                                                                
  Set Running

                                                                                
  kthread_parkme

                                                                                
  SET TASK_PARKED

                                                                                
  schedule

Set TASK_RUNNING

Can you please correct ME, if I misunderstood this.




+   if (!test_bit(KTHREAD_SHOULD_PARK, &self->flags))
+   break;
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
schedule();
-   __set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);
__set_current_state(TASK_RUNNING);



--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. 
is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

Re: [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath

2018-04-26 Thread Peter Zijlstra

On Thu, Apr 26, 2018 at 11:34:19AM +0100, Will Deacon wrote:
> @@ -290,58 +312,50 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, 
> u32 val)
>   }
>  
>   /*
> +  * If we observe any contention; queue.
> +  */
> + if (val & ~_Q_LOCKED_MASK)
> + goto queue;
> +
> + /*
>* trylock || pending
>*
>* 0,0,0 -> 0,0,1 ; trylock
>* 0,0,1 -> 0,1,1 ; pending
>*/
> + val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> + if (!(val & ~_Q_LOCKED_MASK)) {
>   /*
> +  * we're pending, wait for the owner to go away.
> +  *
> +  * *,1,1 -> *,1,0

Tail must be 0 here, right?

> +  *
> +  * this wait loop must be a load-acquire such that we match the
> +  * store-release that clears the locked bit and create lock
> +  * sequentiality; this is because not all
> +  * clear_pending_set_locked() implementations imply full
> +  * barriers.
>*/
> + if (val & _Q_LOCKED_MASK) {
> + smp_cond_load_acquire(&lock->val.counter,
> +   !(VAL & _Q_LOCKED_MASK));
> + }
>  
>   /*
> +  * take ownership and clear the pending bit.
> +  *
> +  * *,1,0 -> *,0,1
>*/

Idem.

> + clear_pending_set_locked(lock);
>   return;
> + }
>  
>   /*
> +  * If pending was clear but there are waiters in the queue, then
> +  * we need to undo our setting of pending before we queue ourselves.
>*/
> + if (!(val & _Q_PENDING_MASK))
> + clear_pending(lock);

This is the branch for when we have !0 tail.

>  
>   /*
>* End of pending bit optimistic spinning and beginning of MCS

> @@ -445,15 +459,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, 
> u32 val)
>* claim the lock:
>*
>* n,0,0 -> 0,0,1 : lock, uncontended
> +  * *,*,0 -> *,*,1 : lock, contended
>*
> +  * If the queue head is the only one in the queue (lock value == tail)
> +  * and nobody is pending, clear the tail code and grab the lock.
> +  * Otherwise, we only need to grab the lock.
>*/
>   for (;;) {
>   /* In the PV case we might already have _Q_LOCKED_VAL set */
> + if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
>   set_locked(lock);
>   break;
>   }

This one hunk is terrible on the brain. I'm fairly sure I get it, but I
feel that comment can use help. Or at least, I need help reading it.

I'll try and cook up something when my brain starts working again.

Re: [PATCH v3 00/14] kernel/locking: qspinlock improvements

2018-04-26 Thread Peter Zijlstra



Acked-by: Peter Zijlstra (Intel) 

Ingo, please queue.

Re: [PATCH v2] f2fs: avoid stucking GC due to atomic write

2018-04-26 Thread Jaegeuk Kim

On 04/24, Chao Yu wrote:
> f2fs doesn't allow abuse on atomic write class interface, so except
> limiting in-mem pages' total memory usage capacity, we need to limit
> atomic-write usage as well when filesystem is seriously fragmented,
> otherwise we may run into infinite loop during foreground GC because
> target blocks in victim segment are belong to atomic opened file for
> long time.

How about using fi->i_gc_failure likewise pin_file?

> 
> Now, we will detect failure due to atomic write in foreground GC, if
> the count exceeds threshold, we will drop all atomic written data in
> cache, by this, I expect it can keep our system running safely to
> prevent Dos attack.
> 
> In addition, his patch adds to show GC skip information in debugfs,
> now it just shows count of skipped caused by atomic write.
> 
> Signed-off-by: Chao Yu 
> ---
> v2:
> - add to show skip info in debugfs.
>  fs/f2fs/debug.c   |  8 
>  fs/f2fs/f2fs.h|  2 ++
>  fs/f2fs/file.c|  5 +
>  fs/f2fs/gc.c  | 29 +
>  fs/f2fs/gc.h  |  3 +++
>  fs/f2fs/segment.c |  1 +
>  fs/f2fs/segment.h |  2 ++
>  7 files changed, 46 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
> index 0fbd674c66fb..607b258a9b61 100644
> --- a/fs/f2fs/debug.c
> +++ b/fs/f2fs/debug.c
> @@ -104,6 +104,10 @@ static void update_general_status(struct f2fs_sb_info 
> *sbi)
>   si->avail_nids = NM_I(sbi)->available_nids;
>   si->alloc_nids = NM_I(sbi)->nid_cnt[PREALLOC_NID];
>   si->bg_gc = sbi->bg_gc;
> + si->skipped_atomic_files[BG_GC] =
> + sbi->gc_thread->skipped_atomic_files[BG_GC];
> + si->skipped_atomic_files[FG_GC] =
> + sbi->gc_thread->skipped_atomic_files[FG_GC];
>   si->util_free = (int)(free_user_blocks(sbi) >> sbi->log_blocks_per_seg)
>   * 100 / (int)(sbi->user_block_count >> sbi->log_blocks_per_seg)
>   / 2;
> @@ -341,6 +345,10 @@ static int stat_show(struct seq_file *s, void *v)
>   si->bg_data_blks);
>   seq_printf(s, "  - node blocks : %d (%d)\n", si->node_blks,
>   si->bg_node_blks);
> + seq_printf(s, "Skipped : atomic write %llu (%llu)\n",
> + si->skipped_atomic_files[BG_GC] +
> + si->skipped_atomic_files[FG_GC],
> + si->skipped_atomic_files[BG_GC]);
>   seq_puts(s, "\nExtent Cache:\n");
>   seq_printf(s, "  - Hit Count: L1-1:%llu L1-2:%llu L2:%llu\n",
>   si->hit_largest, si->hit_cached,
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index 75d3b4875429..c2b92cb377c6 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -2254,6 +2254,7 @@ enum {
>   FI_EXTRA_ATTR,  /* indicate file has extra attribute */
>   FI_PROJ_INHERIT,/* indicate file inherits projectid */
>   FI_PIN_FILE,/* indicate file should not be gced */
> + FI_ATOMIC_REVOKE_REQUEST,/* indicate atomic committed data has been 
> dropped */
>  };
>  
>  static inline void __mark_inode_dirty_flag(struct inode *inode,
> @@ -3010,6 +3011,7 @@ struct f2fs_stat_info {
>   int bg_node_segs, bg_data_segs;
>   int tot_blks, data_blks, node_blks;
>   int bg_data_blks, bg_node_blks;
> + unsigned long long skipped_atomic_files[2];
>   int curseg[NR_CURSEG_TYPE];
>   int cursec[NR_CURSEG_TYPE];
>   int curzone[NR_CURSEG_TYPE];
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index a352804af244..0cfa65c21d3f 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -1702,6 +1702,7 @@ static int f2fs_ioc_start_atomic_write(struct file 
> *filp)
>  skip_flush:
>   set_inode_flag(inode, FI_HOT_DATA);
>   set_inode_flag(inode, FI_ATOMIC_FILE);
> + clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
>   f2fs_update_time(F2FS_I_SB(inode), REQ_TIME);
>  
>   F2FS_I(inode)->inmem_task = current;
> @@ -1750,6 +1751,10 @@ static int f2fs_ioc_commit_atomic_write(struct file 
> *filp)
>   ret = f2fs_do_sync_file(filp, 0, LLONG_MAX, 1, false);
>   }
>  err_out:
> + if (is_inode_flag_set(inode, FI_ATOMIC_REVOKE_REQUEST)) {
> + clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
> + ret = -EINVAL;
> + }
>   up_write(&F2FS_I(inode)->dio_rwsem[WRITE]);
>   inode_unlock(inode);
>   mnt_drop_write_file(filp);
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 2febb84b2fd6..ec6cb7b417a1 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -135,6 +135,9 @@ int init_gc_context(struct f2fs_sb_info *sbi)
>  
>   init_rwsem(&gc_th->gc_rwsem);
>  
> + gc_th->skipped_atomic_files[BG_GC] = 0;
> + gc_th->skipped_atomic_files[FG_GC] = 0;
> +
>   sbi->gc_thread = gc_th;
>  
>   return 0;
> @@ -629,7 +632,7 @@ static bool is_alive(struct f2fs_sb_info *sbi, struct 
> f2fs_summary *sum

Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread Mikulas Patocka

On Thu, 26 Apr 2018, James Bottomley wrote:

> So you're shifting your argument from "I have to do it as a Kconfig
> option because the distros require it" to "distributions will build
> separate kernel packages for this, but won't do enabling in a non
> kernel package"?  To be honest, I think the argument is nuts but I
> don't really care.  From my point of view it's usually me explaining to
> people how to debug stuff and "you have to build your own kernel with
> this Kconfig option" compared to "add this to the kernel command line
> and reboot" is much more effort for the debugger.
> 
> James

If you have to explain to the user that he needs to turn it on, it is 
already wrong.

In order to find the kvmalloc abuses, it should be tested by as many users 
as possible. And it could be tested by as many users as possible, if it 
can be enabled in a VISIBLE place (i.e. menuconfig) - or (in my opinion 
even better) it should be bound to an CONFIG_ option that is already 
enabled for debugging kernel - then you won't have to explain anything to 
the user at all.

Hardly anyone - except for people who read this thread - will know about 
the new commandline parameters or debugfs files.

I'm not arguing that the commandline parameter or debugfs files are wrong. 
They are OK to overridde the default settings for advanced users. But they 
are useless for common users because common users won't know about them.

Mikulas

Re: [PATCH v3 14/14] MAINTAINERS: Add myself as a co-maintainer for LOCKING PRIMITIVES

2018-04-26 Thread Peter Zijlstra

On Thu, Apr 26, 2018 at 11:34:28AM +0100, Will Deacon wrote:
> I've been heavily involved with concurrency and memory ordering stuff
> (see ATOMIC INFRASTRUCTURE and LINUX KERNEL MEMORY CONSISTENCY MODEL)
> and with arm64 now using qrwlock with a view to using qspinlock in the
> near future, I'm going to continue being involved with the core locking
> primitives. Reflect this by adding myself as a co-maintainer alongside
> Ingo and Peter.

I much value your help on this, thanks!

Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

2018-04-26 Thread Kirill Tkhai

On 26.04.2018 18:20, Peter Zijlstra wrote:
> On Thu, Apr 26, 2018 at 04:52:39PM +0300, Kirill Tkhai wrote:
>> In the patch I used the logic, that the below code:
>>
>>  x = A;
>>  spin_lock();
>>  spin_unlock();
>>  spin_lock();
>>  spin_unlock();
>>  y = B;
>>
>> cannot reorder much than:
>>
>>  spin_lock();
>>  x = A;  <- this can't become visible later, that spin_unlock()
>>  spin_unlock();
>>  spin_lock();
>>  y = B;  <- this can't become visible earlier, than spin_lock()
>>  spin_unlock();
>>
>> Is there a problem?
> 
> The two stores will be ordered, but only at the strength of an
> smp_wmb(). The above construct does not imply smp_mb(). The difference
> is observable on real hardware (Power).

Ah, thanks.

But hopefully, smp_rmb() should be enough here.

Re: [Intel-gfx] 4.17-rc2: Could not determine valid watermarks for inherited state

2018-04-26 Thread Dave Jones

On Thu, Apr 26, 2018 at 06:25:13PM +0300, Ville Syrjälä wrote:
 > On Thu, Apr 26, 2018 at 06:16:41PM +0300, Ville Syrjälä wrote:
 > > On Thu, Apr 26, 2018 at 05:56:14PM +0300, Ville Syrjälä wrote:
 > > > On Thu, Apr 26, 2018 at 10:27:19AM -0400, Dave Jones wrote:
 > > > > [1.176131] [drm:i9xx_get_initial_plane_config] pipe A/primary A 
 > > > > with fb: size=800x600@32, offset=0, pitch 3200, size 0x1d4c00
 > > > > [1.176161] [drm:i915_gem_object_create_stolen_for_preallocated] 
 > > > > creating preallocated stolen object: stolen_offset=0x, 
 > > > > gtt_offset=0x, size=0x001d5000
 > > > > [1.176312] [drm:intel_alloc_initial_plane_obj.isra.127] initial 
 > > > > plane fb obj (ptrval)
 > > > > [1.176351] [drm:intel_modeset_init] pipe A active planes 0x1
 > > > > [1.176456] [drm:drm_atomic_helper_check_plane_state] Plane must 
 > > > > cover entire CRTC
 > > > > [1.176481] [drm:drm_rect_debug_print] dst: 800x600+0+0
 > > > > [1.176494] [drm:drm_rect_debug_print] clip: 1366x768+0+0
 > > > 
 > > > OK, so that's the problem right there. The fb we took over from the
 > > > BIOS was 800x600, but now we're trying to set up a 1366x768 mode.
 > > > 
 > > > We seem to be missing checks to make sure the initial fb is actually
 > > > big enough for the mode we're currently using :(
 > > 
 > Hmm. Or maybe we should just stick to the pipe src size.
 > 
 > I'm curious whether this fixes the problem?
 > 
 > diff --git a/drivers/gpu/drm/i915/intel_display.c 
 > b/drivers/gpu/drm/i915/intel_display.c
 > index 0f8c7389e87d..30824beedef7 100644
 > --- a/drivers/gpu/drm/i915/intel_display.c
 > +++ b/drivers/gpu/drm/i915/intel_display.c
 > @@ -15284,6 +15284,8 @@ static void intel_modeset_readout_hw_state(struct 
 > drm_device *dev)
 >  memset(&crtc->base.mode, 0, sizeof(crtc->base.mode));
 >  if (crtc_state->base.active) {
 >  intel_mode_from_pipe_config(&crtc->base.mode, 
 > crtc_state);
 > +crtc->base.mode.hdisplay = crtc_state->pipe_src_w;
 > +crtc->base.mode.vdisplay = crtc_state->pipe_src_h;
 >  
 > intel_mode_from_pipe_config(&crtc_state->base.adjusted_mode, crtc_state);
 >  WARN_ON(drm_atomic_set_mode_for_crtc(crtc->base.state, 
 > &crtc->base.mode));
 > 

It does!

Feel free to throw a Tested-by: Dave Jones  in there.

Dave

[PATCH] mm: sections are not offlined during memory hotremove

2018-04-26 Thread Pavel Tatashin

Memory hotplug, and hotremove operate with per-block granularity. If
machine has large amount of memory (more than 64G), the size of memory
block can span multiple sections. By mistake, during hotremove we set
only the first section to offline state.

The bug was discovered because kernel selftest started to fail:
https://lkml.kernel.org/r/20180423011247.GK5563@yexl-desktop

After commit, "mm/memory_hotplug: optimize probe routine". But, the bug is
older than this commit. In this optimization we also added a check for
sections to be in a proper state during hotplug operation.

Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to have 
holes")

Signed-off-by: Pavel Tatashin 
---
 mm/sparse.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/sparse.c b/mm/sparse.c
index 62eef264a7bd..73dc2fcc0eab 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -629,7 +629,7 @@ void offline_mem_sections(unsigned long start_pfn, unsigned 
long end_pfn)
unsigned long pfn;
 
for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
-   unsigned long section_nr = pfn_to_section_nr(start_pfn);
+   unsigned long section_nr = pfn_to_section_nr(pfn);
struct mem_section *ms;
 
/*
-- 
1.8.3.1

Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-04-26 Thread Eric W. Biederman

Andrey Grodzovsky  writes:

> On 04/26/2018 08:34 AM, Andrey Grodzovsky wrote:
>>
>>
>> On 04/25/2018 08:01 PM, Eric W. Biederman wrote:
>>> Andrey Grodzovsky  writes:
>>>
 On 04/25/2018 01:17 PM, Oleg Nesterov wrote:
> On 04/25, Andrey Grodzovsky wrote:
>> here (drm_sched_entity_fini) is also a bad idea, but we still want to be
>> able to exit immediately
>> and not wait for GPU jobs completion when the reason for reaching this
>> code
>> is because of KILL
>> signal to the user process who opened the device file.
> Can you hook f_op->flush method?
 But this one is called for each task releasing a reference to the the file,
 so
 not sure I see how this solves the problem.
>>> The big question is why do you need to wait during the final closing a
>>> file?
>>>
>>> The wait can be terminated so the wait does not appear to be simply a
>>> matter of correctness.
>>
>> Well, as I understand it, it just means that you don't want to abruptly
>> terminate GPU work in progress without a good
>> reason (such as KILL signal). When we exit we are going to release various
>> resources GPU is still using so we either
>> wait for it to complete or terminate the remaining jobs.

At the point of do_exit you might as well be a KILL signal however you
got there.

> Looked more into code, some correction, drm_sched_entity_fini means the SW job
> queue itself is about to die, so we must
> either wait for completion or terminate any outstanding jobs that are still in
> the SW queue. Anything which already in flight in HW
> will still complete.

It sounds like we don't care if we block the process that had the file
descriptor open, this is just book keeping.  Which allows having a piece
of code that cleans up resources when the GPU is done with the queue but
does not make userspace wait.  (option 1)

For it to make sense that we let the process run there has to be
something that cares about the results being completed.  If all of the
file descriptors are closed and the process is killed I can't see who
will care that the software queue will continue to be processed.  So it
may be reasonable to simply kill the queue (option 2).

If userspace really needs the wait it is probably better done in
f_op->flush so that every close of the file descriptor blocks
until the queue is flushed (option 3).

Do you know if userspace cares about the gpu operations completing?

My skim of the code suggests that nothing actually cares about those
operations, but I really don't know the gpu well.

Eric

Re: [PATCH] f2fs: fix to wait IO writeback in __revoke_inmem_pages()

2018-04-26 Thread Chao Yu

On 2018/4/26 23:48, Jaegeuk Kim wrote:
> On 04/26, Chao Yu wrote:
>> Thread A Thread B
>> - f2fs_ioc_commit_atomic_write
>>  - commit_inmem_pages
>>   - f2fs_submit_merged_write_cond
>>   : write data
>>  - write_checkpoint
>>   - do_checkpoint
>>   : commit all node within CP
>>   -> SPO
>>   - f2fs_do_sync_file
>>- file_write_and_wait_range
>>: wait data writeback
>>
>> In above race condition, data/node can be flushed in reversed order when
>> coming a checkpoint before f2fs_do_sync_file, after SPOR, it results in
>> atomic written data being corrupted.
> 
> Wait, what is the problem here? Thread B could succeed checkpoint, there is
> no problem. If it fails, there is no fsync mark where we can recover it, so

Node is flushed by checkpoint before data, with reversed order, that's the 
problem.

Thanks,

> we can just ignore the last written data as nothing.
> 
>>
>> This patch adds f2fs_wait_on_page_writeback in __revoke_inmem_pages() to
>> keep data and node of atomic file being flushed orderly.
>>
>> Signed-off-by: Chao Yu 
>> ---
>>  fs/f2fs/file.c| 4 
>>  fs/f2fs/segment.c | 3 +++
>>  2 files changed, 7 insertions(+)
>>
>> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
>> index be7578774a47..a352804af244 100644
>> --- a/fs/f2fs/file.c
>> +++ b/fs/f2fs/file.c
>> @@ -217,6 +217,9 @@ static int f2fs_do_sync_file(struct file *file, loff_t 
>> start, loff_t end,
>>  
>>  trace_f2fs_sync_file_enter(inode);
>>  
>> +if (atomic)
>> +goto write_done;
>> +
>>  /* if fdatasync is triggered, let's do in-place-update */
>>  if (datasync || get_dirty_pages(inode) <= SM_I(sbi)->min_fsync_blocks)
>>  set_inode_flag(inode, FI_NEED_IPU);
>> @@ -228,6 +231,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t 
>> start, loff_t end,
>>  return ret;
>>  }
>>  
>> +write_done:
>>  /* if the inode is dirty, let's recover all the time */
>>  if (!f2fs_skip_inode_update(inode, datasync)) {
>>  f2fs_write_inode(inode, NULL);
>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
>> index 584483426584..9ca3d0a43d93 100644
>> --- a/fs/f2fs/segment.c
>> +++ b/fs/f2fs/segment.c
>> @@ -230,6 +230,8 @@ static int __revoke_inmem_pages(struct inode *inode,
>>  
>>  lock_page(page);
>>  
>> +f2fs_wait_on_page_writeback(page, DATA, true);
>> +
>>  if (recover) {
>>  struct dnode_of_data dn;
>>  struct node_info ni;
>> @@ -415,6 +417,7 @@ static int __commit_inmem_pages(struct inode *inode)
>>  /* drop all uncommitted pages */
>>  __revoke_inmem_pages(inode, &fi->inmem_pages, true, false);
>>  } else {
>> +/* wait all committed IOs writeback and release them from list 
>> */
>>  __revoke_inmem_pages(inode, &revoke_list, false, false);
>>  }
>>  
>> -- 
>> 2.15.0.55.gc2ece9dc4de6

Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread Michael S. Tsirkin

On Thu, Apr 26, 2018 at 11:44:21AM -0400, Mikulas Patocka wrote:
> 
> 
> On Thu, 26 Apr 2018, James Bottomley wrote:
> 
> > On Thu, 2018-04-26 at 11:05 -0400, Mikulas Patocka wrote:
> > > 
> > > On Thu, 26 Apr 2018, James Bottomley wrote:
> > [...]
> > > > Perhaps find out beforehand instead of insisting on an approach
> > > without
> > > > knowing.  On openSUSE the grub config is built from the files in
> > > > /etc/grub.d/ so any package can add a kernel option (and various
> > > > conditions around activating it) simply by adding a new file.
> > > 
> > > And then, different versions of the debug kernel will clash when 
> > > attempting to create the same file.
> > 
> > Don't be silly ... there are many ways of coping with that in rpm/dpkg.
> 
> I know you can deal with it - but how many lines of code will that 
> consume? Multiplied by the total number of rpm-based distros.
> 
> Mikulas

I don't think debug kernels should inject faults by default.

IIUC debug kernels mainly exist so people who experience e.g. memory
corruption can try and debug the failure. In this case, CONFIG_DEBUG_SG
will *already* catch a failure early. Nothing special needs to be done.

There is a much smaller class of people like QA who go actively looking
for trouble. That's the kind of thing fault injection is good for, and
IMO you do not want your QA team to test a separate kernel - otherwise
you are never quite sure whatever was tested will work in the field.
So a config option won't help them either.

How do you make sure QA tests a specific corner case? Add it to
the test plan :)

I don't speak for Red Hat, etc.

-- 
MST

Re: [BUG] igb: reconnecting of cable not always detected

2018-04-26 Thread Alexander Duyck

On Thu, Apr 26, 2018 at 2:08 AM, Holger Schurig  wrote:
> Hi,
>
>> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
>> point. In order to trigger what you are seeing I am assuming the
>> device has been detached due to a read failure of some sort.
>
> Okay, I added a printk to igb_rd32. And because no one calls this
> function directly (all access goes via the rd32/rd32_array macro) I also
> added the output of the calling function. This should help greatly in
> identifying the read from the hardware to the consumer.
>
> Finally, I noticed that igb_update_stats() produced a lot of churn that
> most likely are unrelated. So I helper variable to make output from this
> function go away.
>
> I installed this modified driver, rebooted, and removed / inserted the
> LAN cable until the error was present.
>
> As before, "ethtool" and "mii-tool" now said that the device is not
> there, while "ip link" showed the device as present.
>
>
> The full output of "journalctl -fk | grep igb" is 600 kB. So put the
> whole file at Google Drive:
>
> https://drive.google.com/open?id=1p9cCT2d_EHnSHh29oS3AepUgFTKGFSeA
>
>
>
> I looked at the output to see patterns, e.g with
>
> grep -n igb_get_cfg_done_i210 igb.error.txt
> grep -n __igb_shutdown igb.error.txt
> ...
>
> (and almost all other function names). I hoped to see patterns. But for
> my untrained eye, things looked not out of the order.

Thanks for the data. It is actually useful. There are a few things
that I see that seem to point to an obvious issue.

The first are the following 2 lines from your dump:
Apr 26 10:42:49 kernel: igb :02:00.0 eth0: igb: eth0 NIC Link is
Up 1000 Mbps Half Duplex, Flow Control: RX
Apr 26 10:42:49 kernel: igb :02:00.0: EEE Disabled: unsupported at
half duplex. Re-enable using ethtool when at full duplex.

In case you aren't aware 1000Mbps Half Duplex is not a valid combination.

The other bit that catches my attention is:
Apr 26 10:42:51 kernel: igb :02:00.0: exceed max 2 second

Which appears to be a timeout error that is triggered in response to
the above error which I believe is the fact that it didn't actually
link at 1000Mbps.

As I get time I will try to look into this further. I will have to go
through the MDIC reads to figure out if there is something in there
that is providing us with bad information from the PHY or if we are
misinterpreting something.

Thanks.

- Alex

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1210 matches

Mail list logo