from:"Sergey Senozhatsky"

Re: [PATCH v4 7/7] zram: writeback throttle

2018-12-02 Thread Sergey Senozhatsky

On (12/03/18 14:50), Sergey Senozhatsky wrote:
> On (12/03/18 11:40), Minchan Kim wrote:
> [..]
> > +   down_read(>init_lock);
> > +   atomic64_set(>stats.bd_wb_limit, val);
> > +   if (val == 0)
> > +   zram->stop_writeback = false;
> > +   up_read(>init_lock);
> 
> [..]
> 
> > +   if (zram->stop_writeback) {
> > +   ret = -EIO;
> > +   break;
> > +   }
> > +
> > if (!blk_idx) {
> > blk_idx = alloc_block_bdev(zram);
> > if (!blk_idx) {
> > @@ -694,6 +732,11 @@ static ssize_t writeback_store(struct device *dev,
> > zram_set_element(zram, index, blk_idx);
> > blk_idx = 0;
> > atomic64_inc(>stats.pages_stored);
> > +   if (atomic64_add_unless(>stats.bd_wb_limit,
> > +   -1 << (PAGE_SHIFT - 12), 0)) {
> > +   if (atomic64_read(>stats.bd_wb_limit) == 0)
> > +   zram->stop_writeback = true;
> > +   }
> 
> Do we need ->stop_writeback? It should be identical to
> 
>   atomic64_read(>stats.bd_wb_limit) == 0

Seems like I misread writeback_limit_store() a bit.

So, if I want to, say, let only 10M of writteback pages, I need to
do

echo 0 > writeback_limit
echo 10M > writeback_limit_store// memparse format is for
// simplicity only; I know
// it should be in 4K units.

every day. How about dropping the "echo 0" and ->stop_writeback?
So then we can just do

echo 10M > writeback_limit_store

every day: if we have ->bd_wb_limit budget then we writeback,
   otherwise we don't.

-ss

Re: [PATCH v4 7/7] zram: writeback throttle

2018-12-02 Thread Sergey Senozhatsky

On (12/03/18 14:50), Sergey Senozhatsky wrote:
> On (12/03/18 11:40), Minchan Kim wrote:
> [..]
> > +   down_read(>init_lock);
> > +   atomic64_set(>stats.bd_wb_limit, val);
> > +   if (val == 0)
> > +   zram->stop_writeback = false;
> > +   up_read(>init_lock);
> 
> [..]
> 
> > +   if (zram->stop_writeback) {
> > +   ret = -EIO;
> > +   break;
> > +   }
> > +
> > if (!blk_idx) {
> > blk_idx = alloc_block_bdev(zram);
> > if (!blk_idx) {
> > @@ -694,6 +732,11 @@ static ssize_t writeback_store(struct device *dev,
> > zram_set_element(zram, index, blk_idx);
> > blk_idx = 0;
> > atomic64_inc(>stats.pages_stored);
> > +   if (atomic64_add_unless(>stats.bd_wb_limit,
> > +   -1 << (PAGE_SHIFT - 12), 0)) {
> > +   if (atomic64_read(>stats.bd_wb_limit) == 0)
> > +   zram->stop_writeback = true;
> > +   }
> 
> Do we need ->stop_writeback? It should be identical to
> 
>   atomic64_read(>stats.bd_wb_limit) == 0

Seems like I misread writeback_limit_store() a bit.

So, if I want to, say, let only 10M of writteback pages, I need to
do

echo 0 > writeback_limit
echo 10M > writeback_limit_store// memparse format is for
// simplicity only; I know
// it should be in 4K units.

every day. How about dropping the "echo 0" and ->stop_writeback?
So then we can just do

echo 10M > writeback_limit_store

every day: if we have ->bd_wb_limit budget then we writeback,
   otherwise we don't.

-ss

Re: [PATCH v4 7/7] zram: writeback throttle

2018-12-02 Thread Sergey Senozhatsky

On (12/03/18 11:40), Minchan Kim wrote:
[..]
> + down_read(>init_lock);
> + atomic64_set(>stats.bd_wb_limit, val);
> + if (val == 0)
> + zram->stop_writeback = false;
> + up_read(>init_lock);

[..]

> + if (zram->stop_writeback) {
> + ret = -EIO;
> + break;
> + }
> +
>   if (!blk_idx) {
>   blk_idx = alloc_block_bdev(zram);
>   if (!blk_idx) {
> @@ -694,6 +732,11 @@ static ssize_t writeback_store(struct device *dev,
>   zram_set_element(zram, index, blk_idx);
>   blk_idx = 0;
>   atomic64_inc(>stats.pages_stored);
> + if (atomic64_add_unless(>stats.bd_wb_limit,
> + -1 << (PAGE_SHIFT - 12), 0)) {
> + if (atomic64_read(>stats.bd_wb_limit) == 0)
> + zram->stop_writeback = true;
> + }

Do we need ->stop_writeback? It should be identical to

atomic64_read(>stats.bd_wb_limit) == 0


Otherwise, looks good!

-ss

Re: [PATCH v4 7/7] zram: writeback throttle

2018-12-02 Thread Sergey Senozhatsky

On (12/03/18 11:40), Minchan Kim wrote:
[..]
> + down_read(>init_lock);
> + atomic64_set(>stats.bd_wb_limit, val);
> + if (val == 0)
> + zram->stop_writeback = false;
> + up_read(>init_lock);

[..]

> + if (zram->stop_writeback) {
> + ret = -EIO;
> + break;
> + }
> +
>   if (!blk_idx) {
>   blk_idx = alloc_block_bdev(zram);
>   if (!blk_idx) {
> @@ -694,6 +732,11 @@ static ssize_t writeback_store(struct device *dev,
>   zram_set_element(zram, index, blk_idx);
>   blk_idx = 0;
>   atomic64_inc(>stats.pages_stored);
> + if (atomic64_add_unless(>stats.bd_wb_limit,
> + -1 << (PAGE_SHIFT - 12), 0)) {
> + if (atomic64_read(>stats.bd_wb_limit) == 0)
> + zram->stop_writeback = true;
> + }

Do we need ->stop_writeback? It should be identical to

atomic64_read(>stats.bd_wb_limit) == 0


Otherwise, looks good!

-ss

Re: [PATCH v3 7/7] zram: writeback throttle

2018-12-02 Thread Sergey Senozhatsky

On (12/03/18 11:41), Minchan Kim wrote:
> On Mon, Dec 03, 2018 at 11:30:40AM +0900, Sergey Senozhatsky wrote:
> > On (12/03/18 08:18), Minchan Kim wrote:
> > > 
> > > Per andrew's comment:
> > > https://lkml.org/lkml/2018/11/27/156
> > > 
> > > I need to fix it to represent 4K always.
> > 
> > Aha.
> > 
> > Then we need to increase bd_writes PAGE_SIZE/4K times in writeback_store()?
> > 
> >wb_count = atomic64_inc_return(>stats.bd_writes);
> >...
> >if (wb_limit != 0 && wb_count >= wb_limit)
> >zram->stop_writeback = true;
> > 
> > bd_wb_limit is in 4K units; but in writeback_store() we alloc a full page
> > and write it to the backing device. So the actual number of written bytes
> > can be larger on systems with page_size > 4K. Right?
> 
> Hey Sergey,
> 
> I changed interface in recent version v4. I belive it would be more
> straigtforward for user. Could you review it?

Hi Minchan,

Just received v4. Will take a look!

-ss

Re: [PATCH v3 7/7] zram: writeback throttle

2018-12-02 Thread Sergey Senozhatsky

On (12/03/18 11:41), Minchan Kim wrote:
> On Mon, Dec 03, 2018 at 11:30:40AM +0900, Sergey Senozhatsky wrote:
> > On (12/03/18 08:18), Minchan Kim wrote:
> > > 
> > > Per andrew's comment:
> > > https://lkml.org/lkml/2018/11/27/156
> > > 
> > > I need to fix it to represent 4K always.
> > 
> > Aha.
> > 
> > Then we need to increase bd_writes PAGE_SIZE/4K times in writeback_store()?
> > 
> >wb_count = atomic64_inc_return(>stats.bd_writes);
> >...
> >if (wb_limit != 0 && wb_count >= wb_limit)
> >zram->stop_writeback = true;
> > 
> > bd_wb_limit is in 4K units; but in writeback_store() we alloc a full page
> > and write it to the backing device. So the actual number of written bytes
> > can be larger on systems with page_size > 4K. Right?
> 
> Hey Sergey,
> 
> I changed interface in recent version v4. I belive it would be more
> straigtforward for user. Could you review it?

Hi Minchan,

Just received v4. Will take a look!

-ss

Re: [PATCH v3 7/7] zram: writeback throttle

2018-12-02 Thread Sergey Senozhatsky

On (12/03/18 08:18), Minchan Kim wrote:
> 
> Per andrew's comment:
> https://lkml.org/lkml/2018/11/27/156
> 
> I need to fix it to represent 4K always.

Aha.

Then we need to increase bd_writes PAGE_SIZE/4K times in writeback_store()?

   wb_count = atomic64_inc_return(>stats.bd_writes);
   ...
   if (wb_limit != 0 && wb_count >= wb_limit)
   zram->stop_writeback = true;

bd_wb_limit is in 4K units; but in writeback_store() we alloc a full page
and write it to the backing device. So the actual number of written bytes
can be larger on systems with page_size > 4K. Right?

-ss

Re: [PATCH v3 7/7] zram: writeback throttle

2018-12-02 Thread Sergey Senozhatsky

On (12/03/18 08:18), Minchan Kim wrote:
> 
> Per andrew's comment:
> https://lkml.org/lkml/2018/11/27/156
> 
> I need to fix it to represent 4K always.

Aha.

Then we need to increase bd_writes PAGE_SIZE/4K times in writeback_store()?

   wb_count = atomic64_inc_return(>stats.bd_writes);
   ...
   if (wb_limit != 0 && wb_count >= wb_limit)
   zram->stop_writeback = true;

bd_wb_limit is in 4K units; but in writeback_store() we alloc a full page
and write it to the backing device. So the actual number of written bytes
can be larger on systems with page_size > 4K. Right?

-ss

Re: [PATCH RFC 14/15] lib: replace **** with a hug

2018-12-01 Thread Sergey Senozhatsky

On (11/30/18 12:59), Jarkko Sakkinen wrote:
> On Fri, Nov 30, 2018 at 02:41:11PM -0500, Steven Rostedt wrote:
> > Since the code has been greatly modified since that comment was added,
> > I would say the comment is simply out of date.
> > 
> > Just nuke the comment, and that will be an accurate change with or
> > without CoC.
> 
> Thanks, will do.

Well, nuking the comment is OK.

Could you please help me understand one thing:
  Linus has made a comment; in his own words, about his own
  code. Why would anyone be offended by this?

More importantly, if this is now non-CoC-complaint, will I face
any penalties for
"Yikes! I'm a moron; my patch broke NMI printk, sorry!"
email?

-ss

Re: [PATCH RFC 14/15] lib: replace **** with a hug

2018-12-01 Thread Sergey Senozhatsky

On (11/30/18 12:59), Jarkko Sakkinen wrote:
> On Fri, Nov 30, 2018 at 02:41:11PM -0500, Steven Rostedt wrote:
> > Since the code has been greatly modified since that comment was added,
> > I would say the comment is simply out of date.
> > 
> > Just nuke the comment, and that will be an accurate change with or
> > without CoC.
> 
> Thanks, will do.

Well, nuking the comment is OK.

Could you please help me understand one thing:
  Linus has made a comment; in his own words, about his own
  code. Why would anyone be offended by this?

More importantly, if this is now non-CoC-complaint, will I face
any penalties for
"Yikes! I'm a moron; my patch broke NMI printk, sorry!"
email?

-ss

Re: [PATCH v3 0/7] zram idle page writeback

2018-11-29 Thread Sergey Senozhatsky

On (11/27/18 14:54), Minchan Kim wrote:
> Inherently, swap device has many idle pages which are rare touched since
> it was allocated. It is never problem if we use storage device as swap.
> However, it's just waste for zram-swap.
> 
> This patchset supports zram idle page writeback feature.
> 
> * Admin can define what is idle page "no access since X time ago"
> * Admin can define when zram should writeback them
> * Admin can define when zram should stop writeback to prevent wearout
> 
> Detail is on each patch's description.
> 
> Below first two patches are -stable material so it could go first
> separately with others in this series.

I had some time to look at the patches
Reviewed-by: Sergey Senozhatsky 

Will give it some testing later; next week maybe.

-ss

Re: [PATCH v3 0/7] zram idle page writeback

2018-11-29 Thread Sergey Senozhatsky

On (11/27/18 14:54), Minchan Kim wrote:
> Inherently, swap device has many idle pages which are rare touched since
> it was allocated. It is never problem if we use storage device as swap.
> However, it's just waste for zram-swap.
> 
> This patchset supports zram idle page writeback feature.
> 
> * Admin can define what is idle page "no access since X time ago"
> * Admin can define when zram should writeback them
> * Admin can define when zram should stop writeback to prevent wearout
> 
> Detail is on each patch's description.
> 
> Below first two patches are -stable material so it could go first
> separately with others in this series.

I had some time to look at the patches
Reviewed-by: Sergey Senozhatsky 

Will give it some testing later; next week maybe.

-ss

Re: [PATCH] printk: don't unconditionally shortcut print_time()

2018-11-29 Thread Sergey Senozhatsky

On (11/29/18 15:19), Petr Mladek wrote:
> This fixes a real problem. It might even avoid a crash.
> 
> See syslog_print_all() and kmsg_dump_get_buffer(). They
> start with calling msg_print_text() many times to calculate
> how many messages fit into the given buffer. Then they
> blindly copy the messages into the buffer.

Hmm. Interesting find.

So can we just use a pessimistic approach then?

If IS_ENABLED(CONFIG_PRINTK_TIME) then always calculate
buffer size as if printk_time was true; but use the actual
printk_time when we copy out messages. Yes, the buffer can
be larger than needed, but at least we don't care about any
races anymore. And we also don't need to pass printk_time
snapshot while we copy out messages.

I really don't know who and why would enable CONFIG_PRINTK_TIME
but then disable printk_time via param. So most of the time that
pessimistic approach should calculate the right buffer size.
Probably.

[..]
> It seems worth supporting the disabled printk_time() properly.
> I am not sure who is disabling the timestamps.

Yeah, I'm curious as well. Can't see why would anyone switch
it back and forth. Kconfig and boot param work just fine.

-ss

Re: [PATCH] printk: don't unconditionally shortcut print_time()

2018-11-29 Thread Sergey Senozhatsky

On (11/29/18 15:19), Petr Mladek wrote:
> This fixes a real problem. It might even avoid a crash.
> 
> See syslog_print_all() and kmsg_dump_get_buffer(). They
> start with calling msg_print_text() many times to calculate
> how many messages fit into the given buffer. Then they
> blindly copy the messages into the buffer.

Hmm. Interesting find.

So can we just use a pessimistic approach then?

If IS_ENABLED(CONFIG_PRINTK_TIME) then always calculate
buffer size as if printk_time was true; but use the actual
printk_time when we copy out messages. Yes, the buffer can
be larger than needed, but at least we don't care about any
races anymore. And we also don't need to pass printk_time
snapshot while we copy out messages.

I really don't know who and why would enable CONFIG_PRINTK_TIME
but then disable printk_time via param. So most of the time that
pessimistic approach should calculate the right buffer size.
Probably.

[..]
> It seems worth supporting the disabled printk_time() properly.
> I am not sure who is disabling the timestamps.

Yeah, I'm curious as well. Can't see why would anyone switch
it back and forth. Kconfig and boot param work just fine.

-ss

Re: [PATCH 7/7] lib/lzo: separate lzo-rle from lzo

2018-11-29 Thread Sergey Senozhatsky

On (11/29/18 10:21), Dave Rodgman wrote:
> > [..]
> >> +++ b/drivers/block/zram/zcomp.c
> >> @@ -20,6 +20,7 @@
> >>   
> >>   static const char * const backends[] = {
> >>"lzo",
> >> +  "lzo-rle",
> >>   #if IS_ENABLED(CONFIG_CRYPTO_LZ4)
> >>"lz4",
> >>   #endif
> > 
> > [..]
> > 
> >> +++ b/drivers/block/zram/zram_drv.c
> >> @@ -41,7 +41,7 @@ static DEFINE_IDR(zram_index_idr);
> >>   static DEFINE_MUTEX(zram_index_mutex);
> >>   
> >>   static int zram_major;
> >> -static const char *default_compressor = "lzo";
> >> +static const char *default_compressor = "lzo-rle";
> > 
> > OK, so it's not just "separate lzo-rle", it's also "switch zram to
> > a new compression algorithm by default". I'd say that usually I'd
> > expect this to be separate patches.
> 
> Yes, fair point. akpm has picked this up now though, so probably a bit
> late to break it out into a separate patch?

Andrew accepts patches to patches, and patches to patches to patches, etc.
He is very flexible :)

I don't have a very strong opinion. Would probably be better to split
it, tho. Let's hear from Minchan.

-ss

Re: [PATCH 7/7] lib/lzo: separate lzo-rle from lzo

2018-11-29 Thread Sergey Senozhatsky

On (11/29/18 10:21), Dave Rodgman wrote:
> > [..]
> >> +++ b/drivers/block/zram/zcomp.c
> >> @@ -20,6 +20,7 @@
> >>   
> >>   static const char * const backends[] = {
> >>"lzo",
> >> +  "lzo-rle",
> >>   #if IS_ENABLED(CONFIG_CRYPTO_LZ4)
> >>"lz4",
> >>   #endif
> > 
> > [..]
> > 
> >> +++ b/drivers/block/zram/zram_drv.c
> >> @@ -41,7 +41,7 @@ static DEFINE_IDR(zram_index_idr);
> >>   static DEFINE_MUTEX(zram_index_mutex);
> >>   
> >>   static int zram_major;
> >> -static const char *default_compressor = "lzo";
> >> +static const char *default_compressor = "lzo-rle";
> > 
> > OK, so it's not just "separate lzo-rle", it's also "switch zram to
> > a new compression algorithm by default". I'd say that usually I'd
> > expect this to be separate patches.
> 
> Yes, fair point. akpm has picked this up now though, so probably a bit
> late to break it out into a separate patch?

Andrew accepts patches to patches, and patches to patches to patches, etc.
He is very flexible :)

I don't have a very strong opinion. Would probably be better to split
it, tho. Let's hear from Minchan.

-ss

Re: [PATCH v2 0/7] lib/lzo: performance improvements

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 16:19), Dave Rodgman wrote:
> > Right. The number is data dependent. Not all swapped out pages can be
> > compressed; compressed pages that end up being >= zs_huge_class_size() are
> > considered incompressible and stored as it.
> > 
> > I'd say that on my setups around 50-60% of pages are incompressible.
> 
> So, just to give a bit more detail: the test setup was a Samsung
> Chromebook Pro, cycling through 80 tabs in Chrome. With lzo-rle, only
> 5% of pages increased in size, and 90% of pages compress to 75% of
> original size (or better). Mean compression ratio was 41%. Importantly
> for lzo-rle, there are a lot of low-entropy pages where it can do well:
> in total about 20% of the data is zeros forming part of a run of 4 or
> more bytes.
> 
> As a quick summary of the impact of these patches on bigger chunks of
> data, I've compared the performance of four different variants of lzo
> on two large (~40 MB) files. The numbers show round-trip throughput
> in MB/s:
> 
> Variant | Low-entropy | High-entropy
> Current lzo |  242| 157
> Arm opts|  290| 159
> RLE |  876| 151
> Arm opts + RLE  | 1150| 181
> 
> So both the Arm optimisations (8,16-byte copy & CTZ patches), and the
> RLE implementation make a significant contribution to the overall
> performance uplift.

Cool!

-ss

Re: [PATCH v2 0/7] lib/lzo: performance improvements

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 16:19), Dave Rodgman wrote:
> > Right. The number is data dependent. Not all swapped out pages can be
> > compressed; compressed pages that end up being >= zs_huge_class_size() are
> > considered incompressible and stored as it.
> > 
> > I'd say that on my setups around 50-60% of pages are incompressible.
> 
> So, just to give a bit more detail: the test setup was a Samsung
> Chromebook Pro, cycling through 80 tabs in Chrome. With lzo-rle, only
> 5% of pages increased in size, and 90% of pages compress to 75% of
> original size (or better). Mean compression ratio was 41%. Importantly
> for lzo-rle, there are a lot of low-entropy pages where it can do well:
> in total about 20% of the data is zeros forming part of a run of 4 or
> more bytes.
> 
> As a quick summary of the impact of these patches on bigger chunks of
> data, I've compared the performance of four different variants of lzo
> on two large (~40 MB) files. The numbers show round-trip throughput
> in MB/s:
> 
> Variant | Low-entropy | High-entropy
> Current lzo |  242| 157
> Arm opts|  290| 159
> RLE |  876| 151
> Arm opts + RLE  | 1150| 181
> 
> So both the Arm optimisations (8,16-byte copy & CTZ patches), and the
> RLE implementation make a significant contribution to the overall
> performance uplift.

Cool!

-ss

Re: [PATCH 7/7] lib/lzo: separate lzo-rle from lzo

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 16:19), Dave Rodgman wrote:
>  Documentation/lzo.txt |  12 ++-
>  crypto/Makefile   |   2 +-
>  crypto/lzo-rle.c  | 175 
> ++
>  crypto/tcrypt.c   |   4 +-
>  drivers/block/zram/zcomp.c|   1 +
>  drivers/block/zram/zram_drv.c |   2 +-
>  include/linux/lzo.h   |   4 +
>  lib/lzo/lzo1x_compress.c  |  42 +++---
>  lib/lzo/lzodefs.h |   3 +-
>  9 files changed, 227 insertions(+), 18 deletions(-)
>  create mode 100644 crypto/lzo-rle.c

[..]

> +static struct crypto_alg alg = {
> + .cra_name   = "lzo-rle",
> + .cra_flags  = CRYPTO_ALG_TYPE_COMPRESS,
> + .cra_ctxsize= sizeof(struct lzorle_ctx),
> + .cra_module = THIS_MODULE,
> + .cra_init   = lzorle_init,
> + .cra_exit   = lzorle_exit,
> + .cra_u  = { .compress = {
> + .coa_compress   = lzorle_compress,
> + .coa_decompress = lzorle_decompress } }
> +};

A nitpick:
  indentation for .compress assignment is a bit confusing, maybe.


[..]
> +++ b/drivers/block/zram/zcomp.c
> @@ -20,6 +20,7 @@
>  
>  static const char * const backends[] = {
>   "lzo",
> + "lzo-rle",
>  #if IS_ENABLED(CONFIG_CRYPTO_LZ4)
>   "lz4",
>  #endif

[..]

> +++ b/drivers/block/zram/zram_drv.c
> @@ -41,7 +41,7 @@ static DEFINE_IDR(zram_index_idr);
>  static DEFINE_MUTEX(zram_index_mutex);
>  
>  static int zram_major;
> -static const char *default_compressor = "lzo";
> +static const char *default_compressor = "lzo-rle";

OK, so it's not just "separate lzo-rle", it's also "switch zram to
a new compression algorithm by default". I'd say that usually I'd
expect this to be separate patches.

-ss

Re: [PATCH 7/7] lib/lzo: separate lzo-rle from lzo

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 16:19), Dave Rodgman wrote:
>  Documentation/lzo.txt |  12 ++-
>  crypto/Makefile   |   2 +-
>  crypto/lzo-rle.c  | 175 
> ++
>  crypto/tcrypt.c   |   4 +-
>  drivers/block/zram/zcomp.c|   1 +
>  drivers/block/zram/zram_drv.c |   2 +-
>  include/linux/lzo.h   |   4 +
>  lib/lzo/lzo1x_compress.c  |  42 +++---
>  lib/lzo/lzodefs.h |   3 +-
>  9 files changed, 227 insertions(+), 18 deletions(-)
>  create mode 100644 crypto/lzo-rle.c

[..]

> +static struct crypto_alg alg = {
> + .cra_name   = "lzo-rle",
> + .cra_flags  = CRYPTO_ALG_TYPE_COMPRESS,
> + .cra_ctxsize= sizeof(struct lzorle_ctx),
> + .cra_module = THIS_MODULE,
> + .cra_init   = lzorle_init,
> + .cra_exit   = lzorle_exit,
> + .cra_u  = { .compress = {
> + .coa_compress   = lzorle_compress,
> + .coa_decompress = lzorle_decompress } }
> +};

A nitpick:
  indentation for .compress assignment is a bit confusing, maybe.


[..]
> +++ b/drivers/block/zram/zcomp.c
> @@ -20,6 +20,7 @@
>  
>  static const char * const backends[] = {
>   "lzo",
> + "lzo-rle",
>  #if IS_ENABLED(CONFIG_CRYPTO_LZ4)
>   "lz4",
>  #endif

[..]

> +++ b/drivers/block/zram/zram_drv.c
> @@ -41,7 +41,7 @@ static DEFINE_IDR(zram_index_idr);
>  static DEFINE_MUTEX(zram_index_mutex);
>  
>  static int zram_major;
> -static const char *default_compressor = "lzo";
> +static const char *default_compressor = "lzo-rle";

OK, so it's not just "separate lzo-rle", it's also "switch zram to
a new compression algorithm by default". I'd say that usually I'd
expect this to be separate patches.

-ss

Re: [PATCH 6/7] lib/lzo: implement run-length encoding

2018-11-28 Thread Sergey Senozhatsky

On (11/29/18 12:08), Sergey Senozhatsky wrote:
> On (11/27/18 16:19), Dave Rodgman wrote:
> >
> > This modifies the bitstream in a way which is backwards compatible
> > (i.e., we can decompress old bitstreams, but old versions of lzo
> > cannot decompress new bitstreams).
> >
> 
> Hmmm... Whoa. Help me understand this:
> 
> So a btrfs filesystem, compressed with the new lzo, say I run 4.21,
> won't be readable at all once I reboot under 4.19?
> 
> What about compressed net traffic between servers running different
> kernel version (one with new lzo, the other one with old lzo)?
> XFRM can be compressed with lzo, can't it?

Ah, false alarm. Sorry for the noise!

You create a new lzo version in the next patch.

-ss

Re: [PATCH 6/7] lib/lzo: implement run-length encoding

2018-11-28 Thread Sergey Senozhatsky

On (11/29/18 12:08), Sergey Senozhatsky wrote:
> On (11/27/18 16:19), Dave Rodgman wrote:
> >
> > This modifies the bitstream in a way which is backwards compatible
> > (i.e., we can decompress old bitstreams, but old versions of lzo
> > cannot decompress new bitstreams).
> >
> 
> Hmmm... Whoa. Help me understand this:
> 
> So a btrfs filesystem, compressed with the new lzo, say I run 4.21,
> won't be readable at all once I reboot under 4.19?
> 
> What about compressed net traffic between servers running different
> kernel version (one with new lzo, the other one with old lzo)?
> XFRM can be compressed with lzo, can't it?

Ah, false alarm. Sorry for the noise!

You create a new lzo version in the next patch.

-ss

Re: [PATCH 6/7] lib/lzo: implement run-length encoding

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 16:19), Dave Rodgman wrote:
>
> This modifies the bitstream in a way which is backwards compatible
> (i.e., we can decompress old bitstreams, but old versions of lzo
> cannot decompress new bitstreams).
>

Hmmm... Whoa. Help me understand this:

So a btrfs filesystem, compressed with the new lzo, say I run 4.21,
won't be readable at all once I reboot under 4.19?

What about compressed net traffic between servers running different
kernel version (one with new lzo, the other one with old lzo)?
XFRM can be compressed with lzo, can't it?

-ss

Re: [PATCH 6/7] lib/lzo: implement run-length encoding

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 16:19), Dave Rodgman wrote:
>
> This modifies the bitstream in a way which is backwards compatible
> (i.e., we can decompress old bitstreams, but old versions of lzo
> cannot decompress new bitstreams).
>

Hmmm... Whoa. Help me understand this:

So a btrfs filesystem, compressed with the new lzo, say I run 4.21,
won't be readable at all once I reboot under 4.19?

What about compressed net traffic between servers running different
kernel version (one with new lzo, the other one with old lzo)?
XFRM can be compressed with lzo, can't it?

-ss

Re: [PATCH v2] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 18:33), Rafael David Tinoco wrote:
> On 11/20/18 10:18 PM, Rafael David Tinoco wrote:
> > 
> > Russell,
> > 
> > I have tried to place MAX_POSSIBLE_PHYSMEM_BITS in the best available
> > header for each architecture, considering different paging levels, PAE
> > existence, and existing similar definitions. Also, I have only
> > considered those architectures already having "sparsemem.h" header.
> > 
> > Would you mind reviewing it ?
> 
> Should I re-send the this v2 (as v3) with complete list of
> get_maintainer.pl ? I was in doubt because I'm touching headers from
> several archs and I'm not sure who, if it is accepted, would merge it.

Yes, resending and Cc-ing archs' maintainers if the right thing to do.
It's also possible that they will ask to split the patch and do a
per-arch change.

-ss

Re: [PATCH v2] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 18:33), Rafael David Tinoco wrote:
> On 11/20/18 10:18 PM, Rafael David Tinoco wrote:
> > 
> > Russell,
> > 
> > I have tried to place MAX_POSSIBLE_PHYSMEM_BITS in the best available
> > header for each architecture, considering different paging levels, PAE
> > existence, and existing similar definitions. Also, I have only
> > considered those architectures already having "sparsemem.h" header.
> > 
> > Would you mind reviewing it ?
> 
> Should I re-send the this v2 (as v3) with complete list of
> get_maintainer.pl ? I was in doubt because I'm touching headers from
> several archs and I'm not sure who, if it is accepted, would merge it.

Yes, resending and Cc-ing archs' maintainers if the right thing to do.
It's also possible that they will ask to split the patch and do a
per-arch change.

-ss

Re: [PATCH v3 1/7] zram: fix lockdep warning of free block handling

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 14:54), Minchan Kim wrote:
> With writeback feature, zram_slot_free_notify could be called
> in softirq context by end_swap_bio_read. However, bitmap_lock
> is not aware of that so lockdep yell out. Thanks.
> 
> get_entry_bdev
> spin_lock(bitmap->lock);
> irq
> softirq
> end_swap_bio_read
> zram_slot_free_notify
> zram_slot_lock <-- deadlock prone
> zram_free_page
> put_entry_bdev
> spin_lock(bitmap->lock); <-- deadlock prone
> 
> With akpm's suggestion(i.e. bitmap operation is already atomic),
> we could remove bitmap lock. It might fail to find a empty slot
> if serious contention happens. However, it's not severe problem
> because huge page writeback has already possiblity to fail if there
> is severe memory pressure. Worst case is just keeping
> the incompressible in memory, not storage.
> 
> The other problem is zram_slot_lock in zram_slot_slot_free_notify.
> To make it safe is this patch introduces zram_slot_trylock where
> zram_slot_free_notify uses it. Although it's rare to be contented,
> this patch adds new debug stat "miss_free" to keep monitoring
> how often it happens.
> 
> Signed-off-by: Minchan Kim 

Reviewed-by: Sergey Senozhatsky 

-ss

Re: [PATCH v3 1/7] zram: fix lockdep warning of free block handling

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 14:54), Minchan Kim wrote:
> With writeback feature, zram_slot_free_notify could be called
> in softirq context by end_swap_bio_read. However, bitmap_lock
> is not aware of that so lockdep yell out. Thanks.
> 
> get_entry_bdev
> spin_lock(bitmap->lock);
> irq
> softirq
> end_swap_bio_read
> zram_slot_free_notify
> zram_slot_lock <-- deadlock prone
> zram_free_page
> put_entry_bdev
> spin_lock(bitmap->lock); <-- deadlock prone
> 
> With akpm's suggestion(i.e. bitmap operation is already atomic),
> we could remove bitmap lock. It might fail to find a empty slot
> if serious contention happens. However, it's not severe problem
> because huge page writeback has already possiblity to fail if there
> is severe memory pressure. Worst case is just keeping
> the incompressible in memory, not storage.
> 
> The other problem is zram_slot_lock in zram_slot_slot_free_notify.
> To make it safe is this patch introduces zram_slot_trylock where
> zram_slot_free_notify uses it. Although it's rare to be contented,
> this patch adds new debug stat "miss_free" to keep monitoring
> how often it happens.
> 
> Signed-off-by: Minchan Kim 

Reviewed-by: Sergey Senozhatsky 

-ss

Re: [PATCH v3 2/7] zram: fix double free backing device

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 14:54), Minchan Kim wrote:
> If blkdev_get fails, we shouldn't do blkdev_put. Otherwise,
> kernel emits below log. This patch fixes it.
> 
> [   31.073006] WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 
> blkdev_put+0x105/0x120
> [   31.075104] Modules linked in:
> [   31.075898] CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
> [   31.077484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [   31.079589] RIP: 0010:blkdev_put+0x105/0x120
> [   31.080606] Code: 48 c7 80 a0 00 00 00 00 00 00 00 48 c7 c7 40 e7 40 96 e8 
> 6e 47 73 00 48 8b bb e0 00 00 00 e9 2c ff ff ff 0f 0b e9 75 ff ff ff <0f> 0b 
> e9 5a ff ff ff 48 c7 80 a0 00 00 00 00 00 00 00 eb 87 0f 1f
> [   31.085080] RSP: 0018:b409005c7ed0 EFLAGS: 00010297
> [   31.086383] RAX: 9779fe5a8040 RBX: 9779fbc17300 RCX: 
> b9fc37a4
> [   31.088105] RDX: 0001 RSI:  RDI: 
> 9640e740
> [   31.089850] RBP: 9779fbc17318 R08: 95499a89 R09: 
> 0004
> [   31.091201] R10: b409005c7e50 R11: 7a9ef6088ff4d4a1 R12: 
> 0083
> [   31.092276] R13: 9779fe607b98 R14:  R15: 
> 9779fe607a38
> [   31.093355] FS:  7fc118d9b840() GS:9779fc60() 
> knlGS:
> [   31.094582] CS:  0010 DS:  ES:  CR0: 80050033
> [   31.095541] CR2: 7fc11894b8dc CR3: 339f6001 CR4: 
> 00160ef0
> [   31.096781] Call Trace:
> [   31.097212]  __x64_sys_swapoff+0x46d/0x490
> [   31.097914]  do_syscall_64+0x5a/0x190
> [   31.098550]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   31.099402] RIP: 0033:0x7fc11843ec27
> [   31.100013] Code: 73 01 c3 48 8b 0d 71 62 2c 00 f7 d8 64 89 01 48 83 c8 ff 
> c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 
> 01 f0 ff ff 73 01 c3 48 8b 0d 41 62 2c 00 f7 d8 64 89 01 48
> [   31.103149] RSP: 002b:7ffdf69be648 EFLAGS: 0206 ORIG_RAX: 
> 00a8
> [   31.104425] RAX: ffda RBX: 011d98c0 RCX: 
> 7fc11843ec27
> [   31.105627] RDX: 0001 RSI: 0001 RDI: 
> 011d98c0
> [   31.106847] RBP: 0001 R08: 7ffdf69be690 R09: 
> 0001
> [   31.108038] R10: 02b1 R11: 0206 R12: 
> 0001
> [   31.109231] R13:  R14:  R15: 
> 
> [   31.110433] irq event stamp: 4466
> [   31.111001] hardirqs last  enabled at (4465): [] 
> __free_pages_ok+0x1e3/0x490
> [   31.112437] hardirqs last disabled at (4466): [] 
> trace_hardirqs_off_thunk+0x1a/0x1c
> [   31.113973] softirqs last  enabled at (3420): [] 
> __do_softirq+0x333/0x446
> [   31.115364] softirqs last disabled at (3407): [] 
> irq_exit+0xd1/0xe0
> 
> Cc: sta...@vger.kernel.org # 4.14+
> Signed-off-by: Minchan Kim 

Good catch.

Reviewed-by: Sergey Senozhatsky 

-ss

Re: [PATCH v3 2/7] zram: fix double free backing device

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 14:54), Minchan Kim wrote:
> If blkdev_get fails, we shouldn't do blkdev_put. Otherwise,
> kernel emits below log. This patch fixes it.
> 
> [   31.073006] WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 
> blkdev_put+0x105/0x120
> [   31.075104] Modules linked in:
> [   31.075898] CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
> [   31.077484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [   31.079589] RIP: 0010:blkdev_put+0x105/0x120
> [   31.080606] Code: 48 c7 80 a0 00 00 00 00 00 00 00 48 c7 c7 40 e7 40 96 e8 
> 6e 47 73 00 48 8b bb e0 00 00 00 e9 2c ff ff ff 0f 0b e9 75 ff ff ff <0f> 0b 
> e9 5a ff ff ff 48 c7 80 a0 00 00 00 00 00 00 00 eb 87 0f 1f
> [   31.085080] RSP: 0018:b409005c7ed0 EFLAGS: 00010297
> [   31.086383] RAX: 9779fe5a8040 RBX: 9779fbc17300 RCX: 
> b9fc37a4
> [   31.088105] RDX: 0001 RSI:  RDI: 
> 9640e740
> [   31.089850] RBP: 9779fbc17318 R08: 95499a89 R09: 
> 0004
> [   31.091201] R10: b409005c7e50 R11: 7a9ef6088ff4d4a1 R12: 
> 0083
> [   31.092276] R13: 9779fe607b98 R14:  R15: 
> 9779fe607a38
> [   31.093355] FS:  7fc118d9b840() GS:9779fc60() 
> knlGS:
> [   31.094582] CS:  0010 DS:  ES:  CR0: 80050033
> [   31.095541] CR2: 7fc11894b8dc CR3: 339f6001 CR4: 
> 00160ef0
> [   31.096781] Call Trace:
> [   31.097212]  __x64_sys_swapoff+0x46d/0x490
> [   31.097914]  do_syscall_64+0x5a/0x190
> [   31.098550]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   31.099402] RIP: 0033:0x7fc11843ec27
> [   31.100013] Code: 73 01 c3 48 8b 0d 71 62 2c 00 f7 d8 64 89 01 48 83 c8 ff 
> c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 
> 01 f0 ff ff 73 01 c3 48 8b 0d 41 62 2c 00 f7 d8 64 89 01 48
> [   31.103149] RSP: 002b:7ffdf69be648 EFLAGS: 0206 ORIG_RAX: 
> 00a8
> [   31.104425] RAX: ffda RBX: 011d98c0 RCX: 
> 7fc11843ec27
> [   31.105627] RDX: 0001 RSI: 0001 RDI: 
> 011d98c0
> [   31.106847] RBP: 0001 R08: 7ffdf69be690 R09: 
> 0001
> [   31.108038] R10: 02b1 R11: 0206 R12: 
> 0001
> [   31.109231] R13:  R14:  R15: 
> 
> [   31.110433] irq event stamp: 4466
> [   31.111001] hardirqs last  enabled at (4465): [] 
> __free_pages_ok+0x1e3/0x490
> [   31.112437] hardirqs last disabled at (4466): [] 
> trace_hardirqs_off_thunk+0x1a/0x1c
> [   31.113973] softirqs last  enabled at (3420): [] 
> __do_softirq+0x333/0x446
> [   31.115364] softirqs last disabled at (3407): [] 
> irq_exit+0xd1/0xe0
> 
> Cc: sta...@vger.kernel.org # 4.14+
> Signed-off-by: Minchan Kim 

Good catch.

Reviewed-by: Sergey Senozhatsky 

-ss

Re: [PATCH v3 4/7] zram: introduce ZRAM_IDLE flag

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 14:54), Minchan Kim wrote:
> To support idle page writeback with upcoming patches, this patch
> introduces a new ZRAM_IDLE flag.
> 
> Userspace can mark zram slots as "idle" via
>   "echo all > /sys/block/zramX/idle"
> which marks every allocated zram slot as ZRAM_IDLE.
> User could see it by /sys/kernel/debug/zram/zram0/block_state.
> 
>   30075.033841 ...i
>   30163.806904 s..i
>   30263.806919 ..hi

Hopefully we will never have a 't' block state :)

-ss

Re: [PATCH v3 4/7] zram: introduce ZRAM_IDLE flag

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 14:54), Minchan Kim wrote:
> To support idle page writeback with upcoming patches, this patch
> introduces a new ZRAM_IDLE flag.
> 
> Userspace can mark zram slots as "idle" via
>   "echo all > /sys/block/zramX/idle"
> which marks every allocated zram slot as ZRAM_IDLE.
> User could see it by /sys/kernel/debug/zram/zram0/block_state.
> 
>   30075.033841 ...i
>   30163.806904 s..i
>   30263.806919 ..hi

Hopefully we will never have a 't' block state :)

-ss

Re: [PATCH v3 7/7] zram: writeback throttle

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 14:54), Minchan Kim wrote:
> diff --git a/Documentation/ABI/testing/sysfs-block-zram 
> b/Documentation/ABI/testing/sysfs-block-zram
> index 65fc33b2f53b..9d2339a485c8 100644
> --- a/Documentation/ABI/testing/sysfs-block-zram
> +++ b/Documentation/ABI/testing/sysfs-block-zram
> @@ -121,3 +121,12 @@ Contact: Minchan Kim 
>   The bd_stat file is read-only and represents backing device's
>   statistics (bd_count, bd_reads, bd_writes) in a format
>   similar to block layer statistics file format.
> +
> +What:/sys/block/zram/writeback_limit
> +Date:November 2018
> +Contact: Minchan Kim 
> +Description:
> + The writeback_limit file is read-write and specifies the maximum
> + amount of writeback ZRAM can do. The limit could be changed
> + in run time and "0" means disable the limit.
> + No limit is the initial state.
> diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
> index 906df97527a7..64b61925e475 100644
> --- a/Documentation/blockdev/zram.txt
> +++ b/Documentation/blockdev/zram.txt
> @@ -164,6 +164,8 @@ reset WOtrigger device reset
>  mem_used_max  WOreset the `mem_used_max' counter (see later)
>  mem_limit WOspecifies the maximum amount of memory ZRAM can use
>  to store the compressed data
> +writeback_limit   WOspecifies the maximum amount of write IO zram can
> + write out to backing device as 4KB unit
   
page size units?

-ss

Re: [PATCH v3 7/7] zram: writeback throttle

2018-11-28 Thread Sergey Senozhatsky

On (11/27/18 14:54), Minchan Kim wrote:
> diff --git a/Documentation/ABI/testing/sysfs-block-zram 
> b/Documentation/ABI/testing/sysfs-block-zram
> index 65fc33b2f53b..9d2339a485c8 100644
> --- a/Documentation/ABI/testing/sysfs-block-zram
> +++ b/Documentation/ABI/testing/sysfs-block-zram
> @@ -121,3 +121,12 @@ Contact: Minchan Kim 
>   The bd_stat file is read-only and represents backing device's
>   statistics (bd_count, bd_reads, bd_writes) in a format
>   similar to block layer statistics file format.
> +
> +What:/sys/block/zram/writeback_limit
> +Date:November 2018
> +Contact: Minchan Kim 
> +Description:
> + The writeback_limit file is read-write and specifies the maximum
> + amount of writeback ZRAM can do. The limit could be changed
> + in run time and "0" means disable the limit.
> + No limit is the initial state.
> diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
> index 906df97527a7..64b61925e475 100644
> --- a/Documentation/blockdev/zram.txt
> +++ b/Documentation/blockdev/zram.txt
> @@ -164,6 +164,8 @@ reset WOtrigger device reset
>  mem_used_max  WOreset the `mem_used_max' counter (see later)
>  mem_limit WOspecifies the maximum amount of memory ZRAM can use
>  to store the compressed data
> +writeback_limit   WOspecifies the maximum amount of write IO zram can
> + write out to backing device as 4KB unit
   
page size units?

-ss

Re: [PATCH] fat: Replaced 11 magic to MSDOS_NAME for volume label

2018-11-26 Thread Sergey Senozhatsky

On (11/26/18 21:23), Tamir Carmeli wrote:
> Thanks for the reply and sorry for my noob question:
> Wasn't I supposed to send this patch also to the FAT maintainer, OGAWA
> Hirofumi?

Yes.

> His name wasn't on the get_maintainer output, and I think
> that this specific file isn't listed anywhere under the maintainers
> file.

Right. Not all headers are tracked; so usually you'd want
to ./scripts/get_maintainer.pl fs/fat (where C files are).

-ss

Re: [PATCH] fat: Replaced 11 magic to MSDOS_NAME for volume label

2018-11-26 Thread Sergey Senozhatsky

On (11/26/18 21:23), Tamir Carmeli wrote:
> Thanks for the reply and sorry for my noob question:
> Wasn't I supposed to send this patch also to the FAT maintainer, OGAWA
> Hirofumi?

Yes.

> His name wasn't on the get_maintainer output, and I think
> that this specific file isn't listed anywhere under the maintainers
> file.

Right. Not all headers are tracked; so usually you'd want
to ./scripts/get_maintainer.pl fs/fat (where C files are).

-ss

Re: [PATCH] printk: deduplicate print_prefix() calls

2018-11-25 Thread Sergey Senozhatsky

On (11/24/18 23:28), Tetsuo Handa wrote:
> Since /sys/module/printk/parameters/time can change from N to Y between
> "msg_print_text() called print_prefix() with buf == NULL" and
> "msg_print_text() again calls print_prefix() with buf != NULL", it is not
> safe for print_time() to unconditionally return 0 if printk_time == false.
> 
> But print_prefix() is called by only msg_print_text(), and print_prefix()
> is called once more for calculating output length of prefix, and there is
> no need to recalculate it when one log entry contains multiple lines.

Good catch.

> Since max output length for syslog marker and timestamp are known to be
> small enough, let's close this race by deduplicating print_prefix() calls.
> 
> Signed-off-by: Tetsuo Handa 

Or we can change printk_time under logbuf lock? So msg_print_text will
not race with it.

Or we can "copy" the value of printk_time and use it in print_prefix(),
and we will pick up an updated value next time we do msg_print_text()?

Example:

---

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 1b2a029360b7..4c57f5d5b0b6 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1211,13 +1211,14 @@ static inline void boot_delay_msec(int level)
 #endif
 
 static bool printk_time = IS_ENABLED(CONFIG_PRINTK_TIME);
+static bool printk_time_saved;
 module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR);
 
 static size_t print_time(u64 ts, char *buf)
 {
unsigned long rem_nsec;
 
-   if (!printk_time)
+   if (!printk_time_saved)
return 0;
 
rem_nsec = do_div(ts, 10);
@@ -1258,6 +1259,8 @@ static size_t msg_print_text(const struct printk_log 
*msg, bool syslog, char *bu
size_t text_size = msg->text_len;
size_t len = 0;
 
+   printk_time_saved = printk_time;
+
do {
const char *next = memchr(text, '\n', text_size);
size_t text_len;

Re: [PATCH] printk: deduplicate print_prefix() calls

2018-11-25 Thread Sergey Senozhatsky

On (11/24/18 23:28), Tetsuo Handa wrote:
> Since /sys/module/printk/parameters/time can change from N to Y between
> "msg_print_text() called print_prefix() with buf == NULL" and
> "msg_print_text() again calls print_prefix() with buf != NULL", it is not
> safe for print_time() to unconditionally return 0 if printk_time == false.
> 
> But print_prefix() is called by only msg_print_text(), and print_prefix()
> is called once more for calculating output length of prefix, and there is
> no need to recalculate it when one log entry contains multiple lines.

Good catch.

> Since max output length for syslog marker and timestamp are known to be
> small enough, let's close this race by deduplicating print_prefix() calls.
> 
> Signed-off-by: Tetsuo Handa 

Or we can change printk_time under logbuf lock? So msg_print_text will
not race with it.

Or we can "copy" the value of printk_time and use it in print_prefix(),
and we will pick up an updated value next time we do msg_print_text()?

Example:

---

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 1b2a029360b7..4c57f5d5b0b6 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1211,13 +1211,14 @@ static inline void boot_delay_msec(int level)
 #endif
 
 static bool printk_time = IS_ENABLED(CONFIG_PRINTK_TIME);
+static bool printk_time_saved;
 module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR);
 
 static size_t print_time(u64 ts, char *buf)
 {
unsigned long rem_nsec;
 
-   if (!printk_time)
+   if (!printk_time_saved)
return 0;
 
rem_nsec = do_div(ts, 10);
@@ -1258,6 +1259,8 @@ static size_t msg_print_text(const struct printk_log 
*msg, bool syslog, char *bu
size_t text_size = msg->text_len;
size_t len = 0;
 
+   printk_time_saved = printk_time;
+
do {
const char *next = memchr(text, '\n', text_size);
size_t text_len;

Re: RFC: script to convert vsprintf uses of %pf to %ps and %pF to %pS

2018-11-25 Thread Sergey Senozhatsky

And I forgot to actually Cc Petr and Steven...
Top-posting.

---

Hi,

Cc-ing Petr and Steven

On (11/25/18 01:13), Joe Perches wrote:
> commit 04b8eb7a4ccd ("symbol lookup: introduce 
> dereference_symbol_descriptor()}"
> 
> deprecated vsprintf extension %pf and %pF.
> 
> There are approximately these total uses of the symbolic
> lookup vsprintf extensions %p[SsFf]:
> 
> $ git grep '".*[^%]%p[SsFf]' | \
>   grep -oh '%p[SsFf]' | sort | uniq -c | sort -rn
> 231 %pS
>  65 %ps
>  60 %pf
>  47 %pF

I didn't bother to remove "pf/pF" because I didn't want to count on:
  a) everyone running checkpatch.pl
  b) everyone testing all printk-s they added

I guess pf/pF still can sneak in.

But I don't have a really strong opinion on this. If general
consensus is that we shall remove deprecated specifiers, then
I'm fine.

> which is about a 3:1 ratio favoring %pS
> 
> so a script to convert all the %pf uses to %ps and %pF uses to %pS
> could be useful.
> 
> There are a few files that appear should not be converted.
> 
> $ git grep -w --name-only -i '%pf'| \
>   grep -vP '^(?:Documentation|tools|include/linux/freezer.h)'| \
>   xargs sed -i -e 's/%pf/%ps/g' -e 's/%pF/%pS/g'
> 
> If that script above is run, it leaves the following patch
> to be applied:

After running this script I still have a bunch of leftovers:

//
// these two are probably not really relevant, but still - %pF/%pf
//
tools/lib/traceevent/event-parse.c: if (asprintf(, "%%pf: 
(NO FORMAT FOUND at %llx)\n", addr) < 0)
tools/lib/traceevent/event-parse.c: if (asprintf(, "%s: %s", "%pf", 
printk->printk) < 0)

arch/um/kernel/sysrq.c: pr_info(" [<%08lx>] %s%pF\n", address, reliable ? "" : 
"? ",
arch/x86/xen/multicalls.c:  printk(KERN_DEBUG "  
call %2d/%d: op=%lu arg=[%lx] result=%ld\t%pF\n",
kernel/async.c: pr_debug("calling  %lli_%pF @ %i\n",
kernel/async.c: pr_debug("initcall %lli_%pF returned 0 after %lld 
usecs\n",

-ss

Re: RFC: script to convert vsprintf uses of %pf to %ps and %pF to %pS

2018-11-25 Thread Sergey Senozhatsky

And I forgot to actually Cc Petr and Steven...
Top-posting.

---

Hi,

Cc-ing Petr and Steven

On (11/25/18 01:13), Joe Perches wrote:
> commit 04b8eb7a4ccd ("symbol lookup: introduce 
> dereference_symbol_descriptor()}"
> 
> deprecated vsprintf extension %pf and %pF.
> 
> There are approximately these total uses of the symbolic
> lookup vsprintf extensions %p[SsFf]:
> 
> $ git grep '".*[^%]%p[SsFf]' | \
>   grep -oh '%p[SsFf]' | sort | uniq -c | sort -rn
> 231 %pS
>  65 %ps
>  60 %pf
>  47 %pF

I didn't bother to remove "pf/pF" because I didn't want to count on:
  a) everyone running checkpatch.pl
  b) everyone testing all printk-s they added

I guess pf/pF still can sneak in.

But I don't have a really strong opinion on this. If general
consensus is that we shall remove deprecated specifiers, then
I'm fine.

> which is about a 3:1 ratio favoring %pS
> 
> so a script to convert all the %pf uses to %ps and %pF uses to %pS
> could be useful.
> 
> There are a few files that appear should not be converted.
> 
> $ git grep -w --name-only -i '%pf'| \
>   grep -vP '^(?:Documentation|tools|include/linux/freezer.h)'| \
>   xargs sed -i -e 's/%pf/%ps/g' -e 's/%pF/%pS/g'
> 
> If that script above is run, it leaves the following patch
> to be applied:

After running this script I still have a bunch of leftovers:

//
// these two are probably not really relevant, but still - %pF/%pf
//
tools/lib/traceevent/event-parse.c: if (asprintf(, "%%pf: 
(NO FORMAT FOUND at %llx)\n", addr) < 0)
tools/lib/traceevent/event-parse.c: if (asprintf(, "%s: %s", "%pf", 
printk->printk) < 0)

arch/um/kernel/sysrq.c: pr_info(" [<%08lx>] %s%pF\n", address, reliable ? "" : 
"? ",
arch/x86/xen/multicalls.c:  printk(KERN_DEBUG "  
call %2d/%d: op=%lu arg=[%lx] result=%ld\t%pF\n",
kernel/async.c: pr_debug("calling  %lli_%pF @ %i\n",
kernel/async.c: pr_debug("initcall %lli_%pF returned 0 after %lld 
usecs\n",

-ss

Re: RFC: script to convert vsprintf uses of %pf to %ps and %pF to %pS

2018-11-25 Thread Sergey Senozhatsky

Hi,

Cc-ing Petr and Steven

On (11/25/18 01:13), Joe Perches wrote:
> commit 04b8eb7a4ccd ("symbol lookup: introduce 
> dereference_symbol_descriptor()}"
> 
> deprecated vsprintf extension %pf and %pF.
> 
> There are approximately these total uses of the symbolic
> lookup vsprintf extensions %p[SsFf]:
> 
> $ git grep '".*[^%]%p[SsFf]' | \
>   grep -oh '%p[SsFf]' | sort | uniq -c | sort -rn
> 231 %pS
>  65 %ps
>  60 %pf
>  47 %pF

I didn't bother to remove "pf/pF" because I didn't want to count on:
  a) everyone running checkpatch.pl
  b) everyone testing all printk-s they added

I guess pf/pF still can sneak in.

But I don't have a really strong opinion on this. If general
consensus is that we shall remove deprecated specifiers, then
I'm fine.

> which is about a 3:1 ratio favoring %pS
> 
> so a script to convert all the %pf uses to %ps and %pF uses to %pS
> could be useful.
> 
> There are a few files that appear should not be converted.
> 
> $ git grep -w --name-only -i '%pf'| \
>   grep -vP '^(?:Documentation|tools|include/linux/freezer.h)'| \
>   xargs sed -i -e 's/%pf/%ps/g' -e 's/%pF/%pS/g'
> 
> If that script above is run, it leaves the following patch
> to be applied:

After running this script I still have a bunch of leftovers:

//
// these two are probably not really relevant, but still - %pF/%pf
//
tools/lib/traceevent/event-parse.c: if (asprintf(, "%%pf: 
(NO FORMAT FOUND at %llx)\n", addr) < 0)
tools/lib/traceevent/event-parse.c: if (asprintf(, "%s: %s", "%pf", 
printk->printk) < 0)

arch/um/kernel/sysrq.c: pr_info(" [<%08lx>] %s%pF\n", address, reliable ? "" : 
"? ",
arch/x86/xen/multicalls.c:  printk(KERN_DEBUG "  
call %2d/%d: op=%lu arg=[%lx] result=%ld\t%pF\n",
kernel/async.c: pr_debug("calling  %lli_%pF @ %i\n",
kernel/async.c: pr_debug("initcall %lli_%pF returned 0 after %lld 
usecs\n",

-ss

Re: RFC: script to convert vsprintf uses of %pf to %ps and %pF to %pS

2018-11-25 Thread Sergey Senozhatsky

Hi,

Cc-ing Petr and Steven

On (11/25/18 01:13), Joe Perches wrote:
> commit 04b8eb7a4ccd ("symbol lookup: introduce 
> dereference_symbol_descriptor()}"
> 
> deprecated vsprintf extension %pf and %pF.
> 
> There are approximately these total uses of the symbolic
> lookup vsprintf extensions %p[SsFf]:
> 
> $ git grep '".*[^%]%p[SsFf]' | \
>   grep -oh '%p[SsFf]' | sort | uniq -c | sort -rn
> 231 %pS
>  65 %ps
>  60 %pf
>  47 %pF

I didn't bother to remove "pf/pF" because I didn't want to count on:
  a) everyone running checkpatch.pl
  b) everyone testing all printk-s they added

I guess pf/pF still can sneak in.

But I don't have a really strong opinion on this. If general
consensus is that we shall remove deprecated specifiers, then
I'm fine.

> which is about a 3:1 ratio favoring %pS
> 
> so a script to convert all the %pf uses to %ps and %pF uses to %pS
> could be useful.
> 
> There are a few files that appear should not be converted.
> 
> $ git grep -w --name-only -i '%pf'| \
>   grep -vP '^(?:Documentation|tools|include/linux/freezer.h)'| \
>   xargs sed -i -e 's/%pf/%ps/g' -e 's/%pF/%pS/g'
> 
> If that script above is run, it leaves the following patch
> to be applied:

After running this script I still have a bunch of leftovers:

//
// these two are probably not really relevant, but still - %pF/%pf
//
tools/lib/traceevent/event-parse.c: if (asprintf(, "%%pf: 
(NO FORMAT FOUND at %llx)\n", addr) < 0)
tools/lib/traceevent/event-parse.c: if (asprintf(, "%s: %s", "%pf", 
printk->printk) < 0)

arch/um/kernel/sysrq.c: pr_info(" [<%08lx>] %s%pF\n", address, reliable ? "" : 
"? ",
arch/x86/xen/multicalls.c:  printk(KERN_DEBUG "  
call %2d/%d: op=%lu arg=[%lx] result=%ld\t%pF\n",
kernel/async.c: pr_debug("calling  %lli_%pF @ %i\n",
kernel/async.c: pr_debug("initcall %lli_%pF returned 0 after %lld 
usecs\n",

-ss

Re: [PATCH] fat: Replaced 11 magic to MSDOS_NAME for volume label

2018-11-25 Thread Sergey Senozhatsky

On (11/24/18 17:01), Carmeli Tamir wrote:
> The FAT file system volume label file stored in the root directory should
> match the volume label field in the FAT boot sector. As consequence, the
> max length of these fields ought to be the same. This patch replaces the
> magic '11' usef in the struct fat_boot_sector with MSDOS_NAME,
> which is used in struct msdos_dir_entry.
>
> Please check the following references:
> 1. Microsoft FAT specification 2005
> (http://read.pudn.com/downloads77/ebook/294884/FAT32%20Spec%20%28SDA%20Contribution%29.pdf).
[..]
> Signed-off-by: Carmeli Tamir 

Looks good to me,
Reviewed-by: Sergey Senozhatsky 

-ss

Re: [PATCH] fat: Replaced 11 magic to MSDOS_NAME for volume label

2018-11-25 Thread Sergey Senozhatsky

On (11/24/18 17:01), Carmeli Tamir wrote:
> The FAT file system volume label file stored in the root directory should
> match the volume label field in the FAT boot sector. As consequence, the
> max length of these fields ought to be the same. This patch replaces the
> magic '11' usef in the struct fat_boot_sector with MSDOS_NAME,
> which is used in struct msdos_dir_entry.
>
> Please check the following references:
> 1. Microsoft FAT specification 2005
> (http://read.pudn.com/downloads77/ebook/294884/FAT32%20Spec%20%28SDA%20Contribution%29.pdf).
[..]
> Signed-off-by: Carmeli Tamir 

Looks good to me,
Reviewed-by: Sergey Senozhatsky 

-ss

Re: [PATCH] printk: Make printk_emit() local function.

2018-11-25 Thread Sergey Senozhatsky

On (11/24/18 13:10), Tetsuo Handa wrote:
> Subject: [PATCH v2] printk: Make printk_emit() local function.
> 
> printk_emit() is called from only devkmsg_write() in the same file.
> Save object size by making it a local function.
> 
> Signed-off-by: Tetsuo Handa 

Looks good to me,

Reviewed-by: Sergey Senozhatsky 

-ss

Re: [PATCH] printk: Make printk_emit() local function.

2018-11-25 Thread Sergey Senozhatsky

On (11/24/18 13:10), Tetsuo Handa wrote:
> Subject: [PATCH v2] printk: Make printk_emit() local function.
> 
> printk_emit() is called from only devkmsg_write() in the same file.
> Save object size by making it a local function.
> 
> Signed-off-by: Tetsuo Handa 

Looks good to me,

Reviewed-by: Sergey Senozhatsky 

-ss

Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-25 Thread Sergey Senozhatsky

On (11/24/18 09:24), Tetsuo Handa wrote:
> >> Steven told me on Plumbers conference that even few initial
> >> characters saved him a day few times.
> > 
> > Yes, and that has happened more than once. I would reboot and retest
> > code that is crashing, and due to a triple fault, the machine would
> > reboot because of some race, and the little output I get from the
> > console would help tremendously.
> > 
> > Remember, debugging the kernel is a lot like forensics, especially when
> > it's from a customer's site. You look at all the evidence that you can
> > get, and sometimes it's just 10 characters in the output that gives you
> > an idea of where things went wrong. I'm really not liking the buffering
> > idea because of this.
> 
> Then, we should not enforce buffering in a way that requires modification of
> printk() callers. That is, we should not ask printk() callers to use their
> private buffer. What we can do is to enable/disable line buffering inside
> printk() depending on the problem the user wants to debug.

Right; overall I tend to agree with what you guys are saying and
I like Petr's "I am more and more wondering if the buffered printk
is worth the effort" comment; and I like Steven's comment on flushes;
and admire Tetsuo's efforts.

I think that printk_seq_buf/printk_buffer was never going to replace
pr_cont() and I never liked the idea. The printk_safe proposal for
lockdep had one OK thing about it - it would pass our normal marshaling
before it would reach the buffering stage. Which means - no buffering
for people who "detest" printk buffering.

This looks better:
printk->vprint_func->{early_printk/printk_safe/vprintk_emit}->buffering

Than this:

pr_buffer->buffering->vprintk_func->{early_printk/printk_safe/vprintk_emit}

Another thing is - printk seq_buf/printk_buffer doesn't really solve any
problem. People, who can use seq_buf/char buf[256]/etc. buffering, already
can do so; people who cannot - won't switch to a new buffering printk
anyway.

The bad thing about printk_safe proposal is that it's per-CPU; which
is OK for some paths (like lockdep), but not OK in general (e.g. OOM).

IMO, try_buffered_printk() attempts to solve the problem at the right
place - printk. And it does not break our normal marshaling, so we don't
"fix" printk users and we keep people, who does vprintk_func->early_printk
thing, happy. So I don't dislike try_buffered_printk() approach. And unlike
before, now we are talking about a single line buffering.

If we'd walk this way, I would prefer to NOT introduce any structs and
any new code, or any new "split and log_store() in the middle" rules. Just
a bunch of "struct cont" buffers:

static struct cont conts[N];

and cont_add()/cont_flush() to handle pr_cont, with all the flushes it
does; but on a per-context basis.
conts[0] should serve as a fallback cont buffer, in case if there are
no available cont buffers left. flush_on_panic() is still miserable,
for sure; probably we can do something about it.

Or... Instead.
We can just leave pr_cont() alone for now. And make it possible to
reconstruct messages - IOW, inject some info to printk messages. We
do this at Samsung (inject CPU number at the beginning of every
message. `cat serial.0 | grep "\[1\]"` to grep for all messages from
CPU1). Probably this would be the simplest thing.

> Also, we should allow disabling "struct cont" depending on the problem (in
> order to allow flushing the 10 characters in the "cont" buffer).
>
> By the way, is the comment
>
>   /*
>* Continuation lines are buffered, and not committed to the record buffer
>* until the line is complete, or a race forces it. The line fragments
>* though, are printed immediately to the consoles to ensure everything has
>* reached the console in case of a kernel crash.
>*/

printk does not do this anymore; you are right.

> appropriate despite we don't call cont_flush() upon a kernel crash?

I tend to count on flush_on_panic more than on a "last moment"
pr_cont->cont_flush(), which was guaranteed to happen immediately
only with early_con.

A kernel crash usually has enough pr_emerg/printk-s to force cont flush.
Even pr_info() will do. We look at the loglevel much later; so even
messages which never make it to the consoles still flush cont buffer.

-ss

Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-25 Thread Sergey Senozhatsky

On (11/24/18 09:24), Tetsuo Handa wrote:
> >> Steven told me on Plumbers conference that even few initial
> >> characters saved him a day few times.
> > 
> > Yes, and that has happened more than once. I would reboot and retest
> > code that is crashing, and due to a triple fault, the machine would
> > reboot because of some race, and the little output I get from the
> > console would help tremendously.
> > 
> > Remember, debugging the kernel is a lot like forensics, especially when
> > it's from a customer's site. You look at all the evidence that you can
> > get, and sometimes it's just 10 characters in the output that gives you
> > an idea of where things went wrong. I'm really not liking the buffering
> > idea because of this.
> 
> Then, we should not enforce buffering in a way that requires modification of
> printk() callers. That is, we should not ask printk() callers to use their
> private buffer. What we can do is to enable/disable line buffering inside
> printk() depending on the problem the user wants to debug.

Right; overall I tend to agree with what you guys are saying and
I like Petr's "I am more and more wondering if the buffered printk
is worth the effort" comment; and I like Steven's comment on flushes;
and admire Tetsuo's efforts.

I think that printk_seq_buf/printk_buffer was never going to replace
pr_cont() and I never liked the idea. The printk_safe proposal for
lockdep had one OK thing about it - it would pass our normal marshaling
before it would reach the buffering stage. Which means - no buffering
for people who "detest" printk buffering.

This looks better:
printk->vprint_func->{early_printk/printk_safe/vprintk_emit}->buffering

Than this:

pr_buffer->buffering->vprintk_func->{early_printk/printk_safe/vprintk_emit}

Another thing is - printk seq_buf/printk_buffer doesn't really solve any
problem. People, who can use seq_buf/char buf[256]/etc. buffering, already
can do so; people who cannot - won't switch to a new buffering printk
anyway.

The bad thing about printk_safe proposal is that it's per-CPU; which
is OK for some paths (like lockdep), but not OK in general (e.g. OOM).

IMO, try_buffered_printk() attempts to solve the problem at the right
place - printk. And it does not break our normal marshaling, so we don't
"fix" printk users and we keep people, who does vprintk_func->early_printk
thing, happy. So I don't dislike try_buffered_printk() approach. And unlike
before, now we are talking about a single line buffering.

If we'd walk this way, I would prefer to NOT introduce any structs and
any new code, or any new "split and log_store() in the middle" rules. Just
a bunch of "struct cont" buffers:

static struct cont conts[N];

and cont_add()/cont_flush() to handle pr_cont, with all the flushes it
does; but on a per-context basis.
conts[0] should serve as a fallback cont buffer, in case if there are
no available cont buffers left. flush_on_panic() is still miserable,
for sure; probably we can do something about it.

Or... Instead.
We can just leave pr_cont() alone for now. And make it possible to
reconstruct messages - IOW, inject some info to printk messages. We
do this at Samsung (inject CPU number at the beginning of every
message. `cat serial.0 | grep "\[1\]"` to grep for all messages from
CPU1). Probably this would be the simplest thing.

> Also, we should allow disabling "struct cont" depending on the problem (in
> order to allow flushing the 10 characters in the "cont" buffer).
>
> By the way, is the comment
>
>   /*
>* Continuation lines are buffered, and not committed to the record buffer
>* until the line is complete, or a race forces it. The line fragments
>* though, are printed immediately to the consoles to ensure everything has
>* reached the console in case of a kernel crash.
>*/

printk does not do this anymore; you are right.

> appropriate despite we don't call cont_flush() upon a kernel crash?

I tend to count on flush_on_panic more than on a "last moment"
pr_cont->cont_flush(), which was guaranteed to happen immediately
only with early_con.

A kernel crash usually has enough pr_emerg/printk-s to force cont flush.
Even pr_info() will do. We look at the loglevel much later; so even
messages which never make it to the consoles still flush cont buffer.

-ss

Re: [PATCH] s390: Remove obsolete bust_spinlock() implementation

2018-11-22 Thread Sergey Senozhatsky

On (11/22/18 15:15), Petr Mladek wrote:
> The commit cefc8be82403cf ("Consolidate bust_spinlocks()") kept
> the s390-specific implementation because of the absence of CONFIG_VT.
> In fact, the only difference was calling console_unblank() instead of
> unblank_screen().
> 
> The common implementation in lib/bust_spinlocks.c started to call
> unblank_screen() explicitly since the commit b61312d353da187
> ("oops handling: ensure that any oops is flushed to the mtdoops
> console").
> 
> As a result, the custom implementation is not longer necessary.
> And we could get all the other improvements of the common
> implementation for free.

I believe I sent a similar patch several weeks ago and it's
in s390 patch queue as of now, waiting for the next merge
window.

lkml.kernel.org/r/20181025081108.GB26561@osiris

-ss

Re: [PATCH] s390: Remove obsolete bust_spinlock() implementation

2018-11-22 Thread Sergey Senozhatsky

On (11/22/18 15:15), Petr Mladek wrote:
> The commit cefc8be82403cf ("Consolidate bust_spinlocks()") kept
> the s390-specific implementation because of the absence of CONFIG_VT.
> In fact, the only difference was calling console_unblank() instead of
> unblank_screen().
> 
> The common implementation in lib/bust_spinlocks.c started to call
> unblank_screen() explicitly since the commit b61312d353da187
> ("oops handling: ensure that any oops is flushed to the mtdoops
> console").
> 
> As a result, the custom implementation is not longer necessary.
> And we could get all the other improvements of the common
> implementation for free.

I believe I sent a similar patch several weeks ago and it's
in s390 patch queue as of now, waiting for the next merge
window.

lkml.kernel.org/r/20181025081108.GB26561@osiris

-ss

Re: [PATCH 0/6] lib/lzo: performance improvements

2018-11-22 Thread Sergey Senozhatsky

On (11/21/18 12:06), Dave Rodgman wrote:
> 
> Overall, performance is improved by around 1.1 - 4.8x (data-dependent: data
> with many zero runs shows higher improvement). Under real-world testing with
> zram, time spent in (de)compression during swapping is reduced by around 27%.

Impressive.

I think we usually Cc Greg Kroah-Hartman and Andrew Morton on
lzo/lz4 patches.

> The graph below shows the weighted round-trip throughput of lzo, lz4 and
> lzo-rle, for randomly generated 4k chunks of data with varying levels of
> entropy. (To calculate weighted round-trip throughput, compression performance
> is emphasised to reflect the fact that zram does around 2.25x more compression
> than decompression.

Right. The number is data dependent. Not all swapped out pages can be
compressed; compressed pages that end up being >= zs_huge_class_size() are
considered incompressible and stored as it.

I'd say that on my setups around 50-60% of pages are incompressible.

-ss

Re: [PATCH 0/6] lib/lzo: performance improvements

2018-11-22 Thread Sergey Senozhatsky

On (11/21/18 12:06), Dave Rodgman wrote:
> 
> Overall, performance is improved by around 1.1 - 4.8x (data-dependent: data
> with many zero runs shows higher improvement). Under real-world testing with
> zram, time spent in (de)compression during swapping is reduced by around 27%.

Impressive.

I think we usually Cc Greg Kroah-Hartman and Andrew Morton on
lzo/lz4 patches.

> The graph below shows the weighted round-trip throughput of lzo, lz4 and
> lzo-rle, for randomly generated 4k chunks of data with varying levels of
> entropy. (To calculate weighted round-trip throughput, compression performance
> is emphasised to reflect the fact that zram does around 2.25x more compression
> than decompression.

Right. The number is data dependent. Not all swapped out pages can be
compressed; compressed pages that end up being >= zs_huge_class_size() are
considered incompressible and stored as it.

I'd say that on my setups around 50-60% of pages are incompressible.

-ss

Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Sergey Senozhatsky

On (11/22/18 15:31), Minchan Kim wrote:
> > 
> > I got what you mean now. Let's call it as "incompressible page wrieback"
> > to prevent confusing.
> > 
> > "incompressible page writeback" would be orthgonal feature. The goal is
> > "let's save memory at the cost of *latency*". If the page is swapped-in
> > soon, it's unfortunate. However, the design expects once it's swapped out,
> > it means it's non-workingset so soonish swappined-in would be rather not
> > many, theoritically compared to other workingset.
> > If's it's too frequent, it means system were heavily overcommitted.
> 
> Havid said, I agree it's not a good idea to enable incompressible page
> writeback with idle page writeback. If you don't oppose, I want to add
> new knob to "enable incompressible page writeback" so by default,
> although we enable CONFIG_ZRAM_WRITEBACK, incompressible page writeback
> is off until we enable the knob.
> It would make some regressison if someone have used the feature but
> I guess we are not too late.
> 
> What do you think?

Yes, totally works for me!

"IDLE writeback" is superior to "incompressible writeback".

"incompressible writeback" is completely unpredictable and
uncontrollable; it depens on data patterns and compression algorithms.
While "IDLE writeback" is predictable.

I even suspect, that, *ideally*, we can remove "incompressible
writeback". "IDLE pages" is a super set which also includes
"incompressible" pages. So, technically, we still can do
"incompressible writeback" from "IDLE writeback" path; but a much
more reasonable one, based on a page idling period.

I understand that you want to keep "direct incompressible writeback"
around. ZRAM is especially popular on devices which do suffer from
flash wearout, so I can see "incompressible writeback" path becoming
a dead code, long term.

-ss

Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Sergey Senozhatsky

On (11/22/18 15:31), Minchan Kim wrote:
> > 
> > I got what you mean now. Let's call it as "incompressible page wrieback"
> > to prevent confusing.
> > 
> > "incompressible page writeback" would be orthgonal feature. The goal is
> > "let's save memory at the cost of *latency*". If the page is swapped-in
> > soon, it's unfortunate. However, the design expects once it's swapped out,
> > it means it's non-workingset so soonish swappined-in would be rather not
> > many, theoritically compared to other workingset.
> > If's it's too frequent, it means system were heavily overcommitted.
> 
> Havid said, I agree it's not a good idea to enable incompressible page
> writeback with idle page writeback. If you don't oppose, I want to add
> new knob to "enable incompressible page writeback" so by default,
> although we enable CONFIG_ZRAM_WRITEBACK, incompressible page writeback
> is off until we enable the knob.
> It would make some regressison if someone have used the feature but
> I guess we are not too late.
> 
> What do you think?

Yes, totally works for me!

"IDLE writeback" is superior to "incompressible writeback".

"incompressible writeback" is completely unpredictable and
uncontrollable; it depens on data patterns and compression algorithms.
While "IDLE writeback" is predictable.

I even suspect, that, *ideally*, we can remove "incompressible
writeback". "IDLE pages" is a super set which also includes
"incompressible" pages. So, technically, we still can do
"incompressible writeback" from "IDLE writeback" path; but a much
more reasonable one, based on a page idling period.

I understand that you want to keep "direct incompressible writeback"
around. ZRAM is especially popular on devices which do suffer from
flash wearout, so I can see "incompressible writeback" path becoming
a dead code, long term.

-ss

Re: [PATCH 3/6] zram: introduce ZRAM_IDLE flag

2018-11-21 Thread Sergey Senozhatsky

On (11/22/18 14:11), Minchan Kim wrote:
> 
> It was a option when I imagined this idea first but problem from product
> division was memory waste of ac_time for every zram table.

OK, I see.

-ss

Re: [PATCH 3/6] zram: introduce ZRAM_IDLE flag

2018-11-21 Thread Sergey Senozhatsky

On (11/22/18 14:11), Minchan Kim wrote:
> 
> It was a option when I imagined this idea first but problem from product
> division was memory waste of ac_time for every zram table.

OK, I see.

-ss

Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Sergey Senozhatsky

On (11/22/18 14:04), Minchan Kim wrote:
> 
> > additionally, it's too simple. It writes-back pages which can be
> > swapped in immediately; which basically means that we do pointless
> > PAGE_SIZE writes to a device which doesn't really like pointless
> > writes.
> 
> This patchset aims for *IDLE page* writeback and you can define
> what is IDLE page by yourself. It doesn't do pointless writeback.
> > 
> > It's a whole different story with idle, compressible pages writeback.
> 
> I don't understand your point.

Seems you misunderstood me. I'm not saying that IDLE writeback is bad.
On the contrary, I think IDLE writeback is x100 better than writeback
which we currently have.

The "pointless writeback" comment was about the existing writeback,
when we WB pages which we couldn't compress. We can have a relative
huge percentage of incompressible pages, and not all of them will end
up being IDLE:
 - we swap out page
 - can't compress it
 - writeback PAGE_SIZE
 - swap it in two seconds later

// as example //

IDLE page writeback (this patch set) looks to me like a really
significant improvement. Especially if we can writeback compressed
objects and do, e.g., 300-bytes writes instead of PAGE_SIZE writes.

-ss

Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Sergey Senozhatsky

On (11/22/18 14:04), Minchan Kim wrote:
> 
> > additionally, it's too simple. It writes-back pages which can be
> > swapped in immediately; which basically means that we do pointless
> > PAGE_SIZE writes to a device which doesn't really like pointless
> > writes.
> 
> This patchset aims for *IDLE page* writeback and you can define
> what is IDLE page by yourself. It doesn't do pointless writeback.
> > 
> > It's a whole different story with idle, compressible pages writeback.
> 
> I don't understand your point.

Seems you misunderstood me. I'm not saying that IDLE writeback is bad.
On the contrary, I think IDLE writeback is x100 better than writeback
which we currently have.

The "pointless writeback" comment was about the existing writeback,
when we WB pages which we couldn't compress. We can have a relative
huge percentage of incompressible pages, and not all of them will end
up being IDLE:
 - we swap out page
 - can't compress it
 - writeback PAGE_SIZE
 - swap it in two seconds later

// as example //

IDLE page writeback (this patch set) looks to me like a really
significant improvement. Especially if we can writeback compressed
objects and do, e.g., 300-bytes writes instead of PAGE_SIZE writes.

-ss

Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Sergey Senozhatsky

On (11/21/18 05:34), Minchan Kim wrote:
> > 
> > Just a thought,
> > 
> > I wonder if it will make sense (and if it will be possible) to writeback
> > idle _compressed_ objects. Right now we decompress, say, a perfectly
> > fine 400-byte compressed object to a PAGE_SIZE-d object and then push
> > it to the WB device. In this particular case it has a x10 bigger IO
> > pressure on flash. If we can write/read compressed object then we
> > will write and read 400-bytes, instead of PAGE_SIZE.
> 
> Although it has pros/cons, that's the my final goal although it would
> add much complicated stuffs. Sometime, we should have the feature.

So you plan to switch to "compressed objects" writeback?

> However, I want to go simple one first which is very valuable, too.

Flash wearout is a serious problem; maybe less of a problem on smart
phones, but much bigger on TVs and on other embedded devices that have
lifespans of 5+ years. With "writeback idle compressed" we can remove
the existing "writeback incompressible pages" and writeback only
"idle, compressed" pages.

The existing incompressible writeback is way too aggressive, and,
additionally, it's too simple. It writes-back pages which can be
swapped in immediately; which basically means that we do pointless
PAGE_SIZE writes to a device which doesn't really like pointless
writes.

It's a whole different story with idle, compressible pages writeback.

-ss

Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Sergey Senozhatsky

On (11/21/18 05:34), Minchan Kim wrote:
> > 
> > Just a thought,
> > 
> > I wonder if it will make sense (and if it will be possible) to writeback
> > idle _compressed_ objects. Right now we decompress, say, a perfectly
> > fine 400-byte compressed object to a PAGE_SIZE-d object and then push
> > it to the WB device. In this particular case it has a x10 bigger IO
> > pressure on flash. If we can write/read compressed object then we
> > will write and read 400-bytes, instead of PAGE_SIZE.
> 
> Although it has pros/cons, that's the my final goal although it would
> add much complicated stuffs. Sometime, we should have the feature.

So you plan to switch to "compressed objects" writeback?

> However, I want to go simple one first which is very valuable, too.

Flash wearout is a serious problem; maybe less of a problem on smart
phones, but much bigger on TVs and on other embedded devices that have
lifespans of 5+ years. With "writeback idle compressed" we can remove
the existing "writeback incompressible pages" and writeback only
"idle, compressed" pages.

The existing incompressible writeback is way too aggressive, and,
additionally, it's too simple. It writes-back pages which can be
swapped in immediately; which basically means that we do pointless
PAGE_SIZE writes to a device which doesn't really like pointless
writes.

It's a whole different story with idle, compressible pages writeback.

-ss

Re: [PATCH 4/6] zram: support idle page writeback

2018-11-20 Thread Sergey Senozhatsky

On (11/16/18 16:20), Minchan Kim wrote:
> + zram_set_flag(zram, index, ZRAM_UNDER_WB);
> + zram_slot_unlock(zram, index);
> + if (zram_bvec_read(zram, , index, 0, NULL)) {
> + zram_slot_lock(zram, index);
> + zram_clear_flag(zram, index, ZRAM_UNDER_WB);
> + zram_slot_unlock(zram, index);
> + continue;
> + }
> +
> + bio_init(, _vec, 1);
> + bio_set_dev(, zram->bdev);
> + bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
> + bio.bi_opf = REQ_OP_WRITE | REQ_SYNC;
> +
> + bio_add_page(, bvec.bv_page, bvec.bv_len,
> + bvec.bv_offset);
> + /*
> +  * XXX: A single page IO would be inefficient for write
> +  * but it would be not bad as starter.
> +  */
> + ret = submit_bio_wait();
> + if (ret) {
> + zram_slot_lock(zram, index);
> + zram_clear_flag(zram, index, ZRAM_UNDER_WB);
> + zram_slot_unlock(zram, index);
> + continue;
> + }

Just a thought,

I wonder if it will make sense (and if it will be possible) to writeback
idle _compressed_ objects. Right now we decompress, say, a perfectly
fine 400-byte compressed object to a PAGE_SIZE-d object and then push
it to the WB device. In this particular case it has a x10 bigger IO
pressure on flash. If we can write/read compressed object then we
will write and read 400-bytes, instead of PAGE_SIZE.

-ss

Re: [PATCH 4/6] zram: support idle page writeback

2018-11-20 Thread Sergey Senozhatsky

On (11/16/18 16:20), Minchan Kim wrote:
> + zram_set_flag(zram, index, ZRAM_UNDER_WB);
> + zram_slot_unlock(zram, index);
> + if (zram_bvec_read(zram, , index, 0, NULL)) {
> + zram_slot_lock(zram, index);
> + zram_clear_flag(zram, index, ZRAM_UNDER_WB);
> + zram_slot_unlock(zram, index);
> + continue;
> + }
> +
> + bio_init(, _vec, 1);
> + bio_set_dev(, zram->bdev);
> + bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
> + bio.bi_opf = REQ_OP_WRITE | REQ_SYNC;
> +
> + bio_add_page(, bvec.bv_page, bvec.bv_len,
> + bvec.bv_offset);
> + /*
> +  * XXX: A single page IO would be inefficient for write
> +  * but it would be not bad as starter.
> +  */
> + ret = submit_bio_wait();
> + if (ret) {
> + zram_slot_lock(zram, index);
> + zram_clear_flag(zram, index, ZRAM_UNDER_WB);
> + zram_slot_unlock(zram, index);
> + continue;
> + }

Just a thought,

I wonder if it will make sense (and if it will be possible) to writeback
idle _compressed_ objects. Right now we decompress, say, a perfectly
fine 400-byte compressed object to a PAGE_SIZE-d object and then push
it to the WB device. In this particular case it has a x10 bigger IO
pressure on flash. If we can write/read compressed object then we
will write and read 400-bytes, instead of PAGE_SIZE.

-ss

Re: [PATCH 3/6] zram: introduce ZRAM_IDLE flag

2018-11-19 Thread Sergey Senozhatsky

Hello,

On (11/16/18 16:20), Minchan Kim wrote:
[..]
> +static ssize_t idle_store(struct device *dev,
> + struct device_attribute *attr, const char *buf, size_t len)
> +{
> + struct zram *zram = dev_to_zram(dev);
> + unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
> + int index;
> +
> + down_read(>init_lock);
> + if (!init_done(zram)) {
> + up_read(>init_lock);
> + return -EINVAL;
> + }
> +
> + for (index = 0; index < nr_pages; index++) {
> + zram_slot_lock(zram, index);
> + if (!zram_allocated(zram, index))
> + goto next;
> +
> + zram_set_flag(zram, index, ZRAM_IDLE);
> +next:
> + zram_slot_unlock(zram, index);
> + }
> +
> + up_read(>init_lock);
> +
> + return len;
> +}

This is one way of doing it.

The other one could, probabaly, be a bit more friendly to the cache
lines and CPU cycles. Basically, have a static timestamp variable,
which would keep the timestamp of last idle_store().

static idle_snapshot_ts;

static ssize_t idle_store(struct device *dev,
  struct device_attribute *attr,
  const char *buf, size_t len)
{
idle_snapshot_ts = ktime();
}

And then in read_block_state() compare handle access time and
idle_snapshot_ts (if it's not 0). If the page was not modified/access
since the last idle_snapshot_ts (handle access time <= idle_snapshot_ts),
then it's idle, otherwise (handle access time > idle_snapshot_ts) it's
not idle.

Would this do the trick?

-ss

Re: [PATCH 3/6] zram: introduce ZRAM_IDLE flag

2018-11-19 Thread Sergey Senozhatsky

Hello,

On (11/16/18 16:20), Minchan Kim wrote:
[..]
> +static ssize_t idle_store(struct device *dev,
> + struct device_attribute *attr, const char *buf, size_t len)
> +{
> + struct zram *zram = dev_to_zram(dev);
> + unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
> + int index;
> +
> + down_read(>init_lock);
> + if (!init_done(zram)) {
> + up_read(>init_lock);
> + return -EINVAL;
> + }
> +
> + for (index = 0; index < nr_pages; index++) {
> + zram_slot_lock(zram, index);
> + if (!zram_allocated(zram, index))
> + goto next;
> +
> + zram_set_flag(zram, index, ZRAM_IDLE);
> +next:
> + zram_slot_unlock(zram, index);
> + }
> +
> + up_read(>init_lock);
> +
> + return len;
> +}

This is one way of doing it.

The other one could, probabaly, be a bit more friendly to the cache
lines and CPU cycles. Basically, have a static timestamp variable,
which would keep the timestamp of last idle_store().

static idle_snapshot_ts;

static ssize_t idle_store(struct device *dev,
  struct device_attribute *attr,
  const char *buf, size_t len)
{
idle_snapshot_ts = ktime();
}

And then in read_block_state() compare handle access time and
idle_snapshot_ts (if it's not 0). If the page was not modified/access
since the last idle_snapshot_ts (handle access time <= idle_snapshot_ts),
then it's idle, otherwise (handle access time > idle_snapshot_ts) it's
not idle.

Would this do the trick?

-ss

Re: [PATCH v6 1/3] printk: Add line-buffered printk() API.

2018-11-12 Thread Sergey Senozhatsky

On (11/09/18 15:10), Petr Mladek wrote:
> > 
> > If I'm not mistaken, this is for the futute "printk injection" work.
> 
> The above code only tries to push complete lines to the main log buffer
> and consoles ASAP. It sounds like a Good Idea(tm).

Probably it is. So *quite likely* I'm wrong here.

Hmm... Thinking out loud.
At the same time, splitting a single logbuf entry gives a chance to other
events to mix in. Example:
pr_cont("Foo:");
pr_cont("\nbar");
pr_cont("\n");

Previously it could been stored in one logbuf entry (one logstore,
one console_trylock_spinning), which means that it could have been
printed in one go:

call_console_drivers()
spin_trylock_irqsave(>lock, flags);
uart_console_write( "Foo:\nbar\n");
spin_unlock_irqrestore(>lock, flags);

While with buffered printk, it will be store in two logbuf entries,
and printed in two goes:

call_console_drivers()
spin_trylock_irqsave(>lock, flags);
uart_console_write( "Foo:\nbar\n");
spin_unlock_irqrestore(>lock, flags);

<< ... console driver IRQ, TX port->state->xmit chars ... >>>

call_console_drivers()
spin_trylock_irqsave(>lock, flags);
uart_console_write( "Foo:\nbar\n");
spin_unlock_irqrestore(>lock, flags);

So, *technically*, we now let more events to happens between printk-s.

But, let's look at some of pr_cont() usage examples.
E.g. dump() from arch/h8300/kernel/traps.c. The code in question looks
as follows:

pr_info("\nKERNEL STACK:");
tp = ((unsigned char *) fp) - 0x40;
for (sp = (unsigned long *) tp, i = 0; (i < 0x40);  i += 4) {
if ((i % 0x10) == 0)
pr_info("\n%08x: ", (int) (tp + i));
pr_info("%08x ", (int) *sp++);
}
pr_info("\n");

dmesg

[   15.260099] :  0010  0040  0090
   0010:0100  0190  0240  0310
   0020:0400  0510  0640  0790
   0030:0900  0a90  0c40  0e10

So we have the entire KERNEL STACK stored as a single line, in
a single logbuf entry.

cat /dev/kmsg

4,687,15260099,c;\x0a:  0010  0040  0090  \x0a0010: 
   0100  0190  0240  0310  \x0a0020:0400  0510  
0640  0790  \x0a0030:0900  0a90  0c40  0e10

Shall we consider this as a "reference" representation: data that
pr_cont(), ideally, would have stored as a single logbuf entry? And
then compare the "reference" representation and what buffered printk
does.

Buffered printk always stores this as several lines, IOW several
logbuf entries.

cat /dev/kmsg

4,690,15260152,-;:  0010  0040  0090  
4,691,15260154,-;0010:0100  0190  0240  0310  
4,692,15260157,-;0020:0400  0510  0640  0790  
4,694,15260161,-;0030:0900  0a90  0c40  0e10  

So if we will have concurrent printk()-s happening on other CPUs,
then the KERNEL STACK data block still can be a bit hard to read:

[   15.260152] :  0010  0040  0090  
 ... foo bar foo bar
 ... foo bar foo bar
...
[   15.260154] 0010:0100  0190  0240  0310  
 ... foo bar foo bar
 ... foo bar foo bar
...
  ... and so on; you got the idea.

> No, please note that the for cycle searches for '\n' from the end
> of the string.

You are right, I didn't notice that. Indeed.

Tetsuo, lockdep report with buffered printks looks a bit different:

 kernel:  Possible unsafe locking scenario:
 kernel:CPU0CPU1
 kernel:
 kernel:   lock(bar_lock);
 kernel:lock(
 kernel: foo_lock);
 kernel:lock(bar_lock);
 kernel:   lock(foo_lock);
 kernel: 

> > len = vsprintf();
> > if (len && text[len - 1] == '\n' || overflow)
> > flush();
> 
> I had the same idea. Tetsuo ignored it. I looked for more arguments
> and found that '\n' is used in the middle of several pr_cont()
> calls

OK, so we probably can have that new semantics. But we might split
something that possibly was meant to be a single line with a bunch
of \n in the middle.

-ss

Re: [PATCH v6 1/3] printk: Add line-buffered printk() API.

2018-11-12 Thread Sergey Senozhatsky

On (11/09/18 15:10), Petr Mladek wrote:
> > 
> > If I'm not mistaken, this is for the futute "printk injection" work.
> 
> The above code only tries to push complete lines to the main log buffer
> and consoles ASAP. It sounds like a Good Idea(tm).

Probably it is. So *quite likely* I'm wrong here.

Hmm... Thinking out loud.
At the same time, splitting a single logbuf entry gives a chance to other
events to mix in. Example:
pr_cont("Foo:");
pr_cont("\nbar");
pr_cont("\n");

Previously it could been stored in one logbuf entry (one logstore,
one console_trylock_spinning), which means that it could have been
printed in one go:

call_console_drivers()
spin_trylock_irqsave(>lock, flags);
uart_console_write( "Foo:\nbar\n");
spin_unlock_irqrestore(>lock, flags);

While with buffered printk, it will be store in two logbuf entries,
and printed in two goes:

call_console_drivers()
spin_trylock_irqsave(>lock, flags);
uart_console_write( "Foo:\nbar\n");
spin_unlock_irqrestore(>lock, flags);

<< ... console driver IRQ, TX port->state->xmit chars ... >>>

call_console_drivers()
spin_trylock_irqsave(>lock, flags);
uart_console_write( "Foo:\nbar\n");
spin_unlock_irqrestore(>lock, flags);

So, *technically*, we now let more events to happens between printk-s.

But, let's look at some of pr_cont() usage examples.
E.g. dump() from arch/h8300/kernel/traps.c. The code in question looks
as follows:

pr_info("\nKERNEL STACK:");
tp = ((unsigned char *) fp) - 0x40;
for (sp = (unsigned long *) tp, i = 0; (i < 0x40);  i += 4) {
if ((i % 0x10) == 0)
pr_info("\n%08x: ", (int) (tp + i));
pr_info("%08x ", (int) *sp++);
}
pr_info("\n");

dmesg

[   15.260099] :  0010  0040  0090
   0010:0100  0190  0240  0310
   0020:0400  0510  0640  0790
   0030:0900  0a90  0c40  0e10

So we have the entire KERNEL STACK stored as a single line, in
a single logbuf entry.

cat /dev/kmsg

4,687,15260099,c;\x0a:  0010  0040  0090  \x0a0010: 
   0100  0190  0240  0310  \x0a0020:0400  0510  
0640  0790  \x0a0030:0900  0a90  0c40  0e10

Shall we consider this as a "reference" representation: data that
pr_cont(), ideally, would have stored as a single logbuf entry? And
then compare the "reference" representation and what buffered printk
does.

Buffered printk always stores this as several lines, IOW several
logbuf entries.

cat /dev/kmsg

4,690,15260152,-;:  0010  0040  0090  
4,691,15260154,-;0010:0100  0190  0240  0310  
4,692,15260157,-;0020:0400  0510  0640  0790  
4,694,15260161,-;0030:0900  0a90  0c40  0e10  

So if we will have concurrent printk()-s happening on other CPUs,
then the KERNEL STACK data block still can be a bit hard to read:

[   15.260152] :  0010  0040  0090  
 ... foo bar foo bar
 ... foo bar foo bar
...
[   15.260154] 0010:0100  0190  0240  0310  
 ... foo bar foo bar
 ... foo bar foo bar
...
  ... and so on; you got the idea.

> No, please note that the for cycle searches for '\n' from the end
> of the string.

You are right, I didn't notice that. Indeed.

Tetsuo, lockdep report with buffered printks looks a bit different:

 kernel:  Possible unsafe locking scenario:
 kernel:CPU0CPU1
 kernel:
 kernel:   lock(bar_lock);
 kernel:lock(
 kernel: foo_lock);
 kernel:lock(bar_lock);
 kernel:   lock(foo_lock);
 kernel: 

> > len = vsprintf();
> > if (len && text[len - 1] == '\n' || overflow)
> > flush();
> 
> I had the same idea. Tetsuo ignored it. I looked for more arguments
> and found that '\n' is used in the middle of several pr_cont()
> calls

OK, so we probably can have that new semantics. But we might split
something that possibly was meant to be a single line with a bunch
of \n in the middle.

-ss

Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-11-08 Thread Sergey Senozhatsky

On (11/01/18 09:05), Daniel Wang wrote:
> > Another deadlock scenario could be the following one:
> >
> > printk()
> >  console_trylock()
> >   down_trylock()
> >raw_spin_lock_irqsave(>lock, flags)
> > 
> >  panic()
> >   console_flush_on_panic()
> >console_trylock()
> > raw_spin_lock_irqsave(>lock, flags)// deadlock
> >
> > There are no patches addressing this one at the moment. And it's
> > unclear if you are hitting this scenario.
> 
> I am not sure, but Steven's patches did make the deadlock I saw go away...

You certainly can find cases when "busy spin on console_sem owner" logic
can reduce some possibilities.

But spin_lock(); NMI; spin_lock(); code is still in the kernel.

> A little swamped by other things lately but I'll run a test with it.
> If it works, would you recommend taking your patch alone

Let's first figure out if it works.

-ss

Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-11-08 Thread Sergey Senozhatsky

On (11/01/18 09:05), Daniel Wang wrote:
> > Another deadlock scenario could be the following one:
> >
> > printk()
> >  console_trylock()
> >   down_trylock()
> >raw_spin_lock_irqsave(>lock, flags)
> > 
> >  panic()
> >   console_flush_on_panic()
> >console_trylock()
> > raw_spin_lock_irqsave(>lock, flags)// deadlock
> >
> > There are no patches addressing this one at the moment. And it's
> > unclear if you are hitting this scenario.
> 
> I am not sure, but Steven's patches did make the deadlock I saw go away...

You certainly can find cases when "busy spin on console_sem owner" logic
can reduce some possibilities.

But spin_lock(); NMI; spin_lock(); code is still in the kernel.

> A little swamped by other things lately but I'll run a test with it.
> If it works, would you recommend taking your patch alone

Let's first figure out if it works.

-ss

Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-08 Thread Sergey Senozhatsky

On (11/08/18 20:37), Tetsuo Handa wrote:
> On 2018/11/08 13:45, Sergey Senozhatsky wrote:
> > So, can we just do the following? /* a sketch */
> > 
> > lockdep.c
> > printk_safe_enter_irqsave(flags);
> > lockdep_report();
> > printk_safe_exit_irqrestore(flags);
> 
> If buffer size were large enough to hold messages from out_of_memory(),
> I would like to use it for out_of_memory() because delaying SIGKILL
> due to waiting for printk() to complete is not good. Surely we can't
> hold all messages because amount from dump_tasks() is unpredictable.
> Maybe we can hold all messages from dump_header() except dump_tasks().
> 
> But isn't it essentially same with
> http://lkml.kernel.org/r/1493560477-3016-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
> which Linus does not want?

Dunno. I guess we still haven't heard from Linus because he did quite a good
job setting up his 'email filters' ;)

Converting the existing users to buffered printk is not so simple.
Apparently there are different paths; some can afford buffered printk, some
cannot. Some of 'cont' users tend to get advantage of transparent 'cont'
context: start 'cont' output in function A: A()->pr_cont(), continue it in
B: A()->B()->pr_cont(), and then in C: A()->B()->C()->pr_cont(), and
finally flush in A: A()->pr_cont(\n). And then some paths have the
early_printk requirement. We can break the 'transparent cont' by passing
buffer pointers around [it can get a bit hairy; looking at lockdep patch],
but early_printk requirement is a different beast.

So in my email I was not advertising printk_safe as a "buffered printk for
everyone", I was just talking about lockdep. It's a bit doubtful that Peter
will ACK lockdep transition to buffered printk.

-ss

Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-08 Thread Sergey Senozhatsky

On (11/08/18 20:37), Tetsuo Handa wrote:
> On 2018/11/08 13:45, Sergey Senozhatsky wrote:
> > So, can we just do the following? /* a sketch */
> > 
> > lockdep.c
> > printk_safe_enter_irqsave(flags);
> > lockdep_report();
> > printk_safe_exit_irqrestore(flags);
> 
> If buffer size were large enough to hold messages from out_of_memory(),
> I would like to use it for out_of_memory() because delaying SIGKILL
> due to waiting for printk() to complete is not good. Surely we can't
> hold all messages because amount from dump_tasks() is unpredictable.
> Maybe we can hold all messages from dump_header() except dump_tasks().
> 
> But isn't it essentially same with
> http://lkml.kernel.org/r/1493560477-3016-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
> which Linus does not want?

Dunno. I guess we still haven't heard from Linus because he did quite a good
job setting up his 'email filters' ;)

Converting the existing users to buffered printk is not so simple.
Apparently there are different paths; some can afford buffered printk, some
cannot. Some of 'cont' users tend to get advantage of transparent 'cont'
context: start 'cont' output in function A: A()->pr_cont(), continue it in
B: A()->B()->pr_cont(), and then in C: A()->B()->C()->pr_cont(), and
finally flush in A: A()->pr_cont(\n). And then some paths have the
early_printk requirement. We can break the 'transparent cont' by passing
buffer pointers around [it can get a bit hairy; looking at lockdep patch],
but early_printk requirement is a different beast.

So in my email I was not advertising printk_safe as a "buffered printk for
everyone", I was just talking about lockdep. It's a bit doubtful that Peter
will ACK lockdep transition to buffered printk.

-ss

Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-08 Thread Sergey Senozhatsky

On (11/08/18 21:44), Sergey Senozhatsky wrote:
> 
> If lockdep and OOM people will ACK buffered printk transition in
> its current form, then we can go ahead.

That printk_safe approach in lockdep, BTW, does not change (convert)
any printk-s within lockdep, thus Peter's earlycon should not be
affected. So Peter will have earlycon working, syzbot folks will have
buffered lockdep print outs.

-ss

Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-08 Thread Sergey Senozhatsky

On (11/08/18 21:44), Sergey Senozhatsky wrote:
> 
> If lockdep and OOM people will ACK buffered printk transition in
> its current form, then we can go ahead.

That printk_safe approach in lockdep, BTW, does not change (convert)
any printk-s within lockdep, thus Peter's earlycon should not be
affected. So Peter will have earlycon working, syzbot folks will have
buffered lockdep print outs.

-ss

Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-08 Thread Sergey Senozhatsky

On (11/08/18 12:53), Petr Mladek wrote:
> > lockdep.c
> > printk_safe_enter_irqsave(flags);
> > lockdep_report();
> > printk_safe_exit_irqrestore(flags);
> 
> All this looks nice. Let's look it also from the other side.
> The following comes to my mind:
> 
> a) lockdep is not the only place when continuous lines get mixed.
>This patch mentions also RCU stalls. The other patch mentions
>OOM. I am sure that there will be more.
> 
> b) It is not obvious where printk_safe() would be necessary.
>While buffered printk is clearly connected with continuous
>lines.
> 
> c) I am not sure that disabling preemption would always be
>acceptable.
> 
> d) We might need to increase the size of the per-CPU buffers if
>they are used more widely.
> 
> e) People would need to learn a new (printk_safe) API when it is
>use outside printk sources.
> 
> f) Losing the entire log is more painful than loosing one line
>when the buffer never gets flushed.
> 
> Sigh, no solution is perfect. If only we could agree that one
> way was better than the other.

I agree with what you are saying. All of the above (in my email)
was for lockdep only, that's why I did mention lockdep several times.
Like I said, a random and wild idea.
I'm not proposing printk_safe as a "better" buffered printk for
everyone. The buffered_printk patch is pretty big, and comes with a
price tag.

If lockdep and OOM people will ACK buffered printk transition in
its current form, then we can go ahead.

It's debatable if we need a fixed size list of buffers; or we can
do kmalloc()+cont fallback. But if we will have ACKs, then we can
move forward.

-ss

Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-08 Thread Sergey Senozhatsky

On (11/08/18 12:53), Petr Mladek wrote:
> > lockdep.c
> > printk_safe_enter_irqsave(flags);
> > lockdep_report();
> > printk_safe_exit_irqrestore(flags);
> 
> All this looks nice. Let's look it also from the other side.
> The following comes to my mind:
> 
> a) lockdep is not the only place when continuous lines get mixed.
>This patch mentions also RCU stalls. The other patch mentions
>OOM. I am sure that there will be more.
> 
> b) It is not obvious where printk_safe() would be necessary.
>While buffered printk is clearly connected with continuous
>lines.
> 
> c) I am not sure that disabling preemption would always be
>acceptable.
> 
> d) We might need to increase the size of the per-CPU buffers if
>they are used more widely.
> 
> e) People would need to learn a new (printk_safe) API when it is
>use outside printk sources.
> 
> f) Losing the entire log is more painful than loosing one line
>when the buffer never gets flushed.
> 
> Sigh, no solution is perfect. If only we could agree that one
> way was better than the other.

I agree with what you are saying. All of the above (in my email)
was for lockdep only, that's why I did mention lockdep several times.
Like I said, a random and wild idea.
I'm not proposing printk_safe as a "better" buffered printk for
everyone. The buffered_printk patch is pretty big, and comes with a
price tag.

If lockdep and OOM people will ACK buffered printk transition in
its current form, then we can go ahead.

It's debatable if we need a fixed size list of buffers; or we can
do kmalloc()+cont fallback. But if we will have ACKs, then we can
move forward.

-ss

Re: [PATCH v6 1/3] printk: Add line-buffered printk() API.

2018-11-08 Thread Sergey Senozhatsky

On (11/08/18 12:24), Petr Mladek wrote:
> I believe that I mentioned this more times. 16 buffers is the first
> attempt. We could improve it later in several ways

Sure. Like I said - maybe it is a normal development pattern; I really
don't know.

> > Let's have one more look at what we will fix and what we will break.
> > 
> > 'cont' has premature flushes.
> > 
> >   Why is it good.
> >   It preserves the correct order of events.
> > 
> >   pr_cont("calling foo->init()");
> >   foo->init()
> >printk("Can't allocate buffer\n");// premature flush
> >   pr_cont("...blah\h");
> > 
> >  Will end up in the logbuf as:
> >  [12345.123] calling foo->init()
> >  [12345.124] Can't allocate buffer
> >  [12345.125] ...blah
> > 
> >  Where buffered printk will endup as:
> >  [12345.123] Can't allocate buffer
> >  [12345.124] calling foo->init()...blah
> 
> We will always have this problem with API using explicit buffers.

Exactly.

> What do you suggest instead, please?

So this is why "let's not remove 'cont'" thing. Buffered printk
is not 100% identical to 'cont'. And 'cont' does a good job sometimes,
Because 'cont' looks like a buffered printk, but it behaves like a
normal printk when things go bad. Buffered printk is always buffered;
and not even aware of the fact that things go bad around.

> > If our problem is OOM and lockdep print outs, then let's address only
> > those two; let's not "fix" the rest of the kernel, especially the early
> > boot, - we can break more things than we can mend.
> 
> Do you have any alternative proposal how to handle OOM and lockdep, please?

You misunderstood me. I'm not against the buffered printk in lockdep and
OOM. Albeit I must admit that lockdep patch looks quite big. I don't like
the idea of "we will remove 'cont' and replace it with buffered printk
through out the kernel".

[..]
> > - It seems that buffered printk attempts to solve too many problems.
> >   I'd prefer it to address just one.
> 
> This API tries to handle continuous lines more reliably.
> Do I miss anything, please?

This part:

+   /* Flush already completed lines if any. */
+   for (pos = ptr->len - 1; pos >= 0; pos--) {
+   if (ptr->buf[pos] != '\n')
+   continue;
+   ptr->buf[pos++] = '\0';
+   printk("%s\n", ptr->buf);
+   ptr->len -= pos;
+   memmove(ptr->buf, ptr->buf + pos, ptr->len);
+   /* This '\0' will be overwritten by next vsnprintf() above. */
+   ptr->buf[ptr->len] = '\0';
+   break;
+   }
+   return r;

If I'm not mistaken, this is for the futute "printk injection" work.

Right now printk("foo\nbar\n") will end up to be a single logbuf
entry, with \n in the middle and at the end. So it will look like
two lines on the serial console:

[123.123] foo
  bar

Tetsuo does this \n lookup (on every vprintk_buffered) and split lines
(via memove) for "printk injection", so the output will be

[123.123] foo
[123.124] bar

Which makes it simpler to "inject" printk origin into every printed
line.

Without it we can just do

len = vsprintf();
if (len && text[len - 1] == '\n' || overflow)
flush();

Can't we?

-ss

Re: [PATCH v6 1/3] printk: Add line-buffered printk() API.

2018-11-08 Thread Sergey Senozhatsky

On (11/08/18 12:24), Petr Mladek wrote:
> I believe that I mentioned this more times. 16 buffers is the first
> attempt. We could improve it later in several ways

Sure. Like I said - maybe it is a normal development pattern; I really
don't know.

> > Let's have one more look at what we will fix and what we will break.
> > 
> > 'cont' has premature flushes.
> > 
> >   Why is it good.
> >   It preserves the correct order of events.
> > 
> >   pr_cont("calling foo->init()");
> >   foo->init()
> >printk("Can't allocate buffer\n");// premature flush
> >   pr_cont("...blah\h");
> > 
> >  Will end up in the logbuf as:
> >  [12345.123] calling foo->init()
> >  [12345.124] Can't allocate buffer
> >  [12345.125] ...blah
> > 
> >  Where buffered printk will endup as:
> >  [12345.123] Can't allocate buffer
> >  [12345.124] calling foo->init()...blah
> 
> We will always have this problem with API using explicit buffers.

Exactly.

> What do you suggest instead, please?

So this is why "let's not remove 'cont'" thing. Buffered printk
is not 100% identical to 'cont'. And 'cont' does a good job sometimes,
Because 'cont' looks like a buffered printk, but it behaves like a
normal printk when things go bad. Buffered printk is always buffered;
and not even aware of the fact that things go bad around.

> > If our problem is OOM and lockdep print outs, then let's address only
> > those two; let's not "fix" the rest of the kernel, especially the early
> > boot, - we can break more things than we can mend.
> 
> Do you have any alternative proposal how to handle OOM and lockdep, please?

You misunderstood me. I'm not against the buffered printk in lockdep and
OOM. Albeit I must admit that lockdep patch looks quite big. I don't like
the idea of "we will remove 'cont' and replace it with buffered printk
through out the kernel".

[..]
> > - It seems that buffered printk attempts to solve too many problems.
> >   I'd prefer it to address just one.
> 
> This API tries to handle continuous lines more reliably.
> Do I miss anything, please?

This part:

+   /* Flush already completed lines if any. */
+   for (pos = ptr->len - 1; pos >= 0; pos--) {
+   if (ptr->buf[pos] != '\n')
+   continue;
+   ptr->buf[pos++] = '\0';
+   printk("%s\n", ptr->buf);
+   ptr->len -= pos;
+   memmove(ptr->buf, ptr->buf + pos, ptr->len);
+   /* This '\0' will be overwritten by next vsnprintf() above. */
+   ptr->buf[ptr->len] = '\0';
+   break;
+   }
+   return r;

If I'm not mistaken, this is for the futute "printk injection" work.

Right now printk("foo\nbar\n") will end up to be a single logbuf
entry, with \n in the middle and at the end. So it will look like
two lines on the serial console:

[123.123] foo
  bar

Tetsuo does this \n lookup (on every vprintk_buffered) and split lines
(via memove) for "printk injection", so the output will be

[123.123] foo
[123.124] bar

Which makes it simpler to "inject" printk origin into every printed
line.

Without it we can just do

len = vsprintf();
if (len && text[len - 1] == '\n' || overflow)
flush();

Can't we?

-ss

Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-07 Thread Sergey Senozhatsky

On (11/07/18 16:19), Petr Mladek wrote:
> > syzbot is sometimes getting mixed output like below due to concurrent
> > printk(). Mitigate such output by using line-buffered printk() API.
> > 
> > @@ -2421,18 +2458,20 @@ static void check_chain_key(struct task_struct 
> > *curr)
> >  print_usage_bug_scenario(struct held_lock *lock)
> >  {
> > struct lock_class *class = hlock_class(lock);
> > +   struct printk_buffer *buf = get_printk_buffer();
> >  
> > printk(" Possible unsafe locking scenario:\n\n");
> > printk("   CPU0\n");
> > printk("   \n");
> > -   printk("  lock(");
> > -   __print_lock_name(class);
> > -   printk(KERN_CONT ");\n");
> > +   printk_buffered(buf, "  lock(");
> > +   __print_lock_name(class, buf);
> > +   printk_buffered(buf, ");\n");
> > printk("  \n");
> > -   printk("lock(");
> > -   __print_lock_name(class);
> > -   printk(KERN_CONT ");\n");
> > +   printk_buffered(buf, "lock(");
> > +   __print_lock_name(class, buf);
> > +   printk_buffered(buf, ");\n");
> > printk("\n *** DEADLOCK ***\n\n");
> > +   put_printk_buffer(buf);
> >  }
> >  
> >  static int
> 
> I really hope that the maze of pr_cont() calls in lockdep.c is the most
> complicated one that we would meet.

Hmm... Yes, buffered/seq_buf printk looks like a hard-to-use API,
when it comes to real world cases like this.

So... here is a random and wild idea.

We actually already have an easy-to-use buffered printk. And it's per-CPU.
And it makes all printk-s on this CPU to behave like as if they were called
on UP system. And it cures pr_cont(). And it doesn't require anyone to learn
any new printk API names. And it doesn't require any additional maintenance
work. And it doesn't require any printk->buffered_printk conversions. And
it's already in the kernel. And we gave it a name. And it's printk_safe.

a) lockdep reporting path should be atomic. And it's not a hot path,
   so local_irq_save/local_irq_restore will not cause a lot of trouble
   there probably.

b) We already have some lockdep reports coming via printk_safe.
   All those
printk->console_driver->scheduler->lockdep
printk->console_driver->timekeeping->lockdep
etc.

   came via printk_safe path. So it's not a complete novelty.

c) printk_safe sections can nest.

d) No premature flushes. Everything looks the way it was supposed to
   look.

e) There are no out-of-line printk-s. We keep the actual order of events.

f) We flush it on panic.

g) Low maintenance costs.

So, can we just do the following? /* a sketch */

lockdep.c
printk_safe_enter_irqsave(flags);
lockdep_report();
printk_safe_exit_irqrestore(flags);

-ss

Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-07 Thread Sergey Senozhatsky

On (11/07/18 16:19), Petr Mladek wrote:
> > syzbot is sometimes getting mixed output like below due to concurrent
> > printk(). Mitigate such output by using line-buffered printk() API.
> > 
> > @@ -2421,18 +2458,20 @@ static void check_chain_key(struct task_struct 
> > *curr)
> >  print_usage_bug_scenario(struct held_lock *lock)
> >  {
> > struct lock_class *class = hlock_class(lock);
> > +   struct printk_buffer *buf = get_printk_buffer();
> >  
> > printk(" Possible unsafe locking scenario:\n\n");
> > printk("   CPU0\n");
> > printk("   \n");
> > -   printk("  lock(");
> > -   __print_lock_name(class);
> > -   printk(KERN_CONT ");\n");
> > +   printk_buffered(buf, "  lock(");
> > +   __print_lock_name(class, buf);
> > +   printk_buffered(buf, ");\n");
> > printk("  \n");
> > -   printk("lock(");
> > -   __print_lock_name(class);
> > -   printk(KERN_CONT ");\n");
> > +   printk_buffered(buf, "lock(");
> > +   __print_lock_name(class, buf);
> > +   printk_buffered(buf, ");\n");
> > printk("\n *** DEADLOCK ***\n\n");
> > +   put_printk_buffer(buf);
> >  }
> >  
> >  static int
> 
> I really hope that the maze of pr_cont() calls in lockdep.c is the most
> complicated one that we would meet.

Hmm... Yes, buffered/seq_buf printk looks like a hard-to-use API,
when it comes to real world cases like this.

So... here is a random and wild idea.

We actually already have an easy-to-use buffered printk. And it's per-CPU.
And it makes all printk-s on this CPU to behave like as if they were called
on UP system. And it cures pr_cont(). And it doesn't require anyone to learn
any new printk API names. And it doesn't require any additional maintenance
work. And it doesn't require any printk->buffered_printk conversions. And
it's already in the kernel. And we gave it a name. And it's printk_safe.

a) lockdep reporting path should be atomic. And it's not a hot path,
   so local_irq_save/local_irq_restore will not cause a lot of trouble
   there probably.

b) We already have some lockdep reports coming via printk_safe.
   All those
printk->console_driver->scheduler->lockdep
printk->console_driver->timekeeping->lockdep
etc.

   came via printk_safe path. So it's not a complete novelty.

c) printk_safe sections can nest.

d) No premature flushes. Everything looks the way it was supposed to
   look.

e) There are no out-of-line printk-s. We keep the actual order of events.

f) We flush it on panic.

g) Low maintenance costs.

So, can we just do the following? /* a sketch */

lockdep.c
printk_safe_enter_irqsave(flags);
lockdep_report();
printk_safe_exit_irqrestore(flags);

-ss

Re: [PATCH v6 1/3] printk: Add line-buffered printk() API.

2018-11-07 Thread Sergey Senozhatsky

On (11/07/18 19:52), Tetsuo Handa wrote:
> > A question.
> > 
> > How bad would it actually be to:
> > 
> > - Allocate seq_buf 512-bytes buffer (GFP_ATOMIC) just-in-time, when we
> >   need it.
> > // How often systems cannot allocate a 512-byte buffer? //
> 
> It is a very bad thing to do GFP_ATOMIC without __GFP_NOWARN.

Absolutely, __GFP_NOWARN.

> "it does not sleep". Not suitable for printk() which might be called from
> critically dangerous situations.

So I'm really not convinced that we can use buffered printk in critically
dangerous situations. Premature 'cont' flushes and 'cont' flushes on panic
are nice and right in critically dangerous situations.

[..]
> > - Do not allocate seq_buf if we are in printk-safe or in printk-nmi mode.
> >   To avoid "buffering for the sake of buffering". IOW, when in printk-safe
> >   use printk-safe.
> 
> Why? Since printk_safe_flush_buffer() forcibly flushes the partial line

We need to leave printk_safe and enable local IRQs for that partial
flush to occur. I'm not sure that those "partial flushes" from printk_safe
actually happen.

-ss

Re: [PATCH v6 1/3] printk: Add line-buffered printk() API.

2018-11-07 Thread Sergey Senozhatsky

On (11/07/18 19:52), Tetsuo Handa wrote:
> > A question.
> > 
> > How bad would it actually be to:
> > 
> > - Allocate seq_buf 512-bytes buffer (GFP_ATOMIC) just-in-time, when we
> >   need it.
> > // How often systems cannot allocate a 512-byte buffer? //
> 
> It is a very bad thing to do GFP_ATOMIC without __GFP_NOWARN.

Absolutely, __GFP_NOWARN.

> "it does not sleep". Not suitable for printk() which might be called from
> critically dangerous situations.

So I'm really not convinced that we can use buffered printk in critically
dangerous situations. Premature 'cont' flushes and 'cont' flushes on panic
are nice and right in critically dangerous situations.

[..]
> > - Do not allocate seq_buf if we are in printk-safe or in printk-nmi mode.
> >   To avoid "buffering for the sake of buffering". IOW, when in printk-safe
> >   use printk-safe.
> 
> Why? Since printk_safe_flush_buffer() forcibly flushes the partial line

We need to leave printk_safe and enable local IRQs for that partial
flush to occur. I'm not sure that those "partial flushes" from printk_safe
actually happen.

-ss

Re: [PATCH v6 1/3] printk: Add line-buffered printk() API.

2018-11-07 Thread Sergey Senozhatsky

On (11/07/18 11:21), Petr Mladek wrote:
> > [..]
> > - The printk-fallback sounds like a hint that the existing 'cont' handling
> >   better stay in the kernel. I don't see how the existing 'cont' is
> >   significantly worse than
> > bpr_warn(NULL, ...)->printk() // no 'cont' support
> >   I don't see why would we want to do it, sorry. I don't see "it takes 16
> >   printk-buffers to make a thing go right" as a sure thing.
> 
> I see it the following way:
> 
>+ mixed cont lines are very rare but they happen
> 
>+ 16 buffers are more than 1 so it could only be better [*]

Right, so this is where things go a bit shady. I'll try to explain.

What is the problem:
- we have tons of CPUs, with tons of tasks running on them, with preemption,
  and interrupts, and potentially printk-s coming from various
  contexts/CPUs/tasks etc. so one 'cont' buffer is not enough.

What is the proposed solution:
- if 1 is not enough then 16 will do. And if 16 is not enough then this
  is not our problem anymore, it's a kernel misconfiguration and users'
  fault.

So, maybe it is a "common" kernel development pattern, I really don't know;
but it looks like we are just throwing the issue over the fence; and atop
of that we are killing off the existing 'cont'.

> Anyway, I do not think that both implementations are worth it.
> We could keep both for some transition period but we should
> remove the old one later.
>
[..]
>
> This would prevent removing the fallback to struct cont. OOM is
> one important scenario where continuous lines are used.

Let's have one more look at what we will fix and what we will break.

'cont' has premature flushes.

- it's annoying in some cases
- it's good otherwise

  Why is it good.
  It preserves the correct order of events.

  pr_cont("calling foo->init()");
  foo->init()
   printk("Can't allocate buffer\n");// premature flush
  pr_cont("...blah\h");

 Will end up in the logbuf as:
 [12345.123] calling foo->init()
 [12345.124] Can't allocate buffer
 [12345.125] ...blah

 Where buffered printk will endup as:
 [12345.123] Can't allocate buffer
 [12345.124] calling foo->init()...blah

It's very different.

Not to mention that buffered printk does not flush on panic.
So, frankly, as of now, I don't see buffered printk as a 'cont'
replacement.

If our problem is OOM and lockdep print outs, then let's address only
those two; let's not "fix" the rest of the kernel, especially the early
boot, - we can break more things than we can mend.

[..]
> I opened this problem once and it got lost. So I did not want to
> complicate it at this moment.

Right, I saw it. We have similar points; I raised those in private talks:

- we don't need buffered printk in printk_safe/printk_nmi

- we don't need br_cont()

- unlike 'cont' buffer there is no way for us to flush buffered printk
  buffers on panic

- I don't exactly like the completely of the vprintk_buffered. If
  buffered printk is for single line, then it must be for single
  line only. And I'm not buying the "we will need this for printk
  origin info injection" argument.
  Linus was very clear about this whole idea:

  Linus Torvalds wrote:
  :  Sergey Senozhatsky wrote:
  :  >
  :  > so... I think we don't have to update 'struct printk_log'. we can store
  :  > that "extended data" at the beginning of every message, right after the
  :  > prefix.
  :
  :   No, we really can't. That just means that all the tools would have to
  :   be changed to get the normal messages without the extra crud. And
  :   since it will have lost the difference, that's not even easy to do.
  :
  :   So this is exactly the wrong way around.
  :
  :   If people want to see the extra data, it really should be extra data
  :   that you can get with a new interface from the kernel logs. Not a
  :   "let's just a add it to all lines and make every line uglier and
  :   harder to read.

  So the "we need all that complexity to inject printk info in later
  patches" push doesn't look right.

- It seems that buffered printk attempts to solve too many problems.
  I'd prefer it to address just one.

-ss

Re: [PATCH v6 1/3] printk: Add line-buffered printk() API.

2018-11-07 Thread Sergey Senozhatsky

On (11/07/18 11:21), Petr Mladek wrote:
> > [..]
> > - The printk-fallback sounds like a hint that the existing 'cont' handling
> >   better stay in the kernel. I don't see how the existing 'cont' is
> >   significantly worse than
> > bpr_warn(NULL, ...)->printk() // no 'cont' support
> >   I don't see why would we want to do it, sorry. I don't see "it takes 16
> >   printk-buffers to make a thing go right" as a sure thing.
> 
> I see it the following way:
> 
>+ mixed cont lines are very rare but they happen
> 
>+ 16 buffers are more than 1 so it could only be better [*]

Right, so this is where things go a bit shady. I'll try to explain.

What is the problem:
- we have tons of CPUs, with tons of tasks running on them, with preemption,
  and interrupts, and potentially printk-s coming from various
  contexts/CPUs/tasks etc. so one 'cont' buffer is not enough.

What is the proposed solution:
- if 1 is not enough then 16 will do. And if 16 is not enough then this
  is not our problem anymore, it's a kernel misconfiguration and users'
  fault.

So, maybe it is a "common" kernel development pattern, I really don't know;
but it looks like we are just throwing the issue over the fence; and atop
of that we are killing off the existing 'cont'.

> Anyway, I do not think that both implementations are worth it.
> We could keep both for some transition period but we should
> remove the old one later.
>
[..]
>
> This would prevent removing the fallback to struct cont. OOM is
> one important scenario where continuous lines are used.

Let's have one more look at what we will fix and what we will break.

'cont' has premature flushes.

- it's annoying in some cases
- it's good otherwise

  Why is it good.
  It preserves the correct order of events.

  pr_cont("calling foo->init()");
  foo->init()
   printk("Can't allocate buffer\n");// premature flush
  pr_cont("...blah\h");

 Will end up in the logbuf as:
 [12345.123] calling foo->init()
 [12345.124] Can't allocate buffer
 [12345.125] ...blah

 Where buffered printk will endup as:
 [12345.123] Can't allocate buffer
 [12345.124] calling foo->init()...blah

It's very different.

Not to mention that buffered printk does not flush on panic.
So, frankly, as of now, I don't see buffered printk as a 'cont'
replacement.

If our problem is OOM and lockdep print outs, then let's address only
those two; let's not "fix" the rest of the kernel, especially the early
boot, - we can break more things than we can mend.

[..]
> I opened this problem once and it got lost. So I did not want to
> complicate it at this moment.

Right, I saw it. We have similar points; I raised those in private talks:

- we don't need buffered printk in printk_safe/printk_nmi

- we don't need br_cont()

- unlike 'cont' buffer there is no way for us to flush buffered printk
  buffers on panic

- I don't exactly like the completely of the vprintk_buffered. If
  buffered printk is for single line, then it must be for single
  line only. And I'm not buying the "we will need this for printk
  origin info injection" argument.
  Linus was very clear about this whole idea:

  Linus Torvalds wrote:
  :  Sergey Senozhatsky wrote:
  :  >
  :  > so... I think we don't have to update 'struct printk_log'. we can store
  :  > that "extended data" at the beginning of every message, right after the
  :  > prefix.
  :
  :   No, we really can't. That just means that all the tools would have to
  :   be changed to get the normal messages without the extra crud. And
  :   since it will have lost the difference, that's not even easy to do.
  :
  :   So this is exactly the wrong way around.
  :
  :   If people want to see the extra data, it really should be extra data
  :   that you can get with a new interface from the kernel logs. Not a
  :   "let's just a add it to all lines and make every line uglier and
  :   harder to read.

  So the "we need all that complexity to inject printk info in later
  patches" push doesn't look right.

- It seems that buffered printk attempts to solve too many problems.
  I'd prefer it to address just one.

-ss

Re: [PATCH v6 1/3] printk: Add line-buffered printk() API.

2018-11-06 Thread Sergey Senozhatsky

On (11/02/18 22:31), Tetsuo Handa wrote:
>   (1) Call get_printk_buffer() and acquire "struct printk_buffer *".
> 
>   (2) Rewrite printk() calls in the following way. The "ptr" is
>   "struct printk_buffer *" obtained in step (1).
> 
>   printk(fmt, ...) => printk_buffered(ptr, fmt, ...)
>   vprintk(fmt, args)   => vprintk_buffered(ptr, fmt, args)
>   pr_emerg(fmt, ...)   => bpr_emerg(ptr, fmt, ...)
>   pr_alert(fmt, ...)   => bpr_alert(ptr, fmt, ...)
>   pr_crit(fmt, ...)=> bpr_crit(ptr, fmt, ...)
>   pr_err(fmt, ...) => bpr_err(ptr, fmt, ...)
>   pr_warning(fmt, ...) => bpr_warning(ptr, fmt, ...)
>   pr_warn(fmt, ...)=> bpr_warn(ptr, fmt, ...)
>   pr_notice(fmt, ...)  => bpr_notice(ptr, fmt, ...)
>   pr_info(fmt, ...)=> bpr_info(ptr, fmt, ...)
>   pr_cont(fmt, ...)=> bpr_cont(ptr, fmt, ...)
> 
>   (3) Release "struct printk_buffer" by calling put_printk_buffer().

[..]

> Since we want to remove "struct cont" eventually, we will try to remove
> both "implicit printk() users who are expecting KERN_CONT behavior" and
> "explicit pr_cont()/printk(KERN_CONT) users". Therefore, converting to
> this API is recommended.

- The printk-fallback sounds like a hint that the existing 'cont' handling
  better stay in the kernel. I don't see how the existing 'cont' is
  significantly worse than
bpr_warn(NULL, ...)->printk() // no 'cont' support
  I don't see why would we want to do it, sorry. I don't see "it takes 16
  printk-buffers to make a thing go right" as a sure thing.

A question.

How bad would it actually be to:

- Allocate seq_buf 512-bytes buffer (GFP_ATOMIC) just-in-time, when we
  need it.
// How often systems cannot allocate a 512-byte buffer? //

- OK, assuming that systems around the world are so badly OOM like all the
  time and even kmalloc(512) is absolutely impossible, then have a fallback
  to the existing 'cont' handling; it just looks to me better than a plain
  printk()-fallback with removed 'cont' support.

- Do not allocate seq_buf if we are in printk-safe or in printk-nmi mode.
  To avoid "buffering for the sake of buffering". IOW, when in printk-safe
  use printk-safe.

-ss

Re: [PATCH v6 1/3] printk: Add line-buffered printk() API.

2018-11-06 Thread Sergey Senozhatsky

On (11/02/18 22:31), Tetsuo Handa wrote:
>   (1) Call get_printk_buffer() and acquire "struct printk_buffer *".
> 
>   (2) Rewrite printk() calls in the following way. The "ptr" is
>   "struct printk_buffer *" obtained in step (1).
> 
>   printk(fmt, ...) => printk_buffered(ptr, fmt, ...)
>   vprintk(fmt, args)   => vprintk_buffered(ptr, fmt, args)
>   pr_emerg(fmt, ...)   => bpr_emerg(ptr, fmt, ...)
>   pr_alert(fmt, ...)   => bpr_alert(ptr, fmt, ...)
>   pr_crit(fmt, ...)=> bpr_crit(ptr, fmt, ...)
>   pr_err(fmt, ...) => bpr_err(ptr, fmt, ...)
>   pr_warning(fmt, ...) => bpr_warning(ptr, fmt, ...)
>   pr_warn(fmt, ...)=> bpr_warn(ptr, fmt, ...)
>   pr_notice(fmt, ...)  => bpr_notice(ptr, fmt, ...)
>   pr_info(fmt, ...)=> bpr_info(ptr, fmt, ...)
>   pr_cont(fmt, ...)=> bpr_cont(ptr, fmt, ...)
> 
>   (3) Release "struct printk_buffer" by calling put_printk_buffer().

[..]

> Since we want to remove "struct cont" eventually, we will try to remove
> both "implicit printk() users who are expecting KERN_CONT behavior" and
> "explicit pr_cont()/printk(KERN_CONT) users". Therefore, converting to
> this API is recommended.

- The printk-fallback sounds like a hint that the existing 'cont' handling
  better stay in the kernel. I don't see how the existing 'cont' is
  significantly worse than
bpr_warn(NULL, ...)->printk() // no 'cont' support
  I don't see why would we want to do it, sorry. I don't see "it takes 16
  printk-buffers to make a thing go right" as a sure thing.

A question.

How bad would it actually be to:

- Allocate seq_buf 512-bytes buffer (GFP_ATOMIC) just-in-time, when we
  need it.
// How often systems cannot allocate a 512-byte buffer? //

- OK, assuming that systems around the world are so badly OOM like all the
  time and even kmalloc(512) is absolutely impossible, then have a fallback
  to the existing 'cont' handling; it just looks to me better than a plain
  printk()-fallback with removed 'cont' support.

- Do not allocate seq_buf if we are in printk-safe or in printk-nmi mode.
  To avoid "buffering for the sake of buffering". IOW, when in printk-safe
  use printk-safe.

-ss

Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-06 Thread Sergey Senozhatsky

On (11/06/18 09:38), Petr Mladek wrote:
> 
> If you would want to avoid buffering, you could set the number
> of buffers to zero. Then it would always fallback to
> the direct printk().

This printk-fallback makes me wonder if 'cont' really can ever go away.
We would totally break cont printk-s that trapped into printk-fallback;
as opposed to current sometimes-cont-works-just-fine.

-ss

Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-06 Thread Sergey Senozhatsky

On (11/06/18 09:38), Petr Mladek wrote:
> 
> If you would want to avoid buffering, you could set the number
> of buffers to zero. Then it would always fallback to
> the direct printk().

This printk-fallback makes me wonder if 'cont' really can ever go away.
We would totally break cont printk-s that trapped into printk-fallback;
as opposed to current sometimes-cont-works-just-fine.

-ss

Re: [PATCHv3] panic: avoid deadlocks in re-entrant console drivers

2018-10-31 Thread Sergey Senozhatsky

On (10/31/18 13:27), Petr Mladek wrote:
> > 
> > Signed-off-by: Sergey Senozhatsky 
> 
> The patch makes sense to me. The locks should stay busted also for
> console_flush_on_panic().
> 
> With the added #include :
> 
> Reviewed-by: Petr Mladek 

Thanks!

Since there are no objections - how shall we route it? Via printk tree?
I'd also prefer this patch to be in -stable, it fixes a real issue after
all.

-ss

Re: [PATCHv3] panic: avoid deadlocks in re-entrant console drivers

2018-10-31 Thread Sergey Senozhatsky

On (10/31/18 13:27), Petr Mladek wrote:
> > 
> > Signed-off-by: Sergey Senozhatsky 
> 
> The patch makes sense to me. The locks should stay busted also for
> console_flush_on_panic().
> 
> With the added #include :
> 
> Reviewed-by: Petr Mladek 

Thanks!

Since there are no objections - how shall we route it? Via printk tree?
I'd also prefer this patch to be in -stable, it fixes a real issue after
all.

-ss

Re: [PATCHv3] panic: avoid deadlocks in re-entrant console drivers

2018-10-25 Thread Sergey Senozhatsky

On (10/25/18 18:51), kbuild test robot wrote:
> 
> [auto build test ERROR on linux-sof-driver/master]
> [also build test ERROR on v4.19 next-20181019]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 

My bad, sorry!

+#include 

This should fix it.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

From: Sergey Senozhatsky 
Subject: [PATCH] panic: avoid deadlocks in re-entrant console drivers

>From printk()/serial console point of view panic() is special, because
it may force CPU to re-enter printk() or/and serial console driver.
Therefore, some of serial consoles drivers are re-entrant. E.g. 8250:

serial8250_console_write()
{
if (port->sysrq)
locked = 0;
else if (oops_in_progress)
locked = spin_trylock_irqsave(>lock, flags);
else
spin_lock_irqsave(>lock, flags);
...
}

panic() does set oops_in_progress via bust_spinlocks(1), so in theory
we should be able to re-enter serial console driver from panic():

CPU0

uart_console_write()
serial8250_console_write()  // if (oops_in_progress)
//spin_trylock_irqsave()
call_console_drivers()
console_unlock()
console_flush_on_panic()
bust_spinlocks(1)   // oops_in_progress++
panic()

spin_lock_irqsave(>lock, flags)   // spin_lock_irqsave()
serial8250_console_write()
call_console_drivers()
console_unlock()
printk()
...

However, this does not happen and we deadlock in serial console on
port->lock spinlock. And the problem is that console_flush_on_panic()
called after bust_spinlocks(0):

void panic(const char *fmt, ...)
{
bust_spinlocks(1);
...
bust_spinlocks(0);
console_flush_on_panic();
...
}

bust_spinlocks(0) decrements oops_in_progress, so oops_in_progress
can go back to zero. Thus even re-entrant console drivers will simply
spin on port->lock spinlock. Given that port->lock may already be
locked either by a stopped CPU, or by the very same CPU we execute
panic() on (for instance, NMI panic() on printing CPU) the system
deadlocks and does not reboot.

Fix this by removing bust_spinlocks(0), so oops_in_progress is always
set in panic() now and, thus, re-entrant console drivers will trylock
the port->lock instead of spinning on it forever, when we call them
from console_flush_on_panic().

Signed-off-by: Sergey Senozhatsky 
---
 kernel/panic.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/panic.c b/kernel/panic.c
index f6d549a29a5c..272ac1c34e4b 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 

 #define PANIC_TIMER_STEP 100
 #define PANIC_BLINK_SPD 18
@@ -237,7 +238,10 @@ void panic(const char *fmt, ...)
if (_crash_kexec_post_notifiers)
__crash_kexec(NULL);

-   bust_spinlocks(0);
+#ifdef CONFIG_VT
+   unblank_screen();
+#endif
+   console_unblank();

/*
 * We may have ended up stopping the CPU holding the lock (in
-- 
2.19.1

Re: [PATCHv3] panic: avoid deadlocks in re-entrant console drivers

2018-10-25 Thread Sergey Senozhatsky

On (10/25/18 18:51), kbuild test robot wrote:
> 
> [auto build test ERROR on linux-sof-driver/master]
> [also build test ERROR on v4.19 next-20181019]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 

My bad, sorry!

+#include 

This should fix it.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

From: Sergey Senozhatsky 
Subject: [PATCH] panic: avoid deadlocks in re-entrant console drivers

>From printk()/serial console point of view panic() is special, because
it may force CPU to re-enter printk() or/and serial console driver.
Therefore, some of serial consoles drivers are re-entrant. E.g. 8250:

serial8250_console_write()
{
if (port->sysrq)
locked = 0;
else if (oops_in_progress)
locked = spin_trylock_irqsave(>lock, flags);
else
spin_lock_irqsave(>lock, flags);
...
}

panic() does set oops_in_progress via bust_spinlocks(1), so in theory
we should be able to re-enter serial console driver from panic():

CPU0

uart_console_write()
serial8250_console_write()  // if (oops_in_progress)
//spin_trylock_irqsave()
call_console_drivers()
console_unlock()
console_flush_on_panic()
bust_spinlocks(1)   // oops_in_progress++
panic()

spin_lock_irqsave(>lock, flags)   // spin_lock_irqsave()
serial8250_console_write()
call_console_drivers()
console_unlock()
printk()
...

However, this does not happen and we deadlock in serial console on
port->lock spinlock. And the problem is that console_flush_on_panic()
called after bust_spinlocks(0):

void panic(const char *fmt, ...)
{
bust_spinlocks(1);
...
bust_spinlocks(0);
console_flush_on_panic();
...
}

bust_spinlocks(0) decrements oops_in_progress, so oops_in_progress
can go back to zero. Thus even re-entrant console drivers will simply
spin on port->lock spinlock. Given that port->lock may already be
locked either by a stopped CPU, or by the very same CPU we execute
panic() on (for instance, NMI panic() on printing CPU) the system
deadlocks and does not reboot.

Fix this by removing bust_spinlocks(0), so oops_in_progress is always
set in panic() now and, thus, re-entrant console drivers will trylock
the port->lock instead of spinning on it forever, when we call them
from console_flush_on_panic().

Signed-off-by: Sergey Senozhatsky 
---
 kernel/panic.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/panic.c b/kernel/panic.c
index f6d549a29a5c..272ac1c34e4b 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 

 #define PANIC_TIMER_STEP 100
 #define PANIC_BLINK_SPD 18
@@ -237,7 +238,10 @@ void panic(const char *fmt, ...)
if (_crash_kexec_post_notifiers)
__crash_kexec(NULL);

-   bust_spinlocks(0);
+#ifdef CONFIG_VT
+   unblank_screen();
+#endif
+   console_unblank();

/*
 * We may have ended up stopping the CPU holding the lock (in
-- 
2.19.1

[PATCHv3] panic: avoid deadlocks in re-entrant console drivers

2018-10-25 Thread Sergey Senozhatsky

>From printk()/serial console point of view panic() is special, because
it may force CPU to re-enter printk() or/and serial console driver.
Therefore, some of serial consoles drivers are re-entrant. E.g. 8250:

serial8250_console_write()
{
if (port->sysrq)
locked = 0;
else if (oops_in_progress)
locked = spin_trylock_irqsave(>lock, flags);
else
spin_lock_irqsave(>lock, flags);
...
}

panic() does set oops_in_progress via bust_spinlocks(1), so in theory
we should be able to re-enter serial console driver from panic():

CPU0

uart_console_write()
serial8250_console_write()  // if (oops_in_progress)
//spin_trylock_irqsave()
call_console_drivers()
console_unlock()
console_flush_on_panic()
bust_spinlocks(1)   // oops_in_progress++
panic()

spin_lock_irqsave(>lock, flags)   // spin_lock_irqsave()
serial8250_console_write()
call_console_drivers()
console_unlock()
printk()
...

However, this does not happen and we deadlock in serial console on
port->lock spinlock. And the problem is that console_flush_on_panic()
called after bust_spinlocks(0):

void panic(const char *fmt, ...)
{
bust_spinlocks(1);
...
bust_spinlocks(0);
console_flush_on_panic();
...
}

bust_spinlocks(0) decrements oops_in_progress, so oops_in_progress
can go back to zero. Thus even re-entrant console drivers will simply
spin on port->lock spinlock. Given that port->lock may already be
locked either by a stopped CPU, or by the very same CPU we execute
panic() on (for instance, NMI panic() on printing CPU) the system
deadlocks and does not reboot.

Fix this by removing bust_spinlocks(0), so oops_in_progress is always
set in panic() now and, thus, re-entrant console drivers will trylock
the port->lock instead of spinning on it forever, when we call them
from console_flush_on_panic().

Signed-off-by: Sergey Senozhatsky 
---

v2: do not do bust_spinlocks(1);bust_spinlocks(0);bust_spinlocks(1)
thing and just copy paste from lib/bust_spinlocks what
bust_spinlocks(0) does. (Petr)

 kernel/panic.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/panic.c b/kernel/panic.c
index f6d549a29a5c..fccd41628b24 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -237,7 +237,10 @@ void panic(const char *fmt, ...)
if (_crash_kexec_post_notifiers)
__crash_kexec(NULL);
 
-   bust_spinlocks(0);
+#ifdef CONFIG_VT
+   unblank_screen();
+#endif
+   console_unblank();
 
/*
 * We may have ended up stopping the CPU holding the lock (in
-- 
2.19.1

[PATCHv3] panic: avoid deadlocks in re-entrant console drivers

2018-10-25 Thread Sergey Senozhatsky

>From printk()/serial console point of view panic() is special, because
it may force CPU to re-enter printk() or/and serial console driver.
Therefore, some of serial consoles drivers are re-entrant. E.g. 8250:

serial8250_console_write()
{
if (port->sysrq)
locked = 0;
else if (oops_in_progress)
locked = spin_trylock_irqsave(>lock, flags);
else
spin_lock_irqsave(>lock, flags);
...
}

panic() does set oops_in_progress via bust_spinlocks(1), so in theory
we should be able to re-enter serial console driver from panic():

CPU0

uart_console_write()
serial8250_console_write()  // if (oops_in_progress)
//spin_trylock_irqsave()
call_console_drivers()
console_unlock()
console_flush_on_panic()
bust_spinlocks(1)   // oops_in_progress++
panic()

spin_lock_irqsave(>lock, flags)   // spin_lock_irqsave()
serial8250_console_write()
call_console_drivers()
console_unlock()
printk()
...

However, this does not happen and we deadlock in serial console on
port->lock spinlock. And the problem is that console_flush_on_panic()
called after bust_spinlocks(0):

void panic(const char *fmt, ...)
{
bust_spinlocks(1);
...
bust_spinlocks(0);
console_flush_on_panic();
...
}

bust_spinlocks(0) decrements oops_in_progress, so oops_in_progress
can go back to zero. Thus even re-entrant console drivers will simply
spin on port->lock spinlock. Given that port->lock may already be
locked either by a stopped CPU, or by the very same CPU we execute
panic() on (for instance, NMI panic() on printing CPU) the system
deadlocks and does not reboot.

Fix this by removing bust_spinlocks(0), so oops_in_progress is always
set in panic() now and, thus, re-entrant console drivers will trylock
the port->lock instead of spinning on it forever, when we call them
from console_flush_on_panic().

Signed-off-by: Sergey Senozhatsky 
---

v2: do not do bust_spinlocks(1);bust_spinlocks(0);bust_spinlocks(1)
thing and just copy paste from lib/bust_spinlocks what
bust_spinlocks(0) does. (Petr)

 kernel/panic.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/panic.c b/kernel/panic.c
index f6d549a29a5c..fccd41628b24 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -237,7 +237,10 @@ void panic(const char *fmt, ...)
if (_crash_kexec_post_notifiers)
__crash_kexec(NULL);
 
-   bust_spinlocks(0);
+#ifdef CONFIG_VT
+   unblank_screen();
+#endif
+   console_unblank();
 
/*
 * We may have ended up stopping the CPU holding the lock (in
-- 
2.19.1

Re: [RFC][PATCHv2 1/4] panic: avoid deadlocks in re-entrant console drivers

2018-10-25 Thread Sergey Senozhatsky

On (10/25/18 11:06), Petr Mladek wrote:
> 
> IMHO, the custom s390 implementation can get removed.
> The generic code should do the same job these days.
> 

Yep.

> > And console_unblank() is not guaranteed to print anything (unlike
> > console_flush_on_panic(), but oops is not panic() yet, so we can't
> > replace it with flush_on_panic()) - console_sem can be locked, so
> > console_unblank() would do nothing.
> 
> I see. I missed that console_unblank() returns early when
> down_trylock_console_sem() fails.
> 
> I still would like to refactor the code somehow to avoid
> the bust_spinlocks(0)/bust_spinlocks(1) ping-pong.
> 
> It might make sense to call console_unblank() from
> console_flush_on_panic().
> 
> I wonder if it would make sense to call unblank_screen() in
> console_unblank()...

These are interesting thoughts.

I can add one more thing to the list.
bust_spinlock() is probably not doing enough. It says that it

"clears any spinlocks which would prevent oops, die(), BUG()
 and panic() information from reaching the user."

But this is, technically, not true. Because bust_spinlock() does not remove
console_sem out of sight. And we do have several spinlocks behind it. E.g.
semaphore's ->lock. Both down() and down_trylock() do
raw_spin_lock_irqsave(>lock, flags), so if we got NMI panic while one
of the CPUs was under raw_spin_lock_irqsave(>lock, flags) trying to
lock the console_sem then we are done. And IIRC we had exactly this type
of a bug report from LG 1 or 2 years ago - deadlock on sem->lock.

-ss

Re: [RFC][PATCHv2 1/4] panic: avoid deadlocks in re-entrant console drivers

2018-10-25 Thread Sergey Senozhatsky

On (10/25/18 11:06), Petr Mladek wrote:
> 
> IMHO, the custom s390 implementation can get removed.
> The generic code should do the same job these days.
> 

Yep.

> > And console_unblank() is not guaranteed to print anything (unlike
> > console_flush_on_panic(), but oops is not panic() yet, so we can't
> > replace it with flush_on_panic()) - console_sem can be locked, so
> > console_unblank() would do nothing.
> 
> I see. I missed that console_unblank() returns early when
> down_trylock_console_sem() fails.
> 
> I still would like to refactor the code somehow to avoid
> the bust_spinlocks(0)/bust_spinlocks(1) ping-pong.
> 
> It might make sense to call console_unblank() from
> console_flush_on_panic().
> 
> I wonder if it would make sense to call unblank_screen() in
> console_unblank()...

These are interesting thoughts.

I can add one more thing to the list.
bust_spinlock() is probably not doing enough. It says that it

"clears any spinlocks which would prevent oops, die(), BUG()
 and panic() information from reaching the user."

But this is, technically, not true. Because bust_spinlock() does not remove
console_sem out of sight. And we do have several spinlocks behind it. E.g.
semaphore's ->lock. Both down() and down_trylock() do
raw_spin_lock_irqsave(>lock, flags), so if we got NMI panic while one
of the CPUs was under raw_spin_lock_irqsave(>lock, flags) trying to
lock the console_sem then we are done. And IIRC we had exactly this type
of a bug report from LG 1 or 2 years ago - deadlock on sem->lock.

-ss

Re: [RFC][PATCHv2 1/4] panic: avoid deadlocks in re-entrant console drivers

2018-10-25 Thread Sergey Senozhatsky

On (10/25/18 10:29), Petr Mladek wrote:
> 
> Yes, klogd is not a big deal. I just think that the bust_spinlocks()
> ping-pong would just confuse the code.

I agree; that's why I put some comments there.

> It might be better to keep the spinlocks busted and make sure that we do
> not cause regressions by not calling bust_spinlocks(0).

Sure, I understand. One reason to keep bust_spinlocks(0) there was "invoke
arch-specific bust_spinlocks(0), which might do something that common
bust_spinlocks() wouldn't do". Without going into details if any arch actually
does anything "special" in bust_spinlocks(0). Another reason was - this patch
looks like a -stable material to me; especially given that we have panic()
deadlock reports now. So I wanted to have a one liner which will not change
things for arch-s that re-define bust_spinlocks() and, at the same time,
fix the deadlock. Other than that I'm all for keeping spinlocks busted all
the time and just doing:

---
#ifdef CONFIG_VT
unblank_screen();
#endif
console_unblank();
---

in panic().

BTW, speaking of s390 bust_spinlocks(). It seems that starting from 4.21
all arch-s will use common bust_spinlocks() [1].

[..]
> > Hmm, I don't think I've seen any reports because of this. From 
> > printk/console
> > POV the locks which are not taken under oops_in_progress are not released.
> 
> Fair enough. Let's keep debug_locks_off() in panic().

Agreed.

[1] lkml.kernel.org/r/20181025081108.GB26561@osiris

-ss

Re: [RFC][PATCHv2 1/4] panic: avoid deadlocks in re-entrant console drivers

2018-10-25 Thread Sergey Senozhatsky

On (10/25/18 10:29), Petr Mladek wrote:
> 
> Yes, klogd is not a big deal. I just think that the bust_spinlocks()
> ping-pong would just confuse the code.

I agree; that's why I put some comments there.

> It might be better to keep the spinlocks busted and make sure that we do
> not cause regressions by not calling bust_spinlocks(0).

Sure, I understand. One reason to keep bust_spinlocks(0) there was "invoke
arch-specific bust_spinlocks(0), which might do something that common
bust_spinlocks() wouldn't do". Without going into details if any arch actually
does anything "special" in bust_spinlocks(0). Another reason was - this patch
looks like a -stable material to me; especially given that we have panic()
deadlock reports now. So I wanted to have a one liner which will not change
things for arch-s that re-define bust_spinlocks() and, at the same time,
fix the deadlock. Other than that I'm all for keeping spinlocks busted all
the time and just doing:

---
#ifdef CONFIG_VT
unblank_screen();
#endif
console_unblank();
---

in panic().

BTW, speaking of s390 bust_spinlocks(). It seems that starting from 4.21
all arch-s will use common bust_spinlocks() [1].

[..]
> > Hmm, I don't think I've seen any reports because of this. From 
> > printk/console
> > POV the locks which are not taken under oops_in_progress are not released.
> 
> Fair enough. Let's keep debug_locks_off() in panic().

Agreed.

[1] lkml.kernel.org/r/20181025081108.GB26561@osiris

-ss

Re: [PATCH] s390/fault: use wake_up_klogd() in bust_spinlocks()

2018-10-25 Thread Sergey Senozhatsky

On (10/25/18 10:11), Heiko Carstens wrote:
> > s390 is the only architecture that is using own bust_spinlocks()
> > variant, while other arch-s seem to be OK with the common
> > implementation.
> > 
> > Heiko Carstens [1] said he would prefer s390 to use the common
> > bust_spinlocks() as well:
> >   I did some code archaeology and this function is unchanged since ~17
> >   years. When it was introduced it was close to identical to the x86
> >   variant. All other architectures use the common code variant in the
> >   meantime. So if we change this I'd prefer that we switch s390 to the
> >   common code variant as well. Right now I can't see a reason for not
> >   doing that
> > 
> > This patch removes s390 bust_spinlocks() and drops the weak attribute
> > from the common bust_spinlocks() version.
> > 
> > [1] lkml.kernel.org/r/20181025062800.GB4037@osiris
> > Signed-off-by: Sergey Senozhatsky 
> > ---
> >  arch/s390/mm/fault.c | 24 
> >  lib/bust_spinlocks.c |  6 +++---
> >  2 files changed, 3 insertions(+), 27 deletions(-)
> 
> I gave this some testing and forced panic/die in interrupt as well as
> process context with different consoles as well as single and multi
> cpu systems. Everything still seems to work.

That was quick ;) Thanks.

> So I'm applying this to our internal queue first. It will hit upstream
> latest in the next merge window if there aren't any issues found.

Sure; sounds like a plan.

-ss

Re: [PATCH] s390/fault: use wake_up_klogd() in bust_spinlocks()

2018-10-25 Thread Sergey Senozhatsky

On (10/25/18 10:11), Heiko Carstens wrote:
> > s390 is the only architecture that is using own bust_spinlocks()
> > variant, while other arch-s seem to be OK with the common
> > implementation.
> > 
> > Heiko Carstens [1] said he would prefer s390 to use the common
> > bust_spinlocks() as well:
> >   I did some code archaeology and this function is unchanged since ~17
> >   years. When it was introduced it was close to identical to the x86
> >   variant. All other architectures use the common code variant in the
> >   meantime. So if we change this I'd prefer that we switch s390 to the
> >   common code variant as well. Right now I can't see a reason for not
> >   doing that
> > 
> > This patch removes s390 bust_spinlocks() and drops the weak attribute
> > from the common bust_spinlocks() version.
> > 
> > [1] lkml.kernel.org/r/20181025062800.GB4037@osiris
> > Signed-off-by: Sergey Senozhatsky 
> > ---
> >  arch/s390/mm/fault.c | 24 
> >  lib/bust_spinlocks.c |  6 +++---
> >  2 files changed, 3 insertions(+), 27 deletions(-)
> 
> I gave this some testing and forced panic/die in interrupt as well as
> process context with different consoles as well as single and multi
> cpu systems. Everything still seems to work.

That was quick ;) Thanks.

> So I'm applying this to our internal queue first. It will hit upstream
> latest in the next merge window if there aren't any issues found.

Sure; sounds like a plan.

-ss

Re: [PATCH] s390/fault: use wake_up_klogd() in bust_spinlocks()

2018-10-25 Thread Sergey Senozhatsky

On (10/25/18 08:28), Heiko Carstens wrote:
> 
> With your patch this looks nearly like the common code variant. I did
> some code archaeology and this function is unchanged since ~17 years.
> When it was introduced it was close to identical to the x86 variant.
> All other architectures use the common code variant in the
> meantime. So if we change this I'd prefer that we switch s390 to the
> common code variant as well.
> 
> Right now I can't see a reason for not doing that, but I might be
> wrong of course. So, could you please provide a new version which just
> removes this variant and makes s390 use the generic one too.
> 
> We'll see if there is any fallout...

Heiko, if this one works for you then I'll submit a format patch.
Let me know. Thanks.



From: Sergey Senozhatsky 
Subject: [PATCH] arch/s390: use common bust_spinlocks()

s390 is the only architecture that is using own bust_spinlocks()
variant, while other arch-s seem to be OK with the common
implementation.

Heiko Carstens [1] said he would prefer s390 to use the common
bust_spinlocks() as well:
  I did some code archaeology and this function is unchanged since ~17
  years. When it was introduced it was close to identical to the x86
  variant. All other architectures use the common code variant in the
  meantime. So if we change this I'd prefer that we switch s390 to the
  common code variant as well. Right now I can't see a reason for not
  doing that

This patch removes s390 bust_spinlocks() and drops the weak attribute
from the common bust_spinlocks() version.

[1] lkml.kernel.org/r/20181025062800.GB4037@osiris
Signed-off-by: Sergey Senozhatsky 
---
 arch/s390/mm/fault.c | 24 
 lib/bust_spinlocks.c |  6 +++---
 2 files changed, 3 insertions(+), 27 deletions(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 2b8f32f56e0c..11613362c4e7 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -81,30 +81,6 @@ static inline int notify_page_fault(struct pt_regs *regs)
return ret;
 }
 
-
-/*
- * Unlock any spinlocks which will prevent us from getting the
- * message out.
- */
-void bust_spinlocks(int yes)
-{
-   if (yes) {
-   oops_in_progress = 1;
-   } else {
-   int loglevel_save = console_loglevel;
-   console_unblank();
-   oops_in_progress = 0;
-   /*
-* OK, the message is on the console.  Now we call printk()
-* without oops_in_progress set so that printk will give klogd
-* a poke.  Hold onto your hats...
-*/
-   console_loglevel = 15;
-   printk(" ");
-   console_loglevel = loglevel_save;
-   }
-}
-
 /*
  * Find out which address space caused the exception.
  * Access register mode is impossible, ignore space == 3.
diff --git a/lib/bust_spinlocks.c b/lib/bust_spinlocks.c
index ab719495e2cb..8be59f84eaea 100644
--- a/lib/bust_spinlocks.c
+++ b/lib/bust_spinlocks.c
@@ -2,7 +2,8 @@
 /*
  * lib/bust_spinlocks.c
  *
- * Provides a minimal bust_spinlocks for architectures which don't have one of 
their own.
+ * Provides a minimal bust_spinlocks for architectures which don't
+ * have one of their own.
  *
  * bust_spinlocks() clears any spinlocks which would prevent oops, die(), BUG()
  * and panic() information from reaching the user.
@@ -16,8 +17,7 @@
 #include 
 #include 
 
-
-void __attribute__((weak)) bust_spinlocks(int yes)
+void bust_spinlocks(int yes)
 {
if (yes) {
++oops_in_progress;
-- 
2.19.1

Re: [PATCH] s390/fault: use wake_up_klogd() in bust_spinlocks()

2018-10-25 Thread Sergey Senozhatsky

On (10/25/18 08:28), Heiko Carstens wrote:
> 
> With your patch this looks nearly like the common code variant. I did
> some code archaeology and this function is unchanged since ~17 years.
> When it was introduced it was close to identical to the x86 variant.
> All other architectures use the common code variant in the
> meantime. So if we change this I'd prefer that we switch s390 to the
> common code variant as well.
> 
> Right now I can't see a reason for not doing that, but I might be
> wrong of course. So, could you please provide a new version which just
> removes this variant and makes s390 use the generic one too.
> 
> We'll see if there is any fallout...

Heiko, if this one works for you then I'll submit a format patch.
Let me know. Thanks.



From: Sergey Senozhatsky 
Subject: [PATCH] arch/s390: use common bust_spinlocks()

s390 is the only architecture that is using own bust_spinlocks()
variant, while other arch-s seem to be OK with the common
implementation.

Heiko Carstens [1] said he would prefer s390 to use the common
bust_spinlocks() as well:
  I did some code archaeology and this function is unchanged since ~17
  years. When it was introduced it was close to identical to the x86
  variant. All other architectures use the common code variant in the
  meantime. So if we change this I'd prefer that we switch s390 to the
  common code variant as well. Right now I can't see a reason for not
  doing that

This patch removes s390 bust_spinlocks() and drops the weak attribute
from the common bust_spinlocks() version.

[1] lkml.kernel.org/r/20181025062800.GB4037@osiris
Signed-off-by: Sergey Senozhatsky 
---
 arch/s390/mm/fault.c | 24 
 lib/bust_spinlocks.c |  6 +++---
 2 files changed, 3 insertions(+), 27 deletions(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 2b8f32f56e0c..11613362c4e7 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -81,30 +81,6 @@ static inline int notify_page_fault(struct pt_regs *regs)
return ret;
 }
 
-
-/*
- * Unlock any spinlocks which will prevent us from getting the
- * message out.
- */
-void bust_spinlocks(int yes)
-{
-   if (yes) {
-   oops_in_progress = 1;
-   } else {
-   int loglevel_save = console_loglevel;
-   console_unblank();
-   oops_in_progress = 0;
-   /*
-* OK, the message is on the console.  Now we call printk()
-* without oops_in_progress set so that printk will give klogd
-* a poke.  Hold onto your hats...
-*/
-   console_loglevel = 15;
-   printk(" ");
-   console_loglevel = loglevel_save;
-   }
-}
-
 /*
  * Find out which address space caused the exception.
  * Access register mode is impossible, ignore space == 3.
diff --git a/lib/bust_spinlocks.c b/lib/bust_spinlocks.c
index ab719495e2cb..8be59f84eaea 100644
--- a/lib/bust_spinlocks.c
+++ b/lib/bust_spinlocks.c
@@ -2,7 +2,8 @@
 /*
  * lib/bust_spinlocks.c
  *
- * Provides a minimal bust_spinlocks for architectures which don't have one of 
their own.
+ * Provides a minimal bust_spinlocks for architectures which don't
+ * have one of their own.
  *
  * bust_spinlocks() clears any spinlocks which would prevent oops, die(), BUG()
  * and panic() information from reaching the user.
@@ -16,8 +17,7 @@
 #include 
 #include 
 
-
-void __attribute__((weak)) bust_spinlocks(int yes)
+void bust_spinlocks(int yes)
 {
if (yes) {
++oops_in_progress;
-- 
2.19.1

< 3 4 5 6 7 8 9 10 11 12 >

701 - 800 of 6882 matches

Mail list logo