RE: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-15 Thread David Laight
From: Joe Perches
> Sent: 15 August 2020 00:52
...
> > This is why I think any discussion that says "people should buffer
> > their lines themselves and we should get rid if pr_cont()" is
> > fundamentally broken.
> >
> > Don't go down that hole. I won't take it. It's wrong.
> 
> I don't think it's wrong per se.
> 
> It's reasonable to avoid pr_cont when appropriate.
> 
> Trivial buffering, or adding and using YA vsprintf
> extension can avoid unnecessary message interleaving.
> 
> For instance, I just sent this patch to allow removal
> of print_vma_addr and its use of pr_cont.
> 
> https://lore.kernel.org/lkml/09f11651f0e913e159b955ac447cd8cadf36cb0d.ca...@perches.com/

ISTM that is a bit complex for a printf format.
Even with the 'noinline_for_stack' it is going to
use a lot of stack - and error printfs are already likely
to be near the stack limit.
The recursion for %pV might also cause grief.

In that case formatting the data into an on-stack char[]
before the printf is probably the simplest solution.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-15 Thread Linus Torvalds
On Fri, Aug 14, 2020 at 4:52 PM Joe Perches  wrote:
>
> On Fri, 2020-08-14 at 15:46 -0700, Linus Torvalds wrote:
> >
> > This is why I think any discussion that says "people should buffer
> > their lines themselves and we should get rid if pr_cont()" is
> > fundamentally broken.
> >
> > Don't go down that hole. I won't take it. It's wrong.
>
> I don't think it's wrong per se.

It's *absolutely* and 100% wrong.

Yes, any random *user* of pr_cont() can decide to buffer on it's own.

But when the discussion is about printk() - the implementation, not
the users - then it's complete and utter BS to talk about trying to
get rid of pr_cont().

See the difference?

Linus


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-15 Thread Joe Perches
On Fri, 2020-08-14 at 19:33 -0700, Linus Torvalds wrote:
> On Fri, Aug 14, 2020 at 4:52 PM Joe Perches  wrote:
> > On Fri, 2020-08-14 at 15:46 -0700, Linus Torvalds wrote:
> > > This is why I think any discussion that says "people should buffer
> > > their lines themselves and we should get rid if pr_cont()" is
> > > fundamentally broken.
> > > 
> > > Don't go down that hole. I won't take it. It's wrong.
> > 
> > I don't think it's wrong per se.
> 
> It's *absolutely* and 100% wrong.
> 
> Yes, any random *user* of pr_cont() can decide to buffer on it's own.

Which I believe is the point of the discussion,
not the complete removal of KERN_CONT.

> But when the discussion is about printk() - the implementation, not
> the users - then it's complete and utter BS to talk about trying to
> get rid of pr_cont().
> 
> See the difference?

Sure, but I fail to see where anyone said get rid of pr_cont
in this thread.  It seems all that was discussed was just
various schemes to improve coalescing output.




Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-15 Thread Sergey Senozhatsky
On (20/08/14 15:46), Linus Torvalds wrote:
> On Thu, Aug 13, 2020 at 4:54 AM Sergey Senozhatsky
>  wrote:
> >
> > I think what Linus said a long time ago was that the initial purpose of
> > pr_cont was
> >
> > pr_info("Initialize feature foo...");
> > if (init_feature_foo() == 0)
> > pr_cont("ok\n");
> > else
> > pr_cont("not ok\n");
> >
> > And if init_feature_foo() crashes the kernel then the first printk()
> > form panic() will flush the cont buffer.
> 
> Right.
> 
> This is why I think any discussion that says "people should buffer
> their lines themselves and we should get rid if pr_cont()" is
> fundamentally broken.

I think what we've been talking about so far was "how do we buffer
cont lines now", what are the problems of current buffering and what
can we do to make that buffering SMP safe (preserving the context of
broken cont buffer, etc. etc.). I don't think that anyone concluded
to just "s/pr_cont/printk/g", I don't see this.

-ss


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-14 Thread Joe Perches
On Fri, 2020-08-14 at 15:46 -0700, Linus Torvalds wrote:
> On Thu, Aug 13, 2020 at 4:54 AM Sergey Senozhatsky
>  wrote:
> > I think what Linus said a long time ago was that the initial purpose of
> > pr_cont was
> > 
> > pr_info("Initialize feature foo...");
> > if (init_feature_foo() == 0)
> > pr_cont("ok\n");
> > else
> > pr_cont("not ok\n");
> > 
> > And if init_feature_foo() crashes the kernel then the first printk()
> > form panic() will flush the cont buffer.
> 
> Right.
> 
> This is why I think any discussion that says "people should buffer
> their lines themselves and we should get rid if pr_cont()" is
> fundamentally broken.
> 
> Don't go down that hole. I won't take it. It's wrong.

I don't think it's wrong per se.

It's reasonable to avoid pr_cont when appropriate.

Trivial buffering, or adding and using YA vsprintf 
extension can avoid unnecessary message interleaving.

For instance, I just sent this patch to allow removal
of print_vma_addr and its use of pr_cont.

https://lore.kernel.org/lkml/09f11651f0e913e159b955ac447cd8cadf36cb0d.ca...@perches.com/

This is similar to the dump_flags_names removal back
in commit edf14cdbf9a0 ("mm, printk: introduce new format
string for flags")

> The fact is, pr_cont() goes back to the original kernel. No, it wasn't
> pr_cont() back then, and no, there were no actual explicit markers for
> "this is a continuation" at all, it was all just "the last printk
> didn't have a newline, so we continue where we left off".
> 
> We've added pr_cont (and KERN_CONT) since then, and I realize that a
> lot of people hate the complexity it introduces, but it's a
> fundamental complexity that you have to live with.
> 
> If you can't live with pr_cont(), you shouldn't be working on
> printk(), and find some other area of the kernel that you _can_ live
> with.
> 
> It really is that simple.
> 



Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-14 Thread Linus Torvalds
On Thu, Aug 13, 2020 at 4:54 AM Sergey Senozhatsky
 wrote:
>
> I think what Linus said a long time ago was that the initial purpose of
> pr_cont was
>
> pr_info("Initialize feature foo...");
> if (init_feature_foo() == 0)
> pr_cont("ok\n");
> else
> pr_cont("not ok\n");
>
> And if init_feature_foo() crashes the kernel then the first printk()
> form panic() will flush the cont buffer.

Right.

This is why I think any discussion that says "people should buffer
their lines themselves and we should get rid if pr_cont()" is
fundamentally broken.

Don't go down that hole. I won't take it. It's wrong.

The fact is, pr_cont() goes back to the original kernel. No, it wasn't
pr_cont() back then, and no, there were no actual explicit markers for
"this is a continuation" at all, it was all just "the last printk
didn't have a newline, so we continue where we left off".

We've added pr_cont (and KERN_CONT) since then, and I realize that a
lot of people hate the complexity it introduces, but it's a
fundamental complexity that you have to live with.

If you can't live with pr_cont(), you shouldn't be working on
printk(), and find some other area of the kernel that you _can_ live
with.

It really is that simple.

Linus


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-14 Thread Petr Mladek
On Fri 2020-08-14 10:22:35, John Ogness wrote:
> On 2020-08-14, Sergey Senozhatsky  wrote:
> > One thing that we need to handle here, I believe, is that the context
> > which crashes the kernel should flush its cont buffer, because the
> > information there is relevant to the crash:
> >
> > pr_cont_alloc_info();
> > pr_cont(, "1");
> > pr_cont(, "2");
> > >>
> >oops
> >   panic()
> > <<
> > pr_cont_flush();
> >
> > We better flush that context's pr_cont buffer during panic().
> 
> I am not convinced of the general usefulness of partial messages, but as
> long as we have an API that includes registration, usage, and
> deregistration of some sort of handle, then we leave the window open for
> such implementations.

Registering some handle is an interesting idea. But it rules out
buffers on the stack. And we should avoid dynamic allocation.
printk() must be reliable also when there is not enough memory.

Please, keep it simple and do not add dependencies on new subsystems.
Using external code in printk() is always a call for problems.
The called code must be lockless and must not call printk().

Best Regards,
Petr


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-14 Thread Petr Mladek
On Thu 2020-08-13 20:54:35, Sergey Senozhatsky wrote:
> On (20/08/13 10:41), Petr Mladek wrote:
> > > My concerns about this idea:
> > > 
> > > - What if the printk user does not correctly terminate the cont message?
> > >   There is no mechanism to allow that open record to be force-finalized
> > >   so that readers can read newer records.
> > 
> > This is a real problem. And it is the reason why the cont buffer is
> > currently flushed (finalized) by the next message from another context.
> 
> I understand that you think that this should be discussed and addressed
> later in a separate patch, but, since we are on pr_cont topic right now,
> can we slow down and maybe re-think what is actually expected from
> pr_cont()? IOW, have the "what is expect from this feature" thread?
> 
> For instance, is missing \n the one and only reason why printk-s from
> another context flush cont buffer now? Because I can see some more reasons
> for current behaviour and I'd like to question those reasons.
> 
> I think what Linus said a long time ago was that the initial purpose of
> pr_cont was
> 
>   pr_info("Initialize feature foo...");
>   if (init_feature_foo() == 0)
>   pr_cont("ok\n");
>   else
>   pr_cont("not ok\n");
> 
>   And if init_feature_foo() crashes the kernel then the first printk()
>   form panic() will flush the cont buffer.
> 
> We can handle this by realizing that new printk() message has LOG_NEWLINE
> and has different log_level (not pr_cont), maybe.

Yes, this is a handy behavior. But it is also complicated
on the implementation side. It requires that consoles are able to
print the existing part of the continuous line and print only
the rest later.

BTW: It used to work before the commit 5c2992ee7fd8a29d041 ("printk:
 remove console flushing special cases for partial buffered
 lines");

BTW2: It will be much easier to implement when only the last
 message can be partially shown on the consoles. Each console
 driver will need to track the position only in one message.

 Also it will be easier when the part of the message if stored
 in the main lockless ring buffer. Then the driver could just
 try to reread the last message and see if it was concatenated.


> Let's look at the more general case:
> 
> CPU0  .. CPU255
> pr_info("text");
>   pr_alert("boom\n");
> pr_cont("1);
> pr_cont("2\n");
> 
> Do we really need to preliminary flush CPU0 pr_cont buffer in this
> case? There is no connection between messages from CPU0 and CPU255.
> Maybe (maybe!) what matters here is keeping the order of messages
> per-context rather than globally system-wide?

Honestly, I have no idea how many newlines are missing. They are often
not noticed because the buffered message is flushed later by some
other one.

The chance that some other "random" message will flush the pending
message is much lower if we have many cont buffers per-context
and per-cpu.

I am not brave enough to add more cont buffers without some fallback
mechanism to flush them later (irq_work?, timer?) or without
audit of all callers.

Where the audit is implicit when all callers are converted to
the buffered printk API.


There is one more problem. Any buffering might cause that nobody
will be able to see the message when things go wrong. Flushing
during panic() might help but only when panic() is called
and when there are system-wide cont buffers.

By other words, the current pr_cont() behavior causes mixed output
from time to time. But it increases the chance to see the messages.
And it makes it easier to find them in crashdump. Perfect output
is nice. But it will not help when the messages gets lost.

All I want to say that it is not black and white.


My opinion:

I will leave the decision on John. If he thinks that converting
all pr_cont() users to a buffered API is easier I will be fine
with it. It was proposed and requested several times.

If John realizes that my proposal to allow to reopen committed
messages is easier, I will be fine with it as well. We could
create the buffered API and convert the most critical
users one by one later. Also the context information will
allow to connect the broken messages in userspace and
do not complicate the kernel side.

Anyway, the lockless printk() feature is more important for
me that a perfect output of continuous lines.

Best Regards,
Petr


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-14 Thread John Ogness
On 2020-08-14, Sergey Senozhatsky  wrote:
> One thing that we need to handle here, I believe, is that the context
> which crashes the kernel should flush its cont buffer, because the
> information there is relevant to the crash:
>
>   pr_cont_alloc_info();
>   pr_cont(, "1");
>   pr_cont(, "2");
>   >>
>  oops
> panic()
>   <<
>   pr_cont_flush();
>
> We better flush that context's pr_cont buffer during panic().

I am not convinced of the general usefulness of partial messages, but as
long as we have an API that includes registration, usage, and
deregistration of some sort of handle, then we leave the window open for
such implementations.

John Ogness


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-13 Thread Sergey Senozhatsky
On (20/08/13 12:35), John Ogness wrote:
> I believe I failed to recognize the fundamental problem. The fundamental
> problem is that the pr_cont() semantics are very poor.

The semantics is pretty clear - use it only in UP early bootup,
anything else is broken :)

  /*
   * Annotation for a "continued" line of log printout (only done after a
   * line that had no enclosing \n). Only to be used by core/arch code
   * during early bootup (a continued line is not SMP-safe otherwise).
   */
  #define KERN_CONT KERN_SOH "c"

> I now strongly believe that we need to fix those semantics by having the
> pr_cont() user take responsibility for buffering the message. Patching the
> ~2000 pr_cont() users will be far easier than continuing to twist ourselves
> around this madness.

I welcome this effort. We've been talking about the fact that pr_cont() is
not something we can ignore anymore (we have more and more SMP users of
it) since the Kernel Summit in Santa Fe, NM, but the general response back
then was "oh my god, who cares" (pretty sure this is very close to what Ted
Ts'o said during the printk session).

> Here is an example for a new pr_cont() API:
> 
> struct pr_cont c;
> 
> pr_cont_alloc_info();
>(or alternatively)
> dev_cont_alloc_info(dev, );
> 
> pr_cont(, "1");
> pr_cont(, "2");
> 
> pr_cont_flush();

This might be a bit more complex.

One thing that we need to handle here, I believe, is that the context
which crashes the kernel should flush its cont buffer, because the
information there is relevant to the crash:

pr_cont_alloc_info();
pr_cont(, "1");
pr_cont(, "2");
>>
   oops
  panic()
<<
pr_cont_flush();

We better flush that context's pr_cont buffer during panic().

Another example:


pr_cont_alloc_info();

for (i = 0; i < p->sz; i++)
pr_cont(, p->buf[i]);
>>
   page fault
exit
<<
pr_cont_flush();

I believe we need to preliminary flush pr_cont() in this case as well,
because the information there might be very helpful.

-ss


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-13 Thread Sergey Senozhatsky
On (20/08/13 10:41), Petr Mladek wrote:
> > My concerns about this idea:
> > 
> > - What if the printk user does not correctly terminate the cont message?
> >   There is no mechanism to allow that open record to be force-finalized
> >   so that readers can read newer records.
> 
> This is a real problem. And it is the reason why the cont buffer is
> currently flushed (finalized) by the next message from another context.

I understand that you think that this should be discussed and addressed
later in a separate patch, but, since we are on pr_cont topic right now,
can we slow down and maybe re-think what is actually expected from
pr_cont()? IOW, have the "what is expect from this feature" thread?

For instance, is missing \n the one and only reason why printk-s from
another context flush cont buffer now? Because I can see some more reasons
for current behaviour and I'd like to question those reasons.

I think what Linus said a long time ago was that the initial purpose of
pr_cont was

pr_info("Initialize feature foo...");
if (init_feature_foo() == 0)
pr_cont("ok\n");
else
pr_cont("not ok\n");

And if init_feature_foo() crashes the kernel then the first printk()
form panic() will flush the cont buffer.

We can handle this by realizing that new printk() message has LOG_NEWLINE
and has different log_level (not pr_cont), maybe.

Let's look at the more general case:

CPU0.. CPU255
pr_info("text");
pr_alert("boom\n");
pr_cont("1);
pr_cont("2\n");

Do we really need to preliminary flush CPU0 pr_cont buffer in this
case? There is no connection between messages from CPU0 and CPU255.
Maybe (maybe!) what matters here is keeping the order of messages
per-context rather than globally system-wide?

-ss


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-13 Thread Petr Mladek
On Thu 2020-08-13 12:35:47, John Ogness wrote:
> On 2020-08-13, Petr Mladek  wrote:
> > On Thu 2020-08-13 09:50:25, John Ogness wrote:
> >> On 2020-08-13, Sergey Senozhatsky  wrote:
> >> > This is not an unseen pattern, I'm afraid. And the problem here can
> >> > be more general:
> >> >
> >> >  pr_info("text");
> >> >  pr_cont("1");
> >> >  exception/IRQ/NMI
> >> >  pr_alert("text\n");
> >> >  pr_cont("2");
> >> >  pr_cont("\n");
> >> >
> >> > I guess the solution would be to store "last log_level" in task_struct
> >> > and get current (new) timestamp for broken cont line?
> >> 
> >> (Warning: new ideas ahead)
> >> 
> >> The fundamental problem is that there is no real association between
> >> the cont parts. So any interruption results in a broken record. If we
> >> really want to do this correctly, we need real association.
> 
> I believe I failed to recognize the fundamental problem. The fundamental
> problem is that the pr_cont() semantics are very poor. I now strongly
> believe that we need to fix those semantics by having the pr_cont() user
> take responsibility for buffering the message. Patching the ~2000
> pr_cont() users will be far easier than continuing to twist ourselves
> around this madness.

Yes, this would make the API much more reliable. But it is not completely
easy. It has been discussed several times. The most serious attempt
is 
https://lore.kernel.org/lkml/1541165517-3557-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp/#t

lockdep.c is probably the most complicated pr_cont() user. The
conversion complicated the code a bit, see
https://lore.kernel.org/lkml/1541165517-3557-3-git-send-email-penguin-ker...@i-love.sakura.ne.jp/

Linus also wanted to have an API to flush part of the message
before some action and finalize it later. It would be usable
for things like:

pr_info("Doing something dangerous... ");
printk_flush();
do_dangerous_thing();
pr_cont("ok\n");

Where pr_cont() would allow to flush part of the message to the
consoles and add the end of the line later when it did not crash.

It would need some support in the lockless ring buffer API anyway.
But it is only nice to have from my POV.


> The alloc function would need to work for any context, but that would
> not be an issue. If the cont message started to get too large, pr_cont()
> could do its own flushing in between, while still holding on to the
> context information. If for some reason the alloc function could not
> allocate a buffer, all the pr_cont() calls could fallback to logging the
> individual cont parts.

Also this was discussed several times. The most acceptable solution
was to use "small" buffer on the stack (128 bytes or so).


> I believe this would solve all cont-related problems while also allowing
> the new ringbuffer to remain as it already is in linux-next.

You are right. The buffering is really good step. It would make
the API sane and reliable. It has been proposed and discussed many
times. The main obstacle was that there was no volunteer to get this
done. If you are able to do so, it would be amazing.

I am just not sure what is more complicated. Getting lockless
operation correctly is always tricky. But going though 2000
callers individually is also a lot of work.

Best Regards,
Petr


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-13 Thread John Ogness
On 2020-08-13, Petr Mladek  wrote:
> On Thu 2020-08-13 09:50:25, John Ogness wrote:
>> On 2020-08-13, Sergey Senozhatsky  wrote:
>> > This is not an unseen pattern, I'm afraid. And the problem here can
>> > be more general:
>> >
>> >pr_info("text");
>> >pr_cont("1");
>> >exception/IRQ/NMI
>> >pr_alert("text\n");
>> >pr_cont("2");
>> >pr_cont("\n");
>> >
>> > I guess the solution would be to store "last log_level" in task_struct
>> > and get current (new) timestamp for broken cont line?
>> 
>> (Warning: new ideas ahead)
>> 
>> The fundamental problem is that there is no real association between
>> the cont parts. So any interruption results in a broken record. If we
>> really want to do this correctly, we need real association.

I believe I failed to recognize the fundamental problem. The fundamental
problem is that the pr_cont() semantics are very poor. I now strongly
believe that we need to fix those semantics by having the pr_cont() user
take responsibility for buffering the message. Patching the ~2000
pr_cont() users will be far easier than continuing to twist ourselves
around this madness.

Here is an example for a new pr_cont() API:

struct pr_cont c;

pr_cont_alloc_info();
   (or alternatively)
dev_cont_alloc_info(dev, );

pr_cont(, "1");
pr_cont(, "2");

pr_cont_flush();

Using macro magic, there can be the usual dbg, warn, err, etc. variants
of the alloc functions.

The alloc function would need to work for any context, but that would
not be an issue. If the cont message started to get too large, pr_cont()
could do its own flushing in between, while still holding on to the
context information. If for some reason the alloc function could not
allocate a buffer, all the pr_cont() calls could fallback to logging the
individual cont parts.

I believe this would solve all cont-related problems while also allowing
the new ringbuffer to remain as it already is in linux-next.

John Ogness


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-13 Thread Petr Mladek
On Thu 2020-08-13 09:50:25, John Ogness wrote:
> On 2020-08-13, Sergey Senozhatsky  wrote:
> > This is not an unseen pattern, I'm afraid. And the problem here can
> > be more general:
> >
> > pr_info("text");
> > pr_cont("1");
> > exception/IRQ/NMI
> > pr_alert("text\n");
> > pr_cont("2");
> > pr_cont("\n");
> >
> > I guess the solution would be to store "last log_level" in task_struct
> > and get current (new) timestamp for broken cont line?
> 
> (Warning: new ideas ahead)
> 
> The fundamental problem is that there is no real association between
> the cont parts. So any interruption results in a broken record. If we
> really want to do this correctly, we need real association.
> 
> With the new finalize flag for records, I thought about perhaps adding
> support for chaining data blocks.
> 
> A data block currently stores an unsigned long for the ID of the
> associated descriptor. But it could optionally include a second unsigned
> long, which is the lpos of the next text part. All the data blocks of a
> chain would point back to the same descriptor. The descriptor would only
> point to the first data block of the chain and include a flag that it is
> using chained data blocks.
>
> Then we would only need to track the sequence number of the open record
> and new data blocks could be added to the data block chain of the
> correct record. Readers cannot see the record until it is finalized.

Note that we still must try to append the continuous piece into the
same data block. We could not concatenate them in the reader API
because it would create the problem with unused/skipped sequence numbers.

Also it would complicate the reader code. It would need to know
whether a record has already been printed together with a previous
one.


> Also, since only finalized records can be invalidated, there are no
> races of chains becoming invalidated while being appended.
> 
> My concerns about this idea:
> 
> - What if the printk user does not correctly terminate the cont message?
>   There is no mechanism to allow that open record to be force-finalized
>   so that readers can read newer records.

This is a real problem. And it is the reason why the cont buffer is
currently flushed (finalized) by the next message from another context.


> - For tasks, the sequence number of the open record could be stored on
>   the task_struct. For non-tasks, we could use a global per-cpu variable
>   where each CPU stores 2 sequence numbers: the sequence number of the
>   open record for the non-task and the sequence number of the open
>   record for an interrupting NMI. Is that sufficient?

Yeah, it would be possible.

Anyway, this would fix an already existing problem. It might get
complicated and tricky. I am afraid that it comes from the "perfection
is the enemy of good" department.

A good compromise might be to just store "loglevel" from the previous
message in the given context. It could then be used for pr_cont()
messages in the same context.

Anyway, I would solve this later because it is already existing
problem.

Best Regards,
Petr


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-13 Thread Petr Mladek
On Thu 2020-08-13 14:18:53, Sergey Senozhatsky wrote:
> On (20/08/13 02:30), John Ogness wrote:
> > 2. I haven't yet figured out how to preserve calling context when a
> > newline appears. For example:
> > 
> > pr_info("text");
> > pr_cont(" 1");
> > pr_cont(" 2\n");
> > pr_cont("3");
> > pr_cont(" 4\n");
> >
> > For "3" the calling context (info, timestamp) is lost because with "2"
> > the record is finalized. Perhaps the above is invalid usage of LOG_CONT?

If I get it correctly, the original code has the same problem.

The cont buffer is flushed when the cont piece ends with newline:

static bool cont_add(u32 caller_id, int facility, int level,
 enum log_flags flags, const char *text, size_t len)
{
   [...]

// The original flags come from the first line,
// but later continuations can add a newline.
if (flags & LOG_NEWLINE) {
cont.flags |= LOG_NEWLINE;
cont_flush();
}

return true;
}

cont_flush sets cont.len = 0;

static void cont_flush(void)
{
[...]
cont.len = 0;
}


The messages is appended only when cont.len != 0 in log_output:

static size_t log_output(int facility, int level, enum log_flags lflags, const 
char *dict, size_t dictlen, char *text, size_t text_len)
{
const u32 caller_id = printk_caller_id();

/*
 * If an earlier line was buffered, and we're a continuation
 * write from the same context, try to add it to the buffer.
 */
if (cont.len) {
if (cont.caller_id == caller_id && (lflags & LOG_CONT)) {
if (cont_add(caller_id, facility, level, lflags, text, 
text_len))
return text_len;
}

  [...]
}


Also the original context is overridden when the cont buffer is empty:

static bool cont_add(u32 caller_id, int facility, int level,
 enum log_flags flags, const char *text, size_t len)
{
[...]

if (!cont.len) {
cont.facility = facility;
cont.level = level;
cont.caller_id = caller_id;
cont.ts_nsec = local_clock();
cont.flags = flags;
}

[...]
}

So I would ignore this problem for now.


> This is not an unseen pattern, I'm afraid. And the problem here can
> be more general:
> 
>   pr_info("text");
>   pr_cont("1");
>   exception/IRQ/NMI
>   pr_alert("text\n");
>   pr_cont("2");
>   pr_cont("\n");
> 

Good point.


> I guess the solution would be to store "last log_level" in task_struct
> and get current (new) timestamp for broken cont line?

I think about storing the context in per-CPU and per-context array.
It should be more memory efficient than task_struct and it should
solve even the above problem.


> We have this problem now. E.g. early boot messages from one of my boxes:
> 
> 6,173,41837,-;x86: Booting SMP configuration:
> 6,174,41838,-; node  #0, CPUs:  #1 #2 #3 #4
> 4,175,44993,-;MDS CPU bug present and SMT on, data leak possible. See 
> https://www.kernel.org/doc/...
> 4,176,44993,c; #5 #6 #7
> 
> "CPUs: #1 #2 #3 #4 #5 #6 #7" is supposed to be one cont line with
> loglevel 6. But it gets interrupted and flushed so that "#5 #6 #7"
> appears with the different loglevel.

Nice example. It would be nice to fix this. But it should be done
separately.

Best Regards,
Petr


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-13 Thread John Ogness
On 2020-08-13, Sergey Senozhatsky  wrote:
> This is not an unseen pattern, I'm afraid. And the problem here can
> be more general:
>
>   pr_info("text");
>   pr_cont("1");
>   exception/IRQ/NMI
>   pr_alert("text\n");
>   pr_cont("2");
>   pr_cont("\n");
>
> I guess the solution would be to store "last log_level" in task_struct
> and get current (new) timestamp for broken cont line?

(Warning: new ideas ahead)

The fundamental problem is that there is no real association between
the cont parts. So any interruption results in a broken record. If we
really want to do this correctly, we need real association.

With the new finalize flag for records, I thought about perhaps adding
support for chaining data blocks.

A data block currently stores an unsigned long for the ID of the
associated descriptor. But it could optionally include a second unsigned
long, which is the lpos of the next text part. All the data blocks of a
chain would point back to the same descriptor. The descriptor would only
point to the first data block of the chain and include a flag that it is
using chained data blocks.

Then we would only need to track the sequence number of the open record
and new data blocks could be added to the data block chain of the
correct record. Readers cannot see the record until it is finalized.

Also, since only finalized records can be invalidated, there are no
races of chains becoming invalidated while being appended.

My concerns about this idea:

- What if the printk user does not correctly terminate the cont message?
  There is no mechanism to allow that open record to be force-finalized
  so that readers can read newer records.

- For tasks, the sequence number of the open record could be stored on
  the task_struct. For non-tasks, we could use a global per-cpu variable
  where each CPU stores 2 sequence numbers: the sequence number of the
  open record for the non-task and the sequence number of the open
  record for an interrupting NMI. Is that sufficient?

John Ogness


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-13 Thread Petr Mladek
On Thu 2020-08-13 02:30:02, John Ogness wrote:
> On 2020-08-12, Petr Mladek  wrote:
> > So, I have one crazy idea to add one more state bit so that we
> > could have:
> >
> >   + committed: set when the data are written into the data ring.
> >   + final: set when the data block could not longer get reopened
> >   + reuse: set when the desctiptor/data block could get reused
> >
> > "final" bit will define when the descriptor could not longer
> > get reopened (cleared committed bit) and the data block could
> > not get extended.
> 
> I implemented this solution for myself and am currently running more
> tests. Some things that I changed from your suggestion:
> 
> 1. I created a separate prb_reserve_cont() function. The reason for this
> is because the caller needs to understand what is happening. The caller
> is getting an existing record with existing data and must append new
> data. The @text_len field of the info reports how long the existing data
> is.

I see.

> So the LOG_CONT handling code in printk.c looks something like this:

Yeah, it makes sense.


> 2. I haven't yet figured out how to preserve calling context when a
> newline appears. For example:
> 
> pr_info("text");
> pr_cont(" 1");
> pr_cont(" 2\n");
> pr_cont("3");
> pr_cont(" 4\n");
> 
> For "3" the calling context (info, timestamp) is lost because with "2"
> the record is finalized. Perhaps the above is invalid usage of LOG_CONT?

I am going to answer Sergey's reply.


> 3. There are some memory barriers introduced, but it looks like it
> shouldn't add too much complexity.

Uff. I have hooped for this.

> I will continue to refine my working version and post a patch so that we
> have something to work with. This looks to be the most promising way
> forward. Thanks.

Uff, I am happy that it seems working. Several other approaches looked
much more complicated or they caused regressions.

Best Regards,
Petr


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-12 Thread Sergey Senozhatsky
On (20/08/13 02:30), John Ogness wrote:
> 2. I haven't yet figured out how to preserve calling context when a
> newline appears. For example:
> 
> pr_info("text");
> pr_cont(" 1");
> pr_cont(" 2\n");
> pr_cont("3");
> pr_cont(" 4\n");
>
> For "3" the calling context (info, timestamp) is lost because with "2"
> the record is finalized. Perhaps the above is invalid usage of LOG_CONT?

This is not an unseen pattern, I'm afraid. And the problem here can
be more general:

pr_info("text");
pr_cont("1");
exception/IRQ/NMI
pr_alert("text\n");
pr_cont("2");
pr_cont("\n");

I guess the solution would be to store "last log_level" in task_struct
and get current (new) timestamp for broken cont line?

We have this problem now. E.g. early boot messages from one of my boxes:

6,173,41837,-;x86: Booting SMP configuration:
6,174,41838,-; node  #0, CPUs:  #1 #2 #3 #4
4,175,44993,-;MDS CPU bug present and SMT on, data leak possible. See 
https://www.kernel.org/doc/...
4,176,44993,c; #5 #6 #7

"CPUs: #1 #2 #3 #4 #5 #6 #7" is supposed to be one cont line with
loglevel 6. But it gets interrupted and flushed so that "#5 #6 #7"
appears with the different loglevel.

-ss


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-12 Thread John Ogness
On 2020-08-12, Petr Mladek  wrote:
> So, I have one crazy idea to add one more state bit so that we
> could have:
>
>   + committed: set when the data are written into the data ring.
>   + final: set when the data block could not longer get reopened
>   + reuse: set when the desctiptor/data block could get reused
>
> "final" bit will define when the descriptor could not longer
> get reopened (cleared committed bit) and the data block could
> not get extended.

I had not thought of extending data blocks. That is clever!

I implemented this solution for myself and am currently running more
tests. Some things that I changed from your suggestion:

1. I created a separate prb_reserve_cont() function. The reason for this
is because the caller needs to understand what is happening. The caller
is getting an existing record with existing data and must append new
data. The @text_len field of the info reports how long the existing data
is. So the LOG_CONT handling code in printk.c looks something like this:

if (lflags & LOG_CONT) {
struct prb_reserved_entry e;
struct printk_record r;

prb_rec_init_wr(, text_len, 0);

if (prb_reserve_cont(, prb, , caller_id)) {
memcpy(_buf[r.info->text_len], text, text_len);
r.info->text_len += text_len;

if (lflags & LOG_NEWLINE)
r.info->flags |= LOG_NEWLINE;

if (r.info->flags & LOG_NEWLINE)
prb_commit_finalize();
else
prb_commit();

return text_len;
}
}

This seemed simpler than trying to extend prb_reserve() to secretly
support LOG_CONT records.

2. I haven't yet figured out how to preserve calling context when a
newline appears. For example:

pr_info("text");
pr_cont(" 1");
pr_cont(" 2\n");
pr_cont("3");
pr_cont(" 4\n");

For "3" the calling context (info, timestamp) is lost because with "2"
the record is finalized. Perhaps the above is invalid usage of LOG_CONT?

3. There are some memory barriers introduced, but it looks like it
shouldn't add too much complexity.

I will continue to refine my working version and post a patch so that we
have something to work with. This looks to be the most promising way
forward. Thanks.

John Ogness


POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-12 Thread Petr Mladek
On Tue 2020-08-11 18:05:51, Petr Mladek wrote:
> On Sat 2020-07-18 16:48:55, John Ogness wrote:
> > On 2020-07-17, Linus Torvalds  wrote:
> > > Make sure you test the case of "fast concurrent readers". The last
> > > time we did things like this, it was a disaster, because a concurrent
> > > reader would see and return the _incomplete_ line, and the next entry
> > > was still being generated on another CPU.
> > >
> > > The reader would then decide to return that incomplete line, because
> > > it had something.
> > >
> > > And while in theory this could then be handled properly in user space,
> > > in practice it wasn't. So you'd see a lot of logging tools that would
> > > then report all those continuations as separate log events.
> > >
> > > Which is the whole point of LOG_CONT - for that *not* to happen.
> > 
> > I expect this is handled correctly since the reader is not given any
> > parts until a full line is ready, but I will put more focus on testing
> > this to make sure. Thanks for the regression and testing tips.
> 
> Hmm, the current patchset has different problem. The continuation
> pieces are correctly passed as a single lines. But empty line is
> printed for each unused sequence number to avoid warnings about
> missed messages in journactl. It looks like:
> 
> I am afraid that the only working solution is to store all pieces
> in a single lockless transaction. I think that John already
> proposed using 2nd small lockless buffer for this. The problem
> might be how to synchronize flushing the pieces into the final
> buffer.

Do not panic! It might look scary. But I am less scared
after I wrote some pieces of the pseudo code.

So, I have one crazy idea to add one more state bit so that we
could have:

  + committed: set when the data are written into the data ring.
  + final: set when the data block could not longer get reopened
  + reuse: set when the desctiptor/data block could get reused


"final" bit will define when the descriptor could not longer
get reopened (cleared committed bit) and the data block could
not get extended.

The logic would be the following:

bool prb_reserve() {

desc = try_reopen_desc(seq);
if (desc) {
text_buf = data_alloc_continuous();
if (text_buf)
goto success;
else
/* commit the reopened desc back again */
prb_commit(desc);
}

/* Otherwise, do as before */
desc = desc_reserve();
if (!desc)
goto fail;

text_buf = data_alloc();
...

where:

static struct prb_desc *try_reopen_desc(seq)
{
struct prb_desc *desc;

enum desc_state d_state;
struct prb_desc desc;

d_state = desc_read(desc_ring, seq, );
if (d_state != committed_and_not_finalized)
return NULL;

if (!is_same_context(desc))
return NULL;

/* try to reopen only when the state is still the same */
if(!atomic_long_cmpxchg_relaxed(state_var,
val_committed_and_not_finished,
val_reserved))
return NULL;

return desc;
}

static char *data_alloc_continuous()
{
/*
 * Same as data_alloc() with one added parameter:
 * unsigned long requested_begin_lpos;
 */

begin_lpos = atomic_long_read(_ring->head_lpos);

do {
if (begin_lpos != requested_begin_lpos)
return NULL;

... same as before

} while (!atomic_long_try_cmpxchg(_ring->head_lpos, _lpos,
  next_lpos)); /* LMM(data_alloc:A) */

if (requested_begin_lpos) {
/* only update tail lpos */
blk_lpos->next = next_lpos;
/* return pointer to the new data space */
return >data[0];
}

/* For completely new block do everything as before */
blk = to_block(data_ring, begin_lpos);
blk->id = id; /* LMM(data_alloc:B) */
...
}

void prb_commit_and_finalize()
{
/* Same as prb_commit() + it will set also 'final' bit */
}


Addintional changes in the code:

+ desc_resrved() will also set 'final' bit in the previous
  descriptor so that the descriptor could not longer get reopended
  once committed.

+ prb_commit_and_finalize() will be called instead of prb_commit()
  when the message ends with '\n'.

+ prb_read() will allow to read the data only when
  the state is "committed_and_finalized".

+ desc_make_reusable() can be called only when the desciptor
  is in "commited_and_finalized" state.


I am not sure if it is everything. Also it might need some code
refactoring.

But it looks like it might work. And it should not require new barriers.