Re: misc details: Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-26 Thread Petr Mladek
On Tue 2020-02-25 21:11:31, John Ogness wrote:
> >> --- a/kernel/printk/printk.c
> >> +++ b/kernel/printk/printk.c
> >> - * Every record carries the monotonic timestamp in microseconds, as well 
> >> as
> >> - * the standard userspace syslog level and syslog facility. The usual
> >> + * Every record meta-data carries the monotonic timestamp in 
> >> microseconds, as
> >
> > I am afraid that we could not guarantee monotonic timestamp because
> > the writers are not synchronized. I hope that it will not create
> > real problems and we could just remove the word "monotonic" ;-)
> 
> I removed "monotonic". I hope userspace doesn't require the ringbuffer
> to be chronologically sorted. That would explain why the safe buffers
> use bogus timestamps. :-/

The timestamp was not stored into the safe buffers to keep the code
simple. And people request to add the proper timestamps from time
to time.

IMHO, the precise timestamps are more important than ordering. So
people should love the lockless ringbuffer from this POV ;-)


> >> @@ -1974,9 +1966,9 @@ asmlinkage int vprintk_emit(int facility, int level,
> >>  
> >>/* This stops the holder of console_sem just where we want him */
> >>logbuf_lock_irqsave(flags);
> >> -  curr_log_seq = log_next_seq;
> >> +  pending_output = !prb_read_valid(prb, console_seq, NULL);
> >>printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
> >> -  pending_output = (curr_log_seq != log_next_seq);
> >> +  pending_output &= prb_read_valid(prb, console_seq, NULL);
> >
> > The original code checked whether vprintk_store() stored the text
> > into the main log buffer or only into the cont buffer.
> >
> > The new code checks whether console is behind which is something
> > different.
> 
> I would argue that they are the same thing in this context. Keep in mind
> that we are under the logbuf_lock. If there was previously nothing
> pending and now there is, this context is the only one that could have
> added it.

Right.

> This logic will change significantly when we remove the locks (and it
> will disappear once we go to kthreads). But we aren't that far at this
> stage and I'd like to keep the general logic somewhat close to the
> current mainline implementation for now.

OK, it is not a big deal from my POV. It is just an optimization.
It can be removed or improved later.

It caught my eyes primary because prb_read_valid() was relatively
complex function. I was not sure if it was worth the effort. But
I am fine with keeping your code for now. It will help to reduce
unrelated behavior changes.

Best Regards,
Petr

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: misc details: Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-25 Thread John Ogness
>> - Record meta-data is now stored in a separate array of descriptors.
>>   This is an additional 72 * (2 ^ ((CONFIG_LOG_BUF_SHIFT - 6))) bytes
>>   for the static array and 72 * (2 ^ ((log_buf_len - 6))) bytes for
>>   the dynamic array.
>
> It might help to show some examples. I mean to mention the sizes
> when CONFIG_LOG_BUF_SHIFT is 12 or so.

OK.

>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> - * Every record carries the monotonic timestamp in microseconds, as well as
>> - * the standard userspace syslog level and syslog facility. The usual
>> + * Every record meta-data carries the monotonic timestamp in microseconds, 
>> as
>
> I am afraid that we could not guarantee monotonic timestamp because
> the writers are not synchronized. I hope that it will not create
> real problems and we could just remove the word "monotonic" ;-)

I removed "monotonic". I hope userspace doesn't require the ringbuffer
to be chronologically sorted. That would explain why the safe buffers
use bogus timestamps. :-/

>> +/*
>> + * Define the average message size. This only affects the number of
>> + * descriptors that will be available. Underestimating is better than
>> + * overestimating (too many available descriptors is better than not 
>> enough).
>> + * The dictionary buffer will be the same size as the text buffer.
>> + */
>> +#define PRB_AVGBITS 6
>
> Do I get it correctly that '6' means 2^6 = 64 characters?

Correct.

> Some ugly counting on my test systems shows the average 49 chars:
>
> $> dmesg | cut -d ']' -f 2- | wc -c
> 30172
> $> dmesg | cut -d ']' -f 2- | wc -l
> 612
> $> echo $((30172 / 612))
> 49
>
> If I get it correctly then lower number is the more safe side.
> So, a more safe default should be 5?

For v2 the value will be lowered to 5.

>> -if (log_make_free_space(size)) {
>> +if (!prb_reserve(, prb, )) {
>>  /* truncate the message if it is too long for empty buffer */
>> -size = truncate_msg(_len, _msg_len,
>> -_len, _len);
>> +truncate_msg(_len, _msg_len, _len);
>> +r.text_buf_size = text_len + trunc_msg_len;

Note that the additional space for the trunc_msg_len is being reserved.

>> +r.dict_buf_size = dict_len;
>>  /* survive when the log buffer is too small for trunc_msg */
>> -if (log_make_free_space(size))
>> +if (!prb_reserve(, prb, ))
>>  return 0;
>>  }
>>  
>> -if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) {
>> -/*
>> - * This message + an additional empty header does not fit
>> - * at the end of the buffer. Add an empty header with len == 0
>> - * to signify a wrap around.
>> - */
>> -memset(log_buf + log_next_idx, 0, sizeof(struct printk_log));
>> -log_next_idx = 0;
>> -}
>> -
>>  /* fill message */
>> -msg = (struct printk_log *)(log_buf + log_next_idx);
>> -memcpy(log_text(msg), text, text_len);
>> -msg->text_len = text_len;
>> -if (trunc_msg_len) {
>> -memcpy(log_text(msg) + text_len, trunc_msg, trunc_msg_len);
>> -msg->text_len += trunc_msg_len;
>
> Note that the old code updates msg->text_len.

msg->text_len is equivalent to r.info->text_len, which was already set
by the prb_reserve() (and already includes the trunc_msg_len).

>> -}
>> -memcpy(log_dict(msg), dict, dict_len);
>> -msg->dict_len = dict_len;
>> -msg->facility = facility;
>> -msg->level = level & 7;
>> -msg->flags = flags & 0x1f;
>> +memcpy(_buf[0], text, text_len);
>> +if (trunc_msg_len)
>> +memcpy(_buf[text_len], trunc_msg, trunc_msg_len);
>
> The new one just appends the string.

That is all it needs to do here.

>> +if (r.dict_buf)
>> +memcpy(_buf[0], dict, dict_len);
>> +r.info->facility = facility;
>> +r.info->level = level & 7;
>> +r.info->flags = flags & 0x1f;
>>  if (ts_nsec > 0)
>> -msg->ts_nsec = ts_nsec;
>> +r.info->ts_nsec = ts_nsec;
>>  else
>> -msg->ts_nsec = local_clock();
>> -#ifdef CONFIG_PRINTK_CALLER
>> -msg->caller_id = caller_id;
>> -#endif
>> -memset(log_dict(msg) + dict_len, 0, pad_len);
>> -msg->len = size;
>> +r.info->ts_nsec = local_clock();
>> +r.info->caller_id = caller_id;
>>  
>>  /* insert message */
>> -log_next_idx += msg->len;
>> -log_next_seq++;
>> +prb_commit();
>>  
>> -return msg->text_len;
>> +return text_len;
>
> So, this should be text_len + trunc_msg_len.

Good catch! Yes. Fixed for v2. Thank you.

(Note that simply returning r.info->text_len is not allowed because the
writer must not access that data after calling prb_commit()).

>> @@ -1974,9 +1966,9 @@ asmlinkage int vprintk_emit(int facility, int level,
>>  
>>  /* This stops the holder of console_sem just where we want him */

Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-18 Thread lijiang
在 2020年01月29日 00:19, John Ogness 写道:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.
> 
> Changes that exist due to the ringbuffer replacement:
> 
> - The VMCOREINFO has been updated for the new structures.
> 
> - Dictionary data is now stored in a separate data buffer from the
>   human-readable messages. The dictionary data buffer is set to the
>   same size as the message buffer. Therefore, the total reserved
>   memory for messages is 2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the
>   initial static buffer and 2x the specified size in the log_buf_len
>   kernel parameter.
> 
> - Record meta-data is now stored in a separate array of descriptors.
>   This is an additional 72 * (2 ^ ((CONFIG_LOG_BUF_SHIFT - 6))) bytes
>   for the static array and 72 * (2 ^ ((log_buf_len - 6))) bytes for
>   the dynamic array.
> 
> Signed-off-by: John Ogness 
> ---
>  include/linux/kmsg_dump.h |   2 -
>  kernel/printk/Makefile|   1 +
>  kernel/printk/printk.c| 836 +++---
>  3 files changed, 416 insertions(+), 423 deletions(-)
> 
> diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
> index 2e7a1e032c71..ae6265033e31 100644
> --- a/include/linux/kmsg_dump.h
> +++ b/include/linux/kmsg_dump.h
> @@ -46,8 +46,6 @@ struct kmsg_dumper {
>   bool registered;
>  
>   /* private state of the kmsg iterator */
> - u32 cur_idx;
> - u32 next_idx;
>   u64 cur_seq;
>   u64 next_seq;
>  };
> diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile
> index 4d052fc6bcde..eee3dc9b60a9 100644
> --- a/kernel/printk/Makefile
> +++ b/kernel/printk/Makefile
> @@ -2,3 +2,4 @@
>  obj-y= printk.o
>  obj-$(CONFIG_PRINTK) += printk_safe.o
>  obj-$(CONFIG_A11Y_BRAILLE_CONSOLE)   += braille.o
> +obj-$(CONFIG_PRINTK) += printk_ringbuffer.o
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 1ef6f75d92f1..d0d24ee1d1f4 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -56,6 +56,7 @@
>  #define CREATE_TRACE_POINTS
>  #include 
>  
> +#include "printk_ringbuffer.h"
>  #include "console_cmdline.h"
>  #include "braille.h"
>  #include "internal.h"
> @@ -294,30 +295,22 @@ enum con_msg_format_flags {
>  static int console_msg_format = MSG_FORMAT_DEFAULT;
>  
>  /*
> - * The printk log buffer consists of a chain of concatenated variable
> - * length records. Every record starts with a record header, containing
> - * the overall length of the record.
> + * The printk log buffer consists of a sequenced collection of records, each
> + * containing variable length message and dictionary text. Every record
> + * also contains its own meta-data (@info).
>   *
> - * The heads to the first and last entry in the buffer, as well as the
> - * sequence numbers of these entries are maintained when messages are
> - * stored.
> - *
> - * If the heads indicate available messages, the length in the header
> - * tells the start next message. A length == 0 for the next message
> - * indicates a wrap-around to the beginning of the buffer.
> - *
> - * Every record carries the monotonic timestamp in microseconds, as well as
> - * the standard userspace syslog level and syslog facility. The usual
> + * Every record meta-data carries the monotonic timestamp in microseconds, as
> + * well as the standard userspace syslog level and syslog facility. The usual
>   * kernel messages use LOG_KERN; userspace-injected messages always carry
>   * a matching syslog facility, by default LOG_USER. The origin of every
>   * message can be reliably determined that way.
>   *
> - * The human readable log message directly follows the message header. The
> - * length of the message text is stored in the header, the stored message
> - * is not terminated.
> + * The human readable log message of a record is available in @text, the 
> length
> + * of the message text in @text_len. The stored message is not terminated.
>   *
> - * Optionally, a message can carry a dictionary of properties (key/value 
> pairs),
> - * to provide userspace with a machine-readable message context.
> + * Optionally, a record can carry a dictionary of properties (key/value 
> pairs),
> + * to provide userspace with a machine-readable message context. The length 
> of
> + * the dictionary is available in @dict_len. The dictionary is not 
> terminated.
>   *
>   * Examples for well-defined, commonly used property names are:
>   *   DEVICE=b12:8   device identifier
> @@ -331,21 +324,19 @@ static int console_msg_format = MSG_FORMAT_DEFAULT;
>   * follows directly after a '=' character. Every property is terminated by
>   * a '\0' character. The last property is not terminated.
>   *
> - * Example of a message structure:
> - *     ff 8f 00 00 00 00 00 00  monotonic time in nsec
> 

Re: crashdump: Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-17 Thread John Ogness
On 2020-02-17, Petr Mladek  wrote:
>>> Should the "prb"(printk tb static) symbol be exported into
>>> vmcoreinfo?  Otherwise, do you happen to know how to walk through
>>> the log_buf and get all kernel logs from vmcore?
>> 
>> You are correct. This will need to be exported as well so that the
>> descriptors can be accessed. (log_buf is only the pure human-readable
>> text.) I am currently hacking the crash tool to see exactly what
>> needs to be made available in order to access all the data of the
>> ringbuffer.
>
> I am not sure which parts you are working on. Are you going to provide
> also patch for makedumpfile, please?

I'm working on crash first. makedumpfile is on my list as well.

> I get the following failure when creating the crashdump using:
>
> echo c >/proc/sysrq-trigger
>
>
> The kernel version is not supported.
> The makedumpfile operation may be incomplete.
> dump_dmesg: Can't find variable-length record symbols
> makedumpfile Failed.
> Running makedumpfile --dump-dmesg /proc/vmcore failed (1).

Yes, the symbols have changed (and some are missing). I will get this
sorted out for v2. And I will provide some heavily hacked code for crash
and makedumpfile to show that the necessary symbols are there and it
works.

John Ogness

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


crashdump: Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-17 Thread Petr Mladek
On Fri 2020-02-14 14:50:02, John Ogness wrote:
> Hi Lianbo,
> 
> On 2020-02-14, lijiang  wrote:
> >> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> >> index 1ef6f75d92f1..d0d24ee1d1f4 100644
> >> --- a/kernel/printk/printk.c
> >> +++ b/kernel/printk/printk.c
> >> @@ -1062,21 +928,16 @@ void log_buf_vmcoreinfo_setup(void)
> >>  {
> >>VMCOREINFO_SYMBOL(log_buf);
> >>VMCOREINFO_SYMBOL(log_buf_len);
> >
> > I notice that the "prb"(printk tb static) symbol is not exported into
> > vmcoreinfo as follows:
> >
> > +   VMCOREINFO_SYMBOL(prb);
> >
> > Should the "prb"(printk tb static) symbol be exported into vmcoreinfo?
> > Otherwise, do you happen to know how to walk through the log_buf and
> > get all kernel logs from vmcore?
> 
> You are correct. This will need to be exported as well so that the
> descriptors can be accessed. (log_buf is only the pure human-readable
> text.) I am currently hacking the crash tool to see exactly what needs
> to be made available in order to access all the data of the ringbuffer.

I am not sure which parts you are working on. Are you going to provide
also patch for makedumpfile, please? I get the following failure when
creating the crashdump using:

echo c >/proc/sysrq-trigger


The kernel version is not supported.
The makedumpfile operation may be incomplete.
dump_dmesg: Can't find variable-length record symbols
makedumpfile Failed.
Running makedumpfile --dump-dmesg /proc/vmcore failed (1).


Best Regards,
Petr

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


misc details: Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-17 Thread Petr Mladek
On Tue 2020-01-28 17:25:48, John Ogness wrote:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.
> 
> - Record meta-data is now stored in a separate array of descriptors.
>   This is an additional 72 * (2 ^ ((CONFIG_LOG_BUF_SHIFT - 6))) bytes
>   for the static array and 72 * (2 ^ ((log_buf_len - 6))) bytes for
>   the dynamic array.

It might help to show some examples. I mean to mention the sizes
when CONFIG_LOG_BUF_SHIFT is 12 or so.


> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -294,30 +295,22 @@ enum con_msg_format_flags {
>  static int console_msg_format = MSG_FORMAT_DEFAULT;
>  
>  /*
> - * The printk log buffer consists of a chain of concatenated variable
> - * length records. Every record starts with a record header, containing
> - * the overall length of the record.
> + * The printk log buffer consists of a sequenced collection of records, each
> + * containing variable length message and dictionary text. Every record
> + * also contains its own meta-data (@info).
>   *
> - * The heads to the first and last entry in the buffer, as well as the
> - * sequence numbers of these entries are maintained when messages are
> - * stored.
> - *
> - * If the heads indicate available messages, the length in the header
> - * tells the start next message. A length == 0 for the next message
> - * indicates a wrap-around to the beginning of the buffer.
> - *
> - * Every record carries the monotonic timestamp in microseconds, as well as
> - * the standard userspace syslog level and syslog facility. The usual
> + * Every record meta-data carries the monotonic timestamp in microseconds, as

I am afraid that we could not guarantee monotonic timestamp because
the writers are not synchronized. I hope that it will not create
real problems and we could just remove the word "monotonic" ;-)


>  /* record buffer */
> -#define LOG_ALIGN __alignof__(struct printk_log)
> +#define LOG_ALIGN __alignof__(unsigned long)
>  #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
>  #define LOG_BUF_LEN_MAX (u32)(1 << 31)
>  static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
>  static char *log_buf = __log_buf;
>  static u32 log_buf_len = __LOG_BUF_LEN;
>  
> +/*
> + * Define the average message size. This only affects the number of
> + * descriptors that will be available. Underestimating is better than
> + * overestimating (too many available descriptors is better than not enough).
> + * The dictionary buffer will be the same size as the text buffer.
> + */
> +#define PRB_AVGBITS 6

Do I get it correctly that '6' means 2^6 = 64 characters?

Some ugly counting on my test systems shows the average 49 chars:

$> dmesg | cut -d ']' -f 2- | wc -c
30172
$> dmesg | cut -d ']' -f 2- | wc -l
612
$> echo $((30172 / 612))
49

If I get it correctly then lower number is the more safe side.
So, a more safe default should be 5?


> +
> +_DECLARE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
> +   PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]);
> +


> @@ -606,60 +488,42 @@ static int log_store(u32 caller_id, int facility, int 
> level,
>const char *dict, u16 dict_len,
>const char *text, u16 text_len)
>  {
> - struct printk_log *msg;
> - u32 size, pad_len;
> + struct prb_reserved_entry e;
> + struct printk_record r;
>   u16 trunc_msg_len = 0;
>  
> - /* number of '\0' padding bytes to next message */
> - size = msg_used_size(text_len, dict_len, _len);
> + r.text_buf_size = text_len;
> + r.dict_buf_size = dict_len;
>  
> - if (log_make_free_space(size)) {
> + if (!prb_reserve(, prb, )) {
>   /* truncate the message if it is too long for empty buffer */
> - size = truncate_msg(_len, _msg_len,
> - _len, _len);
> + truncate_msg(_len, _msg_len, _len);
> + r.text_buf_size = text_len + trunc_msg_len;
> + r.dict_buf_size = dict_len;
>   /* survive when the log buffer is too small for trunc_msg */
> - if (log_make_free_space(size))
> + if (!prb_reserve(, prb, ))
>   return 0;
>   }
>  
> - if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) {
> - /*
> -  * This message + an additional empty header does not fit
> -  * at the end of the buffer. Add an empty header with len == 0
> -  * to signify a wrap around.
> -  */
> - memset(log_buf + log_next_idx, 0, sizeof(struct printk_log));
> - log_next_idx = 0;
> - }
> -
>   /* fill message */
> - msg = (struct printk_log *)(log_buf + log_next_idx);
> - memcpy(log_text(msg), text, text_len);
> - 

Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-14 Thread lijiang
在 2020年02月14日 21:50, John Ogness 写道:
> Hi Lianbo,
> 
> On 2020-02-14, lijiang  wrote:
>>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>>> index 1ef6f75d92f1..d0d24ee1d1f4 100644
>>> --- a/kernel/printk/printk.c
>>> +++ b/kernel/printk/printk.c
>>> @@ -1062,21 +928,16 @@ void log_buf_vmcoreinfo_setup(void)
>>>  {
>>> VMCOREINFO_SYMBOL(log_buf);
>>> VMCOREINFO_SYMBOL(log_buf_len);
>>
>> I notice that the "prb"(printk tb static) symbol is not exported into
>> vmcoreinfo as follows:
>>
>> +VMCOREINFO_SYMBOL(prb);
>>
>> Should the "prb"(printk tb static) symbol be exported into vmcoreinfo?
>> Otherwise, do you happen to know how to walk through the log_buf and
>> get all kernel logs from vmcore?
> 
> You are correct. This will need to be exported as well so that the
> descriptors can be accessed. (log_buf is only the pure human-readable

Really agree, and I guess that there may be more structures and their offsets
to be exported, for example: struct prb_desc_ring, struct prb_data_ring, and
struct prb_desc, etc.

This makes sure that tools(such as makedumpfile and crash) can appropriately
access them. 

> text.) I am currently hacking the crash tool to see exactly what needs
> to be made available in order to access all the data of the ringbuffer.
> 
It makes sense and avoids exporting unnecessary symbols and offsets.

Thanks.
Lianbo


> John Ogness
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-14 Thread John Ogness
Hi Lianbo,

On 2020-02-14, lijiang  wrote:
>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> index 1ef6f75d92f1..d0d24ee1d1f4 100644
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -1062,21 +928,16 @@ void log_buf_vmcoreinfo_setup(void)
>>  {
>>  VMCOREINFO_SYMBOL(log_buf);
>>  VMCOREINFO_SYMBOL(log_buf_len);
>
> I notice that the "prb"(printk tb static) symbol is not exported into
> vmcoreinfo as follows:
>
> + VMCOREINFO_SYMBOL(prb);
>
> Should the "prb"(printk tb static) symbol be exported into vmcoreinfo?
> Otherwise, do you happen to know how to walk through the log_buf and
> get all kernel logs from vmcore?

You are correct. This will need to be exported as well so that the
descriptors can be accessed. (log_buf is only the pure human-readable
text.) I am currently hacking the crash tool to see exactly what needs
to be made available in order to access all the data of the ringbuffer.

John Ogness

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-14 Thread John Ogness
On 2020-02-14, Sergey Senozhatsky  wrote:
>>> cat /dev/kmsg
>>> cat: /dev/kmsg: Broken pipe
>>
>> In mainline you can have this "problem" as well. Once the ringbuffer
>> has wrapped, any read to a newly opened /dev/kmsg when a new message
>> arrived will result in an EPIPE. This happens quite easily once the
>> ringbuffer has wrapped because each new message is overwriting the
>> oldest message.
>
> Hmm. Something doesn't add up.
>
> Looking at the numbers, both r->info->seq and prb_first_seq(prb)
> do increase, so there are new messages in the ring buffer
>
>u->seqr->seqprb_first_seq
> [..]
> cat: devkmsg_read() error 1981080   1982633   1981080
> cat: devkmsg_read() error 1981080   1982633   1981080
> cat: devkmsg_read() error 1981095   1982652   1981095
> cat: devkmsg_read() error 1981095   1982652   1981095
> cat: devkmsg_read() error 1981095   1982652   1981095
> [..]
>
> but 'cat' still wouldn't read anything from the logbuf - EPIPE.
>
> NOTE: I don't run 'cat /dev/kmsg' during the test. I run the test
> first, then I run 'cat /dev/kmsg', after the test, when
> printk-pressure is gone.

Sure. The problem is not the printk-pressure. The problem is you have
data-less records in your ringbuffer (from your previous
printk-pressure). If you used your own program that continued to read
after EPIPE, then you would see the sequence numbers jumping over the
data-less records.

> I can't reproduce it with current logbuf. 'cat' reads from /dev/kmsg
> after heavy printk-pressure test. So chances are some loggers can also
> experience problems. This might be a regression.

Mainline doesn't have data-less records. In mainline such failed
printk's are silently ignored (after attepting truncation). So for
mainline you can only reproduce the overflow case.

1. Boot 5.6.0-rc1 (without any console= slowing down printk)

2. Fill the ringbuffer and let it overflow with:

   $ while true; do echo filling buffer > /dev/kmsg; done &

3. Once you can see the ringbuffer has overflowed (and continues to
   overflow), try to read from /dev/kmsg

   $ strace head /dev/kmsg

In most cases you will see:

read(3, 0x7f7307ac1000, 4096)   = -1 EPIPE (Broken pipe)

Current readers need to be able to handle EPIPE. cat(1) does not and so
(unfortunately) is not a good candidate for reading the ringbuffer.

>>> ...
>>> systemd-journal: devkmsg_read() error 1979281 1982465 1980933
>>> systemd-journal: corrected seq 1982465 1982465
>>> cat: devkmsg_read() error 1980987 1982531 1980987
>>> cat: corrected seq 1982531 1982531
>>> cat: devkmsg_read() error 1981015 1982563 1981015
>>> cat: corrected seq 1982563 1982563
>>
>> The situation with a data-less record is the same as when the ringbuffer
>> wraps: cat is hitting that EPIPE. But re-opening the file descriptor is
>> not going to help because it will not be able to get past that data-less
>> record.
>
> So maybe this is the case with broken 'cat' on my system?

I think it is appropriate for an application to close the descriptor
after an EPIPE. /dev/kmsg is special because the reader should continue
reading anyway.

>> We could implement it such that devkmsg_read() will skip over data-less
>> records instead of issuing an EPIPE. (That is what dmesg does.) But then
>> do we need EPIPE at all? The reader can see that is has missed records
>> by tracking the sequence number, so could we just get rid of EPIPE? Then
>> cat(1) would be a great tool to view the raw ringbuffer. Please share
>> your thoughts on this.
>
> Looking at systemd/src/journal/journald-kmsg.c :
> server_read_dev_kmsg() -EPIPE is just one of the erronos they handle,
> nothing special.

Yes, but what does systemd-journald do when the EPIPE is _not_ returned
and instead there is a jump in the sequence number? Looking at
dev_kmsg_record(), systemd actually does it the way I would hope. It
tracks the sequence number correctly.

/* Did we lose any? */
if (serial > *s->kernel_seqnum)
 server_driver_message(s, 0,
   "MESSAGE_ID="
   SD_MESSAGE_JOURNAL_MISSED_STR,
   LOG_MESSAGE("Missed %"PRIu64" kernel messages",
   serial - *s->kernel_seqnum),
   NULL);

> Could it be the case that some other loggers would have special
> handling for EPIPE?  I'm not sure, let's look around.
>
> I'd say that EPIPE removal looks OK to me. But before we do that, I'm
> not sure that we have clear understanding of 'cat /dev/kmsg' behaviour
> change.

In mainline, with regard to /dev/kmsg, sequence numbers will never
jump. If there would be a jump (due to lost messages) then EPIPE is
issued. The reader can either:

1. continue reading and see the jump

2. reopen the file descriptor, possibly having missed a ton more
   messages due to reopening, and then start from the oldest available
   message

With my series, #2 is no longer an option because the lost 

Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-13 Thread Sergey Senozhatsky
On (20/02/14 10:41), Sergey Senozhatsky wrote:
> On (20/02/13 23:36), John Ogness wrote:
[..]
> > We could implement it such that devkmsg_read() will skip over data-less
> > records instead of issuing an EPIPE. (That is what dmesg does.) But then
> > do we need EPIPE at all? The reader can see that is has missed records
> > by tracking the sequence number, so could we just get rid of EPIPE? Then
> > cat(1) would be a great tool to view the raw ringbuffer. Please share
> > your thoughts on this.
> 
> Looking at systemd/src/journal/journald-kmsg.c : server_read_dev_kmsg()
> -EPIPE is just one of the erronos they handle, nothing special. Could it
> be the case that some other loggers would have special handling for EPIPE?
> I'm not sure, let's look around.

rsyslog

static void
readkmsg(void)
{
int i;
uchar pRcv[8192+1];
char errmsg[2048];

for (;;) {
dbgprintf("imkmsg waiting for kernel log line\n");

/* every read() from the opened device node receives one record 
of the printk buffer */
i = read(fklog, pRcv, 8192);

if (i > 0) {
/* successful read of message of nonzero length */
pRcv[i] = '\0';
} else if (i == -EPIPE) {
imkmsgLogIntMsg(LOG_WARNING,
"imkmsg: some messages in circular 
buffer got overwritten");
continue;
} else {
/* something went wrong - error or zero length message 
*/
if (i < 0 && errno != EINTR && errno != EAGAIN) {
/* error occured */
imkmsgLogIntMsg(LOG_ERR,
   "imkmsg: error reading kernel log - 
shutting down: %s",
rs_strerror_r(errno, errmsg, 
sizeof(errmsg)));
fklog = -1;
}
break;
}

submitSyslog(pRcv);
}
}


So EPIPE errno better stay around.

-ss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-13 Thread Sergey Senozhatsky
On (20/02/13 23:36), John Ogness wrote:
> >> Here prb_read_valid() was successful, so a record _was_ read. The
> >> kerneldoc for the prb_read_valid() says:
> >
> > Hmm, yeah. That's true.
> >
> > OK, something weird...
> >
> > I ran some random printk-pressure test (mostly printks from IRQs;
> > + some NMI printk-s, but they are routed through nmi printk-safe
> > buffers; + some limited number of printk-safe printk-s, routed
> > via printk-safe buffer (so, once again, IRQ); + user-space
> > journalctl -f syslog reader), and after the test 'cat /dev/kmsg'
> > is terminally broken
> >
> > [..]
> > cat /dev/kmsg
> > cat: /dev/kmsg: Broken pipe
>
> In mainline you can have this "problem" as well. Once the ringbuffer has
> wrapped, any read to a newly opened /dev/kmsg when a new message arrived
> will result in an EPIPE. This happens quite easily once the ringbuffer
> has wrapped because each new message is overwriting the oldest message.

Hmm. Something doesn't add up.

Looking at the numbers, both r->info->seq and prb_first_seq(prb)
do increase, so there are new messages in the ring buffer

   u->seqr->seqprb_first_seq
[..]
cat: devkmsg_read() error 1981080   1982633   1981080
cat: devkmsg_read() error 1981080   1982633   1981080
cat: devkmsg_read() error 1981095   1982652   1981095
cat: devkmsg_read() error 1981095   1982652   1981095
cat: devkmsg_read() error 1981095   1982652   1981095
[..]

but 'cat' still wouldn't read anything from the logbuf - EPIPE.

NOTE: I don't run 'cat /dev/kmsg' during the test. I run the test first,
then I run 'cat /dev/kmsg', after the test, when printk-pressure is gone.

I can't reproduce it with current logbuf. 'cat' reads from /dev/kmsg after
heavy printk-pressure test. So chances are some loggers can also experience
problems. This might be a regression.

> > ...
> > systemd-journal: devkmsg_read() error 1979281 1982465 1980933
> > systemd-journal: corrected seq 1982465 1982465
> > cat: devkmsg_read() error 1980987 1982531 1980987
> > cat: corrected seq 1982531 1982531
> > cat: devkmsg_read() error 1981015 1982563 1981015
> > cat: corrected seq 1982563 1982563
>
> The situation with a data-less record is the same as when the ringbuffer
> wraps: cat is hitting that EPIPE. But re-opening the file descriptor is
> not going to help because it will not be able to get past that data-less
> record.

So maybe this is the case with broken 'cat' on my system?

> We could implement it such that devkmsg_read() will skip over data-less
> records instead of issuing an EPIPE. (That is what dmesg does.) But then
> do we need EPIPE at all? The reader can see that is has missed records
> by tracking the sequence number, so could we just get rid of EPIPE? Then
> cat(1) would be a great tool to view the raw ringbuffer. Please share
> your thoughts on this.

Looking at systemd/src/journal/journald-kmsg.c : server_read_dev_kmsg()
-EPIPE is just one of the erronos they handle, nothing special. Could it
be the case that some other loggers would have special handling for EPIPE?
I'm not sure, let's look around.

I'd say that EPIPE removal looks OK to me. But before we do that, I'm
not sure that we have clear understanding of 'cat /dev/kmsg' behaviour
change.

> On a side note (but related to data-less records): I hacked the
> ringbuffer code to inject data-less records at various times in order to
> verify your report. And I stumbled upon a bug in the ringbuffer, which
> can lead to an infinite loop in console_unlock(). The problem occurs at:
> 
> retry = prb_read_valid(prb, console_seq, NULL);
> 
> which will erroneously return true if console_seq is pointing to a
> data-less record but there are no valid records after it. The following
> patch fixes the bug. And yes, for v2 I have added comments to the
> desc_read_committed() code.

That's great to know!

-ss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-13 Thread John Ogness
On 2020-02-13, Sergey Senozhatsky  wrote:
 @@ -890,30 +758,26 @@ static ssize_t devkmsg_read(struct file *file, char 
 __user *buf,
  
logbuf_unlock_irq();
ret = wait_event_interruptible(log_wait,
 - user->seq != log_next_seq);
 +  prb_read_valid(prb, user->seq, r));
if (ret)
goto out;
logbuf_lock_irq();
}
  
 -  if (user->seq < log_first_seq) {
 -  /* our last seen message is gone, return error and reset */
 -  user->idx = log_first_idx;
 -  user->seq = log_first_seq;
 +  if (user->seq < r->info->seq) {
 +  /* the expected message is gone, return error and reset */
 +  user->seq = r->info->seq;
ret = -EPIPE;
logbuf_unlock_irq();
goto out;
}
>>>
>>> Sorry, why doesn't this do something like
>>>
>>> if (user->seq < prb_first_seq(prb)) {
>>> /* the expected message is gone, return error and reset */
>>> user->seq = prb_first_seq(prb);
>>> ret = -EPIPE;
>>> ...
>>> }
>> 
>> Here prb_read_valid() was successful, so a record _was_ read. The
>> kerneldoc for the prb_read_valid() says:
>
> Hmm, yeah. That's true.
>
> OK, something weird...
>
> I ran some random printk-pressure test (mostly printks from IRQs;
> + some NMI printk-s, but they are routed through nmi printk-safe
> buffers; + some limited number of printk-safe printk-s, routed
> via printk-safe buffer (so, once again, IRQ); + user-space
> journalctl -f syslog reader), and after the test 'cat /dev/kmsg'
> is terminally broken
>
> [..]
> cat /dev/kmsg
> cat: /dev/kmsg: Broken pipe

In mainline you can have this "problem" as well. Once the ringbuffer has
wrapped, any read to a newly opened /dev/kmsg when a new message arrived
will result in an EPIPE. This happens quite easily once the ringbuffer
has wrapped because each new message is overwriting the oldest message.

Although it can be convenient, cat(1) is actually a poor tool for
viewing the ringbuffer for this reason. Unfortunately dmesg(1) is
sub-optimal as well because it does not show the sequence numbers. So
with dmesg(1) you cannot see if a message was dropped. :-/

> So I printed seq numbers from devksmg read to a seq buffer and dumped
> it via procfs, just seq numbers before we adjust user->seq (set to
> r->seq) and after
>
> +   offt += snprintf(BUF + offt,
> +   sizeof(BUF) - offt,
> +   "%s: devkmsg_read() error %llu %llu 
> %llu\n",
> +   current->comm,
> +   user->seq,
> +   r->info->seq,
> +   prb_first_seq(prb));
>
>
> ...
> systemd-journal: devkmsg_read() error 1979281 1982465 1980933
> systemd-journal: corrected seq 1982465 1982465
> cat: devkmsg_read() error 1980987 1982531 1980987
> cat: corrected seq 1982531 1982531
> cat: devkmsg_read() error 1981015 1982563 1981015
> cat: corrected seq 1982563 1982563

The situation with a data-less record is the same as when the ringbuffer
wraps: cat is hitting that EPIPE. But re-opening the file descriptor is
not going to help because it will not be able to get past that data-less
record.

We could implement it such that devkmsg_read() will skip over data-less
records instead of issuing an EPIPE. (That is what dmesg does.) But then
do we need EPIPE at all? The reader can see that is has missed records
by tracking the sequence number, so could we just get rid of EPIPE? Then
cat(1) would be a great tool to view the raw ringbuffer. Please share
your thoughts on this.


On a side note (but related to data-less records): I hacked the
ringbuffer code to inject data-less records at various times in order to
verify your report. And I stumbled upon a bug in the ringbuffer, which
can lead to an infinite loop in console_unlock(). The problem occurs at:

retry = prb_read_valid(prb, console_seq, NULL);

which will erroneously return true if console_seq is pointing to a
data-less record but there are no valid records after it. The following
patch fixes the bug. And yes, for v2 I have added comments to the
desc_read_committed() code.

I now have 2 bugfixes queued up for v2. The first one is here[0].

[0] https://lkml.kernel.org/r/87wo919grz@linutronix.de

John Ogness


diff --git a/kernel/printk/printk_ringbuffer.c 
b/kernel/printk/printk_ringbuffer.c
index 796257f226ee..31893051ad6b 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1074,6 +1071,7 @@ static int desc_read_committed(struct prb_desc_ring 
*desc_ring,
   unsigned long id, u64 seq,
   struct 

Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-13 Thread Sergey Senozhatsky
On (20/02/13 10:42), John Ogness wrote:
> On 2020-02-13, Sergey Senozhatsky  wrote:
> >> -  while (user->seq == log_next_seq) {
> >> +  if (!prb_read_valid(prb, user->seq, r)) {
> >>if (file->f_flags & O_NONBLOCK) {
> >>ret = -EAGAIN;
> >>logbuf_unlock_irq();
> >> @@ -890,30 +758,26 @@ static ssize_t devkmsg_read(struct file *file, char 
> >> __user *buf,
> >>  
> >>logbuf_unlock_irq();
> >>ret = wait_event_interruptible(log_wait,
> >> - user->seq != log_next_seq);
> >> +  prb_read_valid(prb, user->seq, r));
> >>if (ret)
> >>goto out;
> >>logbuf_lock_irq();
> >>}
> >>  
> >> -  if (user->seq < log_first_seq) {
> >> -  /* our last seen message is gone, return error and reset */
> >> -  user->idx = log_first_idx;
> >> -  user->seq = log_first_seq;
> >> +  if (user->seq < r->info->seq) {
> >> +  /* the expected message is gone, return error and reset */
> >> +  user->seq = r->info->seq;
> >>ret = -EPIPE;
> >>logbuf_unlock_irq();
> >>goto out;
> >>}
> >
> > Sorry, why doesn't this do something like
> >
> > if (user->seq < prb_first_seq(prb)) {
> > /* the expected message is gone, return error and reset */
> > user->seq = prb_first_seq(prb);
> > ret = -EPIPE;
> > ...
> > }
> 
> Here prb_read_valid() was successful, so a record _was_ read. The
> kerneldoc for the prb_read_valid() says:

Hmm, yeah. That's true.

OK, something weird...

I ran some random printk-pressure test (mostly printks from IRQs;
+ some NMI printk-s, but they are routed through nmi printk-safe
buffers; + some limited number of printk-safe printk-s, routed
via printk-safe buffer (so, once again, IRQ); + user-space
journalctl -f syslog reader), and after the test 'cat /dev/kmsg'
is terminally broken

[..]
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
[..]

dmesg works. Reading from /dev/kmsg - doesn't; it did work, however,
before the test.

So I printed seq numbers from devksmg read to a seq buffer and dumped
it via procfs, just seq numbers before we adjust user->seq (set to
r->seq) and after

+   offt += snprintf(BUF + offt,
+   sizeof(BUF) - offt,
+   "%s: devkmsg_read() error %llu %llu 
%llu\n",
+   current->comm,
+   user->seq,
+   r->info->seq,
+   prb_first_seq(prb));


...
systemd-journal: devkmsg_read() error 1979235 1979236 1979236
systemd-journal: corrected seq 1979236 1979236
systemd-journal: devkmsg_read() error 1979237 1979243 1979243
systemd-journal: corrected seq 1979243 1979243
systemd-journal: devkmsg_read() error 1979244 1979250 1979250
systemd-journal: corrected seq 1979250 1979250
systemd-journal: devkmsg_read() error 1979251 1979257 1979257
systemd-journal: corrected seq 1979257 1979257
systemd-journal: devkmsg_read() error 1979258 1979265 1979265
systemd-journal: corrected seq 1979265 1979265
systemd-journal: devkmsg_read() error 1979266 1979272 1979272
systemd-journal: corrected seq 1979272 1979272
systemd-journal: devkmsg_read() error 1979272 1979273 1979273
systemd-journal: corrected seq 1979273 1979273
systemd-journal: 

Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-13 Thread John Ogness
On 2020-02-13, Sergey Senozhatsky  wrote:
>> -while (user->seq == log_next_seq) {
>> +if (!prb_read_valid(prb, user->seq, r)) {
>>  if (file->f_flags & O_NONBLOCK) {
>>  ret = -EAGAIN;
>>  logbuf_unlock_irq();
>> @@ -890,30 +758,26 @@ static ssize_t devkmsg_read(struct file *file, char 
>> __user *buf,
>>  
>>  logbuf_unlock_irq();
>>  ret = wait_event_interruptible(log_wait,
>> -   user->seq != log_next_seq);
>> +prb_read_valid(prb, user->seq, r));
>>  if (ret)
>>  goto out;
>>  logbuf_lock_irq();
>>  }
>>  
>> -if (user->seq < log_first_seq) {
>> -/* our last seen message is gone, return error and reset */
>> -user->idx = log_first_idx;
>> -user->seq = log_first_seq;
>> +if (user->seq < r->info->seq) {
>> +/* the expected message is gone, return error and reset */
>> +user->seq = r->info->seq;
>>  ret = -EPIPE;
>>  logbuf_unlock_irq();
>>  goto out;
>>  }
>
> Sorry, why doesn't this do something like
>
>   if (user->seq < prb_first_seq(prb)) {
>   /* the expected message is gone, return error and reset */
>   user->seq = prb_first_seq(prb);
>   ret = -EPIPE;
>   ...
>   }

Here prb_read_valid() was successful, so a record _was_ read. The
kerneldoc for the prb_read_valid() says:

 * On success, the reader must check r->info.seq to see which record was
 * actually read.

The value will either be the requested user->seq or some higher value
because user->seq is not available.

There are 2 reasons why user->seq is not available (and a later record
_is_ available):

1. The ringbuffer overtook user->seq. In this case, comparing and then
   setting using prb_first_seq() could be appropriate. And r->info->seq
   might even already be what prb_first_seq() would return. (More on
   this below.)

2. The record with user->seq has no data because the writer failed to
   allocate dataring space. In this case, resetting back to
   prb_first_seq() would be incorrect. And since r->info->seq is the
   next valid record, it is appropriate that the next devkmsg_read()
   starts there.

Rather than checking these cases separately, it is enough just to check
for the 2nd case. For the 1st case, prb_first_seq() could be less than
r->info->seq if all the preceeding records have no data. But this just
means the whole set of records with missing data are skipped, which
matches existing behavior. (For example, currently when devkmsg is
behind 10 messages, there are not 10 -EPIPE returns. Instead it
immediately catches up to the next available record.)

Perhaps the new comment should be:

/*
 * The expected message is gone, return error and
 * reset to the next available message.
 */

John Ogness

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-13 Thread Sergey Senozhatsky
On (20/01/28 17:25), John Ogness wrote:
[..]
> - while (user->seq == log_next_seq) {
> + if (!prb_read_valid(prb, user->seq, r)) {
>   if (file->f_flags & O_NONBLOCK) {
>   ret = -EAGAIN;
>   logbuf_unlock_irq();
> @@ -890,30 +758,26 @@ static ssize_t devkmsg_read(struct file *file, char 
> __user *buf,
>  
>   logbuf_unlock_irq();
>   ret = wait_event_interruptible(log_wait,
> -user->seq != log_next_seq);
> + prb_read_valid(prb, user->seq, r));
>   if (ret)
>   goto out;
>   logbuf_lock_irq();
>   }
>  
> - if (user->seq < log_first_seq) {
> - /* our last seen message is gone, return error and reset */
> - user->idx = log_first_idx;
> - user->seq = log_first_seq;
> + if (user->seq < r->info->seq) {
> + /* the expected message is gone, return error and reset */
> + user->seq = r->info->seq;
>   ret = -EPIPE;
>   logbuf_unlock_irq();
>   goto out;
>   }

Sorry, why doesn't this do something like

if (user->seq < prb_first_seq(prb)) {
/* the expected message is gone, return error and reset */
user->seq = prb_first_seq(prb);
ret = -EPIPE;
...
}

?

-ss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 2/2] printk: use the lockless ringbuffer

2020-01-30 Thread John Ogness
Replace the existing ringbuffer usage and implementation with
lockless ringbuffer usage. Even though the new ringbuffer does not
require locking, all existing locking is left in place. Therefore,
this change is purely replacing the underlining ringbuffer.

Changes that exist due to the ringbuffer replacement:

- The VMCOREINFO has been updated for the new structures.

- Dictionary data is now stored in a separate data buffer from the
  human-readable messages. The dictionary data buffer is set to the
  same size as the message buffer. Therefore, the total reserved
  memory for messages is 2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the
  initial static buffer and 2x the specified size in the log_buf_len
  kernel parameter.

- Record meta-data is now stored in a separate array of descriptors.
  This is an additional 72 * (2 ^ ((CONFIG_LOG_BUF_SHIFT - 6))) bytes
  for the static array and 72 * (2 ^ ((log_buf_len - 6))) bytes for
  the dynamic array.

Signed-off-by: John Ogness 
---
 include/linux/kmsg_dump.h |   2 -
 kernel/printk/Makefile|   1 +
 kernel/printk/printk.c| 836 +++---
 3 files changed, 416 insertions(+), 423 deletions(-)

diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
index 2e7a1e032c71..ae6265033e31 100644
--- a/include/linux/kmsg_dump.h
+++ b/include/linux/kmsg_dump.h
@@ -46,8 +46,6 @@ struct kmsg_dumper {
bool registered;
 
/* private state of the kmsg iterator */
-   u32 cur_idx;
-   u32 next_idx;
u64 cur_seq;
u64 next_seq;
 };
diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile
index 4d052fc6bcde..eee3dc9b60a9 100644
--- a/kernel/printk/Makefile
+++ b/kernel/printk/Makefile
@@ -2,3 +2,4 @@
 obj-y  = printk.o
 obj-$(CONFIG_PRINTK)   += printk_safe.o
 obj-$(CONFIG_A11Y_BRAILLE_CONSOLE) += braille.o
+obj-$(CONFIG_PRINTK)   += printk_ringbuffer.o
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 1ef6f75d92f1..d0d24ee1d1f4 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -56,6 +56,7 @@
 #define CREATE_TRACE_POINTS
 #include 
 
+#include "printk_ringbuffer.h"
 #include "console_cmdline.h"
 #include "braille.h"
 #include "internal.h"
@@ -294,30 +295,22 @@ enum con_msg_format_flags {
 static int console_msg_format = MSG_FORMAT_DEFAULT;
 
 /*
- * The printk log buffer consists of a chain of concatenated variable
- * length records. Every record starts with a record header, containing
- * the overall length of the record.
+ * The printk log buffer consists of a sequenced collection of records, each
+ * containing variable length message and dictionary text. Every record
+ * also contains its own meta-data (@info).
  *
- * The heads to the first and last entry in the buffer, as well as the
- * sequence numbers of these entries are maintained when messages are
- * stored.
- *
- * If the heads indicate available messages, the length in the header
- * tells the start next message. A length == 0 for the next message
- * indicates a wrap-around to the beginning of the buffer.
- *
- * Every record carries the monotonic timestamp in microseconds, as well as
- * the standard userspace syslog level and syslog facility. The usual
+ * Every record meta-data carries the monotonic timestamp in microseconds, as
+ * well as the standard userspace syslog level and syslog facility. The usual
  * kernel messages use LOG_KERN; userspace-injected messages always carry
  * a matching syslog facility, by default LOG_USER. The origin of every
  * message can be reliably determined that way.
  *
- * The human readable log message directly follows the message header. The
- * length of the message text is stored in the header, the stored message
- * is not terminated.
+ * The human readable log message of a record is available in @text, the length
+ * of the message text in @text_len. The stored message is not terminated.
  *
- * Optionally, a message can carry a dictionary of properties (key/value 
pairs),
- * to provide userspace with a machine-readable message context.
+ * Optionally, a record can carry a dictionary of properties (key/value pairs),
+ * to provide userspace with a machine-readable message context. The length of
+ * the dictionary is available in @dict_len. The dictionary is not terminated.
  *
  * Examples for well-defined, commonly used property names are:
  *   DEVICE=b12:8   device identifier
@@ -331,21 +324,19 @@ static int console_msg_format = MSG_FORMAT_DEFAULT;
  * follows directly after a '=' character. Every property is terminated by
  * a '\0' character. The last property is not terminated.
  *
- * Example of a message structure:
- *     ff 8f 00 00 00 00 00 00  monotonic time in nsec
- *   0008  34 00record is 52 bytes long
- *   000a0b 00  text is 11 bytes long
- *   000c  1f 00dictionary is 23 bytes long
- *   000e03 00  LOG_KERN (facility)