Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2020-01-04 Thread Andrea Parri
On Fri, Jan 03, 2020 at 11:24:20AM +0100, Petr Mladek wrote:
> On Mon 2019-12-23 17:01:00, John Ogness wrote:
> > Hi Andrea,
> > 
> > On 2019-12-21, Andrea Parri  wrote:
> > >> +*desc_out = READ_ONCE(*desc);
> > >> +
> > >> +/* Load data before re-checking state. */
> > >> +smp_rmb(); /* matches LMM_REF(desc_reserve:A) */
> > >
> > > I looked for a matching WRITE_ONCE() or some other type of marked write,
> > > but I could not find it.  What is the rationale?  Or what did I miss?
> 
> Good question. READ_ONCE() looks superfluous here because it is
> surrounded by two read barriers. In each case, there is no
> corresponding WRITE_ONCE().
> 
> Note that we are copying the entire struct prb_desc here. All values
> are written only when state_val is in desc_reserved state. It happens
> between two full write barriers:
> 
>   + A writer is allowed to modify the descriptor after successful
> cmpxchg in desc_reserve(), see LMM_TAG(desc_reserve:A).
> 
>   + The writer must not touch the descriptor after changing
> state_var to committed state, see
> LMM_TAG(prb_commit:A) in prb_commit().
> 
> These barriers are mentioned in the comments for the two
> read barriers here.

Thanks for these remarks.  As usual, I'd recommend to (try to) map those
comments into litmus tests and check with the LKMM simulator.


> BTW: Documentation/memory-barriers.txt describes various aspects of
> the memory barriers. It describes implicit barriers provided
> by spin locks, mutexes, semaphores, and various scheduler-related
> operations.
> 
> But I can't find any explanation of the various variants of the atomic
> operations: acquire, release, fetch, return, try, relaxed. I can find
> some clues here and there but it is hard to get the picture.

Documentation/atomic_t.txt could serve this purpose.  Please have a look
there and let me know if you have any comments.

Thanks,
  Andrea

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2020-01-03 Thread Petr Mladek
On Mon 2019-12-23 17:01:00, John Ogness wrote:
> Hi Andrea,
> 
> On 2019-12-21, Andrea Parri  wrote:
> >> +  *desc_out = READ_ONCE(*desc);
> >> +
> >> +  /* Load data before re-checking state. */
> >> +  smp_rmb(); /* matches LMM_REF(desc_reserve:A) */
> >
> > I looked for a matching WRITE_ONCE() or some other type of marked write,
> > but I could not find it.  What is the rationale?  Or what did I miss?

Good question. READ_ONCE() looks superfluous here because it is
surrounded by two read barriers. In each case, there is no
corresponding WRITE_ONCE().

Note that we are copying the entire struct prb_desc here. All values
are written only when state_val is in desc_reserved state. It happens
between two full write barriers:

  + A writer is allowed to modify the descriptor after successful
cmpxchg in desc_reserve(), see LMM_TAG(desc_reserve:A).

  + The writer must not touch the descriptor after changing
state_var to committed state, see
LMM_TAG(prb_commit:A) in prb_commit().

These barriers are mentioned in the comments for the two
read barriers here.

> >> +  do {
> >> +  next_lpos = get_next_lpos(data_ring, begin_lpos, size);
> >> +
> >> +  if (!data_push_tail(rb, data_ring,
> >> +  next_lpos - DATA_SIZE(data_ring))) {
> >> +  /* Failed to allocate, specify a data-less block. */
> >> +  blk_lpos->begin = INVALID_LPOS;
> >> +  blk_lpos->next = INVALID_LPOS;
> >> +  return NULL;
> >> +  }
> >> +  } while (!atomic_long_try_cmpxchg(_ring->head_lpos, _lpos,
> >> +next_lpos));
> >> +
> >> +  /*
> >> +   * No barrier is needed here. The data validity is defined by
> >> +   * the state of the associated descriptor. They are marked as
> >> +   * invalid at the moment. And only the winner of the above
> >> +   * cmpxchg() could write here.
> >> +   */
> >
> > The (successful) CMPXCHG provides a full barrier.  This comment suggests
> > that that could be somehow relaxed?  Or the comment could be improved?
> 
> You are correct. There is no need for the full barrier here. This code
> is based on Petr's POC. I focussed on making sure needed barriers are in
> place, but did not try to eliminate excessive barriers.

I hope that I'll get better understanding of the guarantees
of different atomic operations one day. There are so many variants now.

BTW: Documentation/memory-barriers.txt describes various aspects of
the memory barriers. It describes implicit barriers provided
by spin locks, mutexes, semaphores, and various scheduler-related
operations.

But I can't find any explanation of the various variants of the atomic
operations: acquire, release, fetch, return, try, relaxed. I can find
some clues here and there but it is hard to get the picture.

Best Regards,
Petr

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-23 Thread John Ogness
Hi Andrea,

On 2019-12-21, Andrea Parri  wrote:
>> +*desc_out = READ_ONCE(*desc);
>> +
>> +/* Load data before re-checking state. */
>> +smp_rmb(); /* matches LMM_REF(desc_reserve:A) */
>
> I looked for a matching WRITE_ONCE() or some other type of marked write,
> but I could not find it.  What is the rationale?  Or what did I miss?

This smp_rmb() matches LMM_TAG(desc_reserve:A).

>> +do {
>> +next_lpos = get_next_lpos(data_ring, begin_lpos, size);
>> +
>> +if (!data_push_tail(rb, data_ring,
>> +next_lpos - DATA_SIZE(data_ring))) {
>> +/* Failed to allocate, specify a data-less block. */
>> +blk_lpos->begin = INVALID_LPOS;
>> +blk_lpos->next = INVALID_LPOS;
>> +return NULL;
>> +}
>> +} while (!atomic_long_try_cmpxchg(_ring->head_lpos, _lpos,
>> +  next_lpos));
>> +
>> +/*
>> + * No barrier is needed here. The data validity is defined by
>> + * the state of the associated descriptor. They are marked as
>> + * invalid at the moment. And only the winner of the above
>> + * cmpxchg() could write here.
>> + */
>
> The (successful) CMPXCHG provides a full barrier.  This comment suggests
> that that could be somehow relaxed?  Or the comment could be improved?

You are correct. There is no need for the full barrier here. This code
is based on Petr's POC. I focussed on making sure needed barriers are in
place, but did not try to eliminate excessive barriers.

> (The patch introduces a number of CMPXCHG: similar questions would
> apply to those other instances...)

LMM_TAG(data_push_tail:A) is the only CMPXCHG that requires its full
barrier. All others can be relaxed. I will make this change for the next
series.

Thanks for taking time for this.

John Ogness

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-21 Thread Andrea Parri
Hi John,

Sorry for the delay.

I don't have an overall understanding of the patch(-set) yet, so I limit
to a couple of general questions about the memory barriers introduced by
the path.  Please see inline comments.


> + *desc_out = READ_ONCE(*desc);
> +
> + /* Load data before re-checking state. */
> + smp_rmb(); /* matches LMM_REF(desc_reserve:A) */

I looked for a matching WRITE_ONCE() or some other type of marked write,
but I could not find it.  What is the rationale?  Or what did I miss?


> + do {
> + next_lpos = get_next_lpos(data_ring, begin_lpos, size);
> +
> + if (!data_push_tail(rb, data_ring,
> + next_lpos - DATA_SIZE(data_ring))) {
> + /* Failed to allocate, specify a data-less block. */
> + blk_lpos->begin = INVALID_LPOS;
> + blk_lpos->next = INVALID_LPOS;
> + return NULL;
> + }
> + } while (!atomic_long_try_cmpxchg(_ring->head_lpos, _lpos,
> +   next_lpos));
> +
> + /*
> +  * No barrier is needed here. The data validity is defined by
> +  * the state of the associated descriptor. They are marked as
> +  * invalid at the moment. And only the winner of the above
> +  * cmpxchg() could write here.
> +  */

The (successful) CMPXCHG provides a full barrier.  This comment suggests
that that could be somehow relaxed?  Or the comment could be improved?

(The patch introduces a number of CMPXCHG: similar questions would apply
to those other instances...)

Thanks,
  Andrea

P. S.  Please use my @gmail.com address for future communications.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-09 Thread John Ogness
On 2019-12-09, Sergey Senozhatsky  wrote:
>> + * Sample reader code::
>> + *
>> + *  struct printk_info info;
>> + *  char text_buf[32];
>> + *  char dict_buf[32];
>> + *  u64 next_seq = 0;
>> + *  struct printk_record r = {
>> + *  .info   = ,
>> + *  .text_buf   = _buf[0],
>> + *  .dict_buf   = _buf[0],
>> + *  .text_buf_size  = sizeof(text_buf),
>> + *  .dict_buf_size  = sizeof(dict_buf),
>> + *  };
>> + *
>> + *  while (prb_read_valid(, next_seq, )) {
>> + *  if (info.seq != next_seq)
>> + *  pr_warn("lost %llu records\n", info.seq - next_seq);
>> + *
>> + *  if (info.text_len > r.text_buf_size) {
>> + *  pr_warn("record %llu text truncated\n", info.seq);
>> + *  text_buf[sizeof(text_buf) - 1] = 0;
>> + *  }
>> + *
>> + *  if (info.dict_len > r.dict_buf_size) {
>> + *  pr_warn("record %llu dict truncated\n", info.seq);
>> + *  dict_buf[sizeof(dict_buf) - 1] = 0;
>> + *  }
>> + *
>> + *  pr_info("%llu: %llu: %s;%s\n", info.seq, info.ts_nsec,
>> + *  _buf[0], info.dict_len ? _buf[0] : "");
>> + *
>> + *  next_seq = info.seq + 1;
>> + *  }
>> + */
>
> Will this loop ever end? :)
>
> pr_info() adds data to ringbuffer, which prb_read_valid() reads, so
> pr_info() can add more data, which prb_read_valid() will read, so
> pr_info()...

The sample code is assuming that @rb is not the same ringbuffer used by
kernel/printk/printk.c. (For example, the test module is doing that to
stress test the ringbuffer code without actually affecting printk.) I
can add a sentence to clarify that.

John Ogness

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-09 Thread Sergey Senozhatsky
On (19/11/28 02:58), John Ogness wrote:
> + * Sample reader code::
> + *
> + *   struct printk_info info;
> + *   char text_buf[32];
> + *   char dict_buf[32];
> + *   u64 next_seq = 0;
> + *   struct printk_record r = {
> + *   .info   = ,
> + *   .text_buf   = _buf[0],
> + *   .dict_buf   = _buf[0],
> + *   .text_buf_size  = sizeof(text_buf),
> + *   .dict_buf_size  = sizeof(dict_buf),
> + *   };
> + *
> + *   while (prb_read_valid(, next_seq, )) {
> + *   if (info.seq != next_seq)
> + *   pr_warn("lost %llu records\n", info.seq - next_seq);
> + *
> + *   if (info.text_len > r.text_buf_size) {
> + *   pr_warn("record %llu text truncated\n", info.seq);
> + *   text_buf[sizeof(text_buf) - 1] = 0;
> + *   }
> + *
> + *   if (info.dict_len > r.dict_buf_size) {
> + *   pr_warn("record %llu dict truncated\n", info.seq);
> + *   dict_buf[sizeof(dict_buf) - 1] = 0;
> + *   }
> + *
> + *   pr_info("%llu: %llu: %s;%s\n", info.seq, info.ts_nsec,
> + *   _buf[0], info.dict_len ? _buf[0] : "");
> + *
> + *   next_seq = info.seq + 1;
> + *   }
> + */

Will this loop ever end? :)

pr_info() adds data to ringbuffer, which prb_read_valid() reads, so pr_info()
can add more data, which prb_read_valid() will read, so pr_info()...

-ss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-09 Thread Sergey Senozhatsky
On (19/12/02 16:48), Petr Mladek wrote:
> Hi,
> 
> I have seen few prelimitary versions before this public one.
> I am either happy with it or blind to see new problems[*].

I guess I'm happy with it.

-ss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-09 Thread John Ogness
On 2019-12-09, Sergey Senozhatsky  wrote:
>> +#define _DATA_SIZE(sz_bits) (1UL << (sz_bits))
>> +#define _DESCS_COUNT(ct_bits)   (1U << (ct_bits))
>> +#define DESC_SV_BITS(sizeof(int) * 8)
>> +#define DESC_COMMITTED_MASK (1U << (DESC_SV_BITS - 1))
>
> What does SV state for? State Value?

Yes. Originally this thing was just called the state. But it was a bit
confusing in the code because there is also an enum desc_state (used for
state queries), which is _not_ the value that is stored in the state
variable. That's why the code is using state_var/state_val (SV) for the
actual data values, keeping it separate from desc_state/d_state for the
the state queries.

John Ogness

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-08 Thread Sergey Senozhatsky
On (19/11/28 02:58), John Ogness wrote:
[..]
> +
> +#define _DATA_SIZE(sz_bits)  (1UL << (sz_bits))
> +#define _DESCS_COUNT(ct_bits)(1U << (ct_bits))
> +#define DESC_SV_BITS (sizeof(int) * 8)
> +#define DESC_COMMITTED_MASK  (1U << (DESC_SV_BITS - 1))

What does SV state for? State Value?

-ss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-05 Thread John Ogness
On 2019-12-03, Steven Rostedt  wrote:
> +/* Reserve a new descriptor, invalidating the oldest if necessary. */
> +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out)
> +{
> + struct prb_desc_ring *desc_ring = >desc_ring;
> + struct prb_desc *desc;
> + u32 id_prev_wrap;
> + u32 head_id;
> + u32 id;
> +
> + head_id = atomic_read(_ring->head_id);
> +
> + do {
> + desc = to_desc(desc_ring, head_id);
> +
> + id = DESC_ID(head_id + 1);
> + id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
> +
> + if (id_prev_wrap == atomic_read(_ring->tail_id)) {
> + if (!desc_push_tail(rb, id_prev_wrap))
> + return false;
> + }
> + } while (!atomic_try_cmpxchg(_ring->head_id, _id, id));

 Hmm, in theory, ABA problem might cause that we successfully
 move desc_ring->head_id when tail has not been pushed yet.

 As a result we would never call desc_push_tail() until
 it overflows again.

 I am not sure if we need to take care of it. The code is called
 with interrupts disabled. IMHO, only NMI could cause ABA problem in
 reality. But the game (debugging) is lost anyway when NMI ovewrites
 the buffer several times.

 Also it should not be a complete failure. We still could bail out
 when the previous state was not reusable. We are on the safe side
 when it was reusable.

 Also we could probably detect when desc_ring->tail_id is not
 updated any longer and do something about it.

 BTW: If I am counting correctly. The ABA problem would happen when
 exactly 2^30 (1G) messages is written in the mean time.
>>> 
>>> All the ringbuffer code assumes that the use of index numbers
>>> handles the ABA problem (i.e. there must not be 1 billion printk's
>>> within an NMI). If we want to support 1 billion+ printk's within an
>>> NMI, then perhaps the index number should be increased. For 64-bit
>>> systems it would be no problem to go to 62 bits. For 32-bit systems,
>>> I don't know how well the 64-bit atomic operations are supported.
>> 
>> ftrace dumps from NMI (DUMP_ALL type ftrace_dump_on_oops on a $BIG
>> machine)? 1G seems large enough, but who knows.
>
> ftrace dump from NMI is the most likely case to hit this, but when
> that happens, you are in debugging mode, and the system usually
> becomes unreliable at this moment. I agree with Petr, that we should
> not complicate the code more to handle this theoretical condition.

I will change the data block ID size to "unsigned long". Then it is
really only an issue for 32-bit systems.

AFAICT, the only real issue is that the head can be pushed to the
descriptor index that the tail is pointing to. From there, the head can
be advanced beyond and the tail may never move again. So writers just
need to make sure the tail gets pushed beyond the newly reserved head
before making any changes to that descriptor.

Aside from moving to "unsigned long" ID's, I believe this can be handled
with the following 4 changes when reserving a new descriptor:

- also push the tail forward if it falls behind

- after pushing the head, re-check if the tail is still ok

- verify the state of the reserved descriptor before changing it

- use cmpxchg() to change it

Below is the patch snippet I use for this. (Note that the patch is
against a version where ID's have already been changed to unsigned
longs.)

John Ogness


@@ -468,19 +470,53 @@ static bool desc_reserve(struct printk_ringbuffer *rb, 
unsigned long *id_out)
id = DESC_ID(head_id + 1);
id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
 
-   if (id_prev_wrap == atomic_long_read(_ring->tail_id)) {
-   if (!desc_push_tail(rb, id_prev_wrap))
+   /*
+* Make space for the new descriptor by pushing the tail
+* beyond the descriptor to be reserved. Normally this only
+* requires pushing the tail once. But due to possible ABA
+* problems with the ID, the tail could be too far behind.
+* Use a loop to push the tail where it needs to be.
+*/
+   tail_id = atomic_long_read(_ring->tail_id);
+   while (DESC_ID(id_prev_wrap - tail_id) <
+  DESCS_COUNT(desc_ring)) {
+
+   if (!desc_push_tail(rb, tail_id))
return false;
+   tail_id = DESC_ID(tail_id + 1);
}
} while (!atomic_long_try_cmpxchg(_ring->head_id, _id, id));
 
+   /*
+* Before moving the head, it was ensured that the tail was pushed to
+* where it needs to be. Due to possible ABA problems with the ID, the
+* reserved descriptor may not be what was expected. Re-check the tail
+* to see if it is still where it needs to 

Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-03 Thread Petr Mladek
On Tue 2019-12-03 15:13:36, John Ogness wrote:
> On 2019-12-02, Petr Mladek  wrote:
> >> +/*
> >> + * Sanity checker for reserve size. The ringbuffer code assumes that a 
> >> data
> >> + * block does not exceed the maximum possible size that could fit within 
> >> the
> >> + * ringbuffer. This function provides that basic size check so that the
> >> + * assumption is safe.
> >> + */
> >> +static bool data_check_size(struct prb_data_ring *data_ring, unsigned int 
> >> size)
> >> +{
> >> +  struct prb_data_block *db = NULL;
> >> +
> >> +  /* Writers are not allowed to write data-less records. */
> >> +  if (size == 0)
> >> +  return false;
> >
> > I would personally let this decision to the API caller.
> >
> > The code actually have to support data-less records. They are used
> > when the descriptor is reserved but the data block can't get reserved.
> 
> Exactly. Data-less records are how the ringbuffer identifies that data
> has been lost. If users were allowed to store data-less records, that
> destinction is no longer possible (unless I created some extra field in
> the descriptor). Does it even make sense for printk to store data-less
> records explicitly?

>From my POV, it does not make much sense.

> The current printk implementation silently ignores empty messages.

I am not able to find it. I only see that empty continuous framgments
are ignored in log_output(). Normal empty lines seems to be strored.

Well, I can't see any usecase for this. I think that we could ignore
all empty strings.


> > The above statement might create false feeling that it could not
> > happen later in the code. It might lead to bugs in writer code.
> 
> Then let me change the statement to describe that data-less records are
> used internally by the ringbuffer and cannot be explicitly created by
> writers.

Sounds godo to me.

> > Also reader API users might not expect to get a "valid" data-less
> > record.
> 
> Readers will never see them. The reader API implementation skips over
> data-less records.

Yeah. They will see bump in the seq number. We would need to
distinguish empty records and lost records as you wrote above.
It looks better to ignore empty records already during write.

Best Regards,
Petr

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-03 Thread Steven Rostedt
On Tue, 3 Dec 2019 10:17:21 +0900
Sergey Senozhatsky  wrote:

> > > BTW: If I am counting correctly. The ABA problem would happen when
> > > exactly 2^30 (1G) messages is written in the mean time.  
> > 
> > All the ringbuffer code assumes that the use of index numbers handles
> > the ABA problem (i.e. there must not be 1 billion printk's within an
> > NMI). If we want to support 1 billion+ printk's within an NMI, then
> > perhaps the index number should be increased. For 64-bit systems it
> > would be no problem to go to 62 bits. For 32-bit systems, I don't know
> > how well the 64-bit atomic operations are supported.  
> 
> ftrace dumps from NMI (DUMP_ALL type ftrace_dump_on_oops on a $BIG
> machine)? 1G seems large enough, but who knows.

ftrace dump from NMI is the most likely case to hit this, but when that
happens, you are in debugging mode, and the system usually becomes
unreliable at this moment. I agree with Petr, that we should not
complicate the code more to handle this theoretical condition.

-- Steve

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-03 Thread John Ogness
On 2019-12-02, Petr Mladek  wrote:
>> +/*
>> + * Sanity checker for reserve size. The ringbuffer code assumes that a data
>> + * block does not exceed the maximum possible size that could fit within the
>> + * ringbuffer. This function provides that basic size check so that the
>> + * assumption is safe.
>> + */
>> +static bool data_check_size(struct prb_data_ring *data_ring, unsigned int 
>> size)
>> +{
>> +struct prb_data_block *db = NULL;
>> +
>> +/* Writers are not allowed to write data-less records. */
>> +if (size == 0)
>> +return false;
>
> I would personally let this decision to the API caller.
>
> The code actually have to support data-less records. They are used
> when the descriptor is reserved but the data block can't get reserved.

Exactly. Data-less records are how the ringbuffer identifies that data
has been lost. If users were allowed to store data-less records, that
destinction is no longer possible (unless I created some extra field in
the descriptor). Does it even make sense for printk to store data-less
records explicitly?

The current printk implementation silently ignores empty messages.

> The above statement might create false feeling that it could not
> happen later in the code. It might lead to bugs in writer code.

Then let me change the statement to describe that data-less records are
used internally by the ringbuffer and cannot be explicitly created by
writers.

> Also reader API users might not expect to get a "valid" data-less
> record.

Readers will never see them. The reader API implementation skips over
data-less records.

John Ogness

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-03 Thread Petr Mladek
On Mon 2019-12-02 17:37:26, John Ogness wrote:
> On 2019-12-02, Petr Mladek  wrote:
> >> > +/* Reserve a new descriptor, invalidating the oldest if necessary. */
> >> > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out)
> >> > +{
> >> > +struct prb_desc_ring *desc_ring = >desc_ring;
> >> > +struct prb_desc *desc;
> >> > +u32 id_prev_wrap;
> >> > +u32 head_id;
> >> > +u32 id;
> >> > +
> >> > +head_id = atomic_read(_ring->head_id);
> >> > +
> >> > +do {
> >> > +desc = to_desc(desc_ring, head_id);
> >> > +
> >> > +id = DESC_ID(head_id + 1);
> >> > +id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
> >> > +
> >> > +if (id_prev_wrap == atomic_read(_ring->tail_id)) {
> >> > +if (!desc_push_tail(rb, id_prev_wrap))
> >> > +return false;
> >> > +}
> >> > +} while (!atomic_try_cmpxchg(_ring->head_id, _id, 
> >> > id));
> >> 
> >> Hmm, in theory, ABA problem might cause that we successfully
> >> move desc_ring->head_id when tail has not been pushed yet.
> >> 
> >> As a result we would never call desc_push_tail() until
> >> it overflows again.
> >> 
> >> I am not sure if we need to take care of it. The code is called with
> >> interrupts disabled. IMHO, only NMI could cause ABA problem
> >> in reality. But the game (debugging) is lost anyway when NMI ovewrites
> >> the buffer several times.
> >
> > BTW: If I am counting correctly. The ABA problem would happen when
> > exactly 2^30 (1G) messages is written in the mean time.
> 
> All the ringbuffer code assumes that the use of index numbers handles
> the ABA problem (i.e. there must not be 1 billion printk's within an
> NMI). If we want to support 1 billion+ printk's within an NMI, then
> perhaps the index number should be increased. For 64-bit systems it
> would be no problem to go to 62 bits. For 32-bit systems, I don't know
> how well the 64-bit atomic operations are supported.

I am not super happy but I really think that it does not make sense
to complicate the code too much because of this theoretical race.

1 billion printks in NMI is crazy. Also the race is a problem only
when we hit exactly the 2^30 message. If we hit 2^30 + 1 and more
than everything is fine again.

In the worst case, printk() will stop working. I think that there
are other situations that are much more likely when printk() will
not work (people will not see the messages).


> >> Also it should not be a complete failure. We still could bail out when
> >> the previous state was not reusable. We are on the safe side
> >> when it was reusable.
> >> 
> >> Also we could probably detect when desc_ring->tail_id is not
> >> updated any longer and do something about it.
> >> 
> >> > +
> >> > +desc = to_desc(desc_ring, id);
> >> > +
> >> > +/* Set the descriptor's ID and also set its state to reserved. 
> >> > */
> >> > +atomic_set(>state_var, id | 0);
> >> 
> >> We should check here that the original state id from the previous
> >> wrap in reusable state (or 0 for bootstrap). On error, we know that
> >> there was the ABA problem and would need to do something about it.
> >
> > I believe that it should be enough to add this check and
> > bail out in case of error.
> 
> I need to go through the code again in detail and see how many locations
> are affected by ABA. All the code was written with the assumption that
> this type of ABA will not happen.
>
> As you've stated, we could add minimal handling so that the ringbuffer
> at least does not break or get stuck.
> 
> BTW: The same assumption is made for logical positions. There are 4
> times as many of these (on 32-bit systems) but logical positions advance
> much faster. I will review these as well.

The logical positions are assigned only when a descriptor is reserved.
Such a descriptor could not be reused until committed and marked
reusable. It limits the logical position movement to:

   max_record_size * number_of_descriptors

Printk records are limited to 1k. So we could safely support
up to 1 million fully sized lines printed when NMI interrupted
printk() on one CPU.

The most important thing is that it must not crash the system.
It would be nice if we are able to detect this situation and
do something about it. But IMHO, it is perfectly fine when
printk() would stop working in this corner case.

The only problem is that it might be hard to debug. But it
should be possible with crashdump. And I think that we will
first hit some other bugs that we do not see at the moment ;-)

Best Regards,
Petr

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-02 Thread Sergey Senozhatsky
On (19/12/02 17:37), John Ogness wrote:
> On 2019-12-02, Petr Mladek  wrote:
> >> > +/* Reserve a new descriptor, invalidating the oldest if necessary. */
> >> > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out)
> >> > +{
> >> > +struct prb_desc_ring *desc_ring = >desc_ring;
> >> > +struct prb_desc *desc;
> >> > +u32 id_prev_wrap;
> >> > +u32 head_id;
> >> > +u32 id;
> >> > +
> >> > +head_id = atomic_read(_ring->head_id);
> >> > +
> >> > +do {
> >> > +desc = to_desc(desc_ring, head_id);
> >> > +
> >> > +id = DESC_ID(head_id + 1);
> >> > +id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
> >> > +
> >> > +if (id_prev_wrap == atomic_read(_ring->tail_id)) {
> >> > +if (!desc_push_tail(rb, id_prev_wrap))
> >> > +return false;
> >> > +}
> >> > +} while (!atomic_try_cmpxchg(_ring->head_id, _id, 
> >> > id));
> >> 
> >> Hmm, in theory, ABA problem might cause that we successfully
> >> move desc_ring->head_id when tail has not been pushed yet.
> >> 
> >> As a result we would never call desc_push_tail() until
> >> it overflows again.
> >> 
> >> I am not sure if we need to take care of it. The code is called with
> >> interrupts disabled. IMHO, only NMI could cause ABA problem
> >> in reality. But the game (debugging) is lost anyway when NMI ovewrites
> >> the buffer several times.
> >
> > BTW: If I am counting correctly. The ABA problem would happen when
> > exactly 2^30 (1G) messages is written in the mean time.
> 
> All the ringbuffer code assumes that the use of index numbers handles
> the ABA problem (i.e. there must not be 1 billion printk's within an
> NMI). If we want to support 1 billion+ printk's within an NMI, then
> perhaps the index number should be increased. For 64-bit systems it
> would be no problem to go to 62 bits. For 32-bit systems, I don't know
> how well the 64-bit atomic operations are supported.

ftrace dumps from NMI (DUMP_ALL type ftrace_dump_on_oops on a $BIG
machine)? 1G seems large enough, but who knows.

-ss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-02 Thread John Ogness
On 2019-12-02, Petr Mladek  wrote:
>> > +/* Reserve a new descriptor, invalidating the oldest if necessary. */
>> > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out)
>> > +{
>> > +  struct prb_desc_ring *desc_ring = >desc_ring;
>> > +  struct prb_desc *desc;
>> > +  u32 id_prev_wrap;
>> > +  u32 head_id;
>> > +  u32 id;
>> > +
>> > +  head_id = atomic_read(_ring->head_id);
>> > +
>> > +  do {
>> > +  desc = to_desc(desc_ring, head_id);
>> > +
>> > +  id = DESC_ID(head_id + 1);
>> > +  id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
>> > +
>> > +  if (id_prev_wrap == atomic_read(_ring->tail_id)) {
>> > +  if (!desc_push_tail(rb, id_prev_wrap))
>> > +  return false;
>> > +  }
>> > +  } while (!atomic_try_cmpxchg(_ring->head_id, _id, id));
>> 
>> Hmm, in theory, ABA problem might cause that we successfully
>> move desc_ring->head_id when tail has not been pushed yet.
>> 
>> As a result we would never call desc_push_tail() until
>> it overflows again.
>> 
>> I am not sure if we need to take care of it. The code is called with
>> interrupts disabled. IMHO, only NMI could cause ABA problem
>> in reality. But the game (debugging) is lost anyway when NMI ovewrites
>> the buffer several times.
>
> BTW: If I am counting correctly. The ABA problem would happen when
> exactly 2^30 (1G) messages is written in the mean time.

All the ringbuffer code assumes that the use of index numbers handles
the ABA problem (i.e. there must not be 1 billion printk's within an
NMI). If we want to support 1 billion+ printk's within an NMI, then
perhaps the index number should be increased. For 64-bit systems it
would be no problem to go to 62 bits. For 32-bit systems, I don't know
how well the 64-bit atomic operations are supported.

>> Also it should not be a complete failure. We still could bail out when
>> the previous state was not reusable. We are on the safe side
>> when it was reusable.
>> 
>> Also we could probably detect when desc_ring->tail_id is not
>> updated any longer and do something about it.
>> 
>> > +
>> > +  desc = to_desc(desc_ring, id);
>> > +
>> > +  /* Set the descriptor's ID and also set its state to reserved. */
>> > +  atomic_set(>state_var, id | 0);
>> 
>> We should check here that the original state id from the previous
>> wrap in reusable state (or 0 for bootstrap). On error, we know that
>> there was the ABA problem and would need to do something about it.
>
> I believe that it should be enough to add this check and
> bail out in case of error.

I need to go through the code again in detail and see how many locations
are affected by ABA. All the code was written with the assumption that
this type of ABA will not happen.

As you've stated, we could add minimal handling so that the ringbuffer
at least does not break or get stuck.

BTW: The same assumption is made for logical positions. There are 4
times as many of these (on 32-bit systems) but logical positions advance
much faster. I will review these as well.

John Ogness

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-02 Thread Petr Mladek
On Mon 2019-12-02 16:48:41, Petr Mladek wrote:
> > +/* Reserve a new descriptor, invalidating the oldest if necessary. */
> > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out)
> > +{
> > +   struct prb_desc_ring *desc_ring = >desc_ring;
> > +   struct prb_desc *desc;
> > +   u32 id_prev_wrap;
> > +   u32 head_id;
> > +   u32 id;
> > +
> > +   head_id = atomic_read(_ring->head_id);
> > +
> > +   do {
> > +   desc = to_desc(desc_ring, head_id);
> > +
> > +   id = DESC_ID(head_id + 1);
> > +   id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
> > +
> > +   if (id_prev_wrap == atomic_read(_ring->tail_id)) {
> > +   if (!desc_push_tail(rb, id_prev_wrap))
> > +   return false;
> > +   }
> > +   } while (!atomic_try_cmpxchg(_ring->head_id, _id, id));
> 
> Hmm, in theory, ABA problem might cause that we successfully
> move desc_ring->head_id when tail has not been pushed yet.
> 
> As a result we would never call desc_push_tail() until
> it overflows again.
> 
> I am not sure if we need to take care of it. The code is called with
> interrupts disabled. IMHO, only NMI could cause ABA problem
> in reality. But the game (debugging) is lost anyway when NMI ovewrites
> the buffer several times.

BTW: If I am counting correctly. The ABA problem would happen when
exactly 2^30 (1G) messages is written in the mean time.

> Also it should not be a complete failure. We still could bail out when
> the previous state was not reusable. We are on the safe side
> when it was reusable.
> 
> Also we could probably detect when desc_ring->tail_id is not
> updated any longer and do something about it.
> 
> > +
> > +   desc = to_desc(desc_ring, id);
> > +
> > +   /* Set the descriptor's ID and also set its state to reserved. */
> > +   atomic_set(>state_var, id | 0);
> 
> We should check here that the original state id from the previous
> wrap in reusable state (or 0 for bootstrap). On error, we know that
> there was the ABA problem and would need to do something about it.

I believe that it should be enough to add this check and
bail out in case of error.

Best Regards,
Petr

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-02 Thread Petr Mladek
Hi,

I have seen few prelimitary versions before this public one.
I am either happy with it or blind to see new problems[*].

It would be great if anyone else could look at it. Especially
I am intreseted:

  + Whether the algorithm can be understood by people who
see it for the "first" time.

  + Whether there are any obvious races. I wonder if our assumtions
about atomic_cmpxchg() guaranttes are correct.


[*] I have found one theoretical ABA problem. I think that is
really rather theoretical and should not block this patch.
I think that we could do something to prevent it either
now or later. See below for more details


On Thu 2019-11-28 02:58:33, John Ogness wrote:
> Add a new lockless ringbuffer implementation to be used for printk.
> First only add support for writers. Reader support will be added in
> a follow-up commit.
> 
> Signed-off-by: John Ogness 
> ---
>  kernel/printk/printk_ringbuffer.c | 676 ++
>  kernel/printk/printk_ringbuffer.h | 239 +++
>  2 files changed, 915 insertions(+)
>  create mode 100644 kernel/printk/printk_ringbuffer.c
>  create mode 100644 kernel/printk/printk_ringbuffer.h
> 
> diff --git a/kernel/printk/printk_ringbuffer.c 
> b/kernel/printk/printk_ringbuffer.c
> new file mode 100644
> index ..09c32e52fd40
> --- /dev/null
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -0,0 +1,676 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "printk_ringbuffer.h"
> +
> +/**
> + * DOC: printk_ringbuffer overview
> + *
> + * Data Structure
> + * --
> + * The printk_ringbuffer is made up of 3 internal ringbuffers::
> + *
> + * * desc_ring:  A ring of descriptors. A descriptor contains all record
> + *   meta-data (sequence number, timestamp, loglevel, etc.) 
> as
> + *   well as internal state information about the record and
> + *   logical positions specifying where in the other
> + *   ringbuffers the text and dictionary strings are located.
> + *
> + * * text_data_ring: A ring of data blocks. A data block consists of a 32-bit
> + *   integer (ID) that maps to a desc_ring index followed by
> + *   the text string of the record.
> + *
> + * * dict_data_ring: A ring of data blocks. A data block consists of a 32-bit
> + *   integer (ID) that maps to a desc_ring index followed by
> + *   the dictionary string of the record.
> + *

I slightly miss some top level description of the approach. Especially
how the code deals with races. The following comes to my mind:


The state of each entry is always defined by desc->state_val.
It consists of 30-bit id and 2-bit state value, see DESC_INDEX()
and get_desc_state.

prb_reserve() call returns descriptor in a reserved state. It points
to reserved data blocks in the data rings.

The algorithm starts with using never used descriptors and data
blocks. Later it uses descriptors and data blocks in the reusable
state.

The descriptors and data space are reused in a round-robin fahion.
Only records in the committed state could be moved to reusable state.

prb_reserve() has to do several steps. Namely it has to update head
and tail positions in all ring buffers. Also it has to manipulare
desc->state_val. Most races are avoided by using atomic_cmpxchg()
operations. They make sure that all values are moved from one
well defined state to another well defined state.

ABA races are avoided by using logical positions and indexes.
The rinbuffer must be reused many times before these values
overflow. Anyway, logical positions to data ring could not overflow
wildly. They are manipulated only when the descriptor is already
reserved. The descriptor could get reused only when it was commited
before. It means that there is a limited number of writers
until they would need to reuse a particular descriptor. [*]

Back to the prb_reserve() algorithm. The repeated pattern is that
it does not matter what caller invalidates oldest entries. They are
fine when it has already been done by another writer in the meantime.
The real winner is the caller that is able to move the head position
and reserve the space. Others need to go back to invalidating the oldest
entry, etc.

prb_reserve() caller has exclusive write access to the reserved
descriptor and data blocks. It has to call prb_commit() when finished.
It is a signal that the data are valid for readers. But it is also
a signal that the descriptor and the space might get reused be
other writers.


Finally, readers just need to check the state of the descriptor
before and after reading the data. The data are correct only
when the descriptor is in committed state.

[*] Hmm, the ABA problem might theoretically happen in desc_reserve(),
see below.



> +/*
> + * Sanity checker for reserve size. The ringbuffer code assumes that a data
> + * block