Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On Fri, Jan 03, 2020 at 11:24:20AM +0100, Petr Mladek wrote: > On Mon 2019-12-23 17:01:00, John Ogness wrote: > > Hi Andrea, > > > > On 2019-12-21, Andrea Parri wrote: > > >> +*desc_out = READ_ONCE(*desc); > > >> + > > >> +/* Load data before re-checking state. */ > > >> +smp_rmb(); /* matches LMM_REF(desc_reserve:A) */ > > > > > > I looked for a matching WRITE_ONCE() or some other type of marked write, > > > but I could not find it. What is the rationale? Or what did I miss? > > Good question. READ_ONCE() looks superfluous here because it is > surrounded by two read barriers. In each case, there is no > corresponding WRITE_ONCE(). > > Note that we are copying the entire struct prb_desc here. All values > are written only when state_val is in desc_reserved state. It happens > between two full write barriers: > > + A writer is allowed to modify the descriptor after successful > cmpxchg in desc_reserve(), see LMM_TAG(desc_reserve:A). > > + The writer must not touch the descriptor after changing > state_var to committed state, see > LMM_TAG(prb_commit:A) in prb_commit(). > > These barriers are mentioned in the comments for the two > read barriers here. Thanks for these remarks. As usual, I'd recommend to (try to) map those comments into litmus tests and check with the LKMM simulator. > BTW: Documentation/memory-barriers.txt describes various aspects of > the memory barriers. It describes implicit barriers provided > by spin locks, mutexes, semaphores, and various scheduler-related > operations. > > But I can't find any explanation of the various variants of the atomic > operations: acquire, release, fetch, return, try, relaxed. I can find > some clues here and there but it is hard to get the picture. Documentation/atomic_t.txt could serve this purpose. Please have a look there and let me know if you have any comments. Thanks, Andrea ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On Mon 2019-12-23 17:01:00, John Ogness wrote: > Hi Andrea, > > On 2019-12-21, Andrea Parri wrote: > >> + *desc_out = READ_ONCE(*desc); > >> + > >> + /* Load data before re-checking state. */ > >> + smp_rmb(); /* matches LMM_REF(desc_reserve:A) */ > > > > I looked for a matching WRITE_ONCE() or some other type of marked write, > > but I could not find it. What is the rationale? Or what did I miss? Good question. READ_ONCE() looks superfluous here because it is surrounded by two read barriers. In each case, there is no corresponding WRITE_ONCE(). Note that we are copying the entire struct prb_desc here. All values are written only when state_val is in desc_reserved state. It happens between two full write barriers: + A writer is allowed to modify the descriptor after successful cmpxchg in desc_reserve(), see LMM_TAG(desc_reserve:A). + The writer must not touch the descriptor after changing state_var to committed state, see LMM_TAG(prb_commit:A) in prb_commit(). These barriers are mentioned in the comments for the two read barriers here. > >> + do { > >> + next_lpos = get_next_lpos(data_ring, begin_lpos, size); > >> + > >> + if (!data_push_tail(rb, data_ring, > >> + next_lpos - DATA_SIZE(data_ring))) { > >> + /* Failed to allocate, specify a data-less block. */ > >> + blk_lpos->begin = INVALID_LPOS; > >> + blk_lpos->next = INVALID_LPOS; > >> + return NULL; > >> + } > >> + } while (!atomic_long_try_cmpxchg(_ring->head_lpos, _lpos, > >> +next_lpos)); > >> + > >> + /* > >> + * No barrier is needed here. The data validity is defined by > >> + * the state of the associated descriptor. They are marked as > >> + * invalid at the moment. And only the winner of the above > >> + * cmpxchg() could write here. > >> + */ > > > > The (successful) CMPXCHG provides a full barrier. This comment suggests > > that that could be somehow relaxed? Or the comment could be improved? > > You are correct. There is no need for the full barrier here. This code > is based on Petr's POC. I focussed on making sure needed barriers are in > place, but did not try to eliminate excessive barriers. I hope that I'll get better understanding of the guarantees of different atomic operations one day. There are so many variants now. BTW: Documentation/memory-barriers.txt describes various aspects of the memory barriers. It describes implicit barriers provided by spin locks, mutexes, semaphores, and various scheduler-related operations. But I can't find any explanation of the various variants of the atomic operations: acquire, release, fetch, return, try, relaxed. I can find some clues here and there but it is hard to get the picture. Best Regards, Petr ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
Hi Andrea, On 2019-12-21, Andrea Parri wrote: >> +*desc_out = READ_ONCE(*desc); >> + >> +/* Load data before re-checking state. */ >> +smp_rmb(); /* matches LMM_REF(desc_reserve:A) */ > > I looked for a matching WRITE_ONCE() or some other type of marked write, > but I could not find it. What is the rationale? Or what did I miss? This smp_rmb() matches LMM_TAG(desc_reserve:A). >> +do { >> +next_lpos = get_next_lpos(data_ring, begin_lpos, size); >> + >> +if (!data_push_tail(rb, data_ring, >> +next_lpos - DATA_SIZE(data_ring))) { >> +/* Failed to allocate, specify a data-less block. */ >> +blk_lpos->begin = INVALID_LPOS; >> +blk_lpos->next = INVALID_LPOS; >> +return NULL; >> +} >> +} while (!atomic_long_try_cmpxchg(_ring->head_lpos, _lpos, >> + next_lpos)); >> + >> +/* >> + * No barrier is needed here. The data validity is defined by >> + * the state of the associated descriptor. They are marked as >> + * invalid at the moment. And only the winner of the above >> + * cmpxchg() could write here. >> + */ > > The (successful) CMPXCHG provides a full barrier. This comment suggests > that that could be somehow relaxed? Or the comment could be improved? You are correct. There is no need for the full barrier here. This code is based on Petr's POC. I focussed on making sure needed barriers are in place, but did not try to eliminate excessive barriers. > (The patch introduces a number of CMPXCHG: similar questions would > apply to those other instances...) LMM_TAG(data_push_tail:A) is the only CMPXCHG that requires its full barrier. All others can be relaxed. I will make this change for the next series. Thanks for taking time for this. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
Hi John, Sorry for the delay. I don't have an overall understanding of the patch(-set) yet, so I limit to a couple of general questions about the memory barriers introduced by the path. Please see inline comments. > + *desc_out = READ_ONCE(*desc); > + > + /* Load data before re-checking state. */ > + smp_rmb(); /* matches LMM_REF(desc_reserve:A) */ I looked for a matching WRITE_ONCE() or some other type of marked write, but I could not find it. What is the rationale? Or what did I miss? > + do { > + next_lpos = get_next_lpos(data_ring, begin_lpos, size); > + > + if (!data_push_tail(rb, data_ring, > + next_lpos - DATA_SIZE(data_ring))) { > + /* Failed to allocate, specify a data-less block. */ > + blk_lpos->begin = INVALID_LPOS; > + blk_lpos->next = INVALID_LPOS; > + return NULL; > + } > + } while (!atomic_long_try_cmpxchg(_ring->head_lpos, _lpos, > + next_lpos)); > + > + /* > + * No barrier is needed here. The data validity is defined by > + * the state of the associated descriptor. They are marked as > + * invalid at the moment. And only the winner of the above > + * cmpxchg() could write here. > + */ The (successful) CMPXCHG provides a full barrier. This comment suggests that that could be somehow relaxed? Or the comment could be improved? (The patch introduces a number of CMPXCHG: similar questions would apply to those other instances...) Thanks, Andrea P. S. Please use my @gmail.com address for future communications. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On 2019-12-09, Sergey Senozhatsky wrote: >> + * Sample reader code:: >> + * >> + * struct printk_info info; >> + * char text_buf[32]; >> + * char dict_buf[32]; >> + * u64 next_seq = 0; >> + * struct printk_record r = { >> + * .info = , >> + * .text_buf = _buf[0], >> + * .dict_buf = _buf[0], >> + * .text_buf_size = sizeof(text_buf), >> + * .dict_buf_size = sizeof(dict_buf), >> + * }; >> + * >> + * while (prb_read_valid(, next_seq, )) { >> + * if (info.seq != next_seq) >> + * pr_warn("lost %llu records\n", info.seq - next_seq); >> + * >> + * if (info.text_len > r.text_buf_size) { >> + * pr_warn("record %llu text truncated\n", info.seq); >> + * text_buf[sizeof(text_buf) - 1] = 0; >> + * } >> + * >> + * if (info.dict_len > r.dict_buf_size) { >> + * pr_warn("record %llu dict truncated\n", info.seq); >> + * dict_buf[sizeof(dict_buf) - 1] = 0; >> + * } >> + * >> + * pr_info("%llu: %llu: %s;%s\n", info.seq, info.ts_nsec, >> + * _buf[0], info.dict_len ? _buf[0] : ""); >> + * >> + * next_seq = info.seq + 1; >> + * } >> + */ > > Will this loop ever end? :) > > pr_info() adds data to ringbuffer, which prb_read_valid() reads, so > pr_info() can add more data, which prb_read_valid() will read, so > pr_info()... The sample code is assuming that @rb is not the same ringbuffer used by kernel/printk/printk.c. (For example, the test module is doing that to stress test the ringbuffer code without actually affecting printk.) I can add a sentence to clarify that. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On (19/11/28 02:58), John Ogness wrote: > + * Sample reader code:: > + * > + * struct printk_info info; > + * char text_buf[32]; > + * char dict_buf[32]; > + * u64 next_seq = 0; > + * struct printk_record r = { > + * .info = , > + * .text_buf = _buf[0], > + * .dict_buf = _buf[0], > + * .text_buf_size = sizeof(text_buf), > + * .dict_buf_size = sizeof(dict_buf), > + * }; > + * > + * while (prb_read_valid(, next_seq, )) { > + * if (info.seq != next_seq) > + * pr_warn("lost %llu records\n", info.seq - next_seq); > + * > + * if (info.text_len > r.text_buf_size) { > + * pr_warn("record %llu text truncated\n", info.seq); > + * text_buf[sizeof(text_buf) - 1] = 0; > + * } > + * > + * if (info.dict_len > r.dict_buf_size) { > + * pr_warn("record %llu dict truncated\n", info.seq); > + * dict_buf[sizeof(dict_buf) - 1] = 0; > + * } > + * > + * pr_info("%llu: %llu: %s;%s\n", info.seq, info.ts_nsec, > + * _buf[0], info.dict_len ? _buf[0] : ""); > + * > + * next_seq = info.seq + 1; > + * } > + */ Will this loop ever end? :) pr_info() adds data to ringbuffer, which prb_read_valid() reads, so pr_info() can add more data, which prb_read_valid() will read, so pr_info()... -ss ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On (19/12/02 16:48), Petr Mladek wrote: > Hi, > > I have seen few prelimitary versions before this public one. > I am either happy with it or blind to see new problems[*]. I guess I'm happy with it. -ss ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On 2019-12-09, Sergey Senozhatsky wrote: >> +#define _DATA_SIZE(sz_bits) (1UL << (sz_bits)) >> +#define _DESCS_COUNT(ct_bits) (1U << (ct_bits)) >> +#define DESC_SV_BITS(sizeof(int) * 8) >> +#define DESC_COMMITTED_MASK (1U << (DESC_SV_BITS - 1)) > > What does SV state for? State Value? Yes. Originally this thing was just called the state. But it was a bit confusing in the code because there is also an enum desc_state (used for state queries), which is _not_ the value that is stored in the state variable. That's why the code is using state_var/state_val (SV) for the actual data values, keeping it separate from desc_state/d_state for the the state queries. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On (19/11/28 02:58), John Ogness wrote: [..] > + > +#define _DATA_SIZE(sz_bits) (1UL << (sz_bits)) > +#define _DESCS_COUNT(ct_bits)(1U << (ct_bits)) > +#define DESC_SV_BITS (sizeof(int) * 8) > +#define DESC_COMMITTED_MASK (1U << (DESC_SV_BITS - 1)) What does SV state for? State Value? -ss ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On 2019-12-03, Steven Rostedt wrote: > +/* Reserve a new descriptor, invalidating the oldest if necessary. */ > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out) > +{ > + struct prb_desc_ring *desc_ring = >desc_ring; > + struct prb_desc *desc; > + u32 id_prev_wrap; > + u32 head_id; > + u32 id; > + > + head_id = atomic_read(_ring->head_id); > + > + do { > + desc = to_desc(desc_ring, head_id); > + > + id = DESC_ID(head_id + 1); > + id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id); > + > + if (id_prev_wrap == atomic_read(_ring->tail_id)) { > + if (!desc_push_tail(rb, id_prev_wrap)) > + return false; > + } > + } while (!atomic_try_cmpxchg(_ring->head_id, _id, id)); Hmm, in theory, ABA problem might cause that we successfully move desc_ring->head_id when tail has not been pushed yet. As a result we would never call desc_push_tail() until it overflows again. I am not sure if we need to take care of it. The code is called with interrupts disabled. IMHO, only NMI could cause ABA problem in reality. But the game (debugging) is lost anyway when NMI ovewrites the buffer several times. Also it should not be a complete failure. We still could bail out when the previous state was not reusable. We are on the safe side when it was reusable. Also we could probably detect when desc_ring->tail_id is not updated any longer and do something about it. BTW: If I am counting correctly. The ABA problem would happen when exactly 2^30 (1G) messages is written in the mean time. >>> >>> All the ringbuffer code assumes that the use of index numbers >>> handles the ABA problem (i.e. there must not be 1 billion printk's >>> within an NMI). If we want to support 1 billion+ printk's within an >>> NMI, then perhaps the index number should be increased. For 64-bit >>> systems it would be no problem to go to 62 bits. For 32-bit systems, >>> I don't know how well the 64-bit atomic operations are supported. >> >> ftrace dumps from NMI (DUMP_ALL type ftrace_dump_on_oops on a $BIG >> machine)? 1G seems large enough, but who knows. > > ftrace dump from NMI is the most likely case to hit this, but when > that happens, you are in debugging mode, and the system usually > becomes unreliable at this moment. I agree with Petr, that we should > not complicate the code more to handle this theoretical condition. I will change the data block ID size to "unsigned long". Then it is really only an issue for 32-bit systems. AFAICT, the only real issue is that the head can be pushed to the descriptor index that the tail is pointing to. From there, the head can be advanced beyond and the tail may never move again. So writers just need to make sure the tail gets pushed beyond the newly reserved head before making any changes to that descriptor. Aside from moving to "unsigned long" ID's, I believe this can be handled with the following 4 changes when reserving a new descriptor: - also push the tail forward if it falls behind - after pushing the head, re-check if the tail is still ok - verify the state of the reserved descriptor before changing it - use cmpxchg() to change it Below is the patch snippet I use for this. (Note that the patch is against a version where ID's have already been changed to unsigned longs.) John Ogness @@ -468,19 +470,53 @@ static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out) id = DESC_ID(head_id + 1); id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id); - if (id_prev_wrap == atomic_long_read(_ring->tail_id)) { - if (!desc_push_tail(rb, id_prev_wrap)) + /* +* Make space for the new descriptor by pushing the tail +* beyond the descriptor to be reserved. Normally this only +* requires pushing the tail once. But due to possible ABA +* problems with the ID, the tail could be too far behind. +* Use a loop to push the tail where it needs to be. +*/ + tail_id = atomic_long_read(_ring->tail_id); + while (DESC_ID(id_prev_wrap - tail_id) < + DESCS_COUNT(desc_ring)) { + + if (!desc_push_tail(rb, tail_id)) return false; + tail_id = DESC_ID(tail_id + 1); } } while (!atomic_long_try_cmpxchg(_ring->head_id, _id, id)); + /* +* Before moving the head, it was ensured that the tail was pushed to +* where it needs to be. Due to possible ABA problems with the ID, the +* reserved descriptor may not be what was expected. Re-check the tail +* to see if it is still where it needs to
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On Tue 2019-12-03 15:13:36, John Ogness wrote: > On 2019-12-02, Petr Mladek wrote: > >> +/* > >> + * Sanity checker for reserve size. The ringbuffer code assumes that a > >> data > >> + * block does not exceed the maximum possible size that could fit within > >> the > >> + * ringbuffer. This function provides that basic size check so that the > >> + * assumption is safe. > >> + */ > >> +static bool data_check_size(struct prb_data_ring *data_ring, unsigned int > >> size) > >> +{ > >> + struct prb_data_block *db = NULL; > >> + > >> + /* Writers are not allowed to write data-less records. */ > >> + if (size == 0) > >> + return false; > > > > I would personally let this decision to the API caller. > > > > The code actually have to support data-less records. They are used > > when the descriptor is reserved but the data block can't get reserved. > > Exactly. Data-less records are how the ringbuffer identifies that data > has been lost. If users were allowed to store data-less records, that > destinction is no longer possible (unless I created some extra field in > the descriptor). Does it even make sense for printk to store data-less > records explicitly? >From my POV, it does not make much sense. > The current printk implementation silently ignores empty messages. I am not able to find it. I only see that empty continuous framgments are ignored in log_output(). Normal empty lines seems to be strored. Well, I can't see any usecase for this. I think that we could ignore all empty strings. > > The above statement might create false feeling that it could not > > happen later in the code. It might lead to bugs in writer code. > > Then let me change the statement to describe that data-less records are > used internally by the ringbuffer and cannot be explicitly created by > writers. Sounds godo to me. > > Also reader API users might not expect to get a "valid" data-less > > record. > > Readers will never see them. The reader API implementation skips over > data-less records. Yeah. They will see bump in the seq number. We would need to distinguish empty records and lost records as you wrote above. It looks better to ignore empty records already during write. Best Regards, Petr ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On Tue, 3 Dec 2019 10:17:21 +0900 Sergey Senozhatsky wrote: > > > BTW: If I am counting correctly. The ABA problem would happen when > > > exactly 2^30 (1G) messages is written in the mean time. > > > > All the ringbuffer code assumes that the use of index numbers handles > > the ABA problem (i.e. there must not be 1 billion printk's within an > > NMI). If we want to support 1 billion+ printk's within an NMI, then > > perhaps the index number should be increased. For 64-bit systems it > > would be no problem to go to 62 bits. For 32-bit systems, I don't know > > how well the 64-bit atomic operations are supported. > > ftrace dumps from NMI (DUMP_ALL type ftrace_dump_on_oops on a $BIG > machine)? 1G seems large enough, but who knows. ftrace dump from NMI is the most likely case to hit this, but when that happens, you are in debugging mode, and the system usually becomes unreliable at this moment. I agree with Petr, that we should not complicate the code more to handle this theoretical condition. -- Steve ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On 2019-12-02, Petr Mladek wrote: >> +/* >> + * Sanity checker for reserve size. The ringbuffer code assumes that a data >> + * block does not exceed the maximum possible size that could fit within the >> + * ringbuffer. This function provides that basic size check so that the >> + * assumption is safe. >> + */ >> +static bool data_check_size(struct prb_data_ring *data_ring, unsigned int >> size) >> +{ >> +struct prb_data_block *db = NULL; >> + >> +/* Writers are not allowed to write data-less records. */ >> +if (size == 0) >> +return false; > > I would personally let this decision to the API caller. > > The code actually have to support data-less records. They are used > when the descriptor is reserved but the data block can't get reserved. Exactly. Data-less records are how the ringbuffer identifies that data has been lost. If users were allowed to store data-less records, that destinction is no longer possible (unless I created some extra field in the descriptor). Does it even make sense for printk to store data-less records explicitly? The current printk implementation silently ignores empty messages. > The above statement might create false feeling that it could not > happen later in the code. It might lead to bugs in writer code. Then let me change the statement to describe that data-less records are used internally by the ringbuffer and cannot be explicitly created by writers. > Also reader API users might not expect to get a "valid" data-less > record. Readers will never see them. The reader API implementation skips over data-less records. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On Mon 2019-12-02 17:37:26, John Ogness wrote: > On 2019-12-02, Petr Mladek wrote: > >> > +/* Reserve a new descriptor, invalidating the oldest if necessary. */ > >> > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out) > >> > +{ > >> > +struct prb_desc_ring *desc_ring = >desc_ring; > >> > +struct prb_desc *desc; > >> > +u32 id_prev_wrap; > >> > +u32 head_id; > >> > +u32 id; > >> > + > >> > +head_id = atomic_read(_ring->head_id); > >> > + > >> > +do { > >> > +desc = to_desc(desc_ring, head_id); > >> > + > >> > +id = DESC_ID(head_id + 1); > >> > +id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id); > >> > + > >> > +if (id_prev_wrap == atomic_read(_ring->tail_id)) { > >> > +if (!desc_push_tail(rb, id_prev_wrap)) > >> > +return false; > >> > +} > >> > +} while (!atomic_try_cmpxchg(_ring->head_id, _id, > >> > id)); > >> > >> Hmm, in theory, ABA problem might cause that we successfully > >> move desc_ring->head_id when tail has not been pushed yet. > >> > >> As a result we would never call desc_push_tail() until > >> it overflows again. > >> > >> I am not sure if we need to take care of it. The code is called with > >> interrupts disabled. IMHO, only NMI could cause ABA problem > >> in reality. But the game (debugging) is lost anyway when NMI ovewrites > >> the buffer several times. > > > > BTW: If I am counting correctly. The ABA problem would happen when > > exactly 2^30 (1G) messages is written in the mean time. > > All the ringbuffer code assumes that the use of index numbers handles > the ABA problem (i.e. there must not be 1 billion printk's within an > NMI). If we want to support 1 billion+ printk's within an NMI, then > perhaps the index number should be increased. For 64-bit systems it > would be no problem to go to 62 bits. For 32-bit systems, I don't know > how well the 64-bit atomic operations are supported. I am not super happy but I really think that it does not make sense to complicate the code too much because of this theoretical race. 1 billion printks in NMI is crazy. Also the race is a problem only when we hit exactly the 2^30 message. If we hit 2^30 + 1 and more than everything is fine again. In the worst case, printk() will stop working. I think that there are other situations that are much more likely when printk() will not work (people will not see the messages). > >> Also it should not be a complete failure. We still could bail out when > >> the previous state was not reusable. We are on the safe side > >> when it was reusable. > >> > >> Also we could probably detect when desc_ring->tail_id is not > >> updated any longer and do something about it. > >> > >> > + > >> > +desc = to_desc(desc_ring, id); > >> > + > >> > +/* Set the descriptor's ID and also set its state to reserved. > >> > */ > >> > +atomic_set(>state_var, id | 0); > >> > >> We should check here that the original state id from the previous > >> wrap in reusable state (or 0 for bootstrap). On error, we know that > >> there was the ABA problem and would need to do something about it. > > > > I believe that it should be enough to add this check and > > bail out in case of error. > > I need to go through the code again in detail and see how many locations > are affected by ABA. All the code was written with the assumption that > this type of ABA will not happen. > > As you've stated, we could add minimal handling so that the ringbuffer > at least does not break or get stuck. > > BTW: The same assumption is made for logical positions. There are 4 > times as many of these (on 32-bit systems) but logical positions advance > much faster. I will review these as well. The logical positions are assigned only when a descriptor is reserved. Such a descriptor could not be reused until committed and marked reusable. It limits the logical position movement to: max_record_size * number_of_descriptors Printk records are limited to 1k. So we could safely support up to 1 million fully sized lines printed when NMI interrupted printk() on one CPU. The most important thing is that it must not crash the system. It would be nice if we are able to detect this situation and do something about it. But IMHO, it is perfectly fine when printk() would stop working in this corner case. The only problem is that it might be hard to debug. But it should be possible with crashdump. And I think that we will first hit some other bugs that we do not see at the moment ;-) Best Regards, Petr ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On (19/12/02 17:37), John Ogness wrote: > On 2019-12-02, Petr Mladek wrote: > >> > +/* Reserve a new descriptor, invalidating the oldest if necessary. */ > >> > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out) > >> > +{ > >> > +struct prb_desc_ring *desc_ring = >desc_ring; > >> > +struct prb_desc *desc; > >> > +u32 id_prev_wrap; > >> > +u32 head_id; > >> > +u32 id; > >> > + > >> > +head_id = atomic_read(_ring->head_id); > >> > + > >> > +do { > >> > +desc = to_desc(desc_ring, head_id); > >> > + > >> > +id = DESC_ID(head_id + 1); > >> > +id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id); > >> > + > >> > +if (id_prev_wrap == atomic_read(_ring->tail_id)) { > >> > +if (!desc_push_tail(rb, id_prev_wrap)) > >> > +return false; > >> > +} > >> > +} while (!atomic_try_cmpxchg(_ring->head_id, _id, > >> > id)); > >> > >> Hmm, in theory, ABA problem might cause that we successfully > >> move desc_ring->head_id when tail has not been pushed yet. > >> > >> As a result we would never call desc_push_tail() until > >> it overflows again. > >> > >> I am not sure if we need to take care of it. The code is called with > >> interrupts disabled. IMHO, only NMI could cause ABA problem > >> in reality. But the game (debugging) is lost anyway when NMI ovewrites > >> the buffer several times. > > > > BTW: If I am counting correctly. The ABA problem would happen when > > exactly 2^30 (1G) messages is written in the mean time. > > All the ringbuffer code assumes that the use of index numbers handles > the ABA problem (i.e. there must not be 1 billion printk's within an > NMI). If we want to support 1 billion+ printk's within an NMI, then > perhaps the index number should be increased. For 64-bit systems it > would be no problem to go to 62 bits. For 32-bit systems, I don't know > how well the 64-bit atomic operations are supported. ftrace dumps from NMI (DUMP_ALL type ftrace_dump_on_oops on a $BIG machine)? 1G seems large enough, but who knows. -ss ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On 2019-12-02, Petr Mladek wrote: >> > +/* Reserve a new descriptor, invalidating the oldest if necessary. */ >> > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out) >> > +{ >> > + struct prb_desc_ring *desc_ring = >desc_ring; >> > + struct prb_desc *desc; >> > + u32 id_prev_wrap; >> > + u32 head_id; >> > + u32 id; >> > + >> > + head_id = atomic_read(_ring->head_id); >> > + >> > + do { >> > + desc = to_desc(desc_ring, head_id); >> > + >> > + id = DESC_ID(head_id + 1); >> > + id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id); >> > + >> > + if (id_prev_wrap == atomic_read(_ring->tail_id)) { >> > + if (!desc_push_tail(rb, id_prev_wrap)) >> > + return false; >> > + } >> > + } while (!atomic_try_cmpxchg(_ring->head_id, _id, id)); >> >> Hmm, in theory, ABA problem might cause that we successfully >> move desc_ring->head_id when tail has not been pushed yet. >> >> As a result we would never call desc_push_tail() until >> it overflows again. >> >> I am not sure if we need to take care of it. The code is called with >> interrupts disabled. IMHO, only NMI could cause ABA problem >> in reality. But the game (debugging) is lost anyway when NMI ovewrites >> the buffer several times. > > BTW: If I am counting correctly. The ABA problem would happen when > exactly 2^30 (1G) messages is written in the mean time. All the ringbuffer code assumes that the use of index numbers handles the ABA problem (i.e. there must not be 1 billion printk's within an NMI). If we want to support 1 billion+ printk's within an NMI, then perhaps the index number should be increased. For 64-bit systems it would be no problem to go to 62 bits. For 32-bit systems, I don't know how well the 64-bit atomic operations are supported. >> Also it should not be a complete failure. We still could bail out when >> the previous state was not reusable. We are on the safe side >> when it was reusable. >> >> Also we could probably detect when desc_ring->tail_id is not >> updated any longer and do something about it. >> >> > + >> > + desc = to_desc(desc_ring, id); >> > + >> > + /* Set the descriptor's ID and also set its state to reserved. */ >> > + atomic_set(>state_var, id | 0); >> >> We should check here that the original state id from the previous >> wrap in reusable state (or 0 for bootstrap). On error, we know that >> there was the ABA problem and would need to do something about it. > > I believe that it should be enough to add this check and > bail out in case of error. I need to go through the code again in detail and see how many locations are affected by ABA. All the code was written with the assumption that this type of ABA will not happen. As you've stated, we could add minimal handling so that the ringbuffer at least does not break or get stuck. BTW: The same assumption is made for logical positions. There are 4 times as many of these (on 32-bit systems) but logical positions advance much faster. I will review these as well. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
On Mon 2019-12-02 16:48:41, Petr Mladek wrote: > > +/* Reserve a new descriptor, invalidating the oldest if necessary. */ > > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out) > > +{ > > + struct prb_desc_ring *desc_ring = >desc_ring; > > + struct prb_desc *desc; > > + u32 id_prev_wrap; > > + u32 head_id; > > + u32 id; > > + > > + head_id = atomic_read(_ring->head_id); > > + > > + do { > > + desc = to_desc(desc_ring, head_id); > > + > > + id = DESC_ID(head_id + 1); > > + id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id); > > + > > + if (id_prev_wrap == atomic_read(_ring->tail_id)) { > > + if (!desc_push_tail(rb, id_prev_wrap)) > > + return false; > > + } > > + } while (!atomic_try_cmpxchg(_ring->head_id, _id, id)); > > Hmm, in theory, ABA problem might cause that we successfully > move desc_ring->head_id when tail has not been pushed yet. > > As a result we would never call desc_push_tail() until > it overflows again. > > I am not sure if we need to take care of it. The code is called with > interrupts disabled. IMHO, only NMI could cause ABA problem > in reality. But the game (debugging) is lost anyway when NMI ovewrites > the buffer several times. BTW: If I am counting correctly. The ABA problem would happen when exactly 2^30 (1G) messages is written in the mean time. > Also it should not be a complete failure. We still could bail out when > the previous state was not reusable. We are on the safe side > when it was reusable. > > Also we could probably detect when desc_ring->tail_id is not > updated any longer and do something about it. > > > + > > + desc = to_desc(desc_ring, id); > > + > > + /* Set the descriptor's ID and also set its state to reserved. */ > > + atomic_set(>state_var, id | 0); > > We should check here that the original state id from the previous > wrap in reusable state (or 0 for bootstrap). On error, we know that > there was the ABA problem and would need to do something about it. I believe that it should be enough to add this check and bail out in case of error. Best Regards, Petr ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)
Hi, I have seen few prelimitary versions before this public one. I am either happy with it or blind to see new problems[*]. It would be great if anyone else could look at it. Especially I am intreseted: + Whether the algorithm can be understood by people who see it for the "first" time. + Whether there are any obvious races. I wonder if our assumtions about atomic_cmpxchg() guaranttes are correct. [*] I have found one theoretical ABA problem. I think that is really rather theoretical and should not block this patch. I think that we could do something to prevent it either now or later. See below for more details On Thu 2019-11-28 02:58:33, John Ogness wrote: > Add a new lockless ringbuffer implementation to be used for printk. > First only add support for writers. Reader support will be added in > a follow-up commit. > > Signed-off-by: John Ogness > --- > kernel/printk/printk_ringbuffer.c | 676 ++ > kernel/printk/printk_ringbuffer.h | 239 +++ > 2 files changed, 915 insertions(+) > create mode 100644 kernel/printk/printk_ringbuffer.c > create mode 100644 kernel/printk/printk_ringbuffer.h > > diff --git a/kernel/printk/printk_ringbuffer.c > b/kernel/printk/printk_ringbuffer.c > new file mode 100644 > index ..09c32e52fd40 > --- /dev/null > +++ b/kernel/printk/printk_ringbuffer.c > @@ -0,0 +1,676 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +#include > +#include > +#include > +#include > +#include > +#include "printk_ringbuffer.h" > + > +/** > + * DOC: printk_ringbuffer overview > + * > + * Data Structure > + * -- > + * The printk_ringbuffer is made up of 3 internal ringbuffers:: > + * > + * * desc_ring: A ring of descriptors. A descriptor contains all record > + * meta-data (sequence number, timestamp, loglevel, etc.) > as > + * well as internal state information about the record and > + * logical positions specifying where in the other > + * ringbuffers the text and dictionary strings are located. > + * > + * * text_data_ring: A ring of data blocks. A data block consists of a 32-bit > + * integer (ID) that maps to a desc_ring index followed by > + * the text string of the record. > + * > + * * dict_data_ring: A ring of data blocks. A data block consists of a 32-bit > + * integer (ID) that maps to a desc_ring index followed by > + * the dictionary string of the record. > + * I slightly miss some top level description of the approach. Especially how the code deals with races. The following comes to my mind: The state of each entry is always defined by desc->state_val. It consists of 30-bit id and 2-bit state value, see DESC_INDEX() and get_desc_state. prb_reserve() call returns descriptor in a reserved state. It points to reserved data blocks in the data rings. The algorithm starts with using never used descriptors and data blocks. Later it uses descriptors and data blocks in the reusable state. The descriptors and data space are reused in a round-robin fahion. Only records in the committed state could be moved to reusable state. prb_reserve() has to do several steps. Namely it has to update head and tail positions in all ring buffers. Also it has to manipulare desc->state_val. Most races are avoided by using atomic_cmpxchg() operations. They make sure that all values are moved from one well defined state to another well defined state. ABA races are avoided by using logical positions and indexes. The rinbuffer must be reused many times before these values overflow. Anyway, logical positions to data ring could not overflow wildly. They are manipulated only when the descriptor is already reserved. The descriptor could get reused only when it was commited before. It means that there is a limited number of writers until they would need to reuse a particular descriptor. [*] Back to the prb_reserve() algorithm. The repeated pattern is that it does not matter what caller invalidates oldest entries. They are fine when it has already been done by another writer in the meantime. The real winner is the caller that is able to move the head position and reserve the space. Others need to go back to invalidating the oldest entry, etc. prb_reserve() caller has exclusive write access to the reserved descriptor and data blocks. It has to call prb_commit() when finished. It is a signal that the data are valid for readers. But it is also a signal that the descriptor and the space might get reused be other writers. Finally, readers just need to check the state of the descriptor before and after reading the data. The data are correct only when the descriptor is in committed state. [*] Hmm, the ABA problem might theoretically happen in desc_reserve(), see below. > +/* > + * Sanity checker for reserve size. The ringbuffer code assumes that a data > + * block